## Dataset

I am using the Boston dataset that is already there in the scikit-learn library. It contains information about the housing price in Boston. First, import the necessary packages and the dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.DataFrame(data=boston_data.data, columns=boston_data.feature_names)df["prices"] = boston_data.target
df.shape#Output:
(506, 14)
df.columns#Output:
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'prices'], dtype='object')
df.isnull().sum()

## Dependent and independent variables

In this dataset, we need to figure out the dependent variable and the independent variables. If you notice in the dataset, people might be interested to predict the housing prices based on the other features. Because they do not want to pay higher prices than fair market value. Just by the experience, we can expect that housing prices may differ based on the other features in the dataset. So, in this dataset, the housing prices are the dependent variable.

## Exploratory Analysis

sns.set(rc={'figure.figsize': (12, 8)})
sns.distplot(df["prices"], bins=25)
plt.show()
correlation_matrix = df.corr().round(2)sns.set(rc={'figure.figsize':(11, 8)})
sns.heatmap(data=correlation_matrix, annot=True)
plt.show()

plt.figure(figsize=(20, 5))features = ['RM', 'DIS']
target = df['prices']
for i, col in enumerate(features):
plt.subplot(1, len(features) , i+1)
x = df[col]
y = target
plt.scatter(x, y, marker='o')
plt.title(col)
plt.xlabel(col)
plt.ylabel('prices')
plt.figure(figsize=(20, 5))df['AgeN'] = df['AGE']**3
df['DisN'] = np.log(df['DIS'])
features = ['AgeN', 'DisN']
target = df['prices']
for i, col in enumerate(features):
plt.subplot(1, len(features) , i+1)
x = df[col]
y = target
plt.scatter(x, y, marker='o')
plt.title(col)
plt.xlabel(col)
plt.ylabel('prices')

#ExploratoryDataAnalysis #DataScience #DataAnalysis #pandas #python #matplotlib #seaborn