Exploratory Data Analysis For Data Modeling

Dataset

I am using the Boston dataset that is already there in the scikit-learn library. It contains information about the housing price in Boston. First, import the necessary packages and the dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.datasets import load_boston
boston_data = load_boston()
df = pd.DataFrame(data=boston_data.data, columns=boston_data.feature_names)df["prices"] = boston_data.target

Image for post

df.shape#Output:
(506, 14)
df.columns#Output:
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'prices'], dtype='object')
df.isnull().sum()

Image for post

Dependent and independent variables

In this dataset, we need to figure out the dependent variable and the independent variables. If you notice in the dataset, people might be interested to predict the housing prices based on the other features. Because they do not want to pay higher prices than fair market value. Just by the experience, we can expect that housing prices may differ based on the other features in the dataset. So, in this dataset, the housing prices are the dependent variable.

Exploratory Analysis

Start with the distribution of the target variable or dependent variable:

sns.set(rc={'figure.figsize': (12, 8)})
sns.distplot(df["prices"], bins=25)
plt.show()

Image for post

correlation_matrix = df.corr().round(2)sns.set(rc={'figure.figsize':(11, 8)})
sns.heatmap(data=correlation_matrix, annot=True)
plt.show()

Image for post




plt.figure(figsize=(20, 5))features = ['RM', 'DIS']
target = df['prices']
for i, col in enumerate(features):
plt.subplot(1, len(features) , i+1)
x = df[col]
y = target
plt.scatter(x, y, marker='o')
plt.title(col)
plt.xlabel(col)
plt.ylabel('prices')

Image for post

plt.figure(figsize=(20, 5))df['AgeN'] = df['AGE']**3
df['DisN'] = np.log(df['DIS'])
features = ['AgeN', 'DisN']
target = df['prices']
for i, col in enumerate(features):
plt.subplot(1, len(features) , i+1)
x = df[col]
y = target
plt.scatter(x, y, marker='o')
plt.title(col)
plt.xlabel(col)
plt.ylabel('prices')

Image for post

 

#ExploratoryDataAnalysis #DataScience #DataAnalysis #pandas #python #matplotlib #seaborn

Leave a Reply

Close Menu