Exploratory data analysis(EDA) is a broad topic. It is very hard to cover this in one article. Exploratory data analysis can be done for learning about the data and the relationship between the different features of a dataset. Or it can be done to learn about the features to choose the important features and prepare the dataset so that a statistical model can be fit in it. Whatever the reason is, it is important to learn about the features and the relationship amongst the features. In this article, I will focus on exploratory data analysis(EDA) for data modeling. Though we are not going to do prediction here. We will just focus on the EDA part.
Dataset
I am using the Boston dataset that is already there in the scikit-learn library. It contains information about the housing price in Boston. First, import the necessary packages and the dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.datasets import load_boston
boston_data = load_boston()
For information, the Boston dataset is organized into two parts. One is the data part which contains features and the other one contains the price of the houses. We are going to include the prices in the same dataset as the features.
df = pd.DataFrame(data=boston_data.data, columns=boston_data.feature_names)df["prices"] = boston_data.target
Find out how many features and how many observations are there:
df.shape#Output:
(506, 14)
The shape of the dataset says, there are 14 features and 506 observations in the dataset. Now, List all the columns or features in the dataset:
df.columns#Output:
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'prices'], dtype='object')
The name of the features may look obscure. Here is the explanation of the features:
CRIM: Per capita crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxide concentration in parts per 10 million
RM: Average number of rooms
AGE: Proportion of owner-occupied units built before 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Property tax rate per $10,000
PTRATIO: Student-teacher ratio by town
B: 1000(Bk — 0.63)2, where Bk is the proportion of people of African American descent by town
LSTAT: Percentage of the lower status of the population
prices: The prices of homes in $1000s
Now, we know about the dataset more clearly. Before going into any data exploration, it is important to check for the missing data:
df.isnull().sum()
Fortunately, there is no missing value in this dataset. So, we can move to further analysis.
Dependent and independent variables
In this dataset, we need to figure out the dependent variable and the independent variables. If you notice in the dataset, people might be interested to predict the housing prices based on the other features. Because they do not want to pay higher prices than fair market value. Just by the experience, we can expect that housing prices may differ based on the other features in the dataset. So, in this dataset, the housing prices are the dependent variable.
Choosing independent variables or predictor variables can be tricky. It primarily depends on the variable’s relationship with the dependent variable. We can start with the instinct and then test it. My instinct says the predictor variables for this dataset can be RM, CRIM, DIS, Age, and PTRATIO. If we think of the details:
RM (average number of rooms) will affect the square footage and affect the price of the houses. The relationship with housing is expected to be positive. That means if the number of rooms is more the price is also more.
The crime rate(CRIM) may also affect the price. If the crime rate is higher, probably the housing price is lower. The weighted distances to five Boston employment centers (DIS) also may have a negative correlation with housing prices. PTRATIO (student-teacher ratio) may also have a negative correlation with the housing price because children’s education should be important to the parents. Now, check if our instinct is right.
Exploratory Analysis
Start with the distribution of the target variable or dependent variable:
sns.set(rc={'figure.figsize': (12, 8)})
sns.distplot(df["prices"], bins=25)
plt.show()
As shown in the picture above, the distribution of the prices is nearly normal. There are some outliers in the upper quantiles.
One very important step is to learn about the relationships between the dependent and independent variables. Correlation between the predictor variables is also essential in choosing the right predictor variables. Here is how you can draw a correlation matrix amongst all the features in the dataset or part of a dataset. Because this dataset is not too big, I am taking the whole dataset and drawing the correlation matrix:
correlation_matrix = df.corr().round(2)sns.set(rc={'figure.figsize':(11, 8)})
sns.heatmap(data=correlation_matrix, annot=True)
plt.show()
Just to remind you, correlation coefficients range from -1 to 1. -1 means there is a strong negative correlation and 1 means there is a strong positive correlation. If the correlation coefficient is zero, there is no correlation.
Now, check the correlation between the target variable ‘prices’ and the predictor variables of our choice (RM, CRIM, DIS, AGE, and PTRATIO) from the correlation matrix above.
It is shown in the picture that the correlation coefficient between prices and RM is -0.7 which is a strong positive correlation. The correlation coefficient between prices and CRIM is -0.39 that s a significant negative correlation. The correlation coefficients between prices and DIS, AGE, and PTRATIO are 0.25, -0.38, and -0.51 respectively that indicates quite significant correlations. You may think that other than CHAS, all the features have a good correlation with our dependent variable which is right. If I write a prediction algorithm with this dataset, I will only exclude CHAS and use the rest of the features as predictor variables or independent variables.
Let’s investigate the relationship between the dependent and independent variables some more. This time I would like to see individual relationships. I am not going to check the relationships of all the predictor variables with the target variables in this article. I am choosing two predictor variables based on the correlations from the correlation_matrix. Here I am taking AGE which has a good negative correlation and DIS which has a good positive correlation with ‘prices’.
To see the relationship between RM and DIS with the target variable, a scatter plot is most helpful.
plt.figure(figsize=(20, 5))features = ['RM', 'DIS']
target = df['prices']for i, col in enumerate(features):
plt.subplot(1, len(features) , i+1)
x = df[col]
y = target
plt.scatter(x, y, marker='o')
plt.title(col)
plt.xlabel(col)
plt.ylabel('prices')
From the scatter plot above, we can see that the distribution of AGE is left-skewed and the distribution of DIS is right-skewed. In the AGE plot, there is a cluster in the upper quantile, and in the DIS plot, there is a cluster in the lower quantile. The relationship shows a linear trend. But if we think of a linear model, we should do some data manipulation to make the relationship a stronger linear relationship. When the data is left-skewed, one way to transform the data is to take cubes of the data. And when it is left-skewed, taking a log helps. There are several ways to transform the data. I will discuss this in detail in a later article.
plt.figure(figsize=(20, 5))df['AgeN'] = df['AGE']**3
df['DisN'] = np.log(df['DIS'])features = ['AgeN', 'DisN']
target = df['prices']for i, col in enumerate(features):
plt.subplot(1, len(features) , i+1)
x = df[col]
y = target
plt.scatter(x, y, marker='o')
plt.title(col)
plt.xlabel(col)
plt.ylabel('prices')
If you notice, the cluster is gone and the linearity is stronger. These two predictor variables are ready for modeling. Please practice with other predictor variables as well.
From this article, we learned to check for null values, define predictor and dependent values, make a correlation matrix, transform the data to improve the quality of data modeling.
#ExploratoryDataAnalysis #DataScience #DataAnalysis #pandas #python #matplotlib #seaborn