A Complete Exploratory Data Analysis in Python
Exploratory Data Analysis, Visualization Machine Learning in Python

A Complete Exploratory Data Analysis in Python

I have a few tutorials on Exploratory Data Analysis before. But I feel I should do some more of that. Taking a dataset and explore it, doing the data cleaning, analytics, visualization, and prediction model all in one piece is necessary. As a Data Scientist or Data Analyst, we may have to work with so much strange data, sometimes we may not even understand the features properly but that should not stop us from doing our job. It’s best to know the features very well. But if that information is not available still analytics part shouldn’t suffer.

In this article, I will work on a dataset that I got from Kaggle. Most of us may not understand the features or column names. But let’s see what we can do with it.

We will focus on:

  1. Basic understanding of the dataset.
  2. Brainstorming and getting overall idea of the dataset in deeper level through some visualization.
  3. Find ways to get specific information derived from the basic visualization in previous steps.
  4. Predictive modeling.

Please feel free to download the dataset from this link:

Florida_Subsidence_Incident_Reports csv-file (kaggle.com)

The necessary imports first:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

Using the data to create a Pandas’ DataFrame:

pd.set_option('display.max_columns', 100)
df1 = pd.read_csv('Florida_Subsidence_Incident_Reports.csv')

The dataset is too big. So, I am not showing a preview here. These are the column names of the dataset:

Index(['X', 'Y', 'OBJECTID', 'REF_NUM', 'DATE_REV', 'EVENT_DATE', 'TRUE_SINK',
       'LONGDD', 'LATDD', 'COUNTY', 'TWNSHP', 'TWNSHP_D', 'RANGE', 'RANGE_D',
       'SECTION', 'QTRSECT1', 'QTRSECT2', 'ACCURACY', 'RPT_SOURCE', 'RPT_NAME',
       'OCITY', 'OZIP', 'SIZDIM', 'SINSHAPE', 'SINLNGTH', 'SINWIDTH',
       'SINDEPTH', 'SLOPE', 'WATSIN', 'WATBLS', 'LIMVIS', 'CAVVIS', 'SUBRATE',
       'PROPDAM', 'REPAIR_S', 'DRAINSTR', 'SOILTYPE', 'COMMENTS', 'COMMENTS_2',
       'ACCESS_'],
      dtype='object')

The column names have not been very clear or expressive. But anyway we need to approach this. To begin with it is always a good idea to check for the null values. This following line of code will finds the number of null values in each columns of the DataFrame and divide them by the total number of rows in the DataFrame to find the proportion of null values in each feature:

df1.isna().sum()/len(df1)

Output:

X             0.000000
Y             0.000000
OBJECTID      0.000000
REF_NUM       0.000248
DATE_REV      0.000000
EVENT_DATE    0.000000
TRUE_SINK     0.000000
LONGDD        0.001734
LATDD         0.001981
COUNTY        0.000000
TWNSHP        0.008668
TWNSHP_D      0.008668
RANGE         0.008420
RANGE_D       0.009411
SECTION       0.010649
QTRSECT1      0.099802
QTRSECT2      0.083705
ACCURACY      0.000000
RPT_SOURCE    0.435612
RPT_NAME      0.796929
OCITY         0.457405
OZIP          0.777117
SIZDIM        0.000000
SINSHAPE      0.000000
SINLNGTH      0.253096
SINWIDTH      0.246161
SINDEPTH      0.303863
SLOPE         0.716691
WATSIN        0.000000
WATBLS        0.928430
LIMVIS        0.000000
CAVVIS        0.000000
SUBRATE       0.000000
PROPDAM       0.000000
REPAIR_S      0.000000
DRAINSTR      0.000000
SOILTYPE      0.256067
COMMENTS      0.067112
COMMENTS_2    0.408866
ACCESS_       0.300149
dtype: float64

As we can see ‘WATBLS’ has about 93% null values, ‘SLOPE’ has 71%, OZIP has 78%, and RPT_NAME has 79% null values. I usually try to get rid of the columns that have that much null values. You can simply use df.drop() method to get rid of those columns. I choose to write a line code that omits any column with over 50% null values.

df1 = df1[[column for column in df1 if df1[column].count() / len(df1) >= 0.5]]

So, we are left with the columns that have less than 50% null values.

Let’s dive into the analytics part now!

As this dataset has a lot of features. You can create a lot of graphs and visuals from them. In the workplace, usually we talk to the stakeholders and focus on the part that are useful for the stakeholders primarily and then if you have extra time, it is a good practice to report if you see any other interesting trends or facts that you could see from the data.

But this is a tutorial where I am trying to demonstrate data processing, data visualization techniques and machine learning. So, I will only present the idea.

As there are a lot of features in this dataset. Some are continuous, some are categorical. First, I will take the continuous variables.

It is important to check distribution of the data for a continuous variable to begin with. Here is the distribution for six variables;

df1[['TWNSHP', 'RANGE', 'SECTION', 'SINLNGTH', 'SINWIDTH', 'SINDEPTH']].hist(bins = 25,
                                                                            figsize=(10, 10))

We have the general idea of the distribution of the data. For example, in which range the majority of the data fall, the outliers, and the type of distribution. For a bit deeper dive and some more specific information we can use .describe() method:

df1[['TWNSHP', 'RANGE', 'SECTION', 'SINLNGTH', 'SINWIDTH', 'SINDEPTH']].describe()

So, these are some general statistical details here.

It is important to know, how all these variables interact with each other. I like to use a PairPlot from seaborn library for this. It provides with the relationship between each variable with the rest in one packet.

sns.pairplot(df1[['TWNSHP', 'RANGE', 'SECTION', 'SINLNGTH', 'SINWIDTH', 'SINDEPTH']],
            diag_kind='kde', kind = 'scatter', palette='husl')
plt.show()

This plot can be useful for you to find out any interesting relationship between the variables and to find out if you need to focus on any part further. It also has the distribution of the variables.

For example, I see a trend between RANGE and TWNSHP. So, I will elaborate that one. I am also adding the SECTION as a size of the dots.

 fig, ax = plt.subplots(figsize = (12, 8))
ax= sns.scatterplot(x = "RANGE", y = "TWNSHP", size = "SECTION", 
                    sizes = (20, 300), data = df1, alpha=0.4)




Here it is. Please feel free to check on some other variables in the same way if you are interested.

Let’s check on some relationship between continuous and categorical variables. I picked RANGE and RANGE_D. Here is a mashup of a Violinplot and Stripplot together. This will show the distribution of RANGE for each category in RANGE_D.

fig, ax = plt.subplots(figsize=(12, 8))
ax = sns.violinplot(x = "RANGE_D", y = "RANGE", data = df1, 
                   inner=None, color = "0.4")
ax = sns.stripplot(x = "RANGE_D", y = "RANGE", data = df1)
plt.title("RANGE and RANGE_D ")
ax.tick_params(labelsize=12)

Not so much data available in RANGE for ‘S’ in RANGE_D.

In the same way, the following plot shows the relationship between SUBRATE and SINLNGTH. This time I would use a boxplot with stripplot. Whenever I plot a continuous variable against a categorical, I like to include a stripplot as well. Simply because it shows the amount of data goes into that boxplot or violinplot.

fig, ax = plt.subplots(figsize=(12, 8))
ax = sns.boxplot(x = "SUBRATE", y = "SINLNGTH", data = df1[df1['SINLNGTH'] < 50], 
                   color = "0.4")
ax = sns.stripplot(x = "SUBRATE", y = "SINLNGTH", data = df1[df1['SINLNGTH'] < 50])
plt.title("SUBRATE vs SINLNGTH Plot", fontsize = 20)
ax.tick_params(labelsize=12)

We saw the SECTION vs TWNSHP in the Pairplor above. We didn’t see any significant or interesting trend there. But still we can explore some more on that. Here I will add PROPDAM in the relationship to see if we add a this how the relationship works for each category of PROPDAM.

fig = plt.figure(figsize = (12, 7), dpi = 80)
g = sns.lmplot(x = 'SECTION', y = "TWNSHP", data = df1, 
              robust=True, palette = "Set1", col = "PROPDAM")
for ax in g.axes.flat:
    ax.set_title(ax.get_title(), fontsize='x-large')
    ax.set_ylabel(ax.get_ylabel(), fontsize='x-large')
    ax.set_xlabel(ax.get_xlabel(), fontsize='x-large')

You can see the TWNSHP vs SECTION for each category of PROPDAM.

We were able to create those visualization without much data preparation. But before moving further, it’s important to process the data a bit better.

These are the few things to do:

  1. Extracting the Month and Year, and Date from the date columns.
  2. Fill in the null values.
  3. Converting the categorical variables to the numeric values.

The date columns do not have any null values. So, we need to just extract the month, year, and days from them. Before that it is necessary to transform them into datetime format:

df1['DATE_REV'] = pd.to_datetime(df1['DATE_REV'])
df1['EVENT_DATE'] = pd.to_datetime(df1['EVENT_DATE'])

df1[‘Date_Rev_Month’] = df1[‘DATE_REV’].dt.month
df1[‘Date_Rev_Year’] = df1[‘DATE_REV’].dt.year
df1[‘Date_Rev_Day’] = df1[‘DATE_REV’].dt.day

df1[‘Event_Date_Month’] = df1[‘EVENT_DATE’].dt.month
df1[‘Event_Date_Year’] = df1[‘EVENT_DATE’].dt.year
df1[‘Event_Date_Day’] = df1[‘EVENT_DATE’].dt.day

This part is done. Now, we have all these new columns in the DataFrame which are categorical in nature. The way we did a general of distribution for continuous variable, we should do that for categorical variables as well.

As there are a lot of categorial variables here, I decided to divide them in two parts and plot the bar graphs for them. Here is the first part:

df1_categorical1 = df1[['TRUE_SINK', 'TWNSHP_D', 'RANGE_D', 
     'QTRSECT1', 'QTRSECT2', 'ACCURACY', 'SIZDIM', 'SINSHAPE', 'WATSIN', 
     'LIMVIS']]

fig, axes = plt.subplots(round(len(df1_categorical1.columns) / 3), 3, figsize=(12, 15))

for i, ax in enumerate(fig.axes):
if i < len(df1_categorical1.columns):
ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45)
sns.countplot(x=df1_categorical1.columns[i], alpha=0.7, data=df1_categorical1, ax=ax)

fig.tight_layout()

And this is the second part:

df1_categorical2 = df1[['CAVVIS', 'SUBRATE', 'PROPDAM', 'REPAIR_S', 'DRAINSTR',
   'Event_Date_Month', 'Date_Rev_Year', 'Date_Rev_Month']]

fig, axes = plt.subplots(round(len(df1_categorical2.columns) / 3), 3, figsize=(13, 18))

for i, ax in enumerate(fig.axes):
if i < len(df1_categorical2.columns):
ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45)
sns.countplot(x=df1_categorical2.columns[i], alpha=0.7, data=df1_categorical2, ax=ax)

fig.tight_layout()

Some very good information in two graphs. Now we can see the frequency of occurance of each category for each variable.

Dealing with the null values is very important part. There are several ways to deal with that. You can fill them up using the rest of the values, taking mean or median or a machine learning model. In this example, I would like to fill them up using a value that will clearly show us that it was actually null.

So, for continuous variables, I filled up the null values with a value 9999 and for the categorical variables I used the string ‘None’.

for c in df1.columns:
    if df1[c].dtype == 'object':
        df1[c] = df1[c].fillna('None')
    else:
        df1[c] = df1[c].fillna(9999)

Finally, we should convert the categorical variables to the numeric. Here is the tricky part. If you check the data types of the date columns, you will see that they are objects. But we don’t want to convert them. So, we will keep them out of the conversion process:

dates = ['DATE_REV', 'EVENT_DATE']
for c in df1.columns:
    if c not in dates and df1[c].dtype == 'object':
        df1[c] = df1[c].astype('category')
        df1[c] = df1[c].cat.codes

Let’s do out correlation matrix. Because we have a lot of variables. It could become too crowded. So, I decided to exclude the correlation that are less than 0.4. So you will see it a bit better.

plt.figure(figsize=(18, 18))
sns.heatmap(corr[(corr >= 0.4) | (corr <= -0.4)],
           cmap='viridis', vmax = 1.0, vmin = -1, linewidths=1.5, annot=True, annot_kws={'size': 8},
           square=True)

Now as you can see it here. Please feel free to exclude some variables are already not showing any significant correlation with other variables.

I just want to work on one more visualization for this dataset today and then finish the visualization part. There are latitude and longitude data available in the dataset. So, it will be good to show an interactive map with the incidents. That will require a package ‘folium’ installed. You can use the following line in your notebook to install folium:

!pip install folium

Now, let’s import folium and do the Map. Here is the complete code for the map:

import folium
fl = folium.Map(location=[27.665, -81.516], zoom_start=7)
incidents = folium.map.FeatureGroup()
for lat, lng in zip(df1.Y, df1.X):
    incidents.add_child(
        folium.features.CircleMarker(
            [lat, lng], 
            radius = 5,
            color = 'yellow',
            fill = True, 
            fill_color = 'blue',
            fill_opacity = 0.6
        ))
fl.add_child(incidents)

I just took a picture of the map. But originally it is interactive. You can zoom in and zoom out, have a closer look on a specific area easily in this map.

Predictive Modeling

Although, all the datasets are not suitable for predictive modeling. Still, it is a usual practice to look for the scope of some predictive models while doing an exploratory data analysis.

I will not explain the details of machine learning here. If you need more details on machine learning, please go through this series.

You should choose your target variable based on your stakeholders need. But when we are dealing with a random unknown dataset, we can pick one with eye estimation. Here I am taking ‘SUBRATE’ as the target variable.

We have so many variables here. All of them may not be necessary. Just at your first glance, you will know that these are the few columns we can safely get rid of before machine learning:


df1 = df1.drop(columns=['OBJECTID', 'X', 'Y', 'DATE_REV', 'EVENT_DATE', 'REF_NUM', 'COMMENTS', 'COMMENTS_2', 'ACCESS_'])

We do not need those data columns because we already extracted Day, Month, and Year from them.

Checking how many total features we are left with:

len(df1.columns)

Output:

42

I would like to pick 25 features for this model. Too many features may cause overfitting. From sklearn library SelectKBest method can be used for this. Also, we need to define the input features and target variables.

from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import f_classif 
X = df1.drop(columns=['SUBRATE'])
y = df1['SUBRATE']

Calling the SelectKBest method and passing the parameters k as 25 as we want 25 faetures and score_func as f_classif. We already imported that above.

uni = SelectKBest(score_func=f_classif, k=25)
fit = uni.fit(X, y)

Here is the list of the 25 best features:

X_best = X.columns[fit.get_support(indices=True)].tolist()
X_best

Output:

['TRUE_SINK',
 'COUNTY',
 'TWNSHP',
 'TWNSHP_D',
 'RANGE',
 'RANGE_D',
 'SECTION',
 'ACCURACY',
 'RPT_SOURCE',
 'OCITY',
 'SIZDIM',
 'SINSHAPE',
 'SINLNGTH',
 'SINWIDTH',
 'SINDEPTH',
 'WATSIN',
 'LIMVIS',
 'CAVVIS',
 'REPAIR_S',
 'DRAINSTR',
 'SOILTYPE',
 'Date_Rev_Year',
 'Event_Date_Month',
 'Event_Date_Year',
 'Event_Date_Day']

Splitting the data for training and testing using train_test_split method:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1[X_best], y, test_size = 0.4, random_state=27)

The data is ready for model development. There are so many machine learning models for classification. I picked the Decision tree model for this.

To pick some suitable parameters it may be necessary to try a few parameters. The sklearn library has ‘GridSearchCV’ method that can be used to try several parameters at one go. Here is a detailed tutorial for that.

from sklearn.model_selection import GridSearchCV 
parameters = {'max_depth': [8, 10, 12], 'max_leaf_nodes': [20, 24, 28, 32]}
dt = DecisionTreeClassifier()
dt = GridSearchCV(dt, parameters)

Model is ready. Now, we need to fit the training data we prepared earlier into this model:

dt.fit(X_train, y_train)

Model training is done. Time to check the accuracy of prediction. Accuracy score on training data:

dt.score(X_train, y_train)

Output:

0.8447563996696945

Accuracy score on test data:

dt.score(X_test, y_test)

Output:

0.8118811881188119

The accuracy scores look pretty good to me. Please feel free to try some other models and some other model evaluations as well.

Conclusion

There are many different ways to approach a dataset. It is impossible to show all the way to explore the data in a big dataset. However, I wanted to share some methods of brainstorming, visualization, and modeling. Hopefully, this article gives an idea and direction that you will be able to use in your work.

Feel free to follow me on Twitter and like my Facebook page.

 

#DataScience #machinelearning #artificialintelligence #datavisualization #python #sklearn 

Leave a Reply

Close Menu