Exploratory data analysis is very basic. Sometimes it is necessary to just understand the data well. Sometimes it is dome before diving into the modeling. Anyway, a big dataset will have no use if it is not possible to extract the necessary information from it. This article will explain some techniques and visualization codes to extract important information from a dataset.
There is another aspect of exploratory data analysis. We are data scientists only, right? But data is coming from all different areas of life. There might a medical dataset, environmental dataset, or financial dataset where some or all of the terms are not known. What to do then?
The dataset I will use in this article for demonstration is a heart disease dataset from Kaggle. Several terms of this dataset may not be so understandable to a lot of people. I myself did not understand all the terms when I worked on it. Such as ST depression or major vessels. I did not know what exactly they are. But still, you can derive some good information from it. We will see how.
All the findings above are particular to only this dataset. Please do not take it as absolute truth.
Overview of the Dataset
As mentioned earlier, this dataset is about heart disease. Each row represents the different health data of a person.
Feel free to download the dataset from this link and follow along
I downloaded the dataset and put it in the same folder as my R file. I used this code to import the dataset in the RStudio environment:
heart = read.csv("Heart.csv")
The dataset has 14 columns. So, it is too big to take a screenshot of and show it here. Here is the name of the columns and the explanation of each variable as described in Kaggle.
1. age: The age of a person
2. sex: The person’s gender(1 = male, 0 = female)
3. cp: The types of chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
4. trestbps: Resting blood pressure (mm Hg on admission to the hospital)
5. chol: Cholesterol measurement in mg/dl
6. fbs: Fasting blood sugar (if > 120 mg/dl, 1 = true; 0 = false)
7. restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
8. thalach: Maximum heart rate achieved
9. exang: Exercise induced angina (1 = yes; 0 = no)
10.oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot)
11.slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
12.ca: The number of major vessels (0–3)
13.thal: A blood disorder called thalassemia (1 = normal; 2 = fixed defect; 3 = reversable defect)
14.target: Heart disease (0 = no, 1 = yes)
The dataset is already clean and well organized. Not too much cleaning was necessary.
There are a lot of things that can be done with a dataset like this. Also, it can be analyzed in so many different ways. So many different plots and tables can be generated to explain in different ways.
Fo this article I chose to find any correlation between heart disease and the other parameters of the data.
If you see number 14 above, it shows if a person has heart disease or not. We will focus a lot on this variable.
To begin with, it will be helpful to see the correlation between heart disease and the other variables in the dataset. I will use the library ‘corrplot’ and make a correlation plot that will show the correlation of each variable with the others.
library(corrplot) corrplot(cor(heart), type="upper")
Because our focus is to find out the relation between heart disease and other parameters, let’s have a close look at the correlation between the ‘target’ variable with other variables. The size of the dots shows how strong the correlation is.
This correlation plot shows that ‘restecg’, ‘fbs’, and ‘chol’ parameters are very loosely correlated with the ‘target variable. I can safely delete them from the dataset for this particular study.
heart = subset(heart, select=c(-restecg, -chol,-fbs))
We have 11 variables now. I will demonstrate the analysis of some discrete and some categorical variable’s relationship with the ‘target’ variable.
But the categorical variables are denoted as 0, 1, 2, 3. I changed them to some more meaningful string values as per the description above.
Here are the codes for changing the categorical variables to the corresponding strings:
heart$sex[heart$sex == 0] = "female" heart$sex[heart$sex == 1] = "male"heart$cp[heart$cp == 0] = "typical angina" heart$cp[heart$cp == 1] = "atypical angina" heart$cp[heart$cp == 2] = "non-anginal pain" heart$cp[heart$cp == 3] = "asymptomatic"heart$exang[heart$exang == 0] = "no" heart$exang[heart$exang == 1] = "yes"heart$slope[heart$slope == 0] = "upsloping" heart$slope[heart$slope == 1] = "flat" heart$slope[heart$slope == 2] = "downsloping"heart$thal[heart$thal == 1] = "normal" heart$thal[heart$thal == 2] = "fixed defect" heart$thal[heart$thal == 3] = "reversible defect"heart$target1 = heart$target heart$target1[heart$target1 == 0] = "no heart disease" heart$target1[heart$target1 == 1] = "heart disease"
The dataset is ready. Let’s dive into exploratory analysis.
Exploratory data analysis starts from questions, curiosity, and necessity. I came up with these questions below and will answer them. As we will focus primarily on heart disease, it is intuitive to start with the proportion of the people with heart disease and with no heart disease in the dataset.
It’s very close. This dataset has 51% of people with heart disease and 49% of people with no heart disease.
Age of the Population
This is a common idea that older people are more prone to get heart disease. Here is the distribution of the age of the population in the dataset.
library(ggplot2) ggplot(heart, aes(x=age)) + geom_histogram() + ggtitle("Distribution of age of the population")+ xlab("Age") + ylab("Density")
Distribution is nearly normal and slightly right-skewed. The majority population lies in the 50 to 65 years age group. Very few people are in the thirties and very few people are in the above 70s.
Instead of looking at each particular age, looking at the age group might be more meaningful in terms of heart disease rate. For that ‘age’ variable was divided into different age groups and a separate column has been made named ‘age_grp’.
heart$age_grp = cut(heart$age, breaks = seq(25, 77, 4))
Find the number of people with heart disease for each age group.
target_by_age = heart %>% group_by(age_grp) %>% summarise(heart_disease = sum(target)) target_by_age
This is part of the data. Now, make a bar plot of this data to see the frequency of people with heart disease in each age group.
target_by_age %>% ggplot(aes(x=age_grp, y=heart_disease)) + geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) + xlab("") + ylab("No. of People with Heart Disease") + ggtitle("No of Heart disease in Age Group") + theme_bw()
This plot shows that the 49 to 57 years range has maximum heart disease. It’s even more than the people above 57. On the other hand, people of age below 30 and above 73 have a similar number of heart disease patients. That phenomenon could be because there are far fewer people in the 30s, 40s, and 70s in the sample population.
To understand it some more, the proportion of heart disease patients in each age group will help. For that let’s find the proportion of people with heart disease in each group.
prop_in_age = heart %>% group_by(age_grp) %>% summarise(heart_disease_proportion = round(sum(target)/n(), 3)*100) prop_in_age
Here is the bar plot of this data:
prop_in_age %>% ggplot(aes(x=age_grp, y=heart_disease_proportion)) + geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) + xlab("") + ylab("Proportion of People with Heart Disease") + ggtitle("Proportion of Heart disease in Age Groups") + theme_bw()
In the age group of below 30, 100% of people have heart disease. That is definitely not the case in real life. It is clear that this is not a representative sample. This is not possible to infer any conclusion about how age contributed to heart disease from this dataset.
Gender or Sex
Before looking into the relationship between heart disease and gender, it is important to know the proportion of males and females in the dataset.
The proportion of male and female are not similar. There are 30% female and 70% male population in the dataset. The proportion of males and females with heart disease could be the next important findings.
round(prop.table(table(heart$sex, heart$target1)), 2)
In the female population, people with heart disease are way more than people with no heart disease. At the same time, in the male population, 40% do not have heart disease while only 29% have heart disease.
The Slope of the Peak Exercise ST Segment(Slope)
A bar plot of different types of slope and heart disease conditions will be appropriate to understand it.
ggplot(heart, aes(x= slope, fill=target1)) + geom_bar(position = 'dodge') + xlab("Type of Slope") + ylab("Count") + ggtitle("Analysis of types of slope") + scale_fill_discrete(name = "Heart disease", labels = c("No", "Yes"))
Clearly, with different types of the slope, the rate of heart disease looks different. With a downsloping number of no heart disease is far higher (about 340) than the number of heart disease patients (about 125). But with a flat surface, it is almost the opposite. The number of heart disease is about 325 and the number of no heart disease is about 160. In upsloping, there are not many differences but the number of heart disease is higher than the number of no heart disease cases.
Is that trend the same in the male and female population?
Another valid question to answer. The same type of bar plot for both the male and female portion of the dataset will help understand that. First, the dataset was separated for male and female population:
male_data = heart[heart$sex=="male",]female_data = heart[heart$sex=="female",]
Now, make the same barplot for the male and female population.
ggplot(male_data, aes(x= slope, fill=target1)) + geom_bar(position = 'dodge') + xlab("Type of Slope") + ylab("Count") + ggtitle("Analysis of types of slope for males") + scale_fill_discrete(name = "Heart disease", labels = c("No", "Yes"))
ggplot(female_data, aes(x= slope, fill=target1)) + geom_bar(position = 'dodge') + xlab("Type of Slope") + ylab("Count") + ggtitle("Analysis of types of slope for females") + scale_fill_discrete(name = "Heart disease", labels = c("No", "Yes"))
The plot of the male population follows the same trend as the overall bar plot for analysis of slope. But in the female population, the trend is very different. The downsloping number of no heart disease is far higher (180) than the number of heart disease (25). Again, for the flat slope, both the cases are close but the number of no heart disease cases is a bit higher.
The Number of Major Vessels (ca)
The dataset shows there might be 0, 1, 2, 3, or 4 major vessels in a person. As per the correlation plot number of vessels has a good correlation with heart disease. Here is the visual representation of how different the number of major vessels relates to heart disease:
mosaicplot(table(heart$target1, heart$ca), col=c("#754869", "coral", "skyblue", "#423f42", "#ed18c6"), las=1, main="Heart Disease for Major Vessels")
About 2/3 of the people having heart disease have no major vessel. Very few people have 4 major vessels. So, it is hard to know the impact of that.
Male and female populations may have a different number of major vessels or different levels of relationship between major vessels and heart disease. Here is a plot that shows the major vessels vs heart disease in males:
mosaicplot(table(male_data$target1, male_data$ca), col=c("#754869", "coral", "skyblue", "#423f42", "#ed18c6"), las=1, main="Major Vessels in Males")
Looks like the male population follows a very similar trend as the total population for major vessels. Here is a plot that shows the correlation of the number of major vessels and heart disease in the female population:
mosaicplot(table(female_data$target1, female_data$ca), col=c("#754869", "coral", "skyblue", "#423f42", "#ed18c6"), las=1, main="Major Vessels in Females")
In the female population, there are 0, 1, 2, or 3 major vessels. No female has 4 major vessels. As in the male population, maximum females having heart disease have no major vessels. Again, in the no heart disease zone also the majority of females have 0 or 2 major vessels.
ST Depression Induced by Exercise Relative to Rest(oldpeak)
Here are the boxplots showing the distribution of ST depression for heart disease and no heart disease people.
ggplot(heart, aes(x = target1, y = oldpeak)) + ylab("ST Depression") + xlab("Haert Disease State")+ ggtitle("ST Depression Induced by Exercise vs Haert Disease")+ geom_boxplot()
On the no heart disease side, the interquartile range is higher (about 2) than on the heart disease side (1).
Does this type of depression change with age and together do they have different impacts on heart condition?
A combined scatter plot may provide some insights on that.
ggplot(heart, aes(x = age, y = oldpeak,color=target1, size = factor(oldpeak))) + geom_point(alpha=0.3) + labs(color = "Heart Disease State")+guides(size=FALSE) + xlab("Age") + ylab("ST Depression") + ggtitle("Age vs Resting Blood Pressure Separated by Heart Condition")
As discussed in the beginning, this dataset is different. It shows, with higher age heart disease decreases. Looks like when ST depression goes up heart disease cases go down. The size of the dots changes with resting blood sugar. But from this picture, it is hard to derive any relationship between age and ST Depression.
Resting Blood Pressure
Boxplots of Resting Blood Sugars separated by heart disease state will provide some initial idea.
ggplot(heart, aes(x = target1, y = trestbps)) + geom_boxplot() + xlab("Heart Disease State") + ylab("Resting Blood Pressure") + ggtitle("Boxplots of Resting Blood Pressure by Heart Condition")
The plot above shows an interquartile range of resting blood sugar is slightly higher for the no heart disease plot. But the medians of both the box plot looks the same. The next chart is a scatter plot of age vs Resting blood pressure that includes different colors for heart disease state and dot size depends on St depression. This one should reveal some more information.
ggplot(data=heart,aes(x=age,y=trestbps,color=target1,size=factor(oldpeak)))+ geom_point(alpha=0.3)+ xlab("Age")+ ylab("Resting blood sugar") + labs(color="Heart Disease State") + guides(size=FALSE)+ ggtitle("Age vs Resting Blood Pressure Separated by Heart Condition")
This plot shows something very interesting. When the resting blood sugar is really low like 100 or less heart disease cases are higher than no heart disease case. When the resting blood pressure is above 165, no heart disease cases are higher than heart disease cases.
As we found before most big dots are blue. That means more people with ST depression have no heart disease. At the same time, more number of bigger dots are in the higher age range. So, ST Depression is higher in older people.
I tried to show some visualizations and techniques to summarise a dataset. This dataset is not too big. It has only 14 columns. But still, there are a lot more that can be explored. There are a few more variables that I didn’t even touch. Please feel free to explore some more of your own.
Feel free to follow me on Twitter and like my Facebook page.
#DataScience #DataAnalytics #DataAnalysis #R #programming