Univariate data analysis is the simplest form of data analysis. As the name suggests, it deals with one variable. It doesn’t find cause and effect or relationship between variables. The purpose of univariate data analysis is to summarize and describe one data or one variable. If two variables are included, it becomes bivariate. In this article, we will understand and visualize some data using univariate and bivariate data analysis. In some practice, we will include three variables as well. All the information is true only for the particular dataset used in this article.

## Know the Dataset

We will use the Heart dataset from Kaggle. First import the packages and the dataset.

%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns import pandas as pd from statsmodels import api as sm import numpy as npdf = pd.read_csv("Heart.csv")

Let’s see the column names clearly:

```
df.columns
#Output:
Index(['Unnamed: 0', 'Age', 'Sex', 'ChestPain', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng', 'Oldpeak', 'Slope', 'Ca', 'Thal', 'AHD'], dtype='object')
```

You may not understand now what each column means. I will only use a few columns in this article and I will keep explaining what the column name means as we go.

## Solve Some Questions

- Find the population proportions with different types of blood disorders.

We will find that in the ‘Thal’ column. Here, ‘Thal’ means a blood disorder called thalassemia. There is a function in Pandas called ‘value_counts’ that count the value of each category in a Series.

```
x = df.Thal.value_counts()
x
```

These are the numbers of people having normal, reversible, and fixed disorders. Now, divide each of them with the total population to find the population proportion.

`x / x.sum()`

If you notice, these proportions do not add up to 1. There is one thing we missed in this calculation. There might be some values. Fill those spaced with ‘Missing’ And then calculate the proportions again.

```
df["Thal"] = df.Thal.fillna("Missing")
x = df.Thal.value_counts()
x / x.sum()
```

So, there were a few missing values. In this dataset, 54.79% of people have normal thalassemia. The next big one was 38.16%, who have reversible thalassemia.

2. Find the minimum, maximum, average, and standard deviation of Cholesterol data.

There is a function called ‘describe’. Let’s use that. We will get all the information we needed and also some other useful parameters which will help us understand the data even better.

So, we got a few extra useful parameters. The population count is 303. We are not going to use that in this article. But it is important in statistical analysis. Especially in inferential statistics. ‘describe’ function also returns 25%, 50%, and 75% percentile data that gives an idea of the distribution of the data.

3. Make a plot of the distribution of the Cholesterol data.

`sns.distplot(df.Chol.dropna())`

The distribution is slightly right-skewed with some outliers.

4. Find the mean of the RestBP (Resting Blood Pressure). Then, calculate the population proportion of the people who have the higher RestBP than the mean RestBP.

`mean_rbp = df.RestBP.dropna().mean()`

Mean RestBP was 131.69. First, find the dataset where RestBP is bigger than mean RestBP. Divide it by the length of the total dataset.

`len(df[df["RestBP"] > mean_rbp])/len(df)`

The result is 0.44 or 44%.

5. Plot the Cholesterol data against the age group to observe the difference in cholesterol levels in different age groups of people.

Here is the solution. Make a new column in the dataset that will return the number of people in the different age groups.

`df["agegrp"]=pd.cut(df.Age, [29,40,50,60,70,80])`

Now, make the boxplots. Place age groups on the x-axis and the cholesterol level in the y-axis.

```
plt.figure(figsize=(12,5))
sns.boxplot(x = "agegrp", y = "Chol", data=df)
```

The box plot shows an increasing trend of cholesterol with the increasing age. It is a good idea to check if gender plays any role. If the cholesterol level differs in different genders. In our sex column, we have the numbers 0 and 1 for females and males. We will make a new column replacing 0 or 1 with ‘Male’ and ‘Female’.

```
df["Sex1"] = df.Sex.replace({1: "Male", 0: "Female"})
plt.figure(figsize=(12, 4))
sns.boxplot(x = "agegrp", y = "Chol", hue = "Sex1", data=df)
```

Overall, the female population in this dataset has a higher level of cholesterol. In the age group of 29 to 40, it is different. In the age group of 70 to 80, there is cholesterol level only in the female population. That does not mean that the male population in that age has no cholesterol. In our dataset, we do not have enough male population in that age group. It will be helpful to understand if we plot the male and female population against the age.

`sns.boxplot(x = "Sex1", y = "Age", data=df)`

6. Make a chart to show the number of people having each type of chest pain in each age group.

`df.groupby('agegrp')["ChestPain"].value_counts().unstack()`

For each type of chest pain, the maximum people seem to be in the age group of 50 to 60. Probably because we have the most number of people in that age group in our dataset. Look at the picture above.

7. Make the same chart as the previous practice with the addition of Gender variable. Segregate the numbers by gender.

`dx = df.dropna().groupby(["agegrp", "Sex1"])["ChestPain"].value_counts().unstack()`

8. Present the population proportion for each type of chest pain in the same groups in the previous chart.

`dx = dx.apply(lambda x: x/x.sum(), axis=1)`

That was the last exercise. These were some techniques to make univariate and multivariate charts and plots. I hope that was helpful.

#DataScience #DataVisualization #Pandas #Matplotlib #Seaborn