Exploratory data analysis is very important to understanding a dataset properly. It is even important for machine learning. I have a few exploratory data analysis projects published before. But this time I am taking a big dataset. Let’s see how it goes.
I am going to use a public dataset called the FIFA dataset from Kaggle. The user license is mentioned here.
Please feel free to download the dataset from this link.
First import the packages and the dataset:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("fifa.csv")
This dataset has a total of 80 columns. The original Kaggle dataset had 89 columns I already deleted 9 columns and uploaded them in the link provided above. So, you get a bit cleaner dataset to start with.
But in the beginning, an unnecessary column is there that I removed:
df = df.drop(columns="Unnamed: 0")
Now, these are the columns left:
df.columns
Output:
Index(['sofifa_id', 'player_url', 'short_name', 'long_name', 'age', 'dob', 'height_cm', 'weight_kg', 'nationality', 'club_name', 'league_name', 'league_rank', 'overall', 'potential', 'value_eur', 'wage_eur', 'player_positions', 'preferred_foot', 'international_reputation', 'weak_foot', 'skill_moves', 'work_rate', 'body_type', 'real_face', 'release_clause_eur', 'player_tags', 'team_position', 'team_jersey_number', 'loaned_from', 'joined', 'contract_valid_until', 'nation_position', 'nation_jersey_number', 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking', 'gk_reflexes', 'gk_speed', 'gk_positioning', 'player_traits', 'attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy', 'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed', 'movement_agility', 'movement_reactions', 'movement_balance', 'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots', 'mentality_aggression', 'mentality_interceptions', 'mentality_positioning', 'mentality_vision', 'mentality_penalties', 'mentality_composure', 'defending_marking', 'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 'goalkeeping_reflexes'], dtype='object')
As you can see so many variables! You can make a big book with the amount of analysis possible from this dataset. But this is a blog article. I have to only limit my analysis to an extent.
Also, there are so many variables that I do not quite understand because I am not that familiar with soccer. But that is also a part of a data scientist’s life. Without knowing all about the features, you may create some interesting visualizations and analyses.
Let’s start.
The first visualization will be on the relationship between the work rate and wage by league rank. This plot will be a combination of scatterplot and line plot.
plt.figure(figsize=(10, 6))#Scatter plot ax = sns.scatterplot(x ='work_rate', y = df['wage_eur'], hue = "league_rank", data = df, palette = ["green", "red", "coral", "blue"], legend="full", alpha = 0.4 )
#Getting the max wage for each work rate max_wage_eur = df.groupby("work_rate")["wage_eur"].max() #Making a line plot of max wages sns.lineplot(data = max_wage_eur, ax = ax.axes, color="grey")ax.tick_params(axis= "x", rotation=90) plt.xlabel("Work Rate") plt.ylabel("Wage EUR") plt.title("Relationship between work rate and wage by league rank", fontsize = 18) plt.show()
There is a nationality column. It will be interesting to see some analysis on that.
How many nationalities players are there in this dataset?
len(df['nationality'].unique())
Output:
149
A word cloud of nationalities will help understand which nationalities are dominating. For doing that we need to join all the nationalities and then make a word cloud.
nationality = " ".join(n for n in df['nationality'])
from wordcloud import WordCloud
plt.figure(figsize=(10, 10))
wc = WordCloud().generate(nationality)
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()
That means we have more players from England, France, Germany, Spain, Italy, Netherlands and so on.
I was curious to see the top 20 nationalities in terms of the number of players. Using the value_counts function of Pandas library I got the value counts of all the nationalities and took the top 20 to find the top 20. Pandas value_counts function automatically sort the values.
nationality_count = df['nationality'].value_counts()
nationality_count[:20]
Output:
England 1627
Spain 1051
France 958
Argentina 867
Italy 795
Germany 701
Colombia 543
Republic of Ireland 460
Netherlands 419
Mexico 416
Brazil 416
Chile 411
Sweden 398
Saudi Arabia 362
United States 342
Poland 342
Turkey 341
Portugal 337
Korea Republic 328
Scotland 323
Name: nationality, dtype: int64
Soccer players have a certain active timeframe in their lives. The distribution of age is here:
df['age'].hist()
plt.title("Distribution of age of the players")
It will be a good idea to check if the distribution matches the distribution of age of top nationalities, leagues, and clubs based on the number of players. First let’s make the DataFrame based on the top nationalities, clubs, and leagues:
top_20_nat= nationality_count.index[:20]
df_nationality = df[df.nationality.isin(top_20_nat)]
clubs_count = df['club_name'].value_counts()
top_20_club = clubs_count[:20].index
df_clubs = df[df.club_name.isin(top_20_club)]
league_count = df['league_name'].value_counts()
top_20_leagues = league_count[:20].index
df_league = df[df.league_name.isin(top_20_leagues)]
I want to put the distribution of ages in the same plot to compare the distribution properly:
pd.DataFrame({'nationality': df_nationality['age'],
'club': df_clubs['age'],
'league': df_league['age']}).hist(bins = 15,
figsize=(12, 6),
grid=False,
rwidth=0.9,
sharex=True,
sharey=True)
plt.show()
Look, the club’s age distribution is so low. Maybe they are children’s clubs. It is interesting to see that the bigger clubs based on numbers are children’s clubs.
I also check the distribution of height and weight of all three groups and got a similar distribution. Please feel free to check for yourself.
Let’s see a plot that gives a point estimation of wages for each nationality using the top 20 nationalities DataFrame we created.
ax = sns.catplot(x = 'nationality', y = 'wage_eur', data = df_nationality,
hue = 'preferred_foot', height=6, aspect=2,
capsize=0.2, kind='point')
plt.xlabel("Nationality", fontsize=12)
plt.ylabel("Wage EUR", fontsize=12)
plt.title("Point Estimation of Wages per Top 20 Nationalities", fontsize=20)
plt.xticks(rotation = 60, fontsize=13)
plt.show()

The wages above show the top 20 nationality players’ wages. We have the information of the top 20 clubs as well. Let’s see the top 20 nationality players’ wages at the club level.
fig, ax = plt.subplots(figsize=(16, 6), dpi=80)
sns.stripplot(x = "nationality", y = "wage_eur",
data=df_clubs, size = 7, ax=ax)
plt.tick_params(axis='x', which='major', labelsize=12, rotation=90)
plt.xlabel("Nationality")
plt.ylabel("Wage")
plt.title("Wages for Nationalities in Club Level")
plt.show()

You can see some new nationalities in this club group. Also in the point estimate plot, Brazil was at the top and in this plot, Bazil is in the middle.
The next plot will show the wages for national_position by international reputation. I will use another combined plot here that will show the mean wages for nation_position and different colors will denote the international_reputation.
To do that, we need to convert the string values of the national_position column to int values. Below is the function that makes a dictionary where keys will be values of nation_position column and values will a unique integer for each value.
def make_dictionary(col):
dictn = {}
val = 0
for i in col.unique():
if i not in dictn:
dictn[i] = val+1
val += 1
return dictn
Now, we will make a new column named nation_position1 where data type will be integers:
new_nation_position= make_dictionary(df['nation_position'])
df['nation_position1'] = df['nation_position'].replace(new_nation_position)
Let’s make the plot using this new national_position1 column:
plt.figure(figsize=(15, 10)) ax = sns.scatterplot(x= df['nation_position'], y = df['wage_eur'], hue = df['international_reputation'], legend = "full", data = df, x_jitter = 1000)ax = sns.regplot(x= 'nation_position1', y = df['wage_eur'], data = df, ax=ax.axes, x_estimator=np.mean )plt.xlabel("Nation Position") plt.ylabel("Wage EUR") plt.title("Wages for National Position by International Reputation", fontsize=18) plt.show()

You can see the mean wages for each national position. Wages are higher when internal reputation is higher.
I have to admit that I couldn’t quite understand all the variables in this dataset. For example, there is one feature called ‘value_uer and another one called ’wage_eur’. Are they related? Some visuals may help understand that.
The next plot explores the relationship between wag and value o the players and also if there is any effect of the preferred foot on those two features.
In this plot, I plotted wages against values where color indicates preferred foot. Because most of the dots were cluttered in one place I had to amplify a portion of the scatter plot on top of it. And there are density plots of wages and values on the sides.
plt.figure(figsize = (18, 10))
grid = plt.GridSpec(4, 4, wspace =0.3, hspace = 0.8)
g1 = plt.subplot(grid[:, 0])
g2 = plt.subplot(grid[:2, 1:3])
g3 = plt.subplot(grid[2:, 1:3])
g4 = plt.subplot(grid[:, 3])
g1.set_title("Distribution of Wage", fontsize=12)
sns.kdeplot(x = "value_eur", hue="preferred_foot",
vertical=True,
data=df, ax=g1)
sns.scatterplot(x = "wage_eur", y = "value_eur",
hue = "preferred_foot1", alpha = 0.3,
palette=['coral', 'blue'],
data = df, ax=g2)
g2.set_xlim(0, 50000)
g2.set_ylim(0, 0.05*1e8)
g2.legend(['Left', 'Right'])
sns.scatterplot(x = "wage_eur", y = "value_eur",
hue = "preferred_foot", alpha = 0.3,
palette=['coral', 'blue'],
data = df, ax=g3)
g4.set_title("Distribution of Value", fontsize=12)
sns.kdeplot(x = "value_eur", hue="preferred_foot",
vertical=True,
data=df, ax=g4)
plt.show()

Is height, weight, and body type affects the performance or values of the players? Is there any correlation between height/weight and internal reputation? This plot shows that.
plt.figure(figsize=(12, 8))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
#fig.suptitle("International Reputation and Value with Height and Weight, and Body Type")
sns.scatterplot(x = "height_cm", y = "weight_kg",
hue = "body_type1",
alpha = 0.5, size = "value_eur", sizes = (60, 300),
palette=['green','orange','dodgerblue'],
data = df.sample(frac=0.25), ax=ax1)
ax1.set_ylim([50, 100])
ax1.set_xlim([160, 200])
ax1.set_title("Height vs Weight By Body Type and Value",fontsize=18)
ax1.set_xlabel("Height")
ax1.set_ylabel("Weight")
sns.scatterplot(x = "height_cm", y = "weight_kg",
hue = "international_reputation",
alpha = 0.5, size = "value_eur", sizes = (60, 300), palette=['green','orange','dodgerblue', 'red', 'black'],
data = df.sample(frac=0.25), ax=ax2)
ax2.set_ylim([50, 100]) ax2.set_xlim([160, 200]) ax2.set_title("Height vs Weight By International Reputation and Value",fontsize=18) ax2.set_xlabel("Height") ax2.set_ylabel("Weight")plt.show()

Here is the value counts of international reputation.
df['international_reputation'].value_counts()
Output:
1 14512
2 1362
3 235
4 37
5 9
Higher the reputation, lower the count. Which is expected.
We have an interesting feature named “weak_foot”. Let us see how weak foot relates to value segregated by preferred foot. I will use a violin plot for that.
plt.figure(figsize=(15, 10))
sns.set()
_, ax = plt.subplots(figsize=(10, 7))
sns.violinplot(x="weak_foot",
y="value_eur",
hue = "preferred_foot",
data=df,
split=True,
bw=.4,
cut = 0.2,
linewidth=1,
palette=sns.color_palette(["green", "orange"]))
ax.set(ylim = (-0.1, 1*1e7))
plt.show()

You can see the distribution changes a lot. I used a ‘ylim’ parameter to exclude the outliers. If I include all the values the shape of the violins does not show.
In the next plot, I would like to explore the relationship between body type and shooting. It will also be good to separate them by preferred foot. I will use a point plot to add the point estimation of the group. The point represents the central tendencies of the groups.
_, ax = plt.subplots(figsize=(10, 8))
sns.despine(bottom=True, left=True)
sns.stripplot(x = "shooting",
y = "body_type",
hue = "preferred_foot",
data = df,
dodge = 0.8, zorder=1)
sns.pointplot(x = "shooting",
y = "body_type",
hue = "preferred_foot",
data=df, dodge=0.5, join=False,
palette="dark", markers="d",
scale=0.75, ci=None)
handles, labels = ax.get_legend_handles_labels()

The points estimates do not look too different for all the groups.
We did not explore the nation_position as much. Also, another intriguing feature is called mentality aggression. We will see the relationship between them and also include league rank as a ‘hue’ parameter. Before that, we need to convert the nation_position to a numeric variable. I used the make_dictionary function mentioned before to create a dictionary and eventually created a new column named ‘nation_position1’ replacing the original string values with the integers.
nat_dictn = make_dictionary(df['nation_position'])df['nation_position1'] = df['nation_position'].replace(nat_dictn)
Here is the code block to the plot. The ‘lmplot’ and ‘regplot’ are combined together in this plot:
ax = sns.lmplot(x = "nation_position1",
y = "mentality_aggression",
data = df, hue = "league_rank", fit_reg = False, height = 5, aspect = 2.2)
sns.regplot(x = "nation_position1",
y = "mentality_aggression",
data = df, scatter=False, ax=ax.axes[0, 0], order = 3)
plt.ylabel("Mentality Aggression")
plt.xticks(list(range(1,30)), list(df['nation_position'].unique()))
plt.title("Relationship Between Nation Position and Mentality Aggression", fontsize=18)
plt.xlabel("Nation Position", fontsize=14)
plt.ylabel("Mentality Aggression", fontsize=14)
plt.show()

You can see lots of the data are ‘nan’ for league_rank four. But otherwise, this gives a good idea about the mentality_aggression of nation position.
When we have this big of a dataset and a lot of continuous variables, I find it helpful to make a pair plot. It gives the distribution and also the several relationships between variables in the same plot. Here is a pair plot:
sns.set(color_codes=True)
plt.rcParams["axes.labelsize"] = 20
g1 = sns.PairGrid(df.sample(frac = 0.2), vars = ['pace', 'shooting',
'passing', 'dribbling', 'defending', 'attacking_crossing',
'attacking_finishing', 'attacking_heading_accuracy'],
hue = 'preferred_foot')
g1.map_lower(sns.regplot)
g1.map_diag(plt.hist, alpha=0.7)
g1.map_upper(sns.kdeplot, shade=True)
g1.add_legend(title='Foot', fontsize=20)
for axes in g1.axes.flat:
axes.set_ylabel(axes.get_ylabel(), rotation=0, horizontalalignment='right')

I wanted to explore the relationship between attacking_heading_accuracy and defending a bit more. I want to include the body_type and mentality aggression in this exploration as well:
df2 = df[['attacking_heading_accuracy', 'defending', 'body_type', 'mentality_aggression']]df_encircle=df2[(df2['attacking_heading_accuracy'] >=50) & (df2['attacking_heading_accuracy'] <=70)].dropna()
df_encirclefrom scipy.spatial import ConvexHullplt.figure(figsize=(18, 8))
ax = sns.scatterplot(x = "attacking_heading_accuracy", y = "defending",
hue = "body_type",
alpha = 0.5, size = "mentality_aggression", sizes = (20, 300),
data = df.sample(frac=0.10))def encircle(x, y, ax=None, **kw):
if not ax:
ax=plt.gca()
p=np.c_[x, y]
hull = ConvexHull(p)
poly=plt.Polygon(p[hull.vertices, :], **kw)
ax.add_patch(poly)
encircle(df_encircle.attacking_heading_accuracy, df_encircle.defending,
ec = "k", fc="gold",
alpha = 0.1)encircle(df_encircle.attacking_heading_accuracy, df_encircle.defending,
ec = "firebrick", fc="None",
linewidth = 1.5)plt.xlabel("Attacking Heading Accuracy", fontsize=12)
plt.ylabel("Defending", fontsize=12)
plt.title("Defending vs Attacking Heading Accuracy")
plt.show()

You may think, why I encircled some area. If you notice in the distribution of attacking_heading_accuracy in the pair plot before, the majority population lied in the range of 50 to 70. I just wanted to encircle that area so that we can focus on that area.
The following plot will show the relationships between the nation_position and movement_sprint_speed. This time I combined the boxplot and strip plot.
fig, ax = plt.subplots(figsize=(14, 8))ax = sns.boxplot(x = 'nation_position', y = "movement_sprint_speed",
data = df)
ax.tick_params(rotation=90, labelsize=18)
ax = sns.stripplot(x = 'nation_position', y = "movement_sprint_speed", data=df)
plt.xlabel("Nation Position", labelpad = 16, fontsize=24)
plt.ylabel("Movement Sprint Speed", labelpad = 16, fontsize=24)
plt.title("Nation Position vs Movement Sprint Speed", fontsize=32)
plt.show()

Next, a combination of violin plot snd stripplot will plot word_rate vs Skill Curves.
fig, ax = plt.subplots(figsize=(12, 6))
ax=sns.violinplot(x = "work_rate", y = "skill_curve", hue="preferred_foot",
data=df, inner=None, color = "0.4")
ax = sns.stripplot(x = "work_rate", y = "skill_curve", alpha = 0.6, data=df)
ax.tick_params(rotation=90, labelsize=12)
plt.xlabel("Work Rate", fontsize = 12)
plt.ylabel("Skill Curve", fontsize = 12)
plt.title("Skill Curve for Work Rate", fontsize=24)
plt.show()

In my next plot, I will explore the median power stamina. First, the top 20 nationalities were found in terms of median power stamina, and then the visualization was created.
d = df.groupby('nationality')['power_stamina'].agg([np.median])
d1 = d[:20].reset_index()
d1 = d1.rename(columns={"median": 'power_stamina'})fig, ax = plt.subplots(figsize=(14, 6))
ax.vlines(x=d1['nationality'], ymin=0, ymax = d1.power_stamina,
color = 'coral', alpha = 0.7, linewidth = 3)
ax.scatter(x = d1["nationality"], y = d1["power_stamina"],
s = 70, color = "firebrick")ax.set_ylabel("Median Power Stamina", fontsize = 12)
ax.set_xticklabels(d1.nationality, rotation=90)
ax.set_ylim(0, 80)
ax.set_title("Median Power Stamina in for top 20 nationality", fontsize=18)for row in d1.itertuples():
ax.text(row.Index, row.power_stamina+1, s = round(row.power_stamina),
horizontalalignment = 'center', verticalalignment='bottom', fontsize=14)
plt.show()
There’s quite a range here. But the interesting fact is the top 2 countries with the highest median power stamina are not in the top 20 nationality list we have seen before.
Conclusion
As I mentioned before, A big book can be created with such a huge dataset. This is a blog post. So, I am stopping here. I always like to include some predictive modeling in exploratory data analysis. But I have to save it for a future article because I do not want to make this post any longer. I tried to include some cool visualization in this article. Hopefully, it was helpful.
Please feel free to follow me on Twitter, the Facebook page, and check out my new YouTube channel.
#DataScience #DataAnalytics #DataVisualization #ExploratoryDataAnalysis