CSV files are a very common and popular format of storing data. Data Scientists deal with csv files almost regularly. Pandas not only has the option to import a dataset as a regular Pandas DataFrame, also there are other options to clean and shape the dataframe while importing. Here I want to discuss few of those options:
As usual, import pandas and the dataset as a Dataframe with read_csv method:
import pandas as pd
df = pd.read_csv(‘olympics.csv’)
df.head()
Output:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | № Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 ! | 02 ! | 03 ! | Total | № Games | 01 ! | 02 ! | 03 ! | Combined total |
1 | Afghanistan (AFG) | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
2 | Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
3 | Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
4 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
df.head() gives only the top five rows of dataframe so we can see some properties of the dataframe. df.tail() returns the last five rows of the DataFrame. Number of rows can be specified as well like this: df.head(8). This will return first 8 rows. You can use any number in the range of the length of your dataframe. Number of rows can be specified in df.tail() in the same way.
df.tail()
Output:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
143 | Independent Olympic Participants (IOP) [IOP] | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 3 |
144 | Zambia (ZAM) [ZAM] | 12 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 1 | 1 | 2 |
145 | Zimbabwe (ZIM) [ZIM] | 12 | 3 | 4 | 1 | 8 | 1 | 0 | 0 | 0 | 0 | 13 | 3 | 4 | 1 | 8 |
146 | Mixed team (ZZX) [ZZX] | 3 | 8 | 5 | 4 | 17 | 0 | 0 | 0 | 0 | 0 | 3 | 8 | 5 | 4 | 17 |
147 | Totals | 27 | 4809 | 4775 | 5130 | 14714 | 22 | 959 | 958 | 948 | 2865 | 49 | 5768 | 5733 | 6078 | 17579 |
Look at the dataset. On top there is an unnecessary row. That’s actually not the header. There is an option called skiprows to get rid of unnecessary rows like that.
df1 = pd.read_csv(‘olympics.csv’, skiprows = 1)
df1.head()
Output:
Unnamed: 0 | № Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | № Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
1 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
2 | Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
3 | Australia (AUS) [AUS] [Z] | 25 | 139 | 152 | 177 | 468 | 18 | 5 | 3 | 4 | 12 | 43 | 144 | 155 | 181 | 480 |
4 | Austria (AUT) | 26 | 18 | 33 | 35 | 86 | 22 | 59 | 78 | 81 | 218 | 48 | 77 | 111 | 116 | 304 |
Here skiprows = 1, means delete one row. By default it will delete one row from the top. You can delete 2, 3, 4 or any number of rows you want using skiprows. But you may not want to delete all the rows from the top. Like you may want to delete first row, third row and forth row. In that case you can specify the rows in a list.
df2 = pd.read_csv(‘olympics.csv’, skiprows = [0, 2, 3])
df2.head()
Output:
Unnamed: 0 | № Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | № Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
1 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
2 | Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
3 | Australia (AUS) [AUS] [Z] | 25 | 139 | 152 | 177 | 468 | 18 | 5 | 3 | 4 | 12 | 43 | 144 | 155 | 181 | 480 |
4 | Austria (AUT) | 26 | 18 | 33 | 35 | 86 | 22 | 59 | 78 | 81 | 218 | 48 | 77 | 111 | 116 | 304 |
Sometimes in the csv files, there is no header, only values. But by default, pandas take the row as a header. To avoid that, we can use ‘header = None’. In this dataset there is a header. But for the sake of this example let’s just say that there is no header.
df3 = pd.read_csv(‘olympics.csv’, header = None)
df3.head()
Output:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
1 | NaN | № Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 ! | 02 ! | 03 ! | Total | № Games | 01 ! | 02 ! | 03 ! | Combined total |
2 | Afghanistan (AFG) | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
3 | Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
4 | Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
Look pandas added an extra row of label on top to add a header. It didn’t take any row from the original dataset as a header. But actually, in this dataset there is a header. The second row, that means row index 1 is actually a header.
df4 = pd.read_csv(‘olympics.csv’, header = 1)
df4.head()
Output:
Unnamed: 0 | № Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | № Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan (AFG) | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
1 | Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
2 | Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
3 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
4 | Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
Here in this dataframe, the default index column is a generic serial. It is useful to have a meaningful index. While importing you can set the index as any of the column you want. I am choosing the mane of the countries that is the first column. So the column index is zero.
df5 = pd.read_csv(‘olympics.csv’, index_col = 0, skiprows = 1)
df5.head()
Output:
№ Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | № Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan (AFG) | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
Sometimes all the columns of the dataset may not be necessary for the analysis. Specially when the dataset is too big, it’s always makes sense to present part of it. Otherwise it becomes too hard to understand it. Here is way to import few columns.
df6 = pd.read_csv(‘olympics.csv’, skiprows = 1, usecols = [‘Unnamed: 0′,’№ Summer’, ‘№ Winter’, ‘№ Games’], index_col = 0)
df6.head()
Output:
№ Summer | № Winter | № Games | |
---|---|---|---|
Afghanistan (AFG) | 13 | 0 | 13 |
Algeria (ALG) | 12 | 3 | 15 |
Argentina (ARG) | 23 | 18 | 41 |
Armenia (ARM) | 5 | 6 | 11 |
Australasia (ANZ) [ANZ] | 2 | 0 | 2 |
Let’s get back to skiprows again. You can skip rows of any range using a for loop. This option can be very useful at times. I am using a simple for loop here.
df7 = pd.read_csv(‘olympics.csv’, skip_blank_lines = False, skiprows = [i for i in range(2, 5)] )
df7.head()
Output:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | № Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 ! | 02 ! | 03 ! | Total | № Games | 01 ! | 02 ! | 03 ! | Combined total |
1 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
2 | Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
3 | Australia (AUS) [AUS] [Z] | 25 | 139 | 152 | 177 | 468 | 18 | 5 | 3 | 4 | 12 | 43 | 144 | 155 | 181 | 480 |
4 | Austria (AUT) | 26 | 18 | 33 | 35 | 86 | 22 | 59 | 78 | 81 | 218 | 48 | 77 | 111 | 116 | 304 |
Another very dynamic idea is to define a function and use that function. For this example, I am defining a simple function here. A function to return only the even indexed rows. Then, use this in the skiprows to skip the even indexed rows. You can use some advanced function as per your requirement.
def even_rows(index):
if index%2 == 0:
return True
return False
df8 = pd.read_csv(‘olympics.csv’, skiprows = lambda x: even_rows(x) )
df8.head()
Output:
Unnamed: 0 | № Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | № Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
1 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
2 | Australia (AUS) [AUS] [Z] | 25 | 139 | 152 | 177 | 468 | 18 | 5 | 3 | 4 | 12 | 43 | 144 | 155 | 181 | 480 |
3 | Azerbaijan (AZE) | 5 | 6 | 5 | 15 | 26 | 5 | 0 | 0 | 0 | 0 | 10 | 6 | 5 | 15 | 26 |
4 | Bahrain (BRN) | 8 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 1 | 1 |
At last, I should talk about skipfooter. Skipfooter skips rows from the bottom. Here I am using a value of 3 to the skipfooter. So, it will skip three rows from the footer.
df10 = pd.read_csv(‘olympics.csv’, header = 1, skipfooter = 3)
df10.tail()
Output:
â„– Summer | 01 ! | 02 ! | 03 ! | Total | â„– Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | â„– Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
139 | Vietnam (VIE) | 14 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 2 | 0 | 2 |
140 | Virgin Islands (ISV) | 11 | 0 | 1 | 0 | 1 | 7 | 0 | 0 | 0 | 0 | 18 | 0 | 1 | 0 | 1 |
141 | Yugoslavia (YUG) [YUG] | 16 | 26 | 29 | 28 | 83 | 14 | 0 | 3 | 1 | 4 | 30 | 26 | 32 | 29 | 87 |
142 | Independent Olympic Participants (IOP) [IOP] | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 3 |
143 | Zambia (ZAM) [ZAM] | 12 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 1 | 1 | 2 |
That’s all I wanted to share about importing csv file in a notebook using Pandas library. I hope it was helpful. Please subscribe to stay with me. There are lot more resources coming over time.
Click here for the dataset.
Click here for the working video of this contents.