Import CSV Files As Pandas DataFrame With skiprows, skipfooter, usecols, index_col and header Options

CSV files are a very common and popular format of storing data. Data Scientists deal with csv files almost regularly. Pandas not only has the option to import a dataset as a regular Pandas DataFrame, also there are other options to clean and shape the dataframe while importing. Here I want to discuss few of those options:

As usual, import pandas and the dataset as a Dataframe with read_csv method:

import pandas as pd

df = pd.read_csv(‘olympics.csv’)

df.head()

Output:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 NaN № Summer 01 ! 02 ! 03 ! Total № Winter 01 ! 02 ! 03 ! Total № Games 01 ! 02 ! 03 ! Combined total
1 Afghanistan (AFG) 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2
2 Algeria (ALG) 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15
3 Argentina (ARG) 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70
4 Armenia (ARM) 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12

df.head() gives only the top five rows of dataframe so we can see some properties of the dataframe. df.tail()  returns the last five rows of the DataFrame. Number of rows can be specified as well like this: df.head(8). This will return first 8 rows. You can use any number in the range of the length of your dataframe. Number of rows can be specified in df.tail() in the same way. 

df.tail()

Output:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
143 Independent Olympic Participants (IOP) [IOP] 1 0 1 2 3 0 0 0 0 0 1 0 1 2 3
144 Zambia (ZAM) [ZAM] 12 0 1 1 2 0 0 0 0 0 12 0 1 1 2
145 Zimbabwe (ZIM) [ZIM] 12 3 4 1 8 1 0 0 0 0 13 3 4 1 8
146 Mixed team (ZZX) [ZZX] 3 8 5 4 17 0 0 0 0 0 3 8 5 4 17
147 Totals 27 4809 4775 5130 14714 22 959 958 948 2865 49 5768 5733 6078 17579

Look at the dataset. On top there is an unnecessary row. That’s actually not the header. There is an option called skiprows to get rid of unnecessary rows like that.

df1 = pd.read_csv(‘olympics.csv’, skiprows = 1)

df1.head()

Output:

Unnamed: 0 № Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 03 !.2 Combined total
0 Argentina (ARG) 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70
1 Armenia (ARM) 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12
2 Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12
3 Australia (AUS) [AUS] [Z] 25 139 152 177 468 18 5 3 4 12 43 144 155 181 480
4 Austria (AUT) 26 18 33 35 86 22 59 78 81 218 48 77 111 116 304

Here skiprows = 1, means delete one row. By default it will delete one row from the top. You can delete 2, 3, 4 or any number of rows you want using skiprows. But you may not want to delete all the rows from the top. Like you may want to delete first row, third row and forth row. In that case you can specify the rows in a list.

df2 = pd.read_csv(‘olympics.csv’, skiprows = [0, 2, 3])

df2.head()

Output:

Unnamed: 0 № Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 03 !.2 Combined total
0 Argentina (ARG) 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70
1 Armenia (ARM) 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12
2 Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12
3 Australia (AUS) [AUS] [Z] 25 139 152 177 468 18 5 3 4 12 43 144 155 181 480
4 Austria (AUT) 26 18 33 35 86 22 59 78 81 218 48 77 111 116 304

Sometimes in the csv files, there is no header, only values. But by default, pandas take the row as a header. To avoid that, we can use ‘header = None’. In this dataset there is a header. But for the sake of this example let’s just say that there is no header.

df3 = pd.read_csv(‘olympics.csv’, header = None)

df3.head()

Output:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 NaN № Summer 01 ! 02 ! 03 ! Total № Winter 01 ! 02 ! 03 ! Total № Games 01 ! 02 ! 03 ! Combined total
2 Afghanistan (AFG) 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2
3 Algeria (ALG) 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15
4 Argentina (ARG) 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70

Look pandas added an extra row of label on top to add a header. It didn’t take any row from the original dataset as a header. But actually, in this dataset there is a header. The second row, that means row index 1 is actually a header.

df4 = pd.read_csv(‘olympics.csv’, header = 1)

df4.head()

Output:

Unnamed: 0 № Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 03 !.2 Combined total
0 Afghanistan (AFG) 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2
1 Algeria (ALG) 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15
2 Argentina (ARG) 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70
3 Armenia (ARM) 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12
4 Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12

Here in this dataframe, the default index column is a generic serial. It is useful to have a meaningful index. While importing you can set the index as any of the column you want. I am choosing the mane of the countries that is the first column. So the column index is zero.  

df5 = pd.read_csv(‘olympics.csv’, index_col = 0, skiprows = 1)

df5.head()

Output:

№ Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 03 !.2 Combined total
Afghanistan (AFG) 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2
Algeria (ALG) 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15
Argentina (ARG) 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70
Armenia (ARM) 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12
Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12

Sometimes all the columns of the dataset may not be necessary for the analysis. Specially when the dataset is too big, it’s always makes sense to present part of it. Otherwise it becomes too hard to understand it. Here is way to import few columns. 

df6 = pd.read_csv(‘olympics.csv’, skiprows = 1, usecols = [‘Unnamed: 0′,’№ Summer’, ‘№ Winter’, ‘№ Games’], index_col = 0)

df6.head()

Output:

№ Summer № Winter № Games
Afghanistan (AFG) 13 0 13
Algeria (ALG) 12 3 15
Argentina (ARG) 23 18 41
Armenia (ARM) 5 6 11
Australasia (ANZ) [ANZ] 2 0 2

Let’s get back to skiprows again. You can skip rows of any range using a for loop. This option can be very useful at times. I am using a simple for loop here.  

df7 = pd.read_csv(‘olympics.csv’, skip_blank_lines = False, skiprows = [i for i in range(2, 5)] )

df7.head()

Output:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 NaN № Summer 01 ! 02 ! 03 ! Total № Winter 01 ! 02 ! 03 ! Total № Games 01 ! 02 ! 03 ! Combined total
1 Armenia (ARM) 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12
2 Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12
3 Australia (AUS) [AUS] [Z] 25 139 152 177 468 18 5 3 4 12 43 144 155 181 480
4 Austria (AUT) 26 18 33 35 86 22 59 78 81 218 48 77 111 116 304

Another very dynamic idea is to define a function and use that function. For this example, I am defining a simple function here. A function to return only the even indexed rows. Then, use this in the skiprows to skip the even indexed rows. You can use some advanced function as per your requirement.

def even_rows(index):

    if index%2 == 0:

        return True

    return False

 

df8 = pd.read_csv(‘olympics.csv’, skiprows = lambda x: even_rows(x) )

df8.head()

Output:

Unnamed: 0 № Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 03 !.2 Combined total
0 Algeria (ALG) 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15
1 Armenia (ARM) 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12
2 Australia (AUS) [AUS] [Z] 25 139 152 177 468 18 5 3 4 12 43 144 155 181 480
3 Azerbaijan (AZE) 5 6 5 15 26 5 0 0 0 0 10 6 5 15 26
4 Bahrain (BRN) 8 0 0 1 1 0 0 0 0 0 8 0 0 1 1

At last, I should talk about skipfooter. Skipfooter skips rows from the bottom. Here I am using a value of 3 to the skipfooter. So, it will skip three rows from the footer.

df10 = pd.read_csv(‘olympics.csv’, header = 1, skipfooter = 3)

df10.tail()   

Output:

â„– Summer 01 ! 02 ! 03 ! Total â„– Winter 01 !.1 02 !.1 03 !.1 Total.1 â„– Games 01 !.2 02 !.2 03 !.2 Combined total
139 Vietnam (VIE) 14 0 2 0 2 0 0 0 0 0 14 0 2 0 2
140 Virgin Islands (ISV) 11 0 1 0 1 7 0 0 0 0 18 0 1 0 1
141 Yugoslavia (YUG) [YUG] 16 26 29 28 83 14 0 3 1 4 30 26 32 29 87
142 Independent Olympic Participants (IOP) [IOP] 1 0 1 2 3 0 0 0 0 0 1 0 1 2 3
143 Zambia (ZAM) [ZAM] 12 0 1 1 2 0 0 0 0 0 12 0 1 1 2

That’s all I wanted to share about importing csv file in a notebook using Pandas library. I hope it was helpful. Please subscribe to stay with me. There are lot more resources coming over time.  

Click here for the dataset.

Click here for the working video of this contents.

Leave a Reply

Close Menu