What is the time-series analysis?
It is the analysis of the dataset that has a sequence of time stamps. It has become more and more important with the increasing emphasis on machine learning. So many different types of industries use time-series data now for time series forecasting, seasonality analysis, finding trends, and making important business and research decisions. So it is very important as a data scientist or data analyst to understand the time series data clearly.
Time series data is used in:
Banks and Financial institutes
Power, gas, and oil industries
Periodic measures in a mechanical or chemical process
And many more areas
It is unavoidable. So, let’s learn it.
I will start with some general functions and show some more topics using the Facebook Stock price dataset.
Let’s dive in!
Using the correct format
Time series data can come in with so many different formats. But not all of those formats are friendly to python’s pandas’ library. The most convenient format is the timestamp format for Pandas. But most of the time time-series data come in string formats. Here I have the example of the different formats time series data may be found in.
import pandas as pd import numpy as np
dates = ['2020-11-25 2:30:00 PM', 'Jan 5, 2020 18:45:00', '01/11/2020', '2020.01.11', '2020/01/11', '20201105']
The ‘dates’ variable above showing five different formats of date-time settings and all are correct. But we need this specific format to work conveniently. So, convert those dates to the right format.
DatetimeIndex(['2020-11-25 14:30:00', '2020-01-05 18:45:00', '2020-01-11 00:00:00', '2020-01-11 00:00:00', '2020-01-11 00:00:00', '2020-11-05 00:00:00'], dtype='datetime64[ns]', freq=None)
There are other countries around the world, who use days first. For example in Americal style June 1st, 2002 is written as ‘6/1/2020’. But in England, South Asian countries like India, Bangladesh, Pakistan, and some other parts of the world write it as ‘1/6/2020’.
If you are working for a client from those other parts of the world, here is how to format the dates.
Index(['25-11-20', '05-01-20', '11-01-20', '11-01-20', '11-01-20', '05-11-20'], dtype='object')
I just used ‘%d-%m-%y’ as a format here. You can change the sequence as required.
If you need to put the month first or year first, you only need to change the sequence in the format.
Now I will import the dataset that we will use to demonstrate many of the functions. Our Facebook Stock data. Feel free to download the dataset here and follow along.
If you reading this to learn, I strongly recommend to practice along with reading.
df = pd.read_csv('FB_data.csv') df.head()
This is a raw dataset. Doesn’t look bad! Well organized. But we need to change the format of the ‘Date’ column as we discussed earlier. We will make it to the DatetimeIndex format and put it as the index column.
Because when the ‘date’ column is the index column we will be able to resample it very easily. You will see what that means in the later sections.
Here is the correct way of importing the data where I am changing the format of the dates and setting it as an index while importing.
df = pd.read_csv('FB_data.csv', parse_dates=['Date'], index_col="Date") df.head()
Look, we changed the format of the ‘Date’ column!
In this section, I will discuss how to resample the data.
Why resampling is important?
Because we do not always need all the data in a huge dataset. For example, we may need only the data from June 2019. If your date format is in DatetimeIndex, it is very easy:
We have the data for eight days only. Find the mean of the opening stock price in June 2019.
We can get the data on an individual date as well.
Open 1.887500e+02 High 1.920000e+02 Low 1.887500e+02 Close 1.911400e+02 Adj Close 1.911400e+02 Volume 2.275120e+07 Name: 2019-06-21 00:00:00, dtype: float64
Let’s say, we need two weeks’ data from June 27th to July 10th of 2019.
You can also resample by month. That gives the monthly average. For example, here I will get the monthly average of closing data:
Date 2019-06-30 190.324286 2019-07-31 199.595454 2019-08-31 184.497726 2019-09-30 185.735000 2019-10-31 184.383912 2019-11-30 195.718500 2019-12-31 201.951904 2020-01-31 216.643333 2020-02-29 207.505263 2020-03-31 165.747727 2020-04-30 177.003335 2020-05-31 216.549001 2020-06-30 232.671332 Freq: M, Name: Close, dtype: float64
We can take the monthly average and plot with just one line of code:
If you want weekly data and plot it, you can get it by this code:
Instead of simple line plot, you can get total 13 types of plots using a ‘kind’ parameter in plot() function.
I named those 13 types of plots after this bar plot.
I will make a bar plot of quarterly closing data.
The ‘kind’ parameter above takes the following 13 types of visualization:
Please see this article for details about those visualizations.
The way it sounds, the shift function shifts the data to the specified number of times. FB dataset we are using starts on June 20th, 2019. Now, if we shift our data by 1, June 20th, 2019 data will move to June 21st, 2019, June 21st, 2019 data will shift to June 22nd, 2019, and so on.
I will explain a lillte later why people use shift
For this example, I will only use the column. You will see the shifts very clearly. I will make a new DataFrame called ‘df1′ with only opening data.
df1 = pd.DataFrame(df['Open']) df1.head()
Shift this data by 1.
The first row has a null value. Simply because the first row moves to the second row. But there is no data before the first row.
If you use a negative value in shift it will do just the opposite.
Why we use shift?
The shift gives you the previous day data or the next day’s data. Here is a use case. I will put today’s data and the previous day data side by side using shift.
df1['Prev Day Opening'] = df1['Open'].shift(1) df1
Isn’t it useful!?
You can also get the change in 1-day data in another column:
df1['1 day change'] = df1['Open'] - df1['Prev Day Opening']
Find the 1 week total in percentage. for that, we have to shift by 5 days. Right? And then take the difference from today and 5 days early data. Multiply by 100 and divide by the original today data. Let’s see it to understand it better.
df1['One week total return'] = (df1['Open'] - df1['Open'].shift(5)) * 100/df1['Open'].shift(5)df1.tail()
I am taking df.tail() because we did a 5-day shift. So the first 5 rows will be null.
Understanding of timezone is important. Probably, you are in one timezone and your client is in another timezone. Pandas have great functionality to deal with different timezones.
We have two types of DateTime data. Naive DateTime which has no idea about timezone and time zone aware DateTime that knows the time zone. The data we have is naive DateTime. So, we need to use tz_localize to convert this DateTime.
Convert the index of the Facebook dataset to ‘US/Eastern’.
df.index = df.index.tz_localize(tz = 'US/Eastern') df.index
DatetimeIndex(['2019-06-20 00:00:00-04:00', '2019-06-21 00:00:00-04:00', '2019-06-24 00:00:00-04:00', '2019-06-25 00:00:00-04:00', '2019-06-26 00:00:00-04:00', '2019-06-27 00:00:00-04:00', '2019-06-28 00:00:00-04:00', '2019-07-01 00:00:00-04:00', '2019-07-02 00:00:00-04:00', '2019-07-03 00:00:00-04:00', ... '2020-06-08 00:00:00-04:00', '2020-06-09 00:00:00-04:00', '2020-06-10 00:00:00-04:00', '2020-06-11 00:00:00-04:00', '2020-06-12 00:00:00-04:00', '2020-06-15 00:00:00-04:00', '2020-06-16 00:00:00-04:00', '2020-06-17 00:00:00-04:00', '2020-06-18 00:00:00-04:00', '2020-06-19 00:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', name='Date', length=253, freq=None)
On each date, it shows negative 4 hours. Again, if we convert it to ‘Europe/Berline’ it will add 6 hours to it.
df = df.tz_convert('Europe/Berlin') df.index
DatetimeIndex(['2019-06-20 06:00:00+02:00', '2019-06-21 06:00:00+02:00', '2019-06-24 06:00:00+02:00', '2019-06-25 06:00:00+02:00', '2019-06-26 06:00:00+02:00', '2019-06-27 06:00:00+02:00', '2019-06-28 06:00:00+02:00', '2019-07-01 06:00:00+02:00', '2019-07-02 06:00:00+02:00', '2019-07-03 06:00:00+02:00', ... '2020-06-08 06:00:00+02:00', '2020-06-09 06:00:00+02:00', '2020-06-10 06:00:00+02:00', '2020-06-11 06:00:00+02:00', '2020-06-12 06:00:00+02:00', '2020-06-15 06:00:00+02:00', '2020-06-16 06:00:00+02:00', '2020-06-17 06:00:00+02:00', '2020-06-18 06:00:00+02:00', '2020-06-19 06:00:00+02:00'], dtype='datetime64[ns, Europe/Berlin]', name='Date', length=253, freq=None)
You can find all the time zones available in the world and use the one suitable for you this way:
from pytz import all_timezones print(all_timezones)
Here is part of the output. The full output is too big:
['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Asmera', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo',.....
How to generate missing dates
What if you have the data and you know the period but the time is not recorded in the dataset. Or you have data for the second quarter of last year but you do not have that for this year. And you need to use last year’s data this year.
There might be many occasions where you may need to generate a series of dates. Pandas date_range function will come in handy. Let’s generate a period of 10 days:
rng = pd.date_range(start='11/1/2020', periods=10) rng
DatetimeIndex(['2020-11-01', '2020-11-02', '2020-11-03', '2020-11-04', '2020-11-05', '2020-11-06', '2020-11-07', '2020-11-08', '2020-11-09', '2020-11-10'], dtype='datetime64[ns]', freq='D')
What if I need only business days?
I need to add only an extra parameter called frequency like this:
rng = pd.date_range(start='11/1/2020', periods=10, freq='B') rng
DatetimeIndex(['2020-11-02', '2020-11-03', '2020-11-04', '2020-11-05', '2020-11-06', '2020-11-09', '2020-11-10', '2020-11-11', '2020-11-12', '2020-11-13'], dtype='datetime64[ns]', freq='B')
There are several more options and frequencies like that. Please check in this article where I explained only the date_range function in details:
So, we can generate a series of dates and add them to our dataset!
Rolling function aggregates data for a specified number of DateTime. Here I will take the mean of every three days. I will explain some more after working on this example:
What exactly happened here? I passed 3 as an argument in the rolling function and the aggregate function is mean. So, it is taking a mean of 20th, 21st, and 24th June ‘High’ data and putting on 24th. Doing the same for 21st, 24th, and 25th data and putting on 25th and so on. Lots of time we use the weekly average or 3-day average results to make decisions.
You can also choose where to put the rolling data. Here is an example:
data_rol = df[['High', 'Low']].rolling(window = 7, center = True).mean() data_rol
Here in rolling function, I passed window = 7. That means it will take a 7-day average. center = True means it will put that average in the 4th row instead of the 7th row. That’s why it has some null values at the bottom as well.
Let’s plot the original ‘High’ data and 7 days rolled ‘High’ data in the same plot:
%matplotlib inline import matplotlib.ticker as ticker fig, ax = plt.subplots(figsize= (11, 4))
ax.plot(df['High'], marker = '.', markersize=4, color='0.4', linestyle='None', label='Daily')
ax.xaxis.set_major_locator(ticker.MultipleLocator(30)) ax.plot(data_rol['High'], linewidth=2, label='7-d rolling mean') ax.set_xlabel('Month')
Usually, this type of plot is used to observe any trend in the data
In our data, there is a trend observable. After January 2020 the values start dropping and the curve is steep. In time series analysis we sometimes work for finding the trend. But sometimes we need to remove the trends from the data. Especially when we need to use the time series data for machine learning or forecasting.
Again after the march, it has a steep rise. In the next section, I will show you how to get rid of this type of trend.
Differencing for removing the trend
If there is any trend in the data, it is not good for modeling, forecasting, or observing seasonality. To improve model performance, or to observe any seasonality or any noise in the data, differencing is a common practice. It takes the difference in data for a specified number of days. Here is an example:
df_first_order_diff = df[['High', 'Low']].diff() df_first_order_diff
Here I did not specify any number of days in the .diff() function. So by default, it took just a 1-day difference.
Do you see what happened in the resulting table? The ‘High’ and ‘Low’ data is ‘20–06–19’ is the difference in ‘High’ and ‘Low’ data of 21–06–19 and 20–06–19. And it is set in 21–06–19. That’s why it’s null in 20–06–19. Because there no data before that to subtract.
This process of differencing is supposed to remove the trend. By any chance it does not, try with a 3 day differencing or 7 days differencing. This is how to take a 3 day differencing:
Let’s plot the data from first-order differencing from above to see if the trend we observed in the last section is removed.
start = '20-06-19'
fig, ax = plt.subplots(figsize = (11, 4))
ax.plot(df_first_order_diff.loc[start:, "High"], marker = 'o', markersize = 4, linestyle = '-', label = 'First Order Differencing') ax.xaxis.set_major_locator(ticker.MultipleLocator(30))
Look that obvious trend is gone! Feel free to check with the 3day differencing I talked about earlier if you can get rid of that slight trend at the end.
Important Time Feature Extraction
You can extract the year, month, week, or weekday from the time series that can be very useful. Let’s start with extracting the year from our index column ‘Date’.
Int64Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, ... 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020], dtype='int64', name='Date', length=253)
Now, take a subset of the dataset to make it smaller and add the years in a separate column.
df3 = df[['High','Low', 'Volume']] df3['Year'] = pd.DatetimeIndex(df3.index).year df3
See we added the year in the end. In the same way, you can extract the month and weekday. Here is the code for that:
Int64Index([6, 6, 6, 6, 6, 6, 6, 7, 7, 7, ... 6, 6, 6, 6, 6, 6, 6, 6, 6, 6], dtype='int64', name='Date', length=253)
Here is how to extract the weekdays:
Int64Index([3, 4, 0, 1, 2, 3, 4, 0, 1, 2, ... 0, 1, 2, 3, 4, 0, 1, 2, 3, 4], dtype='int64', name='Date', length=253)
Weekday comes out to be the numbers. What if you need weekdays format as Sunday, Monday, and so on? That will be more useful!
df3['Weekday'] = pd.DatetimeIndex(df3.index).to_series().dt.day_name() df3.head()
Let’s check if weekday has any effect on the ‘High’, ‘Low’, and ‘Volume’ data.
import seaborn as sns
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(['High', 'Low', 'Volume'], axes): sns.boxplot(data=df3, x = 'Weekday', y = name, ax=ax) ax.set_title(name)
Weekday has an effect on those data, right? On Wednesday ‘High’, ‘Low’ and ‘Volume’ everything is higher. On Monday it’s the opposite. Boxplots give a lot of information in one bundle. If you need a refresher on how to extract all the data from boxplots, here is a detailed article.
Another essential python function. We will learn it by doing. The most basic way of using the Period function:
y = pd.Period('2020') y
This output shows that this period ‘2020’ will end in December. Make sense, right?
Here is the directory of all the information that can be extracted from the Period function:
Here is part of the output. Because the directory is big!
. . . . . 'asfreq', 'day', 'dayofweek', 'dayofyear', 'days_in_month', 'daysinmonth', 'end_time', 'freq', 'freqstr', 'hour', 'is_leap_year', 'minute', 'month', 'now', 'ordinal', 'quarter', 'qyear', 'second', 'start_time', 'strftime', 'to_timestamp', 'week', 'weekday', 'weekofyear', 'year']
I will demonstrate some of them.
If we use the frequency as month:
month = pd.Period('2020-2', freq="M") month
What can we do with this type of month data?
You can add or subtract if necessary. For example, if you have age data of students and need to update the years or months, you can do that like this:
month + 2
In the same way, you can add or subtract days. If we put a date it will take the frequency as the day by default.
d = pd.Period('2020-02-28') d
If you add a day or two it will add a day or two. But the date I put here is February 28th. That is different, right? In leap years we have 29 days in February and the other years we have 28 days in February.
Let’s add 2 days on top of the date d above:
d + 2
After adding 2 days to February 28th, I got March 1st. That means the Period function knows the leap years.
In the same way, you can add year, hours, minutes even quarters.
Let’s see an example of a quarter:
q = pd.Period('2020Q1') q
Check the start and end-month of q.
The first month of 2020Q1 is January. That means by default the 1st quarter starts in January. Check when it ends. Though we know it should end in March.
The Period q starts in January and ends in March. The business year does not start in January and end in March everywhere. I will talk about it some more in a minute.
Add 1 to the q:
q + 1
Here, ‘Q-DEC’ means the quarter ends in December. There are four quarters in a year and the last quarter ends in December. But there are several industries out there who use January as the end of the 4th quarter or June as the 4th quarter.
We can specify the end of quarters using a ‘freq’ parameter. In the next example, I will use the end of the fourth quarter as January.
q1 = pd.Period('2020Q2', freq = 'Q-Jan') q1
Look, Here we changed the end of the 4th quarter to January! Feel free to check the start and end-month of q1. You will see the start month will be march instead of April. Because the first quarter runs from February to April.
The way we generated date_range before, we can generate period range as well:
idx = pd.period_range('2017', '2020', freq = 'Q') idx
PeriodIndex(['2017Q1', '2017Q2', '2017Q3', '2017Q4', '2018Q1', '2018Q2', '2018Q3', '2018Q4', '2019Q1', '2019Q2', '2019Q3', '2019Q4', '2020Q1'], dtype='period[Q-DEC]', freq='Q-DEC')
By default, it started by ‘2017Q1’. Because by default quarter starts from January and ends in December. But as before if we specify the end of the Quarter in January, it will start with 2017Q4.
idx = pd.period_range('2017', '2020', freq = 'Q-Jan') idx
PeriodIndex(['2017Q4', '2018Q1', '2018Q2', '2018Q3', '2018Q4', '2019Q1', '2019Q2', '2019Q3', '2019Q4', '2020Q1', '2020Q2', '2020Q3', '2020Q4'], dtype='period[Q-JAN]', freq='Q-JAN')
You can convert these quarters to timestamps:
idx = idx.to_timestamp() idx
DatetimeIndex(['2016-11-01', '2017-02-01', '2017-05-01', '2017-08-01', '2017-11-01', '2018-02-01', '2018-05-01', '2018-08-01', '2018-11-01', '2019-02-01', '2019-05-01', '2019-08-01', '2019-11-01'], dtype='datetime64[ns]', freq='QS-NOV')
Again, when we have timestamps we can convert it to quarters using to_period().
PeriodIndex(['2016Q4', '2017Q1', '2017Q2', '2017Q3', '2017Q4', '2018Q1', '2018Q2', '2018Q3', '2018Q4', '2019Q1', '2019Q2', '2019Q3', '2019Q4'], dtype='period[Q-DEC]', freq='Q-DEC')
Congratulations! You just learned to perform a time series analysis on any dataset now!
I tried to document and explain most of the major pandas’ function for time series analysis. After working on this entire page, you should have enough knowledge to perform an efficient time series analysis on any time series data. But remember, it will take a lot of practice to become proficient at using all these functions! Happy coding!
#datascience #dataAlatytics #python #programming #DataAnalysis