How to Detect Seasonality in the Time Series Data, And Remove Seasonality in Python

How to Detect Seasonality in the Time Series Data, And Remove Seasonality in Python

Time series data can be subject to seasonal fluctuations. For example, Halloween costumes are supposed to be in high demand during the Halloween season, red roses and candies are around Valentine’s Day, and restaurants have more customers during weekends. Seasonality can come in days intervals, week intervals, or months.

It is crucial to understand the seasonality in the time series data so we can produce forecasting models. In this article, I will explain, how to detect the seasonality in the data and how to remove it. I will keep explaining the codes and the process as we move forward.

First necessary import:

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

A dataset of US pollution data will be used for this tutorial. Please feel free to find the dataset in the link below to follow along:

U.S. Pollution Data (kaggle.com)

Reading the dataset into a DataFrame using Pandas read_csv() method.

Then, make the DataFrame smaller by choosing two columns only. I picked the ‘Date Local’ as this is a time series practice and ‘CO Mean’. Please feel free to practice with another variable if you want.

Finally, rename the columns to ‘Date’ and ‘CO’ for convenience.

df = pd.read_csv('/content/uspollution_pollution_us_2000_2016.csv', error_bad_lines=False, engine="python")
df_co = df[['Date Local', 'CO Mean']]
df_co = df_co.rename(columns={'Date Local': 'Date', 'CO Mean': 'CO'})
df_co

The Date column should be in Pandas DateTime format and also it is helpful to have it as index of the DataFrame in a time series analysis.

df_co['Date'] = pd.to_datetime(df_co['Date'])
df_co = df_co.set_index('Date')

If you notice from the picture of the DataFrame above, there are multiple data available for a day. I would like to create two more DataFrames. One groups the data by day to get the daily average of CO, and another groups the data by month to get the monthly average of CO.

df_month = pd.DataFrame(df_co.groupby(df_co.index.to_period('m'))['CO'].mean())
df_day = pd.DataFrame(df_co.groupby(df_co.index.to_period('d'))['CO'].mean())

This dataset has several years data. A graphical representation of monthly and daily data may show the seasonality.

df_day.plot(figsize=(15, 6)) 

The daily plot already shows seasonality in the data clearly. You can see spikes in the data after certain intervals. Let’s see what the monthly average data shows.

df_month.plot(figsize=(15, 6)) 

The graph cleared up, and we got a clear line using the monthly average.

Seasonal Decompose

This seasonal_decompose() function in stats models library extracts season, trend, and residuals. Then you can plot the whole thing to see all the components, or you can take only ‘seasonal’ part and plot that only to get the seasonal plot. Here I called the seasonal_decompose() function on the monthly average and plotting all the components.

We needed to convert the ‘date’ data, which is the index of the DataFrame, to DateTime format. Otherwise, it will generate error.

import statsmodels.api as sm
df_month.index = df_month.index.to_timestamp()
res = sm.tsa.seasonal_decompose(df_month['CO'])
resplot = res.plot()

Here, the top plot is the original data, the second one shows the trend part from the data, the third one extracts the seasonal part, and the fourth one shows the residuals after excluding the trend and seasonality. This is probably the most common visualization for time series analysis.

ACF And PACF Plot

In time series analysis, ACF and PACF plots are also very commonly used to understand the pattern of the data. They show the effect of past data on the current data. So, if seasonality exists in the series that also can be inferred from ACF and PACF plots.

We already imported the functions from the statsmodels library in the beginning. Let’s see what the plots look like, and then we will interpret them.

fig, (ax1) = plt.subplots(1, 1, figsize=(10, 3))
plot_acf(df_month, ax=ax1)
ax1.set_title('ACF Plot')
plt.show()

ACF plot shows the correlation of the data with the immediate lag or the data right before. The ACF plot above is made using the monthly average CO data. So, each bar represents the correlation with the previous month’s data. The first one is 1.0 because it is correlated with itself as there is no data before. The blue zone shows the 95% confidence interval. If the bar reaches outside the blue zone, that means the correlation is significant.

The plot above shows the correlation becomes significant after every certain time period, which means a clear seasonality.

Let’s have a look at the PACF plot.

max_lag = 12 
fig, (ax1) = plt.subplots(1, 1, figsize=(10, 3))
plot_pacf(df_month, ax=ax1)
ax1.set_title('PACF Plot')

As usual, the first one is 1.0 as it it the correlation with itself. The rest of the bars show the correlation with the main series of data. For example, the bar at the 10th position shows the correlation of the tenth lag with the main data point. There is a significant correlation with the 8th, 9th, 10th, and 11th lag. That means the CO values are significantly correlated to their previous lags. That means the evident seasonality in the data.

How to Remove Seasonality?

Seasonality can be removed using the same techniques I used in the removing trend in my last article as well. Please check here.

But those techniques are good for removing any type of fluctuations and making the data stationary. Here I will demonstrate how you can target the seasonality only.

As we have data for several years here, we can take the monthly average for each month and divide the data using it’s corresponding monthly average. Let’s do it step by step.

First taking the monthly average:

monthly_average = df_co.groupby(df_co.index.month).mean()
monthly_average

As we have the monthly average data for each month, now mapping this on the df_month we created earlier.

mapped_monthly_average = df_month.index.map(lambda x:monthly_average['CO'][x.month])

Then dividing the CO of df_month by this mapped_monthly_average to get the CO values without the seasonality effect.

df_month['norm_co'] = df_month['CO']/mapped_monthly_average

Plotting this ‘norm_co’ column to see if it worked:

df_month['norm_co'].plot(figsize=(15, 5))
plt.xlabel('Date')
plt.ylabel('Standardized CO')
plt.grid()
plt.show()

Look, there is no seasonality anymore, but a downward trend is shown in the graph. We only wanted to deal with seasonality here.

Conclusion

In my last article, we worked on how to detect and remove trends, and in this one, we learned to deal with seasonality in the time series data. Slowly we will move towards the time series forecasting modeling. Hope this was helpful.

Leave a Reply

Close Menu