The learning curve is very useful to determine if an algorithm is working correctly and to improve the performance of an algorithm. It is useful to determine if an algorithm is suffering from bias or underfitting, a variance or overfishing, or a bit of both.
If your machine learning algorithm is not working as expected, what to do next? There are several options:
- Getting more training data which is very time-consuming. It may even take months to obtain more research data.
- Getting more training features. It may also take a lot of time. But if adding some polynomial features works, that is cool.
- Selecting a smaller set of training features.
- Increasing regularization term
- Decreasing the regularization term.
So, which one should you try next? This is not a good idea to start trying just anything. Because you may end up spending too much time on something that is not helpful. You need to detect the problem first and then take action accordingly. A learning curve helps to detect the problem easily which saves a lot of time.
How Learning Curve Works
The learning curve is the plot of the cost function. The cost function for the training data and the cost function for the cross-validation data in the same plot gives important insights about the algorithm. As a reminder, here is the formula for the cost function:
In other words, it is squared of the predicted output minus the original output divided by twice the number of training data. To make the learning curve, we need to plot these cost functions as a function of the number of training data (m). Instead of using all the training data, we will use only a smaller subset of training data to train the data.
Have a look at the picture below:
Here is the concept. If we train the data with a too-small number of data, the algorithm will fit perfectly on the training data and the cost function will return 0. In the picture above it is showing clearly that when we train the data with only one, two, or three data algorithms can learn that few data very well and training cost comes out to be zero or close to zero. But this type of algorithm cannot perform well on other data. When you will try to fit the cross-validation data on this algorithm, the probability is very high that it will perform poorly on cross-validation data. So, the cost function for cross-validation data will return a very high value. On the other hand, when we will take more and more data to train the algorithm, it will not fit in the training data perfectly anymore. So, the training cost will become higher. At the same time, as this algorithm is trained on a lot of data, it will perform better on the cross-validation data and the cost function for cross-validation data will return a lower value. Here is how to develop a learning curve.
Develop A Learning Algorithm
I will demonstrate how to draw a learning curve step by step. For drawing a learning curve, we need a machine learning algorithm first. For simplicity, I will work with a linear regression algorithm. I will move a bit faster here and not explain every step because I am assuming you know the machine learning algorithm development. If you need a refresher on how to develop a linear regression algorithm, please check this article first.
First, import the packages and the dataset. The dataset I am using here is taken from Andrew Ng’s machine learning course in Coursera. In this dataset, X-value, and y-value are organized in separate sheets in an Excel file. X and y values of cross-validation data are also organized in two other sheets in the same Excel file. I provided the link to the dataset at the end of this article. Please feel free to download the dataset and practice yourself.
%matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt file = pd.ExcelFile('dataset.xlsx') df = pd.read_excel(file, 'Xval', header=None) df.head()
In the same way, import the y-values for the training set:
y = pd.read_excel(file, 'yval', header=None) y.head()
Let’s develop the linear regression algorithm quickly. Define hypothesis and cost function.
m = len(df)def hypothesis(theta, X): return theta + theta*Xdef cost_calc(theta, X, y): return (1/2*m) * np.sum((hypothesis(theta, X) - y)**2)
Now, we will define gradient descent to optimize the parameters
def gradient_descent(theta, X, y, epoch, alpha): cost =  theta_hist =  i = 0 while i < epoch: hx = hypothesis(theta, X) theta -= alpha*(sum(hx-y)/m) theta -= (alpha * np.sum((hx - y) * X))/m cost.append(cost_calc(theta, X, y)) i += 1 return theta, cost
A linear regression algorithm is done. We need a method to predict the output:
def predict(theta, X, y, epoch, alpha): theta, cost = gradient_descent(theta, X, y, epoch, alpha) return hypothesis(theta, X), cost, theta
Now, initiate the parameters as zeros and use the predict function to predict the output variable.
theta = [0,0] y_predict, cost, theta = predict(theta, df, y, 1400, 0.001)
The updated theta values are: [10.724868115832654, 0.3294833798797125]
Now, plot the predicted output and the original output against the df in the same plot:
plt.figure() plt.scatter(df, y) plt.scatter(df, y_predict)
Looks like the algorithm is working well.
Draw A Learning Curve
Now, we can draw a learning curve. First, let’s import the X and y values for our cross-validation dataset. As I mentioned earlier, We have then organized in separate Excel sheets.
file = pd.ExcelFile('dataset.xlsx') cross_val = pd.read_excel(file, 'X', header=None) cross_val.head()
cross_y = pd.read_excel(file, 'y', header=None) cross_y.head()
For this purpose, I want to modify the gradient_descent function a little bit. In our previous gradient_descent function, we calculated the cost in each iteration. I did that because that’s a good practice in traditional machine learning algorithm development. But for the learning curve, we do not need the cost in each iteration. So, to save running time there, I will exclude calculating cost function in each epoch. We will return only the updated parameters.
def grad_descent(theta, X, y, epoch, alpha): i = 0 while i < epoch: hx = hypothesis(theta, X) theta -= alpha*(sum(hx-y)/m) theta -= (alpha * np.sum((hx - y) * X))/m i += 1 return theta
As I discussed earlier, to develop a learning curve, we need to train the learning algorithm with the different subsets of training data. In our training dataset, we have 21 data. I will train the algorithm using just one data, then with two data, then with three data all the way up to 21 data. So, we will train the algorithm 21 times on 21 subsets of the training data. We will also keep track of the cost function for each subset of training data. Please have a close look at the code, it will be clearer.
j_tr =  theta_list =  #theta = [0,0] for i in range(0, len(df)): theta = [0,0] theta_list.append(grad_descent(theta, df[:i], y[:i], 1400, 0.001)) #print(theta) j_tr.append(cost_calc(theta, df[:i], y[:i])) theta_list
Here are the training parameters for each subset of training data:
Here is the cost for each training subset:
Look at the cost for each subset. When the training data was only 1 or 2, the cost was zero or almost zero. As we kept increasing the training data, the cost also went up which was expected. Now, use the parameters above for all the subsets of training data to calculate the cost on cross-validation data:
j_val =  for i in theta_list: j_val.append(cost_calc(i, cross_val, cross_y)) j_val
In the beginning, the cost was really high because the training parameters are coming from too few training data. But as the parameters improved with more training data, cross-validation error kept going down. Let’s plot the training error and cross-validation error in the same plot:
%matplotlib inline import matplotlib.pyplot as plt plt.figure() plt.scatter(range(0, 21), j_tr) plt.scatter(range(0, 21), j_val)
This is our learning curve.
Drawing Decision From Learning Curve
The learning curve above looks nice. It is flowing the way we expected. In the beginning, training error was too small and validation error was too high. Slowly they totally overlapped on each other. So that is perfect! But in the real-life, it does not happen very often. Most machine learning algorithms do not work perfectly for the first time. Almost all the time it suffers from some problems that we need to fix. Here I will discuss some issues.
We may find our learning curve looks like this:
If there is a significant gap between training error and validation-error, that indicates a high variance problem. It also can be called an overfitting problem. Getting more training data or selecting a smaller set of features or both may fix this problem.
If a leaning curve looks like this that means in the beginning training error was too small and validation error was too high. Slowly, training error goes higher and validation-error goes lower. But at a point they become parallel. You can see from the picture, after a point, even with more training data cross-validation error is not going down anymore. In this case, getting more training data will not improve the machine learning algorithm. This indicates that the learning algorithm is suffering from a high bias problem. In this case, getting more training features may help.
Fixing A Learning Algorithm
Assume, we are implementing linear regression. But the algorithm is not working as expected. What to do now?
First, draw a learning curve as I demonstrated here. If you detect a high variance problem, select a smaller set of features based on the importance of the features. If that helps, that will save some time. If not, try getting more training data.
If you detect high bias problem from the learning curve, you know already that getting additional features is a possible solution. You may even try adding some polynomial features. Lots of time that helps and saves a lot of time.
If you are implementing an algorithm with the regularization term lambda, try decreasing the lambda if the algorithm is suffering from a high bias and try increasing the lambda, if the algorithm is suffering from a high variance problem.
In the case of a neural network also we may come across this bias or variance problem. For the high bias or underfitting problem, we need to increase the number of neurons or the number of hidden layers. To address the high variance or overfitting problem, we should decrease the number of neurons or the number of hidden layers. We can even draw a learning curve using a different number of neurons.
Thank you so much for reading this article. I hope this was helpful.
Here is the dataset used in this article
#machinelearning #learningcurve #datascience #artificialinteligence #dataAnalytics