Understanding Regularization in Plain Language: L1 and L2 Regularization
Key rings background. (Nickel plated key rings)

Understanding Regularization in Plain Language: L1 and L2 Regularization

The meaning of the word regularization is “the act of changing a situation or system so that it follows laws or rules”. That’s what it does in the machine learning world as well.

Regularization is a method that constrains or regularizes the weights. Why do we need to constraints the weights?

One of the major problems in machine learning is overfitting. We use regularization to prevent overfitting. This article will focus on why overfitting happens and how to prevent it?

If the degree of the polynomial is too high in a polynomial regression or the number of features is too high the algorithm learns all the noises and fits the training set so well that it does not become an efficient model for general data. It is only good for the data in the training set. This is known as an overfitting problem or a high variance problem. In this case, the training accuracy is very high but the validation accuracy is poor.

Let’s take the linear regression formula for this explanation. Because this is the simplest formula:

y = mx + c

If we have only one feature, this is the formula. But in the real world, it almost never happens that we are dealing with only one feature. Most of the time we have several features.

So, the formula becomes:

Image by Author

I did not include the intercept term here. Because we will focus on the weights in this article. The slope m1, m2, m3…mn are randomly generated values in the beginning. In machine learning, slopes are also referred to as the weights. And based on the Mean Squared Error (MSE) of the output values, slopes get updated. If you need a refresher, here is the formula for the MSE:

Image by Author

To solve an overfitting issue, a regularization term is added. There are two common types of regularizations. L1 and L2 regularizations.

L1 Regularization:

Here is the expression for L1 regularization. It is known as Lasso regression when we use L1 norm in the linear regression:

Image by Author

The first term of this formula is the simple MSE formula. But the second term is the regularization parameter. As you can see the regularization term is the sum of the absolute values of all the slopes multiplied by the term lambda. You need to choose lambda based on the cross-validation data output. If lambda is bigger MSE is bigger. That means a bigger penalty. When the error term gets bigger, slopes get smaller.

On the other hand, if the slopes are bigger, the error term also becomes larger. That’s also a penalty. As a result, slopes start getting smaller. Some of the slopes may get so close to zero and that will make some of the features dismissed. Because each slope is multiplied by a feature. In this way, L1 regularization can work for feature selection as well. But the downside is, if you do not want to lose any information and do not want to eliminate any feature, you have to be careful.

The advantage of L1 regularization is, it is more robust to outliers than L2 regularization. And also it can be used for feature seelction.

L2 Regularization:

Here is the expression for L2 regularization. This type of regression is also called Ridge regression.

Image by Author

As you can see in the formula, we add the squared of all the slopes multiplied by the lambda. Like L1 regularization, if you choose a higher lambda value, MSE will be higher, so slopes will become smaller. Also, if the values of the slopes are higher, the MSE is higher. That means a higher penalty. But because it takes the squared of the slopes, the slope values never go down to zeros. So, you will not lose any feature contribution in the algorithm.

The downside is, it gets affected by the outliers too much. As we take squared of the weights, if a value is a lot higher than the others, it becomes too overpowering because of the squared.

The advantage of L2 norm is, it is easier to get the derivative of the regularization term. So, it can be used in gradient descent formulas more easily. Also, because you do not loose any information, as no slope becomes zero, it may give you a better performance if outliers are not an issue.


Both L1 and L2 regularization have advantages and disadvantages. Depending on the project, you can choose your type of regularization. Or, you can try both of them to see which one works better.

Please feel free to follow me on Twitter, the Facebook page, and check out my new YouTube channel


#MachineLearning #ArtificialInteligence #DataScience #DataAnalytics 

Leave a Reply

Close Menu