## Stochastic Gradient Descent: Explanation and Complete Implementation from Scratch

Stochastic gradient descent is a widely used approach in machine learning and deep learning. This article explains stochastic gradient descent using a single perceptron, using the famous iris dataset. I am assuming that you already know the basics of gradient descent. If you need a refresher, please check out this linear regression tutorial which explains gradient descent with a simple machine learning problem.

What is a Stochastic Gradient Descent?

Before diving into the stochastic gradient descent, let’s have an overview of regular gradient descent. Gradient descent is an iterative algorithm. Let’s put it in a simple example. As I mentioned, I will use a single perceptron:

Here is the simple linear formula:

Y = WX+ W0

Here W0 is the bias term. If there are multiple features:

Y = W1X1 + W2X2 + … WnXn + W0

Using this formula let’s check step by step, how to perform a gradient descent algorithm:

1. First, randomly initialize the Ws.

2. Using this random Ws calculate the predicted output Y_hat.

3. Take the Root mean square error of the Y and Y_hat.

4. Calculate the updated Ws using the formula: W= W— stepSize*derivative of the RMS, where stepSize needs to be chosen.

5. Keep repeating steps 2 to 4 till the convergence.

In this process, we use the whole dataset in each iteration. If you have only a hundred data points, it will be ok. But if you have 10,000 or 100, 000 data points and you have to run a 200 iteration for convergence, computation becomes slow and costly. To solve that problem instead of using all 100, 000 data, one random data point can be chosen for each iteration. Computation will be really fast and easy. This is called stochastic gradient descent.

Instead of choosing just one data point, we can also choose a mini batch of data points such as 10/15 or 20 data points. The example I will show in a bit will use 12 data points in each iteration.

## Data Preparation

First load the iris dataset from sklearn library:

```from sklearn.datasets import load_iris

This iris dataset comes in a dictionary format. The features are in the key ‘data’ and the target is in a key ‘target’. Here are all the keys:

`list(iris.keys())`

Output:

```['data',
'target',
'frame',
'target_names',
'DESCR',
'feature_names',
'filename']```

We need the first two keys: data and target. I like to work with DataFrame. So, I will convert the data to a DataFrame:

```import pandas as pd
ir = pd.DataFrame(iris['data'])
ir['Species'] = pd.DataFrame(iris['target'])
ir```

Output:

In the species column, there are numbers only. But those numbers represent the species of the iris flower. Target_names key in the iris dataset shows them.

`ir['Species'] = ir['Species'].replace({0: "setosa", 1: "versicolor", 2: 'virginica'})`

I just replaced the numbers in the species column with the real species names. There are three different species.

My goal is to show a demonstration of stochastic gradient descent. So, I want to work on a simple binary classification. So, I will make the ‘setosa’ as the positive class and the rest of the species as the negative class. That means. this classifier will tell us if a flower is ‘setosa’ or not. Here we need to change the ‘setosa’ to 1 and the rest of the species to 0.

```for i in range(len(ir['Species'])):
if ir['Species'][i] == 'setosa':
ir['Species'][i] = 1
else:
ir['Species'][i] = 0
ir['Species']```

Output:

```0      1
1      1
2      1
3      1
4      1
..
145    0
146    0
147    0
148    0
149    0
Name: Species, Length: 150, dtype: object```

The dataset is ready! Now the fun begins.

## Develop the Classifier Using Stochastic Gradient Descent Approach

First, we will separate the dataset as training set and test set using the train_test_split function in sklearn library. Before that, the features and the target needs to be separated.

`from sklearn.model_selection import train_test_split`
```ir_features = ir.drop(columns = 'Species')
ir_label = ir['Species']```
```x_train, x_test, y_train, y_test = train_test_split(
ir_features, ir_label,
test_size = 0.2,
random_state = 10
)```

Now, we have training and test set separately. As I explained before, we will randomly select a few data points for each iteration. I will combine the x_train and y_train because we will need the y_train as well for training,

```x_train['Species'] = y_train
df = x_train```

Below is the function to randomly select 12 data points from the training set for each iteration. In the original iris dataset, there were 50 ‘setosa’ and 100 other species. That means there are 50 positive class and 100 negative class data. The sample should be accorning to that ratio. So, let’s take 4 data from the positive class and 8 data from the negative class.

```def stratified_spl(df):
df1 = df[df['Species'] == 1]
df0 = df[df['Species'] == 0]
df1_spl = df1.sample(n=4)
df0_spl = df0.sample(n=8)
return pd.concat([df1_spl, df0_spl])```

I will use the sigmoid activation function here:

```def hypothesis(X, w):
z = np.dot(w, X.T)
return 1/(1+np.exp(-(z)))```

The next function will separate the features and target of the mini training set of 12 data again for the training purpose.

As you can see in the linear formula above, we need a bias W0. An extra feature of 1s is added as a bias term. We will improve the bias term as well in each iteration:

```def xy(df):
df_features = df.drop(columns = 'Species')
df_label = df['Species']
df_features['00'] = [1]*12
return df_features, df_label```

Let’s define the error function to calculate the Mean Square Error(MSE). This function will calculate the predicted target first and then the MSE:

```def error(X, y, w):
n = len(X)
yp = hypothesis(X, w)
return np.sum((yp-y)**2)/n```

All the functions are ready. Now gradient descent function. In each iteration,

1. It will sample 12 data points using the stratified_spl function defined before.
2. Then split the features and target as X and y.
3. Calculate the predicted y using Ws and Xs.
4. Update the Ws using the gradient descent formula.

This is common to use a log to regularize the error term in a classification problem. But I will use MSE here. Because I guess more people are familiar with the MSE. This dataset is very easy to work with and usign MSE worked for this project.

Feel free to try with a log on the error term. Here is an example.

We will collect the errors in each iteration and the Ws.

`def grad_des(df, w, alpha, epoch):    j = []    w1 = []    w1.append(w)    for i in range(epoch):        d = stratified_spl(df)        X, y = xy(d)        n= len(X)        yp = hypothesis(X, w)                for i in range(4):            w[i] -= (alpha/n) * np.sum(-2*X[i]*(y-yp))        w[4] -= (alpha/n) *np.sum(-2*(y-yp))        w1.append(list(w))        j.append(error(X, y, w))    return j, w1`

The classifier is done! Now, it needs to be tested.

## Testing the Classifier

I will randomly initialize Ws. There are total of five features including the bias. So, for each feature or X, I need to initialize a W as you saw in the linear formula in the beginning.

```import numpy as np
w = np.random.rand(5)
w```

Output:

`array([0.05837806, 0.91017305, 0.71097702, 0.91990355, 0.71139191])`

Now, use the gradient descent function using the step size (alpha) as 0.01 for 100 iterations:

`j, w1 = grad_des(x_train, w, 0.01, 100)`

This should give us 100 w1 for 100 iterations. Using this 100 w1s, we can calculate 100 MSEs to observe the change in MSE in each iteration. I need a function that should calculate the predicted y for each w1 and then error.

```def err_test(X, y, w):
er = []
for i in range(len(w1)):
er.append(error(X, y, w[i]))
return er```

It is good to see a plot of MSE in each iteration.

Here is the plot function:

```def plot(X, y, w):
error = err_test(X, y, w)
return plt.scatter(range(len(error)), error)```

Remember we added the target in the x_train. So, separate the target from x_train and add the bias term:

```X = x_train.drop(columns = 'Species')
X['00'] = [1]*len(X)```

Now, plot the MSE for the training set:

```import matplotlib.pyplot as plt
plot(X, y_train, w1)```

Let’s add a bias term in the test set as well.

`X_t['00'] = [1]*len(x_test)`

Now, plot the MSE for the test data as well:

`plot(X_t, y_test, w1)`

One unusual thing that happened in this data is that the first MSE is too low. That does not happen usually. Normally, the first MSE is too high and it keeps going down gradually as the plot shows after the first MSE. I couldn’t find the reason for that.

Performance evaluation is incomplete without looking at the accuracy. Let’s define a function for accuracy:

```def accuracy(X, y, w):
yp = hypothesis(X, w)
for i in range(len(yp)):
if yp[i] >=0.5:
yp[i] = 1
else:
yp[i] = 0
return sum(yp == y)/len(y)```

Now, we want to see, how accuracy changes with each iteration:

```def accuracy_series(X, y, w1):
acc = []
for i in range(len(w1)):
acc.append(accuracy(X, y, w1[i]))
return acc```

Using this function to see the accuracy series for the training set:

`np.array(accuracy_series(X, y_train, w1))`

Output:

```array([0.975, 0.34166667, 0.34166667, 0.34166667, 0.34166667,
0.34166667, 0.34166667, 0.34166667, 0.33333333, 0.23333333,
0.34166667, 0.5       , 0.6       , 0.63333333, 0.64166667,
0.65833333, 0.65833333, 0.65833333, 0.65833333, 0.65833333,
0.65833333, 0.65833333, 0.65833333, 0.65833333, 0.65833333,
0.65833333, 0.65833333, 0.65833333, 0.65833333, 0.65833333,
0.65833333, 0.65833333, 0.65833333, 0.65833333, 0.65833333,
0.66666667, 0.66666667, 0.66666667, 0.66666667, 0.66666667,
0.66666667, 0.675     , 0.69166667, 0.69166667, 0.69166667,
0.69166667, 0.71666667, 0.74166667, 0.75      , 0.76666667,
0.76666667, 0.78333333, 0.78333333, 0.83333333, 0.83333333,
0.83333333, 0.83333333, 0.85833333, 0.86666667, 0.83333333,
0.85      , 0.88333333, 0.88333333, 0.88333333, 0.90833333,
0.90833333, 0.90833333, 0.91666667, 0.925     , 0.925     ,
0.925     , 0.925     , 0.925     , 0.94166667, 0.94166667,
0.94166667, 0.95833333, 0.95833333, 0.96666667, 0.98333333,
0.98333333, 0.98333333, 0.98333333, 0.98333333, 0.98333333,
0.98333333, 0.99166667, 0.99166667, 0.99166667, 0.99166667,
0.99166667, 0.99166667, 0.99166667, 0.99166667, 0.99166667,
0.99166667, 0.99166667, 0.99166667, 0.99166667, 0.99166667,
0.99166667])```

Again. it is unusual to get 99% accuracy with the first w. Because the first w is the randomly initiated w. That does not happen in real-life projects.

Now, the accuracy series for the test set:

`np.array(accuracy_series(X_t, y_test, w1))`

Output:

```array([.93333333        , 0.3       , 0.3       , 0.3       , 0.3       ,
0.3       , 0.3       , 0.3       , 0.3       , 0.16666667,
0.33333333, 0.53333333, 0.6       , 0.63333333, 0.63333333,
0.7       , 0.7       , 0.7       , 0.7       , 0.7       ,
0.7       , 0.7       , 0.7       , 0.7       , 0.7       ,
0.7       , 0.7       , 0.7       , 0.7       , 0.7       ,
0.7       , 0.7       , 0.7       , 0.7       , 0.7       ,
0.7       , 0.7       , 0.7       , 0.7       , 0.7       ,
0.7       , 0.7       , 0.7       , 0.7       , 0.7       ,
0.7       , 0.73333333, 0.73333333, 0.73333333, 0.83333333,
0.83333333, 0.83333333, 0.83333333, 0.83333333, 0.86666667,
0.86666667, 0.86666667, 0.86666667, 0.86666667, 0.86666667,
0.86666667, 0.86666667, 0.86666667, 0.86666667, 0.9       ,
0.9       , 0.9       , 0.9       , 0.93333333, 0.93333333,
0.93333333, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
0.96666667, 1.        , 1.        , 1.        , 1.        ,
1.        , 1.        , 1.        , 1.        , 1.        ,
1.        ])```

You can see the accuracy become 100% at the end.

## Conclusion

My goal was to demonstrate the stochastic gradient descent implementation. I hope that was successful. If you are having a hard time understanding, my suggestion is to run the code yourself. It will be clearer. And you can see, why stochastic gradient descent is so popular. Because instead of using all 150 data from the dataset I used 12 data in each iteration. This same technique can be used when there are 100, 000 data points in a dataset. That should save a lot of time and computation cost.