Logistic Regression With Python and Scikit-Learn

Logistic regression is one of the most widely used classification algorithms. In one of my previous blogs, I talked about the definition, use and types of logistic regression. In this article I want to focus more about its functional side. First, the idea of cost function and gradient descent and implementation of the algorithm with python will be presented. At the end, same model will be implemented with scikit-learn library.

Here is the link for my previous article on Logistic Regression:

Cost Function

The definition of cost function is same as the cost function of linear regression. Does that mean, Cost function of linear regression and logistic regression are exactly the same? Not really. Because The hypothesis is different. Just as a reminder hypothesis for logistic regression is shown below. This hypothesis equation is called sigmoid function or logistic function.

It creates another problem. Because sigmoid function is not linear, so the cost function representation will be non-convex like this picture. It will not just come to a global convergence, instead we will get several local minima like this. 

To solve this problem, this equation is used to get the cost function in logistic regression.

Gradient descent is the same linear. Repeat until convergence.

Let’s dive into most exciting part of it, the code implementation of logistic regression.

Logistic Regression With Python

First let’s define the hypothesis and Gradient Descent

def hypothesis(theta, X, n):
    h = np.ones((X.shape[0],1))
    theta = theta.reshape(1,n+1)
    for i in range(0,X.shape[0]):
        h[i] = 1 / (1 + np.exp(-float(np.matmul(theta, X[i]))))
    h = h.reshape(X.shape[0])
    return h
Now, gradient descent using sigmoid function and keep repeating till convergence
def GD(theta, alpha, num_iters, h, X, y, n):
    theta_history = np.ones((num_iters,n+1))
    cost = np.ones(num_iters)
    for i in range(0,num_iters):
        theta[0] = theta[0] - (alpha/X.shape[0]) * sum(h - y)
        for j in range(1,n+1):
        theta_history[i] = theta
        h = hypothesis(theta, X, n)
        cost[i]=(-1/X.shape[0])*sum(y*np.log(h)+(1-y)*np.log(1 - h))
    theta = theta.reshape(1,n+1)
    return theta, theta_history, cost
Now using this gradient descent and hypothesis get the parameters theta.
def logistic_regression(X, y, alpha, num_iters):
    n = X.shape[1]
    one_column = np.ones((X.shape[0],1))
    X = np.concatenate((one_column, X), axis = 1)
    # initializing the parameter vector...
    theta = np.zeros(n+1)
    # hypothesis calculation....
    h = hypothesis(theta, X, n)
    # returning the optimized parameters by Gradient Descent...
    theta,theta_history,cost = GD(theta,alpha,num_iters,h,X,y,n)
    return theta, theta_history, cost

The above mentioned functions have been taken from this article written by Navoneel Chakrabarty Now we will use these functions on a dataset and see how we can predict the labels of a test dataset.  

First let’s import the dataset in a ipython notebook. The dataset is available in my github repository. Please click here for the dataset.

df = pd.read_table('marks.txt', delimiter = ',')
In this dataset X1 and X2 are the features and y is the label. Though this dataset is presented them as X1 and X2, we can assume that they are the marks of students of their two test. In that case, y may be the admitted or not admitted label. Let’s choose 1 for admitted label and 0 for not admitted label.  
X = df[['X1','X2']]
y = df['y']
Import train_test_split from scikit-learn library to split the dataset as training set and test set. So, we can use the training set to train the model and test set to test, how our model is performing. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Separate the test results according to passing or failing label. x3 is the not admitted student’s grades and x4 is the admitted student’s grade. 
x3 = X[y == 0]
x4 = X[y == 1]
Using matplotlib library plotting the admitted students grades in green and not admitted student’s grades in red.
import matplotlib.pyplot as plt
plt.scatter(x3['X1'], x3['X2'], c = 'r', label = 'Not Admitted')
plt.scatter(x4['X1'], x4['X2'], c = 'green', label = 'Admitted')
plt.xlabel("Marks obtained in 1st Exam")
plt.ylabel("Marks obtained in 2nd Exam")
Graph shows a clear distinction between admitted and not admitted student’s grades. 
This add_one_column function adds an one column to the features set. 
def add_one_column(X):
    one_column = np.ones((X.shape[0],1))
    return np.concatenate((one_column, X), axis = 1)
Use this predict function to predict the accuracy score of our model. This function takes X, y and a probability threshold value ‘thr’. If the hypothesis is more than or equal to this threshold value, the predicted label will be 1. Otherwise predicted label will be 0. 
def predict(X, y, thr = 0.5):
    X_new = add_one_column(X)
    n = X.shape[1]
    predicted_classes = (hypothesis(theta, X_new, n) >= thr).astype(int)
    #predicted_classes = predicted_classes.flatten()
    accuracy = np.mean(predicted_classes == y)
    return accuracy * 100        
First check the accuracy score on our training set. 
predict(X_train, y_train)
Accuracy score is 61.3 which is not too good.
Check the accuracy on the test set now. 
predict(X_test, y_test)
Accuracy score is not good either. 
What to do now? My idea was  just putting an average value like 0.5 as a probability threshold wasn’t a good idea. We probably should find an optimum threshold.
for i in np.arange(0, 1, 0.05):
print ("Training Set Accuracy for : " + str(i) + " is " + str(predict(X_train, y_train, thr = i)))
print ("Test Set Accuracy: " + str(i) + " is " + str(predict(X_test, y_test, thr = i )))
Here we get this set of accuracy score for different probability threshold values. 
for i in np.arange(0, 1, 0.05):
    print ("Training Set Accuracy for " + str(i) + " is: " +  str(predict(X_train, y_train, thr = i)))
    print ("Test Set Accuracy for " + str(i) + " is: " +  str(predict(X_test, y_test, thr = i )))
Here is the accuracy score for different probability threshold. Originally here the output was 20 sets of training and test set accuracy scores for 20 different probability threshold. But I am only showing from 0.5 to 0.75. Because our best accuracy score lies in these range.
Training Set Accuracy for 0.5 is: 61.33333333333333
Test Set Accuracy for 0.5 is: 56.00000000000001
Training Set Accuracy for 0.55 is: 68.0
Test Set Accuracy for 0.55 is: 56.00000000000001
Training Set Accuracy for 0.6000000000000001 is: 80.0
Test Set Accuracy for 0.6000000000000001 is: 68.0
Training Set Accuracy for 0.65 is: 82.66666666666667
Test Set Accuracy for 0.65 is: 92.0
Training Set Accuracy for 0.7000000000000001 is: 77.33333333333333
Test Set Accuracy for 0.7000000000000001 is: 68.0
Training Set Accuracy for 0.75 is: 56.00000000000001
Test Set Accuracy for 0.75 is: 48.0

From all the results above, accuracy score is the best in probability threshold 0.65. 

Logistic Regression with Optimized Cost Function

It is a good idea to start with writing pure python code step by step to learn the basics and know what runs behind the scene. Now let’s see a little more efficient and faster way of implementating the same same algorithm. 

Let’s start with defining the hypothesis. Here I am calling it the probability. 

def sigmoid(x): 
return 1 / (1 + np.exp(-x))

def hypothesis(theta, x):
return sigmoid(np.dot(x, theta))
Next compute the cost function for all the training samples.
def cost_function(theta, x, y):
    m = x.shape[0]
    total_cost = -(1 / m) * np.sum(
    y*np.log(hypothesis(theta, x)) + (1 - y) * np.log(
    1 - hypothesis(theta, x)))
    return total_cost
This is the gradient descent:
def gradient(theta, x, y):
    m = x.shape[0]
    return (1/m) * np.dot(x.T, sigmoid(net_input(theta, x)) - y)
Here I am using fmin_tnc function from scipy library to find the optimized parameters.
First, adding a one column to the features column. 
X = np.c_[np.ones((X.shape[0], 1)), X]
y = y[:, np.newaxis]
We need to initialize the theta as a column of zeros. It can be any other values but I choose to initialize with zero.
theta = np.zeros((X.shape[1], 1))
from scipy.optimize import fmin_tnc
def fit(x, y, theta):
    opt_weights = fmin_tnc(func=cost_function, x0=theta,
                          fprime=gradient, args=(x,y.flatten()))
    return opt_weights[0]
parameters = fit(X, y, theta)
These are the parameters (theta values) we get from the fmin_tnc function:
array([-25.16131863, 0.20623159, 0.20147149])
 Now we will use them to predict the admitted or not admitted labels. Then we will see how accurate our prediction is. Here we are using the probability threshold as 0.65 because we found that to be optimum from our first try. 
def predict(x):
    theta = parameters[:, np.newaxis]
    return hypothesis(theta, x)

def accuracy(x, actual_classes, probab_threshold = 0.65):
    predicted_classes = (predict(x) >= probab_threshold).astype(int)
    predicted_classes = predicted_classes.flatten()
    accuracy = np.mean(predicted_classes == actual_classes)
    return accuracy * 100
accuracy(add_one_column(X_train), y_train)
Our accuracy score come out to be 89.33% on the training set.
accuracy(add_one_column(X_test), y_test)
On test dataset this shows 84% accuracy, which is very good.

Logistic Regression Using scikit-learn

Now I want to use the logistic regression model that is available in scikit-learn library. It’s very useful to have a library like that. 

We need to start with importing Logistic Regression model from scikit-learn library. Then fit our training data in the model. Now, we can use this trained model to predict the admitted or not admitted label and finally get the accuracy score.

from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score 
model = LogisticRegression()
model.fit(X_train, y_train)
predicted_classes = model.predict(X_train)
accuracy = accuracy_score(y_train,predicted_classes)
parameters = model.coef_
Find out our parameters of the model and the accuracy score on the training data
parameters, accuracy
Here is the result. First array is the parameters and the accuracy score is .8133 or 81.33%.
(array([[0.0330462 , 0.02631392]]), 0.8133)
model_fit = model.fit(X_train, y_train)
The model model_fit above fit the training data in the model
model_fit.score(X_test, y_test)
The accuracy score on the test set is 88%.  Sorry we didn’t check the accuracy score on the training set yet. Let’s do it now.
model_fit.score(X_train, y_train)
It’s 81.33%. Both training and test set score is pretty good! 

In conclusion, my intention was not to compare all the different way and draw a conclusion on which one is good. I just wanted to explore and share the information.

Please follow me on linkedin, Instagram and twitter to stay connected.  

Leave a Reply

Close Menu