Logistic Regression From Scratch Using a Real Dataset

Logistic regression is a popular method since the last century. It establishes the relationship between a categorical variable and one or more independent variables. This relationship is used in machine learning to predict the outcome of a categorical variable. It is widely used in many different fields such as the medical field, trading and business, technology, and many more. This article explains the process of developing a binary classification algorithm and implements it on a medical dataset.

Problem Statement

In this article, a logistic regression algorithm will be developed that should predict a categorical variable. Ultimately, it will return a 0 or 1.

Important Equations

The core of the logistic regression is a sigmoid function that returns a value from 0 to 1. Logistic regression uses the sigmoid function to predict the output. Here is the sigmoid activation function:

z is the input features multiplied by a randomly initialized term theta.

Here, X is the input features and theta is the randomly initialized values that will be updated in this algorithm. Generally, we add a bias term as well.

Another important term is the cost function. Cost function gives the intuition on how far the original values are from the predicted values. Here is the cost function expression:

Then, we need to update our randomly initialized theta values using the following equation:

Here, alpha is the learning rate.

This is time to use all the equations above to develop the algorithm

Model Development

Step 1: Develop the hypothesis.

The hypothesis is simply the implementation of the sigmoid function.

def hypothesis(X, theta):
    z = np.dot(theta, X.T)
    return 1/(1+np.exp(-(z))) - 0.0000001

I deducted 0.0000001 from the output here because of this expression in the cost function:

If the outcome of the hypothesis expression comes out to be 1, then this expression will turn out to be the log of zero. To mitigate that, I used this very small number at the end.

Step 2: Determine the cost function.

def cost(X, y, theta):
    y1 = hypothesis(X, theta)
    return -(1/len(X)) * np.sum(y*np.log(y1) + (1-y)*np.log(1-y1))

This is just a straightforward implementation of the cost function equation above.

Step 3: Update the theta values.

Theta values need to keep updating until the cost function reaches its minimum. We should get our final theta values and the cost of each iteration as output.

def gradient_descent(X, y, theta, alpha, epochs):
    m =len(X)
    J = [cost(X, y, theta)] 
    for i in range(0, epochs):
        h = hypothesis(X, theta)
        for i in range(0, len(X.columns)):
            theta[i] -= (alpha/m) * np.sum((h-y)*X.iloc[:, i])
        J.append(cost(X, y, theta))
    return J, theta

Step 4: Calculate the final prediction and accuracy

Use the theta values that come out of the ‘gradient_descent’ function and calculate the final prediction using the sigmoid function. Then, calculate the accuracy.

def predict(X, y, theta, alpha, epochs):
    J, th = gradient_descent(X, y, theta, alpha, epochs) 
    h = hypothesis(X, theta)
    for i in range(len(h)):
        h[i]=1 if h[i]>=0.5 else 0
    y = list(y)
    acc = np.sum([y[i] == h[i] for i in range(len(y))])/len(y)
    return J, acc

The final output is the list of costs in each epoch and the accuracy. Let’s implement this model to solve a real problem.

Data Preprocessing

The am using the ‘Heart.csv’ dataset from kaggle.com. Please click on this to download the dataset if you want to work with it too. First, import the necessary packages and import the dataset.

import pandas as pd
import numpy as np
df = pd.read_csv('Heart.csv')

The dataset looks like this:

Top five rows of the Haert.csv dataset

There are a few categorical features in the dataset. We need to convert them to the numerical data.

df["ChestPainx"]= df.ChestPain.replace({"typical": 1, "asymptomatic": 2, "nonanginal": 3, "nontypical": 4})
df["Thalx"] = df.Thal.replace({"fixed": 1, "normal":2, "reversable":3})
df["AHD"] = df.AHD.replace({"Yes": 1, "No":0})

Add one extra column for the bias. This should be a column of ones because any real number remains unchanged if multiplied by one.

df = pd.concat([pd.Series(1, index = df.index, name = '00'), df], axis=1)

Define the input features and output variables. The output column is the categorical column that we want to predict. The input features will be all the columns except the categorical columns that we modified earlier.

X = df.drop(columns=["Unnamed: 0", "ChestPain", "Thal"])
y= df["AHD"]

Get the Accuracy Result

Finally, initialize the theta values in a list and predict the result and calculate the accuracy. Here I am initializing the theta values like 0.5. It can be initialized as for any other value. As each feature should have a corresponding theta value, one theta value should be initialized for each feature in the X, including the bias column.

theta = [0.5]*len(X.columns)
J, acc = predict(X, y, theta, 0.0001, 25000)

The final accuracy is 84.85%. I used 0.0001 as the learning rate and 25000 iterations. I ran this algorithm a few times to determine that. Please check my GitHub link for this project provided below.

‘predict’ function also returns the list of costs in each iteration. Cost should keep going down in each iteration in a good algorithm. Plot the cost for each iteration to visualize the trend.

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize = (12, 8))
plt.scatter(range(0, len(J)), J)

Cost decreased very rapidly in the beginning and then the rate of decrease slowed down. Here is the Github link for this project:


Recommended reading:

1. Logistic regression in python using an optimization function

2.  Polynomial regression in python from scratch.

3. Concepts of Cost function explained.

4. Concept of Gradient Descent explained.

#MachineLearning #DataScience #LogisticRegression

Leave a Reply

Close Menu