Support Vector Machine Classification in Python -sklearn
Support Vector Classifier

Support Vector Machine Classification in Python -sklearn

Support vector machine is one of the oldest and till popular machine learning models. I wrote tutorials on more complex deep learning methods before but I should work on some simple machine learning models as well. 

Before diving into the model development, I want to give high-level instruction on the model, and how it works. 

For simplicity, suppose we are working on a binary classification. The support vector machine finds a hyperplane that best divides the dataset into two classes. How to find the best separator? In this image below which one is the best separator line?

Support vector machine finds a plane that separates the classes with the maximum margin like this image:

The hyperplane should extend to a point that touches at least one point of both classes. The points that the hyperplane touches are called support vectors. 

Not always does the hyperplane separate the classes as well as in the picture above. However, the purpose of the model is to find the optimum hyperplane that separates the classes best. 

Let’s work on an example. We will use the very popular “iris” dataset for this tutorial. Here we create a data frame using the iris dataset:

import pandas as pd 
df = pd.read_csv("iris.csv")
df.head(10)

The variety column will be the target variable for this tutorial. So, this is a classification problem

But the values in the ‘variety’ column are strings. Machine learning models cannot work with strings or text. The data should be numeric. First check, how many unique values are there in the variety column:

df['variety'].unique()

Output:

array(['Setosa', 'Versicolor', 'Virginica'], dtype=object)

Replace these three values in the ‘variety’ column with 1, 2, and 3:

df['variety'] = df['variety'].replace({"Setosa": 1, "Versicolor": 2, "Virginica": 3})

I always check for null values before machine learning models. Because null values are troublemakers in a machine-learning model. This line of code below will show you how many null values are there in each column of the DataFrame:

df.isna().sum()

Output:

sepal.length    0
sepal.width 0
petal.length 0
petal.width 0
variety 0
dtype: int64

Data preparation is done. Define the training features and the target variable. Today, the target is to predict the ‘variety’ using the other 

X = df.drop(columns="variety")
y= df['variety']

We do not use the whole dataset for training the model. Some data should be kept separately for testing the model as well. Here we use the train_test_split method to split the data for training and testing. This method takes X, y that we created above, test_size which means the percentage of the data to be kept for testing, and random_state which can be any integer of your choice.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Finally, the model development! This is an easy part when you use a sklearn library. 

First import the Support Vector Machine from the sklearn library and call the method:

from sklearn import svm 
clf = svm.SVC()

Now, the training data needs to be fitted here:

clf.fit(X_train, y_train)

The model training part is done. It’s time to evaluate the model if this is working for us. There are several ways to evaluate the performance of a model. Here I am using the accuracy score. 

clf.score(X_test, y_test)

Output:

1.0

The accuracy score is 1.0 on the test data that means for 100% of the data the model predicted the ‘variety’ accurately. 

There are multiple other methods to evaluate the model that I have discussed in a different article. But when the accuracy score is 100%, that’s perfect! In most real world projects that does not happen. 

The video version of this tutorial is here:

#MachineLearning #DataScience #ArtificialIntelligence #Python #sklearn

Leave a Reply

Close Menu