Support Vector Machine For Regression in Python -sklearn
Support Vector Machine in Python - Regression

Support Vector Machine For Regression in Python -sklearn

Support vector machine is one of the oldest and still popular machine learning models. I wrote on Support Vector Machine Classifier before. So I thought it is necessary to also write about regression using support vector machine as well. 

We will use the housing dataset for this practice. Please feel free to download the dataset from this link:

Machine-Learning-Tutorials-Scikit-Learn/housing_data.csv at main · rashida048/Machine-Learning-Tutorials-Scikit-Learn (github.com)

Without further delay let’s just get right to it. First make a DataFrame using the housing data:

import pandas as pd 
pd.set_option("display.max_columns", 30)
df = pd.read_csv("housing_data.csv")

These are the columns in the dataset:

df.columns

Output:

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
'lat', 'long', 'sqft_living15', 'sqft_lot15'],
dtype='object')

The data is clean to begin with. So we do not need to do much. But it is a good idea to at least check for the null values. Because if there are null values and we keep it as it is, the model will raise errors.

df.isna().sum()

Output:

id               0
date 0
price 0
bedrooms 0
bathrooms 0
sqft_living 0
sqft_lot 0
floors 0
waterfront 0
view 0
condition 0
grade 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
zipcode 0
lat 0
long 0
sqft_living15 0
sqft_lot15 0
dtype: int64

So, each column has 0 null values. 

Now, we can define the training features and target variable.

In this tutorial we will try to predict the ‘price’ using the other variables. So, ‘price’ is the target variable here and all the other variables can be the training features. But we don’t need the ‘id’ column as a training feature. Also, we will avoid the ‘date’ column as well. There are ways to use date features in the regression. But that’s for a different tutorial. So, The training features will be all the features except id, date, and price.

X = df.drop(columns = ['price', 'id', 'date'])
y = df['price']

We shouldn’t use all the data for training only. We also need to keep some of the data separate to evaluate the model. So, the model can be tested on some data that have never been seen by the model. 

We can use train_test_split method for that. It takes training and testing features, test_size that means the percentage of data to be kept separate for testing or evaluation purpose. And an optional random_state parameter that can be any integer of your choice.

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 1)

The next step is a data scaling. This is not mandatory for the model to run. Even if you do not do data scaling, model will still run. But still data scaling is necessary for Support vector machine models. So, all the features are in the similar range and no feature overpowering over others. 

I will use StandardScaler from sklearn library. Sometimes I use min-max scaler as well. 

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaler = scaler.fit_transform(X_train)
X_test_scaler = scaler.transform(X_test)

So, both training and test features are in scaled. 

Import SVR from sklearn library and fit the scaled training features and label.

from sklearn.svm import SVR
svr = SVR().fit(X_train_scaler, y_train)

Model training is done. Now, see how it performs. We will test the performance of the model on the test data. I used mean_absolute error here for testing the model. 

from sklearn.metrics import mean_absolute_error
y_pred = svr.predict(X_test_scaler)
mean_absolute_error(y_test, y_pred)

Output:

223820.56232200025

Let’s see if we can improve the model performance further. Before I accepted the default parameters in the SVR method. This time I will pass a parameter ‘kernel’. 

from sklearn.svm import SVR
svr1 = SVR(kernel = 'linear').fit(X_train_scaler, y_train)

I used kernel=’linear’ here and let’s see the mean_absolute_error again:

y_pred1 = svr1.predict(X_test_scaler)
mean_absolute_error(y_test, y_pred1)

Output:

202704.9873763359

Yes! It improved the mean_absolute_error. Please feel free to look at the documentation here

Conclusion

Support Vector Machine is a simple to use and also good machine learning model. It is always a good idea to try the simple models first because they are easy to explain to the stakeholders and saves a lot of time if works for you. Hope this tutorial was helpful. 

The video version of Regression in Support Vector Machine is here:

#machinelearning #datascience #artificialintelligence #python #sklearn 

Leave a Reply

Close Menu