A Complete Sentiment Analysis Project Using Python’s Scikit-Learn
Sentiment Analysis in Python

A Complete Sentiment Analysis Project Using Python’s Scikit-Learn

Sentiment analysis is one of the most important parts of Natural Language Processing. It is different than machine learning with numeric data because text data cannot be processed by an algorithm directly. It needs to be transformed into a numeric form. So, text data are vectorized before they get fed into the machine learning model. There are different methods of vectorization. This article will demonstrate sentiment analysis using two types of vectorizers and three machine learning models.

Data Preprocessing

I am using the Amazon Baby Products dataset from Kaggle for this project. Please feel free to download the dataset from this link if you want to follow along.

The original dataset has three features: name(name of the products), review(Customer reviews of the products), and rating(rating of the customer of a product ranging from 1 to 5). The review column will be the input column and the rating column will be used to understand the sentiments of the review. Here are some important data preprocessing steps:

  1. The dataset has about 183,500 rows of data. There are 1147 null values. I simply will get rid of those null values.
  2. As the dataset is pretty big, it takes a lot of time to run some machine learning algorithm. So, I used 30% of the data for this project which is still 54,000 data. The sample was representative.
  3. If the rating is 1 and 2 that will be considered a bad review or negative review. And if the review is 3, 4, and 5, the review will be considered as a good review or positive review. So, I added a new column named ‘sentiments’ to the dataset that will use 1 for the positive reviews and 0 for the negative reviews.

Maybe I put a lot of code in one block. If that is a lot, please break it down and run into small pieces for better understanding.

Here is the code block that imports the dataset, takes a 30% representative sample, and adds the new column ‘sentiments’:

import pandas as pd
df = pd.read_csv('amazon_baby.csv')
#getting rid of null values
df = df.dropna()
#Taking a 30% representative sample
import numpy as np
np.random.seed(34)
df1 = df.sample(frac = 0.3)
#Adding the sentiments column
df1['sentiments'] = df1.rating.apply(lambda x: 0 if x in [1, 2] else 1)

Here is how the dataset looks like now. This is the first 5 rows of data:

 

Sentiment Analysis

Before starting the sentiment analysis, it is necessary to define the input features and the labels. Here there is only one feature, which is the ‘review’. The label will be the ‘sentiments’. The goal of this project is to train a model that can output if a review is positive or negative.

X = df1['review']
y = df1['sentiments']

First I will use Count Vectorizer as a vectorizing method. This article will focus on how to apply the vectorizers. So, I am not going for the details. But please feel free to check out this article to learn more about Count Vectorizer if you are totally new to vectorizers.

Count Vectorizer

I will use a count vectorizer to vectorize the text data in the review column (training feature for this project) and then use three different classification models from scikit-learn models. After that, to evaluate the model on this dataset find out the accuracy, confusion matrix, true positive rates, and true negative rates. Here are the steps.

  1. The first step is to split the dataset into training sets and testing sets.
  2. Vectorize the input feature that is out review column (both training and testing data)
  3. import the model from scikit learn library.
  4. Find the accuracy score
  5. find the true positive and true negative rates.

I will repeat this same process for three different classifiers now. The classifiers that will be used here are Logistic Regression, Support Vector Machine, and K Nearest Neighbor Classifier. I will summarise the results towards the end of this article.

Logistic Regression

Here is the code block for logistic regression. I used the comments in between the code.

 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.5, random_state=24)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
#Vectorizing the text data
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
from sklearn.linear_model import LogisticRegression#Training the model
lr = LogisticRegression()
lr.fit(ctmTr, y_train)
#Accuracy score
lr_score = lr.score(X_test_dtm, y_test)
print("Results for Logistic Regression with CountVectorizer")
print(lr_score)
#Predicting the labels for test data
y_pred_lr = lr.predict(X_test_dtm)
from sklearn.metrics import confusion_matrix
#Confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_lr).ravel()
print(tn, fp, fn, tp)
#True positive and true negative rates
tpr_lr = round(tp/(tp + fn), 4)
tnr_lr = round(tn/(tn+fp), 4)
print(tpr_lr, tnr_lr)

 

As you can see I have the print statement to print accuracy, true positive, false positive, true negative, false negative, true negative rate, and false-negative rate.

Support Vector Machine

I will repeat the exact same process as before to use a support vector machine.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.5, random_state=123)
#Vectorizing the text data
cv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
from sklearn import svm#Training the model
svcl = svm.SVC()
svcl.fit(ctmTr, y_train)
svcl_score = svcl.score(X_test_dtm, y_test)
print("Results for Support Vector Machine with CountVectorizer")
print(svcl_score)
y_pred_sv = svcl.predict(X_test_dtm)#Confusion matrix
cm_sv = confusion_matrix(y_test, y_pred_sv)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_sv).ravel()
print(tn, fp, fn, tp)
tpr_sv = round(tp/(tp + fn), 4)
tnr_sv = round(tn/(tn+fp), 4)
print(tpr_sv, tnr_sv)

I should warn you that support vector machine takes a lot more time than logistic regression.

K Nearest Neighbor

I will run a KNN classifier and get the same evaluation matrix as before. Codes will be almost the same.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.5, random_state=143)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(ctmTr, y_train)
knn_score = knn.score(X_test_dtm, y_test)
print("Results for KNN Classifier with CountVectorizer")
print(knn_score)
y_pred_knn = knn.predict(X_test_dtm)#Confusion matrix
cm_knn = confusion_matrix(y_test, y_pred_knn)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_knn).ravel()
print(tn, fp, fn, tp)
tpr_knn = round(tp/(tp + fn), 4)
tnr_knn = round(tn/(tn+fp), 4)
print(tpr_knn, tnr_knn)

KNN classifier takes less time than Support Vector Machine classifier.

With that, the three classifiers are done for the count vectorizer method.

TFIDF Vectorizer

Next, I will use the TF-IDF vectorizer. This vectorizer is known to be a more popular one because it uses the term frequency of the words. Please feel free to check this article to learn details about the TF-IDF vectorizer.

I will follow exactly the same process I did for the count vectorizer. Only the vectorizer will be different. But that’s not a problem. Super cool sklearn library will take care of the calculation part as usual.

Logistic Regression

The complete code block for the Logistic regression again with the TF-IDF vectorizer:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.5, random_state=45)
from sklearn.feature_extraction.text import TfidfVectorizer#tfidf vectorizervectorizer = TfidfVectorizer()X_train_vec = vectorizer.fit_transform(X_train)X_test_vec = vectorizer.transform(X_test)from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()lr.fit(X_train_vec, y_train)lr_score = lr.score(X_test_vec, y_test)
print("Results for Logistic Regression with tfidf")
print(lr_score)
y_pred_lr = lr.predict(X_test_vec)#Confusion matrix
from sklearn.metrics import confusion_matrix
cm_knn = confusion_matrix(y_test, y_pred_lr)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_lr).ravel()
print(tn, fp, fn, tp)
tpr_knn = round(tp/(tp + fn), 4)
tnr_knn = round(tn/(tn+fp), 4)
print(tpr_knn, tnr_knn)

As you can see you can reuse the codes from before except for the vectorizer part.

Support Vector Machine

It will also be the same process as the previous support vector machine except for the vectorizer.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.5, random_state=55)
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
from sklearn import svm
#params = {'kernel':('linear', 'rbf'), 'C':[1, 10, 100]}
svcl = svm.SVC(kernel = 'rbf')
#clf_sv = GridSearchCV(svcl, params)
svcl.fit(X_train_vec, y_train)
svcl_score = svcl.score(X_test_vec, y_test)
print("Results for Support Vector Machine with tfidf")
print(svcl_score)
y_pred_sv = svcl.predict(X_test_vec)#Confusion matrix
from sklearn.metrics import confusion_matrix
cm_sv = confusion_matrix(y_test, y_pred_sv)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_sv).ravel()
print(tn, fp, fn, tp)
tpr_sv = round(tp/(tp + fn), 4)
tnr_sv = round(tn/(tn+fp), 4)
print(tpr_sv, tnr_sv)

As before it will take a lot more time than logistic regression. So, it may require some patience.

K Nearest Neighbor

This is the last one.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.5, random_state=65)
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_vec, y_train)
knn_score = knn.score(X_test_vec, y_test)
print("Results for KNN Classifier with tfidf")
print(knn_score)
y_pred_knn = knn.predict(X_test_vec)#Confusion matrix
cm_knn = confusion_matrix(y_test, y_pred_knn)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_knn).ravel()
print(tn, fp, fn, tp)
tpr_knn = round(tp/(tp + fn), 4)
tnr_knn = round(tn/(tn+fp), 4)
print(tpr_knn, tnr_knn)

Results

Here I summarised the results for all six code blocks above.

 

Here are some key findings:

  1. Overall TF-IDF vectorizer gave us slightly better results than the count vectorizer part. For both the vectorizer.
  2. Logistic regression was the best out of all three classifiers used for this project considering overall accuracy, true positive rate, and true negative rate.
  3. The KNN classifier does not seem to be suitable for this project. Though true positive rates look very good, true negative rates look really poor.

Conclusion

I have done this project as a part of one of my classes and decided to share this. You may wonder why I split the dataset into training and testing every time I trained a model. This was a requirement for the class. The idea is a certain train-test split may have a bias towards a certain classifier. So, I had to split the dataset before each model.

Feel free to follow me on Twitter and like my Facebook page.

 

#NaturalLanguageProcessing #DataScience #MachineLearning #Python #ArtificialInteligence #DataAnalytics

Leave a Reply

Close Menu