Sentiment Analysis Using CountVectorizer: Scikit-Learn

Sentiment Analysis is a common NLP assignment a data scientist performs in his or her job. So here I am solving a sentiment analysis task. I will use Scikit-Learn library to do this. It should be pretty simple and easy.

Dataset and task Overview

This dataset contains data about baby products reviews of Amazon. It has three columns: name, review and rating. Reviews are strings and ratings are numbers from 1 to 5. 1 for the worst and 5 for the best reviews. Our job is to analyze the reviews as positive and negative reviews. Let’s have a look at the dataset.

Here we used the first five entries to just examine the data. For that I am using Pandas.

import pandas as pd

products = pd.read_csv(‘amazon_baby.csv’)

products.head()

This is the output. 

name

review

rating

Planetwise Flannel Wipes

These flannel wipes are OK, but in my opinion …

3

Planetwise Wipe Pouch

it came early and was not disappointed. i love…

5

Annas Dream Full Quilt with 2 Shams

Very soft and comfortable and warmer than it l…

5

Stop Pacifier Sucking without tears with Thumb…

This is a product well worth the purchase. I …

5

Stop Pacifier Sucking without tears with Thumb…

All of my kids have cried non-stop when I trie…

5

Data Preprocessing

In real life, data scientists almost never get data that are very clean and already prepared for machine learning models. For almost every project, you have to spend time to clean and process the data. So, let’s clean the dataset we will work on now.

One important data cleaning process is to get rid of nan values. There are several ways. But in this project, we will simply delete the columns where review is nan.

products = products.dropna(subset=[‘review’])

Now let’s iterate over the review column and check if there is any other types of data other than string in the review column.

for i in range(0,len(products)-1):

    if type(products.iloc[i][‘review’]) != str:

        print(i)

Products name is a prime important feature. We will remove any column that has name value nan.  

products = products.dropna(subset=[‘name’])

As we are doing a sentiment analysis, it is important to tell our model what is positive sentiment and what is a negative sentiment. In our rating column we have rating from 1 to 5. We can easily define 1 and 2 as bad reviews and 4 and 5 as good reviews. What about 3? 3 is in the middle. It’s neither good nor bad. Just average. But we want to classify good or bad reviews. So, I decided to get rid of all the 3’s. It depends on the employer or your personal ideas of good or bad. If you think you will put 3 in the good review slot, just do it. But I am getting rid of them.

products = products[products[‘rating’] != 3]

We will denote positive sentiments as 1 and negative sentiments as 0. Let’s write a function ‘sentiment’ that returns 1 if the rating is 4 or more else return 0.

def sentiment(n):

    return 1 if n >= 4 else 0

Now apply the function sentiment and create a new column that will represent the positive and negative sentiment as 1 or 0.

products[‘sentiment’] = products[‘rating’].apply(sentiment)

If you want, please check the table again. You will find the new column named ‘sentiment’. Now we are ready to develop our sentiment classifier. First, we need to prepare the training features. Combine both the columns and make one single column. First write a function ‘combined_features’ that will combine both the columns. Then, apply the function and create a new column ‘all_features’ that will contain the strings from both name and review columns.

def combined_features(row):

    return row[‘name’] + ‘ ‘ + row[‘review’]

products[‘all_features’] = products.apply(combined_features, axis=1)

Let’s Develop the sentiment classifier

We need to define the x and y. X should be ‘all_features’ column and y should be our ‘sentiment’ column.

X = products[‘all_features’]

y = products[‘sentiment’]

Now we are ready to develop our sentiment classifier. First we need to split our dataset so that we have a training set and test set. We will use ‘train_test_split’ function from sklearn library. We will train the model with our training dataset and then test the accuracy of the model with our test dataset. We are not defining the size of our training and test dataset. When we don’t define it, ‘train_test_split’ automatically takes 75% data for training and 25% data for testing purposes.   

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

We are using CountVectorizer for this problem. CountVectorizer develops a vector of all the words in the string. Import CountVectorizer and fit both our training, testing data into it.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

We are using CountVectorizer for this problem. CountVectorizer develops a vector of all the words in the string. Import CountVectorizer and fit both our training, testing data into it.

From sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

ctmTr = cv.fit_transform(X_train)

X_test_dtm = cv.transform(X_test)

Let’s dive into original model part. This is the most fun part. We will use LogisticRegression. Let’s do the necessary imports and fit our training data in the model. Then predict the review for the test data.

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

model = LogisticRegression()

model.fit(ctmTr, y_train)

y_pred_class = model.predict(X_test_dtm)

Use accuracy_score function to get the accuracy_score of the test data.

accuracy_score(y_test, y_pred_class)

The accuracy score I got for this data is 93%, which is very good. I will show another more efficient way of doing Natural Language Processing in my next article.

Leave a Reply

Close Menu