Sentiment Analysis in Python, Scikit-Learn

In today’s world sentiment analysis can play a vital role in any industry. Classifying tweets, Facebook comments or product reviews using an automated system can save a lot of time and money. At the same time, it is probably more accurate. In this article, I will explain a sentiment analysis task using a product review dataset.

I am going to use python and a few libraries of python. Even if you haven’t used these libraries before, you should be able to understand it well. If this is new to you, please copy each step of code to your notebook and see the output for better understanding.

Tools Used

  1. Python
  2. Pandas library
  3. scikit-learn library
  4. Jupyter Notebook as an IDE.

Dataset and task Overview

I am going to use a product review dataset as I mentioned earlier. The dataset contains Amazon baby product reviews. Please download the dataset for yourself from this link if you want to practice with it. It has three columns: name, review, and rating. Reviews are text data and ratings are numbering from 1 to 5. 1 for the worst and 5 for the best review. Our job is to analyze the reviews as positive and negative reviews. Let’s have a look at the dataset. Here we used the first five entries to examine the data.

import pandas as pdproducts = pd.read_csv(‘amazon_baby.csv’)products.head()

Data Preprocessing

In real life, data scientists rarely get data that are very clean and already prepared for machine learning models. For almost every project, you have to spend time cleaning and process the data. So, let’s clean the dataset we will work on now.

One important data cleaning process is to get rid of nan values. Let’s check how many null values we have in the dataset. In this dataset, we have to work on these three columns and all three of them are crucial. If the data is not available in any row in a column that row is unnecessary.

len(products) - len(products.dropna())

We have null values in 1147 rows. Now, check how much total data we have.


We have a total of 183531 data. So, if we delete all the null values, we will still have a sizable data to train an algorithm. So, drop the null values.

products = products.dropna()

We need to have all the string data in the review column. If there is any data that has other types, it will cause trouble in later steps. Now, we will check the datatype of the review data of every row. If there is any row having a review in any other type than string we will change that to a string.

for i in range(0,len(products)-1):
if type(products.iloc[i]['review']) != str:
products.iloc[i]['review'] = str(products.iloc[i]['review'])

As we are doing sentiment analysis, it is important to tell our model what is positive sentiment and what is a negative sentiment. In our rating column, we have ratings from 1 to 5. We can define 1 and 2 as bad reviews and 4 and 5 as good reviews. What about 3? 3 is in the middle. It’s neither good nor bad. Just average. But we want to classify good or bad reviews. So, I decided to get rid of all the 3’s. It depends on the employer or your ideas of good or bad. If you think you will put 3 in the good review slot, just do it. But I am getting rid of them.

products = products[products[‘rating’] != 3]

We will denote positive sentiments as 1 and negative sentiments as 0. Let’s write a function ‘sentiment’ that returns 1 if the rating is 4 or more else return 0. Then, apply the function sentiment and create a new column that will represent the positive and negative sentiment as 1 or 0.

def sentiment(n):return 1 if n >= 4 else 0
products['sentiment'] = products[‘rating’].apply(sentiment)

Now we are ready to develop our sentiment classifier. First, we need to prepare the training features. Combine both ‘name’ and ‘review’ columns and make one single column. First, write a function ‘combined_features’ that will combine both the columns. Then, apply the function and create a new column ‘all_features’ that will contain the strings from both name and review columns.

def combined_features(row):return row['name'] + ' '+ row['review']products['all_features'] = products.apply(combined_features, axis=1)

Develop the sentiment classifier

Here is the process step by step.

We need to define the input variable X and the output variable y. X should be ‘all_features’ column and y should be our ‘sentiment’ column.

X = products['all_features']
y = products['sentiment']

Now we are ready to develop our sentiment classifier. We need to split the dataset so that there is a training set and a test set. The ‘train_test_split’ function from the scikit-learn library can be helpful. The model will be trained using the training dataset and the performance of the model can be tested using the test dataset. ‘train_test_split’ automatically splits the data in 75/25 proportion. 75% for the training and 25% for the testing. If you want the proportion to be different, you need to define that.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

I am going to use ‘CountVectorizer’ from the scikit-learn library. CountVectorizer develops a vector of all the words in the string. Import CountVectorizer and fit both our training, testing data into it.

From sklearn.feature_extraction.text import CountVectorizercv = CountVectorizer()ctmTr = cv.fit_transform(X_train)X_test_dtm = cv.transform(X_test)

Let’s dive into the original model part. This is the most fun part. We will use the Logistic Regression as this is a binary classification. Let’s do the necessary imports and fit our training data in the model.

from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scoremodel = LogisticRegression(), y_train)

The logistic regression model is trained with the training data. Here is the output of the training.

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,           intercept_scaling=1, max_iter=100, multi_class='warn',           n_jobs=None, penalty='l2', random_state=None, solver='warn',           tol=0.0001, verbose=0, warm_start=False)

If this output looks obscure to you, please do not worry about it. This output shows optimized parameters for this dataset that this model figured out.


Use the trained model above to predict the sentiments for the test data.

y_pred_class = model.predict(X_test_dtm)

Use the accuracy_score function to get the accuracy_score of the test data.

accuracy_score(y_test, y_pred_class)

The accuracy score I got for this data on the test set is 84%, which is very good. 


#naturallanguageprocessing , #machinelearning #scikitlearn #sentimentanalysis #python #datascience

Leave a Reply

Close Menu