Sentiment analysis is one of the very common natural language processing tasks. Businesses use sentiment analysis to understand social media comments, product reviews, and other text data efficiently. Tensorflow and Keras are amazing tools for that.
Tensorflow is arguably the most popular deep learning library. It uses a neural network behind the scene. The reason it is so popular is, it is really easy to use and works pretty fast. Without even knowing how a neural network works, you can run a neural network. Though it helps, if you know some basics about neural networks.
Tensorflow also has very good documentation. So, it is also easy to learn. Many of us want to have a career in machine learning and deep learning nowadays but are scared of the fact that it may require a very high level of programming problem-solving skills.
Yes, it definitely needs to have programming knowledge. But because of these cool libraries like Tensorflow, if you have intermediate-level programming skills, it is possible to work on machine learning and deep learning problems. Most professional data scientists do not develop their algorithms from the scratch. Using these libraries is pretty common at the industry level.
This article focuses on a simple sentiment analysis project in TensorFlow with a project. Please feel free to download the dataset from this link.
This dataset has three columns. The name of the products, review, and rating. The review column is the text column that contains the comment of the customers and the rating column has the number rating ranging from 1 to 5. 1 being the worst and 5 being the best rating.
What to Expect From This Article?
This article will provide:
- Step by step explanation of how to use TensorFlow and Keras to perform Sentiment Analysis
- Complete working code that will work in some real-life projects as well.
- Explanation of the parameters
- Some resources to understand the parameters better
Data Preparation
In natural language processing projects, data preprocessing is half the work. Because algorithms do not understand the texts. So, we need to convert the texts to numbers that are understandable to algorithms.
Before working on the texts it is also important to define a positive sentiment and a negative sentiment. In this dataset, we can use the rating column to understand the sentiments. We have ratings from 1 to 5. So, when the rating is 1 or 2, that review can be considered a negative review, and when the rating is 3, 4, or 5, then the review may be considered as a positive review. We can set 0 for negative sentiments and 1 for positive sentiments.
df['sentiments'] = df.rating.apply(lambda x: 0 if x in [1, 2] else 1)
After adding the ‘sentiment’ column using the line of code above the dataset looks like this:
The next step is to tokenize the texts as we mentioned before. The Tokenizer function will be used for that. By default, it removes all the punctuations and sets the texts into space-separated organized forms. Each word becomes an integer by the tokenizer function. Let’s set the tokenizer function:
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences
Here, the value of oov_token is set to be ‘OOV’. That means any unknown words will be replaced by oov_token. This is a better option instead of throwing out unknown words. We will talk about the ‘pad_sequences’ later.
Spliting the Dataset
We will keep eighty percent of the data for training and twenty percent for testing purposes.
split = round(len(df)*0.8) train_reviews = df['review'][:split] test_reviews = df['review'][split:]
I will do one step out of extra caution. That is to convert each review to a string. Because by any chance if there is data that is not in string format, we will get an error later. So, I want to take this extra step.
import numpy as np training_sentences = [] testing_sentences = [] for row in test_reviews: testing_sentences.append(str(row))
The training and testing set is ready for action. Some important terms need to be fixed. I will explain them after this code block:
vocab_size = 40000 embedding_dim = 16 trunc_type = 'post' oov_tok="<OOV>" max_length = 120
Here, vocab_size 40,000 means we will take 40,000 unique words to train the network.
Embedding dimension 16 means each word will be represented by a 16-dimensional vector. Max_length 120 represents the length of each review. We will keep 120 words from each review. If originally the comment is longer than 120 words, it will be truncated.
The term trunc_type is set to be ‘post’. So, the review will be truncated at the end when a review is bigger than 120 words. On the other hand, if the review is less than 120 words it will be padded to make 120 words. In the end, padding_type ‘post’ means padding will be applied at the end, not in the beginning.
Now, let’s start by tokenize the words:
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
Here is part of the word_index values:
{'': 1,
'the': 2,
'it': 3,
'i': 4,
'and': 5,
'to': 6,
'a': 7,
'is': 8,
'this': 9,
'for': 10,
Look, how each word has an integer value. Now, the review sentences can be represented as a sequence of words. The next code block converts the sentences into the sequences of words and then pads if necessary:
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type)
testing_padded = pad_sequences(testing_sentences, maxlen=max_length)
The data processing is done here. Now, the model development can be done very easily.
Model Development
For this project, ‘keras. sequential’ model will be used. Please check the attached link to learn details about ‘keras. sequential’. I will explain how to use it in this type of project. Here is the model:
model = tf.keras.Sequential([ tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(1, activation='sigmoid')
The first layer is the embedding layer where all the parameters have been defined and explained before. The second layer is ‘GlobalAveragePooling1D()’ flattens the vector. Originally the data is three-dimensional (batch_size x steps x features). GlobalAveragePooling1D makes it (batch_size x features).
The third layer is a Dense layer where a ‘relu’ activation function is used. You can try ‘tanh’ or any other activation function of your choice. This layer is called the hidden layer. I used only one hidden layer. Feel free to try with multiple hidden layers. More complex problems may require more hidden layers. Also, I used 6 neurons here in my hidden layer. You may wonder how to choose the number of neurons.
There are so many articles on that. Here is a short article that provides some insights in brief.
The last layer uses the sigmoid activation function or logistic function.
In the hidden layers you can use ‘relu’ or ‘tanh’ activation functions but the last layer in a classification problem is always sigmoid or softmax activation functions.
Now, compile the model:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
I choose binary_crossentropy as a loss function as this is a probabilistic loss. There are other loss functions that are described here. I used optimizer as ‘adam’. There are several other optimizer functions such as RMSProp, adadelta, adagrad, adamax, and more. Here is the model summary:
model.summary()
Output:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 120, 16) 640000
_________________________________________________________________
global_average_pooling1d (Gl (None, 16) 0
_________________________________________________________________
dense (Dense) (None, 6) 102
_________________________________________________________________
dense_1 (Dense) (None, 1) 7
=================================================================
Total params: 640,109
Trainable params: 640,109
Non-trainable params: 0
_________________________________________________________________
If you look at the model summary, a lot of the terms we discussed before will make sense now.
Let me explain a little bit. The embedding layer shows three dimensions. We didn’t mention any batch size. Here we use the whole dataset in each epoch for training. 120 words in each review and each word are represented as a 16 (we choose this 16 before) element vector. That means we have 16 features. After Global Average pooling, it flattens out and we have only the batch size which is None, and the number of features.
Training the Model
Before training the model, we just need to convert the labels to the array. If you notice, they are in list form:
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)
Let’s dive into the training the ‘model’. I will train the model for 20 epochs.
num_epochs = 20
history = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
Output:
Epoch 1/20 4589/4589 [==============================] - 40s 8ms/step - loss: 0.2849 - accuracy: 0.8831 - val_loss: 0.2209 - val_accuracy: 0.9094
Epoch 2/20 4589/4589 [==============================] - 35s 8ms/step - loss: 0.2098 - accuracy: 0.9127 - val_loss: 0.1990 - val_accuracy: 0.9186
Epoch 3/20 4589/4589 [==============================] - 36s 8ms/step - loss: 0.1931 - accuracy: 0.9195 - val_loss: 0.2000 - val_accuracy: 0.9177
Epoch 4/20 4589/4589 [==============================] - 35s 8ms/step - loss: 0.1837 - accuracy: 0.9234 - val_loss: 0.1993 - val_accuracy: 0.9168
Epoch 5/20 4589/4589 [==============================] - 35s 8ms/step - loss: 0.1766 - accuracy: 0.9264 - val_loss: 0.2013 - val_accuracy: 0.9163
Epoch 6/20 4589/4589 [==============================] - 35s 8ms/step - loss: 0.1708 - accuracy: 0.9287 - val_loss: 0.2044 - val_accuracy: 0.9174
Epoch 7/20 4589/4589 [==============================] - 36s 8ms/step - loss: 0.1656 - accuracy: 0.9309 - val_loss: 0.2164 - val_accuracy: 0.9166
Epoch 8/20 4589/4589 [==============================] - 35s 8ms/step - loss: 0.1606 - accuracy: 0.9332 - val_loss: 0.2122 - val_accuracy: 0.9155
Epoch 9/20 4589/4589 [==============================] - 35s 8ms/step - loss: 0.1560 - accuracy: 0.9354 - val_loss: 0.2203 - val_accuracy: 0.9170 Epoch 10/20 4589/4589 [==============================] - 36s 8ms/step - loss: 0.1515 - accuracy: 0.9373 - val_loss: 0.2222 - val_accuracy: 0.9161
Epoch 11/20 4589/4589 [==============================] - 35s 8ms/step - loss: 0.1468 - accuracy: 0.9396 - val_loss: 0.2225 - val_accuracy: 0.9143
Epoch 12/20 4589/4589 [==============================] - 37s 8ms/step - loss: 0.1427 - accuracy: 0.9413 - val_loss: 0.2330 - val_accuracy: 0.9120
Epoch 13/20 4589/4589 [==============================] - 36s 8ms/step - loss: 0.1386 - accuracy: 0.9432 - val_loss: 0.2369 - val_accuracy: 0.9131
Epoch 14/20 4589/4589 [==============================] - 34s 7ms/step - loss: 0.1344 - accuracy: 0.9455 - val_loss: 0.2418 - val_accuracy: 0.9102
Epoch 15/20 4589/4589 [==============================] - 36s 8ms/step - loss: 0.1307 - accuracy: 0.9470 - val_loss: 0.2487 - val_accuracy: 0.9073
Epoch 16/20 4589/4589 [==============================] - 37s 8ms/step - loss: 0.1272 - accuracy: 0.9490 - val_loss: 0.2574 - val_accuracy: 0.9058
Epoch 17/20 4589/4589 [==============================] - 36s 8ms/step - loss: 0.1237 - accuracy: 0.9502 - val_loss: 0.2663 - val_accuracy: 0.9009
Epoch 18/20 4589/4589 [==============================] - 36s 8ms/step - loss: 0.1202 - accuracy: 0.9519 - val_loss: 0.2734 - val_accuracy: 0.9028
Epoch 19/20 4589/4589 [==============================] - 36s 8ms/step - loss: 0.1173 - accuracy: 0.9536 - val_loss: 0.2810 - val_accuracy: 0.8978
Epoch 20/20 4589/4589 [==============================] - 36s 8ms/step - loss: 0.1144 - accuracy: 0.9550 - val_loss: 0.2959 - val_accuracy: 0.9058
As you can see from the result of the last epoch, the training accuracy is 95.5% and validation accuracy is 90.58%. Looks a little overfitting. You may want to try with some more epochs to see if the accuracies improve further.
We can plot the training and validation accuracies, and training and validation losses:
%matplotlib inlineimport matplotlib.pyplot as pltimport matplotlib.image as mpimgacc = history.history['accuracy']val_acc = history.history['val_accuracy']loss = history.history['loss']val_loss = history.history['val_loss']epochs=range(len(acc))plt.plot(epochs, acc, 'r', 'Training Accuracy')plt.plot(epochs, val_acc, 'b', 'Validation Accuracy')plt.title('Training and validation accuracy')plt.figure()plt.plot(epochs, loss, 'r', 'Training Loss')plt.plot(epochs, val_loss, 'b', 'Validation Loss')plt.title('Training and validation loss')plt.figure()
The model is done. The main portion is done. Now if you want, you can use different other performance evaluation metrics.
Conclusion
This is almost like a basic but useful model. You can use this same model with different hidden layers and neurons to solve quite a lot of problems in natural language processing. I also provided some more resources in the article that can be used to improve or change the model and try with different models. Feel free to change the model with a different number of hidden layers, neurons, activation functions, metrics, or optimizers and try out. That will give you a lot of learning experiences.
Please feel free to follow me on Twitter.
#DataScience #MachineLearning #ArtificialIntelligence #NaturalLanguageProcessing #Programming #Python