Text data analysis is becoming easier and easier every day. Prominent programming languages like Python and R have great libraries for text data analysis. There was a time when people used to think that you need to be an expert in coding to do these types of complex tasks. But with the more developed and improved version of libraries, it is easier to perform text data analysis with just simple and beginner-level coding knowledge.
In this article, I will work on a dataset that is primarily a text dataset. The dataset contains the customer review of amazon baby products and ratings. Please feel free to download the dataset and from this link and follow along. This is a great dataset to use for machine learning as it has ratings with reviews. But this article will only focus on the exploratory data analysis. We will still talk about sentiments but that will not be the sentiment analysis machine learning.
Let’s jump in!
Dataset
First, import the dataset, and then we will talk about it some more.
import pandas as pd import numpy as np
df = pd.read_csv("amazon_baby.csv") df.head()
As you can see, we have the name of the products, customer reviews, and ratings in the dataset. I like to start by checking how many rows of data are there in the dataset:
len(df)
Output:
183531
That’s a lot of data. I decided to make the dataset smaller. Because when I tried to work with this big dataset, it takes too much time to do some visualization, calculation, and data manipulation to complete. To make it easier and less time-consuming to run the code, I decided to make the dataset smaller. One way is to simply take 5000 to 10000 or whatever number you want from the dataset.
But before doing that let’s check the value counts of the reviews for each product.
df['name'].value_counts()
Output:
Vulli Sophie the Giraffe Teether 785
Simple Wishes Hands-Free Breastpump Bra, Pink, XS-L 562
Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision 561
Baby Einstein Take Along Tunes 547
Cloud b Twilight Constellation Night Light, Turtle 520
...
Mud Pie Baby Stroller Bear Buddy (Set of 4) 1
Baby Mod Modena 3 in 1 Convertible Crib, White 1
Britax Kick Mats 1
Camouflage Camo print Cloth Diaper 1
Baby Gear Blue Bear with Dots and Circles Security Blanket Lovey 1
Name: name, Length: 32415, dtype: int64
You can see from the output that, there are some products that have only one review. It’s not possible to draw any conclusion from only one review. Here I am taking the products that have at least 20 reviews.
df = df[df.groupby("name")["name"].transform('size') > 20]
I check the length of the dataset again and found out that we have 89691 data now. That’s also a lot. I want only 10000 data for this demonstration. If you don’t worry about running time or if you have a higher capacity computer, please feel free to use all the data.
df= df.head(10000)
Now, I have a dataset with 10000 data in it.
Preprocessing
Some text cleaning and processing are necessary before jumping into the analysis. Preprocessing and data cleaning is a big part of data analysis. First I will make the ‘review’ column as string format. It looks like strings. But by any chance, if there are some data that are not in string format, I will simply convert the whole column as strings.
df['review'] = df['review'].astype(str)
Reviews are the main information in this dataset. If any row is missing the review, we don’t need that row.
df = df[~df["review"].isnull()]
I still have 10000 data. That means there were no null values in the review column.
Delete the Special Characters
Reviews may contain a lot of special characters that are not helpful for any analysis. It’s good to clean them in the beginning.
def clean(txt): txt = txt.str.replace("( )", "") txt = txt.str.replace('(<a).*(>).*()', '') txt = txt.str.replace('(&)', '') txt = txt.str.replace('(>)', '') txt = txt.str.replace('(<)', '') txt = txt.str.replace('(\xa0)', ' ') return txtdf['review'] = clean(df['review'])
Converting to lower case
Converting to lower case is necessary. Otherwise, it will consider the same word with an uppercase as a different word. Like ‘me’ and ‘Me’ will be considered as different words. We don’t want that.
df['review1'] = df['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['review1'].head()
Output:
153 we bought these for our son when he turned two...
154 my son loves stacking cups, so a friend recomm...
155 my son cameron just loves these great little s...
156 my one year old son received these as a birthd...
157 i purchased this toy for my great grandson\'s ...
Name: review1, dtype: object
Removing the Punctuation
This step is to remove the punctuations. Because of punctuation, a word might be treated differently than it originally is. For example ‘use’ and ‘use:’ will become different words because of punctuation.
df['review1'] = df['review1'].str.replace('[^\w\s]', '')
df['review1'].head()
Output:
153 we bought these for our son when he turned two...
154 my son loves stacking cups so a friend recomme...
155 my son cameron just loves these great little s...
156 my one year old son received these as a birthd...
157 i purchased this toy for my great grandsons fi...
Name: review1, dtype: object
As you can see, the punctuations are gone! In line 157, there was punctuation in ‘grandsons’. We will eventually get rid of the ‘s’s in the ‘grandsons’ in the lemmatization or Stemming section in a bit.
Removing Stopwords
Stopwords are some grammatical or binding words like ‘is’, ‘the’, ‘and’, ‘so’, ‘my’ etc. These are the words that appear very frequently. But may not add any value to the analysis. Although it is arguable. Some people think they are important at times. In some artificial intelligence projects, they may be important but for this example, stopwords won’t be necessary.
import nltk nltk.download('stopwords') from nltk.corpus import stopwords
stop = stopwords.words('english') df['review1'] = df['review1'].apply(lambda x: " ".join(x for x in x.split() if x not in stop)) df['review1'].head()
Output:
153 bought son turned two seen playmates home love...
154 son loves stacking cups friend recommended toy...
155 son cameron loves great little stacking cars e...
156 one year old son received birthday gift loves ...
157 purchased toy great grandsons first christmas ...
Name: review1, dtype: object
See, all the stopwords are gone!
Remove the Rare Words
There are some words that appeared only once. Those rare words do not add anything to it. So we can safely discard them. First, find the frequency of each word and then find out the words that appeared only once.
freq = pd.Series(' '.join(df['review1']).split()).value_counts()
less_freq = list(freq[freq ==1].index)
less_freq
Output:
['djgs',
'now7monthold',
'joseph',
'area8',
'activing',
'tea',
'productdespite',
'worth3the',
'aroundand',
'80lb',
'combinedit',
'hikesnow',
'bubblesbeing',
'cheast',
'inexcusable',
'heavyeven',
This is part of the output. There is a total of 14352 words in this list. If you notice most of these words even look wired. They are mostly typos or spelling mistakes. Let’s get rid of them from the review:
df['review1'] = df['review1'].apply(lambda x: " ".join(x for x in x.split() if x not in less_freq))
Spelling Correction
Simple spelling mistakes can be corrected using the correct() function.
from textblob import TextBlob, Word, Blobber
df['review1'].apply(lambda x: str(TextBlob(x).correct())) df['review1'].head()
Output:
153 He bought these for our son when he turn two H...
154 By son love stick cups so a friend recommend t...
155 By son cameron just love these great little st...
156 By one year old son receive these a a birthday...
157 I purchase the toy for my great grandson first...
Just a warning! It takes several hours to run this piece of code. You have to be patient. If you want you can avoid this to save time.
Stemming and Lemmatization
Stemming will cut down the parts like ‘ly’, ‘ing’, ‘ed’ from the words. We talked about it a bit before.
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['review1'] = df['review'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
Output:
153 We bought these for our son when he turn two. ...
154 My son love stack cups, so a friend recommend ...
155 My son cameron just love these great littl sta...
156 My one year old son receiv these as a birthday...
157 I purchas thi toy for my great grandson\' firs...
Look at line 157 in the output. After the word ‘grandson’, the punctuation is back. Don’t worry we will take care of it later.
The next step is lemmatization to bring the words in their root form. You can choose either stemming or lemmatization. After stemming, you may not see many changes in the lemmatization. I am still showing it for demonstration purposes.
df['review1'] = df['review1'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df['review1'].head()
Output:
153 We bought these for our son when he turn two. ...
154 My son love stack cups, so a friend recommend ...
155 My son cameron just love these great littl sta...
156 My one year old son receiv these a a birthday ...
157 I purchas thi toy for my great grandson\' firs...
Now we should remove the punctuation again:
df['review1'] = df['review1'].str.replace('[^\w\s]', '')
df['review1'].head()
Output:
153 We bought these for our son when he turn two H...
154 My son love stack cups so a friend recommend t...
155 My son cameron just love these great littl sta...
156 My one year old son receiv these a a birthday ...
157 I purchas thi toy for my great grandson first ...
Name: review1, dtype: object
Actually, I did an extra step. You can avoid removing the punctuations in the beginning and do it at this point. That will save you an unnecessary step.
Data Analysis
Let’s start the analysis by adding some more features to the dataset. Here, I am adding the length of the review and the word count of each review.
df['review_len'] = df['review'].astype(str).apply(len)
df['word_count'] = df['review'].apply(lambda x: len(str(x).split()))
I want to add one more feature called polarity. Polarity shows the sentiment of a piece of text. It counts the negative and positive words and determines the polarity. The value ranges from -1 to 1 where -1 represents the negative sentiment, 0 represents neutral and 1 represent positive sentiment.
df['polarity'] = df['review1'].map(lambda text: TextBlob(text).sentiment.polarity)df.head()
Output:
Distributions
I would like to start by seeing the distribution of the word_count, review_len, and polarity.
df[["review_len", "word_count", "polarity"]].hist(bins=20, figsize=(15, 10))
Polarity vs Rating
Because there is a rating column available, we should check if the polarity goes with the rating. Here are the boxplots of the polarity of each rating:
plt.figure(figsize = (10, 8))
sns.set_style('whitegrid')
sns.set(font_scale = 1.5)
sns.boxplot(x = 'rating', y = 'polarity', data = df)
plt.xlabel("Rating")
plt.ylabel("Polatiry")
plt.title("Product Ratings vs Polarity")
plt.show()
Mean polarity keeps going up with the higher rating. There are a lot of outliers in rating 1 and 5 though. Maybe looking at the numbers will help a bit more.
mean_pol = df.groupby('rating')['polarity'].agg([np.mean]) mean_pol.columns = ['mean_polarity']
fig, ax = plt.subplots(figsize=(8, 6)) plt.bar(mean_pol.index, mean_pol.mean_polarity, width=0.3)
#plt.gca().set_xticklabels(mean_pol.index, fontdict={'size': 14})
for i in ax.patches:
ax.text(i.get_x(), i.get_height()+0.01, str("{:.2f}".format(i.get_height())))
plt.title("Polarity of Ratings", fontsize=22)
plt.ylabel("Polarity", fontsize=16)
plt.xlabel("Rating", fontsize=16)
plt.ylim(0, 0.35)
plt.show()
I was expecting ratings 1 and 2 to have a polarity of close to -1. But look like they are closer to 0. That means the review may not have that many negative words in it. I am guessing it just by looking at the polarity. Please read a few reviews of rating 1 to double-check.
Count of the Reviews for Each Rating
Below is a count plot that will show the count of the reviews of each rating available in the dataset.
plt.figure(figsize=(8, 6))
sns.countplot(x='rating', data=df)
plt.xlabel("Rating")
plt.title("Number of data of each rating")
plt.show()
Most of the reviews of the dataset have a rating of 5.
Length of the Review vs the Rating
It will be interesting to see if the review length changes with rating.
plt.figure(figsize=(10, 6))
sns.pointplot(x = "rating", y = "review_len", data = df)
plt.xlabel("Rating")
plt.ylabel("Review Length")
plt.title("Product Rating vs Review Length")
plt.show()
Top 20 products based on the Polarity
These are the top 20 products based on the polarity
product_pol = df.groupby('name')['polarity'].agg([np.mean])
product_pol.columns = ['polarity']
product_pol = product_pol.sort_values('polarity', ascending=False)
product_pol = product_pol.head(20)
product_pol
WordCloud
Wordcloud is a common and beautiful visualization for text data to plot the frequency of words. You may need to install wordcloud if you do not have it already, using this command:
conda install -c conda-forge wordcloud
I use the Anaconda package. That’s why I am giving the installation command for the anaconda package.
To create the word cloud, I combined all the texts in the review1 column to made a bid text block.
text = " ".join(review for review in df.review1)
Using this text block, I created the word cloud. Look before making the word cloud I get rid of some more words that I thought not necessary. If you want, you can clean it up further.
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator stopwords = set(STOPWORDS) stopwords = stopwords.union(["ha", "thi", "now", "onli", "im", "becaus", "wa", "will", "even", "go", "realli", "didnt", "abl"]) wordcl = WordCloud(stopwords = stopwords, background_color='white', max_font_size = 50, max_words = 5000).generate(text) plt.figure(figsize=(14, 12)) plt.imshow(wordcl, interpolation='bilinear') plt.axis('off') plt.show()
The bigger the words that appeared more frequently in the text.
Frequency Charts
This is common practice in text data analysis to make charts of the frequency of words. That gives a good idea about what people are talking about most in this text. First, find the frequency of each word in the review column of the dataset. Then plot the top 20 words based on the frequency.
def get_top_n_words(corpus, n=None): vec=CountVectorizer().fit(corpus) bag_of_words = vec.transform(corpus) sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) return words_freq[:n]common_words = get_top_n_words(df['review1'], 20)df1 = pd.DataFrame(common_words, columns = ['Review', 'count']) df1.head()
Here is the bar plot of the frequency of the top 20 words:
df1.groupby('Review').sum()['count'].sort_values(ascending=False).plot(
kind='bar',
figsize=(10, 6),
xlabel = "Top Words",
ylabel = "Count",
title = "Bar Chart of Top Words Frequency"
)
These are the topmost occurring words in the reviews. But instead of just seeing one word two consecutive words or three consecutive words are more helpful. They provide some meaning. The following plot shows the topmost frequent bigrams:
def get_top_n_bigram(corpus, n=None): vec = CountVectorizer(ngram_range=(2,2)).fit(corpus) bag_of_words = vec.transform(corpus) sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) return words_freq[:n]common_words2 = get_top_n_bigram(df['review1'], 30) df2 = pd.DataFrame(common_words2, columns=['Review', "Count"]) df2.head()
This is the bar chart of topmost occurring bigrams:
df2.groupby('Review').sum()['Count'].sort_values(ascending=False).plot(
kind='bar',
figsize=(12,6),
xlabel = "Bigram Words",
ylabel = "Count",
title = "Bar chart of Bigrams Frequency"
)
Look at the bigrams. They are somewhat phrases that make more sense. The next plot is the trigrams. Maybe that will provide us some more ideas about what people are saying in the reviews.
def get_top_n_trigram(corpus, n=None): vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus) bag_of_words = vec.transform(corpus) sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) return words_freq[:n]common_words3 = get_top_n_trigram(df['review1'], 30)df3 = pd.DataFrame(common_words3, columns = ['Review' , 'Count']) df3.groupby('Review').sum()['Count'].sort_values(ascending=False).plot( kind='bar', figsize=(12,6), xlabel = "Trigram Words", ylabel = "Count", title = "Bar chart of Trigrams Frequency" )
Part-of -Speech Tagging
This is a process of tagging the words with the part of speech such as nouns, pronouns, verbs, adjectives, etc. It can be done easily using TextBlob API.
blob = TextBlob(str(df['review1'])) pos_df = pd.DataFrame(blob.tags, columns = ['word', 'pos'])pos_df = pos_df.pos.value_counts()[:30] pos_df.plot(kind='bar', xlabel = "Part Of Speech", ylabel = "Frequency", title = "Bar Chart of the Frequency of the Parts of Speech", figsize=(10, 6))
Conclusion
I tried to present some great ways to understand or extract information from a piece of text. In the preprocessing section, I introduced several preprocessing techniques that are useful in machine learning as well. You may not use all the preprocessing techniques I presented here in all the analysis or machine learning. You may choose whatever feels suitable for you. After that, some general exploratory analysis techniques were presented. There are a lot more techniques available out there. Hope you will try all this in your own dataset and do some cool analysis.
Feel free to follow me on Twitter and like my Facebook page.
#DataScience #DataAnalytics #programming #DataVisualization #DataAnalysis