CountVectorizer to Extract Features From Texts in Python in Details
CountVectorizer in sklearn for Natural Language Processing

CountVectorizer to Extract Features From Texts in Python in Details

The most basic data processing that any Natural Language Processing (NLP) project requires is to convert the text data to the numeric data. As long as the data is in text form we cannot do any kind of computation action on it.

There are multiple methods available for this text-to-numeric data conversion. This tutorial will explain one of the most basic vectorizers, the CountVectorizer method in the scikit-learn library.

This method is very simple. It takes the frequency of occurrence of each word as the numeric value. An example will make it clear.

In the following code block:

We will import the CountVectorizer method

Call the method,

Fit the text data to the CountVectorizer method and, convert that to an array.

import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer

#This is the text to be vectorized
text = [“Hello Everyone! This is Lilly. My aunt’s name is also Lilly. I love my aunt.\
I am trying to learn how to use count vectorizer.”]

cv= CountVectorizer()
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_arr

Output:

array([[1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1]],
      dtype=int64)

Here I have the numeric values representing the text data above.

How do we know which values represent which words in the text?

To make that clear, it will be helpful to convert the array into a DataFrame where column names will be the words themselves.

cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df

Now, it shows clearly. The value of the word ‘also’ is 1 which means ‘also’ appeared only once in the test. The word ‘aunt’ came twice in the text. So, the value of the word ‘aunt’ is 2.

In the last example, all the sentences were in one string. So, we got only one row of data for four sentences. Let’s rearrange the text and see what happens:

text = ["Hello Everyone! This is Lilly", 
        "My aunt's name is also Lilly",
        "I love my aunt",
        "I am trying to learn how to use count vectorizer"]
cv= CountVectorizer() 
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_arr

Output:

array([[0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 2, 1, 1, 1]],
      dtype=int64)

This time we have a two-dimensional array with one individual list for each string in the text. Putting this array in a DataFrame:

cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df



 


 

Look carefully at this DataFrame. All the words are there as column names. Each row represents a string in the text and the value in the column shows how many times the word appeared in the string. If the word doesn’t appear, the value is zero.

There are some parameters available for CountVectorizer method in sklearn library that are worth checking.

lowercase

If you notice by default CountVectorizer method converts all the words to lowercase. If you do not want that you need to set lowercase = False.

cv= CountVectorizer(lowercase=False) 
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())

Now, the words are taken the way it is in the text. The word ‘My’ came twice in the DataFrame as ‘My’, and ‘my’.

stop_words

The stop_words are the words that we can consider unnecessary for the analytics. In our text, I may think ‘also’, ‘is’, and ‘to’ are not necessary words. I can simply exclude them which is a very important part of data processing for most analytics or machine learning models. Here we have only 4 strings. But in real-world analytics, we need to deal with thousands of strings. Thousands of strings may involve thousands of words and each word becomes a feature. If we can exclude some of the frequently appearing or not so necessary for the model, it will save a lot of computational effort.

There are default lists of stop words in CountVectorizer method itself for a lot of major languages. Here is an example.

cv= CountVectorizer(stop_words='english') 
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df



 


 

Look! A lot of the words are gone!

If you think the words that are gone are not enough for you or too many words are gone, please provide your own list of stop_words. For example, if I only want ‘also’, ‘is’, ‘am’, and ‘to’ to be excluded, I will provide the list of stop_words like this:

cv= CountVectorizer(stop_words=['also', 'is', 'am', 'to']) 

max_df

This is another way of eliminating words. If we use max_df = 0.5 that means if a word appears in more than 50% of the documents or strings then that will be eliminated. An integer value can be used as max_df as well. Max_df = 20 means if a word exists in more than 20 documents it will be eliminated.

To demonstrate this, I created a new text:

text = ["lilly is a good girl", 
        "lilly is a good student",
        "lilly is very good in math", 
        "lilly loves coffee", 
        "She is from Brazil"]
cnt_vect = CountVectorizer(max_df=0.75)
count_mtrx = cnt_vect.fit_transform(text)
cnt_arr = count_mtrx.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns=cnt_vect.get_feature_names())
cnt_df

‘Lilly’ appeared in 4 documents out of 5. So it is eliminated. Same as ‘is’.

min_df

This is the opposite of max_df. If a document appears less than a proportion or a specified they are eliminated by min_df. In this example, I am using the same text as the last example and setting min_df = 2. So, any word that exists in less than 2 documents is eliminated.

cnt_vect = CountVectorizer(min_df=2)
count_mtrx = cnt_vect.fit_transform(text)
cnt_arr = count_mtrx.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns=cnt_vect.get_feature_names())
cnt_df

We have only three words left as we already have just 5 documents. This can be useful in machine learning projects.

When we are trying to extract a trend, the words that only exist seldom in a couple of documents out of thousands of documents, are not very helpful.

max_features

This is another useful feature. When we have thousands of words, it is computationally expensive and time-consuming. If we have a total of 10000 words that becomes 10000 features. Now if you think only the top 2000 words might be good enough based on the term frequency, you can simply use max_features = 2000. Here we even do not have that many words. So, I will use max_features = 5.

cnt_vect = CountVectorizer(max_features=5, stop_words='english')
count_mtrx = cnt_vect.fit_transform(text)
cnt_arr = count_mtrx.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns=cnt_vect.get_feature_names())
cnt_df

Here we have the top five words that appeared the most.

Conclusion

This article tried to explain the CountVectorizer method and how you can best use this method of text processing. The parameters I explained here can make your analytics or Natural Language Processing models efficient if used correctly. These parameters can be used alone or you can use some of them together with one another based on your need. There is a lot of scope for experiment. There are more sophisticated methods to vectorize text data nowadays. But this simple method still works in many cases.

Feel free to follow me on Twitter and like my Facebook page.

If you want to see the video version of this tutorial:

#DataScience #ArtificialIntelligence #NaturalLanguageProcessing #Python #MachineLearning #CountVectorizer

Leave a Reply

Close Menu