Text Preprocessing to Prepare for Machine Learning in Python — Natural Language Processing
Text preprocessing in Python

Text Preprocessing to Prepare for Machine Learning in Python — Natural Language Processing

In this age of social media and online business era, text data are coming from everywhere. However, dealing with text data is tricky. Because raw text may come in with all types of impurities, unnecessary noises, spelling mistakes, and more. So, it is necessary to go through proper preprocessing before diving into any modeling with text data.

In this article, we will work on some common text preprocessing techniques to prepare text data for machine learning.

Removing Numbers

Numbers in text can be deceiving for machine learning models. Because anyway, the text needs to be converted as numbers. Each text gets converted as a number. If the text contains numbers again, that may interfere with those numbers unnecessarily. So, removing numbers can be helpful.

Here I used regular expressions to remove numbers. So, I needed to import ‘re’ first.

import re  
text = "Class A has 35 students, class B has 29 students and all of them are good at math"
res = re.sub(r'\d+', '', text)
res

Output:

'Class A has  students, class B has  students and all of them are good at math'

All the numbers are gone from the text.

Removing Extra Space

This is another funny problem. Sometimes, at the beginning and at the end, an extra space may come in the raw data that does not seem to be a problem. But it can cause problems. If there is an extra space, the same word may appear as two different words. For example, if we add an extra space at the beginning of the word ‘ song’ while developing a model, this will be considered as a different word than ‘song’ only because of the space, which may be bad for the model performance.

st = " result was great "
st.strip()

Output:

'result was great'

The spaces at the beginning and at the end are gone.

I used twitter.csv data from Kaggle for the next exercises.

This dataset has an Attribute 4.0 International License. Please feel free to download the dataset from this link to follow along:

Twitter Sentiment Dataset

Twitter Sentiment Analysis

www.kaggle.com

 

Create a data frame using this dataset here:

import pandas as pd 
df = pd.read_csv('twitter.csv')
df.head()

The tweet column of the DataFrame is the focus for today. Each row contains a tweet. The goal of today’s work will be to make these tweets as clean as possible for a machine-learning model to find the trend easily.

Remove Punctuation

To begin with, at first glance, it shows all the punctuations like ‘@’, ‘#’, and ‘:’ in the tweets. The function below removes punctuation from texts.

import string

def remove_punctuation(text):
return .join([i for i in text if i not in string.punctuation])

Applying this function to each row of the tweet and creating a new column ‘tweet_clean’, free from punctuations

df['tweet_clean'] = df['tweet'].apply(lambda x: remove_punctuation(x))
df.head()
Image By Author

Look, tweet_clean does not have any of those punctuations anymore!

Tokenization

To perform the next cleaning process, we need words to be separated. Because they will be applied to individual words, not to sentences. So in the next code block function ‘tokenize’ is defined to split the sentences into words and then this function is applied to the newly created column ‘tweet_clean’.

import re 
def tokenize(text):
return text.split(' ')

df[‘tweet_tokenized’] = df[‘tweet_clean’].apply(lambda x: tokenize(x))
df.head()

The new column ‘tweet_tokenized’ includes the list of words in each row. We can deal with the words now.

Stemming

Stemming is applied to the words to bring the words to their base form. For example, the word ‘talk’ may come in all these forms:

talk

talks

talked

talking

Stemming will bring any of these forms to simple ‘talk’. Here we need to: import the ‘PoterStemmer’ from ‘nltk’ library

Initialize the PorterStemmer

Define a function ‘stemm’ that stems the words in a sentence and

Then apply the function to the ‘tweet_tokenized’ column to stem the words:

from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
def stemm(text):
return [stemmer.stem(word) for word in text]

df[‘tweet_stemmed’] = df[‘tweet_tokenized’].apply(lambda x: stemm(x))

Look at the first row. The word ‘dysfunctional’ in the tweet_tokenized column becomes ‘dysfunct’ in the tweet_stemmed column. In the second row, the word ‘thanks’ in the ‘tweet_tokenized’ column becomes ‘thank’ in the ‘tweet_stemmed’ column.

Lemmatization

Lemmatization is also another way of bringing the words to their base form. In the stemming section, some words may lose their meaning. For example, ‘Dysfunctional’ becoming ‘dysfunct’ may not be meaningful to the sentence. Lemmatization gives you a little more control over which part of the sentence to change. In this example, we will lemmatize only the verbs.

In the following code block:

Import Lemmatizer first

Then download ‘wordnet’ and ‘omw-1.4’

Initialize Lemmatizer

Define the function Lemmatization that stems only verbs.

Apply the function to the tweet_tokenized column

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
nltk.download("omw-1.4")
# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()
def lemmatization(text):
return [wnl.lemmatize(w, pos='v') for w in text]

df[‘tweet_lemmatized’] = df[‘tweet_tokenized’].apply(lambda x: lemmatization(x))

As you can see, this time, we stemmed only the verbs. So, in the first row, ‘dysfunctional’ stays as ‘dysfunctional’ after lemmatization. But ‘thanks’ becomes ‘thank’ in the second row ‘talking’ becomes ‘talk’ in the sixth row and ‘camping’ becomes ‘camp’ in the seventh row.

Conclusion

There is more text processing out there. But these are just some very commonly used preprocessing for the modeling. Not all of them need to be used all the time. Some of them can be used based on your needs.

There are some more very commonly used processes that can be found in most articles on this topic like making the words lowercase and removing stop_words. But I didn’t use it because we definitely have to vectorize the text before modeling and most popular vectorizer methods can take care of making the words lowercase and removing stop_words. So we don’t need to do that in text-preprocessing steps.

Feel free to follow me on Twitter and like my Facebook page.

 

If you are interested in the video version of this tutorial:

Leave a Reply

Close Menu