Text Files Processing, Cleaning, and Classification of Documents in R
Natural Language Processing

Text Files Processing, Cleaning, and Classification of Documents in R

With the increasing number of text documents, text document classification has become an important task in data science. At the same time, machine learning and data mining techniques are also improving every day. Both Python and R programming languages have amazing functionalities for text data cleaning and classification.

This article will focus on text documents processing and classification Using R libraries.

Problem Statement

The data that is used here is text files packed in a folder named 20Newsgroups. This folder has two subfolders. One of them contains training data and the other one contains the test data. Each subfolder contains 20 folders. Each of those 20 folders containing 100s of files that are news on different topics. The purpose of this project is to select two topics and develop a classifier that can classify the files of those two topics.

Please feel free to download the dataset from this link and follow along:


Data Preparation

We will use the ‘tm’ library which is a framework for data mining. This framework has a ‘texts’ folder built into it. Let’s, find the path of the ‘texts’ folder on the computer.

First, call all the libraries required for this project:

library(tm) # Framework for text mining.
library(SnowballC) # Provides wordStem() for stemming.
library(dplyr) # Data preparation and pipes %>%.
library(ggplot2) # Plot word frequencies.
library(scales) # Common data analysis activities.

Using the system.file() function, the path of the ‘texts’ folder can be found:

system.file("texts", package = "tm")


[1] "C:/Users/User/Documents/R/win-library/4.0/tm/texts"

Then I put the ‘20Newsgroup’ folder in that ‘texts’ folder. Now, we will bring the training and test data one by one. As I mentioned in the problem statement, we will use only two topics out of 20 topics available in this folder. I chose ‘rec. autos’ and ‘sci. med’. Here is the path to the ‘rec.autos’ folder in the training folder:

mac.path.loc = system.file("texts", "20Newsgroups", "20news-bydate-train", "rec.autos", package = "tm")


[1] "C:/Users/User/Documents/R/win-library/4.0/tm/texts/20Newsgroups/20news-bydate-train/rec.autos"

It is a good idea to check what’s in this ‘rec.autos’ folder. Passing this path above to the ‘DirSource’ function will provide us that information.

mac.files = DirSource(mac.path.loc)

The output is pretty big. Here I am showing the part of the output


[1] ""$length
[1] 594$position
[1] 0$reader
function (elem, language, id) 
    if (!is.null(elem$uri)) 
        id <- basename(elem$uri)
    PlainTextDocument(elem$content, id = id, language = language)
[1] "text"$filelist
  [1] "C:/Users/User/Documents/R/win-library/4.0/tm/texts/20Newsgroups/20news-bydate-train/rec.autos/101551"
  [2] "C:/Users/User/Documents/R/win-library/4.0/tm/texts/20Newsgroups/20news-bydate-train/rec.autos/101552"
  [3] "C:/Users/User/Documents/R/win-library/4.0/tm/texts/20Newsgroups/20news-bydate-train/rec.autos/101553"
  [4] "C:/Users/User/Documents/R/win-

The $length variable at the top of the output shows 594. That means the ‘rec.autos’ folder has 594 items in it. That means 594 text files. How do I know that they are test files? Here in the output, the $mode variable says ‘text’. After that, there was a list of 594 files at the end. But I am only showing 4 to save space.

Now I would want to make a function named ‘fun.corpus’ that will bring a specified number of files from the train and test folders from the specified topics. We will explain some more about the function after the function.

fun.corpus = function(t, f, n){
  mac.path.loc = system.file("texts", "20Newsgroups", t, f, package = "tm")
  mac.files = DirSource(mac.path.loc)
  mac.corpus = VCorpus(URISource(mac.files$filelist[1:n]),
                     readerControl = list(reader=readPlain))

The function takes three parameters. The ‘t’ means the test or training folder, ‘f’ means the topic (we chose ‘sci.med’ or ‘rec. autos’), and n means the number of files we want to classify. The ‘VCorpus’ function converts the files into a corpus format. If this format is new to you, it will be clearer as we keep moving. So, don’t worry about it.

Now, using this function I brought 300 files from the ‘sci.med’ and 300 files from the ‘rec. autos’ topic from the training folder. For the test folder, I brought 200 folders from each topic.

rautos_train = fun.corpus("20news-bydate-train", "rec.autos", 300)
smed_train = fun.corpus("20news-bydate-train", "sci.med", 300)
rautos_test = fun.corpus("20news-bydate-test", "rec.autos", 200) smed_test = fun.corpus("20news-bydate-test", "sci.med", 200)

Let’s just check one of the files in the rautos_train corpus we just created.



Metadata:  7
Content:  chars: 2589From: cs012055@cs.brown.edu (Hok-Chung Tsang)
Subject: Re: Saturn's Pricing Policy
Article-I.D.: cs.1993Apr5.230808.581
Organization: Brown Computer Science Dept.
Lines: 51In article <C4vIr5.L3r@shuksan.ds.boeing.com>, fredd@shuksan (Fred Dickey) writes:
|> CarolinaFan@uiuc (cka52397@uxa.cso.uiuc.edu) wrote:
|> :  I have been active in defending Saturn lately on the net and would
|> : like to state my full opinion on the subject, rather than just reply to others'
|> : points.
|> :  
|> :  The biggest problem some people seem to be having is that Saturn
|> : Dealers make ~$2K on a car.  I think most will agree with me that the car is
|> : comparably priced with its competitors, that is, they aren't overpriced 
|> : compared to most cars in their class.  I don't understand the point of 
|> : arguing over whether the dealer makes the $2K or not?  
|> I have never understood what the big deal over dealer profits is either.
|> The only thing that I can figure out is that people believe that if
|> they minimize the dealer profit they will minimize their total out-of-pocket
|> expenses for the car. While this may be true in some cases, I do not
|> believe that it is generally true. I bought a Saturn SL in January of '92.
|> AT THAT TIME, based on studying car prices, I decided that there was
|> no comparable car that was priced as cheaply as the Saturn. Sure, maybe I
|> could have talked the price for some other car to the Saturn price, but
|> my out-of-pocket expenses wouldn't have been any different. What's important
|> to me is how much money I have left after I buy the car. REDUCING DEALER PROFIT
|> IS NOT THE SAME THING AS SAVING MONEY! Show me how reducing dealer profit
|> saves me money, and I'll believe that it's important. My experience has
|> been that reducing dealer profit does not necessarily save me money.
|> FredSay, you bought your Saturn at $13k, with a dealer profit of $2k.
If the dealer profit is $1000, then you would only be paying $12k for
the same car.  So isn't that saving money?Moreover, if Saturn really does reduce the dealer profit margin by $1000, 
then their cars will be even better deals.  Say, if the price of a Saturn was
already $1000 below market average for the class of cars, then after they
reduce the dealer profit, it would be $2000 below market average.  It will:1) Attract even more people to buy Saturns because it would SAVE THEM MONEY.

2) Force the competitors to lower their prices to survive.

Now, not only will Saturn owners benefit from a lower dealer profit, even 
the buyers for other cars will pay less.Isn't that saving money?$0.02,

Look, in the beginning, we have which email address this text came from, the organization name, and the subject. In this section of the article, I only used these three pieces of information to classify the files.

If you want, feel free to use the full document. I tried that too and got similar result for this dataset.

Here is the ‘ext’ function that takes as an input a corpus and the number of files and returns a list of vectors that contains only the email address, organization name, and the subject of text files. Some more explanation of the function is available after you see the function.

ext = function(corp, n){
  meta.info = list()
  for (i in 1:n){
    g1 = grep("From: ", corp[[i]]$content)
    g2 = grep("Organization: ", corp[[i]]$content)
    g3 = grep("Subject: ", corp[[i]]$content)
    each_c = c(corp[[i]]$content[g1], corp[[i]]$content[g2], corp[[i]]$content[g3])
    meta.info[[i]] = each_c

Here is the explanation of what has been done here. There are two parameters passed in this function. A corpus and the number of files that need to be worked on. First I looped through the files. From each file, we extracted the pieces of texts that contain the strings: “From: “, “Organization: “, “Subject: “. Made a vector of this three information only for each corpus and added it to the list. Using this function we can extract the necessary information from all the corpora we created before.

sm_train = ext(smed_train, 300)
sm_test = ext(smed_test, 200)
ra_train = ext(rautos_train, 300) ra_test = ext(rautos_test, 200)

Now merge all the lists.

merged = c(sm_train, ra_train, ra_test, sm_test)
merged.vec = VectorSource(merged)

This is a big list that has 1000 objects in it both from training and test folders.

Converting the ‘merged’ list into a corpus again.

v = VCorpus(merged.vec)

Checking an element of this corpus:



Metadata:  7
Content:  chars: 114From: bed@intacc.uucp (Deb Waddington)
Organization: Matrix Artists' Network
Subject: INFO NEEDED: Gaucher's Disease

It has 114 characters that include the information that we extracted. So, it’s perfect!

Data Cleaning

Data cleaning is very important. It improved the performance of the classifier significantly every time I worked with text data. It is hard to find perfect text data. So, most of the time we have to clean them.

We made each text very small. Only three pieces of information. I will remove the ‘@’ symbol from the email address, remove the ‘From: ‘, Organization: ‘, and ‘Subject: ‘ part of the strings because these three pieces are not very informative, take off the punctuations, and stem the data. I am saving the transformed corpus in the ‘temp.v’ variable.

transform.words = content_transformer(function(x, from, to) gsub(from, to, x))
temp.v = tm_map(v, transform.words, "@", " ")
temp.v = tm_map(temp.v, transform.words, "From: |Organization: |Subject: ", "")
temp.v = tm_map(temp.v, removePunctuation)
temp.v = tm_map(temp.v, stemDocument, language = "english")

Here transform.words function use the content_transformer’ function. It takes three parameters, ‘x’ is the corpus, ‘from’ is the pattern (in this case‘@’ symbol or “From: |Organization: |Subject: “), and ‘to’ is the replacement (in this case space or nothing).

Checking if the transformation worked:



Metadata:  7
Content:  chars: 76bed intaccuucp Deb Waddington
Matrix Artist Network
INFO NEEDED Gaucher Diseas

It worked! No ‘@’ symbol, ‘From: ‘, Organization:’ and ‘Subject: ‘ part, no punctuation. If you notice carefully to the words, some words are stemmed.

Developing and Training the Classifier

This is the fun part! We want a document term matrix for that. I converted temp.v corpus to a document term matrix. Because the classifier does not take a corpus. A matrix is the right format of data.

dtm = DocumentTermMatrix(temp.v, control = list(wordLengths = c(2, Inf), bound = list(global=c(5, Inf))))

Remember, when we imported the data from the folders we took a total of 600 files from the training folder and 400 from the test folder. And later created the merge corpus putting the training corpora on top. Let’s separate the training and test data from the document term matrix.

dtm.train = dtm[1:600,]
dtm.test = dtm[601:1000,]

To input to the classifier, the right labels or tags of the data are required to train the model. That means which file should be labeled as to what. The tags are also required for the test data. Because after we predict the tags, we need a reference to compare to find out the prediction accuracy.

We know the sequence of the data. So, here are the tags for training and test data:

tags = factor(c(rep("smed", 300), rep("rauto", 300)))
tags1 = factor(c(rep("rauto", 200), rep("smed", 200)))

tags for the training data and tags1 for the test data.

I used a K Nearest Neighbor or KNN classifier. You need to install the ‘class’ library for that if you do not have it already. Here is the classifier below. It takes the training data, test data, the tags for training data, and the ‘k’ value.

Just a high-level and brief idea about the KNN classifier. This classifier first uses the training data and the tags for the training data to learn the trend about the data. When it gets new data to classify, it takes the distance of that data with the other data. And based on the distance it labels the new data as the closest neighbors. Here the value of K comes in. If the value of K is 5 it labels the new data as the five nearest data points. You need to find a suitable value of K.

Here I am using the value of K as 4.

set.seed(245) prob.test = knn(dtm.train, dtm.test, tags, k=4, prob = TRUE) prob.test

Here is the part of the output:

[1] rauto rauto rauto rauto rauto smed  smed  smed  smed  smed  smed 
 [12] rauto rauto smed  rauto rauto smed  rauto rauto rauto rauto rauto
 [23] smed  rauto rauto smed  rauto rauto rauto smed  smed  rauto smed 
 [34] rauto rauto smed  rauto smed  smed  rauto smed  smed  rauto rauto
 [45] rauto rauto rauto smed  rauto smed  smed  smed  smed  smed  smed 
 [56] smed  smed  rauto rauto smed  smed  rauto rauto rauto smed  rauto
 [67] smed  rauto rauto smed  smed  rauto rauto rauto rauto smed  rauto
 [78] smed  rauto rauto smed  rauto rauto rauto smed  smed  rauto rauto
 [89] rauto rauto smed  rauto rauto smed  rauto rauto smed  smed  smed

It has a total of 400 predictions. It learns the training data and tags. And predicts the labels for the 400 test data. Remember, we have 400 test data.

Now, I will make a data frame. There will be three columns in the data frame: a serial number that is 601 to 1000, the predicted labels from the classifier, and if the labels are correct or not.

a = 601:1000
b = levels(prob.test)[prob.test]
c = prob.test==tags1
res = data.frame(SI = a, Predict = b, Correct = c) head(res)


Here is the accuracy of the prediction:



[1] 0.685

So, the accuracy is 68.5%. But this accuracy rate may vary with a different k value. I created a function that takes the k value as a parameter and returns the accuracy rate.

Here is the function:

prob.t = function(n){

prob.test = knn(dtm.train, dtm.test, tags, k=n, prob = TRUE)
a = 601:1000
b = levels(prob.test)[prob.test]
c = prob.test==tags1


Using this function accuracy rate is calculated for a k value of 1 to 12:

res.list = c()
for (i in 1:12){
  acc = prob.t(i)
  res.list = append(res.list, acc)


[1] 0.7175 0.6875 0.7150 0.6850 0.6625 0.6050 0.6000 0.5850 0.5750
[10] 0.5725 0.5675 0.5475

A scatter plot of k value vs accuracy will show the trend of how the accuracy rate changed with the k value.

ggplot(res.data, aes(x = K_values, y = res.list)) + geom_point()+
    title="Accuracy vs k values",
    x = "K Values",
    y = "Accuracy"
  ) + 


Accuracy is highest for the k value of 1 and then for the k value of 3. After that, it kept going down.

Analysis of the Efficiency of the Classifier

I created the data frame of the prediction accuracy again using the k value of 1:

prob.test = knn(dtm.train, dtm.test, tags, k=1, prob = TRUE)
a = 601:1000
b = levels(prob.test)[prob.test]
c = prob.test==tags1
res = data.frame(SI = a, Predict = b, Correct = c)


The confusion matrix provides a lot of information that helps evaluate the efficiency of the classifier.

confusionMatrix(as.factor(b), as.factor(tags1), "smed")


Confusion Matrix and StatisticsReference
Prediction rauto smed
     rauto   105   18
     smed     95  182

Accuracy : 0.7175
95% CI : (0.6706, 0.7611)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.435

Mcnemar’s Test P-Value : 8.711e-13

Sensitivity : 0.9100
Specificity : 0.5250
Pos Pred Value : 0.6570
Neg Pred Value : 0.8537
Prevalence : 0.5000
Detection Rate : 0.4550
Detection Prevalence : 0.6925
Balanced Accuracy : 0.7175

‘Positive’ Class : smed

Look, the classifier predicted 105 ‘rec.autos’ files are correctly and 95 ‘rec.autos’ files are predicted wrongly as ‘sci.med’ files. On the other hand, 182 ‘sci.med’ files are labeled correctly by the classifier, and 18‘sci.med’ files are labeled wrongly as ‘res.autos’ files.

Also, sensitivity is 0.91 which is actually recall. The precision is 0.657 which is ‘Pos Pred Value’ in the confusion matrix result above. Using these two we can calculate the F1 score:

F1 = (2*0.657*0.91) / (0.657 + 0.91)


[1] 0.7630759

F1 score is a great measure of the efficiency of a model. The closer the F1 score is to 1, the better the model is.

If precision, recall, and F1 score are new to you, here is a detailed discussion on it.


I hope this demonstration was helpful. I extracted the subject data and ran the classifier on that. But please feel free to use the whole text and run the classifier again on your own. It will only take some extra time as the whole texts are a lot of data. Also, it may take some more cleaning. Like the complete text will have a lot of stop words. You may consider taking them out to make the text a bit smaller. Just an idea. Also, I only used 2 groups of data and made it binary classification. But KNN classifier also works on a multi-class classifier. You may want to choose more groups.

Feel free to follow me on Twitter and like my Facebook page.

#DataScience #DataAnalytics #NaturalLanguageProcessing #MachineLearning #ArtificialInteligence #Technology

Leave a Reply

Close Menu