Machine Learning with Text Data Using R

Hits: 20

Machine Learning with Text Data Using R



The domain of analytics that addresses how computers understand text is called Natural Language Processing (NLP). NLP has multiple applications like sentiment analysis, chatbots, AI agents, social media analytics, as well as text classification. In this guide, you will learn how to build a supervised machine learning model on text data, using the popular statistical programming language, ‘R’.


The data we’ll be using in this guide comes from Kaggle, a machine learning competition website. This is a women’s clothing e-commerce data, consisting of the reviews written by the customers. In this guide, we will take up the task of predicting whether the customer will recommend the product or not. In this guide, we are taking a sample of the original dataset. The sampled data contains 500 rows and three variables, as described below: 1. Clothing ID: This is the unique ID. 2. Review Text: Text containing reviews by the customer. 3. Recommended IND: Binary variable stating where the customer recommends the product (“1”) or not (“0”). This is the target variable. Let us start by loading the required libraries and the data.

/* Text mining packages */
/* loading the data */
t1 <- read_csv("ml_text_data.csv")


Observations: 500
Variables: 3
$ Clothing_ID 	<int> 1088, 996, 936, 856, 1047, 862, 194, 1117, 996...
$ Review_Text 	<chr> "Yummy, soft material, but very faded looking....
$ Recommended_IND <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

The above output shows that the data has three variables, but the important ones are the variables ‘Review_Text’, and ‘Recommended_IND’.

Preparing Data for Modeling

Since the text data is not in the traditional format of observations in rows, and variables in columns, we will have to perform certain text-specific steps. The list of such steps is discussed in the subsequent sections.

Step 1 – Create the Text Corpus

The variable containing text needs to be converted to a corpus for preprocessing. A corpus is a collection of documents. The first line of code below performs this task. The second line prints the content of the first corpus, while the third line prints the corresponding recommendation score.

corpus = Corpus(VectorSource(t1$Review_Text))


[1] "Yummy, soft material, but very faded looking. so much so that i am sending it back. if a faded look is something you like, then this is for you."
[1] 0

Looking at the review text, it is obvious that the customer was not happy with the product, and hence gave the recommendation score of zero.

Step 2 – Conversion to Lowercase

The model needs to treat Words like ‘soft’ and ‘Soft’ as same. Hence, all the words are converted to lowercase with the lines of code below.

corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, tolower)


[1] "yummy, soft material, but very faded looking. so much so that i am sending it back. if a faded look is something you like, then this is for you."  

Step 3 – Removing Punctuation

The idea here is to remove everything that isn’t a standard number or letter.

corpus = tm_map(corpus, removePunctuation)


[1] "yummy soft material but very faded looking so much so that i am sending it back if a faded look is something you like then this is for you"

Step 4 – Removing Stopwords

Stopwords are unhelpful words like ‘i’, ‘is’, ‘at’, ‘me’, ‘our’. These are not helpful because the frequency of such stopwords is high in the corpus, but they don’t help in differentiating the target classes. The removal of Stopwords is therefore important.

The line of code below uses the tm_map function on the ‘corpus’ and removes stopwords, as well as the word ‘cloth’. The word ‘cloth’ is removed because this dataset is on clothing review, so this word will not add any predictive power to the model.

corpus = tm_map(corpus, removeWords, c("cloth", stopwords("english")))


[1] "yummy soft material   faded looking  much 	sending  back   faded look  something  like 	

Step 5 – Stemming

The idea behind stemming is to reduce the number of inflectional forms of words appearing in the text. For example, words such as “argue”, “argued”, “arguing”, “argues” are reduced to their common stem “argu”. This helps in decreasing the size of the vocabulary space. The lines of code below perform the stemming on the corpus.

corpus = tm_map(corpus, stemDocument)


[1] "yummi soft materi fade look much send back fade look someth like"

Create Document Term Matrix

The most commonly used text preprocessing steps are complete. Now we are ready to extract the word frequencies, which will be used as features in our prediction problem. The line of code below uses the function called DocumentTermMatrix from the tm package and generates a matrix. The rows in the matrix correspond to the documents, in our case reviews, and the columns correspond to words in those reviews. The values in the matrix are the frequency of the word across the document.

frequencies = DocumentTermMatrix(corpus)

The above command results in a matrix that contains zeroes in many of the cells, a problem called sparsity. It is advisable to remove such words that have a lot of zeroes across the documents. The following lines of code perform this task.

sparse = removeSparseTerms(frequencies, 0.995)

The final data preparation step is to convert the matrix into a data frame, a format widely used in ‘R’ for predictive modeling. The first line of code below converts the matrix into dataframe, called ‘tSparse’. The second line makes all the variable names R-friendly, while the third line of code adds the dependent variable to the data set.

tSparse =
colnames(tSparse) = make.names(colnames(tSparse))
tSparse$recommended_id = t1$Recommended_IND

Now we are ready for building the predictive model. But before that, it is always a good idea to set the baseline accuracy of the model. The baseline accuracy, in the case of a classification problem, is the proportion of the majority label in the target variable. The line of code below prints the proportion of the labels in the target variable, ‘recommended_id’.

prop.table(table(tSparse$recommended_id)) #73.6% is the baseline accuracy


      0     1
   0.264 0.736

The above output shows that 73.6 percent of the reviews are from customers who recommended the product. This becomes the baseline accuracy for predictive modeling.

Creating Training and Test Data for Machine Learning

For evaluating how the predictive model is performing, we will divide the data into training and test data. The first line of code below loads the caTools package, which will be used for creating the training and test data. The second line sets the ‘random seed’ so that the results are reproducible.
The third line creates the data partition in the manner that it keeps 70% of the data for training the model. The fourth and fifth lines of code create the training (‘trainSparse’) and testing (‘testSparse’) dataset.

split = sample.split(tSparse$recommended_id, SplitRatio = 0.7)
trainSparse = subset(tSparse, split==TRUE)
testSparse = subset(tSparse, split==FALSE)

Random Forest

The Random Forest classification algorithm is the collection of several classification trees that operate as an ensemble. It is one of the most robust machine learning algorithms. In ‘R’, the randomForest library can be used to build the random forest model, which is loaded in the first line of code below. The second line sets the random state for reproducibility, while the third and fourth lines of code converts the target variable into the ‘factor’ type.
The fifth line trains the random forest algorithm on the training data, while the sixth line uses the trained model to predict on the test data. The seventh line prints the confusion matrix.

trainSparse$recommended_id = as.factor(trainSparse$recommended_id)
testSparse$recommended_id = as.factor(testSparse$recommended_id )
/* Lines 5 to 7 */
RF_model = randomForest(recommended_id ~ ., data=trainSparse)
predictRF = predict(RF_model, newdata=testSparse)
table(testSparse$recommended_id, predictRF)
/* Accuracy */
117/(117+33) #78%
   	0   1
   0     12  28
   1     5  105
[1] 0.78

The above output shows that out of 150 records in the test data, the model got the predictions correct for 117 of them, giving an accuracy of 78 percent.

Evaluation of the Predictive Model

The baseline accuracy we had set for our data was 73 percent. The Random Forest model is conveniently beating this baseline model by achieving the accuracy score of 78 percent.


In this guide, you have learned the fundamentals of text cleaning and pre-processing using the powerful statistical programming language, ‘R’. You also learned how to build and evaluate a random forest classification algorithm on the text data. The random forest model out performed the baseline method.


Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.


Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars