Visualization of Text Data Using Word Cloud in R
Visualization plays an important role in exploratory data analysis and feature engineering. However, visualizing text data can be tricky because it is unstructured. Word Cloud provides an excellent option to visualize the text data in the form of tags, or words, where the importance of a word is identified by its frequency.
In this guide, you will acquire the important knowledge of visualizing the text data with a word cloud, using the popular statistical programming language, ‘R’. We will begin by understanding the data.
The data we’ll be using in this guide comes from Kaggle, a machine learning competition website. This is a women’s clothing e-commerce data, consisting of the reviews written by the customers. In this guide, we are taking a sample of the original dataset. The sampled data contains 500 rows and three variables, as described below:
Clothing ID: Categorical variable that refers to the specific piece being reviewed. This is a unique ID.
Review Text: The text containing a review about the product. This is a string variable.
- Recommended IND: Binary variable stating where the customer recommends the product (“1”) or not (“0”).
Let’s start by loading the required libraries and the data.
library(readr) library(dplyr) library(e1071) library(mlbench) /* Text mining packages */ library(tm) library(SnowballC) library("wordcloud") library("RColorBrewer") /* loading the data */ t1 <- read_csv("ml_text_data.csv") glimpse(t1)
Observations: 500 Variables: 3 $ Clothing_ID <int> 1088, 996, 936, 856, 1047, 862, 194, 1117, 996... $ Review_Text <chr> "Yummy, soft material, but very faded looking.... $ Recommended_IND <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
The output shows that the dataset has three variables, but the important one is the ‘Review_Text’ variable.
Preparing Data for Word Cloud Visualization
Text data needs to be converted into a format that can be used for creating the word cloud. Since the text data is not in the traditional format (observations in rows and variables in columns), we will have to perform certain text-specific steps. The list of such steps is discussed in the subsequent sections.
Create the Text Corpus
The first step is to convert the column containing text into a corpus for preprocessing. A corpus is a collection of documents. The first line of code below performs this task, while the second line prints the content of the first corpus.
/* Create corpus */ corpus = Corpus(VectorSource(t1$Review_Text)) /* Look at corpus */ corpus[]
 "Yummy, soft material, but very faded looking. so much so that i am sending it back. if a faded look is something you like, then this is for you."
Looking at the output, it is obvious that the customer was not happy with the product.
Once the corpus is created, text cleaning and pre-processing steps have to be performed. These steps are summarized below.
Conversion to Lowercase: Words like ‘soft’ and ‘Soft’ should be treated as the same, so these have to be converted to lowercase.
Removing Punctuation: The idea here is to remove everything that isn’t a standard number or letter.
Removing Stopwords: Stopwords are unhelpful words like ‘i’, ‘is’, ‘at’, ‘me’, ‘our’. The removal of Stopwords is therefore important.
Stemming: The idea behind stemming is to reduce the number of inflectional forms of words appearing in the text. For example, words such as “argue”, “argued”, “arguing”, “argues” are reduced to their common stem “argu”. This helps in decreasing the size of the vocabulary space.
Eliminating extra white spaces: The idea here is to strip whitespaces from the text.
The lines of code below perform the above steps.
/* Conversion to Lowercase */ corpus = tm_map(corpus, PlainTextDocument) corpus = tm_map(corpus, tolower) /* Removing Punctuation */ corpus = tm_map(corpus, removePunctuation) /* Remove stopwords */ corpus = tm_map(corpus, removeWords, c("cloth", stopwords("english"))) /* Stemming */ corpus = tm_map(corpus, stemDocument) /* Eliminate white spaces */ corpus = tm_map(corpus, stripWhitespace) corpus[]
$content  "yummi soft materi fade look much send back fade look someth like"
Create Document Term Matrix
The text preprocessing steps are completed. Now, we are ready to extract the word frequencies, to be used as tags, for building the word cloud. The lines of code below create the term document matrix and, finally, stores the word and its respective frequency, in a dataframe, ‘dat’. The head(dat,5) command prints the top five words of the corpus, in terms of the frequency.
DTM <- TermDocumentMatrix(corpus) mat <- as.matrix(DTM) f <- sort(rowSums(mat),decreasing=TRUE) dat <- data.frame(word = names(f),freq=f) head(dat, 5)
word freq dress 286 look 248 size 241 fit 221 love 185
The above output shows that the words like ‘dress’, ‘look’, ‘size’, are amongst the top words in the corpus. This is not surprising given that this is a clothing related data.
Word Cloud Generation
Word Cloud in ‘R’ is generated using the wordcloud function. The major arguments of this function are given below:
words: The words to be plotted.
freq: The frequencies of the words.
min.freq: An argument that ensures that words with a frequency below ‘min.freq’ will not be plotted in the word cloud.
max.words: The maximum number of words to be plotted.
random.order: An argument that specifies plotting of words in random order. If false, the words are plotted in decreasing frequency.
rot.per: The proportion of words with 90 degree rotation (vertical text).
colors: An argument that specifies coloring of words from least to most frequent.
We will build word clouds using the different arguments and visualize how they change the output.
The first word cloud will use the mandatory arguments ‘words’, and ‘freq’, and we will set ‘random.order = TRUE’. The first line of code below plants the seed for reproducibility of the result, while the second line generates the word cloud.
set.seed(100) wordcloud(words = dat$word, freq = dat$freq, random.order=TRUE)
The output above shows that there is no specific order – ascending or descending – in which the words are displayed. The words that are prominent, such as dress, size, fit, perfect, or fabric, represent the words that have the highest frequency in the corpus.
Word Cloud 2
Now, we change the additional argument by setting the random.order = FALSE. The output generated shows that the words are now plotted in decreasing frequency, which means that the most frequent words are in the center of the word cloud, while the words with lower frequency are farther away from the center.
set.seed(100) wordcloud(words = dat$word, freq = dat$freq, random.order=FALSE)
The previous two word clouds used only one additional argument, ‘random.order’, However, there are other arguments that can be used. We will now create the word cloud by changing the other arguments, which is done in the lines of code below.
set.seed(100) wordcloud(words = dat$word, freq = dat$freq, min.freq = 3, max.words=250, random.order=FALSE, rot.per=0.30, colors=brewer.pal(8, "Dark2"))
The output now has different colors, displayed as per the frequency of the words in the corpus. Other arguments have also changed the appearance of the word cloud.
In this guide, we have explored how to build a word cloud and the important parameters that can be altered to improve its appearance. You also learned about cleaning and preparing text required for generating the word cloud. Finally, you also learned how to identify the most and least frequent words in the word cloud.
Python Example for Beginners
Two Machine Learning Fields
There are two sides to machine learning:
- Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
- Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.
Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes
Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!
Latest end-to-end Learn by Coding Recipes in Project-Based Learning:
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.