Beginners Guide to R – R Factor

R Factor

In real-world problems, you often encounter data that can be classified in categories.

For example, suppose a survey was conducted of a group of seven individuals, who were asked to identify their hair color and gender.

The result might appear as follows:

Name Hair color Gender
Amy Blonde Female
Bob Black Male
Eve Black Female
Kim Red Female
Max Blonde Male
Ray Brown Male
Sam Black Male

Here, the hair color and gender are the examples of categorical data. To store such categorical data, R has a special data structure called factors.

A factor is an ordered collection of items. The different values that the factor can take are called levels.

Create a Factor

In R, you can create a factor with the factor() function.

# Factor storing hair color values
hcolors <- c("Blonde", "Black", "Black", "Red", "Blonde", "Brown", "Black")
f <- factor(hcolors)
f
[1] Blonde Black  Black  Red    Blonde Brown  Black 
Levels: Black Blonde Brown Red
# Factor storing gender values
gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
f
[1] Female Male   Female Female Male   Male   Male  
Levels: Female Male

Factor Levels

A factor looks like a vector, but it has special properties. Levels are one of them.

Notice that when you print the factor, R displays the distinct levels below the factor. R keeps track of all the possible values in a vector, and each value is called a level of the associated factor.

gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
f
[1] Female Male   Female Female Male   Male   Male  
Levels: Female Male

The levels() function shows all the levels from a factor.

gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
levels(f)
[1] "Female" "Male" 

Specifying Levels Explicitly

If your vector contains only a subset of all the possible levels, then R will have an incomplete picture of the possible levels.

Consider the following example of a vector consisting of directions:

# Factor with missing level "South"
directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions)
f
[1] North West  North East  North West  East 
Levels: East North West

Notice that the levels of your new factor do not contain the value “South”. So, R thinks that North, West, and East are the only possible levels. However, in practice, it makes sense to have all the possible directions as levels of your factor.

To add all the possible levels explicitly, you specify the levels argument of factor().

# Specifying all the possible levels explicitly
directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions,
            levels = c("North", "East", "South", "West"))
f
[1] North West  North East  North West  East 
Levels: North East South West

You can do this by using the levels() function as well.

directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions)
levels(f) <- c("North", "East", "South", "West")
f
[1] East  South East  North East  South North
Levels: North East South West

Factor Labels

R lets you assign abbreviated names for the levels. You can do this by specifying the labels argument of factor().

directions <- c("North", "West", "South", "East", "West", "North", "South")
f <- factor(directions,
            levels = c("North", "East", "South", "West"),
            labels = c("N", "E", "S", "W"))
f
[1] N W S E W N S
Levels: N E S W

Ordered Factors

Sometimes data has some kind of natural order between elements.

For example, sports analysts use a three-point scale to determine how well a sports team is competing: loss < tie < win.

In market research, it’s very common to use a five point scale to measure perceptions: strongly disagree < disagree < neutral < agree < strongly agree.

Such kind of data that is possible to place in order or scale is known as Ordinal data.

In R, there is a special data type for ordinal data. This type is called ordered factors.

To create an ordered factor, use the factor() function with the argument ordered=TRUE.

# Create ordinal levels
record <- c("win", "tie", "loss", "tie", "loss", "win", "win")
f <- factor(record, 
            ordered = TRUE)
f
[1] win tie  loss tie  loss win  win 
Levels: loss < tie < win

You can also reverse the order of levels using the rev() function.

# Reverse the order of levels
record <- c("win", "tie", "loss", "tie", "loss", "win", "win")
f <- factor(record, 
            ordered = TRUE, 
            levels = rev(levels(f)))
f
[1] win tie  loss tie  loss win  win 
Levels: win < tie < loss

Recode Factor Levels

Suppose you have dining experience data that has three levels: good, average and bad. And you want to recode factor levels to: happy, neutral and unhappy. Then use the revalue() function from the plyr package.

experience <- c("good", "average", "bad", "good", "bad", "good", "average")
f <- factor(experience)
f
[1] good    average bad     good    bad     good    average
Levels: average bad good

plyr::revalue(f, c("good"="happy", "average"="neutral", "bad"="unhappy"))
[1] happy   neutral unhappy happy   unhappy happy   neutral
Levels: neutral unhappy happy

Drop Unused Factor Levels

If you have no observations in one of the levels, you can drop it using the droplevels() function.

# Drop unused level "tie"
record <- c("win", "loss", "loss", "win", "loss", "win")
f <- factor(record,
            levels = c("loss", "tie", "win"))

f
[1] win  loss loss win  loss win 
Levels: loss tie win

droplevels(f)
[1] win  loss loss win  loss win 
Levels: loss win

Summarizing a factor

The summary() function will give you a quick overview of the contents of a factor.

gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
summary(f)
Female   Male 
     3      4

The function table() tabulates observations.

table(f)
f
Female   Male 
     3      4 

 

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.