In this tutorial, you will learn

select()
Filter()
Pipeline
arrange()

The library called dplyr contains valuable verbs to navigate inside the dataset. Through this tutorial, you will use the Travel times dataset. The dataset collects information on the trip leads by a driver between his home and his workplace. There are fourteen variables in the dataset, including:

DayOfWeek: Identify the day of the week the driver uses his car
Distance: The total distance of the journey
MaxSpeed: The maximum speed of the journey
TotalTime: The length in minutes of the journey

The dataset has around 200 observations in the dataset, and the rides occurred between Monday to Friday.

First of all, you need to:

load the dataset
check the structure of the data.

One handy feature with dplyr is the glimpse() function. This is an improvement over str(). We can use glimpse() to see the structure of the dataset and decide what manipulation is required.

library(dplyr)

PATH <- "/datafolder/travel_times.csv"
df <- read.csv(PATH)
glimpse(df)

Output:

## Observations: 205
## Variables: 14
## $ X              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ Date           <fctr> 1/6/2012, 1/6/2012, 1/4/2012, 1/4/2012, 1/3/20...
## $ StartTime      <fctr> 16:37, 08:20, 16:17, 07:53, 18:57, 07:57, 17:3...
## $ DayOfWeek      <fctr> Friday, Friday, Wednesday, Wednesday, Tuesday,...
## $ GoingTo        <fctr> Home, GSK, Home, GSK, Home, GSK, Home, GSK, GS...
## $ Distance       <dbl> 51.29, 51.63, 51.27, 49.17, 51.15, 51.80, 51.37...
## $ MaxSpeed       <dbl> 127.4, 130.3, 127.4, 132.3, 136.2, 135.8, 123.2...
## $ AvgSpeed       <dbl> 78.3, 81.8, 82.0, 74.2, 83.4, 84.5, 82.9, 77.5,...
## $ AvgMovingSpeed <dbl> 84.8, 88.9, 85.8, 82.9, 88.1, 88.8, 87.3, 85.9,...
## $ FuelEconomy    <fctr> , , , , , , -, -, 8.89, 8.89, 8.89, 8.89, 8.89...
## $ TotalTime      <dbl> 39.3, 37.9, 37.5, 39.8, 36.8, 36.8, 37.2, 37.9,...
## $ MovingTime     <dbl> 36.3, 34.9, 35.9, 35.6, 34.8, 35.0, 35.3, 34.3,...
## $ Take407All     <fctr> No, No, No, No, No, No, No, No, No, No, No, No...
## $ Comments       <fctr> , , , , , , , , , , , , , , , Put snow tires o...

This is obvious that the variable Comments needs further diagnostic. The first observations of the Comments variable are only missing values.

sum(df$Comments =="")

Code Explanation

sum(df$Comments ==””): Sum the observations equalts to “” in the column comments from df

Output:

## [1] 181

select()

We will begin with the select() verb. We don’t necessarily need all the variables, and a good practice is to select only the variables you find relevant.

We have 181 missing observations, almost 90 percent of the dataset. If you decide to exclude them, you won’t be able to carry on the analysis.

The other possibility is to drop the variable Comment with the select() verb.

We can select variables in different ways with select(). Note that, the first argument is the dataset.

- `select(df, A, B ,C)`: Select the variables A, B and C from df dataset.
- `select(df, A:C)`: Select all variables from A to C from df dataset.
- `select(df, -C)`: Exclude C from the dataset from df dataset.

You can use the third way to exclude the Comments variable.

step_1_df <- select(df, -Comments)
dim(df)

Output:

## [1] 205  14

dim(step_1_df)

Output:

## [1] 205  13

The original dataset has 14 features while the step_1_df has 13.

Filter()

The filter() verb helps to keep the observations following a criteria. The filter() works exactly like select(), you pass the data frame first and then a condition separated by a comma:

filter(df, condition)
arguments:
- df: dataset used to filter the data
- condition:  Condition used to filter the data

One criteria

First of all, you can count the number of observations within each level of a factor variable.

table(step_1_df$GoingTo)

Code Explanation

table(): Count the number of observations by level. Note, only factor level variable are accepted
table(step_1_df$GoingTo): Count the number of of trips toward the final destination.

Output:

## 
##  GSK Home 
##  105  100

The function table() indicates 105 rides are going to GSK and 100 to Home.

We can filter the data to return one dataset with 105 observations and another one with 100 observations.

# Select observations
if GoingTo == Home
select_home <- filter(df, GoingTo == "Home")
dim(select_home)

Output:

## [1] 100  14

# Select observations
if GoingTo == Work
select_work <- filter(df, GoingTo == "GSK")
dim(select_work)

Output:

## [1] 105  14

Multiple criterions

We can filter a dataset with more than one criteria. For instance, you can extract the observations where the destination is Home and occured on a Wednesday.

select_home_wed <- filter(df, GoingTo == "Home" & DayOfWeek == "Wednesday")
dim(select_home_wed)

Output:

## [1] 23 14

23 observations matched this criterion.

Pipeline

The creation of a dataset requires a lot of operations, such as:

importing
merging
selecting
filtering
and so on

The dplyr library comes with a practical operator, %>%, called the pipeline. The pipeline feature makes the manipulation clean, fast and less prompt to error.

This operator is a code which performs steps without saving intermediate steps to the hard drive. If you are back to our example from above, you can select the variables of interest and filter them. We have three steps:

Step 1: Import data: Import the gps data
Step 2: Select data: Select GoingTo and DayOfWeek
Step 3: Filter data: Return only Home and Wednesday

We can use the hard way to do it:

# Step 1
step_1 <- read.csv(PATH)

# Step 2 
step_2 <- select(step_1, GoingTo, DayOfWeek)

# Step 3 
step_3 <- filter(step_2, GoingTo == "Home", DayOfWeek == "Wednesday")

head(step_3)

Output:

##   GoingTo DayOfWeek
## 1    Home Wednesday
## 2    Home Wednesday
## 3    Home Wednesday
## 4    Home Wednesday
## 5    Home Wednesday
## 6    Home Wednesday

That is not a convenient way to perform many operations, especially in a situation with lots of steps. The environment ends up with a lot of objects stored.

Let’s use the pipeline operator %>% instead. We only need to define the data frame used at the beginning and all the process will flow from it.

Basic syntax of pipeline

New_df <- df %>%
step 1 %>%
step 2 %>%
...
arguments
- New_df: Name of the new data frame 
- df: Data frame used to compute the step
- step: Instruction for each step
- Note: The last instruction does not need the pipe operator `%`, you don't have instructions to pipe anymore
Note: Create a new variable is optional. If not included, the output will be displayed in the console.

You can create your first pipe following the steps enumerated above.

# Create the data frame filter_home_wed.It will be the object return at the end of the pipeline
filter_home_wed <- 

#Step 1
read.csv(PATH) % > % 

#Step 2
select(GoingTo, DayOfWeek) % > % 

#Step 3
filter(GoingTo == "Home",DayOfWeek == "Wednesday")
identical(step_3, filter_home_wed)

Output:

## [1] TRUE

We are ready to create a stunning dataset with the pipeline operator.

arrange()

In the previous tutorial, you learn how to sort the values with the function sort(). The library dplyr has its sorting function. It works like a charm with the pipeline. The arrange() verb can reorder one or many rows, either ascending (default) or descending.

- `arrange(A)`: Ascending sort of variable A
- `arrange(A, B)`: Ascending sort of variable A and B
- `arrange(desc(A), B)`: Descending sort of variable A and ascending sort of B

We can sort the distance by destination.

# Sort by destination and distance
step_2_df <-step_1_df %>%
	arrange(GoingTo, Distance)
head<step_2_df)

Output:

##     X       Date StartTime DayOfWeek GoingTo Distance MaxSpeed AvgSpeed
## 1 193  7/25/2011     08:06    Monday     GSK    48.32    121.2     63.4
## 2 196  7/21/2011     07:59  Thursday     GSK    48.35    129.3     81.5
## 3 198  7/20/2011     08:24 Wednesday     GSK    48.50    125.8     75.7
## 4 189  7/27/2011     08:15 Wednesday     GSK    48.82    124.5     70.4
## 5  95 10/11/2011     08:25   Tuesday     GSK    48.94    130.8     85.7
## 6 171  8/10/2011     08:13 Wednesday     GSK    48.98    124.8     72.8
##   AvgMovingSpeed FuelEconomy TotalTime MovingTime Take407All
## 1           78.4        8.45      45.7       37.0         No
## 2           89.0        8.28      35.6       32.6        Yes
## 3           87.3        7.89      38.5       33.3        Yes
## 4           77.8        8.45      41.6       37.6         No
## 5           93.2        7.81      34.3       31.5        Yes
## 6           78.8        8.54      40.4       37.3         No

Summary

In the table below, you summarize all the operations you learnt during the tutorial.

Verb	Objective	Code	Explanation
glimpse	check the structure of a df	glimpse(df)	Identical to str()
select()	Select/exclude the variables	select(df, A, B ,C)	Select the variables A, B and C
		select(df, A:C)	Select all variables from A to C
		select(df, -C)	Exclude C
filter()	Filter the df based a one or many conditions	filter(df, condition1)	One condition
		filter(df, condition1	ondition2)
arrange()	Sort the dataset with one or many variables	arrange(A)	Ascending sort of variable A
		arrange(A, B)	Ascending sort of variable A and B
		arrange(desc(A), B)	Descending sort of variable A and ascending sort of B
%>%	Create a pipeline between each step	step 1 %>% step 2 %>% step 3

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:

All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.

End-to-End Python Machine Learning Recipes & Examples.

End-to-End R Machine Learning Recipes & Examples.

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.

How to create a pipeline that standardizes the data and create model

Python Example – Write a Python program to list home directory without absolute path

How to create a pipeline that extracts features from the data and create model

Towards Advanced Analytics Specialist & Analytics Engineer

R tutorials for Business Analyst -R Select(), Filter(), Arrange(), Pipeline

(R Tutorials for Business Analyst)

R Select(), Filter(), Arrange(), Pipeline with Example

select()

Filter()

One criteria

Multiple criterions

Pipeline

arrange()

Summary

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Related Posts

Enhancing Agricultural Research with Duncan’s Multiple Range Test: A Comprehensive Guide

Unlocking Post-HOC Analysis with Duncan’s Multiple Range Test: A Statistical Deep Dive

Completely Randomized Design in Agricultural Science: A Comprehensive Guide with Python & R Examples