Beginners Guide to R – R Data Frame

R Data Frame

Suppose you want to store the names of your employees, their age and addresses all in one dataset.

The first thing that readily comes to mind is Matrix. But you can’t combine all this data in one matrix without converting it to character data.

So, you need a new data structure to keep all this information together. That data structure is a Data Frame.

Unlike vectors or matrices, data frames have no restriction on the data types of the variables; you can store numeric data, character data, and so on.

In a nutshell, a data frame is a list of equal-length vectors. The easiest way to think of a data frame is as an Excel worksheet.

Create a Data Frame

You can create a data frame using the data.frame() function.

# Create a data frame to store employee records
name <- c("Bob", "Max", "Sam")
age <- c(25,26,23)
city <- c("New York", "Chicago", "Seattle")

df <- data.frame(name, age, city)
df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle

You can also convert pre-existing structures to a data frame using the as.data.frame() function.

# Convert a list of vectors into a data frame
lst <- list(name = c("Bob", "Max", "Sam"),
            age = c(25,26,23),
            city = c("New York", "Chicago", "Seattle"))
df <- as.data.frame(lst)
df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle
# Convert a matrix into a data frame
m <-matrix(1:12, nrow = 4, ncol = 3)
df <- as.data.frame(m)
df
  V1 V2 V3
1  1  5  9
2  2  6 10
3  3  7 11
4  4  8 12

Keeping Characters as Characters

Let’s have a look at the internal structure of our data frame.

# Print internal structure of a data frame
str(df)
'data.frame':	3 obs. of  3 variables:
 $ name: Factor w/ 3 levels "Bob","Max","Sam": 1 2 3
 $ age : num  25 26 23
 $ city: Factor w/ 3 levels "Chicago","New York",..: 2 1 3

str() function provides a compact display of the internal structure of any R object.

You may have noticed that the character columns (name and city) were converted to factors. R does this by default, but you can avoid this by setting an extra argument stringsAsFactors to FALSE.

df <- data.frame(name, age, city, stringsAsFactors = FALSE)
str(df)
'data.frame':	3 obs. of  3 variables:
 $ name: chr  "Bob" "Max" "Sam"
 $ age : num  25 26 23
 $ city: chr  "New York" "Chicago" "Seattle"

Naming Data Frame Rows and Columns

Every column in a data frame has a name. Even if you didn’t specify them yourself, R will take the column names from your program variables.

v1 <- c("Bob", "Max", "Sam")
v2 <- c(25,26,23)
v3 <- c("New York", "Chicago", "Seattle")

df <- data.frame(v1, v2, v3)
df
   v1 v2       v3
1 Bob 25 New York
2 Max 26  Chicago
3 Sam 23  Seattle

But you can give columns a sensible name by using colnames() or names().

names(df) <- c("name", "age", "city")
df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle

By default, data frames does not have row names but you can add them with rownames()

rownames(df) <- c("row1", "row2", "row3")
df
     name age     city
row1  Bob  25 New York
row2  Max  26  Chicago
row3  Sam  23  Seattle

You can use the same colnames() and rownames() functions to print column names and row names resp.

# print column names
colnames(df)
[1] "name" "age"  "city"

# print column names
names(df)
[1] "name" "age"  "city"

# print row names
rownames(df)
[1] "row1" "row2" "row3"

Subsetting Data Frames Like a List

Data frames possess the characteristics of lists. So, when you subset with a single vector, they behave like lists and will return the selected columns with all rows.

df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle

# subset for 1st column
df[1]
  name
1  Bob
2  Max
3  Sam

# subset for 1st and 3rd column
df[c(1,3)]
  name     city
1  Bob New York
2  Max  Chicago
3  Sam  Seattle

# omit 3rd column
df[-3]
  name age
1  Bob  25
2  Max  26
3  Sam  23

Subsetting Data Frames Like a Matrix

Data frames also possess the characteristics of matrices. So, when you subset with two vectors, they behave like matrices and can be subset by row and column.

df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

# subset for 2nd row
df[2,]
  name age    city
2  Max  26 Chicago

# subset for 3rd column
df[,3]
[1] "New York" "Chicago"  "Seattle" 

# select single element
df[2,3]
[1] "Chicago"

# subset for row 1 and 2 but keep all columns
df[1:2,]
  name age     city
1  Bob  25 New York
2  Max  26  Chicago

# subset for both rows and columns
df[1:2,2:3]
  age     city
1  25 New York
2  26  Chicago

# omit 2nd row
df[-2,]
  name age     city
1  Bob  25 New York
3  Amy  23  Seattle

# omit 2nd row and 3rd column
df[-2,-3]
  name age
1  Bob  25
3  Amy  23

# subset for 'city' column
df[,"city"]
[1] "New York" "Chicago"  "Seattle" 

The subset() function

There’s a more convenient way to subset a data frame using the subset() function.

The function takes three arguments.

subset(df,select,subset)

df: The data frame you want to subset.

select: A column name, or a vector of column names, to be selected.

subset: A logical expression that selects rows.

To see how subset() function works, let’s start with a simple data set. Suppose you have a dataframe df storing employee records:

df
  name age sex     city
1  Eve  21   F  Chicago
2  Max  24   M  Houston
3  Ray  22   M New York
4  Kim  21   F New York
5  Sam  23   M  Chicago

# select the employee name
subset(df, select=name)
  name
1  Eve
2  Max
3  Ray
4  Kim
5  Sam

# select the employee name and city
subset(df, select=c(name,city))
  name     city
1  Eve  Chicago
2  Max  Houston
3  Ray New York
4  Kim New York
5  Sam  Chicago

# select all employees from 'Chicago'
subset(df, subset=(city == "Chicago"))
  name age sex    city
1  Eve  21   F Chicago
5  Sam  23   M Chicago

# select the employee name and city with age > 22
subset(df, select=c(name,city), subset=(age > 22))
  name    city
2  Max Houston
5  Sam Chicago

Add New Rows and Columns to Data Frame

You can add new columns to a data frame using the cbind() function.

df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

sex <- factor(c("M", "M", "F"))
cbind(df, sex)
  name age     city sex
1  Bob  25 New York   M
2  Max  26  Chicago   M
3  Amy  23  Seattle   F

To add new rows (observations) to a data frame, use rbind() function.

Warning:

Take extra care when adding new rows to the data frame. Adding elements of wrong type can change the type of the columns.

For example, if your data frame contains a numeric column and you attempt to add a character vector, it will convert all columns to a character type.

df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

row <- data.frame(name = "Sam",
                  age = 22, 
                  city = "New York")
rbind(df, row)
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle
4  Sam  22 New York

Combine Two Data Frames

You can combine data frames in one of two ways:

Combine the Columns

Use cbind() function to combine the columns of two data frames side by side creating a wider data frame.

df1
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

df2
     sex salary
1   Male  22000
2   Male  25400
3 Female  24800

cbind(df1, df2)
  name age     city    sex salary
1  Bob  25 New York   Male  22000
2  Max  26  Chicago   Male  25400
3  Amy  23  Seattle Female  24800

Make sure the data frames have the same height (number of rows).

Otherwise, R will invoke the Recycling Rule to extend the short columns, which may or may not be what you want.

Combine the Rows

Use rbind() function to stack the rows of two data frames creating a taller data frame.

df1
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

df2
  name age     city
1  Eve  21  Chicago
2  Ray  22  Houston
3  Kim  24 New York

rbind(df1, df2)
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle
4  Eve  21  Chicago
5  Ray  22  Houston
6  Kim  24 New York

Make sure the data frames have the same width (same number of columns and same column names).

However, the columns need not be in the same order.

Merge Data Frames by Common Column

You can merge two data frames by matching on the common column using the merge() function. You just need to specify the two data frames and the name of the common column.

# Merge two data frames by common column 'name'
df1
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

df2
  name    sex salary
1  Max   Male  25400
2  Amy Female  24800
3  Bob   Male  22000

merge(df1, df2, by="name")
  name age     city    sex salary
1  Amy  23  Seattle Female  24800
2  Bob  25 New York   Male  22000
3  Max  26  Chicago   Male  25400

The merge() function does not require the rows to occur in the same order.

It also discards rows that appear in only one data frame or the other.

Modify a Data Frame

Modifying a data frame is pretty straightforward. Access the element using [] operator and simply assign a new value.

df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

# modify 2nd column
df[2] <- c(23,21,22)
df
  name age     city
1  Bob  23 New York
2  Max  21  Chicago
3  Amy  22  Seattle

# modify 2nd row
df[2,] <- list("Eve",24,"Houston")
df
  name age     city
1  Bob  23 New York
2  Eve  24  Houston
3  Amy  22  Seattle

# modify single element
df[3,1] <- "Sam"
df
  name age     city
1  Bob  23 New York
2  Eve  24  Houston
3  Sam  22  Seattle

Editing a Data Frame

R offers convenient ways to edit the data frame contents: the edit() function and the fix() function

The edit() Function

It opens up the data editor that displays your data frame in a spreadsheet-like window. Invoke the editor like this:

df
  name age     city    sex salary
1  Bob  25 New York   Male  22000
2  Max  26  Chicago   Male  25400
3  Amy  23  Seattle Female  24800

temp <- edit(df)
df <- temp
r edit data frame

Once you are done with the changes, close the editor window. The updated data frame will be assigned to the temp variable. If you are happy with the changes, overwrite your data frame with the results.

The fix() Function

There’s another function called fix() which works exactly like edit() except it overwrites the data frame once you close the editor.

Use it if you are confident, because there is no undo.

df
  name age     city    sex salary
1  Bob  25 New York   Male  22000
2  Max  26  Chicago   Male  25400
3  Amy  23  Seattle Female  24800

fix(df)
r edit data frame

Create an Empty Data Frame

You can create an empty data frame using the numeric()character(), and factor() functions to preallocate the columns; then join them together using data.frame()

This technique is useful especially when you want to build a data frame row-by-row.

df <- data.frame(name=character(),
                 age=numeric(),
                 sex=factor(levels=c("M","F")),
                 stringsAsFactors = FALSE)
str(df)
'data.frame':	0 obs. of  3 variables:
 $ name: chr 
 $ age : num 
 $ sex : Factor w/ 2 levels "M","F":

You can even create an empty data frame of fixed size if you know the required number of rows in advance.

# Create an empty data frame with 3 rows
N <- 3
df <- data.frame(name=character(N),
                 age=numeric(N),
                 sex=factor(N, levels=c("M","F")),
                 stringsAsFactors = FALSE)
df
  name age  sex
1        0 <NA>
2        0 <NA>
3        0 <NA>

Sorting a Data Frame

You can sort the contents of a data frame by using the order() function and specifying one of the columns as the sort key.

The order() function alone tells you how to rearrange the columns. It does not return data values. Combine it with the subsetting operator [] to get the sorted data frame.

By default, sorting is ascending. Prepend the sorting variable by a minus sign to sort in descending order.

# Sort the data frame by age
df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

# sort in ascending order
df[order(df$age),]
  name age     city
3  Amy  23  Seattle
1  Bob  25 New York
2  Max  26  Chicago

# sort in descending order
df[order(-df$age),]
  name age     city
2  Max  26  Chicago
1  Bob  25 New York
3  Amy  23  Seattle

 

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.