R tutorials for Business Analyst – Import Data into R: Read CSV, Excel, SPSS, Stata, SAS Files

Data could exist in various formats. For each format R has a specific function and argument. This tutorial explains how to import data to R.

In this tutorial, you will learn

  • Read CSV
  • Read Excel files
  • readxl_example()
  • read_excel()
  • excel_sheets()
  • Import data from other Statistical software
  • Read sas
  • Read STATA
  • Read SPSS
  • Best practices for Data Import

Read CSV

One of the most widely data store is the .csv (comma-separated values) file formats. R loads an array of libraries during the start-up, including the utils package. This package is convenient to open csv files combined with the reading.csv() function. Here is the syntax for read.csv

read.csv(file, header = TRUE, sep = ",")

Argument:

  • file: PATH where the file is stored
  • header: confirm if the file has a header or not, by default, the header is set to TRUE
  • sep: the symbol used to split the variable. By default, `,`.

We will read the data file name mtcats. The csv file is stored online. If your .csv file is stored locally, you can replace the PATH inside the code snippet. Don’t forget to wrap it inside ‘ ‘. The PATH needs to be a string value.

For mac user, the path for the download folder is:

 "/Users/USERNAME/Downloads/FILENAME.csv"

For windows user:

"C:\Users\USERNAME\Downloads\FILENAME.csv"

Note that, we should always specify the extension of the file name.

  • .csv
  • .xlsx
  • .txt
PATH <- '/datafolder/mtcars.csv'                
df <- read.csv(PATH, header =  TRUE, sep = ',')
length(df)

Output:

## [1] 12
class(df$X)

Output:

## [1] "factor"

R, by default, returns character values as Factor. We can turn off this setting by adding stringsAsFactors = FALSE.

PATH <- '/datafolder/mtcars.csv'
df <-read.csv(PATH, header =TRUE, sep = ',', stringsAsFactors =FALSE)
class(df$X)

Output:

## [1] "character"

The class for the variable X is now a character.

Read Excel files

Excel files are very popular among data analysts. Spreadsheets are easy to work with and flexible. R is equipped with a library readxl to import Excel spreadsheet.

Use this code

require(readxl)

to check if readxl is installed in your machine. If you install r with r-conda-essential, the library is already installed. You should see in the command window:

Output:

Loading required package: readxl.

If the package does not exit, you can install it with the conda library or in the terminal, use conda install -c mittner r-readxl.

Use the following command to load the library to import excel files.

library(readxl)

readxl_example()

We use the examples included in the package readxl during this tutorial.

Use code

readxl_example()

to see all the available spreadsheets in the library.

To check the location of the spreadsheet named clippy.xls, simple use

readxl_example("geometry.xls")

If you install R with conda, the spreadsheets are located in Anaconda3/lib/R/library/readxl/extdata/filename.xls

read_excel()

The function read_excel() is of great use when it comes to opening xls and xlsx extention.

The syntax is:

read_excel(PATH, sheet = NULL, range= NULL, col_names = TRUE)
arguments:
-PATH: Path where the excel is located
-sheet: Select the sheet to import. By default, all
-range: Select the range to import. By default, all non-null cells
-col_names: Select the columns to import. By default, all non-null columns

We can import the spreadsheets from the readxl library and count the number of columns in the first sheet.

# Store the path of `datasets.xlsx`
example <- readxl_example("datasets.xlsx")
# Import the spreadsheet
df <- read_excel(example)
# Count the number of columns
length(df)

Output:

## [1] 5

excel_sheets()

The file datasets.xlsx is composed of 4 sheets. We can find out which sheets are available in the workbook by using excel_sheets() function

example <- readxl_example("datasets.xlsx")

excel_sheets(example)

Output:

[1] "iris"     "mtcars"   "chickwts" "quakes"

If a worksheet includes many sheets, it is easy to select a particular sheet by using the sheet arguments. We can specify the name of the sheet or the sheet index. We can verify if both function returns the same output with identical().

example <- readxl_example("datasets.xlsx")
quake <- read_excel(example, sheet = "quakes")
quake_1 <-read_excel(example, sheet = 4)
identical(quake, quake_1)

Output:

## [1] TRUE

We can control what cells to read in 2 ways

  1. Use n_max argument to return n rows
  2. Use range argument combined with cell_rows or cell_cols

For example, we set n_max equals to 5 to import the first five rows.

# Read the first five row: with header
iris <-read_excel(example, n_max =5, col_names =TRUE)

If we change col_names to FALSE, R creates the headers automatically.

# Read the first five row: without header
iris_no_header <-read_excel(example, n_max =5, col_names =FALSE)

iris_no_header

In the data frame iris_no_header, R created five new variables named X__1, X__2, X__3, X__4 and X__5

We can also use the argument range to select rows and columns in the spreadsheet. In the code below, we use the excel style to select the range A1 to B5.

# Read rows A1 to B5
example_1 <-read_excel(example, range = "A1:B5", col_names =TRUE)
dim(example_1)

Output:

## [1] 4 2

We can see that the example_1 returns 4 rows with 2 columns. The dataset has header, that the reason the dimension is 4×2.

In the second example, we use the function cell_rows() which controls the range of rows to return. If we want to import the rows 1 to 5, we can set cell_rows(1:5). Note that, cell_rows(1:5) returns the same output as cell_rows(5:1).

# Read rows 1 to 5
example_2 <-read_excel(example, range =cell_rows(1:5),col_names =TRUE)			
dim(example_2)

Output:

## [1] 4 5

The example_2 however is a 4×5 matrix. The iris dataset has 5 columns with header. We return the first four rows with header of all columns

In case we want to import rows which do not begin at the first row, we have to include col_names = FALSE. If we use range = cell_rows(2:5), it becomes obvious our data frame does not have header anymore.

iris_row_with_header <-read_excel(example, range =cell_rows(2:3), col_names=TRUE)
iris_row_no_header <-read_excel(example, range =cell_rows(2:3),col_names =FALSE)

We can select the columns with the letter, like in Excel.
# Select columns A and B
col <-read_excel(example, range =cell_cols("A:B"))
dim(col)

Output:

## [1] 150   2

Note : range = cell_cols(“A:B”), returns output all cells with non-null value. The dataset contains 150 rows, therefore, read_excel() returns rows up to 150. This is verified with the dim() function.

read_excel() returns NA when a symbol without numerical value appears in the cell. We can count the number of missing values with the combination of two functions

  1. sum
  2. is.na

Here is the code

iris_na <-read_excel(example, na ="setosa")
sum(is.na(iris_na))

Output:

## [1] 50

We have 50 values missing, which are the rows belonging to the setosa species.

Import data from other Statistical software

We will import different files format with the heaven package. This package support SAS, STATA and SPSS softwares. We can use the following function to open different types of dataset, according to the extension of the file:

  • SAS: read_sas()
  • STATA: read_dta() (or read_stata(), which are identical)
  • SPSS: read_sav() or read_por(). We need to check the extension

Only one argument is required within these function. We need to know the PATH where the file is stored. That’s it, we are ready to open all the files from SAS, STATA and SPSS. These three function accepts an URL as well.

library(haven)

haven comes with conda r-essential otherwise go to the link or in the terminal conda install -c conda-forge r-haven

Read sas

For our example, we are going to use the admission dataset from IDRE.

PATH_sas <- '/datafolder/binary.sas7bdat?raw=true'
df <- read_sas(PATH_sas)
head(df)

Output:

## # A tibble: 6 x 4
##   ADMIT   GRE   GPA  RANK
##   <dbl> <dbl> <dbl> <dbl>
## 1     0   380  3.61     3
## 2     1   660  3.67     3
## 3     1   800  4.00     1
## 4     1   640  3.19     4
## 5     0   520  2.93     4
## 6     1   760  3.00     2

Read STATA

For STATA data files you can use read_dta(). We use exactly the same dataset but store in .dta file.

PATH_stata <- '/datafolder/binary.dta?raw=true'
df <- read_dta(PATH_stata)
head(df)

Output:

## # A tibble: 6 x 4				
##   admit   gre   gpa  rank				
##   <dbl> <dbl> <dbl> <dbl>				
## 1     0   380  3.61     3				
## 2     1   660  3.67     3				
## 3     1   800  4.00     1				
## 4     1   640  3.19     4				
## 5     0   520  2.93     4				
## 6     1   760  3.00     2

Read SPSS

We use the read_sav()function to open a SPSS file. The file extension “.sav”

PATH_spss <- '/datafolder/binary.sav?raw=true'
df <- read_sav(PATH_spss)
head(df)

Output:

## # A tibble: 6 x 4				
##   admit   gre   gpa  rank				
##   <dbl> <dbl> <dbl> <dbl>				
## 1     0   380  3.61     3				
## 2     1   660  3.67     3			
## 3     1   800  4.00     1				
## 4     1   640  3.19     4				
## 5     0   520  2.93     4				
## 6     1   760  3.00     2

Best practices for Data Import

When we want to import data into R, it is useful to implement following checklist. It will make it easy to import data correctly into R:

  • The typical format for a spreadsheet is to use the first rows as the header (usually variables name).
  • Avoid to name a dataset with blank spaces; it can lead to interpreting as a separate variable. Alternatively, prefer to use ‘_’ or ‘-.’
  • Short names are preferred
  • Do not include symbol in the name: i.e: exchange_rate_$_€ is not correct. Prefer to name it: exchange_rate_dollar_euro
  • Use NA for missing values otherwise; we need to clean the format later.

Summary

Following table summarizes the function to use in order to import different types of file in R. The column one states the library related to the function. The last column refers to the default argument.

Library Objective Function Default Arguments
utils Read CSV file read.csv() file, header =,TRUE, sep = “,”
readxl Read EXCEL file read_excel() path, range = NULL, col_names = TRUE
haven Read SAS file read_sas() path
haven Read STATA file read_stata() path
haven Read SPSS fille read_sav() path

Following table shows the different ways to import a selection with read_excel() function.

Function Objectives Arguments
read_excel() Read n number of rows n_max = 10
Select rows and columns like in excel range = “A1:D10”
Select rows with indexes range= cell_rows(1:3)
Select columns with letters range = cell_cols(“A:C”)

 

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

A list of FREE programming examples together with eTutorials & eBooks @ SETScholars

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Topics included: Classification, Clustering, Regression, Forecasting, Algorithms, Data Structures, Data Analytics & Data Science, Deep Learning, Machine Learning, Programming Languages and Software Tools & Packages.
(Discount is valid for limited time only)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Please do not waste your valuable time by watching videos, rather use end-to-end (Python and R) recipes from Professional Data Scientists to practice coding, and land the most demandable jobs in the fields of Predictive analytics & AI (Machine Learning and Data Science).

The objective is to guide the developers & analysts to “Learn how to Code” for Applied AI using end-to-end coding solutions, and unlock the world of opportunities!

 

R tutorials for Business Analyst – R Exporting Data to Excel, CSV, SAS, STATA, Text File

R Examples for Beginners – How read data files in R

Beginners tutorial with R – How to load CSV Files