R Examples for Beginners – How read data files in R

(R Example for Citizen Data Scientist & Business Analyst)

 

This code uses a dataset file with population estimates by the US Census Bureau (more info).

tbl <- read.table(file.choose(),header=TRUE,sep=",")
population <- tbl["POPESTIMATE2009"]
print(summary(population[-1:-5,]))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  544300  1734000  4141000  5980000  6613000 36960000 

Reading a CSV file

read.table can read a variety of basic data formats into tables or “data frames”.
sep specifies the separator for the data, which is a comma for CSV files.
header indicates whether the first row contains the names of the data columns.

The first argument contains the file name. In this case file.choose is used to show a dialog.

(The user’s home folder is the default working directory in RStudio.)

Indexing data frames

Getting a specific column

You can use the column name as a string in brackets: tbl[“POPESTIMATE2009”]:

   POPESTIMATE2009
1        307006550
2         55283679
3         66836911
[...]

Using the column number also works: tbl[17].

Getting a column as a list

You can use the dollar sign for this: tbl$POPESTIMATE2009

[1] 307006550  55283679  66836911 113317879  71568081   4708708    698473
[8]   6595778   2889450  36961664   5024748   3518288    885122    599657
[...]

Fetching specific rows and columns

Here the table will be treated as a 2-dimensional matrix.
To get the first 5 rows from the population table:

population[1:5,]  #  first the rows, then the columns
[1] 307006550  55283679  66836911 113317879  71568081

The comma after the row information indicates that we want all columns. In this case we could also have written [1:5,1] because we only have 1 column in population.

Look at this data from the first 5 rows in the population column:

[1] 307006550  55283679  66836911 113317879  71568081

These are too big to be population values for US States. They are the total US population and that of the US Census Bureau regions: Northeast, Midwest, South and West.
Since we are only interested in the states we can drop them like this:

population[-1:-5,]

Negative numbers in matrix indices can be used to omit specific rows or columns.

A short equivalent of the code

You can also fetch the population column at the same time as you remove the multi-state rows. Replace

population <- tbl["POPESTIMATE2009"]
print(summary(population[-1:-5,]))

with

print(summary(tbl[-1:-5,"POPESTIMATE2009"]))

The summary function

summary calculates a few values based on the data passed as the first argument. The exact values calculated depend on the class of the data.

summary(1:10)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.25    5.50    5.50    7.75   10.00 


Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

A list of FREE programming examples together with eTutorials & eBooks @ SETScholars

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Topics included: Classification, Clustering, Regression, Forecasting, Algorithms, Data Structures, Data Analytics & Data Science, Deep Learning, Machine Learning, Programming Languages and Software Tools & Packages.
(Discount is valid for limited time only)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: v-Tutorials on Applied Machine Learning and Data Science for Beginners

Please do not waste your valuable time by watching videos, rather use end-to-end (Python and R) recipes from Professional Data Scientists to practice coding, and land the most demandable jobs in the fields of Predictive analytics & AI (Machine Learning and Data Science).

The objective is to guide the developers & analysts to “Learn how to Code” for Applied AI using end-to-end coding solutions, and unlock the world of opportunities!