Beginners Guide to R – R Histogram – Base Graph

R Histogram – Base Graph

A Histogram is a graphical display of continuous data using bars of different heights.

It is similar to a bar graph, except a histogram groups the data into bins. The height of each bar shows the number of elements in the bin.

typical histogram

They are a great way to display the distribution or variation of data over a range.

The hist() function

In R, you can create a histogram using the hist() function.

It has many options and arguments to control many things, such as bin size, labels, titles and colors.

Syntax

The syntax for the hist() function is:

hist(x,breaks,freq,labels,density,angle,col,border,main,xlab,ylab,)

Parameters

Parameter Description
x A vector of values describing the bars which make up the plot
breaks A number specifying the number of bins for the histogram
freq If TRUE, hist() gives counts instead of probabilities.
labels If TRUE, draws labels on top of bars
density The density of shading lines
angle The slope of shading lines
col A vector of colors for the bars
border The color to be used for the border of the bars
main An overall title for the plot
xlab The label for the x axis
ylab The label for the y axis
Other graphical parameters

Create a Histogram

To get started with plot, you need a set of data to work with. Let’s consider the built-in faithful data set as an example data set.

Here are the first six observations of the data set.

# First six observations of the 'Faithful' data set
head(faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

Faithful data set

The faithful data set contains 272 observations from the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA.

Each observation consists of two measurements: time between eruptions and the duration of the eruption.

To create a histogram just specify the vector in hist() function.

# Create a histogram of time between eruptions of Old Faithful
hist(faithful$waiting)

Choose the Number of Bins

The accuracy of a histogram depends solely upon the number of bins used to plot the data.

Large number of bins hides important details about distribution, while small number of bins causes a lot of noise and hides important information about the distribution as well.

By default, the hist() function chooses an appropriate number of bins to cover the range of values.

However, there are a couple of ways to manually set the number of bins.

1. You can tell R the number of bars you want in the histogram by giving a single number as a value to the breaks argument.

# Specify the number of bars you want in the histogram
hist(faithful$waiting,
     breaks = 20)

Just keep in mind that the number is only a suggestion.

R will still decide whether that’s actually reasonable, and it tries to plot the maximum number of bins as possible.

2. You can tell R exactly where to put the breaks by giving a vector with the break points as the argument.

# Histogram with custom breaks
hist(faithful$waiting,
     breaks = c(40,45,55,60,65,70,75,85,90,100))

Coloring a Histogram

Use col argument to change the colors used for the bars.

hist(faithful$waiting,
     col="dodgerblue3")

By using the border argument, you can even change the color used for the border of the bars.

hist(faithful$waiting,
     col="lightblue1",
     border="dodgerblue3")

Create a Hatched Histogram

Creating hatched charts in R is rather easy, just specify the density argument in the hist() function.

By default the plot is hatched with 45° slanting lines, however, you can change it with the angle argument.

# Create a hatched histogram with 60° slanting lines
hist(faithful$waiting,
     col="dodgerblue3",
     density=25,
     angle=60)

Adding Titles and Axis Labels

You can add your own title and axis labels easily by specifying following arguments.

Argument Description
main Main plot title
xlab x‐axis label
ylab y‐axis label
hist(faithful$waiting,
     col="dodgerblue3",
     main="Time between eruptions of Old Faithful",
     xlab="Time (minutes)")

Add Value Markers

Often you want to draw attention to specific values or observations in your graphic to provide unique insight. You can do this by adding markers to your graphic.

For example, adding mean line will give you an idea about how much of the distribution is above and below the average.

You can add such marker by using the abline() function.

# Add mean line in the histogram
hist(faithful$waiting,
     col="lightblue1")
abline(v=mean(faithful$waiting),
       col="dodgerblue3",
       lty=2,
       lwd=2)

Another example is placing values on top of bars; which will help you interpret the graph correctly.

You can add them by setting the labels argument to TRUE.

# Show values on top of each bar in the histogram
hist(faithful$waiting,
     col="dodgerblue3",
     labels=TRUE)

Plotting a Kernel Density Estimate (KDE)

A histogram gives you a rough sense of the density of the underlying distribution of your data.

The most complete way of describing your data is by estimating the probability density function (PDF) or density of your variable.

Use the density() function to approximate the sample density and then use lines() function to draw the approximation.

By default, the hist() function plots the counts in the histogram. By setting freq argument to FALSE, you can plot the densities.

# Add a kernel density estimate to a histogram
hist(faithful$waiting,
     col="lightblue1",
     freq = FALSE)
lines(density(faithful$waiting),
      col="dodgerblue3",
      lwd=2)

To fill the density plot, use the polygon() function.

# Fill the density plot
hist(faithful$waiting,
     col="lightblue1",
     freq = FALSE)
lines(density(faithful$waiting))
polygon(density(faithful$waiting),
        col=rgb(1,0,1,.2))

Instead of setting freq = FALSE, you can achieve the same result by setting argument prob = TRUE

Plot Multiple Histograms

Often you want to compare the distributions of different variables within your data.

You can overlay the histograms by setting the add argument of the second histogram to TRUE.

# random numbers
h1 <- rnorm(1000,6)
h2 <- rnorm(1000,4)

# Overlay two histograms
hist(h1,
     col=rgb(1,0,0,0.5))
hist(h2,
     col=rgb(0,0,1,0.5),
     add=TRUE)

 

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.