Beginners Guide to R – R Scatter Plot – ggplot2

Hits: 164

 

R Scatter Plot – ggplot2

A scatter plot is a graphical display of relationship between two sets of data.

typical scatter plot

They are good if you to want to visualize how two variables are correlated. That’s why they are also called correlation plot.

Create a Scatter Plot

To get started with plot, you need a set of data to work with. Let’s consider the built-in iris flower data set as an example data set.

Here are the first six observations of the data set.

# First six observations of the 'Iris' data set
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Iris data set

Iris data set contains around 150 observations on three species of iris flower: setosa, versicolor and virginica. Every observation contains four measurements of flower’s Petal length, Petal width, Sepal length and Sepal width.

To create a scatter plot, use ggplot() with geom_point() and specify what variables you want on the X and Y axes.

# Create a basic scatter plot with ggplot
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point()

Change the Shape and Size of the Points

It is possible to use different shapes in a scatter plot; just set shape argument in geom_point().

Here’s a list of shapes you can use.

The size of the points can be controlled with size argument. The default size is 2.

# Change the shape of the points and scale them down to 1.5
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point(shape=3, size=1.5)

Change Theme

The ggplot2 package provides some premade themes to change the overall plot appearance.

With themes you can easily customize some commonly used properties, like background color, panel background color and grid lines.

# Change the ggplot theme to 'Minimal'
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point() +
  theme_minimal()

Other than theme_minimal, following themes are available for use:

ggplot theme

Adding Titles and Axis Labels

You can add your own title and axis labels easily by incorporating following functions.

Function Description
ggtitle() Main plot title
xlab() x-axis label
ylab() y-axis label
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point() +
  ggtitle("Iris Flower Data Set") +
  xlab("Petal Length (cm)") +
  ylab("Petal Width (cm)")

Create a Scatter Plot of Multiple Groups

Plotting multiple groups in one scatter plot creates an uninformative mess. The graphic would be far more informative if you distinguish one group from another.

Following example maps the categorical variable “Species” to shape and color. This will set different shapes and colors for each species.

# Group points by 'Species' mapped to color
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Species)) +
  geom_point()
# Group points by 'Species' mapped to shape
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, shape=Species)) +
  geom_point()

Map a Continuous Variable to Color or Size

In basic scatter plot, two continuous variables are mapped to x-axis and y-axis. If you have more than two continuous variables, you must map them to other aesthetics like size or color.

Following examples map a continuous variable “Sepal.Width” to shape and color.

# A continuous variable 'Sepal.Width' mapped to color
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Sepal.Width)) +
  geom_point()
# A continuous variable 'Sepal.Width' mapped to size
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Species, size=Sepal.Width)) +
  geom_point(alpha=0.3)

Plotting the Regression Line

To add a regression line (line of Best-Fit) to the scatter plot, use stat_smooth() function and specify method=lm.

ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point() +
  stat_smooth(method=lm)

By default, stat_smooth() adds a 95% confidence region for the regression fit.

You can change the confidence interval by setting level e.g. stat_smooth(method=lm, level=0.9)

or you can disable it by setting se e.g. stat_smooth(method=lm, se=FALSE)

If your scatter plot has points grouped by a categorical variable, you can add one regression line for each group.

# Add one regression lines for each group
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Species)) +
  geom_point() +
  stat_smooth(method=lm)

Plotting the LOESS Line

When you add stat_smooth() without specifying the method, a loess line will be added to your plot. Specifying method=loess will have the same result.

ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point() +
  stat_smooth(method=loess)

Adding Marginal Rugs to a Scatter Plot

A marginal rug is a one-dimensional density plot drawn on the axis of a plot. It can be used to observe the marginal distributions more clearly.

By using geom_rug(), you can add marginal rugs to your scatter plot.

If you have too many points, you can jitter the line positions and make them slightly thinner.

# Add add marginal rugs and use jittering to avoid overplotting
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point() +
  geom_rug(position="jitter", size=.2)

2D Density Plot

2D density plot uses the kernel density estimation procedure to visualize a bivariate distribution. This can be useful for dealing with overplotting.

The geom_density_2d() and stat_density_2d() performs a 2D kernel density estimation and displays the results with contours.

# Show the contour only
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point() +
  geom_density_2d()

# Show the area only
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point() +
  stat_density_2d(aes(fill = ..level..), geom="polygon")

# Area + contour
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point() +
  stat_density_2d(aes(fill = ..level..), geom="polygon", colour="white")

If you turn contouring off, you can use geoms like tiles or points

# tiles
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)

# points
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  stat_density_2d(geom = "point", aes(size = stat(density)), n = 20, contour = FALSE)

Scatter Plot with Prediction Ellipse

A prediction ellipse is a region for predicting the location of a new observation under the assumption that the population is bivariate normal.

It is helpful for detecting deviation from normality.

The stat_ellipse() computes and displays a 95% prediction ellipse.

# Overlay a prediction ellipse on a scatter plot
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
  geom_point() +
  stat_ellipse()

Sometimes you might want to overlay prediction ellipses for each group. It helps to visualize how characteristics vary between the groups.

# Draw prediction ellipses for each group
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Species)) +
  geom_point() +
  stat_ellipse()

 

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.