Can a scatter plot have more than 2 variables?

Welcome back to Quantitative Reasoning! Last time, we learned how to make scatter plots that show the association between two variables (e.g.between the day of the year and wind sp

Can a scatter plot have more than 2 variables?

Welcome back to Quantitative Reasoning! Last time, we learned how to make scatter plots that show the association between two variables (e.g.between the day of the year and wind speed). We plotted one variable along the x-axis and another variable along the y-axis. Sometimes we want to show in a single plot how multiple pairs of variables are associated with each other. For example, here is a plot that shows how the gross domestic product (GDP) of the United States and the Soviet Union developed between the years 1928 and 1991. The data are from the Groningen Growth and Development Centre (https://www.rug.nl/ggdc/).

We could make a multi-panel plot with one panel for each country, but a single-panel plot makes it much easier to draw comparisons between the two countries.

Lets download the spreadsheet gdp_usa_ussr.csv from the URL below this video (https://michaelgastner.com/data_for_QR/gdp_usa_ussr.csv). Then we import the data and shorten the name of the data frame to gdp. The data frame has three columns: year, the GDP of the United States and the GDP of the Soviet Union.gdp <- read.csv("gdp_usa_ussr.csv") head(gdp)##   year  usa ussr ## 1 1928 6569 1370 ## 2 1929 6899 1386 ## 3 1930 6213 1448 ## 4 1931 5691 1462 ## 5 1932 4908 1439 ## 6 1933 4777 1493

Our goal is to make a single plot that shows the GDP of both countries as functions of the year. There are different methods to make multi-variable scatter plots with R. The method Im going to show you isnt the most elegant from a programmers perspective, but its easy to learn and gets the job done. The idea is to break the task into smaller pieces. First we make a scatter plot for only one of the two countries (e.g.the United States) with the plot() function. Here I use the arguments main, xlab, ylab and cex from our previous tutorial.plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75)

Then we add the data for the Soviet Union to the existing plot. Because we want to show the data as unconnected points, we use the points() function. If we wanted to connect the points by lines, we would have to use the lines() function. The functions points() and lines() accept input coordinates either in the form x-coordinate comma y-coordinate (i.e.gdp$year, gdp$ussr),plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75) points(gdp$year, gdp$ussr)  # General format: points(x, y)

or we can use the tilde operator with an optional data argument (ussr ~ year, data = gdp).plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75) points(ussr ~ year,  # General format: points(y ~ x) data = gdp)

The sequence of the commands is important. First we must run plot() and only then we can add the points() or lines() functions.

There are several problems with the current plot.

  • The data for the USSR are cut off at the bottom.
  • Its confusing that the points for the USA and the USSR have the same shape and colour,
  • and we need a legend to distinguish between the data for each country.

Lets fix these problems one by one.

To make the data at the bottom visible, we can use the ylim argument of the plot() function to expand the y-axis range.plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75, ylim = c(0, 25000))  # Adjust y-axis range points(ussr ~ year, data = gdp)

Next we change the point symbol for the USSR in the points() function with pch. If you need a reminder about how to choose a numeric value for pch, please have a look again at the previous tutorial. In this example, Im choosing squares, but feel free to choose another point symbol.plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75, ylim = c(0, 25000)) points(ussr ~ year, data = gdp, pch = 0)  # Squares

Right now, the squares are overlapping a little bit. We can cure this problem by reducing the size of the squares with the cex argument.plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75, ylim = c(0, 25000)) points(ussr ~ year, data = gdp, pch = 0, cex = 0.75)  # Reduce symbol size

Lets also change the colour of the squares with the col argument so that the data for each country look clearly distinct.plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75, ylim = c(0, 25000)) points(ussr ~ year, data = gdp, pch = 0, cex = 0.75, col = "red")  # Change colour to red

We can add a faint grid with the grid() function to improve readability.plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75, ylim = c(0, 25000)) points(ussr ~ year, data = gdp, pch = 0, cex = 0.75, col = "red") grid()  # Add faint grid

Right now, it would be unclear to a reader which symbol corresponds to which country. We can add a legend with the legend() function. The first argument in legend() specifies the legends position. In our example, we can place the legend in the top left corner of the plot without obstructing any data point. The second argument in the legend() function is a character vector that contains the legend text, in our case the names of the two countries. plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75, ylim = c(0, 25000)) points(ussr ~ year, data = gdp, pch = 0, cex = 0.75, col = "red") grid() legend("topleft", legend = c("USA", "USSR"))

In its current form, the legend is still useless because it isnt showing the point symbols. We add the symbols with arguments for pch and col that match the symbol types and colours in the plot.plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75, ylim = c(0, 25000)) points(ussr ~ year, data = gdp, pch = 0, cex = 0.75, col = "red") grid() legend("topleft", legend = c("USA", "USSR"), pch = c(1, 0), col = c("black", "red"))

As icing on the cake, we can add trend lines with the lowess() function from last time. Theres a small complication when adding a LOWESS curve for the USSR because some GDP data during World War II are missing. The trick is to remove the NA values first and only then apply the lowess() function. I wont go into details right now. You can see the solution here or in the transcript at the URL below the video.plot(usa ~ year, data = gdp, main = "GDP of USA and USSR", xlab = "Year", ylab = "GDP per capita (1990 international dollar)", cex = 0.75, ylim = c(0, 25000)) points(ussr ~ year, data = gdp, pch = 0, col = "red", cex = 0.75) grid() legend("topleft", legend = c("USA", "USSR"), col = c("black", "red"), pch = c(1, 0)) lines(lowess(gdp$year, gdp$usa)) lines(lowess(gdp$year[!is.na(gdp$ussr)], na.omit(gdp$ussr)), col = "red")

This plot looks nice. Unfortunately, the procedure was a little bit tedious. For example, it would be nice if the legend automatically showed the correct symbol shapes and colours. There are free add-on R packages that have lots of nifty features to automate this task, but their learning curve is quite steep. For our Quantitative Reasoning course, the basic method Ive just shown you is perfectly adequate.

Here are the main points of this tutorial.

  • If we want to make a scatter plot or a line chart with multiple data shown in the same plot, we first use the plot() function to show one data set along the x-axis and another data set on the y-axis. Then we add more data sets to the plot with the points() or lines() functions.
  • We may need to adjust the coordinate ranges with the xlim and ylim arguments of the plot() function.
  • We should choose unique point symbols for each data set.
  • We must add a legend to communicate clearly which point symbol belongs to which data set.

The data in our example were relatively evenly spaced along the x-axis and y-axis. That is, we didnt have large gaps in some parts of the plot and dense clumps of points in other parts. Some data sets have more skewed distributions, so we may need to apply a coordinate transformation, also known as re-expression, before plotting them. We talk about re-expression in our next tutorial.

See you soon.

Video liên quan