This R package has a companion working paper.
Lundberg, Ian, Robin C. Lee, and Brandon M. Stewart. 2021. “The quantile plot: A visualization for bivariate population relationships.” Working paper.
The quantileplot
package visualizes bivariate associations. When summarizing bivariate data, a best-fit regression line often obscures important trends—it is too simple. A scatter plot shows all the data but overwhelms the viewer—it is too complicated. A quantileplot
is a middle ground with three components.
A quantileplot
may be useful in many settings. One example is when
This vignette illustrates the package functionality with a simulated example.
The quantileplot
package is available via GitHub.
devtools
packagedevtools::install_github("ilundberg/quantileplot")
Suppose we have a continuous predictor X and a continuous outcome Y.
library(quantileplot)
<- rbeta(1000,1,2)
x <- log(1 + 9 * x) * rbeta(1000, 1, 2)
y <- data.frame(x = x, y = y) sim_data
A call to the quantileplot
function, produces the most basic quantile plot.
quantileplot(y ~ s(x), data = sim_data)
A call to quantileplot
may take some time to compute. In the simulated setting above, the time was 4 seconds for 1,000 observations and 24 seconds for 10,000 observations.
This most basic version of the visualization does not present any estimates of uncertainty. There are two options for visualizing uncertainty.
The argument show_ci = T
will layer 95% pointwise credible interval bands on top of the estimated quantile curves. A credible interval is the Bayesian analog to a confidence interval, here supported by the default variance estimation methods in the qgam
package.
quantileplot(y ~ s(x), data = sim_data, show_ci = T)
Importantly, these are pointwise credible intervals. At each predictor value X=x, they are designed to contain the middle 95% of the posterior distribution of the quantile of Y given X=x. This is distinct from uncertainty statements about the entire curve. For example, it would be incorrect to conclude that over 95% of repeated draws the entire curve would fall within the plotted band.
The ci
argument allows the user to specify a credible value other than 0.95 (e.g. for 90% confidence bands.)
One may want to visualize uncertainty by simulating a series of hypothetical curves. The argument uncertainty_draws = 10
will add a panel of plots below the main plot. In each plot, the 10 solid black line depicts the point estimate of the curve. The gray lines depict curves sampled from the posterior distribution.
quantileplot(y ~ s(x), data = sim_data, uncertainty_draws = 10)
Importantly, these curves are not like a confidence interval. They do not show the user a range such that the truth would fall in that range with some probability. Instead, they are useful to convey to the viewer the more basic idea that the estimated curve is only our best guess for a curve that is statistically uncertain.
After creating a quantileplot
object, you can convert that object into an analogous scatter plot to visualize the raw data. If desired in a large sample, you can use the fraction
argument to plot some random fraction of the raw data.
<- quantileplot(y ~ s(x), data = sim_data)
qp scatter.quantileplot(qp, fraction = .5)
There are two ways to customize the plot: with arguments and by manually modifying the resulting ggplot2
object.
quantileplot
You can customize many features of the plot with arguments to the quantileplot
function.
quantileplot(
~ s(x),
y data = sim_data,
# Provide axis titles
xlab = "A name for the predictor",
ylab = "A name for the outcome",
# Customize the number of vertical slices
slice_n = 3,
# Customize which quantiles are depicted
quantiles = c(.3,.5,.7),
# Denote quantiles by colors with labels instead of colors
quantile_notation = "label"
)
If predictors are extremely skewed, you may only want to visualize part of the space. For example, the code below restricts the visualization to the region of X∈(0,.5) and Y∈(0,1).
quantileplot(y ~ s(x),
data = sim_data,
x_data_range = c(0,.5),
y_data_range = c(0,1),
show_ci = T)
Note that the vertical densities are redistributed across the user-specified x_data_range
.
ggplot2
Because the basic quantileplot
contains a ggplot2
object in the plot
element, you can modify the output by providing plotting layers. For instance, you can add a custom title, change colors, modify axes, etc.
library(ggplot2)
<- quantileplot(y ~ s(x), data = sim_data)
my_plot $plot +
my_plotggtitle("A custom title for the plot") +
theme_light() +
scale_color_manual(values = rainbow(5),
guide = guide_legend(reverse = TRUE, label.position = "left",
title = "Custom\nlegend\ntitle and\ncolors")) +
xlab(expression(Custom~axis~title~could~have~something~bold(bold))) +
scale_y_continuous(breaks = c(0,1,2),
labels = c("Custom\nlabel at 0",
"Another\ncustom\nlabel at 1",
"Custom\nlabel at 2"),
name = "Custom y-axis\nbreaks and\nrotated title") +
theme(axis.title.y = element_text(angle = 0, vjust = .5))
#> Scale for 'colour' is already present. Adding another scale for 'colour',
#> which will replace the existing scale.
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
If you want to modify the axis limits after the fact, you need to use coord_cartesian()
because this is how quantileplot()
set the axis limits.
<- quantileplot(y ~ s(x), data = sim_data, quantile_notation = "label")
my_plot $plot +
my_plotcoord_cartesian(xlim = c(0,2)) +
annotate(geom = "label", x = 1.75, y = 1,
label = "Extra space\nadded with\ncoord_cartesian()\nto allow more\nroom for\nannotations.",
size = 3)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
In some settings, estimation of quantile curves is computationally intensive. If you are modifying some aspect of the densities and want to use a cache from a previous call for the quantile curves, you can use the argument.
quantileplot(y ~ s(x), data = sim_data, previous_fit = my_plot)
You can also pass other arguments to through the at the end of the function call. For instance, you can specify rather than learn the log learning rate as discussed in the documentation. Note how these two plots have very different wiggliness of the estimated quantile curves.
quantileplot(y ~ s(x), data = sim_data, lsig = log(10))
quantileplot(y ~ s(x), data = sim_data, lsig = log(.01))