quantileplot

This R package has a companion working paper.

Lundberg, Ian, Robin C. Lee, and Brandon M. Stewart. 2021. “The quantile plot: A visualization for bivariate population relationships.” Working paper.

The quantileplot package visualizes bivariate associations. When summarizing bivariate data, a best-fit regression line often obscures important trends—it is too simple. A scatter plot shows all the data but overwhelms the viewer—it is too complicated. A quantileplot is a middle ground with three components.

  1. The marginal density of the predictor variable.
  2. The conditional density of the outcome at selected values of the predictor.
  3. Smooth curves for conditional quantiles of the outcome as functions of the predictor.

A quantileplot may be useful in many settings. One example is when

This vignette illustrates the package functionality with a simulated example.

Installation

The quantileplot package is available via GitHub.

Basic functionality

Suppose we have a continuous predictor \(X\) and a continuous outcome \(Y\).

library(quantileplot)
x <- rbeta(1000,1,2)
y <- log(1 + 9 * x) * rbeta(1000, 1, 2)
sim_data <- data.frame(x = x, y = y)

A call to the quantileplot function, produces the most basic quantile plot.

quantileplot(y ~ s(x), data = sim_data)

A call to quantileplot may take some time to compute. In the simulated setting above, the time was 4 seconds for 1,000 observations and 24 seconds for 10,000 observations.

Visualizing uncertainty

This most basic version of the visualization does not present any estimates of uncertainty. There are two options for visualizing uncertainty.

Uncertainty method 1: Pointwise confidence bands

The argument show_ci = T will layer 95% pointwise credible interval bands on top of the estimated quantile curves. A credible interval is the Bayesian analog to a confidence interval, here supported by the default variance estimation methods in the qgam package.

quantileplot(y ~ s(x), data = sim_data, show_ci = T)

Importantly, these are pointwise credible intervals. At each predictor value \(X = x\), they are designed to contain the middle 95% of the posterior distribution of the quantile of \(Y\) given \(X = x\). This is distinct from uncertainty statements about the entire curve. For example, it would be incorrect to conclude that over 95% of repeated draws the entire curve would fall within the plotted band.

The ci argument allows the user to specify a credible value other than 0.95 (e.g. for 90% confidence bands.)

Uncertainty method 2: Posterior samples of curves

One may want to visualize uncertainty by simulating a series of hypothetical curves. The argument uncertainty_draws = 10 will add a panel of plots below the main plot. In each plot, the 10 solid black line depicts the point estimate of the curve. The gray lines depict curves sampled from the posterior distribution.

quantileplot(y ~ s(x), data = sim_data, uncertainty_draws = 10)

Importantly, these curves are not like a confidence interval. They do not show the user a range such that the truth would fall in that range with some probability. Instead, they are useful to convey to the viewer the more basic idea that the estimated curve is only our best guess for a curve that is statistically uncertain.

Visualizing the raw data

After creating a quantileplot object, you can convert that object into an analogous scatter plot to visualize the raw data. If desired in a large sample, you can use the fraction argument to plot some random fraction of the raw data.

qp <- quantileplot(y ~ s(x), data = sim_data)
scatter.quantileplot(qp, fraction = .5)

Customizing the visualization

There are two ways to customize the plot: with arguments and by manually modifying the resulting ggplot2 object.

Customization 1: Arguments to quantileplot

You can customize many features of the plot with arguments to the quantileplot function.

quantileplot(
  y ~ s(x), 
  data = sim_data, 
  # Provide axis titles
  xlab = "A name for the predictor", 
  ylab = "A name for the outcome",
  # Customize the number of vertical slices
  slice_n = 3,
  # Customize which quantiles are depicted
  quantiles = c(.3,.5,.7),
  # Denote quantiles by colors with labels instead of colors
  quantile_notation = "label"
)

If predictors are extremely skewed, you may only want to visualize part of the space. For example, the code below restricts the visualization to the region of \(X\in(0,.5)\) and \(Y\in (0,1)\).

quantileplot(y ~ s(x), 
             data = sim_data,
             x_data_range = c(0,.5),
             y_data_range = c(0,1),
             show_ci = T)

Note that the vertical densities are redistributed across the user-specified x_data_range.

Customization 2: Make modifications as in ggplot2

Because the basic quantileplot contains a ggplot2 object in the plot element, you can modify the output by providing plotting layers. For instance, you can add a custom title, change colors, modify axes, etc.

library(ggplot2)
my_plot <- quantileplot(y ~ s(x), data = sim_data)
my_plot$plot +
  ggtitle("A custom title for the plot") +
  theme_light() +
  scale_color_manual(values = rainbow(5),
                     guide = guide_legend(reverse = TRUE, label.position = "left",
                                          title = "Custom\nlegend\ntitle and\ncolors")) +
  xlab(expression(Custom~axis~title~could~have~something~bold(bold))) +
  scale_y_continuous(breaks = c(0,1,2),
                     labels = c("Custom\nlabel at 0",
                                "Another\ncustom\nlabel at 1",
                                "Custom\nlabel at 2"),
                     name = "Custom y-axis\nbreaks and\nrotated title") +
  theme(axis.title.y = element_text(angle = 0, vjust = .5))
#> Scale for 'colour' is already present. Adding another scale for 'colour',
#> which will replace the existing scale.
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.

If you want to modify the axis limits after the fact, you need to use coord_cartesian() because this is how quantileplot() set the axis limits.

my_plot <- quantileplot(y ~ s(x), data = sim_data, quantile_notation = "label")
my_plot$plot +
  coord_cartesian(xlim = c(0,2)) +
  annotate(geom = "label", x = 1.75, y = 1,
           label = "Extra space\nadded with\ncoord_cartesian()\nto allow more\nroom for\nannotations.",
           size = 3)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Additional notes

In some settings, estimation of quantile curves is computationally intensive. If you are modifying some aspect of the densities and want to use a cache from a previous call for the quantile curves, you can use the argument.

quantileplot(y ~ s(x), data = sim_data, previous_fit = my_plot)

You can also pass other arguments to through the at the end of the function call. For instance, you can specify rather than learn the log learning rate as discussed in the documentation. Note how these two plots have very different wiggliness of the estimated quantile curves.

quantileplot(y ~ s(x), data = sim_data, lsig = log(10))

quantileplot(y ~ s(x), data = sim_data, lsig = log(.01))