Statistical Transformation in ggplot2

Ha Khanh Nguyen

Statistical Transformations

  • Before we discuss transformation in graphs, we will take a quick look at a bar chart.

  • The above bar chart displays the total number of diamonds in the diamonds dataset, grouped by cut.
    • geom_bar only requires one aesthetic, that is x.
    • On the y-axis, it displays count. But count is not a variable in diamonds!
  • Many plots (like scatterplot) the raw values of your dataset.
  • Other graphs (like bar charts, histograms, etc.) calculate new values to plot:
    • bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
    • smoothers fit a model to your data and then plot predictions from the model.
    • boxplots compute a robust summary of the distribution and then display a specially formatted box.
  • The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.

geom and stat functions

  • You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():

  • This works because every geom has a default stat; and every stat has a default geom.

Use a specific stat for graphing

There are three reasons you might need to use a stat explicitly:

  • You might want to override the default stat.

  • You might want to override the default mapping from transformed variables to aesthetics.
    • For example, you might want to display a bar chart of proportion, rather than count:

  • You might want to draw greater attention to the statistical transformation in your code.
    • For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you’re computing: