Combining Statistical Testing and Visualization, directly in R using ggpubr and rstatix

8 mins

Overview

This article delves into the integration of `ggpubr` and `rstatix` packages in R, providing step-by-step guidance on (a) conducting statistical tests and (b) automating the results into your visualizations.

`ggpubr` is a R package is a package well suited to create elegant, publication-quality plots. Its strength lies in integrating statistical analysis results, such as significance testing, directly into plots. This building block will delve into using functions like `compare_means()`,` ggbarplot()`, and `stat_p_value()` to create a bar plot that not only displays data but also its statistical significance.

Furthermore we will use the `rstatix` package. `rstatix` provides functions for statistical tests and data preparation, while `ggpubr` is used for creating plots. Together, they streamline and automate the process of visualizing statistical comparisons.

Step-by-Step Guide: Crafting Publishable charts with `ggpubr`

Step 1: Installing and Loading `ggpubr` and `rstatix`

First, ensure that you have `ggpubr` and `rstatix` installed and loaded in your R environment:

Step 2: Data Cleaning and Preparation

The next step is to clean your data and prepare it for statistical analysis and visualization.

Step 3: Statistical Analysis using `rstatix`

`rstatix` offers functions for conducting statistical tests, such as t-tests and ANOVA, which can then be directly incorporated into `ggpubr` for creating informative visualizations. This integration within the `ggplot2` ecosystem combines the familiar sytnax of `ggplot2` and enhances the efficiency of the analysis process.

We will now use the `compare_means()` function from the `rstatix` package, designed to compare the mean values of a specific variable across different groups.

Tip

How does the `compare_means` function work?

-Syntax:

• compare_means(formula, data, method, ref.group)

Components:

• formula: This is where you specify the comparison you want to make.
• This similar to the formula specification within the lm() function.
• Example: height ~ gender implies comparing the average height across different genders.
• data: This parameter is used to specify the dataset in which the comparison is to be made.
• method: This parameter defines the statistical test to be used for comparison.
• ref.group: This is the reference group against which other groups are compared.
• Example: If you have groups ‘A’, ‘B’, and ‘C’, and ‘A’ is set as the reference group, the function will compare ‘B’ and ‘C’ with ‘A’.

This function is a powerful tool for conducting and interpreting various statistical comparisons, particularly useful in data analysis and research.

Step 4: Data Visualization using `ggpubr`

For this analysis, we have chosen to use `ggbarplot()` from the `ggpubr` package. This function is used to create bar plots, however `ggpubr` incorporates many more types of visualization. Examples are `gghistogram()`, `ggdensity()` and `ggline()`.

Tip

Key Features of `ggbarplot()`:

Simplicity:
One of the main advantages of the `ggpubr` package and therfore the function `ggbarplot()` is its simplicity. It allows you to create bar plots with minimal coding, making it user-friendly, especially for those new to data visualization in R.

Customization:
Despite its simplicity, `ggbarplot()` offers a wide range of customization options. You can change colors, add labels, adjust the width of the bars, and much more.

Integration with ggplot2:
ggpubr, the package that includes ggbarplot(), is built on top of ggplot2, a powerful and popular plotting system in R. This means that ggbarplot() benefits from the robustness and flexibility of ggplot2.

Let’s look at a practical example next.

Step 5: Adding Statistical Significance with `stat_p_value()`

Now, integrate the test results (step 3) into the bar plot (step 4) using `stat_p_value()`.
The `stat_pvalue_manual()` function comes from the `ggpubr` package, and it’s used in conjunction with bar plots or other types of plots to add statistical annotations, specifically p-values, to the plot.
Adding these to a plot can provide a visual cue about the statistical differences between groups.

Tip

Breaking Down the Syntax of `stat_pvalue_manual()`:

• stat.test: This is your statistical test result that you want to annotate on the plot. The function uses information from this result to create the annotations.
• tip.length: This parameter controls the length of the ‘tips’ or the small lines that point to the bars or groups being compared.
• y.position: It sets the vertical (y-axis) position of the p-value labels on the plot. You can adjust these numbers to position the labels higher or lower relative to the bars.
• label: Here, you specify what kind of label you want to use for your p-value.
• For example, “p.signif” will automatically convert the p-value into a significance level. Like ‘**’ for highlight significant results.

Tip

For a step-by-step guide on creating professional bar charts and saving them, refer to this article. It provides a detailed walkthrough of the process to help you create publishable bar charts with ease.

Advanced case with Multiple Categorical Variables

In our previous analysis, we delved into the relationship between education levels and hourly earnings, focusing on a single categorical variable, which was education level. Now, let’s take our analysis a step further by considering the influence of not just one, but two categorical variables. Specifically, we aim to understand how both education level and gender collectively impact the mean hourly wages of individuals.

Summary

This article explores the integration of R packages, `ggpubr` and `rstatix`, to blend statistical analysis with data visualization. These combination of packages are builiding on the `ggplot2` syntax, assuring visualizations are created within a familiar syntax.

• It covers all the steps, from data preparation and package installation to conducting statistical tests and creating academic publication-quality bar plots.
• The process involves loading data, cleaning it, and performing tests using `compare_means() `from `rstatix`. Then, an example of `ggbarplot() `from `ggpubr` is employed to create bar plots. To enhance the plots, statistical significance is added with `stat_p_value()`.
• An advanced case considers the influence of multiple categorical variables on mean hourly wages.

This guide empowers users to produce informative and visually appealing charts in R.