Data Visualization Best Practices

9 mins

Overview

The process of simply preparing your data set and creating some summary statistics often isn’t enough for understanding your data well. However, mapping some statistics into charts and figures can help you tell a compelling story!

Data visualization facilitates making comparisons, understanding trends or identifying outliers. Visualization also speeds up the decision making process, as it makes it easier for you (or your reader) to comprehend the information, especially if it is a vast amount of complicated data.

In this building block, we go through the theory of data visualization, describe the most common chart types and conclude with best practices for plotting.

Theory of Data Visualization

Data Encoding

Data visualization makes use of marks (geometric primitives) and channels (appearance of marks) to create a chart. Marks can consist of:

• points
• lines
• areas
• any other complex shapes

Channels are ways in which we present the marks and consist of:

• position (horizontal, vertical)
• color (hue, saturation, luminance)
• size (length, area, volume)
• shape (orientation/tilt, curvature)

Several aspects should be considered when constructing a data visualization:

1. What is the data and how is it structured
3. What should the user be able to do - exploration/confirmation/communication of data
4. What are the actions and targets
5. How to map between data items and visual elements

Chart Elements

The chart below contains all necessary elements for a proper visualization.

First of all, a chart needs to have a coordinate system , axes and scaling of data. In the above example there are two coordinates, the X and Y axes representing months on the horizontal axis and financial indicators on the primary vertical axis. Additionally, it has a secondary vertical axis showing the ROI.

A complete chart also has a legend for providing mapping information, axes titles, a chart title, data labels and gridlines for better readability of the data.

Chart Types

There is a vast range of chart types that could be used to visualize data, however in this building block we describe 7 of the most common ones, as they cover most of the visualization goals.

1. Scatterplot

The scatterplot can represent data with 2 quantitative attributes in horizontal and vertical channel positions. The used marks are points and the purposes of a scatterplot are to find trends or outliers, visualize a distribution or correlations, or identify clusters.

Code

With the following code you can create a scatterplot with self-generated data in R. You can substitute this kind of data with any dataset you are working with. This process will be replicated for every figure in this building block.

The output should look like this:

2. Bar plot

The bar plot can visualize one categorical and one quantitative attribute. It uses bars (thick lines) as marks and the used channels are length (to express quantitative value) and spatial regions (one per mark). These can be separated horizontally and aligned vertically (or the other way around) and are ordered by attribute values (either by label/alphabetical or by attribute length). The task of the bar chart is to compare or lookup values.

Code

The output should look like this:

3. Stacked bar chart

The stacked bar chart can visualize two categorical attributes and one quantitative attribute. As a mark it uses a vertical stack of line marks. For the channels the stacked bar chart uses length and color hue, as well as spatial regions to represent data. Its task is again to compare and lookup values, and additionally, it can inspect part-to-whole relationships.

Code

The output should look like this:

4. Line chart

The line chart represents 2 quantitative attributes and uses points with line connections between them as marks. The channels are aligned lengths to express quantitative value and are separated and ordered by attributes into horizontal regions. The task of the line chart is to find trends.

Code

The output should look like this:

5. Heatmap

The heatmap can visualize 2 categorical attributes, usually in order, and one quantitative attribute. It uses areas as marks in the shape of a matrix indexed by the 2 categorical attributes. The channel is color hue ordered by the quantitative attribute. The purpose of the heatmap is to find clusters and outliers.

Code

The output should look like this:

6. Histogram

The histogram is used to find the distribution or shape inside some data. It visualizes the frequency of an attribute from a table by using bins and counts. The bins are intervals in which the range of values is divided into, and counts are the frequencies of the values inside each interval.

Code

The output should look like this:

7. Box plot

The box plot is also used to find the distribution of the data. It maps the attributes by calculating 5 quantitative values:

• median: central value/line
• lower and upper quartile: boxes
• lower and upper limits: whiskers

Any values outside the limits are considered outliers.

Code

The output should look like this:

Best Practices for Data Visualization

We have compiled a few best practices geared towards academic publishing.

Tip
1. Monochrome Compatibility: Prioritize monochrome (black and white) designs for figures, especially when unsure of the printing format. If the publication medium allows for color, use a distinct and color-blind friendly palette. Tools like ColorBrewer can assist in choosing appropriate colors.

2. Simplicity and Clarity: Don’t overcrowd the visualization. It’s better to have multiple clear visualizations than one cluttered and hard-to-decipher chart. Each visualization should convey a singular, focused message.

3. Stick to Recognizable Formats: While innovative charts can be captivating, academic readers expect clarity and familiarity (such as the common formats shown above).

4. Detailed Annotation: Every visualization should self-contain all necessary information:

• Title: A succinct description of what the visualization represents.
• Axes Labels: Clearly labeled with variables being represented.
• Units: Always specify the units of measurement.
• Legends: Ensure that any symbols, colors, or patterns used are clearly explained.
• Captions: Provide a brief overview or important insight, especially if there’s a key takeaway or if the visualization requires additional context.
5. Scaling and Typography:

• Axes Ranges: Choose ranges that highlight the data’s key aspects without misrepresenting any variability or skewing perception.
• Font Size & Style: Fonts should be legible even when the figure is downsized for print. Avoid decorative fonts; stick to clean, universally-readable fonts like Arial, Helvetica, or Times New Roman.

Remember, the primary goal of data visualization in academic papers is clarity and effective communication of the research findings. Your visualization should aid comprehension rather than introduce confusion.

Saving plots

To save the plots just created we will use the function ggsave, part of the ggplot2 package.

The text also includes code examples in R for creating these visualizations and explains how to save them using the `ggsave` function.