problems = [
{name: '(a) Line patterns', carX: 'Cylinders', carY: 'MPG'},
{name: '(b) Dot patterns', carX: 'Cylinders', carY: 'Cylinders'},
{name: '(c) Diagonal patterns', carX: 'MPG', carY: 'MPG'}
];
Plot.plot({
inset: 8,
grid: true,
color: {
legend: true,
},
marks: [
Plot.dot(cars, {x: problem.carX, y: problem.carY, r: 8 })
]
});
Gatherplot: A Non-Overlapping Scatterplot
This paper has been published on the experimental track of the Journal of Visualization and Interaction. See the reviews and issues for this paper.
1 Introduction
Scatterplots—one of the most common types of statistical graphics (Cleveland and McGill 1988; Elmqvist, Dragicevic, and Fekete 2008; Utts 1996)—are often used to visualize two continuous variables using visual marks mapped to a two-dimensional Cartesian space, where the color, size, and shape of the marks can represent additional dimensions. It can also be used for exploring multidimensional datasets in the form of scatterplot matrices (SPLOM), where all the possible combinations of axes are presented in table form. However, scatterplots are so-called overlapping visualizations (Fekete and Plaisant 2002) in that the visual marks representing individual data points may begin to overlap each other in screen space when the marks are large, when there is insufficient screen space to fit all the data at the desired resolution, or simply when several data points share the same value. Furthermore, real-world multidimensional datasets often contain categorical variables, such as nominal variables or discrete data dimensions with a small domain, which lead to many data points being mapped to the exact same screen position. This kind of overlap is known as overplotting (or overdrawing) in visualization, and is problematic because it may lead to data points being entirely hidden by other points, which in turn may lead to the viewer making incorrect assessments of the data. As can be seen in Figure 2, there are three situations for mapping variables to axes in scatterplots when overplotting is inevitable:
- Plotting categorical vs. continuous variables gives rise to line patterns (Figure 2(a));
- Plotting categorical vs. categorical variables gives rise to single dot patterns (Figure 2(b)); and
- Plotting the same continuous variable on both axes gives rise to diagonal line patterns (Figure 2(c)).
MPG
and one categorical variable Cylinders
showing limitations of scatterplots when plotting categorical variables on one or both axes.
Several approaches have been proposed to address this problem (Ellis and Dix 2007), the most prominent being transparency, jittering, and clustering techniques. The first, changing transparency, does not so much address the problem as sidestep it by making the visual marks semi-transparent so that an accumulation of overlapping points are still visible. However, this does not scale for large datasets, and also causes blending issues if color is used to encode additional variables. Jittering perturbs visual marks using a random displacement (Trutschl, Grinstein, and Cvek 2003) so that no mark falls on the exact same screen location as any other mark, but this approach is still prone to overplotting for large data. It also introduces distortion that is not aptly communicated by the scatterplot since marks will no longer be placed at their true location on the Cartesian space. Other approaches still attempt to organize overlapping marks into visual groups that summarize their distribution, such as histograms, violin plots, and kernel density estimation (KDE) plots (Fua, Ward, and Rundensteiner 1999; Mayorga and Gleicher 2013; Im, McGuffin, and Leung 2013; Silverman 1986). However, this comes at the cost of losing the identity of individual points, which can be problematic when filtering or searching; e.g, brushing data points is difficult in histograms (Im, McGuffin, and Leung 2013).
In this paper, we propose the concept of gathering as an alternative to scattering and jittering, and then show how we can use this visual transformation to define a novel visualization technique called a gatherplot. The gatherplot is an instance of a recently recognized family of visualization techniques called unit visualizations (Park et al. 2018) that maintain a strict mapping between every data item and its unique visual mark, improving the understandability over aggregated representations as well as enabling more direct interactions. Gathering is a generalization of the linear mapping used by scatterplots, and works by first partitioning the graphical axis into segments based on the data dimension and then organizing points into packed groups for each segment to avoid overplotting. This means that the gather operation relaxes the continuous spatial mapping commonly used for a graphical axis; instead, each discrete segment occupies a certain interval of screen space that maps to the same data value. This is communicated using graphical brackets on the axis that shows the value or interval for each segment (Figure 1(b)).
The contributions of our paper are the following: (1) the gatherplot technique that applies the gather operation to scatterplots to mitigate overplotting; and (2) results from a crowdsourced graphical perception study on the effectiveness of gatherplots.
2 Background
Scatterplots are classic statistical data graphics with many design variations to address challenges of scale, complexity, and specific tasks (Sarikaya and Gleicher 2018). Our goal with gatherplots is to generalize scatterplots to a representation that maintains its simplicity and familiarity while eliminating overplotting. Partial or complete overplotting generally leads to visual clutter. Ellis and Dix (Ellis and Dix 2007) survey the literature and derive a general approach to reduce clutter. According to their treatment, there are three ways to reduce clutter in a visualization: by changing the visual appearance, by distorting visual space, or by presenting data over time. Some straightforward mechanisms they list include decreasing mark size, increasing display space, or animating the data. Below we review more sophisticated approaches based on appearance and distortion.
Appearance-based Methods
Practical appearance-based approaches to mitigate overplotting include transparency, sampling, kernel density estimation (KDE), and aggregation. Transparency changes the opacity of the visual marks, and has been shown to convey overlap for up to five occurrences (Zhai, Buxton, and Milgram 1996). However, there is still an upper limit for how much overlap is perceptible to the user, and the blending caused by overlapping marks of different colors makes identifying colors difficult.
Sampling uses stochastic methods to statistically reduce the data size to visualize (Dix and Ellis 2002). This may reduce the amount of overplotting, but since the sampling is random, it cannot be reliably eliminated. Furthermore, one of the core strengths of a scatterplot is its ability to show outliers effectively, whereas sampling will likely eliminate all outliers (due to the intrinsic nature of an outlier).
Aggregation methods can also mitigate overplotting. KDE (Silverman 1986) and other binned aggregations (Elmqvist and Fekete 2010; Fua, Ward, and Rundensteiner 1999; Mayorga and Gleicher 2013; Im, McGuffin, and Leung 2013) replace a cluster of marks with a single entity that has a distinct visual representation. Similarly, splatterplots (Mayorga and Gleicher 2013) combine individual marks with aggregated entities, using marks to show outliers and aggregated entities to show the general trends. While aggregation techniques are effective against overplotting for continuous variables, they fare poorly for categorical ones. Therefore, the generalized plot matrices (GPLOMs) (Im, McGuffin, and Leung 2013) were proposed to solve this particular problem by adopting non-homogeneous plots into a matrix. The technique uses a histogram for categorical vs. continuous variables, and a treemap for categorical vs. categorical variables. While effective in providing overview, aggregated techniques sacrifice some compatibility with scatterplots since they no longer maintain object identity, meaning that each visual mark no longer represents a single data point.
Cylinders
and the vertical to Displacement
in a car dataset.
Distortion-based Methods
Distortion-based techniques avoid overplotting by changing the spatial mapping of the space and have the advantage to keep the identity of individual data points. The canonical distortion technique is jittering, where a random displacement is used to subtly modify the exact screen space position of a data point (Figure 3). This has the effect of spreading data points apart so that they are easier to distinguish. However, naïve jittering mechanisms apply the displacement indiscriminately to all data points, regardless of whether they are overlapping or not. This has the drawback of distorting points away from their true location on the visual canvas, and still does not completely eliminate overplotting.
Bezerianos et al. (2010) use a more structured approach to displacement, where overlapping marks are organized onto the perimeter of a circle. The circle is grown to a radius so that all marks fit, which means that its size is also an indication of the number of grouped points. However, this mechanism still introduces uncertainty in the spatial mapping, and it is also not clear how well it scales for very dense data, as this can lead to a circle of arbitrarily large size. Nevertheless, the approach is a good example of how deterministic displacement can be used to great effect for eliminating overplotting.
Trutschl et al. (2003) propose a deterministic displacement (“smart jittering”) that adds meaning to the jittered position based on clustering results. This makes it easier to understand the resulting spatial display.
Data-aware Methods
The most advanced and effective overplotting mitigations are data-aware, in that they determine instances of overplotting in a chart. As a case in point, recent work by Chen et al. (2018) use animation to cycle the depth value of overlapping points in a scatterplot over time to ensure that every point is shown on top at some point in the rotation. This means that overplotting is alleviated by the notion of “guaranteed visibility over time” (Munzner et al. 2003) presented by Munzner et al. However, a common criticism for techniques designed to mitigate overplotting is that often do not scale to large dataset scales.
Shneiderman et al.(2000) propose a data-aware structured displacement approach called hieraxes, which combines hierarchical browsing with two-dimensional scatterplots. In hieraxes, a two-dimensional visual space is subdivided into rectangular segments for different categories in the data, and points are then coalesced into stacked groups inside the different segments. This work inspired gatherplots, which refines the layout and design of hieraxes further.
Microsoft’s SandDance (Microsoft Research 2011) use atomic visual marks as the building block of a highly interactive and visual interface built on smooth transitions between different spatial mappings. Drawing on an older experimental tool called Pivot from Microsoft Live Labs, SandDance now exists as a custom visual in the Microsoft Power BI tool.
Keim et al. (2010) propose generalized scatterplots that use a data-aware combination of overlapping and distortion to avoid overplotting in a scatterplot display. By balancing the overlapping and distortion, the user can achieve a display that conforms to their prior familiarity with scatterplots while retaining minimal occlusion and appropriate distortion of data points. However, in contrast, our gatherplot approach instead balances discriminability (the ability to distinguish individual marks) and spatial accuracy (the deviation from the true mapped position of a point). While gatherplots thus yield better discriminability, they are also less visually scalable as well as accurate.
Hieraxes (Shneiderman 1996), SandDance (Microsoft Research 2011), and gatherplots (that we present in this paper) are all examples of a recently recognized family of visualizations called unit visualizations (Park et al. 2018) where the relation between data items and their mark is explicitly maintained. This identity property between data and display is exemplified in visualizations such as unit charts, dotplots, and scatterplots. It can be contrasted with aggregated visualizations that combine multiple data items intop a single visual mark, such as bar charts, pie charts, and histograms. Our gatherplots technique shows how a unit visualization can be designed, evaluated, and even deployed from the ground up based on unit visualization principles.
It is worth comparing gatherplots to our own prior work on the Atom (Park et al. 2018) grammar for unit visualization.1 Atom is a general-purpose visualization grammar with gathering (packing with no overlap) as one of the visual layout options. In other words, the gatherplots technique can be more or less replicated using Atom. However, our focus in this paper is on specialized design aspects of gatherplots, including the interval marks, streamgraph layout, and layout modes. Also, where Atom is merely a grammar, this paper contributes a detailed user study involving the gatherplots technique in comparison to jittering.
Visualizing Categorical Variables
While we have already ascertained that scatterplots are not optimal for categorical variables, there exists a multitude of visualization techniques that have been specifically designed for such data (Bederson, Shneiderman, and Wattenberg 2002; Hofmann, Siebes, and Wilhelm 2000; Kosara, Bendix, and Hauser 2006). Simplest among them are bar chart histograms, which visualize the item count for each categorical value (Stevens 1946). Boxplots and violin plots show the distribution of continuous variables over categorical variables (Wickham and Stryjewski 2011). While hieraxes, histograms, and treemaps are effective in dealing with categorical variables, it is difficult to extend these to continuous vs. categorical variables. One way is to apply binning to continuous variables to create groups of values. However, the optimal number of bins depends on statistical characteristics of the data and the required task. Dot plots by Wilkinson (1999) renders continuous univariate variables without overplotting by stacking nodes within dot size. Dang et al. (2010) extended this to scatterplots by stacking nodes whose values are similar in 3D visual space. These pioneering works provide the theoretical background for the determination of optimal bin size for gatherplots.
Specialized versions of dot plots (Wilkinson 1999) have been imbued with their own monikers. The R graphics package presents a version called a stripchart, which allows for both jittering and stacking categorical data in a one-dimensional scatterplot. Beeswarm plots (Eklund and Trimble 2021) improves on stripcharts by allowing for closely packed, non-overlapping points. The seaborn (Waskom 2021) statistical data visualization library for Python extends these chart types to two-dimensional space. Stripplots (the name is somewhat unfortunate as it clashes with the strip plots technique for rendering univariate data as strips, yielding a plot reminiscent of a barcode) are 1D or 2D versions of stripcharts with random jittering where at least one axis is expected to be categorical. Swarmplots are 2D versions of beeswarm plots that place points to avoid overlap. In comparison, gatherplots partition the available space into stacked groups and then organize the marks inside each group into ordered grids. This yields a richer visual language that allows for sorting marks based on color and even resize marks to fill the visual space.
Another practical method for visualizing categorical data is for making inferences based on statistical and probabilistic data. Cosmides and Toody (1996) used frequency grids as discrete countable objects, and Micallef et al. (2012) extend this with six different area-proportional representations of categorical data organized into different classes. Huron et al. (2013) suggested using sedimentation as metaphor where individual objects coming from a data stream gradually transform into aggregated areas, or strata.
Finally, Kinetica (Rzeszotarski and Kittur 2014), beyond presenting a novel touch- and physics-based interaction for 2D scatterplots, provides a clustering algorithm where visual marks clump together with collision detection (and thus no overlap) based on their data ranges. The clusters are similar to the partioned clusters in gatherplots, but are organically shaped, not arranged in orderly grid lines, and thus harder to compare. The attraction approach used in Kinetica is reminiscent of the gravity model used for marks in dust & magnet (Yi et al. 2005).