How to create Jitter Plots in Microsoft Excel to visualize social outcome disparities

Making a Jitter Plot of U.S. regional differences in new HIV diagnoses
6 minute read
Screenshot of a jitter plot created in Microsoft Excel, showing regional differences in new HIV diagnoses
Screenshot of a jitter plot created in Microsoft Excel, showing regional differences in new HIV diagnoses

After almost every Equity Dataviz workshop, participants inevitably say something like: “Great! We’re sold on equitable charts. Jitter plots seem fun! But how do we do this in Excel? Or Google Sheets? Or R? Or Python?” This post answers that question for spreadsheets like Excel or Microsoft Excel.

Why use Jitter Plots to visualize social inequality?

Visualizing outcomes for different social groups (e.g. race, gender, age, income, etc) is a common first step in analyzing social inequity and holding policy-makers accountable for the decisions behind these disparities.

3iap’s peer-reviewed research shows that, when visualizing social outcomes, certain types of visualizations can increase harmful stereotypes about the people being visualized.

Some visualizations (e.g. bar charts, dot plots) make it seem like differences between groups are much larger than they really are. This triggers unconscious social biases that can lead to unfairly blaming the outcome differences on the people themselves.

On the other hand, charts like jitter-plots or range-plots make it clear that there are also wide outcome-differences within groups. This makes it difficult to stereotype a particular group because viewers can see that, even though there are differences between groups, a person’s group identity only explains a small part of the differences.

In this post, we’ll look at how to visualize between- and within-group-differences using Jitter Plots.

Example dataset

To make this realistic, we’ll look at differences in new HIV diagnoses between different regions in the United States, using a publicly available dataset from AIDSVu. The United States has a national goal of reducing HIV incidence by 75% before 2030. They’re making progress, but for a number of reasons, progress in southern states has been slow.

Walkthrough


First we’ll prepare the data. Above is the original spreadsheet from AidsVu, but I’ve added a column Region (see E:E) using a vlookup from the state-region-lookup sheet.




Next we’ll choose just the columns we want and filter the data using this formula: CHOOSECOLS(FILTER(CHOOSECOLS(Data!C:H,3,2,4,5,6), Data!H:H="Y"),1,2,3,4)

  • With CHOOSECOLS(Data!C:H,3,2,4,5,6), we’ve selected the columns for Region, State, County, New Diagnoses Rate, and CHOOSECOLS(Data!C:H,3,2,4,5,6)
  • With the filter Data!H:H="Y", we’re choosing only rows that are listed as rate-stable. AIDSVu discusses rate-stability in their methodology documentation.




Since Excel won’t let us use text values in a scatter plot, we create a column that represents each region as a value from 1-4.

We use a vlookup in column E:E to give each row an index value: =VLOOKUP(A2,$H$2:$I$5,2,FALSE). This looks up the region’s index from the table on the right.

We’ve also renamed the columns to clarify which columns drive which axes, for when we create the chart.




Next we’ll add a chart.

  • Select columns D:E
  • In the top menu, choose Insert > Scatter
  • You’ll see Excel lays out a rough scatter plot
  • To tidy things up, we’ll delete the chart title and x-axis title. We’ll also adjust the formatting for the Vertical Axis, setting the Bounds to 0-5 and the units to 1.




How do we make the dots jitter?

  • Jittering helps make the underlying distribution more apparent.
  • First, we’ll add a constant to K2 to control the amount of vertical jitter on each line
  • Then we’ll modify our formula for column E, changing the formula to =VLOOKUP(A2,$H$2:$I$5,2,FALSE)+RAND()*$K$2-0.5*$K$2

    • Remember =VLOOKUP(A2,$H$2:$I$5,2,FALSE) is how we look up the row index from the table on the right
    • We’ve added +RAND()*$K$2-0.5*$K$2 which generates a random number from 0—1, then multiplies that number by our jitter amount constant K2. We subtract half of K2 to center it on the line.
  • Finally we drag the formula down column E to apply it to each row
  • You can see the dots on each row are now vertically jittered.




With jitter plots, people generally still expect to see what the “typical” value is for the dots (either an average or ideally a median). To show this using Excel we’ll have to get a bit clever.

Our plan here is to prepend the average values as part of the overall dataset shown on the scatter plot, but we’ll differentiate the average values by putting them in another “series”

  • First we claim column F to contain data for the average series that we can use in the chart
  • Then we’ll move the whole dataset down 4 rows
  • In column A we put the 4 region names
  • In column F we do a similar vlookup as we used in column E, to look up the row index for the region, i.e. setting F2 to: =VLOOKUP(A2,$H$2:$I$5,2,FALSE). Then we drag that down for the 4 rows.
  • In column D, we use averageifs to get the averge value for each region, i.e. setting D2 to: =AVERAGEIFS(D$6:D$366,A$6:A$366,"="&A2)
  • Now we’ve got the data in place to show the averages. Next we’ll update the chart…




To show the averages on the chart:

  • We open the chart settings
  • We change the source by highlighting from D1 down to F366
  • Then adjust the Y Values if excel doesn’t do this automatically
  • You can see the average values show up as orange dots.
  • You can see this has all the parts we need, but we need to do some cleanup to make it clearer.




The most important part of any dataviz is the way we label it. So let’s add the region names. To add labels to the rows, we just insert a textbox using Insert > Text Box. Then adjust the size of the chart so the rows line up with the text.




For extra credit we can use shapes to overlay a goal range and a small star icon, indicating the U.S. goal to reduce HIV incidence by 75%.

Next steps:

Hopefully this helps you show outcome variance when visualizing social outcome disparities (or any kind of group outcome differences).

  • A demo version of this spreadsheet is available here. Please feel free to copy and use as a template.
  • There may certainly be easier ways to accomplish the above. If you’ve got other good spreadsheet hacks, please send us a note at hi@3isapattern.com.
  • If you or your teams are interested in learning about more equitable dataviz in general, please check out 3iap’s Equity Dataviz workshop, where we’ll cover our latest research on the topic, unpack the underlying psychology of data-driven stereotyping, and cover alternative ways to visualize social outcome disparities (without making them worse).
a curious guinea pig
Would you like to be a guinea pig?

Join 3iap’s mailing list for early access to the latest dataviz research, writing, and experiments.

(Note: No guinea pigs –or humans– have been harmed in the course of 3iap’s research, writing, or experiments.)