Thursday, October 30, 2025

Sankey plots can work, however want sharpening like another graphic


So a vital dialogue of Sankey plots floated throughout my feed on Bluesky lately, and one reply included an unsightly instance and the remark “Anyone who thought that this illustration enhanced readability lives in an alternate actuality”. The precise chart I’ve included on the backside of this put up. I agree it’s fairly unhelpful, however I believed I noticed potential, and mentioned so. This weblog is me seeing if in truth one thing may be performed with it.

The put up that began the dialogue was Emily Moin saying “Sankey diagrams are simply as dangerous as pie charts, and in addition worse as a result of typical knowledge has not rejected them but so I nonetheless have to take a look at them”. Because it occurs, I believe that even pie charts have their (very restricted) place as long as they’re carefuly chosen (not when too many classes, for instance) and correctly polished for the viewers. So I assume I’m being constant in pondering that Sankey charts may also be helpful.

Right here’s what I believe is unsuitable with the unique graphic (which is reproduced later on this put up), which I believe is monitoring the development of sufferers experiencing signs of various severity over a interval of six weeks:

  • The labels litter the picture and have a whole lot of redundant info repeated a number of instances (“Severity Week…”)
  • The weeks and severity are each measured in numbers and introduced collectively within the labels, making a big cognitive load to parse the labels (“Severity Week 0: 3” takes some effort to work out the severity is 3 and the week is 0, which is precisely the kind of factor you wish to intuit straight from place or color in a plot somewhat than need to learn it)
  • The colors aren’t colour-blind pleasant.
  • Though the colors are mapped appropriately to the severity scale (blue for low severity, inexperienced to mid and purple for prime) there’s no legend to attract this to the reader’s consideration, and due to the ordering of ribbons on the web page (see subsequent level) this sequencing of colors is rarely apparent to the reader.
  • The nodes representing severity in a given week aren’t in any mounted order on the web page, and alter from week to week. They appear to have been chosen extra to get the severity ranges with extra sufferers in the direction of the centre of the chart. This stops the reader getting any simple studying of severity (which may have been mapped to vertical place within the plotting space) and provides to the cluttered and sophisticated really feel of the plot – for instance by having the blue ribbon for severity 2 leaping from close to the underside of the plot in weeks 0 and 1 to the highest in weeks 3 and 6.

You’ll have to go down a bit within the put up to see that unique graphic; you’ll see that the mixed impact of those issues is certainly one in all complexity and litter. I believe the final level – the severity nodes swapping locations vertically – is a very powerful.

I had a go at bettering this and got here up with a few options, utilizing David Sjoberg’s ggsankey R package deal. Right here’s a Sankey plot model:

… and right here’s an alluvial plot model. Alluvial plots are just like Sankey plots however haven’t any areas between the nodes, which suggests on this case you possibly can learn the nodes vertically at every week equally to a stacked bar chart:

Right here’s the unique graphic for comparability:

I’m fairly assured that both the Sankey or alluvial plot are particular enhancements and provides a greater sense of the common severity in every week, and the general pattern (which is extra blue, low severity instances). Whereas nonetheless giving a way of individuals shifting in a number of instructions (typically upwards) from every severity-week mixture. So I believe I’ve addressed the details right here:

  • Decluttered the labels by having axis labels for “Week zero”, “Week one”, and many others; that means we don’t have to repeat this in every node. And the node label is now simply the one variety of the severity.
  • Averted the cognitive load of week and severity each being numerals, partly by the simplified labels above and partly by spelling out weeks in English phrases (one, three, and many others) somewhat than numerals.
  • Chosen a extra colour-blind pleasant palette primarily based on the Brewer Purple-Yellow-Blue scheme somewhat than Purple-Inexperienced-Blue
  • I nonetheless don’t have a legend, however I believe it’s now a lot clearer to the reader that purple is excessive severity and blue is low, due to the vertical sequencing of the nodes…
  • … which is the primary repair right here – I’ve strictly ordered the 1,2,3,4,5,6,7 severity nodes vertically so that they by no means swap positions. This implies much less crossing-of-the-beams and therefore much less really feel of complexity within the plot. Most significantly, it provides the attention a simple approach to choose the proportion of individuals in every severity stage by vertical place and dimension on the web page.

I’ve left all of the code on the finish as a result of most of it was about me making an attempt to place collectively by hand a dataset that resembles that within the unique movement chart. Then I needed to calibrate it in R to repair the issues in my hand-made model. This included issues just like the variety of individuals altering from week to week, and the variety of individuals coming into a selected severity state in a single week not matching the quantity exiting it. As soon as that stuff is handled, drawing the precise plot is a comparatively easy ggplot2 and ggsankey chunk of code.

library(tidyverse)
library(janitor)
library(glue)
library(RColorBrewer)
remotes::install_github("davidsjoberg/ggsankey")
library(ggsankey) # one pretty easy strategy to sankey charts / movement diagrams

# learn in some information. This was very crudely hand-entered with some
# tough visible judgements primarily based from a chart that I do not know the
# origin of I noticed on the web. So deal with as made-up instance information:
d <- read_csv("https://uncooked.githubusercontent.com/ellisp/blog-source/refs/heads/grasp/information/complicated-sankey-data.csv", 
              col_types = "ccccd",
              # we wish the NAs within the unique to be characters, not precise NA:
              na = "lacking") |> 
  clean_names()

#------------tidying up data---------------
# we've some changes to cope with due to having made up information

# An additional bunch of rows of information which can be wanted by the Sankey operate to
# to make the week 6 nodes present up:
extras <- d |> 
  filter(week_to == "6") |> 
  mutate(
    week_from = "6",
    week_to = NA, 
    severity_from = severity_to)

#' Comfort relabelling operate for turning week numbers into an element:
weekf <- operate(x){
  x <-  case_when(
    x == 0 ~ "Week zero",
    x == 1 ~ "Week one",
    x == 3 ~ "Week three",
    x == 6 ~ "Week six"
  )
  x <- issue(x, ranges = c("Week zero","Week one","Week three", "Week six"))
}

# going to begin by treating all movement widths as proportions 
total_people <- 1

# add within the further information rows to point out the ultimate week of nodes,
# and relabel the weeks:
d2 <- d |> 
  rbind(extras)  |> 
  mutate(week_from =  weekf(week_from),
         week_to = weekf(week_to))

# there needs to be the identical whole variety of individuals every week,
# and the identical variety of individuals leaving every "node" (a severity-week
# mixture) as arrived at it on the movement from the final week.
# we've somewhat iterative course of to wash this up. If we had
# actual information, none of this is able to be essential; that is mainly
# as a result of I made up information with some tough visible judgements:
for(i in 1:5)> 
    mutate(arrived_sev_from = sum(worth)) 

# guide examine - these ought to all be  mainly the identical numbers
filter(tot_arrived, week_to == "Week one" & severity_to == 4)
filter(d2, week_to == "Week one" & severity_to == 4) |> summarise(sum(worth))
filter(d2, week_from == "Week one" & severity_from == 4) |> summarise(sum(worth))


#--------------draw plot-------------

# palette that's colourblined-ok and reveals sequence. This
# really wasn't too dangerous within the unique, but it surely acquired misplaced
# within the vertical shuffling of all of the severity nodes:
pal <-  c("gray", brewer.pal(7, "RdYlBu")[7:1])
names(pal) <- c("NA", 1:7)

# Draw the precise chart. First, the bottom of chart, widespread to each:
p0 <- d2 |> 
  mutate(worth = spherical(worth * 1000)) |> 
  uncount(weights = worth) |> 
  mutate(severity_from = issue(severity_from, ranges = c("NA", 1:7)),
         severity_to = issue(severity_to, ranges = c("NA", 1:7))) |> 
  ggplot(aes(x = week_from, 
             next_x = week_to,
             node = severity_from, 
             next_node = severity_to,
             fill = severity_from,
             label = severity_from)) +
  # default has a whole lot of white area between y axis and the info
  # so scale back the enlargement of x axis to cut back that
  scale_x_discrete(increase = c(0.05, 0)) +
  scale_fill_manual(values = pal) +
  labs(subtitle = "Chart continues to be cluttered, however lowering severity over time is obvious.
To realize this, vertical sequencing is mapped to severity, and repetitive labels have been moved into the axis guides.",
       x = "",
       caption = "Information has been hand-synthesised to be near an unique plot of unknown provenance.") 

# Sankey plot:
p1 <- p0 +
  geom_sankey(alpha = 0.8) +
  geom_sankey_label() +
  theme_sankey(base_family = "Roboto") +
  theme(legend.place = "none",
        plot.title = element_text(household = "Sarala"),
        panel.background = element_rect(fill = "black")) +
  labs(title = "Severity of an unknown illness proven in a Sankey chart")

# Alluvial plot:
p2 <- p0 +
  geom_alluvial(alpha = 0.8) +
  geom_alluvial_label() +
  theme_alluvial(base_family = "Roboto") +
  theme(legend.place = "none",
        plot.title = element_text(household = "Sarala"),
        panel.background = element_rect(fill = "black")) +
  labs(title = "Severity of an unknown illness proven in an alluvial chart",
       y = "Variety of individuals")

print(p1) # Sankey plot
print(p2) # alluvial plot

[Edited 8 July 2025 for black panel backgrounds for the Sankey and alluvial charts.]



Related Articles

Latest Articles