Your Data Doesn't Lie, Don't Make It

Visualizing continuous data over categorical.

Visualization of data is critical in understanding it properly. The famous Anscombe’s quartet is a great example of the perils of skipping visualizations of data where the two axes are both continuous. In this blog post we look at visualizing continous data split over categorical variables.

Why do I want to do this

When we want to look at a continuous variable against a discrete one we often employ a box plot to get a sense of the data. But We often lose a lot of information by such a broad summary of our data. The only points we see are extreme data points (Not outliers necessarily).

# If you don't use the tidyverse, start! It will make your life easier 
# ggplot2 is an amazing graphing tool from the tidyverse
library(tidyverse)  
library(titanic)
library(gridExtra)
library(devtools)
library(ggiraph)

theme_custom_stark_sahir <- function(){
  
    theme_minimal() %+replace%
    theme(
        panel.grid.major.x = element_blank(),
        plot.background = element_rect(fill = "#ffffff", color = "#ffffff"),
        plot.title = element_text(size = 16, colour = "black", hjust = 0.5, vjust = 3),
        axis.line = element_line(colour = "black"),
        plot.margin = unit(c(0.2,0.2,0.2,0.2), "cm"),
        panel.grid = element_blank()
        )
    }

tut_titanic_train <- titanic_train

# One's and Zero's are so unintuitive, let's replace them with ream name
tut_titanic_train$Survived[tut_titanic_train$Survived == 0] <- "Dead"
tut_titanic_train$Survived[tut_titanic_train$Survived == 1] <- "Alive"

# This sets up our dataframe quickly by telling R that the Survived variable is categorical, and organizes the data so we can look at the important colums quickly.
tut_titanic_train <- tut_titanic_train %>% 
  mutate(Survived = as.factor(Survived)) %>% 
  select(Name, Survived, Fare, everything())


# A rough look at the data
(gg_tut_initial <- ggplot(tut_titanic_train, aes(x = Survived, 
                               y = Fare, 
                               fill = Survived)) +
   geom_boxplot())

# A nice output of the dataframe in table format
knitr::kable(head(tut_titanic_train), format = "markdown")

Name	Survived	Fare	PassengerId	Pclass	Sex	Age	SibSp	Ticket	Cabin	Embarked
Braund, Mr. Owen Harris	Dead	7.2500	1	3	male	22	1	A/5 21171		S
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	Alive	71.2833	2	1	female	38	1	PC 17599	C85	C
Heikkinen, Miss. Laina	Alive	7.9250	3	3	female	26	0	STON/O2. 3101282		S
Futrelle, Mrs. Jacques Heath (Lily May Peel)	Alive	53.1000	4	1	female	35	1	113803	C123	S
Allen, Mr. William Henry	Dead	8.0500	5	3	male	35	0	373450		S
Moran, Mr. James	Dead	8.4583	6	3	male	NA	0	330877		Q

Those outliers! You may be tempted to apply a log transformation to make this information easier to view but be careful. Transforming data is not always a good idea and can often obscure outliers. For visualizing we will do a log transform and we will come back to the idea of obscured extreme data. I will also apply a few visual updates so we can read the graph better.

gg_tut_box <- gg_tut_initial + 
  coord_flip() +
  scale_y_log10() +
  theme_custom_stark_sahir() +
  labs(x = "Survival Status of Passengers", 
         y = "Log Fare of Passengerss", 
         title = "Relationship of Fare and Surviving Gender", 
         fill = "Survived")

We still don’t have a very good idea of how the data is organized. Does it have one mode, two? and is the data skewed in any way. The normal response is to put the data on top of the plot and jitter it so it’s more spaced out. I hate doing that because many data points will overwhelm the graph. The issue with alpha is that points with lower neighbours will not show up much either.e

gg_tut_jittered <- gg_tut_box +
  geom_jitter(width = .2)

gg_tut_jittered_alpha <- gg_tut_box +
  geom_jitter(width = .2, alpha = 0.1)


grid.arrange(gg_tut_jittered, gg_tut_jittered_alpha)

We get some sense of data this way but a density plot would be better to see the distribution of data. But just a density plot does not give us information that the box plot gives us like the median, extreme data classification, and the interquartile range. I decided to create a way of graphing so that we can get all this data and possibly more as well.

How Did I Do this?

I learned and used mostly David Robinson’s Gist code but modified it slightly to allow us to flip the density from one side to the other. I used David’s modified violin plot as the density plot and combined it with a regular geom_boxplot.

This code chunk is rather long but I’ll point out the edits I made in comments.

How does ggplot work on the inside?

Hadley Wickham has a tutorial on extending ggplot that goes more in detail that I will here but if you’re looking to be able to add a small functionality to an existing geom you can continue reading. A geom that you use like geom_violin in ggplot can be thought of in two parts: a ggproto object, and a layer function. The layer function will take the ggproto object to create our geom. In order to add the ability for the user to speify the direction of the density plot, I added a default value to the variable in the layer function (1) and then add it to the list of parameter variables (2). Now that I added direction to the parameters I could use it in the ggproto objecet (3, 4). Lastly, we just tell the ggproto object what the default value should be (5).

# somewhat hackish solution to:
# https://twitter.com/EamonCaddigan/status/646759751242620928
# based mostly on copy/pasting from ggplot2 geom_violin source:
# https://github.com/hadley/ggplot2/blob/master/R/geom-violin.r
# altered minorly to allow directionality of density plot
# 

library(ggplot2)
library(dplyr)


"%||%" <- function(a, b) {
  if (!is.null(a)) a else b
}


#' @rdname ggplot2-ggproto
#' @format NULL
#' @usage NULL
#' @export
GeomFlatViolin <-
  ggproto("GeomFlatViolin", Geom,
          
          setup_data = function(data, params) {
          data$width <- data$width %||%
          params$width %||% (resolution(data$x, FALSE) * 0.9)
            
          ## (3) Here we take the value the user givss and determines which direction the user wants.
          if (params$direction == 1) { 
               direction <-  +1
          } else if (params$direction == 2) {
               direction <-  -1
          }
            
            # ymin, ymax, xmin, and xmax define the bounding rectangle for each group
          data %>%
            group_by(group) %>%
            mutate(ymin = min(y),
                   ymax = max(y),
                   xmin = x,
                   ## (4) Now we can multiply by the direction to change it.
                   xmax = x + (direction * width / 2))
          },
          
          draw_group = function(data, panel_scales, coord) {
            # Find the points for the line to go all the way around
            data <- transform(data, xminv = x,
                              xmaxv = x + violinwidth * (xmax - x))
            
            # Make sure it's sorted properly to draw the outline
            newdata <- rbind(plyr::arrange(transform(data, x = xminv), y),
                             plyr::arrange(transform(data, x = xmaxv), -y))
            
            # Close the polygon: set first and last point the same
            # Needed for coord_polar and such
            newdata <- rbind(newdata, newdata[1,])
            
            ggplot2:::ggname("geom_flat_violin", GeomPolygon$draw_panel(newdata, panel_scales, coord))
            
          },
          
          draw_key = draw_key_polygon,
          
          # (5) We also specify the default setting of the direction here
          default_aes = aes(weight = 1, colour = "white", fill = "white", size = 0.5,
                            alpha = 0.5, linetype = "solid", direction = 1),
          
          
          required_aes = c("x", "y")
)

geom_flat_violin <- function(mapping = NULL, data = NULL, stat = "ydensity",
                        position = "dodge", trim = TRUE, scale = "area",
                        show.legend = NA, inherit.aes = TRUE, direction = 1, ...) {
  
  ## (1) Right above here is where I added 'direction = 1'
  ggplot2::layer(
    data = data,
    mapping = mapping,
    stat = stat,
    ## This calls the ggproto object
    geom = GeomFlatViolin,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      trim = trim,
      scale = scale,
      direction = direction, ## (2), told the layer function that this is a parameter
      ...
    )
  )
}

Handling Outliers

Currently I have outliers disabled but I want them back on now. This is what the graph looks like right now.

(gg_tut_titanic <- ggplot(tut_titanic_train,
                         aes(x = Survived, 
                         y = Fare, 
                         fill = Survived))+
    geom_flat_violin() +
    geom_boxplot(width = 0.05, 
                 coef = 0, 
                 color = "black", 
                 outlier.shape = NA) +
    scale_y_log10(breaks=c(10,100)) +
    labs(x = "Survival Status of Passengers", 
         y = "Log Fare of Passengerss", 
         "Relationship of Fare and Surviving Gender", 
         fill = "Survived") +
    theme_custom_stark_sahir() +
    coord_flip())

We know that outliers get hidden away when we do a log transform but sometimes we still want to know what they are. If you just let ggplot plot your outliers with geom_boxplot you will only see outliers that are outliers after you scale the data which is not always what you want. To display only outliers of my choice and with a text option I adapted a function from a Stack Overflow post here and here. The function we are going to use is defined as so:

get_specials_for <- function(y, coef = 1.5, type = 1, na.rm = TRUE, keyquartile = FALSE) {
  stopifnot(is.numeric(y))
  limits <- quantile(y, c(0.25, 0.75), na.rm = na.rm, type = type)
  lower_bound <- limits[1] - coef * IQR(y, na.rm = na.rm)
  upper_bound <- limits[2] + coef * IQR(y, na.rm = na.rm)
  boolean_outliers <- (y <  lower_bound | y > upper_bound)
  outliers <- if_else(boolean_outliers, y, as.numeric(NA))
  
  # Tackle keyquartile later
  
  return(outliers)
}

This function will get the interquartile range you want and allows you to set the coefficient and type of quantile calculation. If we want the log transformed outliers we give it the logged value of the columns but if we want the unlogged outliers we can just give the function the unaltered values. With this we have a good graph that tells us a lot about the data.

(gg_tut_titanic + 
   geom_point(aes(y = get_specials_for(Fare)))
   )

Bonus ggplot2 extension ggiraph

We can still do more and find out what the values of our outliers have when we hover over them. To do that we need to use the ggiraph extension I found on a website dedicated to ggplot2 extension. They have a lot of great extensions but we are going to use just ggiraph. To do so we need the developer version of ggplot2 and ggiraph from their github pages. Run the code below to set that up.

## First use of devtools package.
dev_mode(TRUE)
#install_github("tidyverse/ggplot2", force = TRUE)
install_github("davidgohel/ggiraph", force = TRUE)
dev_mode(FALSE)

Now that you have those enabled you can add the necessary code to activate the hover features. You can style the hover effect with CSS but I thought the default code provided by David Gohel to be nice enough!

## First use of ggiraph here

tut_titanic_train$tooltip = get_specials_for(tut_titanic_train$Fare)
tut_titanic_train$clickjs = paste0("alert(\"",tut_titanic_train$Fare, "\")" )

gg_tut_titanic_final <- ggplot(tut_titanic_train, 
                               aes(x = Survived, 
                                   y = Fare, 
                                   fill = as.factor(Survived)),
                               tooltip = tooltip, onclick = clickjs) +
    geom_flat_violin() +
    geom_boxplot(width = 0.05, 
                 coef = 0, 
                 color = "black", 
                 outlier.shape = NA) +
    geom_point_interactive(aes(y = get_specials_for(Fare), tooltip = get_specials_for(Fare)), size = 0.7) +
    scale_y_log10(breaks=c(10,100)) +
    labs(x = "Survival Status of Passengers", 
         y = "Log Fare of Passengerss", 
         title = "Relationship of Fare of the surviving gender", 
         fill = "Survived") +
    theme_custom_stark_sahir() +
    coord_flip()


ggiraph(code = {print(gg_tut_titanic_final)}, width = .6,
        tooltip_extra_css = "padding:2px;background:rgba(70,70,70,0.1);color:black;border-radius:2px 2px 2px 2px;",
        hover_css = "fill:#1279BF;stroke:#1279BF;cursor:pointer;")

And we’re done! The code works inside RStudio as well which is great for showing people a quick glanace at data and for yourself to track down any tricky points quickly.

Let me know what you think! If you enjoyed it, need some help or have improvements.

References and in-text citations:

http://blog.heapanalytics.com/anscombes-quartet-and-why-summary-statistics-dont-tell-the-whole-story/

https://gist.github.com/dgrtwo/eb7750e74997891d7c20

http://docs.ggplot2.org/current/vignettes/extending-ggplot2.html

http://stackoverflow.com/questions/3010403/jitter-if-multiple-outliers-in-ggplot2-boxplot?rq=1

http://stackoverflow.com/questions/33524669/labeling-outliers-of-boxplots-in-r

https://www.ggplot2-exts.org/ggiraph.html

https://github.com/davidgohel/ggiraph/blob/master/vignettes/custom_animation_effects.Rmd