Visualizing continuous data over categorical.

Visualization of data is critical in understanding it properly. The famous Anscombe’s quartet is a great example of the perils of skipping visualizations of data where the two axes are both continuous. In this blog post we look at visualizing continous data split over categorical variables.

Why do I want to do this

When we want to look at a continuous variable against a discrete one we often employ a box plot to get a sense of the data. But We often lose a lot of information by such a broad summary of our data. The only points we see are extreme data points (Not outliers necessarily).

# If you don't use the tidyverse, start! It will make your life easier 
# ggplot2 is an amazing graphing tool from the tidyverse

theme_custom_stark_sahir <- function(){
    theme_minimal() %+replace%
        panel.grid.major.x = element_blank(),
        plot.background = element_rect(fill = "#ffffff", color = "#ffffff"),
        plot.title = element_text(size = 16, colour = "black", hjust = 0.5, vjust = 3),
        axis.line = element_line(colour = "black"),
        plot.margin = unit(c(0.2,0.2,0.2,0.2), "cm"),
        panel.grid = element_blank()

tut_titanic_train <- titanic_train

# One's and Zero's are so unintuitive, let's replace them with ream name
tut_titanic_train$Survived[tut_titanic_train$Survived == 0] <- "Dead"
tut_titanic_train$Survived[tut_titanic_train$Survived == 1] <- "Alive"

# This sets up our dataframe quickly by telling R that the Survived variable is categorical, and organizes the data so we can look at the important colums quickly.
tut_titanic_train <- tut_titanic_train %>% 
  mutate(Survived = as.factor(Survived)) %>% 
  select(Name, Survived, Fare, everything())

# A rough look at the data
(gg_tut_initial <- ggplot(tut_titanic_train, aes(x = Survived, 
                               y = Fare, 
                               fill = Survived)) +