In this post I will make a an introduction to a new data visulazation package that I recently published on Github - ggrapid! ggrapid enables creation of the most common ggplot-based visualizations fast and with just a few lines of code.

Doing EDA (Exploratory Data Analysis) is a crucial step in every Data Sciene & Machine Learning project and typically that’s were all Data Scientists spend most of their time when working on a project. There’s already many great visualization packages in the R community that specifically aim at streamlining that process: DataExplorer or GGally for instance, but none of them was 100% fulfilling my needs. My main requirements were: speed of usage, interface consistency and elegance. As long as DataExplorer pretty much offers the first two points (although the execution is different), I think it’s missing out on the last factor - especially when you would like to share that report with managers or externals.

That’s probably where ggrapid fits much better - EDA and reporting that’s intented to be shared in an elegant way with managers and externals, while still built simply and with speed with the help of a specific Rmd syntax. Let’s check it out!

Note: similarity of names with DataExplorer is coincidental. I’ve been working on the ggrapid package for a long time already not having known that DataExplorer even exists. It was only very recently when I eventually decided to publish ggrapid to Github. Nevertheless, perhaps there’s room for the two packages to merge and offer a single, cohesive solution on all fronts :)

Introduction

ggrapid offers a couple wrappers around the most commonly used ggplot functions in the course of doing an EDA or building a report:

  • plot_density
  • plot_boxplot
  • plot_deciles (with calculate_decile_table)
  • plot_correlation
  • plot_bars
  • plot_line

Density plot


library(tidyverse)
library(ggrapid)

diamonds %>%
  plot_density(x = carat)

Box-plot


diamonds %>%
  plot_boxplot(x = cut,
               y = carat,
               fill = cut)

Decile plot


diamonds %>% 
  filter(cut %in% c("Ideal", "Premium")) %>% 
  calculate_decile_table(price, cut, "Ideal") %>%
  plot_deciles()

Correlation


diamonds %>%
  plot_correlation()

Barplot


diamonds %>%
  plot_bars(x = carat,
            x_type = "num",
            fill = cut)

Lineplot


tibble(
  time = 1:20,
  value = rnorm(20, 0.5, 2)
  ) %>%
  plot_line(
    x = time,
    y = value
  )

Main arguments

The most commonly implemented ggplot arguments across all main ggrapid functions ensure that you can build your basic EDA report without making additional changes or custom functions. ggrapid tries to do most things for you but many of those arguments you can modify yourself. They are mainly (might slightly differ across functions):

  • fill
  • facet
  • position
  • ticks
  • angle
  • title
  • subtitle
  • caption
  • lab_x
  • lab_y
  • legend
  • vline/ hline
  • alpha
  • quantile_low
  • quantile_high
  • theme_type
  • palette

They allow the user to further customize the plot almost as flexibly as if you were using the classic ggplot:


diamonds %>%
  plot_density(x = carat)


diamonds %>%
  plot_density(x = carat,
               fill = cut,
               position = "stack")


diamonds %>%
  plot_density(x = carat,
               fill = cut,
               position = "fill")


diamonds %>%
  plot_density(x = carat,
               fill = cut,
               facet = cut,
               title = "Write your title here",
               subtitle = "Write your subtitle here",
               caption = "Write your caption here",
               lab_x = "Carat",
               alpha = .5,
               vline = 1)

Complete usage

The main idea of ggrapid is to apply it’s functions programatically to your entire dataset and then leverage this single object in the main reporting file. You can easily iterate across selected columns and create a set of plots for your EDA file:


library(recipes)

credit_data_nested <- credit_data %>% 
  select(-one_of("Home", "Marital", "Records", "Job")) %>% # removing categorical variables
  gather(variable, variable_value,
         one_of("Seniority", "Time", "Age", "Expenses", # selecting variables to gather
                "Income", "Assets", "Debt", "Amount", "Price")) %>% 
  nest(-variable) %>% 
  mutate(
    decile_table = map(data, 
                       ~calculate_decile_table(
                         .x,
                         binning = variable_value,
                         grouping = Status,
                         top_level = "bad",
                         format = FALSE
                         )
    ),
    plot_deciles  = pmap(list(x = decile_table, y = variable),
                         ~plot_deciles(
                           .x,
                           title = glue::glue("Decile plot of {.y}"),
                           quantile_low = 0, 
                           quantile_high = 1, 
                           lab_x = "Decile",
                           lab_y = "Bad rate, %"
                           )
    ),
    plot_boxplot  = pmap(list(x = data, y = variable),
                         ~plot_boxplot(
                           .x,
                           x = Status,
                           y = variable_value,
                           fill = Status,
                           title = glue::glue("Box plot of {.y} by Status"),
                           quantile_low = 0.01,
                           quantile_high = 0.99,
                           lab_x = "Performance",
                           caption = "Removed 1% of observations from each side",
                           palette = "inv_binary"
                           )
    ),
    plot_density  = pmap(list(x = data, y = variable),
                     ~plot_density(
                       .x,
                       x = variable_value,
                       fill = Status,
                       title = glue::glue("Box plot of {.y} by Status"),
                       quantile_low = 0.01,
                       quantile_high = 0.99,
                       lab_x = "Performance",
                       caption = "Removed 1% of observations from each side",
                       palette = "inv_binary"
                       )
    )
  )

This will give you the following tidy structure. Each row represents an individual variable, and columns are different plots you created before you would like to inspect:


credit_data_nested[1:3, ]
## # A tibble: 3 x 6
##   variable  data       decile_table  plot_deciles plot_boxplot plot_density
##   <chr>     <list>     <list>        <list>       <list>       <list>      
## 1 Seniority <tibble [… <tibble [10 … <gg>         <gg>         <gg>        
## 2 Time      <tibble [… <tibble [10 … <gg>         <gg>         <gg>        
## 3 Age       <tibble [… <tibble [10 … <gg>         <gg>         <gg>

Exemplary EDA format

Creating a standardised and elegant EDA file from the previous data structure is just as easy as calling your data frame and putting the results in a Rmd format:

Variable: Seniority

[above code: glue::glue(“Variable: {credit_data_nested$variable[[1]]}”)]

Decile analysis

credit_data_nested$decile_table[[1]]
## # A tibble: 10 x 8
##    decile   min median   max top_level total bottom_level ratio     
##    <fct>  <dbl>  <dbl> <dbl>     <int> <int>        <int> <formttbl>
##  1 1          0      0     0       235   446          211 52.69%    
##  2 2          0      1     1       209   445          236 46.97%    
##  3 3          1      2     2       174   446          272 39.01%    
##  4 4          2      3     3       146   445          299 32.81%    
##  5 5          3      4     5       122   445          323 27.42%    
##  6 6          5      6     8       105   446          341 23.54%    
##  7 7          8     10    10        88   445          357 19.78%    
##  8 8         10     12    14        76   446          370 17.04%    
##  9 9         14     16    20        54   445          391 12.13%    
## 10 10        20     25    48        45   445          400 10.11%
credit_data_nested$plot_deciles[[1]]

Aditional plots

credit_data_nested$plot_boxplot[[1]]

credit_data_nested$plot_density[[1]]

Variable: Time

[above code: glue::glue(“Variable: {credit_data_nested$variable[[2]]}”)]

Decile analysis

credit_data_nested$decile_table[[2]]
## # A tibble: 10 x 8
##    decile   min median   max top_level total bottom_level ratio     
##    <fct>  <dbl>  <dbl> <dbl>     <int> <int>        <int> <formttbl>
##  1 1          6     18    24        64   446          382 14.35%    
##  2 2         24     30    36       109   445          336 24.49%    
##  3 3         36     36    36       124   446          322 27.80%    
##  4 4         36     36    48       136   445          309 30.56%    
##  5 5         48     48    48       135   445          310 30.34%    
##  6 6         48     48    60       125   446          321 28.03%    
##  7 7         60     60    60       133   445          312 29.89%    
##  8 8         60     60    60       155   446          291 34.75%    
##  9 9         60     60    60       129   445          316 28.99%    
## 10 10        60     60    72       144   445          301 32.36%
credit_data_nested$plot_deciles[[2]]

Aditional plots

credit_data_nested$plot_boxplot[[2]]

credit_data_nested$plot_density[[2]]

Predefined Rmd skeleton for automated EDA

It becomes even easier when you take advantage of the Rmd child document propery of every chunk. You need to have two Rmd documents to leverage that functionality and create customised and great looking EDA reports in no time:

  • child Rmd template
  • main Rmd file

The Rmd template dictates the main Rmd document what type of reporting structure you would like to leverage for every single variable. The main advantage of that is that you do not maintain all the code in the main reporting file which could become very problematic when you’re working on a really large project. With a child document you just make changes to the child and they automatically take place in the main report file - you can think of it as defining a simple R function!

A simple example of a child template could be something like the one presented below - it needs to be saved as an individual Rmd file called e.g.: ‘child_template.Rmd’. It can obviously become much more complicated and could leverage Rmd tabsets and other Rmd functionalities to offer the best user experience. Note the intentional usage of [[i]] which I’m explaining further.


Decile analysis

credit_data_nested$decile_table[[i]]
credit_data_nested$plot_deciles[[i]]

Aditional plots

credit_data_nested$plot_boxplot[[i]]
credit_data_nested$plot_density[[i]]

In the main Rmd file you then apply the following pattern:

# ```{r include=FALSE}
i <- 1 
# ```

glue::glue(“Variable: {credit_data_nested$variable[[i]]}”)

# ```{r child='child_template.Rmd'}
# ```

Then when the main Rmd file renders it imports the Rmd child every time it’s called, plugs in the respective [[i]] object into it and renders the chunk. So you would repeat that pattern for every single attribute (row) of the main data frame with all the rendered plots while changing the i <- assignment to account for the next attribute. It’s most probably not the most streamlined solution, but it used to work out for me very well for a long time and offered the best trade-off between speed, maintenance and elegance of the overall solution.

Wrapping up

That would be really it with ggrapid for now - I hope you guys enjoyed this blogpost and will find the package usefull in your day-to-day practice! I’m planning to continue working on it to offer a more comprehensive solution (perhaps by combining it’s functionalities with DataExplorer?) and would invite you to participate and give your ideas as well. Thanks!