{ggrapid}: Create neat & complete ggplot visualizations with as little code as possible
In this post I will make a an introduction to a new data visulazation package that I recently published on Github - ggrapid! ggrapid enables creation of the most common ggplot-based visualizations fast and with just a few lines of code.
Doing EDA (Exploratory Data Analysis) is a crucial step in every Data Sciene & Machine Learning project and typically that’s were all Data Scientists spend most of their time when working on a project. There’s already many great visualization packages in the R community that specifically aim at streamlining that process: DataExplorer or GGally for instance, but none of them was 100% fulfilling my needs. My main requirements were: speed of usage, interface consistency and elegance. As long as DataExplorer
pretty much offers the first two points (although the execution is different), I think it’s missing out on the last factor - especially when you would like to share that report with managers or externals.
That’s probably where ggrapid
fits much better - EDA and reporting that’s intented to be shared in an elegant way with managers and externals, while still built simply and with speed with the help of a specific Rmd
syntax. Let’s check it out!
Note: similarity of names with DataExplorer
is coincidental. I’ve been working on the ggrapid
package for a long time already not having known that DataExplorer even exists. It was only very recently when I eventually decided to publish ggrapid to Github. Nevertheless, perhaps there’s room for the two packages to merge and offer a single, cohesive solution on all fronts :)
Introduction
ggrapid
offers a couple wrappers around the most commonly used ggplot
functions in the course of doing an EDA or building a report:
plot_density
plot_boxplot
plot_deciles
(withcalculate_decile_table
)plot_correlation
plot_bars
plot_line
Density plot
library(tidyverse)
library(ggrapid)
diamonds %>%
plot_density(x = carat)
Box-plot
diamonds %>%
plot_boxplot(x = cut,
y = carat,
fill = cut)
Decile plot
diamonds %>%
filter(cut %in% c("Ideal", "Premium")) %>%
calculate_decile_table(price, cut, "Ideal") %>%
plot_deciles()
Correlation
diamonds %>%
plot_correlation()
Barplot
diamonds %>%
plot_bars(x = carat,
x_type = "num",
fill = cut)
Lineplot
tibble(
time = 1:20,
value = rnorm(20, 0.5, 2)
) %>%
plot_line(
x = time,
y = value
)
Main arguments
The most commonly implemented ggplot
arguments across all main ggrapid
functions ensure that you can build your basic EDA report without making additional changes or custom functions. ggrapid
tries to do most things for you but many of those arguments you can modify yourself. They are mainly (might slightly differ across functions):
- fill
- facet
- position
- ticks
- angle
- title
- subtitle
- caption
- lab_x
- lab_y
- legend
- vline/ hline
- alpha
- quantile_low
- quantile_high
- theme_type
- palette
They allow the user to further customize the plot almost as flexibly as if you were using the classic ggplot:
diamonds %>%
plot_density(x = carat)
diamonds %>%
plot_density(x = carat,
fill = cut,
position = "stack")
diamonds %>%
plot_density(x = carat,
fill = cut,
position = "fill")
diamonds %>%
plot_density(x = carat,
fill = cut,
facet = cut,
title = "Write your title here",
subtitle = "Write your subtitle here",
caption = "Write your caption here",
lab_x = "Carat",
alpha = .5,
vline = 1)
Complete usage
The main idea of ggrapid
is to apply it’s functions programatically to your entire dataset and then leverage this single object in the main reporting file. You can easily iterate across selected columns and create a set of plots for your EDA file:
library(recipes)
credit_data_nested <- credit_data %>%
select(-one_of("Home", "Marital", "Records", "Job")) %>% # removing categorical variables
gather(variable, variable_value,
one_of("Seniority", "Time", "Age", "Expenses", # selecting variables to gather
"Income", "Assets", "Debt", "Amount", "Price")) %>%
nest(-variable) %>%
mutate(
decile_table = map(data,
~calculate_decile_table(
.x,
binning = variable_value,
grouping = Status,
top_level = "bad",
format = FALSE
)
),
plot_deciles = pmap(list(x = decile_table, y = variable),
~plot_deciles(
.x,
title = glue::glue("Decile plot of {.y}"),
quantile_low = 0,
quantile_high = 1,
lab_x = "Decile",
lab_y = "Bad rate, %"
)
),
plot_boxplot = pmap(list(x = data, y = variable),
~plot_boxplot(
.x,
x = Status,
y = variable_value,
fill = Status,
title = glue::glue("Box plot of {.y} by Status"),
quantile_low = 0.01,
quantile_high = 0.99,
lab_x = "Performance",
caption = "Removed 1% of observations from each side",
palette = "inv_binary"
)
),
plot_density = pmap(list(x = data, y = variable),
~plot_density(
.x,
x = variable_value,
fill = Status,
title = glue::glue("Box plot of {.y} by Status"),
quantile_low = 0.01,
quantile_high = 0.99,
lab_x = "Performance",
caption = "Removed 1% of observations from each side",
palette = "inv_binary"
)
)
)
This will give you the following tidy structure. Each row represents an individual variable, and columns are different plots you created before you would like to inspect:
credit_data_nested[1:3, ]
## # A tibble: 3 x 6
## variable data decile_table plot_deciles plot_boxplot plot_density
## <chr> <list> <list> <list> <list> <list>
## 1 Seniority <tibble [… <tibble [10 … <gg> <gg> <gg>
## 2 Time <tibble [… <tibble [10 … <gg> <gg> <gg>
## 3 Age <tibble [… <tibble [10 … <gg> <gg> <gg>
Exemplary EDA format
Creating a standardised and elegant EDA file from the previous data structure is just as easy as calling your data frame and putting the results in a Rmd
format:
Variable: Seniority
[above code: glue::glue(“Variable: {credit_data_nested$variable[[1]]}”)]
Decile analysis
credit_data_nested$decile_table[[1]]
## # A tibble: 10 x 8
## decile min median max top_level total bottom_level ratio
## <fct> <dbl> <dbl> <dbl> <int> <int> <int> <formttbl>
## 1 1 0 0 0 235 446 211 52.69%
## 2 2 0 1 1 209 445 236 46.97%
## 3 3 1 2 2 174 446 272 39.01%
## 4 4 2 3 3 146 445 299 32.81%
## 5 5 3 4 5 122 445 323 27.42%
## 6 6 5 6 8 105 446 341 23.54%
## 7 7 8 10 10 88 445 357 19.78%
## 8 8 10 12 14 76 446 370 17.04%
## 9 9 14 16 20 54 445 391 12.13%
## 10 10 20 25 48 45 445 400 10.11%
credit_data_nested$plot_deciles[[1]]
Aditional plots
credit_data_nested$plot_boxplot[[1]]
credit_data_nested$plot_density[[1]]
Variable: Time
[above code: glue::glue(“Variable: {credit_data_nested$variable[[2]]}”)]
Decile analysis
credit_data_nested$decile_table[[2]]
## # A tibble: 10 x 8
## decile min median max top_level total bottom_level ratio
## <fct> <dbl> <dbl> <dbl> <int> <int> <int> <formttbl>
## 1 1 6 18 24 64 446 382 14.35%
## 2 2 24 30 36 109 445 336 24.49%
## 3 3 36 36 36 124 446 322 27.80%
## 4 4 36 36 48 136 445 309 30.56%
## 5 5 48 48 48 135 445 310 30.34%
## 6 6 48 48 60 125 446 321 28.03%
## 7 7 60 60 60 133 445 312 29.89%
## 8 8 60 60 60 155 446 291 34.75%
## 9 9 60 60 60 129 445 316 28.99%
## 10 10 60 60 72 144 445 301 32.36%
credit_data_nested$plot_deciles[[2]]
Aditional plots
credit_data_nested$plot_boxplot[[2]]
credit_data_nested$plot_density[[2]]
Predefined Rmd skeleton for automated EDA
It becomes even easier when you take advantage of the Rmd child
document propery of every chunk. You need to have two Rmd
documents to leverage that functionality and create customised and great looking EDA reports in no time:
- child Rmd template
- main Rmd file
The Rmd template
dictates the main Rmd
document what type of reporting structure you would like to leverage for every single variable. The main advantage of that is that you do not maintain all the code in the main reporting file which could become very problematic when you’re working on a really large project. With a child document
you just make changes to the child and they automatically take place in the main report file - you can think of it as defining a simple R function!
A simple example of a child template could be something like the one presented below - it needs to be saved as an individual Rmd file called e.g.: ‘child_template.Rmd’. It can obviously become much more complicated and could leverage Rmd tabsets
and other Rmd functionalities to offer the best user experience. Note the intentional usage of [[i]]
which I’m explaining further.
Decile analysis
credit_data_nested$decile_table[[i]]
credit_data_nested$plot_deciles[[i]]
Aditional plots
credit_data_nested$plot_boxplot[[i]]
credit_data_nested$plot_density[[i]]
In the main Rmd
file you then apply the following pattern:
# ```{r include=FALSE}
i <- 1
# ```
glue::glue(“Variable: {credit_data_nested$variable[[i]]}”)
# ```{r child='child_template.Rmd'}
# ```
Then when the main Rmd
file renders it imports the Rmd child
every time it’s called, plugs in the respective [[i]]
object into it and renders the chunk. So you would repeat that pattern for every single attribute (row) of the main data frame with all the rendered plots while changing the i <-
assignment to account for the next attribute. It’s most probably not the most streamlined solution, but it used to work out for me very well for a long time and offered the best trade-off between speed, maintenance and elegance of the overall solution.
Wrapping up
That would be really it with ggrapid
for now - I hope you guys enjoyed this blogpost and will find the package usefull in your day-to-day practice! I’m planning to continue working on it to offer a more comprehensive solution (perhaps by combining it’s functionalities with DataExplorer
?) and would invite you to participate and give your ideas as well. Thanks!