In this post I will make a comparison between the most popular (by number of monthly downloads from Github) ML framework available for R to date: caret and its successor packages being written by the same author (Max Kuhn) that are wrapped together in a so called tidymodels framework. Tidymodels is a collection of different packages such as: rsample, recipes, parsnip, dials and more, that allow running an entire ML project in a tidy format end-to-end.

Many of them are still in a development phase, which will still take a couple good months before they settle down, so I’ll try to keep this post up-to-date over time. Nevertheless, I’ve wanted to take a closer look at what tidymodels have to offer for a while already, and thought a blogpost would be a great way to demonstrate that.

Update 16.02.2020 - the following parts of this blogpost were updated:

  • since the initial write-up of this post many tidymodels packages were updated and the following, new packages were released: tune & workflows, which significantly simplified the overall modelling workflow
  • updated the entire tidymodels implementation using new functions instead of the handler functions I had to write before. Previous post content is kept in its original form at the very end if anyone is interested
  • added variable importance for the tidymodel implementation using the vip package

In order to write this blog I’ve been reading carefully all individual package websites and this excellent blogpost from Alex Hayes helped me a lot to put things together.

Initial setup

In the beginning, let’s load all the required packages and the credit_data dataset available from recipes that we will use for modelling. Note also that I’m setting the random seed to make sampling reproducible, as well as set the furrr plan to multicore. It’s important unless you want this script to run really long on your machine - we’ll be fitting many different models, so making sure you utilize all your local resources will speed things up a lot.

set.seed(42)
options(max.print = 150)

library(modeldata)
library(tidymodels)
library(tidyverse)
library(caret)
library(magrittr)
library(naniar)
library(furrr)
library(skimr)
library(vip)
library(workflows)
library(tune)

plan(multicore)  
data("credit_data")

Data preparation

In this example, I’m building a classification model to distinguish between good and bad loans indicated by column ‘Status’. We have relatively many observations compared to the number of variables available for modelling. Before making any other steps let’s convert all columns to lowercase.

credit_data %<>%
  set_names(., tolower(names(.)))

glimpse(credit_data)
## Observations: 4,454
## Variables: 14
## $ status    <fct> good, good, bad, good, good, good, good, good, good, b…
## $ seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0, 15…
## $ home      <fct> rent, rent, owner, rent, rent, owner, owner, parents, …
## $ time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60, 36…
## $ age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30, 37…
## $ marital   <fct> married, widow, married, single, single, married, marr…
## $ records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, no, n…
## $ job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, fixe…
## $ expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75, 75…
## $ income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 125, 1…
## $ assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 4000, 3…
## $ debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0…
## $ amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1200,…
## $ price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957, 14…

With the help of the excellent naniar package I’m checking the percentage of missing data per each variable. For this particular dataset there are very few missing values so they won’t pose a problem for us during modelling.

credit_data %>% miss_var_summary()
## # A tibble: 14 x 3
##    variable  n_miss pct_miss
##    <chr>      <int>    <dbl>
##  1 income       381   8.55  
##  2 assets        47   1.06  
##  3 debt          18   0.404 
##  4 home           6   0.135 
##  5 job            2   0.0449
##  6 marital        1   0.0225
##  7 status         0   0     
##  8 seniority      0   0     
##  9 time           0   0     
## 10 age            0   0     
## 11 records        0   0     
## 12 expenses       0   0     
## 13 amount         0   0     
## 14 price          0   0

Another important step would be to make some basic numerical summaries of the data in order to catch any unusual observations. I will do it using the skimr package. Apart from the fact that many numerical variables show high skewness and some categorical variables have levels with very low frequency, it doesn’t seem that we will have to deal with any special encoded numbers or other problems.

credit_data %>% skim()
## Skim summary statistics
##  n obs: 4454 
##  n variables: 14 
## 
## ── Variable type:factor ────────────────────────────────────────────────────────
##  variable missing complete    n n_unique
##      home       6     4448 4454        6
##       job       2     4452 4454        4
##   marital       1     4453 4454        5
##   records       0     4454 4454        2
##    status       0     4454 4454        2
##                                top_counts ordered
##   own: 2107, ren: 973, par: 783, oth: 319   FALSE
##  fix: 2805, fre: 1024, par: 452, oth: 171   FALSE
##    mar: 3241, sin: 977, sep: 130, wid: 67   FALSE
##                 no: 3681, yes: 773, NA: 0   FALSE
##               goo: 3200, bad: 1254, NA: 0   FALSE
## 
## ── Variable type:integer ───────────────────────────────────────────────────────
##   variable missing complete    n    mean       sd  p0     p25  p50    p75
##        age       0     4454 4454   37.08    10.98  18   28      36   45  
##     amount       0     4454 4454 1038.92   474.55 100  700    1000 1300  
##     assets      47     4407 4454 5403.98 11574.42   0    0    3000 6000  
##       debt      18     4436 4454  343.03  1245.99   0    0       0    0  
##   expenses       0     4454 4454   55.57    19.52  35   35      51   72  
##     income     381     4073 4454  141.69    80.75   6   90     125  170  
##      price       0     4454 4454 1462.78   628.13 105 1117.25 1400 1691.5
##  seniority       0     4454 4454    7.99     8.17   0    2       5   12  
##       time       0     4454 4454   46.44    14.66   6   36      48   60  
##   p100     hist
##     68 ▅▇▇▇▅▃▂▁
##   5000 ▅▇▃▁▁▁▁▁
##  3e+05 ▇▁▁▁▁▁▁▁
##  30000 ▇▁▁▁▁▁▁▁
##    180 ▇▃▃▁▁▁▁▁
##    959 ▇▆▁▁▁▁▁▁
##  11140 ▇▆▁▁▁▁▁▁
##     48 ▇▃▂▁▁▁▁▁
##     72 ▁▁▂▃▁▃▇▁

Another point we need to keep in mind when dealing with credit scoring problems is something called a target class imbalance, but in this particular case it’s not that severe. For the sake of comparing programming frameworks and not implementing the best ML model I will ignore it.

table(credit_data$status)
## 
##  bad good 
## 1254 3200
round(prop.table(table(credit_data$status)), 2)
## 
##  bad good 
## 0.28 0.72

Data preparation

Let’s finally move on and start modelling! In the beginning I’ll start with dividing our dataset into training and testing sets with the help of the rsample package. Let’s set an initial, stratified split where 80% of the data is dedicated to training and the rest to evaluating both models.

Furthermore, I’m creating cross-validation splits from the testing data of 5 folds. For compatibility with caret I’m using the rsample2caret function to make use of the same splits in both frameworks - otherwise both solutions wouldn’t be 100% comparable.

split <- initial_split(credit_data, prop = 0.80, strata = "status")

df_train <- training(split)
df_test  <- testing(split)

train_cv <- vfold_cv(df_train, v = 5, strata = "status")
train_cv_caret <- rsample2caret(train_cv)

# write_rds(split, "split.rds")
# write_rds(train_cv, "train_cv.rds")

I would like to fit a Random Forest model for which I will specify a simple recipe. In principle, tree-based models require very little preprocessing, and in this particular example I mainly focus on imputting missing data or assigning them a new categorical level, infrequent/ unobserved values and hot-encoding them. The same recipe will be used for both: caret and tidymodels model.

Normally I would do much more feature engineering, try to assess potential interactions etc., but I will write a separate post dedicated for that to so see how much further we can improve the model!

recipe <- df_train %>%
  recipe(status ~ .) %>%

  # Imputation: assigning NAs to a new level for categorical and median imputation for numeric
  step_unknown(all_nominal(), -status) %>% 
  step_medianimpute(all_numeric()) %>%

  # Combining infrequent categorical levels and introducing a new level for prediction time
  step_other(all_nominal(), -status, other = "infrequent_combined") %>%
  step_novel(all_nominal(), -status, new_level = "unrecorded_observation") %>%

  # Hot-encoding categorical variables
  step_dummy(all_nominal(), -status, one_hot = TRUE)

Let’s take a quick look at the output of the recipe:

(recipe_preped <- prep(recipe, retain = TRUE))
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         13
## 
## Training data contained 3565 data points and 335 incomplete rows. 
## 
## Operations:
## 
## Unknown factor level assignment for home, marital, records, job [trained]
## Median Imputation for seniority, time, age, expenses, ... [trained]
## Collapsing factor levels for home, marital, records, job [trained]
## Novel factor level assignment for home, marital, records, job [trained]
## Dummy variables from home, marital, records, job [trained]
tidy(recipe_preped)
## # A tibble: 5 x 6
##   number operation type         trained skip  id                
##    <int> <chr>     <chr>        <lgl>   <lgl> <chr>             
## 1      1 step      unknown      TRUE    FALSE unknown_XU845     
## 2      2 step      medianimpute TRUE    FALSE medianimpute_tS05C
## 3      3 step      other        TRUE    FALSE other_FHtHk       
## 4      4 step      novel        TRUE    FALSE novel_ncrtZ       
## 5      5 step      dummy        TRUE    FALSE dummy_VMDKC

Fitting our models

Caret

In the code below I’m setting control parameters for the caret model fit, as well as the grid of hyperparameters that will be assessed in order to pick the best performing combination. Note that I’m using the very original observation indexes for cross-validation to ensure reproducability. The trainControl function will also ensure that final hold-out predictions from cross-validation will be persisted for further assessment thanks to savePredictions = "final".

We have 5 different CV folds and 30 grid combinations to assess, which results in 150 models that will be fit and each comprising of 500 individual trees! All models will be assessed based on the prSummary function which is know as the AUC.

control_caret <- trainControl(
  method = "cv",
  verboseIter = FALSE,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  returnResamp = "final",
  savePredictions = "final",
  index = train_cv_caret$index,
  indexOut = train_cv_caret$indexOut,
  )

(grid_caret <- expand.grid(
  mtry = seq(1, ncol(df_train) - 1, 3),
  splitrule = c("extratrees", "gini"),
  min.node.size = c(1, 3, 5)
  ))
##    mtry  splitrule min.node.size
## 1     1 extratrees             1
## 2     4 extratrees             1
## 3     7 extratrees             1
## 4    10 extratrees             1
## 5    13 extratrees             1
## 6     1       gini             1
## 7     4       gini             1
## 8     7       gini             1
## 9    10       gini             1
## 10   13       gini             1
## 11    1 extratrees             3
## 12    4 extratrees             3
## 13    7 extratrees             3
## 14   10 extratrees             3
## 15   13 extratrees             3
## 16    1       gini             3
## 17    4       gini             3
## 18    7       gini             3
## 19   10       gini             3
## 20   13       gini             3
## 21    1 extratrees             5
## 22    4 extratrees             5
## 23    7 extratrees             5
## 24   10 extratrees             5
## 25   13 extratrees             5
## 26    1       gini             5
## 27    4       gini             5
## 28    7       gini             5
## 29   10       gini             5
## 30   13       gini             5

The great advantage of caret is that it wraps a lot of small code pieces in just one, high-level API call that does all the job for you - fits all individual models across CV folds and resamples, selects the best one and fits it already on the entire training dataset. It also makes sure it’s done as fast as possible thanks to parallel processing whenever it’s an enabled option.

The drawback on the other hand is that it’s quite monolythic, untidy and at the end doesn’t offer a great deal of granularity to the end user.

model_caret <- train(
  status ~ .,
  data = juice(recipe_preped),
  method = "ranger",
  metric = "ROC",
  trControl = control_caret,
  tuneGrid = grid_caret,
  importance = "impurity",
  num.trees = 500
  )

print(model_caret)
## Random Forest 
## 
## 3565 samples
##   29 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2851, 2852, 2852, 2852, 2853 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   min.node.size  ROC        Sens          Spec     
##    1    extratrees  1              0.7901464  0.0000000000  1.0000000
##    1    extratrees  3              0.7902729  0.0000000000  1.0000000
##    1    extratrees  5              0.7897509  0.0000000000  1.0000000
##    1    gini        1              0.8222071  0.0009950249  1.0000000
##    1    gini        3              0.8203006  0.0000000000  1.0000000
##    1    gini        5              0.8194996  0.0000000000  1.0000000
##    4    extratrees  1              0.8066284  0.4492786070  0.9191734
##    4    extratrees  3              0.8077006  0.4422786070  0.9226852
##    4    extratrees  5              0.8064328  0.4442587065  0.9226844
##    4    gini        1              0.8326194  0.4671791045  0.9238563
##    4    gini        3              0.8336397  0.4602089552  0.9254188
##    4    gini        5              0.8351950  0.4711691542  0.9281509
##    7    extratrees  1              0.8089259  0.4841144279  0.9086280
##    7    extratrees  3              0.8101340  0.4831144279  0.9094100
##    7    extratrees  5              0.8108062  0.4791343284  0.9117515
##    7    gini        1              0.8325371  0.4960696517  0.9125335
##    7    gini        3              0.8334290  0.4910696517  0.9160491
##    7    gini        5              0.8321708  0.4990447761  0.9140975
##   10    extratrees  1              0.8088898  0.4880945274  0.9039405
##   10    extratrees  3              0.8097806  0.4821044776  0.9074569
##   10    extratrees  5              0.8116245  0.4821194030  0.9066787
##   10    gini        1              0.8304535  0.5050248756  0.9074622
##   10    gini        3              0.8318021  0.5040348259  0.9098022
##   10    gini        5              0.8326123  0.5070248756  0.9121429
##   13    extratrees  1              0.8081539  0.4880845771  0.9031615
##  [ reached getOption("max.print") -- omitted 5 rows ]
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
##  and min.node.size = 5.

Caret also comes with built-in handy functions for assessing model’s individual predictors strength. By setting the importance = "impurity" in the ranger engine we ensure that variable importance will be returned by the final train object. As of now there is no such possibility directly within the tidymodels ecosystem, but this can be solved using another great package called vip.

# Accessing most predictive attributes from caret 
varImp(model_caret, scale = TRUE)$importance %>% 
  rownames_to_column() %>% 
  arrange(-Overall)
##                           rowname    Overall
## 1                          income 100.000000
## 2                       seniority  97.871716
## 3                          amount  85.421899
## 4                           price  79.249485
## 5                             age  66.333280
## 6                          assets  65.518786
## 7                     records_yes  50.558920
## 8                      records_no  50.378803
## 9                        expenses  49.052474
## 10                           time  34.953502
## 11                    job_partime  29.341679
## 12                      job_fixed  27.799490
## 13                           debt  23.744981
## 14                     home_owner  19.524707
## 15                      home_rent  12.910345
## 16                  job_freelance  11.051452
## 17                     home_other  10.008306
## 18                   home_parents   9.289318
## 19                marital_married   8.808904
## 20                 marital_single   7.715428
## 21    marital_infrequent_combined   6.194835
## 22                      home_priv   4.499495
## 23        job_infrequent_combined   3.441151
## 24       home_infrequent_combined   1.463759
## 25    home_unrecorded_observation   0.000000
## 26 marital_unrecorded_observation   0.000000
## 27    records_infrequent_combined   0.000000
## 28 records_unrecorded_observation   0.000000
## 29     job_unrecorded_observation   0.000000

Final cross-validated and test results are easily available with just a couple lines of code. Note that cross-validation performance is aggregated per each index (observation) and averaged out before the final performance metric is calculated.

Getting the test performance is a matter of baking the test set with the already prepped recipe and then making the prediction using the train object. 83.1% AUC for cross-validated training performance and 82.1% for testing - not a bad result for so little preprocessing! Close results also suggest that our model is likely to generalize well to new samples.

df_train_pred_caret <- model_caret$pred %>% 
  group_by(rowIndex) %>% 
  summarise(bad = mean(bad)) %>% 
  transmute(estimate = bad) %>% 
  add_column(truth = df_train$status)

# Cross-validated training performance
percent(roc_auc(df_train_pred_caret, truth, estimate)$.estimate)
## [1] "83.5%"
df_test_pred_caret <- predict(
    model_caret,
    newdata = bake(recipe_preped, df_test),
    type = "prob") %>%
  as_tibble() %>%
  transmute(estimate = bad) %>%
  add_column(truth = df_test$status)

# Test performance
percent(roc_auc(df_test_pred_caret, truth, estimate)$.estimate)
## [1] "81.3%"

Tidymodels

First, let’s use parsnip to define our ‘modelling engine’ - just like before we’re setting it as a classification problem, using Random Forest running on the ranger engine. On top of that I’m using dials to define a grid of parameters to optimize. dials provides a set of handy functions, such as: grid_random or grid_regular, that let you choose the range of parameters in a very flexible way. Lastly, I will use tune and workflows to optimize parameters, build the overal modelling workflow and finalize it with the best parameter values.

From what I can see the parameters that could be optimized slightly differ between both frameworks: caret allows for tunning the ‘min.node.size’ while keeping the ‘trees’ constant, while parsnip allows for tuning ‘trees’ while keeping ‘min.node.size’ constant (I assume it’s using the default ranger values). Nevertheless, the total amount of combinations is same in both cases and equal to 30.

# Specifying the modelling engine
(engine_tidym <- rand_forest(
    mode = "classification",
    mtry = tune(),
    trees = tune(),
    min_n = tune()
  ) %>% 
  set_engine("ranger", importance = "impurity")) # you can provide additional, engine specific arguments to '...'
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger
# Specifying the grid of hyperparameters that should be tested
(gridy_tidym <- grid_random(
  mtry() %>% range_set(c(1, 20)),
  trees() %>% range_set(c(500, 1000)), 
  min_n() %>% range_set(c(2, 10)),
  size = 30
  ))
## # A tibble: 30 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     9   525     4
##  2     7   970     7
##  3    20   835     4
##  4     1   567     5
##  5    15   813     8
##  6     9   589     4
##  7    14   821     8
##  8    11   763     6
##  9     1   500     3
## 10    19   861     7
## # … with 20 more rows

Then we can combine the model recipe we specified before with the modelling engine to form a so called workflow. A workflow puts together all pieces of the overall modelling pipeline, which makes it easier to manipulate and control them.

wkfl_tidym <- workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(engine_tidym)

Next, I combine the grid of parameters and workflow together for tuning to find the best performing combination of hyperparameters.

grid_tidym <- tune_grid(
  wkfl_tidym,
  resamples = train_cv,
  grid = gridy_tidym,
  metrics = metric_set(roc_auc),
  control = control_grid(save_pred = TRUE)
  )

print(grid_tidym)
## #  5-fold cross-validation using stratification 
## # A tibble: 5 x 5
##   splits           id    .metrics        .notes         .predictions       
## * <list>           <chr> <list>          <list>         <list>             
## 1 <split [2.9K/71… Fold1 <tibble [30 × … <tibble [30 ×… <tibble [21,420 × …
## 2 <split [2.9K/71… Fold2 <tibble [30 × … <tibble [0 × … <tibble [21,390 × …
## 3 <split [2.9K/71… Fold3 <tibble [30 × … <tibble [30 ×… <tibble [21,390 × …
## 4 <split [2.9K/71… Fold4 <tibble [30 × … <tibble [30 ×… <tibble [21,390 × …
## 5 <split [2.9K/71… Fold5 <tibble [30 × … <tibble [30 ×… <tibble [21,360 × …

You can aggregate the performance metrics for each parameter combination across all cross-validation folds to find the best performing set, which I will use in the final model.

collect_metrics(grid_tidym)
## # A tibble: 30 x 8
##     mtry trees min_n .metric .estimator  mean     n std_err
##    <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
##  1     1   500     3 roc_auc binary     0.823     5  0.0101
##  2     1   567     5 roc_auc binary     0.820     5  0.0111
##  3     1   568     5 roc_auc binary     0.820     5  0.0109
##  4     1   644     2 roc_auc binary     0.822     5  0.0114
##  5     1   676     2 roc_auc binary     0.821     5  0.0102
##  6     2   591     4 roc_auc binary     0.831     5  0.0118
##  7     6   592     7 roc_auc binary     0.834     5  0.0109
##  8     7   970     7 roc_auc binary     0.834     5  0.0104
##  9     9   525     4 roc_auc binary     0.832     5  0.0102
## 10     9   589     4 roc_auc binary     0.832     5  0.0106
## # … with 20 more rows

We can propagate the best combination of parameters into the workflow by using the finalize_workflow function. At this point, we’re almost finished with finalizing our pipeline. The last step involves refitting that workflow on the entire training data.

grid_tidym_best <- select_best(grid_tidym)
(wkfl_tidym_best <- finalize_workflow(wkfl_tidym, grid_tidym_best))
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## ● step_unknown()
## ● step_medianimpute()
## ● step_other()
## ● step_novel()
## ● step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 7
##   trees = 970
##   min_n = 7
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

This can be easily achieved using the last_fit function which fits the finalized workflow on the entire training data and at the same time provides test data performance metrics. This makes the workflow object complete and provides the data scientist with comprehensive insights into overall model performnce, as well as a fully operational model pipeline that can be deployed to production.

(wkfl_tidym_final <- last_fit(wkfl_tidym_best, split = split))
## ! Resample1: model (predictions): Novel levels found in column 'home': NA. The l...
## # # Monte Carlo cross-validation (0.8/0.2) with 1 resamples  
## # A tibble: 1 x 6
##   splits       id          .metrics     .notes     .predictions   .workflow
## * <list>       <chr>       <list>       <list>     <list>         <list>   
## 1 <split [3.6… train/test… <tibble [2 … <tibble [… <tibble [889 … <workflo…

We can then easily check both cross-validated training performance, as well as test set performance with just two lines of code.

# Cross-validated training performance
percent(show_best(grid_tidym, n = 1)$mean)
## [1] "83.4%"
# Test performance
percent(wkfl_tidym_final$.metrics[[1]]$.estimate[[2]])
## [1] "81.2%"

At the moment there’s no package in the tidymodels universe for calculating model importance metrics (I assume that will change at some point), but this can be achieved either with the vip or DALEX package. It would be fantastic if either of these packages seemlessly worked with tidymodels objects!

In this case I decided to use model-specific metrics (by passing importance = ‘impurity’ to the engine before) and then simply used the vip function on the model object extracted from the workflow. However, you might need to change this for every specific model type that you decide to use. Additionally, at this link you can find how to achieve the same using DALEX.

vip(pull_workflow_fit(wkfl_tidym_final$.workflow[[1]]))$data
## # A tibble: 10 x 2
##    Variable    Importance
##    <chr>            <dbl>
##  1 income           150. 
##  2 seniority        139. 
##  3 amount           128. 
##  4 price            123. 
##  5 age               99.5
##  6 assets            89.8
##  7 expenses          71.5
##  8 records_no        63.4
##  9 records_yes       61.0
## 10 time              47.3

Wrapping up

I’ve fit a credit scoring classification Random Forest model using both caret and tidymodels frameworks. I need to admit that before I started writing this post I expected a lot more additional code to be written in the tidymodels framework to achieve the same goal, but to my surprise those packages already offer a very concise (and tidy!) way of doing ML in R, and things will be even more streamlined in the upcoming months. That’s definitely a really big step-up for the entire R community when it comes to doing ML in R.

Future considerations

  1. I still haven’t fully explored the tidyposterior and probably packages - I will do that in one of me next posts.

  2. On the `rsample page there’s an interesting article listed on so called: nested resampling. I’ve never used it in practice but I’m curious to check it out and compare my model’s current cross-validated performance estimate with the one obtained through nested resampling.

  3. There’s also a lot of buzz in the R community regarding a BETA release of the successor of the mlr package (second most popular ML framework in R) - mlr3. mlr3 could be very strong competition to the tidymodels framework, and since I’ve never really used mlr it’s an excellent opportunity to put it to a test. It is also modular in design like tidymodels, but is built on top of data.table and uses R6 object-oriented class system which could give it substantial speed advantage over tidymodels at the expenses of ‘tidyness’.

Previous tidymodels workflow

In case you’re inteterested you can find the original content of this blogpost below. Please note that sections below are not evaluated to avoid potential errors when renderring this blogpost due to deprecations/ changes.


In the beginning, when I saw some of the very first articles about doing ML the tidy way by combining recipes and rsample my thoughts were that it was all way too complicated compared to what caret offered. I was very surprised now when I discovered how clean and simple it became over the last year, and apparently things will be further simplified over the next months (link)!

First let’s define two helper functions that will be used later during the modelling process. I imagine these might be wrapped into predefined helper functions in tidymodels packages instead of having to do that every time.

# Defining helper functions that will be used later on
fit_on_fold <- function(spec, prepped) {
  
  x <- juice(prepped, all_predictors())
  y <- juice(prepped, all_outcomes())
  
  fit_xy(spec, x, y)
}

predict_helper <- function(split, recipe, fit) {
  
  new_x <- bake(recipe, new_data = assessment(split), all_predictors())
  
  predict(fit, new_x, type = "prob") %>% 
    bind_cols(assessment(split) %>% select(status)) 
}

First, let’s use parsnip to define our ‘modelling engine’ - just like before we’re setting it as a classification problem, using Random Forest running on the ranger engine. On top of that I’m using dials to define a grid of parameters to optimize. dials provides a set of handy functions, such as: grid_random or grid_regular, that let you choose the range of parameters in a very flexible way.

From what I can see the parameters that could be optimized slightly differ between both frameworks: caret allows for tunning the ‘min.node.size’ while keeping the ‘trees’ constant, while parsnip allows for tuning ‘trees’ while keeping ‘min.node.size’ constant (I assume it’s using the default ranger values). Nevertheless, the total amount of combinations is same in both cases and equal to 30.

# Specifying the modelling engine
(engine_tidym <- rand_forest(mode = "classification") %>% 
  set_engine("ranger"))
# Specifying the grid of hyperparameters that should be tested
(gridy_tidym <- grid_random(
  mtry %>% range_set(c( 1,  20)),
  trees %>% range_set(c( 500, 1000)), 
  min_n %>% range_set(c(2,  10)),
  size = 30
  ))

Now comes the really interesting part of tidymodels: we’re using a merge helper function from dials to bind our predefined ‘modelling engine’ with all grid combinations of the hyperparameters to tune.

merge(engine_tidym, gridy_tidym)[1:3] # just to see the top 3

Subsequently, I’m putting it into a tidy data frame structure where each model-parameters combination is bound together and assigned a model id that will be used later to make a distinction between consequtive fits.

# Merging all possibilities with our cross-validated data frame
(spec_tidym <- tibble(spec = merge(engine_tidym, gridy_tidym)) %>% 
  mutate(model_id = row_number()))

Lastly, I’m adding the last component into this tidy structure: all cross-validation splits that were specified before with the use of the crossing function. This part is very likely to evolve and be simplified in the upcoming months. Now we’re all set to start the actual tidy-modelling!

(spec_tidym <- crossing(train_cv, spec_tidym))

To speed thigs up let’s use the furrr package and fit many models simultaneously. In the following code our original recipe is first prepped on each split’s training set and than it’s used by the fit_on_fold helper function to fit a given model-parameter combination.

# Fitting each model-fold pair
fits_tidym <- spec_tidym %>%
  mutate(
    prepped = future_map(splits, prepper, recipe),
    fit = future_map2(spec, prepped, fit_on_fold)
  )

The last step of modelling involves usage of the other predict_helper function that bakes the already prepped split’s recipe and applies it on the testing set of the split, in order to make a prediction of the given model-parameters combination.

# Making predictions of each fitted model on the testing set
fits_pred_tidym <- fits_tidym %>%
  mutate(
    preds = future_pmap(list(splits, prepped, fit), predict_helper)
  )
# Top row of the entire structure as example
fits_pred_tidym[1, ]

After training is done I would like to assess which model performs the best based on cross-validated hold-out performance. In order to do that, let’s calculate the AUC of all test sets across all model-parameter combinations. By averaging the results up, I can see the entire performance profile of all possibilities.

Tidymodels includes also two very handy packages: probably and tidyposterior, which are very usefull for analysing model estimated probabilities and it’s resampled performance profile. I will make an introduction to those packages in one of my next posts.

# Assessing individual model-fold performance and averaging performance across all folds for each model
(perf_summary_tidym <- fits_pred_tidym %>% 
  unnest(preds) %>% 
  group_by(id, model_id) %>% 
  roc_auc(truth = status, .pred_bad) %>% 
  group_by(model_id, .metric, .estimator) %>% 
  summarize(mean = mean(.estimate, na.rm = TRUE))) %>% 
  arrange(-mean)

Just by sorting the previous results we can easly see what is the best performing model. Let’s now take a step back and filter only that model specification, and fit it on the entire training set. As of now I’m not 100% sure what the recommended and most efficient way of doing that would be, but I decided to go for something like that:

# Selecting the best model with:
# perf_summary_tidym$model_id[which.max(perf_summary_tidym$mean)]

# Fitting the best model on the full training set
(model_tidym <- tibble(spec = merge(engine_tidym, gridy_tidym)) %>% 
  mutate(model_id = row_number()) %>% 
  filter(model_id == perf_summary_tidym$model_id[which.max(perf_summary_tidym$mean)]) %>% 
  pull(spec) %>% 
  .[[1]] %>% 
  fit(status ~ ., juice(recipe_preped)))

Similarly like before with caret, I can now summarize our cross-validated and test performances.

# Cross-validated training performance"
percent(perf_summary_tidym$mean[which.max(perf_summary_tidym$mean)])
# Test performance
df_train_pred_tidym <- predict(
  model_tidym, 
  new_data = bake(recipe_preped, df_test), 
  type = "prob"
  ) %>% 
  transmute(estimate = .pred_bad) %>%
  add_column(truth = df_test$status)

percent(roc_auc(df_train_pred_tidym, truth, estimate)$.estimate)

The entire tidymodels code that was scattered across above sections could be easily squeezed in one longer pipeline. Note that I limited the grid to just one row gridy_tidym[1, ] in order to demonstrate the solution and save on processing time.

df_tidym <- tibble(spec = merge(engine_tidym, gridy_tidym[1:2, ])) %>% 
  mutate(model_id = row_number()) %>% 
  crossing(train_cv, .) %>%
  mutate(
    prepped = future_map(splits, prepper, recipe),
    fit = future_map2(spec, prepped, fit_on_fold),
    preds = future_pmap(list(splits, prepped, fit), predict_helper)
  )