I was excited to start using Max Khun (creator of Caret's) new set of 'tidymodels' packages - rsample, recipe, yardstick, parsnip and dials. These are still under development but seem promising. The one I have so far found most useful is recipe. Here I'll give a quick overview of how you use it to do some simple data preparation for machine learning.
R's approach to machine learning has always been a bit haphazard and fragmented. There has never been an equivalent to python's scikit-learn. I have never really got along with caret (the main contender) or mlr. I found the API difficult to learn and I've never liked the amount of control you give up as a result of using them. I like the fact that these new set of packages are modular and so can be used without fully giving up on other approaches.
Basically, recipe provides a bunch of tools for preparing data and creating design matrices. This is a form of feature engineering. These matrices can then be used as training data for ML models. This is done in four steps:
Here is a quick example the does median imputation, centres and scales the airquality dataset to give an idea for how it would work.
library(recipes) aq_train = airquality[1:100, ] aq_test = airquality[-(1:100), ] #make recipe recipe_1 = recipe(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = aq_train) %>% #add steps step_medianimpute(all_numeric()) %>% step_center(all_numeric()) %>% step_scale( all_numeric()) %>% #prep recipe prep(training = aq_train, retain = TRUE, verbose = TRUE) #make model matrices mm_train = bake(recipe_1, new_data = aq_train, composition = 'matrix') mm_test = bake(recipe_1, new_data = aq_test, composition = 'matrix')
After doing this you can go off and do what you want with the model matrix. Changing the composition argument allows you to get a ""tibble", "matrix", "data.frame", or "dgCMatrix".
The recipe package is really useful and i've been using it a lot lately - it has streamlined a bit of my workflow that I'd been struggling with. It still has a few rough edges but is really worth trying out.