The iml
package can now handle bigger datasets. Earlier problems with exploding memory have been fixed for FeatureEffect
, FeatureImp
and Interaction
. It’s also possible now to compute FeatureImp
and Interaction
in parallel. This document describes how.
First we load some data, fit a random forest and create a Predictor object.
set.seed(42)
library("iml")
library("randomForest")
data("Boston", package = "MASS")
rf = randomForest(medv ~ ., data = Boston, ntree = 500)
X = Boston[which(names(Boston) != "medv")]
predictor = Predictor$new(rf, data = X, y = Boston$medv)
You need to install the doParallel
or a similar framework to compute in parallel. Before you can use parallelization to compute for example the feature importance on multiple CPU cores, you have to setup up a cluster. Fortunately, the doParallel
makes it easy to setup and register a cluster:
library("doParallel")
#> Loading required package: iterators
#> Loading required package: parallel
# Creates a cluster with 2 cores
cl = makePSOCKcluster(2)
# Registers cluster
registerDoParallel(cl)
Now we can easily compute feature importance in parallel. This means that the computation per feature is distributed among the 2 cores I specified earlier.
That wasn’t very impressive, let’s actually see how much speed up we get by parallelization.
system.time(FeatureImp$new(predictor, loss = "mae", parallel = FALSE))
#> user system elapsed
#> 1.520 0.004 1.524
system.time(FeatureImp$new(predictor, loss = "mae", parallel = TRUE))
#> user system elapsed
#> 0.084 0.012 1.136
A little bit of improvement, but not too impressive. Parallelization is more useful in the case where the model uses a lot of features or where the feature importance computation is repeated more often to get more stable results.
system.time(FeatureImp$new(predictor, loss = "mae", parallel = FALSE, n.repetitions = 20))
#> user system elapsed
#> 5.544 0.000 5.543
system.time(FeatureImp$new(predictor, loss = "mae", parallel = TRUE, n.repetitions = 20))
#> user system elapsed
#> 0.084 0.008 3.225
Here the parallel computation is twice as fast as the sequential computation of the feature importance.
The parallization also speeds up the computation of the interaction statistics:
system.time(Interaction$new(predictor, parallel = FALSE))
#> user system elapsed
#> 11.912 0.004 11.926
system.time(Interaction$new(predictor, parallel = TRUE))
#> user system elapsed
#> 0.064 0.008 7.144
Remember to stop the cluster in the end again.