In this vignette, we present a global variable importance measure based on Partial Dependence Profiles (PDP) for the random forest regression model.
We work on Apartments dataset from DALEX
package.
#> m2.price construction.year surface floor no.rooms district
#> 1 5897 1953 25 3 1 Srodmiescie
#> 2 1818 1992 143 9 5 Bielany
#> 3 3643 1937 56 1 2 Praga
#> 4 3517 1995 93 7 3 Ochota
#> 5 3013 1992 144 6 5 Mokotow
#> 6 5795 1926 61 6 2 Srodmiescie
Now, we define a random forest regression model and use
explain()
function from DALEX
.
library("randomForest")
apartments_rf_model <- randomForest(m2.price ~ construction.year + surface + floor +
no.rooms, data = apartments)
explainer_rf <- explain(apartments_rf_model,
data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
#> Preparation of a new explainer is initiated
#> -> model label : randomForest ( default )
#> -> data : 9000 rows 4 cols
#> -> target variable : 9000 values
#> -> predict function : yhat.randomForest will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package randomForest , ver. 4.7.1.2 , task regression ( default )
#> -> predicted values : numerical, min = 2099.014 , mean = 3515.384 , max = 5291.89
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = -1203.878 , mean = -3.860303 , max = 2118.176
#> A new explainer has been created!
Let see the Partial Dependence Profiles calculated with
DALEX::model_profile()
function. The PDP also can be
calculated with DALEX::variable_profile()
or
ingredients::partial_dependence()
.
Now, we calculated a measure of global variable importance via oscillation based on PDP.
The most important variable is surface, then no.rooms, floor, and construction.year.
Let created a linear regression model and explain
object.
apartments_lm_model <- lm(m2.price ~ construction.year + surface + floor +
no.rooms, data = apartments)
explainer_lm <- explain(apartments_lm_model,
data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
#> Preparation of a new explainer is initiated
#> -> model label : lm ( default )
#> -> data : 9000 rows 4 cols
#> -> target variable : 9000 values
#> -> predict function : yhat.lm will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.4.2 , task regression ( default )
#> -> predicted values : numerical, min = 2231.8 , mean = 3507.346 , max = 4769.053
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = -733.2516 , mean = 4.177813 , max = 2107.979
#> A new explainer has been created!
We calculated Partial Dependence Profiles and measure.
Now we can see the order of importance of variables by model.