--- title: "Understanding random forests with randomForestExplainer" author: "Aleksandra PaluszyƄska" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Understanding random forests with randomForestExplainer} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.width = 7, fig.height = 5) ``` # Introduction This vignette demonstrates how to use the randomForestExplainer package. We will use the `Boston` from package `MASS`. In fact, the development of randomForestExplainer was motivated by problems that include lots of predictors and not many observations. However, as they usually require growing large forests and are computationally intensive, we use the small `Boston` data set here and encourage you to follow our analysis of such large data set concerning glioblastoma here: [initial vignette](https://htmlpreview.github.io/?https://github.com/geneticsMiNIng/BlackBoxOpener/master/randomForestExplainer/inst/doc/randomForestExplainer.html). We will use the following packages: ```{r} library(randomForest) # devtools::install_github("MI2DataLab/randomForestExplainer") library(randomForestExplainer) ``` # Data and forest The data set `Boston` has the following structure: ```{r} data(Boston, package = "MASS") Boston$chas <- as.logical(Boston$chas) str(Boston) ``` We use the `randomForest::randomForest` function to train a forest of $B=500$ trees (default value of the `mtry` parameter of this function), with option `localImp = TRUE`. The forest is supposed to predict the median price of an apartment `medv` based on its characteristics. ```{r} set.seed(2017) forest <- randomForest(medv ~ ., data = Boston, localImp = TRUE) ``` The prediction accuracy of our forest can be summarized as follows: ```{r} forest ``` Now, we will use all the functions of `randomForestExplainer` in turn and comment on the obtained results. # Distribution of minimal depth To obtain the distribution of minimal depth we pass our forest to the function `min_depth_distribution` and store the result, which contains the following columns (we save this and load it from memory as it takes a while): ```{r} min_depth_frame <- min_depth_distribution(forest) head(min_depth_frame, n = 10) ``` Next, we pass it to the function `plot_min_depth_distribution` and under default settings obtain obtain a plot of the distribution of minimal depth for top ten variables according to mean minimal depth calculated using top trees (`mean_sample = "top_trees"`). We could also pass our forest directly to the plotting function but if we want to make more than one plot of the minimal depth distribution is more efficient to pass the `min_depth_frame` to the plotting function so that it will not be calculated again for each plot (this works similarly for other plotting functions of `randomForestExplainer`). ```{r} # plot_min_depth_distribution(forest) # gives the same result as below but takes longer plot_min_depth_distribution(min_depth_frame) ``` The function `plot_min_depth_distribution` offers three possibilities when it comes to calculating the mean minimal depth, which differ in he way they treat missing values that appear when a variable is not used for splitting in a tree. They can be described as follows: - `mean_sample = "all_trees"` (filling missing value): the minimal depth of a variable in a tree that does not use it for splitting is equal to the mean depth of trees. Note that the depth of a tree is equal to the length of the longest path from root to leave in this tree. This equals the maximum depth of a variable in this tree plus one, as leaves are by definition not split by any variable. - `mean_sample = "top_trees"` (restricting the sample): to calculate the mean minimal depth only $\tilde{B}$ out of $B$ (number of trees) observations are considered, where $\tilde{B}$ is equal to the maximum number of trees in which any variable was used for splitting. Remaining missing values for variables that were used for splitting less than $\tilde{B}$ times are filled in as in `mean_sample = "all_trees"`. - `mean_sample = "relevant_trees"` (ignoring missing values): mean minimal depth is calculated using only non-missing values. Note that the $x$-axis ranges from zero trees to the maximum number of trees in which any variable was used for splitting ($\tilde{B}$) which is in this case equal to 500 and is reached by all variables plotted. The ordering of variables in our plot by their mean minimal depth seems quite accurate when we look at the distribution of minimal depth, though one could argue that for example `indus` should be ranked higher than `dis` as the latter is never used for splitting at the root. Usually we would obtain different orderings when changing the `mean_sample` option but this is not the case if variables are used for splitting in all trees as this option only influences how and whether missing values are treated. The default option, `"top_trees"`, penalizes missing values and this penalization makes the interpretation of the values less obvious -- to address that we can calculate the mean minimal depth only using non-missing observations (`mean_sample = "relevant_trees"`). For forests with many variables with a lot of missing observations we should always consider adding the `min_no_of_trees` option so that only variables used for splitting in at least the declared number of trees will be considered for the plot. This allows us to avoid selecting variables that have been by chance used for splitting e.g., only once but at the root (their mean would be equal to 0). However, in our case we can simply increase the `k` parameter to plot all trees: ```{r fig.height = 7} plot_min_depth_distribution(min_depth_frame, mean_sample = "relevant_trees", k = 15) ``` Clearly, using only relevant trees for calculating the mean does not change it for variables that have no missing values. Also, in this case the change does not influence the ordering of variables, but of course this usually not the case in more complex examples. Regardless of the exact parameters used in `plot_min_depth_distribution`, looking at the whole distribution of minimal depth offers a lot more insight into the role that a predictor plays in a forest in contrast to looking only at the mean, especially as it can be calculated in more than one way. Additionally, the function allows us to specify the maximum number of variables plotted `k`, whether the values of mean minimal depth should be scaled to the $[0,1]$ interval (`mean_scale`, logical), the number of digits to round the mean to for display (`mean_round`) and the title of the plot (`main`). # Various variable importance measures To further explore variable importance measures we pass our forest to `measure_importance` function and get the following data frame (we save and load it from memory to save time): ```{r} importance_frame <- measure_importance(forest) importance_frame ``` It contains 13 rows, each corresponding to a predictor, and 8 columns of which one stores the variable names and the rest store the variable importance measures of a variable $X_j$: (a) `accuracy_decrease` (classification) -- mean decrease of prediction accuracy after $X_j$ is permuted, (b) `gini_decrease` (classification) -- mean decrease in the Gini index of node impurity (i.e. increase of node purity) by splits on $X_j$, (c) `mse_increase` (regression) -- mean increase of mean squared error after $X_j$ is permuted, (d) `node_purity_increase` (regression) -- mean node purity increase by splits on $X_j$, as measured by the decrease in sum of squares, (e) `mean_minimal_depth` -- mean minimal depth calculated in one of three ways specified by the parameter `mean_sample`, (f) `no_of_trees` -- total number of trees in which a split on $X_j$ occurs, (g) `no_of_nodes` -- total number of nodes that use $X_j$ for splitting (it is usually equal to `no_of_trees` if trees are shallow), (h) `times_a_root` -- total number of trees in which $X_j$ is used for splitting the root node (i.e., the whole sample is divided into two based on the value of $X_j$), (i) `p_value` -- $p$-value for the one-sided binomial test using the following distribution: $$Bin(\texttt{no_of_nodes},\ \mathbf{P}(\text{node splits on } X_j)),$$ where we calculate the probability of split on $X_j$ as if $X_j$ was uniformly drawn from the $r$ candidate variables $$\mathbf{P}(\text{node splits on } X_j) = \mathbf{P}(X_j \text{ is a candidate})\cdot\mathbf{P}(X_j \text{ is selected}) = \frac{r}{p}\cdot \frac{1}{r} = \frac{1}{p}.$$ This test tells us whether the observed number of successes (number of nodes in which $X_j$ was used for splitting) exceeds the theoretical number of successes if they were random (i.e. following the binomial distribution given above). Measures (a)-(d) are calculated by the `randomForest` package so need only to be extracted from our `forest` object if option `localImp = TRUE` was used for growing the forest (we assume this is the case). Note that measures (a) and (c) are based on the decrease in predictive accuracy of the forest after perturbation of the variable, (b) and (d) are based on changes in node purity after splits on the variable and (e)-(i) are based on the structure of the forest. The function `measure_importance` allows you to specify the method of calculating mean minimal depth (`mean_sample` parameter, default `"top_trees"`) and the measures to be calculated as a character vector a subset of names of measures given above (`measures` parameter, default to `NULL` leads to calculating all measures). ## Multi-way importance plot Below we present the result of `plot_multi_way_importance` for the default values of `x_measure` and `y_measure`, which specify measures to use on $x$ and $y$-axis, and the size of points reflects the number of nodes split on the variable. For problems with many variables we can restrict the plot to only those used for splitting in at least `min_no_of_trees` trees. By default 10 top variables in the plot are highlighted in blue and labeled (`no_of_labels`) -- these are selected using the function `important_variables`, i.e. using the sum of rankings based on importance measures used in the plot (more variables may be labeled if ties occur). ```{r} # plot_multi_way_importance(forest, size_measure = "no_of_nodes") # gives the same result as below but takes longer plot_multi_way_importance(importance_frame, size_measure = "no_of_nodes") ``` Observe the marked negative relation between `times_a_root` and `mean_min_depth`. Also, the superiority of `lstat` and `rm` is clear in all three dimensions plotted (though it is not clear which of the two is better). Further, we present the multi-way importance plot for a different set of importance measures: increase of mean squared error after permutation ($x$-axis), increase in the node purity index ($y$-axis) and levels of significance (color of points). We also set `no_of_labels` to five so that only five top variables will be highlighted (as ties occur, six are eventually labeled). ```{r} plot_multi_way_importance(importance_frame, x_measure = "mse_increase", y_measure = "node_purity_increase", size_measure = "p_value", no_of_labels = 5) ``` As in the previous plot, the two measures used as coordinates seem correlated, but in this case this is somewhat more surprising as one is connected to the structure of the forest and the other to its prediction, whereas in the previous plot both measures reflected the structure. Also, in this plot we see that although `lstat` and `rm` are similar in terms of node purity increase and $p$-value, the former is markedly better if we look at the increase in MSE. Interestingly, `nox` and `indus` are quite good when it comes to the two measures reflected on the axes, but are not significant according to our $p$-value, which is a derivative of the number of nodes that use a variable for splitting. ## Compare measures using ggpairs Generally, the multi-way importance plot offers a wide variety of possibilities so it can be hard to select the most informative one. One idea of overcoming this obstacle is to first explore relations between different importance measures to then select three that least agree with each other and use them in the multi-way importance plot to select top variables. The first is easily done by plotting selected importance measures pairwise against each other using `plot_importance_ggpairs` as below. One could of course include all seven measures in the plot but by default $p$-value and the number of trees are excluded as both carry similar information as the number of nodes. ```{r} # plot_importance_ggpairs(forest) # gives the same result as below but takes longer plot_importance_ggpairs(importance_frame) ``` We can see that all depicted measures are highly correlated (of course the correlation of any measure with mean minimal depth is negative as the latter is lowest for best variables), but some less than others. Moreover, regardless of which measures we compare, there always seem to be two points that stand out and these most likely correspond to `lstat` and `rm` (to now for sure we could just examine the `importance_frame`). ## Compare different rankings In addition to scatter plots and correlation coefficients, the ggpairs plot also depicts density estimate for each importance measure -- all of which are in this case very skewed. An attempt to eliminate this feature by plotting rankings instead of raw measures is implemented in the function `plot_importance_rankings` that also includes the fitted LOESS curve in each plot. ```{r} # plot_importance_rankings(forest) # gives the same result as below but takes longer plot_importance_rankings(importance_frame) ``` The above density estimates show that skewness was eliminated for all of our importance measures (this is not always the case, e.g., when ties in rankings are frequent, and this is likely for discrete importance measures such as `times_a_root`, then the distribution of the ranking will also be skewed). When comparing the rankings in the above plot we can see that two pairs of measures almost exactly agree in their rankings of variables: `mean_min_depth` vs. `mse_increase` and `mse_increase` vs. `node_purity_increase`. In applications where there are many variables, the LOESS curve may be the main takeaway from this plot (if points fill in the whole plotting area and this is likely if the distributions of measures are close to uniform). # Variable interactions ## Conditional minimal depth After selecting a set of most important variables we can investigate interactions with respect to them, i.e. splits appearing in maximal subtrees with respect to one of the variables selected. To extract the names of 5 most important variables according to both the mean minimal depth and number of trees in which a variable appeared, we pass our `importance_frame` to the function `important_variables` as follows: ```{r} # (vars <- important_variables(forest, k = 5, measures = c("mean_min_depth", "no_of_trees"))) # gives the same result as below but takes longer (vars <- important_variables(importance_frame, k = 5, measures = c("mean_min_depth", "no_of_trees"))) ``` We pass the result together with or forest to the `min_depth_interactions` function to obtain a data frame containing information on mean conditional minimal depth of variables with respect to each element of `vars` (missing values are filled analogously as for unconditional minimal depth, in one of three ways specified by `mean_sample`). If we would not specify the `vars` argument then the vector of conditioning variables would be by default obtained using `important_variables(measure_importance(forest))`. ```{r} # interactions_frame <- min_depth_interactions(forest, vars) # save(interactions_frame, file = "interactions_frame.rda") load("interactions_frame.rda") head(interactions_frame[order(interactions_frame$occurrences, decreasing = TRUE), ]) ``` Then, we pass our `interactions_frame` to the plotting function `plot_min_depth_interactions` and obtain the following: ```{r} # plot_min_depth_interactions(forest) # calculates the interactions_frame for default settings so may give different results than the function below depending on our settings and takes more time plot_min_depth_interactions(interactions_frame) ``` Note that the interactions are ordered by decreasing number of occurrences -- the most frequent one, `lstat:rm`, is also the one with minimal mean conditional minimal depth. Remarkably, the unconditional mean minimal depth of `rm` in the forest is almost equal to its mean minimal depth across maximal subtrees with `lstat` as the root variable. Generally, the plot contains much information and can be interpreted in many ways but always bear in mind the method used for calculating the conditional (`mean_sample` parameter) and unconditional (`uncond_mean_sample` parameter) mean minimal depth. Using the default `"top_trees"` penalizes interactions that occur less frequently than the most frequent one. Of course, one can switch between `"all_trees"`, `"top_trees"` and `"relevant_trees"` for calculating the mean of both the conditional and unconditional minimal depth but each of them has its drawbacks and we favour using `"top_trees"` (the default). However, as `plot_min_depth_interactions` plots interactions by decreasing frequency the major drawback of calculating the mean only for relevant variables vanishes as interactions appearing for example only once but with conditional depth 0 will not be included in the plot anyway. Thus, we repeat the computation of means using `"relevant_trees"` and get the following result: ```{r} # interactions_frame <- min_depth_interactions(forest, vars, mean_sample = "relevant_trees", uncond_mean_sample = "relevant_trees") # save(interactions_frame, file = "interactions_frame_relevant.rda") load("interactions_frame_relevant.rda") plot_min_depth_interactions(interactions_frame) ``` Comparing this plot with the previous one we see that removing penalization of missing values lowers the mean conditional minimal depth of all interactions except the most frequent one. Now, in addition to the frequent ones, some of the less frequent like `rm:tax` stand out. ## Prediction of the forest on a grid To further investigate the most frequent interaction `lstat:rm` we use the function `plot_predict_interaction` to plot the prediction of our forest on a grid of values for the components of each interaction. The function requires the forest, training data, variable to use on $x$ and $y$-axis, respectively. In addition, one can also decrease the number of points in both dimensions of the grid from the default of 100 in case of insufficient memory using the parameter `grid`. ```{r} plot_predict_interaction(forest, Boston, "rm", "lstat") ``` In the above plot we can clearly see the effect of interaction: the predicted median price is highest when `lstat` is low and `rm` is high and low when the reverse is true. To further investigate the effect of interactions we could plot other frequent ones on a grid. # Explain the forest The `explain_forest()` function is the flagship function of the `randomForestExplainer` package, as it takes your random forest and produces a html report that summarizes all basic results obtained for the forest with the new package. Below, we show how to run this function with default settings (we only supply the forest, training data, set `interactions = TRUE` contrary to the default to show full functionality and decrease the grid for prediction plots in our most computationally-intense examples) for our data set. ```{r, eval = FALSE} explain_forest(forest, interactions = TRUE, data = Boston) ``` To see the resulting HTML document click here: [Boston forest summary](https://htmlpreview.github.io/?https://github.com/geneticsMiNIng/BlackBoxOpener/master/examples/Boston_forest_explained.html) For additional examples see: [initial vignette](https://htmlpreview.github.io/?https://github.com/geneticsMiNIng/BlackBoxOpener/blob/master/randomForestExplainer/inst/doc/randomForestExplainer.html).