Title: | Explaining and Visualizing Random Forests in Terms of Variable Importance |
---|---|
Description: | A set of tools to help explain which variables are most important in a random forests. Various variable importance measures are calculated and visualized in different settings in order to get an idea on how their importance changes depending on our criteria (Hemant Ishwaran and Udaya B. Kogalur and Eiran Z. Gorodeski and Andy J. Minn and Michael S. Lauer (2010) <doi:10.1198/jasa.2009.tm08622>, Leo Breiman (2001) <doi:10.1023/A:1010933404324>). |
Authors: | Aleksandra Paluszynska [aut], Przemyslaw Biecek [aut, ths], Michael Mayer [aut], Olivier Roy [aut], Yue Jiang [aut, cre] |
Maintainer: | Yue Jiang <[email protected]> |
License: | GPL |
Version: | 0.11.0 |
Built: | 2024-11-17 04:41:15 UTC |
Source: | https://github.com/modeloriented/randomforestexplainer |
Explains a random forest in a html document using plots created by randomForestExplainer
explain_forest( forest, path = NULL, interactions = FALSE, data = NULL, vars = NULL, no_of_pred_plots = 3, pred_grid = 100, measures = NULL )
explain_forest( forest, path = NULL, interactions = FALSE, data = NULL, vars = NULL, no_of_pred_plots = 3, pred_grid = 100, measures = NULL )
forest |
A randomForest object created with the option localImp = TRUE |
path |
Path to write output html to |
interactions |
Logical value: should variable interactions be considered (this may be time-consuming) |
data |
The data frame on which forest was trained - necessary if interactions = TRUE |
vars |
A character vector with variables with respect to which interactions will be considered if NULL then they will be selected using the important_variables() function |
no_of_pred_plots |
The number of most frequent interactions of numeric variables to plot predictions for |
pred_grid |
The number of points on the grid of plot_predict_interaction (decrease in case memory problems) |
measures |
A character vector specifying the importance measures to be used for plotting ggpairs |
A html document. If path is not specified, this document will be "Your_forest_explained.html" in your working directory
## Not run: forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE) explain_forest(forest, interactions = TRUE) ## End(Not run)
## Not run: forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE) explain_forest(forest, interactions = TRUE) ## End(Not run)
Get the names of k variables with highest sum of rankings based on the specified importance measures
important_variables( importance_frame, k = 15, measures = names(importance_frame)[2:min(5, ncol(importance_frame))], ties_action = "all" )
important_variables( importance_frame, k = 15, measures = names(importance_frame)[2:min(5, ncol(importance_frame))], ties_action = "all" )
importance_frame |
A result of using the function measure_importance() to a random forest or a randomForest object |
k |
The number of variables to extract |
measures |
A character vector specifying the measures of importance to be used |
ties_action |
One of three: c("none", "all", "draw"); specifies which variables to pick when ties occur. When set to "none" we may get less than k variables, when "all" we may get more and "draw" makes us get exactly k. |
A character vector with names of k variables with highest sum of rankings
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE, ntree = 300) important_variables(measure_importance(forest), k = 2)
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE, ntree = 300) important_variables(measure_importance(forest), k = 2)
Get a data frame with various measures of importance of variables in a random forest
measure_importance(forest, mean_sample = "top_trees", measures = NULL)
measure_importance(forest, mean_sample = "top_trees", measures = NULL)
forest |
A random forest produced by the function randomForest with option localImp = TRUE |
mean_sample |
The sample of trees on which mean minimal depth is calculated, possible values are "all_trees", "top_trees", "relevant_trees" |
measures |
A vector of names of importance measures to be calculated - if equal to NULL then all are calculated;
if "p_value" is to be calculated then "no_of_nodes" will be too. Suitable measures for |
A data frame with rows corresponding to variables and columns to various measures of importance of variables
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE, ntree = 300) measure_importance(forest)
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE, ntree = 300) measure_importance(forest)
Get minimal depth values for all trees in a random forest
min_depth_distribution(forest)
min_depth_distribution(forest)
forest |
A randomForest or ranger object |
A data frame with the value of minimal depth for every variable in every tree
min_depth_distribution(randomForest::randomForest(Species ~ ., data = iris, ntree = 100)) min_depth_distribution(ranger::ranger(Species ~ ., data = iris, num.trees = 100))
min_depth_distribution(randomForest::randomForest(Species ~ ., data = iris, ntree = 100)) min_depth_distribution(ranger::ranger(Species ~ ., data = iris, num.trees = 100))
Calculate mean conditional minimal depth with respect to a vector of variables
min_depth_interactions( forest, vars = important_variables(measure_importance(forest)), mean_sample = "top_trees", uncond_mean_sample = mean_sample )
min_depth_interactions( forest, vars = important_variables(measure_importance(forest)), mean_sample = "top_trees", uncond_mean_sample = mean_sample )
forest |
A randomForest object |
vars |
A character vector with variables with respect to which conditional minimal depth will be calculated; by default it is extracted by the important_variables function but this may be time consuming |
mean_sample |
The sample of trees on which conditional mean minimal depth is calculated, possible values are "all_trees", "top_trees", "relevant_trees" |
uncond_mean_sample |
The sample of trees on which unconditional mean minimal depth is calculated, possible values are "all_trees", "top_trees", "relevant_trees" |
A data frame with each observation giving the means of conditional minimal depth and the size of sample for a given interaction
forest <- randomForest::randomForest(Species ~ ., data = iris, ntree = 100) min_depth_interactions(forest, c("Petal.Width", "Petal.Length"))
forest <- randomForest::randomForest(Species ~ ., data = iris, ntree = 100) min_depth_interactions(forest, c("Petal.Width", "Petal.Length"))
Plot selected measures of importance of variables in a forest using ggpairs
plot_importance_ggpairs( importance_frame, measures = NULL, main = "Relations between measures of importance" )
plot_importance_ggpairs( importance_frame, measures = NULL, main = "Relations between measures of importance" )
importance_frame |
A result of using the function measure_importance() to a random forest or a randomForest object |
measures |
A character vector specifying the measures of importance to be used |
main |
A string to be used as title of the plot |
A ggplot object
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE, ntree = 200) frame <- measure_importance(forest, measures = c("mean_min_depth", "times_a_root")) plot_importance_ggpairs(frame, measures = c("mean_min_depth", "times_a_root"))
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE, ntree = 200) frame <- measure_importance(forest, measures = c("mean_min_depth", "times_a_root")) plot_importance_ggpairs(frame, measures = c("mean_min_depth", "times_a_root"))
Plot against each other rankings of variables according to various measures of importance
plot_importance_rankings( importance_frame, measures = NULL, main = "Relations between rankings according to different measures" )
plot_importance_rankings( importance_frame, measures = NULL, main = "Relations between rankings according to different measures" )
importance_frame |
A result of using the function measure_importance() to a random forest or a randomForest object |
measures |
A character vector specifying the measures of importance to be used. |
main |
A string to be used as title of the plot |
A ggplot object
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE, ntree = 300) frame <- measure_importance(forest, measures = c("mean_min_depth", "times_a_root")) plot_importance_ggpairs(frame, measures = c("mean_min_depth", "times_a_root"))
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE, ntree = 300) frame <- measure_importance(forest, measures = c("mean_min_depth", "times_a_root")) plot_importance_ggpairs(frame, measures = c("mean_min_depth", "times_a_root"))
Plot the distribution of minimal depth in a random forest
plot_min_depth_distribution( min_depth_frame, k = 10, min_no_of_trees = 0, mean_sample = "top_trees", mean_scale = FALSE, mean_round = 2, main = "Distribution of minimal depth and its mean" )
plot_min_depth_distribution( min_depth_frame, k = 10, min_no_of_trees = 0, mean_sample = "top_trees", mean_scale = FALSE, mean_round = 2, main = "Distribution of minimal depth and its mean" )
min_depth_frame |
A data frame output of min_depth_distribution function or a randomForest object |
k |
The maximal number of variables with lowest mean minimal depth to be used for plotting |
min_no_of_trees |
The minimal number of trees in which a variable has to be used for splitting to be used for plotting |
mean_sample |
The sample of trees on which mean minimal depth is calculated, possible values are "all_trees", "top_trees", "relevant_trees" |
mean_scale |
Logical: should the values of mean minimal depth be rescaled to the interval [0,1]? |
mean_round |
The number of digits used for displaying mean minimal depth |
main |
A string to be used as title of the plot |
A ggplot object
forest <- randomForest::randomForest(Species ~ ., data = iris, ntree = 300) plot_min_depth_distribution(min_depth_distribution(forest))
forest <- randomForest::randomForest(Species ~ ., data = iris, ntree = 300) plot_min_depth_distribution(min_depth_distribution(forest))
Plot the top mean conditional minimal depth
plot_min_depth_interactions( interactions_frame, k = 30, main = paste0("Mean minimal depth for ", paste0(k, " most frequent interactions")) )
plot_min_depth_interactions( interactions_frame, k = 30, main = paste0("Mean minimal depth for ", paste0(k, " most frequent interactions")) )
interactions_frame |
A data frame produced by the min_depth_interactions() function or a randomForest object |
k |
The number of best interactions to plot, if set to NULL then all plotted |
main |
A string to be used as title of the plot |
A ggplot2 object
forest <- randomForest::randomForest(Species ~ ., data = iris, ntree = 100) plot_min_depth_interactions(min_depth_interactions(forest, c("Petal.Width", "Petal.Length")))
forest <- randomForest::randomForest(Species ~ ., data = iris, ntree = 100) plot_min_depth_interactions(min_depth_interactions(forest, c("Petal.Width", "Petal.Length")))
Plot two or three measures of importance of variables in a random fores. Choose importance measures from the colnames(importance_frame).
plot_multi_way_importance( importance_frame, x_measure = "mean_min_depth", y_measure = "times_a_root", size_measure = NULL, min_no_of_trees = 0, no_of_labels = 10, main = "Multi-way importance plot" )
plot_multi_way_importance( importance_frame, x_measure = "mean_min_depth", y_measure = "times_a_root", size_measure = NULL, min_no_of_trees = 0, no_of_labels = 10, main = "Multi-way importance plot" )
importance_frame |
A result of using the function measure_importance() to a random forest or a randomForest object |
x_measure |
The measure of importance to be shown on the X axis |
y_measure |
The measure of importance to be shown on the Y axis |
size_measure |
The measure of importance to be shown as size of points (optional) |
min_no_of_trees |
The minimal number of trees in which a variable has to be used for splitting to be used for plotting |
no_of_labels |
The approximate number of best variables (according to all measures plotted) to be labeled (more will be labeled in case of ties) |
main |
A string to be used as title of the plot |
A ggplot object
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE) plot_multi_way_importance(measure_importance(forest))
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE) plot_multi_way_importance(measure_importance(forest))
Plot the prediction of the forest for a grid of values of two numerical variables
plot_predict_interaction( forest, data, variable1, variable2, grid = 100, main = paste0("Prediction of the forest for different values of ", paste0(variable1, paste0(" and ", variable2))), time = NULL )
plot_predict_interaction( forest, data, variable1, variable2, grid = 100, main = paste0("Prediction of the forest for different values of ", paste0(variable1, paste0(" and ", variable2))), time = NULL )
forest |
A randomForest or ranger object |
data |
The data frame on which forest was trained |
variable1 |
A character string with the name a numerical predictor that will on X-axis |
variable2 |
A character string with the name a numerical predictor that will on Y-axis |
grid |
The number of points on the one-dimensional grid on x and y-axis |
main |
A string to be used as title of the plot |
time |
A numeric value specifying the time at which to predict survival probability, only applies to survival forests. If not specified, the time closest to predicted median survival time is used |
A ggplot2 object
forest <- randomForest::randomForest(Species ~., data = iris) plot_predict_interaction(forest, iris, "Petal.Width", "Sepal.Width") forest_ranger <- ranger::ranger(Species ~., data = iris) plot_predict_interaction(forest, iris, "Petal.Width", "Sepal.Width")
forest <- randomForest::randomForest(Species ~., data = iris) plot_predict_interaction(forest, iris, "Petal.Width", "Sepal.Width") forest_ranger <- ranger::ranger(Species ~., data = iris) plot_predict_interaction(forest, iris, "Petal.Width", "Sepal.Width")