Title: | moDel Agnostic Language for Exploration and eXplanation |
---|---|
Description: | Any unverified black box model is the path to failure. Opaqueness leads to distrust. Distrust leads to ignoration. Ignoration leads to rejection. DALEX package xrays any model and helps to explore and explain its behaviour. Machine Learning (ML) models are widely used and have various applications in classification or regression. Models created with boosting, bagging, stacking or similar techniques are often used due to their high performance. But such black-box models usually lack direct interpretability. DALEX package contains various methods that help to understand the link between input variables and model output. Implemented methods help to explore the model on the level of a single instance as well as a level of the whole dataset. All model explainers are model agnostic and can be compared across different models. DALEX package is the cornerstone for 'DrWhy.AI' universe of packages for visual model exploration. Find more details in (Biecek 2018) <https://jmlr.org/papers/v19/18-416.html>. |
Authors: | Przemyslaw Biecek [aut, cre] , Szymon Maksymiuk [aut] , Hubert Baniecki [aut] |
Maintainer: | Przemyslaw Biecek <[email protected]> |
License: | GPL |
Version: | 2.5.1 |
Built: | 2024-11-01 06:11:04 UTC |
Source: | https://github.com/modeloriented/dalex |
Datasets apartments
and apartments_test
are artificial,
generated form the same model.
Structure of the dataset is copied from real dataset from PBImisc
package,
but they were generated in a way to mimic effect of Anscombe quartet for complex black box models.
data(apartments)
data(apartments)
a data frame with 1000 rows and 6 columns
m2.price - price per square meter
surface - apartment area in square meters
n.rooms - number of rooms (correlated with surface)
district - district in which apartment is located, factor with 10 levels
floor - floor
construction.date - construction year
DrWhy color palettes for ggplot objects
colors_discrete_drwhy(n = 2) colors_diverging_drwhy() colors_breakdown_drwhy()
colors_discrete_drwhy(n = 2) colors_diverging_drwhy() colors_breakdown_drwhy()
n |
number of colors for color palette |
color palette as vector of charactes
Two datasets of characteristics of patients infected with COVID. It is important to note that these are not real patient data. This is simulated data, generated to have relationships consistent with real data (obtained from NIH), but the data itself is not real. Fortunately, they are sufficient for the purposes of our exercise.
data(covid_summer) data(covid_spring)
data(covid_summer) data(covid_spring)
a data frame with 10 000 rows each and 12 columns
The data is divided into two sets covid_spring and covid_summer. The first is acquired in spring 2020 and will be used as training data while the second dataset is acquired in summer and will be used for validation. In machine learning, model validation is performed on a separate data set. This controls the risk of overfitting an elastic model to the data. If we do not have a separate set then it is generated using cross-validation, out of sample or out of time techniques.
It contains 20 000 rows related fo COVID mortality. it contains 11 variables such as: Gender, Age, Cardiovascular.Diseases, Diabetes, Neurological.Diseases, Kidney.Diseases.
Source: https://github.com/BetaAndBit/RML
https://github.com/BetaAndBit/RML
Datasets dragons
and dragons_test
are artificial, generated form the same ground truth model,
but with sometimes different data distribution.
data(dragons)
data(dragons)
a data frame with 2000 rows and 8 columns
Values are generated in a way to: - have nonlinearity in year_of_birth and height - have concept drift in the test set
year_of_birth - year in which the dragon was born. Negative year means year BC, eg: -1200 = 1201 BC
year_of_discovery - year in which the dragon was found.
height - height of the dragon in yards.
weight - weight of the dragon in tons.
scars - number of scars.
colour - colour of the dragon.
number_of_lost_teeth - number of teeth that the dragon lost.
life_length - life length of the dragon.
Black-box models may have very different structures. This function creates a unified representation of a model, which can be further processed by functions for explanations.
explain.default( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, weights = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = NULL ) explain( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, weights = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = NULL )
explain.default( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, weights = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = NULL ) explain( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, weights = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = NULL )
model |
object - a model to be explained |
data |
data.frame or matrix - data which will be used to calculate the explanations. If not provided, then it will be extracted from the model. Data should be passed without a target column (this shall be provided as the |
y |
numeric vector with outputs/scores. If provided, then it shall have the same size as |
predict_function |
function that takes two arguments: model and new data and returns a numeric vector with predictions. By default it is |
predict_function_target_column |
Character or numeric containing either column name or column number in the model prediction object of the class that should be considered as positive (i.e. the class that is associated with probability 1). If NULL, the second column of the output will be taken for binary classification. For a multiclass classification setting, that parameter cause switch to binary classification mode with one vs others probabilities. |
residual_function |
function that takes four arguments: model, data, target vector y and predict function (optionally). It should return a numeric vector with model residuals for given data. If not provided, response residuals ( |
weights |
numeric vector with sampling weights. By default it's |
... |
other parameters |
label |
character - the name of the model. By default it's extracted from the 'class' attribute of the model |
verbose |
logical. If TRUE (default) then diagnostic messages will be printed |
precalculate |
logical. If TRUE (default) then |
colorize |
logical. If TRUE (default) then |
model_info |
a named list ( |
type |
type of a model, either |
Please NOTE that the model
is the only required argument.
But some explanations may expect that other arguments will be provided too.
An object of the class explainer
.
It's a list with the following fields:
model
the explained model.
data
the dataset used for training.
y
response for observations from data
.
weights
sample weights for data
. NULL
if weights are not specified.
y_hat
calculated predictions.
residuals
calculated residuals.
predict_function
function that may be used for model predictions, shall return a single numerical value for each observation.
residual_function
function that returns residuals, shall return a single numerical value for each observation.
class
class/classes of a model.
label
label of explainer.
model_info
named list contating basic information about model, like package, version of package and type.
Explanatory Model Analysis. Explore, Explain and Examine Predictive Models. https://ema.drwhy.ai/
# simple explainer for regression problem aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v") aps_lm_explainer4 # various parameters for the explain function # all defaults aps_lm <- explain(aps_lm_model4) # silent execution aps_lm <- explain(aps_lm_model4, verbose = FALSE) # set target variable aps_lm <- explain(aps_lm_model4, data = apartments, label = "model_4v", y = apartments$m2.price) aps_lm <- explain(aps_lm_model4, data = apartments, label = "model_4v", y = apartments$m2.price, predict_function = predict) # user provided predict_function aps_ranger <- ranger::ranger(m2.price~., data = apartments, num.trees = 50) custom_predict <- function(X.model, newdata) { predict(X.model, newdata)$predictions } aps_ranger_exp <- explain(aps_ranger, data = apartments, y = apartments$m2.price, predict_function = custom_predict) # user provided residual_function aps_ranger <- ranger::ranger(m2.price~., data = apartments, num.trees = 50) custom_residual <- function(X.model, newdata, y, predict_function) { abs(y - predict_function(X.model, newdata)) } aps_ranger_exp <- explain(aps_ranger, data = apartments, y = apartments$m2.price, residual_function = custom_residual) # binary classification titanic_ranger <- ranger::ranger(as.factor(survived)~., data = titanic_imputed, num.trees = 50, probability = TRUE) # keep in mind that for binary classification y parameter has to be numeric with 0 and 1 values titanic_ranger_exp <- explain(titanic_ranger, data = titanic_imputed, y = titanic_imputed$survived) # multiclass task hr_ranger <- ranger::ranger(status~., data = HR, num.trees = 50, probability = TRUE) # keep in mind that for multiclass y parameter has to be a factor, # with same levels as in training data hr_ranger_exp <- explain(hr_ranger, data = HR, y = HR$status) # set model_info model_info <- list(package = "stats", ver = "3.6.2", type = "regression") aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v", model_info = model_info) # simple function aps_fun <- function(x) 58*x$surface aps_fun_explainer <- explain(aps_fun, data = apartments, y = apartments$m2.price, label="sfun") model_performance(aps_fun_explainer) # set model_info model_info <- list(package = "stats", ver = "3.6.2", type = "regression") aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v", model_info = model_info) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v", weights = as.numeric(apartments$construction.year > 2000)) # more complex model library("ranger") aps_ranger_model4 <- ranger(m2.price ~., data = apartments, num.trees = 50) aps_ranger_explainer4 <- explain(aps_ranger_model4, data = apartments, label = "model_ranger") aps_ranger_explainer4
# simple explainer for regression problem aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v") aps_lm_explainer4 # various parameters for the explain function # all defaults aps_lm <- explain(aps_lm_model4) # silent execution aps_lm <- explain(aps_lm_model4, verbose = FALSE) # set target variable aps_lm <- explain(aps_lm_model4, data = apartments, label = "model_4v", y = apartments$m2.price) aps_lm <- explain(aps_lm_model4, data = apartments, label = "model_4v", y = apartments$m2.price, predict_function = predict) # user provided predict_function aps_ranger <- ranger::ranger(m2.price~., data = apartments, num.trees = 50) custom_predict <- function(X.model, newdata) { predict(X.model, newdata)$predictions } aps_ranger_exp <- explain(aps_ranger, data = apartments, y = apartments$m2.price, predict_function = custom_predict) # user provided residual_function aps_ranger <- ranger::ranger(m2.price~., data = apartments, num.trees = 50) custom_residual <- function(X.model, newdata, y, predict_function) { abs(y - predict_function(X.model, newdata)) } aps_ranger_exp <- explain(aps_ranger, data = apartments, y = apartments$m2.price, residual_function = custom_residual) # binary classification titanic_ranger <- ranger::ranger(as.factor(survived)~., data = titanic_imputed, num.trees = 50, probability = TRUE) # keep in mind that for binary classification y parameter has to be numeric with 0 and 1 values titanic_ranger_exp <- explain(titanic_ranger, data = titanic_imputed, y = titanic_imputed$survived) # multiclass task hr_ranger <- ranger::ranger(status~., data = HR, num.trees = 50, probability = TRUE) # keep in mind that for multiclass y parameter has to be a factor, # with same levels as in training data hr_ranger_exp <- explain(hr_ranger, data = HR, y = HR$status) # set model_info model_info <- list(package = "stats", ver = "3.6.2", type = "regression") aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v", model_info = model_info) # simple function aps_fun <- function(x) 58*x$surface aps_fun_explainer <- explain(aps_fun, data = apartments, y = apartments$m2.price, label="sfun") model_performance(aps_fun_explainer) # set model_info model_info <- list(package = "stats", ver = "3.6.2", type = "regression") aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v", model_info = model_info) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v", weights = as.numeric(apartments$construction.year > 2000)) # more complex model library("ranger") aps_ranger_model4 <- ranger(m2.price ~., data = apartments, num.trees = 50) aps_ranger_explainer4 <- explain(aps_ranger_model4, data = apartments, label = "model_ranger") aps_ranger_explainer4
The fifa
dataset is a preprocessed players_20.csv
dataset which comes as
a part of "FIFA 20 complete player dataset" at Kaggle.
data(fifa)
data(fifa)
a data frame with 5000 rows, 42 columns and rownames
It contains 5000 'overall' best players and 43 variables. These are:
short_name (rownames)
nationality of the player (not used in modeling)
overall, potential, value_eur, wage_eur (4 potential target variables)
age, height, weight, attacking skills, defending skills, goalkeeping skills (37 variables)
It is advised to leave only one target variable for modeling.
Source: https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset
All transformations:
take 43 columns: [3, 5, 7:9, 11:14, 45:78]
(R indexing)
take rows with value_eur > 0
convert short_name
to ASCII
remove rows with duplicated short_name
(keep first)
sort rows on overall
and take top 5000
set short_name
column as rownames
transform nationality
to factor
reorder columns
The players_20.csv
dataset was downloaded from the Kaggle site and went through few transformations.
The complete dataset was obtained from
https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset#players_20.csv on January 1, 2020.
The yardstick package provides many auxiliary functions for calculating the predictive performance of the model. However, they have an interface that is consistent with the tidyverse philosophy. The loss_yardstick function adapts loss functions from the yardstick package to functions understood by DALEX. Type compatibility for y-values and for predictions must be guaranteed by the user.
get_loss_yardstick(loss, reverse = FALSE, reference = 1) loss_yardstick(loss, reverse = FALSE, reference = 1)
get_loss_yardstick(loss, reverse = FALSE, reference = 1) loss_yardstick(loss, reverse = FALSE, reference = 1)
loss |
loss function from the |
reverse |
shall the metric be reversed? for loss metrics lower values are better. |
reference |
if the metric is reverse then it is calculated as |
loss function that can be used in the model_parts function
titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- DALEX::explain(titanic_glm_model, data = titanic_imputed[,-8], y = factor(titanic_imputed$survived)) # See the 'How to use DALEX with the yardstick package' vignette # which explains this model with measures implemented in the 'yardstick' package
titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- DALEX::explain(titanic_glm_model, data = titanic_imputed[,-8], y = factor(titanic_imputed$survived)) # See the 'How to use DALEX with the yardstick package' vignette # which explains this model with measures implemented in the 'yardstick' package
The happiness_train
and happiness_test
datasets are generated
based on the "World Happiness Report" at Kaggle https://www.kaggle.com/datasets/unsdsn/world-happiness.
data(happiness_train) data(happiness_test)
data(happiness_train) data(happiness_test)
two data frames with total 781 rows, 7 columns ech and rownames
It contains data for 781 countries and 7 variables. These are:
score - Happiness score
gdp_per_capita - GDP per capita
social_support - Social support
healthy_life_expectancy - Healthy life expectancy
freedom_life_choices - Freedom to make life choices
generosity - Generosity
perceptions_of_corruption - Perceptions of corruption
World Happiness Report data https://worldhappiness.report/
Datasets HR
and HR_test
are artificial, generated form the same model.
Structure of the dataset is based on a real data, from Human Resources department with
information which employees were promoted, which were fired.
data(HR)
data(HR)
a data frame with 10000 rows and 6 columns
Values are generated in a way to: - have interaction between age and gender for the 'fired' variable - have non monotonic relation for the salary variable - have linear effects for hours and evaluation.
gender - gender of an employee.
age - age of an employee in the moment of evaluation.
hours - average number of working hours per week.
evaluation - evaluation in the scale 2 (bad) - 5 (very good).
salary - level of salary in the scale 0 (lowest) - 5 (highest).
status - target variable, either 'fired' or 'promoted' or 'ok'.
By default 'heavy' dependencies are not installed along DALEX. This function silently install all required packages.
install_dependencies(packages = c("ingredients", "iBreakDown", "ggpubr"))
install_dependencies(packages = c("ingredients", "iBreakDown", "ggpubr"))
packages |
which packages shall be installed? |
Calculate Loss Functions
loss_cross_entropy(observed, predicted, p_min = 1e-04, na.rm = TRUE) loss_sum_of_squares(observed, predicted, na.rm = TRUE) loss_root_mean_square(observed, predicted, na.rm = TRUE) loss_accuracy(observed, predicted, na.rm = TRUE) loss_one_minus_accuracy(observed, predicted, cutoff = 0.5, na.rm = TRUE) get_loss_one_minus_accuracy(cutoff = 0.5, na.rm = TRUE) loss_one_minus_auc(observed, predicted) get_loss_default(x) loss_default(x)
loss_cross_entropy(observed, predicted, p_min = 1e-04, na.rm = TRUE) loss_sum_of_squares(observed, predicted, na.rm = TRUE) loss_root_mean_square(observed, predicted, na.rm = TRUE) loss_accuracy(observed, predicted, na.rm = TRUE) loss_one_minus_accuracy(observed, predicted, cutoff = 0.5, na.rm = TRUE) get_loss_one_minus_accuracy(cutoff = 0.5, na.rm = TRUE) loss_one_minus_auc(observed, predicted) get_loss_default(x) loss_default(x)
observed |
observed scores or labels, these are supplied as explainer specific |
predicted |
predicted scores, either vector of matrix, these are returned from the model specific |
p_min |
for cross entropy, minimal value for probability to make sure that |
na.rm |
logical, should missing values be removed? |
cutoff |
classification threshold for the accuracy loss functions |
x |
either an explainer or type of the model. One of "regression", "classification", "multiclass". |
numeric - value of the loss function
library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) loss_one_minus_auc(titanic_imputed$survived, yhat(titanic_ranger_model, titanic_imputed)) HR_ranger_model_multi <- ranger(status~., data = HR, num.trees = 50, probability = TRUE) loss_cross_entropy(as.numeric(HR$status), yhat(HR_ranger_model_multi, HR))
library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) loss_one_minus_auc(titanic_imputed$survived, yhat(titanic_ranger_model, titanic_imputed)) HR_ranger_model_multi <- ranger(status~., data = HR, num.trees = 50, probability = TRUE) loss_cross_entropy(as.numeric(HR$status), yhat(HR_ranger_model_multi, HR))
This function performs model diagnostic of residuals. Residuals are calculated and plotted against predictions, true y values or selected variables. Find information how to use this function here: https://ema.drwhy.ai/residualDiagnostic.html.
model_diagnostics(explainer, variables = NULL, ...)
model_diagnostics(explainer, variables = NULL, ...)
explainer |
a model to be explained, preprocessed by the |
variables |
character - name of variables to be explained. Default |
... |
other parameters |
An object of the class model_diagnostics
.
It's a data frame with residuals and selected variables.
Explanatory Model Analysis. Explore, Explain and Examine Predictive Models. https://ema.drwhy.ai/
library(DALEX) apartments_lm_model <- lm(m2.price ~ ., data = apartments) explainer_lm <- explain(apartments_lm_model, data = apartments, y = apartments$m2.price) diag_lm <- model_diagnostics(explainer_lm) diag_lm plot(diag_lm) library("ranger") apartments_ranger_model <- ranger(m2.price ~ ., data = apartments) explainer_ranger <- explain(apartments_ranger_model, data = apartments, y = apartments$m2.price) diag_ranger <- model_diagnostics(explainer_ranger) diag_ranger plot(diag_ranger) plot(diag_ranger, diag_lm) plot(diag_ranger, diag_lm, variable = "y") plot(diag_ranger, diag_lm, variable = "construction.year") plot(diag_ranger, variable = "y", yvariable = "y_hat") plot(diag_ranger, variable = "y", yvariable = "abs_residuals") plot(diag_ranger, variable = "ids")
library(DALEX) apartments_lm_model <- lm(m2.price ~ ., data = apartments) explainer_lm <- explain(apartments_lm_model, data = apartments, y = apartments$m2.price) diag_lm <- model_diagnostics(explainer_lm) diag_lm plot(diag_lm) library("ranger") apartments_ranger_model <- ranger(m2.price ~ ., data = apartments) explainer_ranger <- explain(apartments_ranger_model, data = apartments, y = apartments$m2.price) diag_ranger <- model_diagnostics(explainer_ranger) diag_ranger plot(diag_ranger) plot(diag_ranger, diag_lm) plot(diag_ranger, diag_lm, variable = "y") plot(diag_ranger, diag_lm, variable = "construction.year") plot(diag_ranger, variable = "y", yvariable = "y_hat") plot(diag_ranger, variable = "y", yvariable = "abs_residuals") plot(diag_ranger, variable = "ids")
This generic function let user extract base information about model. The function returns a named list of class model_info
that
contain about package of model, version and task type. For wrappers like mlr
or caret
both, package and wrapper inforamtion
are stored
model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'lm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'randomForest' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'svm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'glm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'lrm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'glmnet' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'cv.glmnet' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'ranger' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'gbm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'model_fit' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'train' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'rpart' model_info(model, is_multiclass = FALSE, ...) ## Default S3 method: model_info(model, is_multiclass = FALSE, ...)
model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'lm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'randomForest' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'svm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'glm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'lrm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'glmnet' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'cv.glmnet' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'ranger' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'gbm' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'model_fit' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'train' model_info(model, is_multiclass = FALSE, ...) ## S3 method for class 'rpart' model_info(model, is_multiclass = FALSE, ...) ## Default S3 method: model_info(model, is_multiclass = FALSE, ...)
model |
- model object |
is_multiclass |
- if TRUE and task is classification, then multitask classification is set. Else is omitted. If |
... |
- another arguments |
Currently supported packages are:
class cv.glmnet
and glmnet
- models created with glmnet package
class glm
- generalized linear models
class lrm
- models created with rms package,
class model_fit
- models created with parsnip package
class lm
- linear models created with stats::lm
class ranger
- models created with ranger package
class randomForest
- random forest models created with randomForest package
class svm
- support vector machines models created with the e1071 package
class train
- models created with caret package
class gbm
- models created with gbm package
A named list of class model_info
aps_lm_model4 <- lm(m2.price ~., data = apartments) model_info(aps_lm_model4) library("ranger") model_regr_rf <- ranger::ranger(status~., data = HR, num.trees = 50, probability = TRUE) model_info(model_regr_rf, is_multiclass = TRUE)
aps_lm_model4 <- lm(m2.price ~., data = apartments) model_info(aps_lm_model4) library("ranger") model_regr_rf <- ranger::ranger(status~., data = HR, num.trees = 50, probability = TRUE) model_info(model_regr_rf, is_multiclass = TRUE)
From DALEX version 1.0 this function calls the feature_importance
Find information how to use this function here: https://ema.drwhy.ai/featureImportance.html.
model_parts( explainer, loss_function = get_loss_default(explainer$model_info$type), ..., type = "variable_importance", N = n_sample, n_sample = 1000 )
model_parts( explainer, loss_function = get_loss_default(explainer$model_info$type), ..., type = "variable_importance", N = n_sample, n_sample = 1000 )
explainer |
a model to be explained, preprocessed by the |
loss_function |
a function that will be used to assess variable importance. By default it is 1-AUC for classification, cross entropy for multilabel classification and RMSE for regression. Custom, user-made loss function should accept two obligatory parameters (observed, predicted), where |
... |
other parameters |
type |
character, type of transformation that should be applied for dropout loss. |
N |
number of observations that should be sampled for calculation of variable importance. If |
n_sample |
alias for |
An object of the class feature_importance
.
It's a data frame with calculated average response.
Explanatory Model Analysis. Explore, Explain and Examine Predictive Models. https://ema.drwhy.ai/
# regression library("ranger") apartments_ranger_model <- ranger(m2.price~., data = apartments, num.trees = 50) explainer_ranger <- explain(apartments_ranger_model, data = apartments[,-1], y = apartments$m2.price, label = "Ranger Apartments") model_parts_ranger_aps <- model_parts(explainer_ranger, type = "raw") head(model_parts_ranger_aps, 8) plot(model_parts_ranger_aps) # binary classification titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm_titanic <- explain(titanic_glm_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) logit <- function(x) exp(x)/(1+exp(x)) custom_loss <- function(observed, predicted){ sum((observed - logit(predicted))^2) } attr(custom_loss, "loss_name") <- "Logit residuals" model_parts_glm_titanic <- model_parts(explainer_glm_titanic, type = "raw", loss_function = custom_loss) head(model_parts_glm_titanic, 8) plot(model_parts_glm_titanic) # multilabel classification HR_ranger_model_HR <- ranger(status~., data = HR, num.trees = 50, probability = TRUE) explainer_ranger_HR <- explain(HR_ranger_model_HR, data = HR[,-6], y = HR$status, label = "Ranger HR") model_parts_ranger_HR <- model_parts(explainer_ranger_HR, type = "raw") head(model_parts_ranger_HR, 8) plot(model_parts_ranger_HR)
# regression library("ranger") apartments_ranger_model <- ranger(m2.price~., data = apartments, num.trees = 50) explainer_ranger <- explain(apartments_ranger_model, data = apartments[,-1], y = apartments$m2.price, label = "Ranger Apartments") model_parts_ranger_aps <- model_parts(explainer_ranger, type = "raw") head(model_parts_ranger_aps, 8) plot(model_parts_ranger_aps) # binary classification titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm_titanic <- explain(titanic_glm_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) logit <- function(x) exp(x)/(1+exp(x)) custom_loss <- function(observed, predicted){ sum((observed - logit(predicted))^2) } attr(custom_loss, "loss_name") <- "Logit residuals" model_parts_glm_titanic <- model_parts(explainer_glm_titanic, type = "raw", loss_function = custom_loss) head(model_parts_glm_titanic, 8) plot(model_parts_glm_titanic) # multilabel classification HR_ranger_model_HR <- ranger(status~., data = HR, num.trees = 50, probability = TRUE) explainer_ranger_HR <- explain(HR_ranger_model_HR, data = HR[,-6], y = HR$status, label = "Ranger HR") model_parts_ranger_HR <- model_parts(explainer_ranger_HR, type = "raw") head(model_parts_ranger_HR, 8) plot(model_parts_ranger_HR)
Function model_performance()
calculates various performance measures for classification and regression models.
For classification models following measures are calculated: F1, accuracy, recall, precision and AUC.
For regression models following measures are calculated: mean squared error, R squared, median absolute deviation.
model_performance(explainer, ..., cutoff = 0.5)
model_performance(explainer, ..., cutoff = 0.5)
explainer |
a model to be explained, preprocessed by the |
... |
other parameters |
cutoff |
a cutoff for classification models, needed for measures like recall, precision, ACC, F1. By default 0.5. |
An object of the class model_performance
.
It's a list with following fields:
residuals
- data frame that contains residuals for each observation
measures
- list with calculated measures that are dedicated for the task, whether it is regression, binary classification or multiclass classification.
type
- character that specifies type of the task.
Explanatory Model Analysis. Explore, Explain, and Examine Predictive Models. https://ema.drwhy.ai/
# regression library("ranger") apartments_ranger_model <- ranger(m2.price~., data = apartments, num.trees = 50) explainer_ranger_apartments <- explain(apartments_ranger_model, data = apartments[,-1], y = apartments$m2.price, label = "Ranger Apartments") model_performance_ranger_aps <- model_performance(explainer_ranger_apartments ) model_performance_ranger_aps plot(model_performance_ranger_aps) plot(model_performance_ranger_aps, geom = "boxplot") plot(model_performance_ranger_aps, geom = "histogram") # binary classification titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm_titanic <- explain(titanic_glm_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) model_performance_glm_titanic <- model_performance(explainer_glm_titanic) model_performance_glm_titanic plot(model_performance_glm_titanic) plot(model_performance_glm_titanic, geom = "boxplot") plot(model_performance_glm_titanic, geom = "histogram") # multilabel classification HR_ranger_model <- ranger(status~., data = HR, num.trees = 50, probability = TRUE) explainer_ranger_HR <- explain(HR_ranger_model, data = HR[,-6], y = HR$status, label = "Ranger HR") model_performance_ranger_HR <- model_performance(explainer_ranger_HR) model_performance_ranger_HR plot(model_performance_ranger_HR) plot(model_performance_ranger_HR, geom = "boxplot") plot(model_performance_ranger_HR, geom = "histogram")
# regression library("ranger") apartments_ranger_model <- ranger(m2.price~., data = apartments, num.trees = 50) explainer_ranger_apartments <- explain(apartments_ranger_model, data = apartments[,-1], y = apartments$m2.price, label = "Ranger Apartments") model_performance_ranger_aps <- model_performance(explainer_ranger_apartments ) model_performance_ranger_aps plot(model_performance_ranger_aps) plot(model_performance_ranger_aps, geom = "boxplot") plot(model_performance_ranger_aps, geom = "histogram") # binary classification titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm_titanic <- explain(titanic_glm_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) model_performance_glm_titanic <- model_performance(explainer_glm_titanic) model_performance_glm_titanic plot(model_performance_glm_titanic) plot(model_performance_glm_titanic, geom = "boxplot") plot(model_performance_glm_titanic, geom = "histogram") # multilabel classification HR_ranger_model <- ranger(status~., data = HR, num.trees = 50, probability = TRUE) explainer_ranger_HR <- explain(HR_ranger_model, data = HR[,-6], y = HR$status, label = "Ranger HR") model_performance_ranger_HR <- model_performance(explainer_ranger_HR) model_performance_ranger_HR plot(model_performance_ranger_HR) plot(model_performance_ranger_HR, geom = "boxplot") plot(model_performance_ranger_HR, geom = "histogram")
This function calculates explanations on a dataset level set that explore model response as a function of selected variables.
The explanations can be calulated as Partial Dependence Profile or Accumulated Local Dependence Profile.
Find information how to use this function here: https://ema.drwhy.ai/partialDependenceProfiles.html.
The variable_profile
function is a copy of model_profile
.
model_profile( explainer, variables = NULL, N = 100, ..., groups = NULL, k = NULL, center = TRUE, type = "partial" ) variable_profile( explainer, variables = NULL, N = 100, ..., groups = NULL, k = NULL, center = TRUE, type = "partial" ) single_variable(explainer, variable, type = "pdp", ...)
model_profile( explainer, variables = NULL, N = 100, ..., groups = NULL, k = NULL, center = TRUE, type = "partial" ) variable_profile( explainer, variables = NULL, N = 100, ..., groups = NULL, k = NULL, center = TRUE, type = "partial" ) single_variable(explainer, variable, type = "pdp", ...)
explainer |
a model to be explained, preprocessed by the |
variables |
character - names of variables to be explained |
N |
number of observations used for calculation of aggregated profiles. By default |
... |
other parameters that will be passed to |
groups |
a variable name that will be used for grouping.
By default |
k |
number of clusters for the hclust function (for clustered profiles) |
center |
shall profiles be centered before clustering |
type |
the type of variable profile. Either |
variable |
deprecated, use variables instead |
Underneath this function calls the partial_dependence
or
accumulated_dependence
functions from the ingredients
package.
An object of the class model_profile
.
It's a data frame with calculated average model responses.
Explanatory Model Analysis. Explore, Explain, and Examine Predictive Models. https://ema.drwhy.ai/
titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- explain(titanic_glm_model, data = titanic_imputed) model_profile_glm_fare <- model_profile(explainer_glm, "fare") plot(model_profile_glm_fare) library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed) model_profile_ranger <- model_profile(explainer_ranger) plot(model_profile_ranger, geom = "profiles") model_profile_ranger_1 <- model_profile(explainer_ranger, type = "partial", variables = c("age", "fare")) plot(model_profile_ranger_1 , variables = c("age", "fare"), geom = "points") model_profile_ranger_2 <- model_profile(explainer_ranger, type = "partial", k = 3) plot(model_profile_ranger_2 , geom = "profiles") model_profile_ranger_3 <- model_profile(explainer_ranger, type = "partial", groups = "gender") plot(model_profile_ranger_3 , geom = "profiles") model_profile_ranger_4 <- model_profile(explainer_ranger, type = "accumulated") plot(model_profile_ranger_4 , geom = "profiles") # Multiple profiles model_profile_ranger_fare <- model_profile(explainer_ranger, "fare") plot(model_profile_ranger_fare, model_profile_glm_fare)
titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- explain(titanic_glm_model, data = titanic_imputed) model_profile_glm_fare <- model_profile(explainer_glm, "fare") plot(model_profile_glm_fare) library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed) model_profile_ranger <- model_profile(explainer_ranger) plot(model_profile_ranger, geom = "profiles") model_profile_ranger_1 <- model_profile(explainer_ranger, type = "partial", variables = c("age", "fare")) plot(model_profile_ranger_1 , variables = c("age", "fare"), geom = "points") model_profile_ranger_2 <- model_profile(explainer_ranger, type = "partial", k = 3) plot(model_profile_ranger_2 , geom = "profiles") model_profile_ranger_3 <- model_profile(explainer_ranger, type = "partial", groups = "gender") plot(model_profile_ranger_3 , geom = "profiles") model_profile_ranger_4 <- model_profile(explainer_ranger, type = "accumulated") plot(model_profile_ranger_4 , geom = "profiles") # Multiple profiles model_profile_ranger_fare <- model_profile(explainer_ranger, "fare") plot(model_profile_ranger_fare, model_profile_glm_fare)
Plot List of Explanations
## S3 method for class 'list' plot(x, ...)
## S3 method for class 'list' plot(x, ...)
x |
a list of explanations of the same class |
... |
other parameters |
An object of the class ggplot
.
library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) mp_ranger <- model_performance(explainer_ranger) titanic_ranger_model2 <- ranger(survived~gender + fare, data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger2 <- explain(titanic_ranger_model2, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "ranger2") mp_ranger2 <- model_performance(explainer_ranger2) plot(list(mp_ranger, mp_ranger2), geom = "prc") plot(list(mp_ranger, mp_ranger2), geom = "roc") tmp <- list(mp_ranger, mp_ranger2) names(tmp) <- c("ranger", "ranger2") plot(tmp)
library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) mp_ranger <- model_performance(explainer_ranger) titanic_ranger_model2 <- ranger(survived~gender + fare, data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger2 <- explain(titanic_ranger_model2, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "ranger2") mp_ranger2 <- model_performance(explainer_ranger2) plot(list(mp_ranger, mp_ranger2), geom = "prc") plot(list(mp_ranger, mp_ranger2), geom = "roc") tmp <- list(mp_ranger, mp_ranger2) names(tmp) <- c("ranger", "ranger2") plot(tmp)
Plot Dataset Level Model Diagnostics
## S3 method for class 'model_diagnostics' plot(x, ..., variable = "y_hat", yvariable = "residuals", smooth = TRUE)
## S3 method for class 'model_diagnostics' plot(x, ..., variable = "y_hat", yvariable = "residuals", smooth = TRUE)
x |
a data.frame to be explained, preprocessed by the |
... |
other object to be included to the plot |
variable |
character - name of the variable on OX axis to be explained, by default |
yvariable |
character - name of the variable on OY axis, by default |
smooth |
logical shall the smooth line be added |
an object of the class model_diagnostics_explainer
.
apartments_lm_model <- lm(m2.price ~ ., data = apartments) explainer_lm <- explain(apartments_lm_model, data = apartments, y = apartments$m2.price) diag_lm <- model_diagnostics(explainer_lm) diag_lm plot(diag_lm) library("ranger") apartments_ranger_model <- ranger(m2.price ~ ., data = apartments) explainer_ranger <- explain(apartments_ranger_model, data = apartments, y = apartments$m2.price) diag_ranger <- model_diagnostics(explainer_ranger) diag_ranger plot(diag_ranger) plot(diag_ranger, diag_lm) plot(diag_ranger, diag_lm, variable = "y") plot(diag_ranger, diag_lm, variable = "construction.year") plot(diag_ranger, variable = "y", yvariable = "y_hat")
apartments_lm_model <- lm(m2.price ~ ., data = apartments) explainer_lm <- explain(apartments_lm_model, data = apartments, y = apartments$m2.price) diag_lm <- model_diagnostics(explainer_lm) diag_lm plot(diag_lm) library("ranger") apartments_ranger_model <- ranger(m2.price ~ ., data = apartments) explainer_ranger <- explain(apartments_ranger_model, data = apartments, y = apartments$m2.price) diag_ranger <- model_diagnostics(explainer_ranger) diag_ranger plot(diag_ranger) plot(diag_ranger, diag_lm) plot(diag_ranger, diag_lm, variable = "y") plot(diag_ranger, diag_lm, variable = "construction.year") plot(diag_ranger, variable = "y", yvariable = "y_hat")
Plot Variable Importance Explanations
## S3 method for class 'model_parts' plot(x, ...)
## S3 method for class 'model_parts' plot(x, ...)
x |
an object of the class |
... |
other parameters described below |
An object of the class ggplot
.
max_vars
maximal number of features to be included in the plot. default value is 10
show_boxplots
logical if TRUE
(default) boxplot will be plotted to show permutation data.
bar_width
width of bars. By default 10
desc_sorting
logical. Should the bars be sorted descending? By default TRUE
title
the plot's title, by default 'Feature Importance'
subtitle
a character. Plot subtitle. By default NULL
- then subtitle is set to "created for the XXX, YYY model",
where XXX, YYY are labels of given explainers.
Plot Dataset Level Model Performance Explanations
## S3 method for class 'model_performance' plot( x, ..., geom = "ecdf", show_outliers = 0, ptlabel = "name", lossFunction = loss_function, loss_function = function(x) sqrt(mean(x^2)) )
## S3 method for class 'model_performance' plot( x, ..., geom = "ecdf", show_outliers = 0, ptlabel = "name", lossFunction = loss_function, loss_function = function(x) sqrt(mean(x^2)) )
x |
a model to be explained, preprocessed by the |
... |
other parameters |
geom |
either |
show_outliers |
number of largest residuals to be presented (only when geom = boxplot). |
ptlabel |
either |
lossFunction |
alias for |
loss_function |
function that calculates the loss for a model based on model residuals. By default it's the root mean square. NOTE that this argument was called |
An object of the class model_performance
.
library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) mp_ranger <- model_performance(explainer_ranger) plot(mp_ranger) plot(mp_ranger, geom = "boxplot", show_outliers = 1) titanic_ranger_model2 <- ranger(survived~gender + fare, data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger2 <- explain(titanic_ranger_model2, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "ranger2") mp_ranger2 <- model_performance(explainer_ranger2) plot(mp_ranger, mp_ranger2, geom = "prc") plot(mp_ranger, mp_ranger2, geom = "roc") plot(mp_ranger, mp_ranger2, geom = "lift") plot(mp_ranger, mp_ranger2, geom = "gain") plot(mp_ranger, mp_ranger2, geom = "boxplot") plot(mp_ranger, mp_ranger2, geom = "histogram") plot(mp_ranger, mp_ranger2, geom = "ecdf") titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- explain(titanic_glm_model, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "glm", predict_function = function(m,x) predict.glm(m,x,type = "response")) mp_glm <- model_performance(explainer_glm) plot(mp_glm) titanic_lm_model <- lm(survived~., data = titanic_imputed) explainer_lm <- explain(titanic_lm_model, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "lm") mp_lm <- model_performance(explainer_lm) plot(mp_lm) plot(mp_ranger, mp_glm, mp_lm) plot(mp_ranger, mp_glm, mp_lm, geom = "boxplot") plot(mp_ranger, mp_glm, mp_lm, geom = "boxplot", show_outliers = 1)
library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) mp_ranger <- model_performance(explainer_ranger) plot(mp_ranger) plot(mp_ranger, geom = "boxplot", show_outliers = 1) titanic_ranger_model2 <- ranger(survived~gender + fare, data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger2 <- explain(titanic_ranger_model2, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "ranger2") mp_ranger2 <- model_performance(explainer_ranger2) plot(mp_ranger, mp_ranger2, geom = "prc") plot(mp_ranger, mp_ranger2, geom = "roc") plot(mp_ranger, mp_ranger2, geom = "lift") plot(mp_ranger, mp_ranger2, geom = "gain") plot(mp_ranger, mp_ranger2, geom = "boxplot") plot(mp_ranger, mp_ranger2, geom = "histogram") plot(mp_ranger, mp_ranger2, geom = "ecdf") titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- explain(titanic_glm_model, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "glm", predict_function = function(m,x) predict.glm(m,x,type = "response")) mp_glm <- model_performance(explainer_glm) plot(mp_glm) titanic_lm_model <- lm(survived~., data = titanic_imputed) explainer_lm <- explain(titanic_lm_model, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "lm") mp_lm <- model_performance(explainer_lm) plot(mp_lm) plot(mp_ranger, mp_glm, mp_lm) plot(mp_ranger, mp_glm, mp_lm, geom = "boxplot") plot(mp_ranger, mp_glm, mp_lm, geom = "boxplot", show_outliers = 1)
Plot Dataset Level Model Profile Explanations
## S3 method for class 'model_profile' plot(x, ..., geom = "aggregates")
## S3 method for class 'model_profile' plot(x, ..., geom = "aggregates")
x |
a variable profile explanation, created with the |
... |
other parameters |
geom |
either |
An object of the class ggplot
.
color
a character. Either name of a color, or hex code for a color,
or _label_
if models shall be colored, or _ids_
if instances shall be colored
size
a numeric. Size of lines to be plotted
alpha
a numeric between 0
and 1
. Opacity of lines
facet_ncol
number of columns for the facet_wrap
variables
if not NULL
then only variables
will be presented
title
a character. Partial and accumulated dependence explainers have deafult value.
subtitle
a character. If NULL
value will be dependent on model usage.
titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- explain(titanic_glm_model, data = titanic_imputed) expl_glm <- model_profile(explainer_glm, "fare") plot(expl_glm) library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed) expl_ranger <- model_profile(explainer_ranger) plot(expl_ranger) plot(expl_ranger, geom = "aggregates") vp_ra <- model_profile(explainer_ranger, type = "partial", variables = c("age", "fare")) plot(vp_ra, variables = c("age", "fare"), geom = "points") vp_ra <- model_profile(explainer_ranger, type = "partial", k = 3) plot(vp_ra) plot(vp_ra, geom = "profiles") plot(vp_ra, geom = "points") vp_ra <- model_profile(explainer_ranger, type = "partial", groups = "gender") plot(vp_ra) plot(vp_ra, geom = "profiles") plot(vp_ra, geom = "points") vp_ra <- model_profile(explainer_ranger, type = "accumulated") plot(vp_ra) plot(vp_ra, geom = "profiles") plot(vp_ra, geom = "points")
titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- explain(titanic_glm_model, data = titanic_imputed) expl_glm <- model_profile(explainer_glm, "fare") plot(expl_glm) library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed) expl_ranger <- model_profile(explainer_ranger) plot(expl_ranger) plot(expl_ranger, geom = "aggregates") vp_ra <- model_profile(explainer_ranger, type = "partial", variables = c("age", "fare")) plot(vp_ra, variables = c("age", "fare"), geom = "points") vp_ra <- model_profile(explainer_ranger, type = "partial", k = 3) plot(vp_ra) plot(vp_ra, geom = "profiles") plot(vp_ra, geom = "points") vp_ra <- model_profile(explainer_ranger, type = "partial", groups = "gender") plot(vp_ra) plot(vp_ra, geom = "profiles") plot(vp_ra, geom = "points") vp_ra <- model_profile(explainer_ranger, type = "accumulated") plot(vp_ra) plot(vp_ra, geom = "profiles") plot(vp_ra, geom = "points")
Plot Instance Level Residual Diagnostics
## S3 method for class 'predict_diagnostics' plot(x, ...)
## S3 method for class 'predict_diagnostics' plot(x, ...)
x |
an object with instance level residual diagnostics created with |
... |
other parameters that will be passed to |
an ggplot2
object of the class gg
.
library("ranger") titanic_glm_model <- ranger(survived ~ gender + age + class + fare + sibsp + parch, data = titanic_imputed) explainer_glm <- explain(titanic_glm_model, data = titanic_imputed, y = titanic_imputed$survived) johny_d <- titanic_imputed[24, c("gender", "age", "class", "fare", "sibsp", "parch")] pl <- predict_diagnostics(explainer_glm, johny_d, variables = NULL) plot(pl) pl <- predict_diagnostics(explainer_glm, johny_d, neighbors = 10, variables = c("age", "fare")) plot(pl) pl <- predict_diagnostics(explainer_glm, johny_d, neighbors = 10, variables = c("class", "gender")) plot(pl)
library("ranger") titanic_glm_model <- ranger(survived ~ gender + age + class + fare + sibsp + parch, data = titanic_imputed) explainer_glm <- explain(titanic_glm_model, data = titanic_imputed, y = titanic_imputed$survived) johny_d <- titanic_imputed[24, c("gender", "age", "class", "fare", "sibsp", "parch")] pl <- predict_diagnostics(explainer_glm, johny_d, variables = NULL) plot(pl) pl <- predict_diagnostics(explainer_glm, johny_d, neighbors = 10, variables = c("age", "fare")) plot(pl) pl <- predict_diagnostics(explainer_glm, johny_d, neighbors = 10, variables = c("class", "gender")) plot(pl)
Plot Variable Attribution Explanations
## S3 method for class 'predict_parts' plot(x, ...)
## S3 method for class 'predict_parts' plot(x, ...)
x |
an object of the class |
... |
other parameters described below |
An object of the class ggplot
.
max_features
maximal number of features to be included in the plot. default value is 10
min_max
a range of OX axis. By default NA
, therefore it will be extracted from the contributions of x
.
But it can be set to some constants, useful if these plots are to be used for comparisons.
add_contributions
if TRUE
, variable contributions will be added to the plot.
shift_contributions
number describing how much labels should be shifted to the right, as a fraction of range. By default equal to 0.05
.
vcolors
If NA
(default), DrWhy colors are used.
vnames
a character vector, if specified then will be used as labels on OY axis. By default NULL
.
digits
number of decimal places (round
) or significant digits (signif
) to be used.
rounding_function
a function to be used for rounding numbers.
plot_distributions
if TRUE
then distributions of conditional propotions will be plotted. This requires keep_distributions=TRUE
in the
break_down
, local_attributions
, or local_interactions
.
baseline
if numeric then veritical line starts in baseline
.
title
a character. Plot title. By default "Break Down profile"
.
subtitle
a character. Plot subtitle. By default NULL
- then subtitle is set to "created for the XXX, YYY model",
where XXX, YYY are labels of given explainers.
max_vars
alias for the max_features
parameter.
show_boxplots
logical if TRUE
(default) boxplot will be plotted to show uncertanity of attributions.
vcolors
If NA
(default), DrWhy colors are used.
max_features
maximal number of features to be included in the plot. default value is 10
max_vars
alias for the max_features
parameter.
bar_width
width of bars. By default 10
Plot Variable Profile Explanations
## S3 method for class 'predict_profile' plot(x, ...)
## S3 method for class 'predict_profile' plot(x, ...)
x |
an object of the class |
... |
other parameters |
An object of the class ggplot
.
color
a character. Either name of a color or name of a variable that should be used for coloring
size
a numeric. Size of lines to be plotted
alpha
a numeric between 0
and 1
. Opacity of lines
facet_ncol
number of columns for the facet_wrap
variables
if not NULL
then only variables
will be presented
variable_type
a character. If numerical
then only numerical variables will be plotted.
If categorical
then only categorical variables will be plotted.
title
a character. Plot title. By default "Ceteris Paribus profile"
.
subtitle
a character. Plot subtitle. By default NULL
- then subtitle is set to "created for the XXX, YYY model",
where XXX, YYY are labels of given explainers.
categorical_type
a character. How categorical variables shall be plotted? Either "lines"
(default) or "bars"
.
Displays a waterfall aggregated shap plot for objects of shap_aggregated
class.
## S3 method for class 'shap_aggregated' plot( x, ..., shift_contributions = 0.05, add_contributions = TRUE, add_boxplots = TRUE, max_features = 10, title = "Aggregated SHAP" )
## S3 method for class 'shap_aggregated' plot( x, ..., shift_contributions = 0.05, add_contributions = TRUE, add_boxplots = TRUE, max_features = 10, title = "Aggregated SHAP" )
x |
an explanation object created with function |
... |
other parameters like |
shift_contributions |
number describing how much labels should be shifted to the right, as a fraction of range. By default equal to |
add_contributions |
if |
add_boxplots |
if |
max_features |
maximal number of features to be included in the plot. default value is |
title |
a character. Plot title. By default |
a ggplot2
object.
library("DALEX") set.seed(1313) model_titanic_glm <- glm(survived ~ gender + age + fare, data = titanic_imputed, family = "binomial") explain_titanic_glm <- explain(model_titanic_glm, data = titanic_imputed, y = titanic_imputed$survived, label = "glm") bd_glm <- shap_aggregated(explain_titanic_glm, titanic_imputed[1:10, ]) bd_glm plot(bd_glm) plot(bd_glm, max_features = 3) plot(bd_glm, max_features = 3, vnames = c("average","+ male","+ young","+ cheap ticket", "+ other factors", "final"))
library("DALEX") set.seed(1313) model_titanic_glm <- glm(survived ~ gender + age + fare, data = titanic_imputed, family = "binomial") explain_titanic_glm <- explain(model_titanic_glm, data = titanic_imputed, y = titanic_imputed$survived, label = "glm") bd_glm <- shap_aggregated(explain_titanic_glm, titanic_imputed[1:10, ]) bd_glm plot(bd_glm) plot(bd_glm, max_features = 3) plot(bd_glm, max_features = 3, vnames = c("average","+ male","+ young","+ cheap ticket", "+ other factors", "final"))
This function performs local diagnostic of residuals. For a single instance its neighbors are identified in the validation data. Residuals are calculated for neighbors and plotted against residuals for all data. Find information how to use this function here: https://ema.drwhy.ai/localDiagnostics.html.
predict_diagnostics( explainer, new_observation, variables = NULL, ..., nbins = 20, neighbors = 50, distance = gower::gower_dist ) individual_diagnostics( explainer, new_observation, variables = NULL, ..., nbins = 20, neighbors = 50, distance = gower::gower_dist )
predict_diagnostics( explainer, new_observation, variables = NULL, ..., nbins = 20, neighbors = 50, distance = gower::gower_dist ) individual_diagnostics( explainer, new_observation, variables = NULL, ..., nbins = 20, neighbors = 50, distance = gower::gower_dist )
explainer |
a model to be explained, preprocessed by the 'explain' function |
new_observation |
a new observation for which predictions need to be explained |
variables |
character - name of variables to be explained |
... |
other parameters |
nbins |
number of bins for the histogram. By default 20 |
neighbors |
number of neighbors for histogram. By default 50. |
distance |
the distance function, by default the |
An object of the class 'predict_diagnostics'. It's a data frame with calculated distribution of residuals.
Explanatory Model Analysis. Explore, Explain, and Examine Predictive Models. https://ema.drwhy.ai/
library("ranger") titanic_glm_model <- ranger(survived ~ gender + age + class + fare + sibsp + parch, data = titanic_imputed) explainer_glm <- explain(titanic_glm_model, data = titanic_imputed, y = titanic_imputed$survived) johny_d <- titanic_imputed[24, c("gender", "age", "class", "fare", "sibsp", "parch")] id_johny <- predict_diagnostics(explainer_glm, johny_d, variables = NULL) id_johny plot(id_johny) id_johny <- predict_diagnostics(explainer_glm, johny_d, neighbors = 10, variables = c("age", "fare")) id_johny plot(id_johny) id_johny <- predict_diagnostics(explainer_glm, johny_d, neighbors = 10, variables = c("class", "gender")) id_johny plot(id_johny)
library("ranger") titanic_glm_model <- ranger(survived ~ gender + age + class + fare + sibsp + parch, data = titanic_imputed) explainer_glm <- explain(titanic_glm_model, data = titanic_imputed, y = titanic_imputed$survived) johny_d <- titanic_imputed[24, c("gender", "age", "class", "fare", "sibsp", "parch")] id_johny <- predict_diagnostics(explainer_glm, johny_d, variables = NULL) id_johny plot(id_johny) id_johny <- predict_diagnostics(explainer_glm, johny_d, neighbors = 10, variables = c("age", "fare")) id_johny plot(id_johny) id_johny <- predict_diagnostics(explainer_glm, johny_d, neighbors = 10, variables = c("class", "gender")) id_johny plot(id_johny)
Instance Level Variable Attributions as Break Down, SHAP, aggregated SHAP or Oscillations explanations.
Model prediction is decomposed into parts that are attributed for particular variables.
From DALEX version 1.0 this function calls the break_down
or
shap
functions from the iBreakDown
package or
ceteris_paribus
from the ingredients
package or
kernelshap
from the kernelshap
package.
Find information how to use the break_down
method here: https://ema.drwhy.ai/breakDown.html.
Find information how to use the shap
method here: https://ema.drwhy.ai/shapley.html.
Find information how to use the oscillations
method here: https://ema.drwhy.ai/ceterisParibusOscillations.html.
Find information how to use the kernelshap
method here: https://modeloriented.github.io/kernelshap/
aSHAP method provides explanations for a set of observations based on SHAP.
predict_parts( explainer, new_observation, ..., N = if (substr(type, 1, 4) == "osci") 500 else NULL, type = "break_down" ) predict_parts_oscillations(explainer, new_observation, ...) predict_parts_oscillations_uni( explainer, new_observation, variable_splits_type = "uniform", ... ) predict_parts_oscillations_emp( explainer, new_observation, variable_splits = NULL, variables = colnames(explainer$data), ... ) predict_parts_break_down(explainer, new_observation, ...) predict_parts_break_down_interactions(explainer, new_observation, ...) predict_parts_shap(explainer, new_observation, ...) predict_parts_shap_aggregated(explainer, new_observation, ...) predict_parts_kernel_shap(explainer, new_observation, ...) predict_parts_kernel_shap_break_down(explainer, new_observation, ...) predict_parts_kernel_shap_aggreagted(explainer, new_observation, ...) variable_attribution( explainer, new_observation, ..., N = if (substr(type, 1, 4) == "osci") 500 else NULL, type = "break_down" )
predict_parts( explainer, new_observation, ..., N = if (substr(type, 1, 4) == "osci") 500 else NULL, type = "break_down" ) predict_parts_oscillations(explainer, new_observation, ...) predict_parts_oscillations_uni( explainer, new_observation, variable_splits_type = "uniform", ... ) predict_parts_oscillations_emp( explainer, new_observation, variable_splits = NULL, variables = colnames(explainer$data), ... ) predict_parts_break_down(explainer, new_observation, ...) predict_parts_break_down_interactions(explainer, new_observation, ...) predict_parts_shap(explainer, new_observation, ...) predict_parts_shap_aggregated(explainer, new_observation, ...) predict_parts_kernel_shap(explainer, new_observation, ...) predict_parts_kernel_shap_break_down(explainer, new_observation, ...) predict_parts_kernel_shap_aggreagted(explainer, new_observation, ...) variable_attribution( explainer, new_observation, ..., N = if (substr(type, 1, 4) == "osci") 500 else NULL, type = "break_down" )
explainer |
a model to be explained, preprocessed by the |
new_observation |
a new observation for which predictions need to be explained |
... |
other parameters that will be passed to |
N |
the maximum number of observations used for calculation of attributions. By default NULL (use all) or 500 (for oscillations). |
type |
the type of variable attributions. Either |
variable_splits_type |
how variable grids shall be calculated? Will be passed to |
variable_splits |
named list of splits for variables. It is used by oscillations based measures. Will be passed to |
variables |
names of variables for which splits shall be calculated. Will be passed to |
Depending on the type
there are different classes of the resulting object.
It's a data frame with calculated average response.
Explanatory Model Analysis. Explore, Explain, and Examine Predictive Models. https://ema.drwhy.ai/
library(DALEX) new_dragon <- data.frame( year_of_birth = 200, height = 80, weight = 12.5, scars = 0, number_of_lost_teeth = 5 ) model_lm <- lm(life_length ~ year_of_birth + height + weight + scars + number_of_lost_teeth, data = dragons) explainer_lm <- explain(model_lm, data = dragons, y = dragons$year_of_birth, label = "model_lm") bd_lm <- predict_parts_break_down(explainer_lm, new_observation = new_dragon) head(bd_lm) plot(bd_lm) library("ranger") model_ranger <- ranger(life_length ~ year_of_birth + height + weight + scars + number_of_lost_teeth, data = dragons, num.trees = 50) explainer_ranger <- explain(model_ranger, data = dragons, y = dragons$year_of_birth, label = "model_ranger") bd_ranger <- predict_parts_break_down(explainer_ranger, new_observation = new_dragon) head(bd_ranger) plot(bd_ranger)
library(DALEX) new_dragon <- data.frame( year_of_birth = 200, height = 80, weight = 12.5, scars = 0, number_of_lost_teeth = 5 ) model_lm <- lm(life_length ~ year_of_birth + height + weight + scars + number_of_lost_teeth, data = dragons) explainer_lm <- explain(model_lm, data = dragons, y = dragons$year_of_birth, label = "model_lm") bd_lm <- predict_parts_break_down(explainer_lm, new_observation = new_dragon) head(bd_lm) plot(bd_lm) library("ranger") model_ranger <- ranger(life_length ~ year_of_birth + height + weight + scars + number_of_lost_teeth, data = dragons, num.trees = 50) explainer_ranger <- explain(model_ranger, data = dragons, y = dragons$year_of_birth, label = "model_ranger") bd_ranger <- predict_parts_break_down(explainer_ranger, new_observation = new_dragon) head(bd_ranger) plot(bd_ranger)
This function calculated individual profiles aka Ceteris Paribus Profiles.
From DALEX version 1.0 this function calls the ceteris_paribus
from the ingredients
package.
Find information how to use this function here: https://ema.drwhy.ai/ceterisParibus.html.
predict_profile( explainer, new_observation, variables = NULL, ..., type = "ceteris_paribus", variable_splits_type = "uniform" ) individual_profile( explainer, new_observation, variables = NULL, ..., type = "ceteris_paribus", variable_splits_type = "uniform" )
predict_profile( explainer, new_observation, variables = NULL, ..., type = "ceteris_paribus", variable_splits_type = "uniform" ) individual_profile( explainer, new_observation, variables = NULL, ..., type = "ceteris_paribus", variable_splits_type = "uniform" )
explainer |
a model to be explained, preprocessed by the |
new_observation |
a new observation for which predictions need to be explained |
variables |
character - names of variables to be explained |
... |
other parameters |
type |
character, currently only the |
variable_splits_type |
how variable grids shall be calculated? Use "quantiles" (default) for percentiles or "uniform" to get uniform grid of points. Will be passed to 'ingredients'. |
An object of the class ceteris_paribus_explainer
.
It's a data frame with calculated average response.
Explanatory Model Analysis. Explore, Explain, and Examine Predictive Models. https://ema.drwhy.ai/
new_dragon <- data.frame(year_of_birth = 200, height = 80, weight = 12.5, scars = 0, number_of_lost_teeth = 5) dragon_lm_model4 <- lm(life_length ~ year_of_birth + height + weight + scars + number_of_lost_teeth, data = dragons) dragon_lm_explainer4 <- explain(dragon_lm_model4, data = dragons, y = dragons$year_of_birth, label = "model_4v") dragon_lm_predict4 <- predict_profile(dragon_lm_explainer4, new_observation = new_dragon, variables = c("year_of_birth", "height", "scars")) head(dragon_lm_predict4) plot(dragon_lm_predict4, variables = c("year_of_birth", "height", "scars")) library("ranger") dragon_ranger_model4 <- ranger(life_length ~ year_of_birth + height + weight + scars + number_of_lost_teeth, data = dragons, num.trees = 50) dragon_ranger_explainer4 <- explain(dragon_ranger_model4, data = dragons, y = dragons$year_of_birth, label = "model_ranger") dragon_ranger_predict4 <- predict_profile(dragon_ranger_explainer4, new_observation = new_dragon, variables = c("year_of_birth", "height", "scars")) head(dragon_ranger_predict4) plot(dragon_ranger_predict4, variables = c("year_of_birth", "height", "scars"))
new_dragon <- data.frame(year_of_birth = 200, height = 80, weight = 12.5, scars = 0, number_of_lost_teeth = 5) dragon_lm_model4 <- lm(life_length ~ year_of_birth + height + weight + scars + number_of_lost_teeth, data = dragons) dragon_lm_explainer4 <- explain(dragon_lm_model4, data = dragons, y = dragons$year_of_birth, label = "model_4v") dragon_lm_predict4 <- predict_profile(dragon_lm_explainer4, new_observation = new_dragon, variables = c("year_of_birth", "height", "scars")) head(dragon_lm_predict4) plot(dragon_lm_predict4, variables = c("year_of_birth", "height", "scars")) library("ranger") dragon_ranger_model4 <- ranger(life_length ~ year_of_birth + height + weight + scars + number_of_lost_teeth, data = dragons, num.trees = 50) dragon_ranger_explainer4 <- explain(dragon_ranger_model4, data = dragons, y = dragons$year_of_birth, label = "model_ranger") dragon_ranger_predict4 <- predict_profile(dragon_ranger_explainer4, new_observation = new_dragon, variables = c("year_of_birth", "height", "scars")) head(dragon_ranger_predict4) plot(dragon_ranger_predict4, variables = c("year_of_birth", "height", "scars"))
This is a generic predict()
function works for explainer
objects.
## S3 method for class 'explainer' predict(object, newdata, ...) model_prediction(explainer, new_data, ...)
## S3 method for class 'explainer' predict(object, newdata, ...) model_prediction(explainer, new_data, ...)
object |
a model to be explained, object of the class |
newdata |
data.frame or matrix - observations for prediction |
... |
other parameters that will be passed to the predict function |
explainer |
a model to be explained, object of the class |
new_data |
data.frame or matrix - observations for prediction |
An numeric matrix of predictions
HR_glm_model <- glm(status == "fired"~., data = HR, family = "binomial") explainer_glm <- explain(HR_glm_model, data = HR) predict(explainer_glm, HR[1:3,]) library("ranger") HR_ranger_model <- ranger(status~., data = HR, num.trees = 50, probability = TRUE) explainer_ranger <- explain(HR_ranger_model, data = HR) predict(explainer_ranger, HR[1:3,]) model_prediction(explainer_ranger, HR[1:3,])
HR_glm_model <- glm(status == "fired"~., data = HR, family = "binomial") explainer_glm <- explain(HR_glm_model, data = HR) predict(explainer_glm, HR[1:3,]) library("ranger") HR_ranger_model <- ranger(status~., data = HR, num.trees = 50, probability = TRUE) explainer_ranger <- explain(HR_ranger_model, data = HR) predict(explainer_ranger, HR[1:3,]) model_prediction(explainer_ranger, HR[1:3,])
Generic function
## S3 method for class 'description' print(x, ...)
## S3 method for class 'description' print(x, ...)
x |
an individual explainer produced with the 'describe()' function |
... |
other arguments |
Print Explainer Summary
## S3 method for class 'explainer' print(x, ...)
## S3 method for class 'explainer' print(x, ...)
x |
a model explainer created with the 'explain' function |
... |
other parameters |
aps_lm_model4 <- lm(m2.price~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, y = apartments$m2.price, label = "model_4v") aps_lm_explainer4 library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "model_ranger") explainer_ranger
aps_lm_model4 <- lm(m2.price~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, y = apartments$m2.price, label = "model_4v") aps_lm_explainer4 library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed[,-8], y = titanic_imputed$survived, label = "model_ranger") explainer_ranger
Generic function
## S3 method for class 'model_diagnostics' print(x, ...)
## S3 method for class 'model_diagnostics' print(x, ...)
x |
an object with dataset level residual diagnostics created with |
... |
other parameters |
Function prints object of class model_info
created with model_info
## S3 method for class 'model_info' print(x, ...)
## S3 method for class 'model_info' print(x, ...)
x |
- an object of class |
... |
- other parameters |
Print Dataset Level Model Performance Summary
## S3 method for class 'model_performance' print(x, ...)
## S3 method for class 'model_performance' print(x, ...)
x |
a model to be explained, object of the class 'model_performance_explainer' |
... |
other parameters |
library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 100, probability = TRUE) # It's a good practice to pass data without target variable explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) # resulting dataframe has predicted values and residuals mp_ex_rn <- model_performance(explainer_ranger) mp_ex_rn plot(mp_ex_rn)
library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 100, probability = TRUE) # It's a good practice to pass data without target variable explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed[,-8], y = titanic_imputed$survived) # resulting dataframe has predicted values and residuals mp_ex_rn <- model_performance(explainer_ranger) mp_ex_rn plot(mp_ex_rn)
Generic function
## S3 method for class 'model_profile' print(x, ...)
## S3 method for class 'model_profile' print(x, ...)
x |
an object with dataset level profile created with |
... |
other parameters |
Generic function
## S3 method for class 'predict_diagnostics' print(x, ...)
## S3 method for class 'predict_diagnostics' print(x, ...)
x |
an object with instance level residual diagnostics created with |
... |
other parameters |
Default Theme for DALEX plots
set_theme_dalex( default_theme = "drwhy", default_theme_vertical = default_theme ) theme_default_dalex() theme_vertical_default_dalex()
set_theme_dalex( default_theme = "drwhy", default_theme_vertical = default_theme ) theme_default_dalex() theme_vertical_default_dalex()
default_theme |
object - string ("drwhy" or "ema") or an object of ggplot theme class. Will be applied by default by DALEX to all horizontal plots |
default_theme_vertical |
object - string ("drwhy" or "ema") or an object of ggplot theme class. Will be applied by default by DALEX to all vertical plots |
list with current default themes
old <- set_theme_dalex("ema") library("ranger") apartments_ranger_model <- ranger(m2.price~., data = apartments, num.trees = 50) explainer_ranger <- explain(apartments_ranger_model, data = apartments[,-1], y = apartments$m2.price, label = "Ranger Apartments") model_parts_ranger_aps <- model_parts(explainer_ranger, type = "raw") head(model_parts_ranger_aps, 8) plot(model_parts_ranger_aps) old <- set_theme_dalex(ggplot2::theme_void(), ggplot2::theme_void()) plot(model_parts_ranger_aps) old <- set_theme_dalex("drwhy") plot(model_parts_ranger_aps) old <- set_theme_dalex(ggplot2::theme_void(), ggplot2::theme_void()) plot(model_parts_ranger_aps)
old <- set_theme_dalex("ema") library("ranger") apartments_ranger_model <- ranger(m2.price~., data = apartments, num.trees = 50) explainer_ranger <- explain(apartments_ranger_model, data = apartments[,-1], y = apartments$m2.price, label = "Ranger Apartments") model_parts_ranger_aps <- model_parts(explainer_ranger, type = "raw") head(model_parts_ranger_aps, 8) plot(model_parts_ranger_aps) old <- set_theme_dalex(ggplot2::theme_void(), ggplot2::theme_void()) plot(model_parts_ranger_aps) old <- set_theme_dalex("drwhy") plot(model_parts_ranger_aps) old <- set_theme_dalex(ggplot2::theme_void(), ggplot2::theme_void()) plot(model_parts_ranger_aps)
This function works in a similar way to shap function from iBreakDown
but it calculates explanations for a set of observation and then aggregates them.
shap_aggregated( explainer, new_observations, order = NULL, B = 25, kernelshap = FALSE, ... )
shap_aggregated( explainer, new_observations, order = NULL, B = 25, kernelshap = FALSE, ... )
explainer |
a model to be explained, preprocessed by the |
new_observations |
a set of new observations with columns that correspond to variables used in the model. |
order |
if not |
B |
number of random paths; works only if kernelshap=FALSE |
kernelshap |
indicates whether the kernelshap method should be used |
... |
other parameters like |
an object of the shap_aggregated
class.
Explanatory Model Analysis. Explore, Explain and Examine Predictive Models. https://ema.drwhy.ai
library("DALEX") set.seed(1313) model_titanic_glm <- glm(survived ~ gender + age + fare, data = titanic_imputed, family = "binomial") explain_titanic_glm <- explain(model_titanic_glm, data = titanic_imputed, y = titanic_imputed$survived, label = "glm") bd_glm <- shap_aggregated(explain_titanic_glm, titanic_imputed[1:10, ]) bd_glm plot(bd_glm, max_features = 3)
library("DALEX") set.seed(1313) model_titanic_glm <- glm(survived ~ gender + age + fare, data = titanic_imputed, family = "binomial") explain_titanic_glm <- explain(model_titanic_glm, data = titanic_imputed, y = titanic_imputed$survived, label = "glm") bd_glm <- shap_aggregated(explain_titanic_glm, titanic_imputed[1:10, ]) bd_glm plot(bd_glm, max_features = 3)
DrWhy Theme for ggplot objects
theme_drwhy() theme_ema() theme_drwhy_vertical() theme_ema_vertical()
theme_drwhy() theme_ema() theme_drwhy_vertical() theme_ema_vertical()
theme for ggplot2 objects
The titanic
data is a complete list of passengers and crew members on the RMS Titanic.
It includes a variable indicating whether a person did survive the sinking of the RMS
Titanic on April 15, 1912.
data(titanic) data(titanic_imputed)
data(titanic) data(titanic_imputed)
a data frame with 2207 rows and 9 columns
This dataset was copied from the stablelearner
package and went through few variable
transformations. Levels in embarked
was replaced with full names, sibsp
, parch
and fare
were converted to numerical variables and values for crew were replaced with 0.
If you use this dataset please cite the original package.
From stablelearner
: The website https://www.encyclopedia-titanica.org offers detailed information about passengers and crew
members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were abord.
8 musicians and 9 employees of the shipyard company are listed as passengers, but travelled with a
free ticket, which is why they have NA
values in fare
. In addition to that, fare
is truely missing for a few regular passengers.
gender a factor with levels male
and female
.
age a numeric value with the persons age on the day of the sinking.
class a factor specifying the class for passengers or the type of service aboard for crew members.
embarked a factor with the persons place of of embarkment (Belfast/Cherbourg/Queenstown/Southampton).
country a factor with the persons home country.
fare a numeric value with the ticket price (0
for crew members, musicians and employees of the shipyard company).
sibsp an ordered factor specifying the number if siblings/spouses aboard; adopted from Vanderbild data set (see below).
parch an ordered factor specifying the number of parents/children aboard; adopted from Vanderbild data set (see below).
survived a factor with two levels (no
and yes
) specifying whether the person has survived the sinking.
NOTE: The titanic_imputed
dataset use following imputation rules.
Missing 'age' is replaced with the mean of the observed ones, i.e., 30.
For sibsp and parch, missing values are replaced by the most frequently observed value, i.e., 0.
For fare, mean fare for a given class is used, i.e., 0 pounds for crew, 89 pounds for the 1st, 22 pounds for the 2nd, and 13 pounds for the 3rd class.
This dataset was copied from the stablelearner
package and went through few variable
transformations. The complete list of persons on the RMS titanic was downloaded from
https://www.encyclopedia-titanica.org on April 5, 2016. The information given
in sibsp
and parch
was adopoted from a data set obtained from https://biostat.app.vumc.org/wiki/Main/DataSets.
https://www.encyclopedia-titanica.org and https://CRAN.R-project.org/package=stablelearner
Function allows users to update data an y of any explainer in a unified way. It doesn't require knowledge about structre of an explainer.
update_data(explainer, data, y = NULL, verbose = TRUE)
update_data(explainer, data, y = NULL, verbose = TRUE)
explainer |
- explainer object that is supposed to be updated. |
data |
- new data, is going to be passed to an explainer |
y |
- new y, is going to be passed to an explainer |
verbose |
- logical, indicates if information about update should be printed |
updated explainer object
aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v") explainer <- update_data(aps_lm_explainer4, data = apartmentsTest, y = apartmentsTest$m2.price)
aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v") explainer <- update_data(aps_lm_explainer4, data = apartmentsTest, y = apartmentsTest$m2.price)
Function allows users to update label of any explainer in a unified way. It doesn't require knowledge about structre of an explainer.
update_label(explainer, label, verbose = TRUE)
update_label(explainer, label, verbose = TRUE)
explainer |
- explainer object that is supposed to be updated. |
label |
- new label, is going to be passed to an explainer |
verbose |
- logical, indicates if information about update should be printed |
updated explainer object
aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v") explainer <- update_label(aps_lm_explainer4, label = "lm")
aps_lm_model4 <- lm(m2.price ~., data = apartments) aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v") explainer <- update_label(aps_lm_explainer4, label = "lm")
From DALEX version 1.0 this function calls the accumulated_dependence
or
partial_dependence
from the ingredients
package.
Find information how to use this function here: https://ema.drwhy.ai/partialDependenceProfiles.html.
variable_effect(explainer, variables, ..., type = "partial_dependency") variable_effect_partial_dependency(explainer, variables, ...) variable_effect_accumulated_dependency(explainer, variables, ...)
variable_effect(explainer, variables, ..., type = "partial_dependency") variable_effect_partial_dependency(explainer, variables, ...) variable_effect_accumulated_dependency(explainer, variables, ...)
explainer |
a model to be explained, preprocessed by the 'explain' function |
variables |
character - names of variables to be explained |
... |
other parameters |
type |
character - type of the response to be calculated. Currently following options are implemented: 'partial_dependency' for Partial Dependency and 'accumulated_dependency' for Accumulated Local Effects |
An object of the class 'aggregated_profiles_explainer'. It's a data frame with calculated average response.
Explanatory Model Analysis. Explore, Explain, and Examine Predictive Models. https://ema.drwhy.ai/
titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- explain(titanic_glm_model, data = titanic_imputed) expl_glm <- variable_effect(explainer_glm, "fare", "partial_dependency") plot(expl_glm) library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed) expl_ranger <- variable_effect(explainer_ranger, variables = "fare", type = "partial_dependency") plot(expl_ranger) plot(expl_ranger, expl_glm) # Example for factor variable (with factorMerger) expl_ranger_factor <- variable_effect(explainer_ranger, variables = "class") plot(expl_ranger_factor)
titanic_glm_model <- glm(survived~., data = titanic_imputed, family = "binomial") explainer_glm <- explain(titanic_glm_model, data = titanic_imputed) expl_glm <- variable_effect(explainer_glm, "fare", "partial_dependency") plot(expl_glm) library("ranger") titanic_ranger_model <- ranger(survived~., data = titanic_imputed, num.trees = 50, probability = TRUE) explainer_ranger <- explain(titanic_ranger_model, data = titanic_imputed) expl_ranger <- variable_effect(explainer_ranger, variables = "fare", type = "partial_dependency") plot(expl_ranger) plot(expl_ranger, expl_glm) # Example for factor variable (with factorMerger) expl_ranger_factor <- variable_effect(explainer_ranger, variables = "class") plot(expl_ranger_factor)
This function is a wrapper over various predict functions for different models and differnt model structures. The wrapper returns a single numeric score for each new observation. To do this it uses different extraction techniques for models from different classes, like for classification random forest is forces the output to be probabilities not classes itself.
yhat(X.model, newdata, ...) ## S3 method for class 'lm' yhat(X.model, newdata, ...) ## S3 method for class 'randomForest' yhat(X.model, newdata, ...) ## S3 method for class 'svm' yhat(X.model, newdata, ...) ## S3 method for class 'gbm' yhat(X.model, newdata, ...) ## S3 method for class 'glm' yhat(X.model, newdata, ...) ## S3 method for class 'cv.glmnet' yhat(X.model, newdata, ...) ## S3 method for class 'glmnet' yhat(X.model, newdata, ...) ## S3 method for class 'ranger' yhat(X.model, newdata, ...) ## S3 method for class 'model_fit' yhat(X.model, newdata, ...) ## S3 method for class 'train' yhat(X.model, newdata, ...) ## S3 method for class 'lrm' yhat(X.model, newdata, ...) ## S3 method for class 'rpart' yhat(X.model, newdata, ...) ## S3 method for class ''function'' yhat(X.model, newdata, ...) ## S3 method for class 'party' yhat(X.model, newdata, ...) ## Default S3 method: yhat(X.model, newdata, ...)
yhat(X.model, newdata, ...) ## S3 method for class 'lm' yhat(X.model, newdata, ...) ## S3 method for class 'randomForest' yhat(X.model, newdata, ...) ## S3 method for class 'svm' yhat(X.model, newdata, ...) ## S3 method for class 'gbm' yhat(X.model, newdata, ...) ## S3 method for class 'glm' yhat(X.model, newdata, ...) ## S3 method for class 'cv.glmnet' yhat(X.model, newdata, ...) ## S3 method for class 'glmnet' yhat(X.model, newdata, ...) ## S3 method for class 'ranger' yhat(X.model, newdata, ...) ## S3 method for class 'model_fit' yhat(X.model, newdata, ...) ## S3 method for class 'train' yhat(X.model, newdata, ...) ## S3 method for class 'lrm' yhat(X.model, newdata, ...) ## S3 method for class 'rpart' yhat(X.model, newdata, ...) ## S3 method for class ''function'' yhat(X.model, newdata, ...) ## S3 method for class 'party' yhat(X.model, newdata, ...) ## Default S3 method: yhat(X.model, newdata, ...)
X.model |
object - a model to be explained |
newdata |
data.frame or matrix - observations for prediction |
... |
other parameters that will be passed to the predict function |
Currently supported packages are:
class cv.glmnet
and glmnet
- models created with glmnet package,
class glm
- generalized linear models created with glm,
class model_fit
- models created with parsnip package,
class lm
- linear models created with lm,
class ranger
- models created with ranger package,
class randomForest
- random forest models created with randomForest package,
class svm
- support vector machines models created with the e1071 package,
class train
- models created with caret package,
class gbm
- models created with gbm package,
class lrm
- models created with rms package,
class rpart
- models created with rpart package.
An numeric matrix of predictions