--- title: "Observation influence audit" author: "Alicja Gosiewska" date: "`r Sys.Date()`" output: html_document vignette: > %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Observation influence audit} %\usepackage[UTF-8]{inputenc} --- ```{r setup, echo = FALSE} knitr::opts_chunk$set(warning = FALSE) knitr::opts_chunk$set(message = FALSE) ``` Observation influence audit, i.e. the impact of individual observation on a model. ## Data To illustrate application of `auditor` we will use dataset "dragons" available in the [`DALEX`](https://github.com/ModelOriented/DALEX) package. The dataset contains characteristics of fictional creatures (dragons), like year of birth, height, weight, etc (see below). The goal is to predict the length of life of dragons (a regression problem). ```{r} library(DALEX) data(dragons) head(dragons) ``` ## Models First, we need models to compare. We selected linear regression and random forest because of their different structures. Linear regression model linear relationships between target response and independent variables. While random forest should be able to capture also non-linear relationships between variables. ```{r} # Linear regression lm_model <- lm(life_length ~ ., data = dragons) # Random forest library(randomForest) set.seed(59) rf_model <- randomForest(life_length ~ ., data = dragons) ``` ## Preparation for analysis of observation influence Analysis begins with creation of an explainer object with `explain` function from `DALEX` package. Explainer wraps a model with its meta-data, such as dataset that was used for training or observed response. ```{r results = 'hide'} lm_exp <- DALEX::explain(lm_model, label = "lm", data = dragons, y = dragons$life_length) rf_exp <- DALEX::explain(rf_model, label = "rf", data = dragons, y = dragons$life_length) ``` Next step requires creation of `model_cooksdistance` objects of each explained model. In the case of models of classes other than `lm` and `glm`, the distances are computed directly from the definition, so this may take a while. In this example we will compute them only for a linear model. ```{r} library(auditor) lm_cd <- model_cooksdistance(lm_exp) ``` ## Plot of Cook's distances Cook's distance is used to estimate of the influence of an single observation. It is a tool for identifying observations that may negatively affect the model. Data points indicated by Cook's distances are worth checking for validity. Cook's distances may be also used for indicating regions of the design space where it would be good to obtain more observations. Cook’s Distances are calculated by removing the i-th observation from the data and recalculating the model. It shows how much all the values in the model change when the i-th observation is removed. ```{r} plot(lm_cd) ``` ## Other methods Other methods and plots are described in following vignettes: * [Model residuals audit](https://modeloriented.github.io/auditor/articles/model_residuals_audit.html) * [Model evaluation audit](https://modeloriented.github.io/auditor/articles/model_fit_audit.html) * [Model fit audit](https://modeloriented.github.io/auditor/articles/model_evaluation_audit.html) * [Model performance audit](https://modeloriented.github.io/auditor/articles/model_performance_audit.html)