Title: | Explore Correlations Between Variables in a Machine Learning Model |
---|---|
Description: | When exploring data or models we often examine variables one by one. This analysis is incomplete if the relationship between these variables is not taken into account. The 'corrgrapher' package facilitates simultaneous exploration of the Partial Dependence Profiles and the correlation between variables in the model. The package 'corrgrapher' is a part of the 'DrWhy.AI' universe. |
Authors: | Pawel Morgen [aut, cre], Przemyslaw Biecek [aut] |
Maintainer: | Pawel Morgen <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.2 |
Built: | 2024-11-17 04:10:17 UTC |
Source: | https://github.com/modeloriented/corrgrapher |
Calculate correlation coefficients between variables in a data.frame
, matrix
or table
using 3 different functions for 3 different possible pairs of vairables:
numeric - numeric
numeric - categorical
categorical - categorical
calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL ) ## S3 method for class 'explainer' calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL ) ## S3 method for class 'matrix' calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL ) ## S3 method for class 'table' calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL ) ## Default S3 method: calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL )
calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL ) ## S3 method for class 'explainer' calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL ) ## S3 method for class 'matrix' calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL ) ## S3 method for class 'table' calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL ) ## Default S3 method: calculate_cors( x, num_num_f = NULL, num_cat_f = NULL, cat_cat_f = NULL, max_cor = NULL )
x |
object used to select method. See more below. |
num_num_f |
A |
num_cat_f |
A |
cat_cat_f |
A |
max_cor |
A number used to indicate absolute correlation (like 1 in |
A symmetrical matrix
A of size n x n, where n - amount of columns in x
(or dimensions for table
).
The value at A(i,j) is the correlation coefficient between ith and jth variable.
On the diagonal, values from max_cor
are set.
When x
is a data.frame
, all columns of numeric
type are treated as numeric variables and all columns of factor
type are treated as categorical variables. Columns of other types are ignored.
When x
is a matrix
, it is converted to data.frame
using as.data.frame.matrix
.
When x
is a explainer
, the tests are performed on its data
element.
When x
is a table
, it is treated as contingency table. Its dimensions must be named, but none of them may be named Frequency
.
By default, the function calculates p_value of statistical tests ( cor.test
for 2 numeric
, chisq.test
for factor
and kruskal.test
for mixed).
Then, the correlation coefficients are calculated as -log10(p_value)
. Any results above 100 are treated as absolute correlation and cut to 100.
The results are then divided by 100 to fit inside [0,1].
If only numeric
data was supplied, the function used is cor.test
.
Creating consistent measures for correlation coefficients, which are comparable for different kinds of variables, is a non-trivial task. Therefore, if user wishes to use custom function for calculating correlation coefficients, he must provide all necessary functions. Using a custom function for one case and a default for the other is consciously not supported. Naturally, user may supply copies of default functions at his own responsibility.
Function calculate_cors
chooses, which parameters of *_f
are required based on data supported.
For example, for a matrix
with numeric
data only num_num_f
is required.
On the other hand, for a table
only cat_cat_f
is required.
All *_f
parameters must be functions, which accept 2 parameters (numeric
or factor
vectors respectively)
and return a single number from [0,max_num]. The num_cat_f
must accept numeric
argument as first and factor
argument as second.
cor.test
, chisq.test
, kruskal.test
data(mtcars) # Make sure, that categorical variables are factors mtcars$vs <- factor(mtcars$vs, labels = c('V-shaped', 'straight')) mtcars$am <- factor(mtcars$am, labels = c('automatic', 'manual')) calculate_cors(mtcars) # For a table: data(HairEyeColor) calculate_cors(HairEyeColor) # Custom functions: num_mtcars <- mtcars[,-which(colnames(mtcars) %in% c('vs', 'am'))] my_f <- function(x,y) cor.test(x, y, method = 'spearman', exact=FALSE)$estimate calculate_cors(num_mtcars, num_num_f = my_f, max_cor = 1)
data(mtcars) # Make sure, that categorical variables are factors mtcars$vs <- factor(mtcars$vs, labels = c('V-shaped', 'straight')) mtcars$am <- factor(mtcars$am, labels = c('automatic', 'manual')) calculate_cors(mtcars) # For a table: data(HairEyeColor) calculate_cors(HairEyeColor) # Custom functions: num_mtcars <- mtcars[,-which(colnames(mtcars) %in% c('vs', 'am'))] my_f <- function(x,y) cor.test(x, y, method = 'spearman', exact=FALSE)$estimate calculate_cors(num_mtcars, num_num_f = my_f, max_cor = 1)
corrgrapher
objectThis is the main function of corrgrapher
package. It does necessary calculations and creates a corrgrapher
object.
Feel free to pass it into plot
, include it in knitr report or generate a simple HTML.
corrgrapher(x, ...) ## S3 method for class 'explainer' corrgrapher( x, cutoff = 0.2, values = NULL, cor_functions = list(), ..., feature_importance = NULL, partial_dependence = NULL ) ## S3 method for class 'matrix' corrgrapher(x, cutoff = 0.2, values = NULL, cor_functions = list(), ...) ## Default S3 method: corrgrapher(x, cutoff = 0.2, values = NULL, cor_functions = list(), ...)
corrgrapher(x, ...) ## S3 method for class 'explainer' corrgrapher( x, cutoff = 0.2, values = NULL, cor_functions = list(), ..., feature_importance = NULL, partial_dependence = NULL ) ## S3 method for class 'matrix' corrgrapher(x, cutoff = 0.2, values = NULL, cor_functions = list(), ...) ## Default S3 method: corrgrapher(x, cutoff = 0.2, values = NULL, cor_functions = list(), ...)
x |
an object to be used to select the method, which must satisfy conditions:
|
... |
other arguments. |
cutoff |
a number. Correlations below this are treated as no correlation. Edges corresponding to them will not be included in the graph. |
values |
a |
cor_functions |
a named |
feature_importance |
Either:
|
partial_dependence |
a named
If only one kind of data was used, use a list with 1 object. |
Data analysis (and creating ML models) involves many stages. For early exploration, it is useful to have a grip not only on individual series (AKA variables) available, but also on relations between them. Unfortunately, the task of understanding correlations between variables proves to be difficult. corrgrapher package aims to plot correlations between variables in form of a graph. Each node on it is associated with single variable. Variables correlated with each other (positively and negatively alike) shall be close, and weakly correlated - far from each other.
A corrgrapher
object. Essentially a list
, consisting of following fields:
nodes
- a data.frame
to pass as argument nodes
to visNetwork
function
edges
- a data.frame
to pass as argument edges
to visNetwork
function
pds
(if x was of explainer
class) - a list
with 2 elements: numerical
and categorical
. Each of them contains an object of aggregated_profiles_explainer
used to create partial dependency plots.
data
- data used to create the object.
plot.corrgrapher
, knit_print.corrgrapher
, save_to_html
# convert the category variable df <- as.data.frame(datasets::Seatbelts) df$law <- factor(df$law) cgr <- corrgrapher(df)
# convert the category variable df <- as.data.frame(datasets::Seatbelts) df$law <- factor(df$law) cgr <- corrgrapher(df)
This method allows corrgrapher
objects to be displayed nicely in knitr/rmarkdown documents.
## S3 method for class 'corrgrapher' knit_print(x, ...)
## S3 method for class 'corrgrapher' knit_print(x, ...)
x |
An object of |
... |
Other parameters, passed directly to |
2 objects will be displayed: graph of correlations on the left and a plot on the right.
If x
was created from explainer
, the plot will visualize partial dependency
of the currently selected variable.
In other case, the plot will visualize distribution of the variable.
Visualize correlations between variables, using previously created corrgrapher
object.
## S3 method for class 'corrgrapher' plot(x, ...)
## S3 method for class 'corrgrapher' plot(x, ...)
x |
a |
... |
other parameters, passed directly to |
A visNetwork
object; graph. On this graph, the edges are treated as springs.
The variables correlated strongly (positively or negatively) are close to each other,
and those not (or weakly) correlated - far from each other.
df <- as.data.frame(datasets::Seatbelts)[,1:7] # drop the binary target variable cgr <- corrgrapher(df) plot(cgr)
df <- as.data.frame(datasets::Seatbelts)[,1:7] # drop the binary target variable cgr <- corrgrapher(df) plot(cgr)
This method allows corrgrapher
objects to be displayed nicely in RStudio viewer.
## S3 method for class 'corrgrapher' print(x, ...)
## S3 method for class 'corrgrapher' print(x, ...)
x |
An object of |
... |
Other parameters, passed directly to |
Create an interactive document in HTML based on corrgrapher
object.
save_to_html(cgr, file = "report.html", overwrite = FALSE, ...)
save_to_html(cgr, file = "report.html", overwrite = FALSE, ...)
cgr |
An object of |
file |
File to write content to; passed directly to |
overwrite |
If |
... |
Other parameters |
A file of file
name will be generated with 2 elements: graph of correlations in the middle and a plot on the right.
If x
was created from explainer
, the plot will visualize partial dependency
of the currently selected variable.
In other case, the plot will visualize distribution of the variable.