Mixed-type dissimilarities for distance-based learning
mdist.RdComputes a dissimilarity object for numerical, categorical, or mixed-type data. The function combines continuous and categorical components according to either a predefined `preset` or a user-defined custom specification.
Usage
mdist(
x,
new_data = NULL,
response = NULL,
method_cat = "tvd",
method_num = "std",
commensurable = TRUE,
ncomp = NULL,
threshold = NULL,
preset = "custom",
interaction = FALSE,
prop_nn = 0.1,
score = "ba",
decision = "prior_corrected",
gower_average = TRUE
)Arguments
- x
A data frame or matrix containing the training observations. Columns can be numeric, factors, or a mixture of both.
- new_data
Optional data frame or matrix containing new observations. If supplied, distances are computed from rows of `new_data` to rows of `x`, producing a rectangular test-to-training dissimilarity matrix.
- response
Optional response variable used for response-aware categorical dissimilarities. It can be supplied as an unquoted column name or as a character string. The response column is removed from the predictors before computing distances.
- method_cat
Character string specifying the categorical-variable dissimilarity used when `preset = "custom"`. Common values include `"matching"` and `"tvd"`. Use [all_dist_method_specs()] to inspect available methods.
- method_num
Character string specifying the numerical-variable preprocessing used when `preset = "custom"`. Available options include `"none"` for no preprocessing, `"std"` for standard-deviation scaling, `"range"` for range scaling, `"robust"` for inter-quartile-range-based scaling, and `"pc_scores"` for principal-component score scaling.
- commensurable
Logical. If `TRUE`, dissimilarities are scaled so that the average contribution of each variable to the overall distance is equal to 1.
- ncomp
Integer or `NULL`. Number of principal components to retain when `method_num = "pc_scores"`. If `NULL`, all available components are used unless `threshold` is supplied and supported by the underlying method.
- threshold
Numeric or `NULL`. Optional cumulative variance threshold used when `method_num = "pc_scores"`.
- preset
Character string specifying a predefined distance specification. Available values include `"custom"`, `"gower"`, `"unbiased_dependent"`, `"u_dep"`, `"u_indep"`, `"u_mix"`, `"hl"`, `"gudmm"`, `"dkss"`, `"mod_gower"`, and `"euclidean"`. When `preset` is not `"custom"`, arguments such as `method_cat`, `method_num`, `commensurable`, and `interaction` are handled by the preset and user-supplied values for those arguments are ignored.
- interaction
Logical. If `TRUE`, adds an interaction-aware continuous-categorical component based on local predictive separability.
- prop_nn
Numeric. Proportion of nearest neighbours used when `interaction = TRUE`.
- score
Character string specifying the score used when `interaction = TRUE`. Available values include `"ba"` for balanced accuracy and `"logloss"`.
- decision
Character string specifying the decision rule used when `score = "ba"`. The default is `"prior_corrected"`.
- gower_average
Logical; only used when `preset = "gower"`. If `TRUE`, returns the standard Gower dissimilarity averaged over variables, matching the scale of [cluster::daisy()] with `metric = "gower"`. If `FALSE`, returns the sum of per-variable Gower contributions, equivalent to multiplying the averaged Gower dissimilarity by the number of active variables.
Value
An object of class `"MDist"`. The object contains the computed dissimilarity in its `$distance` field, the selected `preset`, the training data, and a list of parameters describing the fitted distance specification. Square train-train dissimilarities are stored as `"dissimilarity"`/`"dist"` objects; rectangular test-to-training dissimilarities are stored as `"dissimilarity"`/`"matrix"` objects.
Details
`mdist()` is the main distance-construction function in `manydist`. It can return ordinary train-train dissimilarities or rectangular test-to-training dissimilarities when `new_data` is supplied. The resulting object stores both the dissimilarity matrix and metadata about the distance specification that was used.
With `preset = "custom"`, users manually choose the numerical preprocessing, categorical dissimilarity, commensurability, and optional interaction term.
The `"gower"` preset follows the usual Gower construction based on range scaling for continuous variables and matching dissimilarities for categorical variables. The `gower_average` argument controls whether the result is averaged over variables or returned as a sum of variable-wise contributions.
The `"u_dep"`, `"unbiased_dependent"`, `"u_indep"`, and `"u_mix"` presets are convenience specifications for unbiased or commensurable mixed-variable dissimilarities. The `"euclidean"` preset computes a Euclidean distance after one-hot encoding categorical variables. The `"gudmm"`, `"dkss"`, and `"mod_gower"` presets provide additional mixed-type distance constructions. Some presets currently support only train-train distances and will stop if `new_data` is supplied.
Use [all_dist_method_specs()] to inspect the available distance components and method specifications.
Examples
if (requireNamespace("palmerpenguins", quietly = TRUE)) {
data("penguins", package = "palmerpenguins")
penguins_small <- palmerpenguins::penguins |>
dplyr::select(
bill_length_mm, bill_depth_mm, flipper_length_mm,
body_mass_g, species, island, sex
) |>
tidyr::drop_na()
# Gower distance on mixed-type data
d_gower <- mdist(penguins_small, preset = "gower")
d_gower
# Custom mixed-type specification
d_custom <- mdist(
penguins_small,
preset = "custom",
method_cat = "matching",
method_num = "std",
commensurable = TRUE
)
d_custom
# Train-to-new-data distances
penguin_split <- rsample::initial_split(penguins_small, prop = 0.75)
penguin_train <- rsample::training(penguin_split)
penguin_test <- rsample::testing(penguin_split)
d_new <- mdist(
penguin_train,
new_data = penguin_test,
preset = "gower"
)
d_new
}
#> MDist object
#> preset : gower
#> number of training observations : 249
#> number of test observations : 84
#> number of continuous variables : 4
#> number of categorical variables : 3
#> parameters:
#> - commensurability adjustment: FALSE