This page provides access to material for benchmarking « ready-to-use » clustering methods for mixed data. It refers to a paper recently published in Nature Scientific Reports and entitled
by
Gregoire Preud’homme 1, 2, Kevin Duarte 1, Kevin Dalleau 3, Claire Lacomblez 1, Emmanuel Bresso 3, Malika Smaïl-Tabbone 2,3, Miguel Couceiro 3, Marie-Dominique Devignes 2,3, Masatake Kobayashi 1, 2, Olivier Huttin 1, 2, João Pedro Ferreira 1, 2, Faiez Zannad 1, 2, Patrick Rossignol 1, 2, Nicolas Girerd 1, 2
- Université de Lorraine, Centre d’Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, France.
- F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, France.
- Université de Lorraine, CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Vandoeuvre-lès-Nancy, France.
Simulated datasets
For each tested design 1000 simulated datasets were generated (see paper for the method used). The sets of 1000 datasets can be downloaded as a single rds file (formatted for usage with R packages) named according to the design parameters.
Default parameters are as follows
- Population size : 300 ;
- Number of clusters : 6 ;
- Number of continuous variables : 4 ;
- Degree of relevance of continuous variables : mild ;
- Proportion of relevant continuous variables : 100% (for a total of 4 variables);
- Total number of categorical variables : 4 ;
- Degree of relevance of categorical variables : mild.
- Proportion of relevant categorical variables : 100% (for a total of 4 variables);
In the table, for each design, the values of the varying parameters are indicated with a link for downloading the corresponding dataset (average size 10Mo).
Tested algorithms
Distance-based
-
- Gower dissimilarity : cluster package (daisy function).
- Unsupervised Extra Tree dissimilarity : UET package, yet unpublished, available at gitlab address, use build_randomized_tree_and_get_sim function.
- Partitioning Around Medoids (PAM) : clustMixType package (PAM function).
- Hierarchical Ascendant Clustering : stats (R-base package, hclust function).
- K-Prototypes : clustMixType package (kproto function).
Model-based
-
- Kamila : kamila package (kamila function).
- Latent Class Analysis (LCA) : poLCA package (poLCA function).
- Latent Class Model (LCM) : VarSelLCM package (VarSelCluster function).
- Clustering by Mixture Modeling (MixMod) : Rmixmod package (mixmodCluster function).