Clustering of mixed data

This page provides access to material for benchmarking « ready-to-use » clustering methods for mixed data. It refers to a paper recently published in Nature Scientific Reports and entitled

« Head-to-head comparison of clustering methods for heterogeneous data. A simulation-driven benchmark »

Gregoire Preud’homme ^{1, 2}, Kevin Duarte¹, Kevin Dalleau ³, Claire Lacomblez ¹, Emmanuel Bresso ³, Malika Smaïl-Tabbone^2,3, Miguel Couceiro ³, Marie-Dominique Devignes ^2,3, Masatake Kobayashi ^{1, 2}, Olivier Huttin ^{1, 2}, João Pedro Ferreira ^{1, 2}, Faiez Zannad^{1, 2}, Patrick Rossignol ^{1, 2}, Nicolas Girerd ^{1, 2}

Université de Lorraine, Centre d’Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, France.
F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, France.
Université de Lorraine, CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Vandoeuvre-lès-Nancy, France.

Simulated datasets

For each tested design 1000 simulated datasets were generated (see paper for the method used). The sets of 1000 datasets can be downloaded as a single rds file (formatted for usage with R packages) named according to the design parameters.
Default parameters are as follows

Population size : 300 ;
Number of clusters : 6 ;
Number of continuous variables : 4 ;
Degree of relevance of continuous variables : mild ;
Proportion of relevant continuous variables : 100% (for a total of 4 variables);
Total number of categorical variables : 4 ;
Degree of relevance of categorical variables : mild.
Proportion of relevant categorical variables : 100% (for a total of 4 variables);

In the table, for each design, the values of the varying parameters are indicated with a link for downloading the corresponding dataset (average size 10Mo).

Design number	Varying Parameter	Value 1	Value 2	Value 3
Design 1	Population size
Design 2	Number of clusters
Design 3	Number of continuous variables with 2 categorical variables
Design 3bis	Number of continuous variables with 4 categorical variables
Design 3ter	Number of continuous variables with 8 categorical variables
Design 4	Degree of relevance of continuous variables
Design 5	Degree of relevance of categorical variables
Design 6	Proportion of relevant categorical variables for a total of 10 categorical variables
Design 7	Proportion of relevant categorical variables for a total of 10 categorical variables

Tested algorithms

Distance-based

- Gower dissimilarity : cluster package (daisy function).
- Unsupervised Extra Tree dissimilarity : UET package, yet unpublished, available at gitlab address, use build_randomized_tree_and_get_sim function.
- Partitioning Around Medoids (PAM) : clustMixType package (PAM function).
- Hierarchical Ascendant Clustering : stats (R-base package, hclust function).
- K-Prototypes : clustMixType package (kproto function).

Model-based

- Kamila : kamila package (kamila function).
- Latent Class Analysis (LCA) : poLCA package (poLCA function).
- Latent Class Model (LCM) : VarSelLCM package (VarSelCluster function).
- Clustering by Mixture Modeling (MixMod) : Rmixmod package (mixmodCluster function).

L	M	M	J	V	S	D
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Simulated datasets

Tested algorithms

Distance-based

Model-based

Search…

Site content

Events on MBI-DS4H platform