Benchmarking of clustering tools for mixed data
Recently accepted in Nature/Scientific Reports, this study examines the performance of various clustering strategies for mixed data, i.e. data with both continuous and categorical variables. It is a joint project between the LORIA CAPSID and CHRU Nancy CIC-P teams.
« We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). »
Simulated datasets and links to access to the tested tools are available here.
Further contributions and feedbacks on existing « ready-to-use » tools for mixed data are welcome.
Ressources for pharmacogenomics analyses
PGxLOD is a semantic web resource (Linked Open Data) intended to host pharmacogenomic knowledge extracted from various sources (PharmGKB, litterature and Electronic Health Records).
PGxLOD and PGxO were developed during the ANR PraktikPharma project (ANR-15-CE23-0028).
Go to PGxLOD main page.
Integration of rare diseases, genes and phenotypes from Orphanet: Orphamine
Orphamine is a tool for visualizing data from Orphanet : 8496 diseases, 1360 clinical signs, 3129 genes. It integrates cross-references with OMIM, ICD-10, HGNC, UniProtKB and GeneAtlas.
Go to the main Orphamine web site.
Explainable Machine Learning