Data Science for Health


Benchmarking of clustering tools for mixed data

Recently accepted in Nature/Scientific Reports, this study examines the performance of various clustering strategies for mixed data, i.e. data with both continuous and categorical variables. It is a joint project between the LORIA CAPSID and CHRU Nancy CIC-P teams.

« We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). »

Simulated datasets and links to access to the tested tools are available here.

Further contributions and feedbacks on existing « ready-to-use » tools for mixed data are welcome.

Contact Nicolas Girerd at CIC-P or Marie-Dominique Devignes at the LORIA.

Ressources for pharmacogenomics analyses


PGxLOD is a semantic web resource (Linked Open Data) intended to host pharmacogenomic knowledge extracted from various sources (PharmGKB, litterature and Electronic Health Records).

PGxLOD uses the PGxO ontology. A full description of the motivation, implementation and instantiation of PGxO and PGxLOD is available in [1].

PGxLOD and PGxO were developed during the ANR PraktikPharma project (ANR-15-CE23-0028).

Go to PGxLOD main page.

Rare diseases

Integration of rare diseases, genes and phenotypes from Orphanet: Orphamine

Orphamine is a tool for visualizing data from Orphanet : 8496 diseases, 1360 clinical signs, 3129 genes. It integrates cross-references with OMIM, ICD-10, HGNC, UniProtKB and GeneAtlas.

Go to the main Orphamine web site.


Explainable Machine Learning


Knowledge graphs