Thread by @hippopedoid, A year ago in Nature Biotechnology, Becht et al. argued that UMAP [...]

A year ago in Nature Biotechnology, Becht et al. argued that UMAP preserved global structure better than t-SNE. Now @GCLinderman and me wrote a comment saying that their results were entirely due to the different initialization choices: https://www.biorxiv.org/content/10.1101/2019.12.19.877522v1. Thread. (1/n)

UMAP does not preserve global structure any better than t-SNE when using the same initialization

One of the most ubiquitous analysis tools employed in single-cell transcriptomics and cytometry is t-distributed stochastic neighbor embedding (t-SNE) [[1][1]], used to visualize individual cells as...

https://www.biorxiv.org/content/10.1101/2019.12.19.877522v1

Here is the original paper: https://www.nature.com/articles/nbt.4314 by @EtienneBecht @leland_mcinnes @EvNewell1 et al. They used three data sets and two quantitative evaluation metrics: (1) preservation of pairwise distances and (2) reproducibility across repeated runs. UMAP won 6/6. (2/10)

Dimensionality reduction for visualizing single-cell data using UMAP

A benchmarking analysis on single-cell RNA-seq and mass cytometry data reveals the best-performing technique for dimensionality reduction.

https://www.nature.com/articles/nbt.4314

UMAP and t-SNE optimize different loss functions, but the implementations used in Becht et al. also used different default initialization choices: t-SNE was initialized randomly, whereas UMAP was initialized using the Laplacian eigenmaps (LE) embedding of the kNN graph. (3/10)

Were the results due to the different loss functions or due to the different initializations? George extended the code of Becht et al. to add UMAP with random initialization and t-SNE (using FIt-SNE) with PCA initialization to the benchmark comparison. This is the result. (4/10)

Turns out, it was *entirely* due to initialization! UMAP with random initialization preserved global structure as poorly as t-SNE with random initialization, whereas t-SNE with informative (PCA) initialization performed as well as UMAP with informative (LE) initialization. (5/10)

This is particularly obvious for the reproducibility metric: of course if one runs t-SNE with random initialization and different random seeds, one can get very different global arrangements of clusters. People tend to think it is not true for UMAP, but we show that it is. (6/10)

In our view, the results of Becht et al. do not actually support the claim that UMAP preserves global structure better than t-SNE, which is how it's been cited in the field. The real lesson is that one should not be using random initialization for either of these methods. (7/10)

This is in agreement with the recommendation to use PCA initialization (rather than random initialization) for t-SNE made in the recent paper by @CellTypist and me: https://twitter.com/hippopedoid/status/1206535867831083008. (8/10)

https://twitter.com/hippopedoid/status/1206535867831083008

Just to be clear: this is *not* an attack on UMAP! I think UMAP is great :-) But I also think t-SNE is great. And there is plenty of room for further improvements and for better conceptual understanding of this whole family of embedding methods. (9/10)

But to decide which algorithm is more faithful to the single-cell data, further research is needed. Our Comment argues that Becht et al. paper does not answer that. (10/10)

Latest Threads Unrolled: