Thread by @emilydoesastro, What's the best way to find open clusters new and old in [...]

What's the best way to find open clusters new and old in Gaia data?

I did a deep dive into the world of clustering algorithms to find out! And the things we learned along the way were fascinating...

Check out the thread below for a broad overview of the paper!

What are open clusters?

When gas clouds collapse into stars, the stars often also keep collapsing into tight groups of stars - known in astronomy as open clusters!

Below: the Trapezium Cluster in the Orion nebula. It's still surrounded by the gas it formed from! [NASA/ESA]

Stars are the engines of the universe and understanding how they evolve is essential. By studying groups of very similar stars in open clusters, we can learn significantly more about early steps of stellar evolution than by studying stars one at a time.

And this is where my research comes in!

The Gaia satellite has revolutionised galactic astronomy with precise astrometry & photometry on over a billion stars. All the open clusters we knew about before as well as hundreds of new ones are hiding in Gaia's data, but extracting >

> them from the data requires sophisticated techniques.

Clustering algorithms are perfect for this! Given a dataset the algorithm will try to pick out the clusters for you.

E.g. below: 2 blobs

apply HDBSCAN

it highlights the blobs!

Now, do this on 1.3 BILLION stars?

> them from the data requires sophisticated techniques.Clustering algorithms are perfect for this! Given a dataset the algorithm will try to pick out the clusters for you.E.g. below: 2 blobs apply HDBSCAN it highlights the blobs! Now, do this on 1.3 BILLION stars?

There are dozens of potential clustering algorithms to try, all with pros and cons - and they've never been compared side-by-side on Gaia data, so they aren't well tested for this use.

I trawled the literature for contenders. A lot of algorithms can't cope with the huge size of the Gaia dataset (they're too slow) and can't cope with how less than ~0.1% of stars are even *in* an open cluster - so the algorithms we picked are the cream of the crop for this!

First up is DBSCAN! Set a density threshold for your dataset and it will find all clusters denser than that threshold.

It has seen lots of success already finding open clusters (e.g. Castro-Ginard et al. papers). The hard part is choosing the density threshold automatically.

Their threshold method works well but is quite time intensive, so I also came up with my own method too. It fits a basic two-component model for randomly distributed points to the nearest neighbour distance distribution of stars, and uses this to set DBSCAN's threshold.

The model is based on a derivation I found for random points that originally comes from analytical work on... ideal gases?! January, when I worked on this, was wild. Definitely doing another thread on this, it was such a journey in itself

[below: black/stars, red/model]

The model is based on a derivation I found for random points that originally comes from analytical work on... ideal gases?! January, when I worked on this, was wild. Definitely doing another thread on this, it was such a journey in itself [below: black/stars, red/model]

Our next contender is HDBSCAN. A new & improved version of DBSCAN, it trades the global threshold for intelligent hierarchical clustering that can adapt to different density levels in different areas of a dataset.

HDBSCAN's hierarchical representation of data around an cluster:

I think it's a lot easier to use: you just have to specify a minimum cluster size (e.g. 20 stars.) And it should be more sensitive, too!

But the catch is that it's much more sensitive to random dense groupings of stars, making lots of false positives that have to be thrown away.

HDBSCAN has been used on Gaia data to find moving groups in some very nice papers by Kounkel & Covey (2019) and Kounkel et al. (2020), but hasn't been used purely to find open clusters before.

Last up is Gaussian mixture models (Gaussian MMs)! Instead of looking at density levels in data like the last two, Gaussian MMs try to model a dataset as a combination of Gaussian distributions. That means that *all* stars get put into a Gaussian cluster of sorts, and you have >

to tune the algorithm to put open clusters into single Gaussians that you can then extract with some criteria (e.g. picking out the things that are small enough.)

Cantat-Gaudin et al. used it to find 41 new open clusters last year!

It's also closely related to K Means, which powers UPMASK - the algorithm behind the hugely successful Cantat-Gaudin et al. (2018) paper that made the first big census of open clusters with Gaia data.

SHOWDOWN TIME!

To test them out, we picked 100 fields at random (with overlap) containing open clusters. That gave us 100 clusters to study intensively plus 1285 others that the literature lists in those fields, distributed across the galaxy to give us a range of test scenarios!

By far the hardest part of the paper was cleaning up the results of the algorithms. They all produced a lot of false positives (especially HDBSCAN or DBSCAN at high sensitivity) that have to be removed.

Cutting candidates with clearly erroneous sizes or velocity dispersions helped somewhat with obvious cases, but that still left us with a huge number of dodgy candidates (tens of thousands per algorithm

). We needed another way to cut through the noise.

I wanted something like a signal to noise ratio but for clusters. High values would mean your object must be there, while low values indicate a possible false positive.

Cantat-Gaudin et al. 2019 did something similar, but it was based on density bins and only using parallaxes and proper motions. I wanted to try and make something that used all dimensions of the data and could act more autonomously, not requiring binning.

The result was one of the coolest parts of the paper! I use the nearest neighbour distribution of stars in a cluster (analogous to the density of the cluster) and compare it to surrounding field stars, using all 5 dimensions of the astrometric data and doing a statistical test.

Real clusters are denser and should be incompatible with being drawn from the field; false positives are just groups of field stars compatible with being from the field.

See below: the top is a real cluster (blue is the cluster, black is field stars.)

It's to the left (denser!) and has a significance of 20.52 sigma (it's very good), while the lower example (a false positive) is just like the field stars and has 0 sigmas of significance.

In this way, we can provide astrometry-driven probabilities of whether or not a cluster is real, which is super awesome!

Almost all papers only publish a binary yes/no on whether clusters are real, but this method makes it possible to quantify uncertainties with the data itself!

(more tweets coming in a moment, I hit the thread length limit

we're well over halfway through explaining the paper though!)

So, results time!!

Firstly, from our intensive study of 100 clusters, we found roughly what we expected.

DBSCAN is a good method for retrieving open clusters. The existing literature way of determining its threshold was particularly reliable. It could find upto ~60% of objects we thought it should be able to.

HDBSCAN had better sensitivity, finding 82% of objects but with more false positives.

Lastly were Gaussian MMs - unfortunately it was much slower than the other two and only found 33% of the real objects out of the 100.

Next off, we looked at all 1385 alleged clusters in the fields and crossmatched to them! The results as a function of distance and size are really cool.

In summary: HDBSCAN is the best across almost all distances. It's also the best at all sizes of cluster.

DBSCAN gets the best results when you try lots of different thresholds and combine the results at the end, but that requires more processing and is a bit of a pain, and still isn't as good as HDBSCAN.

Gaussian MMs are only good at big clusters in the way we set it up; finding smaller objects with it is possible but gets even more time intensive. It's not a good algorithm for a large-scale blind search because of this.

So: HDBSCAN is the best overall, but you have to be extremely careful to reduce false positives. The significance test thing I mentioned would throw away over 95% of candidates in high sensitivity runs (yes, only a ~5% true positive rate - which is really, really small).

Working with my supervisor (Sabine Reffert) we set out on a big journey to trial and improve methodologies and learnt SO MUCH! And it's now on the arXiv for you to mull over right now! There's tons more here about clustering algorithms and open clusters. https://arxiv.org/abs/2012.04267

We were encouraged by the reviewer to look for new open clusters in our data and found 41 good quality new candidates! Which is really cool - the open cluster census is incomplete and there's still more to get from Gaia data. I'll do a thread about them later today or tomorrow!

Latest Threads Unrolled: