Thread by @tpoi, Do you like viruses? Do you like data that are thoroughly cleaned [...]

Do you like viruses? Do you like data that are thoroughly cleaned and enable really cool science? Do you like that to be open? Read on, because @viralemergence latest project, CLOVER, is now available as a preprint: https://www.biorxiv.org/content/10.1101/2021.01.14.426572v1

Data proliferation, reconciliation, and synthesis in viral ecology

The fields of viral ecology and evolution have rapidly expanded in the last two decades, driven by technological improvements, and motivated by efforts to discover potentially zoonotic wildlife...

https://www.biorxiv.org/content/10.1101/2021.01.14.426572v1

This is the result of a long-term effort by @roryjgibb and @Gfalbery, with many contributions from @danjbecker, @L_Brierley, @taddallas, @evanaeskew, @maxjfarrell, @angie_rasmussen, @SadieRyan, @arsweeny, and @wormmaps. Also I made a figure in it, but otherwise it's cool.

CLOVER is a "Four-in-one value pack of host-virus association data" - https://github.com/viralemergence/clover - where we aggregated four datasets, and reconciled them.

viralemergence/clover

🍀Four-in-one value pack of host-virus association data - viralemergence/clover

https://github.com/viralemergence/clover

This means that yes, every single node is a valid taxonomy entry in the NCBI taxonomy, which we accomplished through the use of taxize, NCBITaxonomy.jl, and hours of manual validation.

This is massively important, because as we show in the preprint, datasets with their own naming convention are under-estimating how much they have in common, sometimes dramatically. EID2 and Shaw look like they have 41% of viruses in common; it's 94% when the cleaning is done.

Now that these datasets are reconciled, we can dig a little bit more into their structure! This is a t-SNE embedding of the graph, and it shows that the sampling effort was not evenly distributed across network clusters. Merging datasets brings us closer to the full picture.

CLOVER contains 1081 hosts (mammals), 829 viruses, and 5494 interactions. So although CLOVER is not covering a lot more diversity, the depth of our coverage is far better than the currently most exhaustive dataset available.

I'd like to end this thread on a few thoughts.
1. Yes, viral ecology synthesis is just viral ecology with additional steps - but these steps are extremely important, because they give a better understanding of both the network-level and node-level structure of what happens.

2. A lot of this work is enabled by new tools, but it also relies on communication between people - @viralemergence succeeds because we reconcile not only data, but also virology, public health, ecology, network science, and biodiversity. Cleaning data is a human effort.

3. CLOVER is, deep down, an act of service - we did not build it to make claims at novelty, at larger numbers, or to grandstand about data liberation; we did this so analyses can draw on more data, more integration, and without the hundreds of hours of overhead to clean them.

4. I will end the thread here: CLOVER is also an invitation. I'll allow myself a bad metaphor: a clover lawn has more biodiversity, demands less upkeep, and persists with fewer external resources. Our hope is that CLOVER will have the same effect.

So if you want to use it to build a project, reach out because we are eager to help - one of the mandates of @viralemergence is to distribute better data and craft better tools. Come on and have some fun.

Latest Threads Unrolled: