This is the result of a long-term effort by @roryjgibb and @Gfalbery, with many contributions from @danjbecker, @L_Brierley, @taddallas, @evanaeskew, @maxjfarrell, @angie_rasmussen, @SadieRyan, @arsweeny, and @wormmaps. Also I made a figure in it, but otherwise it's cool.
CLOVER is a "Four-in-one value pack of host-virus association data" - https://github.com/viralemergence/clover - where we aggregated four datasets, and reconciled them.
This means that yes, every single node is a valid taxonomy entry in the NCBI taxonomy, which we accomplished through the use of taxize, NCBITaxonomy.jl, and hours of manual validation.
This is massively important, because as we show in the preprint, datasets with their own naming convention are under-estimating how much they have in common, sometimes dramatically. EID2 and Shaw look like they have 41% of viruses in common; it's 94% when the cleaning is done.
Now that these datasets are reconciled, we can dig a little bit more into their structure! This is a t-SNE embedding of the graph, and it shows that the sampling effort was not evenly distributed across network clusters. Merging datasets brings us closer to the full picture.
CLOVER contains 1081 hosts (mammals), 829 viruses, and 5494 interactions. So although CLOVER is not covering a lot more diversity, the depth of our coverage is far better than the currently most exhaustive dataset available.
I'd like to end this thread on a few thoughts.
1. Yes, viral ecology synthesis is just viral ecology with additional steps - but these steps are extremely important, because they give a better understanding of both the network-level and node-level structure of what happens.
2. A lot of this work is enabled by new tools, but it also relies on communication between people - @viralemergence succeeds because we reconcile not only data, but also virology, public health, ecology, network science, and biodiversity. Cleaning data is a human effort.
3. CLOVER is, deep down, an act of service - we did not build it to make claims at novelty, at larger numbers, or to grandstand about data liberation; we did this so analyses can draw on more data, more integration, and without the hundreds of hours of overhead to clean them.
4. I will end the thread here: CLOVER is also an invitation. I'll allow myself a bad metaphor: a clover lawn has more biodiversity, demands less upkeep, and persists with fewer external resources. Our hope is that CLOVER will have the same effect.
So if you want to use it to build a project, reach out because we are eager to help - one of the mandates of @viralemergence is to distribute better data and craft better tools. Come on and have some fun.
You can follow @tpoi.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.