This paper was a longtime coming. In no particular order, some of the interesting things to learn from the sequencing and analysis of 53,831 human genomes. 1/n
https://www.nature.com/articles/s41586-021-03205-y
PDF: https://www.nature.com/articles/s41586-021-03205-y.pdf
https://www.nature.com/articles/s41586-021-03205-y
PDF: https://www.nature.com/articles/s41586-021-03205-y.pdf
First, this paper provides an overview of a project that has now sequenced >150,000 human genomes (see status here, https://nhlbi.sph.umich.edu/report/ ).
The goal is to help understand heart, lung, blood and sleep disorders (this is the mission of @nih_nhlbi ) 2/n
The goal is to help understand heart, lung, blood and sleep disorders (this is the mission of @nih_nhlbi ) 2/n
The samples being studied were each donated by a generous research participant and collected by hundreds of investigators over decades. I can’t list them all in a tweet, but if you browse the author list in the paper you will recognize many smart people at all career stages. 3/n
While the first human genome took years to sequence at the turn of the century, this project has been sequencing a genome every 15-20 minutes for the past few years. That’s impressive but not nearly fast enough for what it will take to understand human disease. 4/n
Among the hundreds of millions of variants in TOPMed, nearly half are seen in only one person.
The current list of TOPMed variants is here: https://bravo.sph.umich.edu/freeze8/hg38/
It’s really hard to assign function to these variants until they’re seen several times. 5/n
The current list of TOPMed variants is here: https://bravo.sph.umich.edu/freeze8/hg38/
It’s really hard to assign function to these variants until they’re seen several times. 5/n
Each TOPMed participant has a few thousand unique variants, shared only with their close relatives. That is one unique variant per million bases of sequence or so.
Surprisingly, once you find one unique variant in a person, good chance you will find another a few bases away! 6/n
Surprisingly, once you find one unique variant in a person, good chance you will find another a few bases away! 6/n
That’s because mutations often occur in pairs! Fancy math shows that ~2% of the time a mutation will change bases within 2-8 bases of each other and another ~18% of the time bases within 500 - 5000 bp will mutate together. 7/n
If you marveled at how Google completes your thoughts, you will immediately see how reading all these genomes might power the world’s best DNA auto-completion service. 8/n
It’s the TOPMed imputation server:
…https://imputation.biodatacatalyst.nhlbi.nih.gov/# !
It’s the TOPMed imputation server:
…https://imputation.biodatacatalyst.nhlbi.nih.gov/# !
TOPMed base auto-completion (imputation if you prefer the jargon) has been used on >10M human genomes — one every few seconds if you track the imputation server counters. (But, who would?) 9/n
As an example, the paper shows how this auto-completion capability can help study rare cancer associated variants in UKB participants.
Soon enough, I expect we will be able to compare those results with direct sequencing of all of @uk_biobank ... 10/n
Soon enough, I expect we will be able to compare those results with direct sequencing of all of @uk_biobank ... 10/n
The lab work was joint effort of @debnick60 at UW @gabriel_stacey at Broad @mczody at NYGC Eric Boerwinkle at Baylor and colleagues @GenomeInstitute ... sequences were processed together and then cleaned using machine learning. Kudos to Hyun Min Kang and Jonathon LeFaive. 11/n
You won’t be surprised to hear that many of these sequencing mavens have been moonlighting sequencing covid genomes. After all, data in this paper was generated years ago... 12/n
Why did it take so long? Well, because while our methods for sequencing genomes in the lab are now super fast, the processes for analyzing, cleaning, interpreting and sharing data are still being tuned. 13/n
Most investigators will build on TOPMed results through dbSNP, the variant server and imputation server.
But if you want to study the genetics of heart, lung, blood and sleep disorders directly, the data is also in dbGaP, here:
https://www.ncbi.nlm.nih.gov/gap/?term=TOPMed
... 14/n
But if you want to study the genetics of heart, lung, blood and sleep disorders directly, the data is also in dbGaP, here:
https://www.ncbi.nlm.nih.gov/gap/?term=TOPMed
... 14/n
Yes, we all wish dbGAP access was simpler. Here is a Nature piece on the issue:
https://www.nature.com/articles/d41586-021-00331-5
Facilitating responsible data sharing is worth solving. @eric_lander called previous iteration of dbGaP “a write-only database”. Maybe he will help find a better way! 15/n
https://www.nature.com/articles/d41586-021-00331-5
Facilitating responsible data sharing is worth solving. @eric_lander called previous iteration of dbGaP “a write-only database”. Maybe he will help find a better way! 15/n
Genes that encode DNA and RNA binding proteins, involved in translation initiation or RNA splicing and processing show less damaging variation. That’s probably important to keeping us alive. 16/n
Disease genes in general seem to be depleted in damaging variation. Quote: “Genes associated with human disease in COSMIC (31% depletion), GWAS catalogue (around 8% depletion), OMIM (4% depletion) and ClinVar (4% depletion) contained fewer pLOF variants than expected.” 17/n
For those who like solving puzzles, the authors show how raw sequence data can be used to assemble structural variants and decode complex genotypes in the CYP2D6 locus. Many more such explorations are possible. Just requires patience, creativity and some coding ability. 18/n
Well, that’s now 19 tweets that no one (except you, dear reader) will read. So I should stop... if you liked this thread with a taste of the paper, you will probably enjoy reading the whole thing.
19/n
https://www.nature.com/articles/s41586-021-03205-y.pdf#page8
19/n
https://www.nature.com/articles/s41586-021-03205-y.pdf#page8
I will end by thanking my TOPMed co-authors for 5 year journey that got to this paper. And wish that they all procrastinate/meander a lot less than me as they write and publish the next TOPMed advances and discoveries. 20/20
For more on the paper, look up @rdhernand @OcOutlier
For more on the paper, look up @rdhernand @OcOutlier