Thread by @rfriedman22, My first preprint with @genologos and Barak Cohen is on @biorxivpreprint today!We [...]

My first preprint with @genologos and Barak Cohen is on @biorxivpreprint today!

We came up with an information content metric which can classify functional enhancers vs silencers nearly as well as an unbiased machine learning model.

https://www.biorxiv.org/content/10.1101/2021.02.05.429997v1

1/

Information Content Differentiates Enhancers From Silencers in Mouse Photoreceptors

Enhancers and silencers often depend on the same transcription factors (TFs) and are conflated in genomic assays of TF binding or chromatin state. To identify sequence features that distinguish...

https://www.biorxiv.org/content/10.1101/2021.02.05.429997v1

Enhancers and silencers are “cis-regulatory sequences” (CRS), non-coding DNA that controls when and where genes are expressed to determine cellular identity. Enhancers increase gene expression, silencers decrease gene expression.

2/

Transcription factors (TFs) recognize specific motifs within CRS and recruit chromatin modifying enzymes. These events lead to epigenetic changes in chromatin accessibility and histone modifications.

3/

Epigenetic data is often used to train machine learning models that predict CRS from the genome, but only half of predicted CRS are active when tested directly, as shown in the latest ENCODE paper.

These findings suggest there are other DNA sequence features of genuine CRS.

4/

To find these sequence features, we teamed up with Joe Corbo to do Massively Parallel Reporter Assays (MPRAs) in *live, developing mouse retinas*

This is super cool because we can test thousands of candidate CRS in the native cell type!

5/

We cloned a library of candidate CRS upstream of the rod-specific Rho promoter and the DsRed reporter gene. Each CRS has unique barcodes (BCs) read out w/ sequencing. If the CRS is functional, it will significantly change the activity of Rho.

Activity = # RNA BCs / # DNA BCs

6/

Even though every sequence has epigenetic properties of CRS and motifs for a TF called CRX, they have a wide range of activity: strong enhancers (dark blue), weak enhancers (light blue), inactive (green), and silencers (red).

This difference must be due to the DNA sequence!

7/

We computed the "predicted occupancy" of CRX -- the number and affinity of CRX motifs.

Strong enhancers and silencers both have higher predicted CRX occupancy than inactive sequences.

In other words, more CRX can either increase or decrease activity.

8/

This result means that there must be sequence features that distinguish strong enhancers vs silencers.

To find these features, we ran a de novo motif enrichment analysis and found motifs for several lineage-defining TFs in strong enhancers.

9/

Using the predicted occupancies of these 8 TFs, we trained a logistic regression model to classify strong enhancers vs silencers.

This model does nearly as well as a 6-mer support vector machine, a type of machine learning model.

10/

This result is really exciting because we went from an unbiased set of 2080 features to only 8 features with little change in model performance! That's a 260-fold reduction in features!

How do these 8 TFs differentiate strong enhancers from silencers?

11/

Based on our predicted occupancy metric, strong enhancers have a more diverse set of TF motifs, but each motif is in the minority of sequences.

Additionally, each motif occurs independently of other motifs.

12/

These results suggest that, relative to inactive sequences, strong enhancers have more motifs for a diverse set of lineage-defining TFs, but the exact identity of those TFs is not so important.

Meanwhile, silencers also have more motifs, but for a less diverse set of TFs.

13/

To capture the effect of both the number and diversity of TF motifs, we borrowed ideas from statistical mechanics to calculate the "information content" of a sequence.

This metric describes the number of *unique* ways the motifs can be ordered in a sequence.

14/

Strong enhancers tend to have higher information content than other sequences. In fact, information content alone can classify strong enhancers from other sequences!

15/

We've gone from 2080 features to 1 with only a modest reduction in performance. This suggests that most of the signal captured by our black box machine learning model is the number and diversity of lineage-defining TF motifs!

16/

To test if TF identity is important, we repeated the MPRA without the Rho promoter. The Rho promoter has a motif for a TF called NRL. If TF identity is important, then only sequences with NRL motifs will be "autonomous" without the Rho promoter.

17/

90% of autonomous sequences are enhancers, but only 39% of strong enhancers are autonomous.

Autonomous strong enhancers have higher information content but each motif, including NRL, occurs in the minority of these sequences.

Thus, information content > TF identity.

18/

Information content doesn't consider interactions between motifs, suggesting each motif has an independent contribution towards specifying enhancer activity.

To test this idea, we looked at sequences after mutating all CRX motifs.

Remember, every sequence had CRX motifs.

19/

Mutating CRX motifs causes both enhancers and silencers to regress towards basal levels, indicating that both classes depend on CRX in some capacity.

However, 40% of strong enhancers have low CRX dependence, meaning they stay strong enhancers without CRX motifs.

20/

Strong enhancers with low CRX dependence have lower predicted CRX occupancy. They also have higher "residual" information content (information content without CRX motifs) than strong enhancers with high CRX dependence.

21/

Strong enhancers with high and low CRX dependence have similar wild-type information content. As a result, sequences with more CRX motifs have fewer motifs for other TFs.

The number of motifs for any one TF is not important, so long as there is enough information content.

22/

These results suggest that there is no evolutionary pressure for enhancers to contain additional motifs beyond the minimum amount of information content necessary for activity!

23/

In summary, although every sequence in our assay has epigenetic properties of CRS, genuine enhancers and silencers have more TF motifs than inactive sequences. Enhancers also contain a more diverse set of motifs relative to silencers.

24/

We can capture these differences with information content. This single metric does nearly as well as 2080 features used for machine learning!

Our work illustrates how motif context differentiates enhancers and silencers targeted by the same TF.

25/

Finally, a huge shout out to David Granas, who helped clone libraries and helped me troubleshoot as I learned how to do wet lab work, and Connie Myers, who took care of the animal work!

All of my code is available on Github: https://github.com/barakcohenlab/CRX-Information-Content

26/26

barakcohenlab/CRX-Information-Content

Code for processing and analyzing massively parallel reporter assay data and computing information content of cis-regulatory sequences. - barakcohenlab/CRX-Information-Content

https://github.com/barakcohenlab/CRX-Information-Content

Latest Threads Unrolled: