Thread by @ljacob, Here is a (long overdue) thread on some of Dexiong's work.This work [...]

Here is a (long overdue) thread on some of Dexiong's work.

This work provides powerful data representations for biological sequences, connecting kernel methods and convolutional networks.

I'm intentionally taking an intuitive point of view. https://twitter.com/ljacob/status/1339590995936079884

https://twitter.com/ljacob/status/1339590995936079884

In lots of different contexts, it can be helpful to learn a function that predicts some property of a biological sequence.

Is this piece of DNA a transcription factor binding site?

To which family does this protein belong?

Will a bacteria with this genome resist to an antibiotic?

In order to learn such a prediction function, you first need to design a relevant representation of sequences as vectors.

That is, one such that similar sequences also have similar properties (to be predicted).

There are lots of ways to do this, but a very generic approach relies on k-mers (words of length k).

For example, each DNA sequence can be represented by a vector with (up to) 4ᵏ entries, each entry indicating whether the sequence contains a particular k-mer.

This representation is:

- Flexible, because it does not require to align sequences, handles sequences of different lengths and even bags of sequences.

- Expressive, because it captures point mutations, translocations and insertions (short or long).

But sometimes what matters for prediction is not the presence of an exact k-mer in the sequence, but the presence of one of several similar k-mers. This is often summarized by a motif, represented by a logo.

Dexiong's CKN-seq can be thought of as an extension of the k-mer presence representation to motif.

Instead of a finite vector with one entry per k-mer, each sequence is represented by a function, i.e. an infinite vector with one "entry" per motif.

Each entry quantifies how similar the corresponding motif is to k-mers contained in the sequence.

Here is a (schematic) exemple of what happens for a sequence containing AATC: a large entry for AATC, and exponentially decreasing entries for motifs that are similar to AATC.

Even though this representation is of infinite dimension, it is possible to use it in learning tasks using the so-called kernel trick.

But this approach doesn't scale well, and doesn't tell us which motifs were predictive.

Instead, Dexiong exploited a projection of the infinite vectors to the finite subspace generated by a few motifs (zⱼ).

Each sequence Φ(x) is now represented by the (finite) coordinates of this projection Ψ_Z(x), reflecting how much its k-mers look like these particular motifs.

Optimizing both the set of motifs that generate the representation and the predictive function over this representation, selects a finite set of predictive sequence motifs.

Learning a task-specific representation is a known feature of convolutional neural networks. And indeed, CKN-seq is formally a particular form of these network over sequences.

All this intuition corresponds to a single-layer CKN-seq. Multilayer versions are also possible but they bring little improvement when working with short sequences.

Dexiong developed a version of CKN-seq that includes non-contiguous motifs, and is therefore able to account for short indels.

He showed that the corresponding learning problem was a form of recurrent neural network.

https://papers.nips.cc/paper/2019/file/d60743aab4b625940d39b3b51c3c6a78-Paper.pdf

He also made an extension of his method to graph-structured data.

Here is a thread by @julienmairal on the topic: https://twitter.com/julienmairal/status/1239145184463552514

https://twitter.com/julienmairal/status/1239145184463552514

You can find Dexiong's software and papers on all these projects (and more) on his website: https://dexiong.me/software/

Software - Dexiong Chen

https://dexiong.me/software/

Latest Threads Unrolled: