Longformer was an important article, because transformers models like BERT use an attention that has O(N^2) time complexity, "N" being sequence length

Longformer attention has O(N) time complexity, so we can process longer input sequences with Longformer https://medium.com/dair-ai/longformer-what-bert-should-have-been-78f4cd595be9
"Shortformer", uses two methods to improve speed and performance:

1- Staged training: Start with shorter sentences while training, and use longer sentences in the later iterations.

2- Position Infused Attention: Utilize position embeddings just before calculating attention.
"Staged training" reminds me of "Progressive GANs", which trains the GAN model starting with low quality images (going easy on the model), and going towards high quality images.

https://machinelearningmastery.com/how-to-train-a-progressive-growing-gan-in-keras-for-synthesizing-faces/
You can follow @ahmetmeleq.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.