ccdv/lsg-bart-base-16384-pubmed · LSG implementation

Aug 11, 2022

Hie @ccdv -really enjoyed the BART-Base PubMed fine tuning! The performance of the models is still consistent on other topics.

Would like to know if the Local + Sparse + Global attention (LSG) paper is avaliable and how to reference it ?

ccdv

Owner Aug 11, 2022

Hi @neil268
The paper is currently under anonymized review and not available right now.
You can reference this repo if you really want to reference something.

neil268

Aug 11, 2022

•

edited Aug 11, 2022

Hie @ccdv

Thanks for the quick reply :) will do that ! The last query is on understanding how LSG differs from the big bird architecture (https://huggingface.co/blog/big-bird). Is the main difference in random key blocks ?

neil268 changed discussion status to closed Aug 11, 2022

neil268 changed discussion status to open Aug 11, 2022

ccdv

Owner Aug 11, 2022

@neil268
BigBird is considered as a model/architecture while LSG is more like an attention pattern to replace vanilla attention. I use "LSG" to refer to the attention or models using LSG attention instead of the vanilla (full) one.

BigBird relies on 3 things:

block local attention
random block attention (sparse attention)
global attention: some tokens from the sequence are defined as global. The way they are selected is unclear.

LSG relies on:

block local attention
extended local attention (sparse attention) with various selection schemas. The goal is to expand the context with minimal cost and without randomness. Each head is processed independently. The best schema is tasks specific and can also be removed.
global attention: global tokens are prepended to the sequence and learnable, they are not selected from the sequence.

Other differences:

The goal of LSG is to replace vanilla attention for a wide range of models for them to process long sequences, we dont want to train something from scratch. We want minimal training/fine-tuning for maximum performances.
LSG has better extrapolation capabilities (e.g training on 4096 tokens and doing inference on 16384 tokens)
LSG has very small performance loss when converting an existing model to its LSG variant thanks to the way global tokens are initialized.
LSG (with RoBERTa) is a lot faster compared to the HF implementation of BigBird (about +80% training speed for a similar model size). Same behavior for summarization models (LED/BigBird-Pegasus etc...)
LSG is more memory efficient
LSG converges with less training steps because random attention slowdowns BigBird and affects inference

neil268

Aug 11, 2022

Thank you so much for the detailed explanation @ccdv really looking forward to reading the paper in the near future 🤗 !

neil268 changed discussion status to closed Aug 11, 2022