LSG implementation

#2
by neil268 - opened

Hie @ccdv -really enjoyed the BART-Base PubMed fine tuning! The performance of the models is still consistent on other topics.

Would like to know if the Local + Sparse + Global attention (LSG) paper is avaliable and how to reference it ?

Hi @neil268
The paper is currently under anonymized review and not available right now.
You can reference this repo if you really want to reference something.

Hie @ccdv

Thanks for the quick reply :) will do that ! The last query is on understanding how LSG differs from the big bird architecture (https://huggingface.co/blog/big-bird). Is the main difference in random key blocks ?

neil268 changed discussion status to closed
neil268 changed discussion status to open

@neil268
BigBird is considered as a model/architecture while LSG is more like an attention pattern to replace vanilla attention. I use "LSG" to refer to the attention or models using LSG attention instead of the vanilla (full) one.

BigBird relies on 3 things:

  • block local attention
  • random block attention (sparse attention)
  • global attention: some tokens from the sequence are defined as global. The way they are selected is unclear.

LSG relies on:

  • block local attention
  • extended local attention (sparse attention) with various selection schemas. The goal is to expand the context with minimal cost and without randomness. Each head is processed independently. The best schema is tasks specific and can also be removed.
  • global attention: global tokens are prepended to the sequence and learnable, they are not selected from the sequence.

Other differences:

  • The goal of LSG is to replace vanilla attention for a wide range of models for them to process long sequences, we dont want to train something from scratch. We want minimal training/fine-tuning for maximum performances.
  • LSG has better extrapolation capabilities (e.g training on 4096 tokens and doing inference on 16384 tokens)
  • LSG has very small performance loss when converting an existing model to its LSG variant thanks to the way global tokens are initialized.
  • LSG (with RoBERTa) is a lot faster compared to the HF implementation of BigBird (about +80% training speed for a similar model size). Same behavior for summarization models (LED/BigBird-Pegasus etc...)
  • LSG is more memory efficient
  • LSG converges with less training steps because random attention slowdowns BigBird and affects inference

Thank you so much for the detailed explanation @ccdv really looking forward to reading the paper in the near future 🤗 !

neil268 changed discussion status to closed

Sign up or log in to comment