ccdv/lsg-bart-base-16384-pubmed · Training loss curves

Jun 8, 2022

Hey @ccdv - amazing job on this BART-Base implementation! That's a really impressive performance :-)

With @Stancld we are working on LongT5: https://arxiv.org/pdf/2112.07916.pdf and are currently fine-tuning the model on PubMed - do you have any link to a loss curve by any chance we could look into?

ccdv

Owner Jun 9, 2022

•

edited Jun 9, 2022

Hi @patrickvonplaten

Sadly I dont have the loss curve (empty tensorboard) but I have the final training loss and training params (eval_loss and eval metrics are on the test set).
Training is done using the summarization script (with few tweaks) from Transformers.
Note that the model has been trained on 4096 length (see) during 8 epochs, then converted and finetuned during 1 epoch on 16384 length to reduce overall computation.
To convert BART-base and to increase max model length, I rely on a conversion script also able to convert BERT, RoBERTa and DistilBERT checkpoints (attention is replaced and global tokens added).

The objective of the LSG approach is to significantly reduce computation costs (tight budget) for standard and encoder-decoder models on long sequences using plain PyTorch.
You can get about x2 training speed compared to Longformer/BigBird if you convert RoBERTa to 4096 length (replacing the attention mecanism only).

Final losses:
On 4096 length:

{
    "epoch": 8.0,
    "train_loss": 1.727705581620244,
    "train_samples": 119924,
    "eval_gen_len": 353.9225, 
    "eval_loss": 1.5759657621383667,
    "eval_samples": 6658
}

On 16394 length:

{
    "epoch": 1.0,
    "train_loss": 1.5442171282598995,
    "train_samples": 119924,
    "eval_gen_len": 337.5673, 
    "eval_loss": 1.505071759223938,
    "eval_samples": 6658
}

Training params:

total batch_size: 32
lr: 8e-5
warmup_ratio: 0.1
epochs (4096 length): 8
epochs (16834 length): 1
lr linear decay

This model only has 145M params, a different batch size/lr may be required for larger models, especially for LongT5 (smallest model has about 220M params, largest 3B).
The setup is similar (less epochs) for the ArXiv summarization dataset, see lsg-bart-base-16384-arxiv

patrickvonplaten

Jun 10, 2022

•

edited Jun 10, 2022

Super cool - thanks a lot for posting this! cc @Stancld