Discrepancy in HW description

#13
by mfajcik - opened

Hello,

this model card says

Training Configuration
This model was trained on 8 A100-80GBs for about 2 days using the MosaicML Platform. The model was trained with sharded data parallelism using FSDP and used the LION optimizer.

however, blogpost says

Most open-source language models can only handle sequences with up to a few thousand tokens (see Figure 1). But with the MosaicML platform and a single node of 8xA100-40GB, you can easily finetune MPT-7B to handle context lengths up to 65k!

So, just to avoid confusion, is it possible to finetune this model with ~60k context sizes on 8x40GBs A100?

Cheers,
Martin

Mosaic ML, Inc. org

Hi, Martin. Thanks for catching that error in our blog post! And sorry for the confusion!
The model card is correct, though. You need a pretty hefty GPU memory to train on 65k, and we used A100-80GB cards throughout. We haven't profiled this yet, but I would expect that with 8 A100-40GB cards, you could finetune on closer to 30k, give or take.

Mosaic ML, Inc. org

The blog has been updated. Thanks again for catching that!

atrott changed discussion status to closed

Sign up or log in to comment