Understanding how big of a model can fit on your machine
One very difficult aspect when exploring potential models to use on your machine is knowing just how big of a model will fit into memory with your current graphics card (such as loading the model onto CUDA).
To help alleviate this, 🤗 Accelerate has a CLI interface through accelerate estimate-memory
. This tutorial will
help walk you through using it, what to expect, and at the end link to the interactive demo hosted on the 🤗 Hub which will
even let you post those results directly on the model repo!
Currently we support searching for models that can be used in timm
and transformers
.
This API will load the model into memory on the meta
device, so we are not actually downloading
and loading the full weights of the model into memory, nor do we need to. As a result it’s
perfectly fine to measure 8 billion parameter models (or more), without having to worry about
if your CPU can handle it!
Gradio Demos
Below are a few gradio demos related to what was described above. The first is the official Hugging Face memory estimation space, utilizing Accelerate directly:
A community member has taken the idea and expanded it further, allowing you to filter models directly and see if you can run a particular LLM given GPU constraints and LoRA configurations. To play with it, see here for more details.
The Command
When using accelerate estimate-memory
, you need to pass in the name of the model you want to use, potentially the framework
that model utilizing (if it can’t be found automatically), and the data types you want the model to be loaded in with.
For example, here is how we can calculate the memory footprint for bert-base-cased
:
accelerate estimate-memory bert-base-cased
This will download the config.json
for bert-based-cased
, load the model on the meta
device, and report back how much space
it will use:
Memory Usage for loading bert-base-cased
:
dtype | Largest Layer | Total Size | Training using Adam |
---|---|---|---|
float32 | 84.95 MB | 418.18 MB | 1.61 GB |
float16 | 42.47 MB | 206.59 MB | 826.36 MB |
int8 | 21.24 MB | 103.29 MB | 413.18 MB |
int4 | 10.62 MB | 51.65 MB | 206.59 MB |
By default it will return all the supported dtypes (int4
through float32
), but if you are interested in specific ones these can be filtered.
Specific libraries
If the source library cannot be determined automatically (like it could in the case of bert-base-cased
), a library name can
be passed in.
accelerate estimate-memory HuggingFaceM4/idefics-80b-instruct --library_name transformers
Memory Usage for loading HuggingFaceM4/idefics-80b-instruct
:
dtype | Largest Layer | Total Size | Training using Adam |
---|---|---|---|
float32 | 3.02 GB | 297.12 GB | 1.16 TB |
float16 | 1.51 GB | 148.56 GB | 594.24 GB |
int8 | 772.52 MB | 74.28 GB | 297.12 GB |
int4 | 386.26 MB | 37.14 GB | 148.56 GB |
accelerate estimate-memory timm/resnet50.a1_in1k --library_name timm
Memory Usage for loading timm/resnet50.a1_in1k
:
dtype | Largest Layer | Total Size | Training using Adam |
---|---|---|---|
float32 | 9.0 MB | 97.7 MB | 390.78 MB |
float16 | 4.5 MB | 48.85 MB | 195.39 MB |
int8 | 2.25 MB | 24.42 MB | 97.7 MB |
int4 | 1.12 MB | 12.21 MB | 48.85 MB |
Specific dtypes
As mentioned earlier, while we return int4
through float32
by default, any dtype can be used from float32
, float16
, int8
, and int4
.
To do so, pass them in after specifying --dtypes
:
accelerate estimate-memory bert-base-cased --dtypes float32 float16
Memory Usage for loading bert-base-cased
:
dtype | Largest Layer | Total Size | Training using Adam |
---|---|---|---|
float32 | 84.95 MB | 413.18 MB | 1.61 GB |
float16 | 42.47 MB | 206.59 MB | 826.36 MB |
Caveats with this calculator
This calculator will tell you how much memory is needed to purely load the model in, not to perform inference.
This calculation is accurate within a few % of the actual value, so it is a very good view of just how much memory it will take. For instance loading bert-base-cased
actually takes 413.68 MB
when loaded on CUDA in full precision, and the calculator estimates 413.18 MB
.
When performing inference you can expect to add up to an additional 20% as found by EleutherAI. We’ll be conducting research into finding a more accurate estimate to these values, and will update this calculator once done.
< > Update on GitHub