Spaces:

bertin-project
/

bertin-gpt-j-6B

Runtime error

How much GPU memory do you need to serve the model?

by dcastm - opened Sep 29, 2022

Sep 29, 2022

Hi,

Thank you for making this available. I'm interested in setting up my own server to serve the model, and was wondering much memory I'd require.

Based on your experience, what's the least amount of memory required to serve the model?

Thank you!

versae

BERTIN Project org Oct 3, 2022

Hi,

This demo runs the half-precision (float16) version of the model on a 16GB GPU provided by Huggingface. It needs at least the same amount of RAM. In my experiments with a local GPU, I noticed that it sometimes spikes well above 16GB and then it comes back to around 13,5GB of VRAM. To be on the safe side, I'd use a 20GB GPU if possible with at least another 20GB of RAM available. The regular RAM use is around 3,5GB (with resident memory around 40GB for some reason).

The full-precision model (float32) is more demanding though, as it basically doubles the requirements.

We are also working on updating the int8 version that would make it even possible to load in a Colab GPU, but that's ongoing work at the moment.

Please, let us know if you have any problem setting it up.
Cheers.

mrm8488

BERTIN Project org Oct 3, 2022

If you have problems fitting the model in memory check it out: https://huggingface.co/blog/hf-bitsandbytes-integration#is-it-faster-than-native-models?

versae

BERTIN Project org Oct 3, 2022

The next snippet should work as long as there is enough RAM:

!pip install --quiet -i https://test.pypi.org/simple/ bitsandbytes
!pip install --quiet git+https://github.com/huggingface/transformers.git  # Install latest version of transformers
!pip install --quiet accelerate

from transformers import pipeline

name = "bertin-project/bertin-gpt-j-6B"
text = "Hola, mi nombre es"
max_new_tokens = 20

pipe = pipeline(model=name, model_kwargs={"device_map": "auto", "load_in_8bit": True}, max_new_tokens=max_new_tokens)
pipe(text)

mrm8488

BERTIN Project org Oct 3, 2022

Did you measure the memory fingerprint and the token generation slowdown against the fp16 version?

versae

BERTIN Project org Oct 3, 2022

I haven't yet :(

dcastm

Dec 10, 2022

Sorry, for the late reply! Thanks a lot for your help.

versae changed discussion status to closed Dec 12, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment