Text Generation
Transformers
PyTorch
English
gptj
Inference Endpoints

Hardware requirements for inference?

#9
by spartanml - opened

Where can I find the hardware requirements for this model? (Specifically, can it run on 3060/12GB)?

Together org

Theoretically, GPT-JT cannot run on one single 3060 12GB as the model itself takes up ~12GB and thus so there is not enough memory for inference. I'll recommend VRAM >= 16GB. An alternative is to use multiple 3060 GPUs with accelerate:

from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

# Load model to CPU
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["GPTJBlock"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model, 
    max_memory=max_memory,
    no_split_module_classes=["GPTJBlock"], 
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

I'm using this code and inference still takes ~12 seconds. I use NVIDIA T4 x 2. For inference I use the command model.generate, do you know if I need to do anything else to make it use GPU?

Do you have a code snippet with an inference example, which uses GPU? :) That would be awesome.

Thanks for the good work!

Together org

@billy-ai Sorry for the late reply. If you use this code, the inference should run on GPU.
-- How many tokens were you trying to generate? It's possible to be slow if max_new_tokens is large.

If you use T4 with 16GB VRAM, simply moving the model to GPUmodel = model.half().to('cuda:0') and calling output = model.generate(input_ids, max_new_tokens=10) are enough to GPU.

If I only have a 3070 with only 8 VRAM but has a lot of regular RAM (46) can I get away with running it on the CPU instead, don't mind if it's much slower?

Together org

If I only have a 3070 with only 8 VRAM but has a lot of regular RAM (46) can I get away with running it on the CPU instead, don't mind if it's much slower?

Sure, you can run it on CPU without any problem. You can also try quantization: model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-JT-6B-v1', device_map='auto', load_in_8bit=True, int8_threshold=6.0) :)

Theoretically, GPT-JT cannot run on one single 3060 12GB as the model itself takes up ~12GB and thus so there is not enough memory for inference. I'll recommend VRAM >= 16GB. An alternative is to use multiple 3060 GPUs with accelerate:

from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

# Load model to CPU
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["GPTJBlock"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model, 
    max_memory=max_memory,
    no_split_module_classes=["GPTJBlock"], 
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

Thanks! Sadly, won't be able to get another GPU soon!

Sign up or log in to comment