Text Generation
Transformers
PyTorch
mpt
Composer
MosaicML
llm-foundry
custom_code
text-generation-inference

Help Needed!! Text Generation Taking Too Long

#17
by debajyoti111 - opened

Hi, I am new to NLP and am still learning. I am using a VM of GCP(e2-highmem-4 (Efficient Instance, 4 vCPUs, 32 GB RAM)) to load the model and use it. Here is the code I have written-

import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import transformers
config = transformers.AutoConfig.from_pretrained(
  'mosaicml/mpt-7b-instruct',
  trust_remote_code=True,
)
# config.attn_config['attn_impl'] = 'flash'

model = transformers.AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b-instruct',
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True,
  cache_dir="./cache"
)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b", cache_dir="./cache")
text_gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
text_gen(text_inputs="what is 2+2?")

Now the code is taking way too long to generate the text. Am I doing something wrong? or is there any way to make things faster?
Also, when creating the pipeline, I am getting the following warning-

The model 'MPTForCausalLM' is not supported for text-generation

I saw in another discussion that this shouldn't be a problem as the architecture is custom?

Hi @ debajyoti111, could you try removing the line torch_dtype=torch.bfloat16. I'm seeing in another post that on some CPU machines, this causes it to run very slowly. Removing this line will fallback to default torch.float32 weights and math.

The not supported for text-generation warning can be ignored.

Also taking a step back, to separate the system from the code, can you confirm if MPT is faster/slower than when you run generation with other HF models like OPT-6.7B? In general, running LLMs on CPUs is going to be very slow without a custom framework like GGML. Right now we are focused mainly on GPU inference, which should be quite fast when using the attn_impl: triton backend.

Let me know if the generation speed gets better!

Mosaic ML, Inc. org

Closing as stale.

Also wanted to note that we added support for device_map and faster KV cacheing in this PR: https://huggingface.co/mosaicml/mpt-7b-instruct/discussions/41

abhi-mosaic changed discussion status to closed

Sign up or log in to comment