Help Needed!! Text Generation Taking Too Long

#17

by debajyoti111 - opened May 11, 2023

May 11, 2023

•

edited May 11, 2023

Hi, I am new to NLP and am still learning. I am using a VM of GCP(e2-highmem-4 (Efficient Instance, 4 vCPUs, 32 GB RAM)) to load the model and use it. Here is the code I have written-

import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import transformers
config = transformers.AutoConfig.from_pretrained(
  'mosaicml/mpt-7b-instruct',
  trust_remote_code=True,
)
# config.attn_config['attn_impl'] = 'flash'

model = transformers.AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b-instruct',
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True,
  cache_dir="./cache"
)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b", cache_dir="./cache")
text_gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
text_gen(text_inputs="what is 2+2?")

Now the code is taking way too long to generate the text. Am I doing something wrong? or is there any way to make things faster?
Also, when creating the pipeline, I am getting the following warning-

The model 'MPTForCausalLM' is not supported for text-generation

I saw in another discussion that this shouldn't be a problem as the architecture is custom?

abhi-mosaic

May 11, 2023

•

edited May 11, 2023

Hi @ debajyoti111, could you try removing the line torch_dtype=torch.bfloat16. I'm seeing in another post that on some CPU machines, this causes it to run very slowly. Removing this line will fallback to default torch.float32 weights and math.

The not supported for text-generation warning can be ignored.

Also taking a step back, to separate the system from the code, can you confirm if MPT is faster/slower than when you run generation with other HF models like OPT-6.7B? In general, running LLMs on CPUs is going to be very slow without a custom framework like GGML. Right now we are focused mainly on GPU inference, which should be quite fast when using the attn_impl: triton backend.

Let me know if the generation speed gets better!

abhi-mosaic

Jun 3, 2023

Closing as stale.

Also wanted to note that we added support for device_map and faster KV cacheing in this PR: https://huggingface.co/mosaicml/mpt-7b-instruct/discussions/41

abhi-mosaic changed discussion status to closed Jun 3, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment