mosaicml/mpt-7b · Speed on CPU

May 7, 2023

•

edited May 7, 2023

I have tried llama 7B and this model on a CPU, and LLama is much faster (7 seconds vs 43 for 20 tokens). Is this the right way to run the model on a CPU or I am missing something:

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "mosaicml/mpt-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True,trust_remote_code=True)

import time
timea = time.time()
prompt = "A lion is"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
outputs = model.generate(
    **inputs, max_new_tokens=20, do_sample=True, temperature=0.75 , return_dict_in_generate=True
)
token = outputs.sequences[0]
output_str = tokenizer.decode(token)
print(output_str)
print("timea = time.time()",-timea + time.time())

zokica

May 7, 2023

•

edited May 7, 2023

The out:

MPT 7b:

A lion is a large cat. Lions are native to Africa. Lions live in the savanna, a grassland
timea = time.time() 43.37369394302368

LLama 7b:

<s> A lion is the king of the jungle. The lion is the strongest animal in the animal kingdom
timea = time.time() 6.919593811035156

keldenl

May 8, 2023

you're comparing ggml vs PyTorch – until this gets the ggml treatment expect the speeds to be slower on CPU only

zokica

May 8, 2023

•

edited May 8, 2023

you're comparing ggml vs PyTorch – until this gets the ggml treatment expect the speeds to be slower on CPU only

How did you conclude that i used ggml?

Of course, I did not use ggml, I used exactly the same BF16 for both llama and mpt-7b and llama is much faster.
model_name = "huggyllama/llama-7b"
model_name = "mosaicml/mpt-7b"

Here is exactly what I used for llama so you can replicate and see for yourself:

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "huggyllama/llama-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True,trust_remote_code=True)

import time
timea = time.time()
prompt = "A lion is"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
outputs = model.generate(
    inputs.input_ids, max_new_tokens=20, do_sample=True, temperature=0.75 , return_dict_in_generate=True
)
token = outputs.sequences[0]
output_str = tokenizer.decode(token)
print(output_str)
print("timea = time.time()",-timea + time.time())

abhi-mosaic

May 11, 2023

•

edited May 11, 2023

Hi @zokica , we will take a look at this as we're seeing a couple reports of slow CPU inference. Since you have a system on hand that is showing the issue, could you help confirm if any of the MPT vs. LLaMa speed gap changes based on the torch_dtype and low_cpu_mem_usage flags? Basically this matrix:

torch_dtype=torch.float32, low_cpu_mem_usage=False: ?
torch_dtype=torch.float32, low_cpu_mem_usage=True: ?
torch_dtype=torch.bfloat16, low_cpu_mem_usage=False: ?
torch_dtype=torch.bfloat16, low_cpu_mem_usage=True: MPT slower than LLaMa

In the meantime we will try to reproduce as well. Thank you for the report!

stefan-berlin

May 12, 2023

•

edited May 12, 2023

Hi,

I testet both scripts from zokica above,
on a cheap VPS, 18 Cores, 48 GB RAM, 2048 GB SSD (RAID10).

LLaMa still faster, but with float32 "just" by factor 2.

abhi-mosaic

May 12, 2023

Thank you so much! This definitely seems like a bottleneck somewhere in the MPT forward or KV cacheing logic. It's very interesting that this shows up on CPU but not on GPU (where we saw the opposite relation, ~1.5-2x faster for MPT with triton). We will look into it and patch the model source once we find a fix.

Last question, what version of torch were you using for those results?

zokica

May 12, 2023

I actually run it via BF16, as I have only 32 GB of ram in this server, so i had to use a low ram option.

Is there any other way to run it on a CPU without using bf16 with just 32 GB of memory?

I am using, and probably most people will just use the CPU for testing, it would be nice if this could work a bit faster, but not so much of a problem.

So it works faster than LLama on a GPU, right, even without triton ?

Combatti

May 12, 2023

you're comparing ggml vs PyTorch – until this gets the ggml treatment expect the speeds to be slower on CPU only

There are ggml versions in hugging face 🤗

stefan-berlin

May 12, 2023

Thank you so much! This definitely seems like a bottleneck somewhere in the MPT forward or KV cacheing logic. It's very interesting that this shows up on CPU but not on GPU (where we saw the opposite relation, ~1.5-2x faster for MPT with triton). We will look into it and patch the model source once we find a fix.

Last question, what version of torch were you using for those results?

2.0.1+cpu

nib12345

May 18, 2023

For me. It is taking 35 mins to generate 100 tokens.
Laptop specification: No GPU, 20 GB RAM (4+16 GB), 1 TB SSD, I5 processor.
I have very slow Laptop with No GPU.

def customGenerate(argPrompt):
    inputs = tokenizer(argPrompt, return_tensors='pt').to(model.device)
    outputs = model.generate(
        **inputs, max_new_tokens=1, do_sample=True, temperature=0.75 , return_dict_in_generate=True
    )
    token = outputs.sequences[0]
    output_str = tokenizer.decode(token)

    return output_str

import time
from datetime import datetime
timea = time.time()
dtNow = datetime.now()
print("now =", dtNow)
print("Start time: ",-timea + time.time())

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "mosaicml/mpt-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True,trust_remote_code=True)

prompt = ["Earth is"]

count=0

while(count < 100):
    output_str = customGenerate(prompt[prompt.__len__()-1])
    prompt.append(output_str)
    print(prompt.__len__(), ': ' , prompt[prompt.__len__()-1])
    print("Time taken in sec:",-timea + time.time())
    print("Time taken in min:",((-timea + time.time())/60))
    count = count + 1

dtNow = datetime.now()
print("now =", dtNow)

CShorten

May 25, 2023

I am having a hard time running this on CPU, could someone please help me? I get the error:

ImportError: This modeling file requires the following packages that were not found in your environment: einops. Run pip install einops

But then it seems einops needs to find a CUDA Driver to be installed :(

abhi-mosaic

Jun 3, 2023

•

edited Jun 3, 2023

The CPU load time should be fixed now as of this PR as long as you use device_map=auto: https://huggingface.co/mosaicml/mpt-7b/discussions/47
We also added some logic to improve KV cacheing speed. Let us know if you see improvements!

sam-mosaic

Jun 14, 2023

Closing as complete, but if anyone sees any CPU inference speed issues, please reopen this or open a new issue!

sam-mosaic changed discussion status to closed Jun 14, 2023