How to run that?

#2
by Guilherme34 - opened

Just how to run

you can use this https://github.com/oobabooga/text-generation-webui

in the download part just use "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" (there is a spot in the UI now to download models, load/reload/unload and apply any LoRA etc.)

on my pc it takes ~13220MiB VRAM, loaded in 8bit generates around 5-10 tokens/sec on consumer GPU that has 16Gb VRAM.

Hey,
If you want to run this with hugfging face and transformers api you can use the model this way:,

from transformers import GPTNeoXForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")
# Note: This model takes 15 GB of Vram when loaded in 8bit
model = GPTNeoXForCausalLM.from_pretrained(
  "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",device_map="auto", load_in_8bit=True)
# for cpu ver
# model = AutoModelForCausalLM.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", torch_dtype=torch.bfloat16)
message = "<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs,  max_new_tokens=1000, do_sample=True, temperature=0.8)
tokenizer.decode(tokens[0])

output:
'<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>A meme is a cultural idea, behavior, or style that spreads from person to person within a society. The word "meme" was first used by Richard Dawkins in his 1976 book "The Selfish Gene." He defined a meme as a unit of cultural information that is transmitted from one individual to another through language, gestures, or other means.\n\nThe history of the word "meme" dates back to the late 1960s, when Richard Dawkins was working on his book "The Selfish Gene." He was fascinated by the way cultural information spreads and evolved within a society, and he began to use the term "meme" to describe these ideas.\n\nSince then, the word "meme" has become widely used in the field of cultural studies and has been adopted by many different academic fields and disciplines, including linguistics, anthropology, and psychology. Today, the term "meme" is used to refer to any cultural idea, behavior, or style that is spread from person to person within a society.<|endoftext|>'

If using the oobabooga setup, just go to the directory "oobabooga-windows\text-generation-webui\models" and create a folder named "open-assistant" or whatever. Copy all the files from the files tab on this page to that folder. 3 large bin files and lots of jsons, etc. When you run it, you should see it on the list of options. Note, you need a GPU with a lot of VRAM. I can run a few queries with 12 GB's, but then it drops out due to memory. So, CPU is probably the way to go if you want to play hardcore.

Someone needs to make a 4-bit version of this model. I've been using various 4-bit 12B/13B models on Colab's free tier with zero issues, and I've even gotten this model to run just fine on there.

@cojosef96 Thank you for the sample code. I tried your code for CPU, but "model.generate" seems to run endlessly. I have 6 core CPU, it seems that the process is using only 1 of them. Do you have any clue?

We turned the Inference API on for this model. You can now use the text-generation client to prompt this model:

pip install --upgrade text-generation==0.5.0
from text_generation import InferenceAPIClient

client = InferenceAPIClient("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")

complete_answer = ""
for response in client.generate_stream("<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"):
    print(response.token)
    complete_answer += response.token.text

print(complete_answer)

You can also run this model locally with text-generation-inference.

To run on a GPU with enough VRAM:

# Use a volume to share weights between independant docker runs
docker run --gpus "device=0" -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:sha-7a1ba58  --model-id OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5

The first run whill take a while as it needs to download the model.

or to run on two smaller GPUs:

# Use a volume to share weights between independant docker runs
docker run --gpus "device=0,1" -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:sha-7a1ba58  --model-id OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 --num-shard 2
from text_generation import Client

client = Client("http://localhost:8080")

complete_answer = ""
for response in client.generate_stream("<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"):
    print(response.token)
    complete_answer += response.token.text

print(complete_answer)

@olivierdehaene Thank you very much. I have tried the API. It generate a reply but seems to have a limit of number of output tokens:
A meme is a cultural idea, behavior, or style that spreads from person to person within a
Is that normal ?

OpenAssistant org

You can modify the parameters easily (in this case, it's the max_new_tokens parameter). Check the signature of the generate or generate_stream functions.

Very helpful, thanks a lot @olivierdehaene !

@olivierdehaene Consice & Comprehensive, Thank you!

I have setup a workstation with 2 x RTX 3060 (12GB for each). I am using the latest version of transformers-4.28.1 from huggingface.
The example from @cojosef96 works correctly, and with device_map="auto" , the model is evenly distributed to 2 GPUs.
The 1st call to model.generate() function took over 2 minutes to finish. The successive calls took merely a couple of seconds.

(maybe I have some problems with the 2nd PCIE slot on motherboard, since during the 1st run, the 2nd GPU utilization is only around 15%~20%)

@olivierdehaene Thanks a lot for sharing. I am getting the following error while deploying the model in a g5.4xlarge instance on aws. I am able to load the model and run an inference in an jupyter notebook, but the inference server is not starting.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 150.00 MiB (GPU 0; 22.04 GiB total capacity; 20.99 GiB already allocated; 91.19 MiB free; 20.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
Can someone please guide me setting this up? Do I need a bigger instance?

@himanshu3344 22.04GiB total is not enough for the FP16 version. You can try the 8bit quantized version, which consumes < 15GiB VRAM

@olivierdehaene Thanks a lot for sharing your code. I am trying to use the text_generation InferenceAPIClient on a text column of a python dataframe with a suitable prompt. But it seems to have a rate limit and i have to wait atleast an hour or so before I can use the inference api again, once this rate limit is surpassed. Is there a way to bypass it? The dataframe is quite large but the task at hand is a one time operation as I will be writing the generated texts in a separate file. All suggestions are welcome

OpenAssistant org

You can pay a pro subscription to decrease the rate limit.

Thanks @captainst . I was able to deploy the model on bigger instance.

I don't get any results when running this model. My results are "<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|><|endoftext|>". Why is that?

@davidhung it is giving no output because the prompt you are using ends with <|endoftext|>, which the model interprets as the end of its generation.
If you prompt with <|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>, you should get some output.

@timesler I did use that as the prompt. That was my reponse. It just added <|endoftext|> at the end. It isn't working with load_in_8bit=True.

you can use this https://github.com/oobabooga/text-generation-webui

in the download part just use "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" (there is a spot in the UI now to download models, load/reload/unload and apply any LoRA etc.)

on my pc it takes ~13220MiB VRAM, loaded in 8bit generates around 5-10 tokens/sec on consumer GPU that has 16Gb VRAM.

For this one, let's say I want to rent a GPU Cloud in order to run this, would you say something like a NVIDIA T4 (16 GiB in VRAM) along with a machine of 4 vCPUs and 26 GB memory would suffice?

This comment has been hidden

from text_generation import InferenceAPIClient

client = InferenceAPIClient("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")

complete_answer = ""
for response in client.generate_stream("<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"):
print(response.token)
complete_answer += response.token.text

print(complete_answer)

@olivierdehaene can you suggest me that in the above code which other models I can ?

from transformers import GPTNeoXForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")

Note: This model takes 15 GB of Vram when loaded in 8bit

model = GPTNeoXForCausalLM.from_pretrained(
"OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",device_map="auto", load_in_8bit=True)

for cpu ver

model = AutoModelForCausalLM.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", torch_dtype=torch.bfloat16)

message = "<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_new_tokens=1000, do_sample=True, temperature=0.8)
tokenizer.decode(tokens[0])

@cojosef96 I have 3 questions kindly guide me for their answers

  1. Can I run above codes on colab's CPU ?

  2. What's the difference between 4 bit and 8 bit ?

  3. Which other models I can use for above codes ?

Sign up or log in to comment