Guilherme34

Apr 15, 2023

Just how to run

EmmaTew

Apr 15, 2023

•

edited Apr 15, 2023

you can use this https://github.com/oobabooga/text-generation-webui

in the download part just use "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" (there is a spot in the UI now to download models, load/reload/unload and apply any LoRA etc.)

on my pc it takes ~13220MiB VRAM, loaded in 8bit generates around 5-10 tokens/sec on consumer GPU that has 16Gb VRAM.

cojosef96

Apr 15, 2023

•

edited Apr 15, 2023

Hey,
If you want to run this with hugfging face and transformers api you can use the model this way:,

from transformers import GPTNeoXForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")
# Note: This model takes 15 GB of Vram when loaded in 8bit
model = GPTNeoXForCausalLM.from_pretrained(
  "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",device_map="auto", load_in_8bit=True)
# for cpu ver
# model = AutoModelForCausalLM.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", torch_dtype=torch.bfloat16)
message = "<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs,  max_new_tokens=1000, do_sample=True, temperature=0.8)
tokenizer.decode(tokens[0])

output:
'<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>A meme is a cultural idea, behavior, or style that spreads from person to person within a society. The word "meme" was first used by Richard Dawkins in his 1976 book "The Selfish Gene." He defined a meme as a unit of cultural information that is transmitted from one individual to another through language, gestures, or other means.\n\nThe history of the word "meme" dates back to the late 1960s, when Richard Dawkins was working on his book "The Selfish Gene." He was fascinated by the way cultural information spreads and evolved within a society, and he began to use the term "meme" to describe these ideas.\n\nSince then, the word "meme" has become widely used in the field of cultural studies and has been adopted by many different academic fields and disciplines, including linguistics, anthropology, and psychology. Today, the term "meme" is used to refer to any cultural idea, behavior, or style that is spread from person to person within a society.<|endoftext|>'

jeffwadsworth

Apr 15, 2023

•

edited Apr 15, 2023

If using the oobabooga setup, just go to the directory "oobabooga-windows\text-generation-webui\models" and create a folder named "open-assistant" or whatever. Copy all the files from the files tab on this page to that folder. 3 large bin files and lots of jsons, etc. When you run it, you should see it on the list of options. Note, you need a GPU with a lot of VRAM. I can run a few queries with 12 GB's, but then it drops out due to memory. So, CPU is probably the way to go if you want to play hardcore.

Kagerage

Apr 15, 2023

Someone needs to make a 4-bit version of this model. I've been using various 4-bit 12B/13B models on Colab's free tier with zero issues, and I've even gotten this model to run just fine on there.

captainst

Apr 18, 2023

@cojosef96 Thank you for the sample code. I tried your code for CPU, but "model.generate" seems to run endlessly. I have 6 core CPU, it seems that the process is using only 1 of them. Do you have any clue?

olivierdehaene

OpenAssistant org Apr 18, 2023

•

edited Apr 18, 2023

We turned the Inference API on for this model. You can now use the text-generation client to prompt this model:

pip install --upgrade text-generation==0.5.0

from text_generation import InferenceAPIClient

client = InferenceAPIClient("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")

complete_answer = ""
for response in client.generate_stream("<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"):
    print(response.token)
    complete_answer += response.token.text

print(complete_answer)

You can also run this model locally with text-generation-inference.

To run on a GPU with enough VRAM:

# Use a volume to share weights between independant docker runs
docker run --gpus "device=0" -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:sha-7a1ba58  --model-id OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5

The first run whill take a while as it needs to download the model.

or to run on two smaller GPUs:

# Use a volume to share weights between independant docker runs
docker run --gpus "device=0,1" -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:sha-7a1ba58  --model-id OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 --num-shard 2

from text_generation import Client

client = Client("http://localhost:8080")

complete_answer = ""
for response in client.generate_stream("<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"):
    print(response.token)
    complete_answer += response.token.text

print(complete_answer)

captainst

Apr 18, 2023

•

edited Apr 18, 2023

@olivierdehaene Thank you very much. I have tried the API. It generate a reply but seems to have a limit of number of output tokens:
A meme is a cultural idea, behavior, or style that spreads from person to person within a
Is that normal ?

olivierdehaene

OpenAssistant org Apr 18, 2023

You can modify the parameters easily (in this case, it's the max_new_tokens parameter). Check the signature of the generate or generate_stream functions.

pevogam

Apr 19, 2023

Very helpful, thanks a lot @olivierdehaene !

captainst

Apr 19, 2023

@olivierdehaene Consice & Comprehensive, Thank you!

captainst

Apr 19, 2023

I have setup a workstation with 2 x RTX 3060 (12GB for each). I am using the latest version of transformers-4.28.1 from huggingface.
The example from @cojosef96 works correctly, and with device_map="auto" , the model is evenly distributed to 2 GPUs.
The 1st call to model.generate() function took over 2 minutes to finish. The successive calls took merely a couple of seconds.

(maybe I have some problems with the 2nd PCIE slot on motherboard, since during the 1st run, the 2nd GPU utilization is only around 15%~20%)

himanshu3344

Apr 24, 2023

@olivierdehaene Thanks a lot for sharing. I am getting the following error while deploying the model in a g5.4xlarge instance on aws. I am able to load the model and run an inference in an jupyter notebook, but the inference server is not starting.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 150.00 MiB (GPU 0; 22.04 GiB total capacity; 20.99 GiB already allocated; 91.19 MiB free; 20.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
Can someone please guide me setting this up? Do I need a bigger instance?

captainst

Apr 25, 2023

@himanshu3344 22.04GiB total is not enough for the FP16 version. You can try the 8bit quantized version, which consumes < 15GiB VRAM

shrayani

Apr 26, 2023

@olivierdehaene Thanks a lot for sharing your code. I am trying to use the text_generation InferenceAPIClient on a text column of a python dataframe with a suitable prompt. But it seems to have a rate limit and i have to wait atleast an hour or so before I can use the inference api again, once this rate limit is surpassed. Is there a way to bypass it? The dataframe is quite large but the task at hand is a one time operation as I will be writing the generated texts in a separate file. All suggestions are welcome

olivierdehaene

OpenAssistant org Apr 26, 2023

You can pay a pro subscription to decrease the rate limit.

himanshu3344

Apr 28, 2023

Thanks @captainst . I was able to deploy the model on bigger instance.

davidhung

May 15, 2023

timesler

May 25, 2023

davidhung

May 25, 2023

•

edited May 25, 2023

@timesler I did use that as the prompt. That was my reponse. It just added <|endoftext|> at the end. It isn't working with load_in_8bit=True.

samyar03

May 26, 2023

you can use this https://github.com/oobabooga/text-generation-webui

in the download part just use "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" (there is a spot in the UI now to download models, load/reload/unload and apply any LoRA etc.)

on my pc it takes ~13220MiB VRAM, loaded in 8bit generates around 5-10 tokens/sec on consumer GPU that has 16Gb VRAM.

For this one, let's say I want to rent a GPU Cloud in order to run this, would you say something like a NVIDIA T4 (16 GiB in VRAM) along with a machine of 4 vCPUs and 26 GB memory would suffice?

balu548411

Jun 4, 2023

This comment has been hidden

deepakkaura26

Jun 29, 2023

•

edited Jun 29, 2023

from text_generation import InferenceAPIClient

client = InferenceAPIClient("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")

print(complete_answer)

@olivierdehaene can you suggest me that in the above code which other models I can ?

deepakkaura26

Jun 29, 2023

from transformers import GPTNeoXForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")

Note: This model takes 15 GB of Vram when loaded in 8bit

model = GPTNeoXForCausalLM.from_pretrained(
"OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",device_map="auto", load_in_8bit=True)

for cpu ver

model = AutoModelForCausalLM.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", torch_dtype=torch.bfloat16)

@cojosef96 I have 3 questions kindly guide me for their answers

Can I run above codes on colab's CPU ?
What's the difference between 4 bit and 8 bit ?
Which other models I can use for above codes ?

OpenAssistant
/

oasst-sft-4-pythia-12b-epoch-3.5

How to run that?

Note: This model takes 15 GB of Vram when loaded in 8bit

for cpu ver

model = AutoModelForCausalLM.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", torch_dtype=torch.bfloat16)