databricks/dolly-v2-12b · Running dolly2 locally

Apr 14, 2023

What sort of system requirements would I need to run this model locally - as opposed to something like say, Vicuna-13b.

Databricks org Apr 14, 2023

Ideally a GPU with at least 32GB of RAM for the 12B model. It should work in 16GB if you load in 8-bit.
The smaller models should work in less GPU RAM too.

chainyo

Apr 14, 2023

•

edited Apr 14, 2023

I can confirm that the 12B version runs on 1x RTX 3090 (24GB of VRAM) loaded in int8 precision:

from transformers import AutoTokenizer, AutoModelForCausalLM

base_model = "databricks/dolly-v2-12b"
load_8bit = True

tokenizer = AutoTokenizer.from_pretrained(base_model, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
    base_model, load_in_8bit=load_8bit, torch_dtype=torch.float16, device_map="auto"
)

model.eval()
if torch.__version__ >= "2":
    model = torch.compile(model)

pipe = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
pipe("any prompt you want to provide")

Don't forget to import the InstructionTextGenerationPipeline given by the team.

srowen

Databricks org Apr 14, 2023

You can also just use trust_remote_code='true' to auto-import it but this works fine for sure.
I think bitsandbytes will complain if you set bfloat16 as it will end up using fp16 for the fp parts anyway, but it just ignores that

chainyo

Apr 14, 2023

You are right @srowen I updated the code snippet.

rickey94

Apr 16, 2023

Hi,

I want to run the dolly 12 b model in Azure cloud. Can you suggest which VM I should go for?

chainyo

Apr 16, 2023

Hi @rickey94 , I don't know or use Azure, but from my experiment you need this to successfully run the model:

fp16: 40GB of VRAM -> RTX A6000 or NVIDIA A100 40GB
int8 (with Peft): 15-24GB of VRAM (depending on the prompt size) -> NVIDIA V100 (16GB) or RTX 3090 (or similar)

rickey94

Apr 16, 2023

What's the time takes to generate a response for decent size prompt?

chainyo

Apr 16, 2023

•

edited Apr 16, 2023

@rickey94

fp16: between 5 and 15 sec.
int8 and Peft: between 1 and 5 sec.

It also depends on the num_beams you require and any other generation parameters you use. I used long inputs as a reference, between 1536 and 2048 tokens. You may also have a faster inference time if your inputs are smaller.

MonsterMMORPG

Apr 16, 2023

Here the tutorial video for how to install and use on Windows

The video includes a Gradio user interface script and teaches you how to enable load 8bit speed up and lower VRAM quantization

The results I had was not very good though for some reason :/

Dolly 2.0 : Free ChatGPT-like Model for Commercial Use - How To Install And Use Locally On Your PC

drwootton

Apr 18, 2023

•

edited Apr 18, 2023

I have the 12B model running on my computer running Linux using a RTX 3060 graphics card, a I9-10900X cpu and 48GB memory. I'm using https://github.com/oobabooga/text-generation-webui as the front end. The settings I tried were GPU memory 7.5GB, CPU memory 22GB, auto-devices and load-in-8-bit.
Looking at memory usage, it looks like it gets anywhere close to using the 22GB CPU memory, but GPU memory does go above the 7.5GB limit.
It generates about 1 token per second.

I got to around 1200-1500 tokens current + context/history with the dolly 12B model.
You might be able to get more by tweaking the model settings, but this works as a starting point.

FelixAsanger

Apr 18, 2023

I just ran a few prompts through the model and apparently it took 6-7 mins. I run on databricks on a Standard_NC6s_v3 machine with 112GB of memory.
Any hint why inference takes so long is highly appreciated!

srowen

Databricks org Apr 18, 2023

That's a V100 16GB. The 12B model does not fit onto that GPU. So you are mostly running on the CPU and it takes a long time.
Did you look at https://github.com/databrickslabs/dolly#generating-on-other-instances ?
You need to load in 8-bit, but a 16GB V100 will struggle with the 12B model a bit.
Use A10 or better, or use the 7B model.

FelixAsanger

Apr 18, 2023

@srowen Thanks a lot for the hint - completely confused a few things!

locallyai

Apr 18, 2023

•

edited Apr 18, 2023

When I am trying it locally, it says the pytorch_model.bin is not in the correct JSON format. I am using the following code:

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("./pytorch_model.bin", device_map="auto", torch_dtype=torch.bfloat16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

res = generate_text("What are the differences between dog and cat?")
print (res)

It says:

OSError: It looks like the config file at './pytorch_model.bin' is not a valid JSON file.

But changing to model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", torch_dtype=torch.bfloat16) works. I have also tried using the exact same file as in ~/.cache/huggingface/hub/models--databricks--dolly-v2-12b/blobs/, and that also does not work.

srowen

Databricks org Apr 18, 2023

•

edited Apr 18, 2023

Pass the directory containing this file, not the file path. It's looking for several artifacts in that dir, not just the model. You do not need to download the model like this.

Nagesh

Apr 19, 2023

Hi,

I want to run the dolly 12 b model in Cloudera workbench. Can anyone suggest how much RAM and GPU's I should go for?

srowen

Databricks org Apr 19, 2023

You want an A100 ideally. See https://github.com/databrickslabs/dolly#training-on-other-instances

srowen changed discussion status to closed Apr 19, 2023

opyate

May 16, 2023

@chainyo if you used LoRA, would you mind sharing your LoraConfig? (reference)

chainyo

May 16, 2023

@chainyo , if you used LoRA, would you mind sharing your LoraConfig? (reference)

@opyate Sorry for the confusion. I discussed another alpaca/llama model loaded using the LoRa Peft loader. You can find some code snippets on this repo

But you don't need LoRa for this dolly model until you fine-tune it using the LoRa technique.

ananyaaa

Jun 6, 2023

Hi, like in openAi we have token limit of 4096, do we have token limit in dolly 2 as well when we deploy it locally? Thanks!

srowen

Databricks org Jun 6, 2023

Yes, 2048 tokens

toutski

Jun 28, 2023

Do you have a notebook to run dolly 2.0 on azure databricks, I try but I have error :-(

srowen

Databricks org Jun 28, 2023

Yes, the snippet on the model page works. You need a big enough GPU and instance. You didn't say what the problem was.

toutski

Jun 29, 2023

can you give me the lin k I do not see the snippet

srowen

Databricks org Jun 29, 2023

Just this very site. https://huggingface.co/databricks/dolly-v2-12b#usage

toutski

Jun 29, 2023

Merci, Thanks, Namaste :-)

toutski

Jun 29, 2023

I have this error when I try to run : We couldn't connect to 'https://huggingface.co' to load this file,

srowen

Databricks org Jun 29, 2023

You'll have to solve that access problem yourself, it's specific to your env

Nagesh

Sep 26, 2023

Hi @srowen

I'm trying to finetune "TinyPixel/Llama-2-7B-bf16-sharded" on 8 GB ram and one GPU, but facing some issues like

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True)

RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):

Is it because of RAM and GPU?

srowen

Databricks org Sep 26, 2023

Wrong forum - not a question about Dolly.