Running dolly2 locally

#27
by egorulz - opened

What sort of system requirements would I need to run this model locally - as opposed to something like say, Vicuna-13b.

Databricks org

Ideally a GPU with at least 32GB of RAM for the 12B model. It should work in 16GB if you load in 8-bit.
The smaller models should work in less GPU RAM too.

I can confirm that the 12B version runs on 1x RTX 3090 (24GB of VRAM) loaded in int8 precision:

from transformers import AutoTokenizer, AutoModelForCausalLM

base_model = "databricks/dolly-v2-12b"
load_8bit = True

tokenizer = AutoTokenizer.from_pretrained(base_model, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
    base_model, load_in_8bit=load_8bit, torch_dtype=torch.float16, device_map="auto"
)

model.eval()
if torch.__version__ >= "2":
    model = torch.compile(model)

pipe = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
pipe("any prompt you want to provide")

Don't forget to import the InstructionTextGenerationPipeline given by the team.

Databricks org

You can also just use trust_remote_code='true' to auto-import it but this works fine for sure.
I think bitsandbytes will complain if you set bfloat16 as it will end up using fp16 for the fp parts anyway, but it just ignores that

You are right @srowen I updated the code snippet.

Hi,

I want to run the dolly 12 b model in Azure cloud. Can you suggest which VM I should go for?

Hi @rickey94 , I don't know or use Azure, but from my experiment you need this to successfully run the model:

  • fp16: 40GB of VRAM -> RTX A6000 or NVIDIA A100 40GB
  • int8 (with Peft): 15-24GB of VRAM (depending on the prompt size) -> NVIDIA V100 (16GB) or RTX 3090 (or similar)

What's the time takes to generate a response for decent size prompt?

@rickey94

  • fp16: between 5 and 15 sec.
  • int8 and Peft: between 1 and 5 sec.

It also depends on the num_beams you require and any other generation parameters you use. I used long inputs as a reference, between 1536 and 2048 tokens. You may also have a faster inference time if your inputs are smaller.

Here the tutorial video for how to install and use on Windows

The video includes a Gradio user interface script and teaches you how to enable load 8bit speed up and lower VRAM quantization

The results I had was not very good though for some reason :/

Dolly 2.0 : Free ChatGPT-like Model for Commercial Use - How To Install And Use Locally On Your PC
image

I have the 12B model running on my computer running Linux using a RTX 3060 graphics card, a I9-10900X cpu and 48GB memory. I'm using https://github.com/oobabooga/text-generation-webui as the front end. The settings I tried were GPU memory 7.5GB, CPU memory 22GB, auto-devices and load-in-8-bit.
Looking at memory usage, it looks like it gets anywhere close to using the 22GB CPU memory, but GPU memory does go above the 7.5GB limit.
It generates about 1 token per second.

I got to around 1200-1500 tokens current + context/history with the dolly 12B model.
You might be able to get more by tweaking the model settings, but this works as a starting point.

I just ran a few prompts through the model and apparently it took 6-7 mins. I run on databricks on a Standard_NC6s_v3 machine with 112GB of memory.
Any hint why inference takes so long is highly appreciated!

Databricks org

That's a V100 16GB. The 12B model does not fit onto that GPU. So you are mostly running on the CPU and it takes a long time.
Did you look at https://github.com/databrickslabs/dolly#generating-on-other-instances ?
You need to load in 8-bit, but a 16GB V100 will struggle with the 12B model a bit.
Use A10 or better, or use the 7B model.

@srowen Thanks a lot for the hint - completely confused a few things!

When I am trying it locally, it says the pytorch_model.bin is not in the correct JSON format. I am using the following code:

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("./pytorch_model.bin", device_map="auto", torch_dtype=torch.bfloat16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

res = generate_text("What are the differences between dog and cat?")
print (res)

It says:

OSError: It looks like the config file at './pytorch_model.bin' is not a valid JSON file.

But changing to model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", torch_dtype=torch.bfloat16) works. I have also tried using the exact same file as in ~/.cache/huggingface/hub/models--databricks--dolly-v2-12b/blobs/, and that also does not work.

Pass the directory containing this file, not the file path. It's looking for several artifacts in that dir, not just the model. You do not need to download the model like this.

Hi,

I want to run the dolly 12 b model in Cloudera workbench. Can anyone suggest how much RAM and GPU's I should go for?

srowen changed discussion status to closed

@chainyo if you used LoRA, would you mind sharing your LoraConfig? (reference)

@chainyo , if you used LoRA, would you mind sharing your LoraConfig? (reference)

@opyate Sorry for the confusion. I discussed another alpaca/llama model loaded using the LoRa Peft loader. You can find some code snippets on this repo

But you don't need LoRa for this dolly model until you fine-tune it using the LoRa technique.

Hi, like in openAi we have token limit of 4096, do we have token limit in dolly 2 as well when we deploy it locally? Thanks!

Databricks org

Yes, 2048 tokens

Do you have a notebook to run dolly 2.0 on azure databricks, I try but I have error :-(

Databricks org

Yes, the snippet on the model page works. You need a big enough GPU and instance. You didn't say what the problem was.

can you give me the lin k I do not see the snippet

Databricks org

Merci, Thanks, Namaste :-)

I have this error when I try to run : We couldn't connect to 'https://huggingface.co' to load this file,

Databricks org

You'll have to solve that access problem yourself, it's specific to your env

Hi @srowen

I'm trying to finetune "TinyPixel/Llama-2-7B-bf16-sharded" on 8 GB ram and one GPU, but facing some issues like

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True)

RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):

Is it because of RAM and GPU?

Databricks org

Wrong forum - not a question about Dolly.

Sign up or log in to comment