Text Generation
Transformers
Safetensors
dbrx
conversational
text-generation-inference

Continuation of the Discussion: More than 10 minutes the status is in Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation. #28

#31
by Madhugraj - opened

Adding more details:
First I ran:
tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", trust_remote_code=True, token=auth_token)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True,token=auth_token)
--All 61 files where downloaded.
and then I did:

Model + token save

save_directory = "llmdb/model"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

Later I did:

model = AutoModelForCausalLM.from_pretrained(save_directory, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)
Loading checkpoint shards: 100%
 61/61 [00:13<00:00,  4.69it/s]

Can i understand that the model and tokens are saved successfully?
Now I am running:

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt")
input_ids

{'input_ids': tensor([[100278, 9125, 198, 2675, 527, 6078, 46913, 11, 3549,
555, 423, 2143, 78889, 13, 1472, 1051, 1566, 6177,
304, 6790, 220, 2366, 18, 13, 1472, 4320, 4860,
3196, 389, 2038, 2561, 709, 311, 430, 1486, 627,
57489, 15843, 36, 66024, 77273, 50, 5257, 66024, 57828,
43486, 2794, 23233, 29863, 11, 719, 3493, 17879, 14847,
311, 810, 6485, 323, 1825, 84175, 4860, 627, 2675,
7945, 449, 5370, 9256, 11, 505, 4477, 311, 11058,
320, 985, 51594, 369, 2082, 10215, 2001, 6227, 311,
1005, 55375, 449, 2082, 11, 4823, 11, 323, 12920,
4390, 7, 2675, 656, 539, 617, 1972, 7394, 828,
2680, 477, 2082, 11572, 17357, 13, 1472, 5766, 23473,
67247, 323, 3493, 24770, 39555, 389, 20733, 13650, 13,
1472, 656, 539, 3493, 5609, 24142, 11, 45319, 11,
477, 3754, 9908, 323, 656, 539, 82791, 713, 3649,
315, 701, 4967, 828, 29275, 2028, 374, 701, 1887,
10137, 11, 51346, 701, 14847, 13, 3234, 539, 5905,
433, 11, 1120, 6013, 311, 279, 1217, 13, 1442,
499, 1505, 6261, 7556, 922, 420, 1984, 11, 3009,
13, 1472, 1288, 387, 30438, 36001, 323, 6118, 430,
3445, 539, 45391, 420, 627, 57489, 9503, 4276, 386,
72983, 4230, 3083, 10245, 45613, 52912, 21592, 66873, 6781,
38873, 3247, 45613, 3507, 20843, 9109, 393, 3481, 691,
1863, 5257, 3247, 14194, 13575, 68235, 13, 100279, 198,
100278, 882, 198, 3923, 1587, 433, 1935, 311, 1977,
264, 2294, 445, 11237, 30, 100279, 198, 100278, 78191,
198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Can i understand that the input tokens are generated?

Next:
outputs = model.generate(**input_ids, max_new_tokens=200)
Setting pad_token_id to eos_token_id:100257 for open-end generation.
And now no output and the cell is running for hours...

I am just doing what is instructed in https://github.com/databricks/dbrx/blob/main/MODEL_CARD_dbrx_instruct.md under Run the model on a CPU.

Please explain.

I have the same issue, but I'm using the example code for running on GPU.

Databricks org

What GPUs? Are you sure it's not loading only partly on a GPU? That is what you likely get if use device_map auto and don't have multiple big GPUs

What GPUs? Are you sure it's not loading only partly on a GPU? That is what you likely get if use device_map auto and don't have multiple big GPUs

Sorry, I just saw the previously closed thread...

I'm having the same issue, but my machine has 4 x A6000 Ada 48gb GPUs (combined 192 total VRAM) and 512 gb of RAM for CPU. Am I not able to run this model? Is there like a quantized version of it, or some way I can get it to fit?

code I'm using just to test if I can run it:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", trust_remote_code=True, token="HF_TOKEN")
model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="flash_attention_2", token="HF_TOKEN")
input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Databricks org
edited Apr 1

Right, 132B x 16-bit = 264GB of VRAM. Much of the model could load in 192GB, but there would be a perf hit as at least some would be offloaded to CPU. device_map="auto" is doing that here, almost surely. You can do a sense check with nvidia-smi (your mem is likely ~100% full), and calling .hf_device_map on the loaded model to see which devices have loaded which layers, and which are on CPU if any (and I expect some are)

You can check out third-party 4-bit quantizations like https://huggingface.co/PrunaAI/dbrx-instruct-bnb-4bit for example

I have only on Geforce 3090 with mem of 32G and I stuck on the same message. Can someone help?

Databricks org

As above, that is unfortunately far too little memory to load the model. It's too little to load even the 4-bit quantizations.

Get it. Thank you!

Sign up or log in to comment