lmsys/vicuna-13b-v1.5-16k · Prompt template

Aug 2, 2023

What is the prompt template ?

prompt = "USER: write a poem about sky in 300 words ASSISTANT:"

Response :

I'm sorry, but i can't do that. A poem about the sky could take take a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a

lmzheng

Large Model Systems Organization org Aug 2, 2023

https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md#prompt-template
It requires transformers >= 4.31.0 for the rope scaling.

lmzheng

Large Model Systems Organization org Aug 2, 2023

MaziyarPanahi

Aug 3, 2023

I cannot make this prompt work in TGI. It writes a little and starts repeating everything!

input="""USER: Give me a 3 day plan to trip to Paris?"""

Day 1:
* Wake up early in the morning and head to the Eiffel Tower for sunrise.
* After the tower, take a stroll around the the beautiful Champs-Élyséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséesées

input="""USER: Hi"""

, I'm trying to use the `get_object_or_40()` function in my code, but I'm getting an error message that says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says

I think we are missing some special tokens somewhere. (I tried the Llama-2 chat prompt template, it didn't work)

monuminu

Aug 3, 2023

Same here ..

monuminu

Aug 3, 2023

I was thinking it's just me. Thanks for reporting

marcelgoya

Aug 3, 2023

•

edited Aug 3, 2023

I am also experiencing the same problem with the standard HF transformer example:

from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("lmsys/vicuna-13b-v1.5-16k")
model = LlamaForCausalLM.from_pretrained("lmsys/vicuna-13b-v1.5-16k", device_map="auto")

inputs = tokenizer("How are you?", return_tensors="pt")
generate_ids = model.generate(inputs.input_ids.to('cuda:0'), max_length=16000)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)

Cheshire94

Aug 4, 2023

Same here. V1.5 works fine while V1.5-16K continues to repeat nonsense letters after few words.

lmzheng

Large Model Systems Organization org Aug 4, 2023

Did you use transformers >= 4.31.0?

MaziyarPanahi

Aug 4, 2023

Thansk @lmzheng
It seems the latest TGI uses older transformers(https://github.com/huggingface/text-generation-inference/blob/main/server/requirements.txt#L53). Let me try a pure CasualLM and will get back to you/

Cheshire94

Aug 4, 2023

Did you use transformers >= 4.31.0?

Thx, it seems that the version of the transformers library is the problem, I upgrade it from 4.30.2 to 4.31.0, and the mumbling does not happen again.

However, I start to run into OOM situations "torch.cuda.OutOfMemoryError: CUDA out of memory." with my GPU(Tesla V100 31.75 GiB total capacity) sometimes, does it related to the memory requirement of some intermediate parameters for 16K context?

MaziyarPanahi

Aug 4, 2023

@lmzheng It works great with the normal CasualLM coming from 4.31.0. I'll wait for TGI to start using 4.31.0.

The memory usage is also pretty decent on small text, however, once I use lots of data I also get the same error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 8.85 GiB (GPU 0; 79.19 GiB total capacity; 24.32 GiB 
already allocated; 27.62 MiB free; 28.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated 
memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF

Not sure if there is any way around this, but would it be possible to calculate how much memory one needs if one uses the whole 16000 input's length? (60G, 80G?)

lmzheng

Large Model Systems Organization org Aug 5, 2023

•

edited Aug 5, 2023

@marcelgoya follow this to use transformer api https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/huggingface_api.py

marcelgoya

Aug 5, 2023

•

edited Aug 5, 2023

@marcelgoya follow this to use transformer api https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/huggingface_api.py

@lmzheng Will do that, many thanks!

plancktree

Aug 10, 2023

Did you use transformers >= 4.31.0?

my transformers is 4.31.0 but i also have the same problem.how can i fix it？QAQ

monuminu

Aug 10, 2023

I realized I don't see the issue with TGI 1.0 but with other container like DJL. I think the issue may be related to Rope scaling not implemented .

rboehme86

Aug 11, 2023

I can confirm it works flawlessly with a fresh install. Just created a new linux user on my GPU server, installed all and it was running like a charm. Quality is shockingly good. I used the OpenAI API interface to redirect some of my existing script to this endpoint and they just worked even with very complex prompts and contexts :-) Well done guys!

MaziyarPanahi

Aug 15, 2023

@rboehme86
Just out of curiosity, would you mind sharing your GPU specs and how much it uses if you do feed 16k input size?

dharmam

Aug 25, 2023

I tried to use Vicuna-13b-16k with vllm worker(feature in Fastchat library). In that case, it repeats single word in output.
reproduce the error:
" python3 -m fastchat.serve.vllm_worker --model-names "gpt-3.5-turbo,text-davinci-003,text-embedding-ada-002" --model-path lmsys/vicuna-13b-v1.5-16k --num-gpus 2"

however it works when I replace "vllm_worker" to "model_worker"