Trelis/Mistral-7B-Instruct-v0.1-Summarize-64k · Problems of using this mdoel

Dec 25, 2023

when using transformers , when the context is getting longer (5572 words ), A6000 48GB is not enough . if going for 2 GPUs , The code in model card is showing "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!" .

when using docker deployed in runpod , always got the error "RuntimeError: weight gptq_bits does not exist" for A6000 . (HUGGING_FACE_HUB_TOKEN has been setup and model download has no problem , but when doing the sharding , got errors )

RonanMcGovern

Trelis org Dec 25, 2023

Can you share the full command you are using.

You’ll need to specify quantization as awq or bitsandbytes NF4 (not gptq) and also set the revision (branch) to awq if using awq.

One A6000 should get you to 16k . Two should get you to 64k.

deter3

Dec 25, 2023

Can you share the full command you are using.

You’ll need to specify quantization as awq or bitsandbytes NF4 (not gptq) and also set the revision (branch) to awq if using awq.

One A6000 should get you to 16k . Two should get you to 64k.

--model-id Trelis/Mistral-7B-Instruct-v0.1-Summarize-64k --port 8080 --max-input-length 63000 --max-total-tokens 64000 --max-batch-prefill-tokens 64000 --quantize awq --revision awq ， it's from the template link of this model card . you can try it by yourself .

RonanMcGovern

Trelis org Dec 25, 2023

Sorry for the trouble. TGI is still having issues running Mistral models in AWQ.

I've updated the template to use --quantize bitsandbytes-nf4 and tested it works now. You can also play around with --quantize eetq, which will be faster if you're using larger batch sizes but may fit slightly less context.

I'll need to report further on the AWQ issue and will get to this later after xmas, cheers

deter3

Dec 25, 2023

Sorry for the trouble. TGI is still having issues running Mistral models in AWQ.

I've updated the template to use --quantize bitsandbytes-nf4 and tested it works now. You can also play around with --quantize eetq, which will be faster if you're using larger batch sizes but may fit slightly less context.

I'll need to report further on the AWQ issue and will get to this later after xmas, cheers

thanks for the quick response , Ronan , let's deal with it after xmas . Cheers .

RonanMcGovern

Trelis org Dec 26, 2023

Howdy @deter3 , I've just tested this out completely, have a look here for the full video:

RonanMcGovern changed discussion status to closed Dec 26, 2023

deter3

Dec 27, 2023

Hi , Ronan,
Below is the problem I came cross , Can you let me know what's the root cause for this problem ? I am using 64k model more about a Q&A purpose than summation . Shall I train for myself or i can tweak this model to fit my purpose ? We can talk about it after xmas . Cheers !!

RonanMcGovern

Trelis org Dec 28, 2023

Before going to the trouble of training, I recommend:

Using the prompt on the model card, minus the summarization. Instead of:

B_INST, E_INST = "[INST] ", " [/INST]"
prompt = {B_INST}Provide a summary of the following text:\n\n[TEXT_START]\n\n{text to summarize}\n\n[TEXT_END]\n\n{E_INST}

Use something like:

prompt= #put your question here
B_INST, E_INST = "[INST] ", " [/INST]"
formatted_prompt = f"{B_INST}{prompt}{E_INST}"

That should help quite a bit.
2. Test on 50k tokens (just a little lower than the upper bound).

If the above doesn't work, you can add the following:

B_INST, E_INST = "[INST] ", " [/INST]"
systemPrompt=f"The following is a discussion between a user and a helpful assistant. {B_INST} is pre-pended to user messages and {E_INST} is pre-pended to assistant messages."
prompt= #put your question here
formatted_prompt = f"{systemPrompt}\n\n{B_INST}{prompt}{E_INST}"

RonanMcGovern changed discussion status to open Dec 28, 2023

deter3

Dec 29, 2023

Before going to the trouble of training, I recommend:

Using the prompt on the model card, minus the summarization. Instead of:
B_INST, E_INST = "[INST] ", " [/INST]"
prompt = {B_INST}Provide a summary of the following text:\n\n[TEXT_START]\n\n{text to summarize}\n\n[TEXT_END]\n\n{E_INST}
Use something like:
prompt= #put your question here
B_INST, E_INST = "[INST] ", " [/INST]"
formatted_prompt = f"{B_INST}{prompt}{E_INST}"
That should help quite a bit.
2. Test on 50k tokens (just a little lower than the upper bound).

If the above doesn't work, you can add the following:
B_INST, E_INST = "[INST] ", " [/INST]"
systemPrompt=f"The following is a discussion between a user and a helpful assistant. {B_INST} is pre-pended to user messages and {E_INST} is pre-pended to assistant messages."
prompt= #put your question here
formatted_prompt = f"{systemPrompt}\n\n{B_INST}{prompt}{E_INST}"

Thanks for the information . Tried and I figured my use case need more reasoning in Q&A , so the result is not good .

RonanMcGovern

Trelis org Dec 30, 2023

Ok, interesting. Hard for me to say without seeing an example, but depending on what the model is doing, it may mean:

prompt needs improvement
Mistral 7B is not strong enough so a stronger model is needed (e.g. you could try Yi 34B, which has a function calling chat fine-tuned version).

RonanMcGovern changed discussion status to closed Dec 30, 2023

deter3

Jan 2

I pulled couple of news from New York Times , put all news into one txt file . One of the news is https://www.nytimes.com/2024/01/01/world/asia/south-korea-opposition-lee-jae-myung.html . I want to test this 64k model be able to answer the question with factual information from the context .

What I found are

if there is information for the question , the answer can be good most of time . But if there's no information for the question , then the answer is full of fabrication or keeping repeating till max token . I guess the fine tune training material does not include such situation .
the 64k model is not following the instructions well , i guess it's originally from base model .

Code :

with open('./data/text6.txt', 'r') as f:
text = f.read()

question = """
what's Lee Jae-myung main ideas and Philosophical thought?
"""

prompt = f"""
Your tasks are :

Based on the solely , Please provide around 200-300 word answer to the following . Do not make things up . if is not enough to answer the , please reply "I don't know" .
Also provide a concise 50-100 word summary of any relevant statistics, facts, or case studies that support your answer.
Also provide a concise 50-100 word explanation of the key logical reasoning, conceptual framework, or analysis that underpins your answer.

{question} {text} """ B_INST, E_INST = "[INST] ", " [/INST]" headers = { "Content-Type": "application/json", }

data = {
'inputs': f'{B_INST}{prompt}{E_INST}',
'parameters': {
'max_new_tokens': 4000,
'temperature': 0.01
},
}

response = requests.post('https://2s.proxy.runpod.net/generate', headers=headers, json=data)
try:
data1=response.json()
print(data1["generated_text"])
except Exception as e:
print(e)
print(response.text)

RonanMcGovern

Trelis org Jan 2

Howdy @deter3 .

Yes, you're seeing hallucination. This is a problem across all llms, particularly smaller ones.
Yes, makes sense the model isn't following your instruction as it differs from the prompt template this model was tuned on. Small models aren't good at one-shot or zero shot performance.

To get better performance, I think you need a larger model (maybe Yi 34B 200k model). If you want a similar size, OpenChat 3.5 is probably better than Mistral 7B (although openchat would need to be fine-tuned for longer context).