Model Card for Firefly-Qwen1.5-14B-En-Alpha

firefly-qwen1.5-en-14b-alpha is a preview version model of our new model. It outperforms Qwen1.5-14B-Chat on AlpacaEval 2.0 and MT-Bench' single-turn task.

Note: More importantly, it is not trained with neither SFT nor RLHF, maybe we will share our method later.

What's exciting is that our experimental method can achieve good performance, even though it's still in a very preliminary stage.

Although our model is trained with English data, you can also try to chat with models in Chinese because Qwen1.5 is also good at Chinese. But we have not evaluated the performance in Chinese yet.

We advise you to install transformers>=4.37.0.

Because this is a validation experiment and our training resources are limited, we use QLoRA to train this model based on Qwen1.5-14B with the max length of 1024, it may limit the performance of this model.

Performance

We automatically evaluate models on AlpacaEval 2.0 and MT-Bench with gpt-4o.

We evaluate models on AlpacaEval 2.0 with 805 questions, our model outperforms Qwen1.5-14B-Chat. The win rate is 52.17% : 47.83%.

Task Ours wins Qwen1.5-14B-Chat wins
helpful_base 67 62
koala 80 76
oasst 100 88
selfinstruct 127 125
vicuna 46 34
total 420 385

We also evaluate models on MT-Bench. Though the overall performance of our model is not as good as Qwen1.5-14B-Chat, we find that our model outperforms Qwen1.5-14B-Chat in almost all single-turn tasks. Our model is worse than Qwen1.5-14B-Chat in almost all multi-turn tasks. We conjecture that it may be caused by the training length, and we will dive into this phenomenon later.

Overall Performances on MT-Bench:

Task Ours Qwen1.5-14B-Chat
Avg Score 7.03 7.21
Single-turn Score 8.01 7.66
Multi-turn Score 6.05 6.75

Performances on MT-Bench' single-turn tasks:

Task Ours Qwen1.5-14B-Chat
writing 9.1 8.9
roleplay 8.5 8.3
extraction 8.6 8.2
stem 8.8 8.5
humanities 9 8.8
reasoning 6.8 5.3
math 7.5 7.1
coding 5.8 6.2

Performances on MT-Bench' multi-turn tasks:

Task Ours Qwen1.5-14B-Chat
writing 6.5 7.7
roleplay 7.7 8.3
extraction 5.1 6.7
stem 6.3 6.9
humanities 8.3 8.8
reasoning 4.7 5.7
math 4.9 5.5
coding 4.9 4.4

Usage

The chat templates of our chat models are the same as Official Qwen1.5-14B-Chat:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hello, who are you?<|im_end|>
<|im_start|>assistant
I am a AI program developed by Firefly<|im_end|>

You can use script to inference in Firefly.

You can also use the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name_or_path = "YeungNLP/firefly-qwen1.5-en-14b-alpha"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

prompt = "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. "
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to('cuda')

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1500,
    top_p = 0.8,
    temperature = 0.6,
    repetition_penalty = 1.0,
    eos_token_id=tokenizer.encode('<|im_end|>', add_special_tokens=False)
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Downloads last month
15
Safetensors
Model size
14.2B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.