Model Card for OMNI DVPS-V1-LT
Qwen2.5 Omni 7B Fine-tuned on v1 tasks:
- Simplification (Yuka)
- Automatic Speech Recognition (Binh)
- Speech Translation (Enes)
The model was trained with different system prompts for each task (To be changed for next version). Refer to the dataset for the exact system prompt but seems to work with other prompts.
How to Get Started with the Model
Use the code below to get started with the model. Please make sure you follow the installation instructions from https://huggingface.co/Qwen/Qwen2.5-Omni-7B
Everything is similar to standard huggingface pipelines. You can use flash attention etc, as usual if you want but have to setup the environment.
For Base model, simply do not load the adapter. However, the prompt should be diiferent. Use the system prompt from the Base model page.
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
from peft import PeftModel
import torch
def merge_lora(
base_model_path: str,
lora_checkpoint_path: str,
extra_file: str = "spk_dict.pt",
submodule_name: str = "thinker",
cache_dir: str = "cache/"
):
"""Load the original model, merge the LoRA weights.
For a specified submodule, and save the final merged model along with its configurations.
Args:
base_model_path (str): Path to the original model directory.
lora_checkpoint_path (str): Path to the directory containing LoRA weights.
extra_file (str): Name of the extra file to be copied (default: "spk_dict.pt").
submodule_name (str): Name of the submodule to merge (default: "thinker").
save_path (str): Directory where the merged model and configurations will be saved.
"""
# 1. Load the original model
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(base_model_path, torch_dtype=torch.bfloat16, device_map="auto", cache_dir=cache_dir)
print("Successfully loaded the original model.")
# 2. Extract the submodule to be merged (e.g., model.thinker)
if not hasattr(model, submodule_name):
raise AttributeError(f"The model does not have a submodule named '{submodule_name}'.")
base_submodule = getattr(model, submodule_name)
print(f"Successfully extracted submodule: {submodule_name}.")
# 3. Load the LoRA weights onto the extracted submodule
lora_model = PeftModel.from_pretrained(base_submodule, lora_checkpoint_path)
processor = Qwen2_5OmniProcessor.from_pretrained(lora_checkpoint_path)
print("LoRA weights and processor loaded successfully.")
# 4. Merge the LoRA weights into the submodule and unload the LoRA modules
merged_submodule = lora_model.merge_and_unload()
print("LoRA weights merged successfully.")
# 5. Replace the original submodule with the merged submodule in the model
setattr(model, submodule_name, merged_submodule)
return model
cache_dir = "" # SET YOUR HF CACHE DIR
assert cache_dir != ""
model = merge_lora("Qwen/Qwen2.5-Omni-7B", "kit-isl-ai4lt/qwen_omni_lt_v1", cache_dir)
model.disable_talker() # By default it is enabled
### FOR BASE MODEL
### model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="auto", cache_dir=cache_dir)
print("Disabled Talker for efficiency and is not working after LoRA text output fine-tuning")
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", cache_dir = cache_dir)
# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant that translates audio into text."}
],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": "./allo.mp3"},
{"type": "text", "text": "Translate into English:"}
],
},
]
text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(text=text, images=images, videos=videos, audio=audios, padding=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs)
### Print only generated text with slicing
print(processor.batch_decode(generated_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
Demo
Please install gradio if you want to try the demo system and have conversation. Only tested for offline version and audio-text inputs only for now.
python app.py
Data Balancing
In the first version, we do not train on all of the datasets but sample the data to validate the training pipeline. Hence, some language pairs might get worse than baseline after fine-tuning and should be addressed in the next version
Speech-Translation (ST) Results
We evaluate two scenarios:
- Seen language pairs: language directions included during training
- Zero-shot (unseen) language pairs: directions not seen during training
Each pair shows BLEU and COMET scores for the Baseline model and our Fine-tuned model.
Seen Language Pairs
| Source β Target | Baseline | Omni_lt_V1 | ||
|---|---|---|---|---|
| BLEU | COMET | BLEU | COMET | |
| zh β de | 13.88 | 0.6203 | 15.64 | 0.8345 |
| zh β en | 14.94 | 0.7140 | 23.88 | 0.8487 |
| zh β fr | 17.27 | 0.6273 | 15.88 | 0.8058 |
| en β zh | 25.04 | 0.7440 | 40.99 | 0.8798 |
| en β fr | 33.74 | 0.6290 | 25.52 | 0.8378 |
Zero-shot (Unseen) Language Pairs
| Source β Target | Baseline | Omni_lt_V1 | ||
|---|---|---|---|---|
| BLEU | COMET | BLEU | COMET | |
| zh β it | 11.97 | 0.6590 | 15.17 | 0.8531 |
| zh β ja | 17.23 | 0.8328 | 24.72 | 0.8864 |
| zh β tr | 6.46 | 0.6175 | 7.76 | 0.8013 |
| fr β zh | 19.57 | 0.7174 | 32.72 | 0.8607 |
| fr β de | 19.85 | 0.5994 | 20.77 | 0.8370 |
| fr β it | 19.49 | 0.6363 | 20.17 | 0.8625 |
| tr β zh | 8.56 | 0.6298 | 22.97 | 0.7836 |
Automatic Speech Recognition Results
Word Error Rate (%) - Lower is better!
| Dataset | Shard | Baseline | Omni_lt_V1 |
|---|---|---|---|
| ami | sdm | 32.70 | 76.25 |
| ami | ihm | 18.71 | 11.38 |
| earnings22 | split | 19.11 | 29.36 |
| fleurs | ar_eg | 27.01 | 27.89 |
| fleurs | de_de | 9.12 | 7.56 |
| fleurs | cmn_hans_cn | 53.30 | 13.65 |
| fleurs | tr_tr | 46.53 | 39.29 |
| fleurs | fr_fr | 6.62 | 130.10 |
| fleurs | ja_jp | 10.40 | 7.47 |
| fleurs | es_419 | 6.16 | 5.92 |
| fleurs | it_it | 4.30 | 4.70 |
| voxpopuli | fr | 15.39 | 9.63 |
| voxpopuli | de | 15.67 | 11.55 |
| voxpopuli | es | 6.65 | 6.95 |
| voxpopuli | en | 8.87 | 7.93 |
| voxpopuli | it | 15.44 | 35.70 |
Text Simplification Results
| Language | n | Baseline | Omni_lt_V1 | ||||
|---|---|---|---|---|---|---|---|
| BLEU β | chrF β | SARI β | BLEU β | chrF β | SARI β | ||
| Brazilian Portuguese | 420 | 66.22 | 77.15 | 62.28 | 73.59 | 89.62 | 89.13 |
| English | 5,361 | 50.31 | 66.84 | 57.90 | 75.00 | 81.45 | 79.62 |
| French | 445 | 55.76 | 72.26 | 59.31 | 72.86 | 83.44 | 71.64 |
| German | 106 | 57.33 | 73.43 | 60.00 | 87.38 | 92.27 | 87.10 |
| Italian | 1,279 | 52.45 | 68.01 | 57.83 | 71.60 | 82.39 | 70.56 |
| Japanese | 1,099 | 32.82 | 30.67 | 74.01 | 91.11 | 90.63 | 97.56 |
| ALL | 8,710 | 52.16 | 67.09 | 60.23 | 75.31 | 82.48 | 80.69 |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-Omni-7B |
| image_max_pixels | 262,144 |
| video_max_pixels | 16,384 |
| trust_remote_code | true |
| Parameter | Value |
|---|---|
| stage | sft |
| do_train | true |
| finetuning_type | lora |
| lora_rank | 32 |
| lora_target | all |
| freeze_vision_tower | true |
| freeze_multi_modal_projector | true |
| deepspeed | examples/deepspeed/ds_z2_config.json |
| Parameter | Value |
| ------------------------------------- | ------------------------------------ |
| dataset | train_v1 (video: mllm_video_demo) |
| eval_dataset | dev_v1 |
| template | qwen2_omni |
| cutoff_len | 8096 |
| overwrite_cache | true |
| streaming | true |
| buffer_size | 128 |
| preprocessing_batch_size | 128 |
| preprocessing_num_workers | 16 |
| dataloader_num_workers | 4 |
| accelerator_config.dispatch_batches | false |
| Parameter | Value |
|---|---|
| output_dir | `` |
| logging_steps | 10 |
| save_steps | 1000 |
| plot_loss | true |
| overwrite_output_dir | true |
| save_only_model | false |
| Parameter | Value |
|---|---|
| per_device_train_batch_size | 8 |
| gradient_accumulation_steps | 2 |
| learning_rate | 1e-4 |
| num_train_epochs | 3 |
| lr_scheduler_type | cosine |
| warmup_ratio | 0.1 |
| bf16 | true |
| ddp_timeout | 180000000 |
| resume_from_checkpoint | null |
| save_total_limit | 3 |
| max_steps | 30000 |
| report_to | wandb |
| run_name | v1_omni |
| auto_find_batch_size | true |
| Parameter | Value |
|---|---|
| per_device_eval_batch_size | 8 |
| eval_strategy | steps |
| eval_steps | 2000 |
Framework versions
- PEFT 0.15.2
- Downloads last month
- 6