Model Card for OMNI DVPS-V1-LT

Qwen2.5 Omni 7B Fine-tuned on v1 tasks:

  1. Simplification (Yuka)
  2. Automatic Speech Recognition (Binh)
  3. Speech Translation (Enes)

The model was trained with different system prompts for each task (To be changed for next version). Refer to the dataset for the exact system prompt but seems to work with other prompts.

How to Get Started with the Model

Use the code below to get started with the model. Please make sure you follow the installation instructions from https://huggingface.co/Qwen/Qwen2.5-Omni-7B

Everything is similar to standard huggingface pipelines. You can use flash attention etc, as usual if you want but have to setup the environment.

For Base model, simply do not load the adapter. However, the prompt should be diiferent. Use the system prompt from the Base model page.

from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
from peft import PeftModel
import torch
def merge_lora(
    base_model_path: str,
    lora_checkpoint_path: str,
    extra_file: str = "spk_dict.pt",
    submodule_name: str = "thinker",
    cache_dir: str = "cache/"
):
    """Load the original model, merge the LoRA weights.
    For a specified submodule, and save the final merged model along with its configurations.
    Args:
        base_model_path (str): Path to the original model directory.
        lora_checkpoint_path (str): Path to the directory containing LoRA weights.
        extra_file (str): Name of the extra file to be copied (default: "spk_dict.pt").
        submodule_name (str): Name of the submodule to merge (default: "thinker").
        save_path (str): Directory where the merged model and configurations will be saved.
    """
    # 1. Load the original model
    model = Qwen2_5OmniForConditionalGeneration.from_pretrained(base_model_path, torch_dtype=torch.bfloat16, device_map="auto", cache_dir=cache_dir)
    print("Successfully loaded the original model.")
    # 2. Extract the submodule to be merged (e.g., model.thinker)
    if not hasattr(model, submodule_name):
        raise AttributeError(f"The model does not have a submodule named '{submodule_name}'.")
    base_submodule = getattr(model, submodule_name)
    print(f"Successfully extracted submodule: {submodule_name}.")
    # 3. Load the LoRA weights onto the extracted submodule
    lora_model = PeftModel.from_pretrained(base_submodule, lora_checkpoint_path)
    processor = Qwen2_5OmniProcessor.from_pretrained(lora_checkpoint_path)
    print("LoRA weights and processor loaded successfully.")
    # 4. Merge the LoRA weights into the submodule and unload the LoRA modules
    merged_submodule = lora_model.merge_and_unload()
    print("LoRA weights merged successfully.")
    # 5. Replace the original submodule with the merged submodule in the model
    setattr(model, submodule_name, merged_submodule)
    return model
cache_dir = "" # SET YOUR HF CACHE DIR
assert cache_dir != ""
model = merge_lora("Qwen/Qwen2.5-Omni-7B", "kit-isl-ai4lt/qwen_omni_lt_v1", cache_dir)
model.disable_talker() # By default it is enabled
### FOR BASE MODEL
### model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="auto", cache_dir=cache_dir)
print("Disabled Talker for efficiency and is not working after LoRA text output fine-tuning")
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", cache_dir = cache_dir)
# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant that translates audio into text."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "./allo.mp3"},
            {"type": "text", "text": "Translate into English:"}
        ],
    },
]
text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(text=text, images=images, videos=videos, audio=audios, padding=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs)
### Print only generated text with slicing
print(processor.batch_decode(generated_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Demo

Please install gradio if you want to try the demo system and have conversation. Only tested for offline version and audio-text inputs only for now.

python app.py

Data Balancing

In the first version, we do not train on all of the datasets but sample the data to validate the training pipeline. Hence, some language pairs might get worse than baseline after fine-tuning and should be addressed in the next version


Speech-Translation (ST) Results

We evaluate two scenarios:

  • Seen language pairs: language directions included during training
  • Zero-shot (unseen) language pairs: directions not seen during training

Each pair shows BLEU and COMET scores for the Baseline model and our Fine-tuned model.

Seen Language Pairs

Source β†’ Target Baseline Omni_lt_V1
BLEU COMET BLEU COMET
zh β†’ de 13.88 0.6203 15.64 0.8345
zh β†’ en 14.94 0.7140 23.88 0.8487
zh β†’ fr 17.27 0.6273 15.88 0.8058
en β†’ zh 25.04 0.7440 40.99 0.8798
en β†’ fr 33.74 0.6290 25.52 0.8378

Zero-shot (Unseen) Language Pairs

Source β†’ Target Baseline Omni_lt_V1
BLEU COMET BLEU COMET
zh β†’ it 11.97 0.6590 15.17 0.8531
zh β†’ ja 17.23 0.8328 24.72 0.8864
zh β†’ tr 6.46 0.6175 7.76 0.8013
fr β†’ zh 19.57 0.7174 32.72 0.8607
fr β†’ de 19.85 0.5994 20.77 0.8370
fr β†’ it 19.49 0.6363 20.17 0.8625
tr β†’ zh 8.56 0.6298 22.97 0.7836

Automatic Speech Recognition Results

Word Error Rate (%) - Lower is better!

Dataset Shard Baseline Omni_lt_V1
ami sdm 32.70 76.25
ami ihm 18.71 11.38
earnings22 split 19.11 29.36
fleurs ar_eg 27.01 27.89
fleurs de_de 9.12 7.56
fleurs cmn_hans_cn 53.30 13.65
fleurs tr_tr 46.53 39.29
fleurs fr_fr 6.62 130.10
fleurs ja_jp 10.40 7.47
fleurs es_419 6.16 5.92
fleurs it_it 4.30 4.70
voxpopuli fr 15.39 9.63
voxpopuli de 15.67 11.55
voxpopuli es 6.65 6.95
voxpopuli en 8.87 7.93
voxpopuli it 15.44 35.70

Text Simplification Results

Language n Baseline Omni_lt_V1
BLEU ↑ chrF ↑ SARI ↑ BLEU ↑ chrF ↑ SARI ↑
Brazilian Portuguese 420 66.22 77.15 62.28 73.59 89.62 89.13
English 5,361 50.31 66.84 57.90 75.00 81.45 79.62
French 445 55.76 72.26 59.31 72.86 83.44 71.64
German 106 57.33 73.43 60.00 87.38 92.27 87.10
Italian 1,279 52.45 68.01 57.83 71.60 82.39 70.56
Japanese 1,099 32.82 30.67 74.01 91.11 90.63 97.56
ALL 8,710 52.16 67.09 60.23 75.31 82.48 80.69

Training Hyperparameters

Parameter Value
Base model Qwen/Qwen2.5-Omni-7B
image_max_pixels 262,144
video_max_pixels 16,384
trust_remote_code true
Parameter Value
stage sft
do_train true
finetuning_type lora
lora_rank 32
lora_target all
freeze_vision_tower true
freeze_multi_modal_projector true
deepspeed examples/deepspeed/ds_z2_config.json
Parameter Value
------------------------------------- ------------------------------------
dataset train_v1 (video: mllm_video_demo)
eval_dataset dev_v1
template qwen2_omni
cutoff_len 8096
overwrite_cache true
streaming true
buffer_size 128
preprocessing_batch_size 128
preprocessing_num_workers 16
dataloader_num_workers 4
accelerator_config.dispatch_batches false
Parameter Value
output_dir ``
logging_steps 10
save_steps 1000
plot_loss true
overwrite_output_dir true
save_only_model false
Parameter Value
per_device_train_batch_size 8
gradient_accumulation_steps 2
learning_rate 1e-4
num_train_epochs 3
lr_scheduler_type cosine
warmup_ratio 0.1
bf16 true
ddp_timeout 180000000
resume_from_checkpoint null
save_total_limit 3
max_steps 30000
report_to wandb
run_name v1_omni
auto_find_batch_size true
Parameter Value
per_device_eval_batch_size 8
eval_strategy steps
eval_steps 2000

Framework versions

  • PEFT 0.15.2
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support