CosyVoice 3 — Norwegian LoRA fine-tune

LoRA fine-tune of Fun-CosyVoice3-0.5B-2512 on Norwegian Bokmål speech. Trained as part of an internal self-hosted TTS stack.

Release: step 20,880 · Published: 2026-05-28

What's in this release

  • model_20880_ema.pt (2,025 MB) — EMA-merged Qwen2 LLM weights, drop-in CosyVoice3 inference checkpoint.
  • model_20880_lora_state.pt (212 MB) — LoRA adapters + AdamW state + EMA shadow + step counter, for resuming fine-tunes.

The EMA-merged file is what you load for inference. The LoRA-state sidecar is for anyone who wants to continue training from this checkpoint.

Training setup

  • Stage: stage_4_llm_lora — LoRA on the Qwen2-0.5B LLM frontend only. Flow-matching decoder + HiFi-GAN inherit the base CosyVoice3 weights unchanged.

  • LoRA shape: r=24, last_k=24 (wraps all 24 Qwen2 transformer blocks). ~13.2M trainable params over ~500M base.

  • Optimizer: AdamW, peak LR 1.0e-5, warmup 2,000 steps, Noam decay.

  • Batch: 4,800 frames/GPU × grad_accum=4 (effective 19,200 frames).

  • Max grad norm: 0.3.

  • EMA: maintained over the LoRA delta itself (power-law warmup, β→0.9999, update_every=10). Inference checkpoint is EMA-merged.

  • Hardware: single NVIDIA RTX 3090 (24 GB).

  • Instruct prefix: "You are a helpful assistant.<|endofprompt|>" injected at tokenize time during training and at inference; CV3's LLM asserts the <|endofprompt|> token (id 151646) is present.

  • Latest training loss: 0.3835 at step 20,880.

  • Total training time: ~49.9 h.

Reference dataset

Norwegian Bokmål speech, ~458 h total, four sources combined into a single CosyVoice list-format corpus:

Source Clips Hours License
NbAiLab/NST ~219,000 ~540 Apache 2.0
NbAiLab/NPSC ~32,000 ~140 CC-0
NRK Politisk kvarter (podcast, segmented + diarized) ~26,000 NRK content, see license caveat below
NRK Burde vært pensum + adjacent podcast/audiobook sources ~11,000 NRK content, see license caveat below

Per-clip preprocessing: Demucs music/noise separation → nb-whisper-large transcription → pyannote 3.1 speaker diarization (mixed-speaker clips rejected) → nb-wav2vec2-1b-bokmaal verbatim CTC pass (becomes the training label, preserves disfluencies/repeats that whisper elides) → length + word-confidence filtering.

Quick start

import torch
from cosyvoice.cli.cosyvoice import CosyVoice3

cosy = CosyVoice3('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', fp16=True)

state = torch.load('model_20880_ema.pt', map_location='cpu', weights_only=False)
state = {k: v for k, v in state.items() if k not in ('step', 'epoch')}
cosy.model.llm.load_state_dict(state, strict=False)

for chunk in cosy.inference_zero_shot(
    tts_text='Norsk talesyntese skal være tilgjengelig for alle.',
    prompt_text='You are a helpful assistant.<|endofprompt|>' + ref_transcript,
    prompt_speech_16k=ref_audio_16k,
):
    audio_chunks.append(chunk['tts_speech'])

License

This release: CC BY-NC 4.0. Research and non-commercial use only.

Three pieces of licensing apply:

  • The base model FunAudioLLM/Fun-CosyVoice3-0.5B-2512 is Apache 2.0 — that license still applies to the unmodified base weights inside the EMA-merged checkpoint. Apache 2.0 permits redistribution under additional terms for derivative work.
  • Our LoRA delta and the model card are released under CC BY-NC 4.0.
  • The NRK podcast portion of the training corpus is NOT cleared for commercial redistribution, which is the binding constraint that makes the whole release non-commercial. NST (Apache 2.0) and NPSC (CC-0) on their own would allow a more permissive release.

If you need a commercial-friendly model, the right move is to re-train on a fully commercial-cleared corpus (e.g. NST + NPSC only, or with a properly-licensed conversational dataset substituted for the podcast portion).

Caveats

  • Bokmål-focused; Nynorsk and dialectal coverage is limited by the source corpus.
  • The Norwegian text input must still be prefixed with "You are a helpful assistant.<|endofprompt|>" at inference time — this matches the training-time tokenization and is required for the CV3 LLM to emit speech tokens at all.

Auto-generated by training-cosy/scripts/publish_hf.py on 2026-05-28.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlexKjes/cosyvoice3-norwegian-lora

Adapter
(2)
this model