Instructions to use AlexKjes/cosyvoice3-norwegian-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- CosyVoice
How to use AlexKjes/cosyvoice3-norwegian-lora with CosyVoice:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
CosyVoice 3 — Norwegian LoRA fine-tune
LoRA fine-tune of Fun-CosyVoice3-0.5B-2512 on Norwegian Bokmål speech. Trained as part of an internal self-hosted TTS stack.
Release: step 20,880 · Published: 2026-05-28
What's in this release
model_20880_ema.pt(2,025 MB) — EMA-merged Qwen2 LLM weights, drop-in CosyVoice3 inference checkpoint.model_20880_lora_state.pt(212 MB) — LoRA adapters + AdamW state + EMA shadow + step counter, for resuming fine-tunes.
The EMA-merged file is what you load for inference. The LoRA-state sidecar is for anyone who wants to continue training from this checkpoint.
Training setup
Stage:
stage_4_llm_lora— LoRA on the Qwen2-0.5B LLM frontend only. Flow-matching decoder + HiFi-GAN inherit the base CosyVoice3 weights unchanged.LoRA shape: r=24, last_k=24 (wraps all 24 Qwen2 transformer blocks). ~13.2M trainable params over ~500M base.
Optimizer: AdamW, peak LR 1.0e-5, warmup 2,000 steps, Noam decay.
Batch: 4,800 frames/GPU × grad_accum=4 (effective 19,200 frames).
Max grad norm: 0.3.
EMA: maintained over the LoRA delta itself (power-law warmup, β→0.9999,
update_every=10). Inference checkpoint is EMA-merged.Hardware: single NVIDIA RTX 3090 (24 GB).
Instruct prefix:
"You are a helpful assistant.<|endofprompt|>"injected at tokenize time during training and at inference; CV3's LLM asserts the<|endofprompt|>token (id 151646) is present.Latest training loss: 0.3835 at step 20,880.
Total training time: ~49.9 h.
Reference dataset
Norwegian Bokmål speech, ~458 h total, four sources combined into a single CosyVoice list-format corpus:
| Source | Clips | Hours | License |
|---|---|---|---|
| NbAiLab/NST | ~219,000 | ~540 | Apache 2.0 |
| NbAiLab/NPSC | ~32,000 | ~140 | CC-0 |
| NRK Politisk kvarter (podcast, segmented + diarized) | ~26,000 | — | NRK content, see license caveat below |
| NRK Burde vært pensum + adjacent podcast/audiobook sources | ~11,000 | — | NRK content, see license caveat below |
Per-clip preprocessing: Demucs music/noise separation → nb-whisper-large transcription → pyannote 3.1 speaker diarization (mixed-speaker clips rejected) → nb-wav2vec2-1b-bokmaal verbatim CTC pass (becomes the training label, preserves disfluencies/repeats that whisper elides) → length + word-confidence filtering.
Quick start
import torch
from cosyvoice.cli.cosyvoice import CosyVoice3
cosy = CosyVoice3('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', fp16=True)
state = torch.load('model_20880_ema.pt', map_location='cpu', weights_only=False)
state = {k: v for k, v in state.items() if k not in ('step', 'epoch')}
cosy.model.llm.load_state_dict(state, strict=False)
for chunk in cosy.inference_zero_shot(
tts_text='Norsk talesyntese skal være tilgjengelig for alle.',
prompt_text='You are a helpful assistant.<|endofprompt|>' + ref_transcript,
prompt_speech_16k=ref_audio_16k,
):
audio_chunks.append(chunk['tts_speech'])
License
This release: CC BY-NC 4.0. Research and non-commercial use only.
Three pieces of licensing apply:
- The base model
FunAudioLLM/Fun-CosyVoice3-0.5B-2512is Apache 2.0 — that license still applies to the unmodified base weights inside the EMA-merged checkpoint. Apache 2.0 permits redistribution under additional terms for derivative work. - Our LoRA delta and the model card are released under CC BY-NC 4.0.
- The NRK podcast portion of the training corpus is NOT cleared for commercial redistribution, which is the binding constraint that makes the whole release non-commercial. NST (Apache 2.0) and NPSC (CC-0) on their own would allow a more permissive release.
If you need a commercial-friendly model, the right move is to re-train on a fully commercial-cleared corpus (e.g. NST + NPSC only, or with a properly-licensed conversational dataset substituted for the podcast portion).
Caveats
- Bokmål-focused; Nynorsk and dialectal coverage is limited by the source corpus.
- The Norwegian text input must still be prefixed with
"You are a helpful assistant.<|endofprompt|>"at inference time — this matches the training-time tokenization and is required for the CV3 LLM to emit speech tokens at all.
Auto-generated by training-cosy/scripts/publish_hf.py on 2026-05-28.
- Downloads last month
- -
Model tree for AlexKjes/cosyvoice3-norwegian-lora
Base model
FunAudioLLM/Fun-CosyVoice3-0.5B-2512