Model Card — Whisper Large Icelandic (Spjallrómur Fine-tune)

Model Details

Model Description

This model is a fine-tuned version of language-and-voice-lab/whisper-large-icelandic-62640-steps-967h for Automatic Speech Recognition (ASR) in Icelandic, with a focus on conversational and spontaneous speech. It was further fine-tuned on the Spjallrómur corpus — an Icelandic conversational speech dataset — to improve robustness on informal, dialogue-style audio.

  • Developed by: Páll Rúnarsson
  • Funded by: Almannarómur (Language Technology Programme for Icelandic)
  • Shared by: Language and Voice Laboratory, Reykjavík University
  • Model type: Automatic Speech Recognition (Transformer encoder-decoder)
  • Language(s): Icelandic (is)
  • License: CC BY-SA 4.0 — free to use, share, and adapt, provided you give appropriate credit and distribute any derivatives under the same license.
  • Fine-tuned from: language-and-voice-lab/whisper-large-icelandic-62640-steps-967h, which is itself a fine-tune of openai/whisper-large

Model Sources

  • Repository: (add link)
  • Demo: (add link)

Uses

Direct Use

This model is intended for transcription of Icelandic speech, particularly conversational and spontaneous speech. It can be used directly via the 🤗 Transformers pipeline API or integrated into larger ASR pipelines.

Downstream Use

The model may be fine-tuned further for domain-specific Icelandic ASR tasks (e.g., legal, broadcast, or medical transcription) where spontaneous speech patterns are common. Any derivative models must be released under the same CC BY-SA 4.0 license with appropriate attribution.

Out-of-Scope Use

  • Languages other than Icelandic
  • Highly technical or domain-specific jargon without additional fine-tuning
  • Real-time streaming inference without appropriate latency optimisations

Bias, Risks, and Limitations

  • Performance may degrade on heavily accented, dialectal, or code-switched speech (e.g., Icelandic/English or Icelandic/Danish mixing).
  • The Spjallrómur training data reflects particular speaker demographics; underrepresented groups may see higher error rates.
  • As with all ASR systems, proper nouns, rare words, and domain-specific terminology present challenges.

Recommendations

Users should evaluate the model on their target domain before deployment. For sensitive applications (legal, medical), human review of transcripts is strongly recommended.


How to Get Started with the Model

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="palli23/Whisper-Large-Spjallromur"
)

result = asr("audio.wav")
print(result["text"])

For more control:

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

MODEL_NAME = "palli23/Whisper-Large-Spjallromur"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")

def transcribe(audio_path):
    import librosa
    audio, sr = librosa.load(audio_path, sr=16_000)
    input_features = processor(
        audio, sampling_rate=sr, return_tensors="pt"
    ).input_features.to("cuda")

    with torch.no_grad():
        predicted_ids = model.generate(input_features)[0]

    return processor.decode(predicted_ids, skip_special_tokens=True)

print(transcribe("audio.wav"))

Training Details

Training Data

Fine-tuned on Spjallrómur, an Icelandic conversational speech corpus published via CLARIN-IS. The corpus contains spontaneous, dialogue-style speech and is distinct from the read-speech corpora (Samrómur Milljón, Málrómur) used in the base model.

The base model checkpoint used as the starting point was trained for 62,640 steps on 967 hours of Icelandic read speech from Samrómur Milljón.

Training Procedure

Starting checkpoint: language-and-voice-lab/whisper-large-icelandic-62640-steps-967h

Training Hyperparameters

  • Training regime: (add: fp16/bf16, batch size, learning rate, steps, warmup, etc.)

Speeds, Sizes, Times

  • (add: GPU type, training duration, model size)

Evaluation

Testing Data

The model was evaluated on the Spjallrómur test set — a held-out partition of conversational Icelandic speech not seen during fine-tuning.

Metrics

Word Error Rate (WER) is used as the primary evaluation metric, computed after normalising both reference and hypothesis transcripts.

Results

The model was evaluated on a held-out test set from Spjallrómur.

Model Test WER
Base (Icelandic read-speech fine-tune) 46.2%
This model (Spjallrómur conversational) 28.6%

This represents a 17.6 percentage point absolute reduction — a 38% relative improvement in WER on conversational Icelandic speech.

Technical Specifications

Model Architecture and Objective

Whisper Large — a transformer encoder-decoder architecture trained with a sequence-to-sequence cross-entropy objective. The architecture is unchanged from the OpenAI Whisper Large baseline; only the weights are adapted via fine-tuning on Icelandic conversational data.

Compute Infrastructure

Training was conducted at the Language and Voice Laboratory (lvl.ru.is), Reykjavík University, Iceland.


Citation

If you use this model, please cite both this work and the base model:

@misc{runarsson2025whisper,
  author    = {Páll Rúnarsson},
  title     = {Whisper Large Icelandic Fine-tuned on Spjallrómur},
  year      = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/language-and-voice-lab/<your-model-id>}}
}

@inproceedings{mena2024samromur,
  title     = {Samr{\'o}mur Millj{\'o}n: An ASR Corpus of One Million Verified
               Read Prompts in Icelandic},
  author    = {Mena, Carlos Daniel Hernandez and Gunnarsson, {\TH}orsteinn
               Da{\dh}i and Gu{\dh}nason, J{\'o}n},
  booktitle = {Proceedings of the 2024 Joint International Conference on
               Computational Linguistics, Language Resources and Evaluation
               (LREC-COLING 2024)},
  pages     = {14305--14312},
  year      = {2024}
}

@misc{fong2026spjallromur,
  author    = {Fong, Judy Y. and Borsky, Michal and Runarsson, Pall
               and Hedström, Staffan and Jónsson, Ólafur Helgi
               and Hólmfriðardóttir, Lára Margrét H. and Þorsteinsdóttir, Sunneva
               and Eiríksdóttir, Málfriður Anna and Mollberg, David Erik
               and Magnúsdóttir, Eydís Huld and Þórhallsdóttir, Ragnheiður
               and Gudnason, Jon},
  title     = {Spjallromur 26.03 -- Icelandic Conversational Speech},
  year      = {2026},
  publisher = {CLARIN-IS / Reykjavík University},
  howpublished = {\url{http://hdl.handle.net/20.500.12537/379}}
}

Acknowledgements

This work was carried out at the Language and Voice Laboratory (lvl.ru.is) at Reykjavík University, Iceland, under the supervision of Jón Guðnason and parallel to Michal Borsky's zipformer work.

Funded by Almannarómur — the Language Technology Programme for Icelandic, managed and coordinated by Almannarómur and funded by the Icelandic Ministry of Education, Science and Culture.


Model Card Author

Páll Rúnarsson, Research Associate Language and Voice Laboratory, Reykjavík University

Model Card Contact

(add contact email or HF profile link)

Downloads last month
118
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for palli23/Whisper-Large-Spjallromur

Dataset used to train palli23/Whisper-Large-Spjallromur

Space using palli23/Whisper-Large-Spjallromur 1