Model Card

AMALIA-speech-encoder is an open-source speech encoder adapted for European Portuguese.

Model Description

AMALIA-speech-encoder is the specialized speech encoder that is part of the ASR model inesc-id/WhisperLv3-FT-EP-CPP, fine-tuned by the Instituto Superior Técnico/INESC-ID for European Portuguese ASR. The backbone model is the Whisper large-v3 model from OpenAI.

Training Details

Training Data

The data used to train this model is the CAMÕES dataset, a curated collection of up to 14 sub-corpora, bringing together proprietary datasets acquired through previous research collaborations, speech corpora recorded by the consortium, and data collected from publicly available online sources. Overall, it contains approximately 425 hours of speech with high-quality manual transcriptions. Details are described in CAMOES.

Training Process

We apply supervised finetuning on top of the Whisper Large v3 model (openai/whisper-large-v3), updating all model parameters. Training was carried out on the own Instituto Superior Técnico/INESC-ID computational facilities. The model provided is the resulting fine-tuned transformer speech encoder only (without the ASR decoder).

Intended Use

AMALIA-speech-encoder is intended as a specialized speech encoder for European Portuguese. The model receives speech as an input and outputs an high-dimentional latent representation of the speech content, commonly known as speech embedding. This model is expected to be used as a speech pre-processing stage integrated in a specific downstream task, for instance, speech-to-text.

Limitations

This checkpoint is intended as a research artifact. Performance may vary depending on audio quality, speaker domain, recording conditions, and transcription style. The model may be less reliable on noisy audio, long-form speech, code-switching, or domains that differ from the training data.

Contents and use example

This repo stores:

encoder.safetensors: Whisper speech encoder weights only
config.json: Whisper configuration needed to reconstruct the encoder
preprocessor_config.json: feature extractor from openai/whisper-large-v3

Load example

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import WhisperConfig, WhisperFeatureExtractor
from transformers.models.whisper.modeling_whisper import WhisperEncoder

repo_id = "amalia-llm/AMALIA-speech-encoder"

config = WhisperConfig.from_pretrained(repo_id)
feature_extractor = WhisperFeatureExtractor.from_pretrained(repo_id)

encoder = WhisperEncoder(config)
state = load_file(hf_hub_download(repo_id, "encoder.safetensors"))
encoder.load_state_dict(state)
encoder.eval()

Citation

BibTeX:

@inproceedings{camoes,
    title={{CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for  European Portuguese}},
    author={Carlos Carvalho, Francisco Teixeira, Catarina Botelho, Anna Pompili, Rubén Solera-Ureña, Sérgio Paulo, Mariana Julião, Thomas Rolland, John Mendonça, Diogo Pereira, Isabel Trancoso, Alberto Abad},
    booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
    year={2025},
}