Instructions to use amalia-llm/AMALIA-speech-encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amalia-llm/AMALIA-speech-encoder with Transformers:
# Load model directly from transformers import AutoProcessor, WhisperEncoder processor = AutoProcessor.from_pretrained("amalia-llm/AMALIA-speech-encoder") model = WhisperEncoder.from_pretrained("amalia-llm/AMALIA-speech-encoder") - Notebooks
- Google Colab
- Kaggle
Model Card
AMALIA-speech-encoder is an open-source speech encoder adapted for European Portuguese.
Model Description
AMALIA-speech-encoder is the specialized speech encoder that is part of the ASR model inesc-id/WhisperLv3-FT-EP-CPP, fine-tuned by the Instituto Superior Técnico/INESC-ID for European Portuguese ASR.
The backbone model is the Whisper large-v3 model from OpenAI.
Training Details
Training Data
The data used to train this model is the CAMÕES dataset, a curated collection of up to 14 sub-corpora, bringing together proprietary datasets acquired through previous research collaborations, speech corpora recorded by the consortium, and data collected from publicly available online sources. Overall, it contains approximately 425 hours of speech with high-quality manual transcriptions. Details are described in CAMOES.
Training Process
We apply supervised finetuning on top of the Whisper Large v3 model (openai/whisper-large-v3), updating all model parameters. Training was carried out on the own Instituto Superior Técnico/INESC-ID computational facilities.
The model provided is the resulting fine-tuned transformer speech encoder only (without the ASR decoder).
Intended Use
AMALIA-speech-encoder is intended as a specialized speech encoder for European Portuguese. The model receives speech as an input and outputs an high-dimentional latent representation of the speech content, commonly known as speech embedding. This model is expected to be used as a speech pre-processing stage integrated in a specific downstream task, for instance, speech-to-text.
Limitations
This checkpoint is intended as a research artifact. Performance may vary depending on audio quality, speaker domain, recording conditions, and transcription style. The model may be less reliable on noisy audio, long-form speech, code-switching, or domains that differ from the training data.
Contents and use example
This repo stores:
encoder.safetensors: Whisper speech encoder weights onlyconfig.json: Whisper configuration needed to reconstruct the encoderpreprocessor_config.json: feature extractor fromopenai/whisper-large-v3
Load example
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import WhisperConfig, WhisperFeatureExtractor
from transformers.models.whisper.modeling_whisper import WhisperEncoder
repo_id = "amalia-llm/AMALIA-speech-encoder"
config = WhisperConfig.from_pretrained(repo_id)
feature_extractor = WhisperFeatureExtractor.from_pretrained(repo_id)
encoder = WhisperEncoder(config)
state = load_file(hf_hub_download(repo_id, "encoder.safetensors"))
encoder.load_state_dict(state)
encoder.eval()
Citation
BibTeX:
@inproceedings{camoes,
title={{CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European Portuguese}},
author={Carlos Carvalho, Francisco Teixeira, Catarina Botelho, Anna Pompili, Rubén Solera-Ureña, Sérgio Paulo, Mariana Julião, Thomas Rolland, John Mendonça, Diogo Pereira, Isabel Trancoso, Alberto Abad},
booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
year={2025},
}
- Downloads last month
- 42