Instructions to use Petercoder/autolyrics-whisper-small-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Petercoder/autolyrics-whisper-small-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForSeq2SeqLM base_model = AutoModelForSeq2SeqLM.from_pretrained("openai/whisper-small") model = PeftModel.from_pretrained(base_model, "Petercoder/autolyrics-whisper-small-lora") - Notebooks
- Google Colab
- Kaggle
AUTOLYRICS β Whisper-small + LoRA for Singing Lyrics Transcription
LoRA adapter for openai/whisper-small, fine-tuned for singing voice β lyrics
transcription. Built as a 4-day end-to-end ML project; see the full repo at
GitHub and live demo at
HF Space.
Why this exists
Off-the-shelf ASR fails on singing because of pitch variation, sustained phonemes, rhythm irregularities, and (often) backing music. This adapter recovers a substantial fraction of that loss with ~0.5% extra trainable parameters.
Results on held-out singing test set
| Metric | Whisper-small (baseline) | + LoRA (this adapter) | Ξ |
|---|---|---|---|
| WER | 37.5% | 34.5% | -3.0 pts |
| CER | 27.1% | 17.8% | -9.3 pts |
| RTF on T4 | 0.03 | 0.03 | ~same |
Test set: 13 clips, song-disjoint from train.
How to use
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio
base = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(base, "Petercoder/autolyrics-whisper-small-lora")
proc = WhisperProcessor.from_pretrained("Petercoder/autolyrics-whisper-small-lora")
model.generation_config.language = "de"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None
wav, sr = torchaudio.load("song_clip.wav")
if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
if sr != 16000: wav = torchaudio.functional.resample(wav, sr, 16000)
feats = proc(wav.squeeze(0).numpy(), sampling_rate=16000,
return_tensors="pt").input_features
ids = model.generate(feats, num_beams=5, max_new_tokens=225)
print(proc.batch_decode(ids, skip_special_tokens=True)[0])
For best results, isolate vocals first with Demucs
(htdemucs_ft), then pass the vocals.wav to this model.
Training details
- Base model:
openai/whisper-small(244M params) - PEFT: LoRA, r=32, alpha=64, dropout=0.05, target=
q_proj,v_proj - Trainable params:
1.2M (0.5% of total) - Optimizer: AdamW, lr=1e-3, linear warmup 50 steps
- Batch: 8 Γ grad_accum 2 = effective 16; fp16
- Epochs: 5 with early stopping (patience=2) on eval WER
- Hardware: single NVIDIA T4 (Colab Pro)
Dataset
DSing30 + curated Jamendo Lyrics subset, vocal-isolated via Demucs htdemucs_ft, song-disjoint train/val/test splits.
Limitations
- German only (training data was German).
- Heavy distortion / extreme growl vocals are still hard.
- Best results require vocal isolation as a preprocessing step.
Citation
@misc{autolyrics2026,
author = { ramduvvuri },
title = {AUTOLYRICS: LoRA Fine-tuning of Whisper for Singing Lyrics},
year = {2026},
howpublished = {\url{https://github.com/ramduvvuri/autolyrics}}
}
- Downloads last month
- 7
Model tree for Petercoder/autolyrics-whisper-small-lora
Base model
openai/whisper-small