AUTOLYRICS β€” Whisper-small + LoRA for Singing Lyrics Transcription

LoRA adapter for openai/whisper-small, fine-tuned for singing voice β†’ lyrics transcription. Built as a 4-day end-to-end ML project; see the full repo at GitHub and live demo at HF Space.

Why this exists

Off-the-shelf ASR fails on singing because of pitch variation, sustained phonemes, rhythm irregularities, and (often) backing music. This adapter recovers a substantial fraction of that loss with ~0.5% extra trainable parameters.

Results on held-out singing test set

Metric Whisper-small (baseline) + LoRA (this adapter) Ξ”
WER 37.5% 34.5% -3.0 pts
CER 27.1% 17.8% -9.3 pts
RTF on T4 0.03 0.03 ~same

Test set: 13 clips, song-disjoint from train.

How to use

from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

base = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(base, "Petercoder/autolyrics-whisper-small-lora")
proc  = WhisperProcessor.from_pretrained("Petercoder/autolyrics-whisper-small-lora")
model.generation_config.language = "de"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

wav, sr = torchaudio.load("song_clip.wav")
if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
if sr != 16000: wav = torchaudio.functional.resample(wav, sr, 16000)

feats = proc(wav.squeeze(0).numpy(), sampling_rate=16000,
             return_tensors="pt").input_features
ids = model.generate(feats, num_beams=5, max_new_tokens=225)
print(proc.batch_decode(ids, skip_special_tokens=True)[0])

For best results, isolate vocals first with Demucs (htdemucs_ft), then pass the vocals.wav to this model.

Training details

  • Base model: openai/whisper-small (244M params)
  • PEFT: LoRA, r=32, alpha=64, dropout=0.05, target=q_proj,v_proj
  • Trainable params: 1.2M (0.5% of total)
  • Optimizer: AdamW, lr=1e-3, linear warmup 50 steps
  • Batch: 8 Γ— grad_accum 2 = effective 16; fp16
  • Epochs: 5 with early stopping (patience=2) on eval WER
  • Hardware: single NVIDIA T4 (Colab Pro)

Dataset

DSing30 + curated Jamendo Lyrics subset, vocal-isolated via Demucs htdemucs_ft, song-disjoint train/val/test splits.

Limitations

  • German only (training data was German).
  • Heavy distortion / extreme growl vocals are still hard.
  • Best results require vocal isolation as a preprocessing step.

Citation

@misc{autolyrics2026,
  author = { ramduvvuri },
  title  = {AUTOLYRICS: LoRA Fine-tuning of Whisper for Singing Lyrics},
  year   = {2026},
  howpublished = {\url{https://github.com/ramduvvuri/autolyrics}}
}
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Petercoder/autolyrics-whisper-small-lora

Adapter
(233)
this model

Space using Petercoder/autolyrics-whisper-small-lora 1