AUTOLYRICS — Whisper-small + LoRA for Singing Lyrics Transcription

LoRA adapter for openai/whisper-small, fine-tuned for singing voice → lyrics transcription. Built as a 4-day end-to-end ML project; see the full repo at GitHub and live demo at HF Space.

Why this exists

Off-the-shelf ASR fails on singing because of pitch variation, sustained phonemes, rhythm irregularities, and (often) backing music. This adapter recovers a substantial fraction of that loss with ~0.5% extra trainable parameters.

Results on held-out singing test set

Metric	Whisper-small (baseline)	+ LoRA (this adapter)	Δ
WER	37.5%	34.5%	-3.0 pts
CER	27.1%	17.8%	-9.3 pts
RTF on T4	0.03	0.03	~same

Test set: 13 clips, song-disjoint from train.

How to use

from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

base = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(base, "Petercoder/autolyrics-whisper-small-lora")
proc  = WhisperProcessor.from_pretrained("Petercoder/autolyrics-whisper-small-lora")
model.generation_config.language = "de"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

wav, sr = torchaudio.load("song_clip.wav")
if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
if sr != 16000: wav = torchaudio.functional.resample(wav, sr, 16000)

feats = proc(wav.squeeze(0).numpy(), sampling_rate=16000,
             return_tensors="pt").input_features
ids = model.generate(feats, num_beams=5, max_new_tokens=225)
print(proc.batch_decode(ids, skip_special_tokens=True)[0])

For best results, isolate vocals first with Demucs (htdemucs_ft), then pass the vocals.wav to this model.

Training details

Base model: openai/whisper-small (244M params)
PEFT: LoRA, r=32, alpha=64, dropout=0.05, target=q_proj,v_proj
Trainable params: ~~1.2M (~~0.5% of total)
Optimizer: AdamW, lr=1e-3, linear warmup 50 steps
Batch: 8 × grad_accum 2 = effective 16; fp16
Epochs: 5 with early stopping (patience=2) on eval WER
Hardware: single NVIDIA T4 (Colab Pro)

Dataset

DSing30 + curated Jamendo Lyrics subset, vocal-isolated via Demucs htdemucs_ft, song-disjoint train/val/test splits.

Limitations

German only (training data was German).
Heavy distortion / extreme growl vocals are still hard.
Best results require vocal isolation as a preprocessing step.

Citation

@misc{autolyrics2026,
  author = { ramduvvuri },
  title  = {AUTOLYRICS: LoRA Fine-tuning of Whisper for Singing Lyrics},
  year   = {2026},
  howpublished = {\url{https://github.com/ramduvvuri/autolyrics}}
}

Downloads last month: 7

Model tree for Petercoder/autolyrics-whisper-small-lora

Base model

openai/whisper-small

Adapter

(233)

this model

Petercoder
/

autolyrics-whisper-small-lora