Model Card

A multimodal text + audio model currently in training. This card documents the in-progress run; metrics and details will be updated when training completes.

Status: 🟢 Training in progress — ~38% complete (Epoch 2 of 3, step ~12,510 of ~32,500). No instability observed.

Model Details

Modalities: Joint text and audio
Objective: Combined text + audio loss
Status: Mid-training checkpoint (not final)

Training Procedure

Configuration

Experimental Setup here: https://github.com/Ranjit246/duplex-model-exp/tree/hinglish-indic-adaptation-kame-moshi (self-exploratory)

Setting	Value
Epochs	3
Examples per epoch	86,671
Micro-batch size	1
Gradient accumulation	8
Effective batch size	8
Steps per epoch	~10,834
Total planned steps	~32,500
Learning rate	3e-5
LR schedule	WarmupLR (linear warmup to 3e-5 by ~step 110, held flat, no decay)
Checkpoint interval	Every 500 steps
Throughput	~10 sec/step (+ ~5–6 min checkpoint stall per 500 steps)

Run Timeline

Started: 2026-05-23 20:37
Last logged: 2026-06-01 04:03 (~8.5 days elapsed, still running)
Estimated remaining: ~20,000 steps, on the order of a couple more days
Checkpoints retained: step_12000, step_12500 (older checkpoints rotated out)

Loss Curve

Loss averaged per 1,000 steps. The large initial drop occurs during warmup; thereafter both losses grind down steadily, with text falling faster than audio. Audio loss is the harder signal and is plateauing around ~1.4–1.5.

Step range	Total	Text	Audio
0–999	4.65	2.12	2.53
2k–3k	2.77	1.16	1.61
5k–6k	2.65	1.13	1.51
8k–9k	2.52	1.02	1.50
9k–10k	2.34	0.93	1.41
11k–12k	2.47	1.04	1.42
12k+	2.14	0.84	1.30

Notes on loss: Per-step loss is noisy (individual steps swing from ~0.17 to ~4.4), which is expected with micro-batch=1 and grad-accum to an effective batch of 8. The binned per-1,000-step averages are the meaningful view of the trend.

Stability

No NaN, no OOM, no exceptions, and no tracebacks across the full run. Training is progressing normally with loss still trending down.

Intended Use

This is an intermediate training artifact. The final model and evaluation results are not yet available. Use mid-training checkpoints only for monitoring or experimentation, not for production.

Limitations

Training is not complete; performance will continue to change.
No formal evaluation has been run yet.
Audio loss is plateauing higher than text loss, reflecting the greater difficulty of the audio signal.

Downloads last month: 647