Model Card

A multimodal text + audio model currently in training. This card documents the in-progress run; metrics and details will be updated when training completes.

Status: ๐ŸŸข Training in progress โ€” ~38% complete (Epoch 2 of 3, step ~12,510 of ~32,500). No instability observed.

Model Details

  • Modalities: Joint text and audio
  • Objective: Combined text + audio loss
  • Status: Mid-training checkpoint (not final)

Training Procedure

Configuration

Experimental Setup here: https://github.com/Ranjit246/duplex-model-exp/tree/hinglish-indic-adaptation-kame-moshi (self-exploratory)

Setting Value
Epochs 3
Examples per epoch 86,671
Micro-batch size 1
Gradient accumulation 8
Effective batch size 8
Steps per epoch ~10,834
Total planned steps ~32,500
Learning rate 3e-5
LR schedule WarmupLR (linear warmup to 3e-5 by ~step 110, held flat, no decay)
Checkpoint interval Every 500 steps
Throughput ~10 sec/step (+ ~5โ€“6 min checkpoint stall per 500 steps)

Run Timeline

  • Started: 2026-05-23 20:37
  • Last logged: 2026-06-01 04:03 (~8.5 days elapsed, still running)
  • Estimated remaining: ~20,000 steps, on the order of a couple more days
  • Checkpoints retained: step_12000, step_12500 (older checkpoints rotated out)

Loss Curve

Loss averaged per 1,000 steps. The large initial drop occurs during warmup; thereafter both losses grind down steadily, with text falling faster than audio. Audio loss is the harder signal and is plateauing around ~1.4โ€“1.5.

Step range Total Text Audio
0โ€“999 4.65 2.12 2.53
2kโ€“3k 2.77 1.16 1.61
5kโ€“6k 2.65 1.13 1.51
8kโ€“9k 2.52 1.02 1.50
9kโ€“10k 2.34 0.93 1.41
11kโ€“12k 2.47 1.04 1.42
12k+ 2.14 0.84 1.30

Notes on loss: Per-step loss is noisy (individual steps swing from ~0.17 to ~4.4), which is expected with micro-batch=1 and grad-accum to an effective batch of 8. The binned per-1,000-step averages are the meaningful view of the trend.

Stability

No NaN, no OOM, no exceptions, and no tracebacks across the full run. Training is progressing normally with loss still trending down.

Intended Use

This is an intermediate training artifact. The final model and evaluation results are not yet available. Use mid-training checkpoints only for monitoring or experimentation, not for production.

Limitations

  • Training is not complete; performance will continue to change.
  • No formal evaluation has been run yet.
  • Audio loss is plateauing higher than text loss, reflecting the greater difficulty of the audio signal.
Downloads last month
647
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support