Qwen3.5AE v6 β€” Trajectory checkpoint share

Selected training checkpoints from the projector alignment phase (Stage-1) and the LoRA fine-tune phase (Stage-2) of the "Where Does the Sound Go? Tracing Acoustic Information Loss in Audio-Conditioned LLMs" project. Stage-1: audio encoder frozen, only the 4-layer Transformer projector (~18M params) is updated. LM backbone is Qwen3.5-4B (frozen). Stage-2: LoRA on LM transformer linears + projector full-param. Audio encoder still frozen.

Contents

Stage-1 (projector-only alignment)

Folder Encoder Step Tokens seen Rows seen
encodec/checkpoint-30000 EnCodec (24 kHz) 30,000 3.93B 8.4M
encodec/checkpoint-40000 EnCodec (24 kHz) 40,000 5.24B 11.2M
encodec/checkpoint-80000 EnCodec (24 kHz) 80,000 10.49B 22.3M
encodec/checkpoint-90000 β˜… EnCodec (24 kHz) 90,000 11.80B 25.1M
dacvae/checkpoint-80000 β˜… DAC-VAE (48 kHz) 80,000 4.59B 9.76M
dacvae-asr-only/checkpoint-100000 β˜… DAC-VAE (48 kHz), ASR-only mix 100,000 5.73B 12.2M
whisper-small/checkpoint-94000 β˜… Whisper-small (16 kHz) 94,000 10.78B 22.9M

Stage-2 (LoRA + projector fine-tune)

Folder Encoder Step Notes
whisper-tiny/stage2/checkpoint-30000 β˜… Whisper-tiny (16 kHz) 30,000 LoRA r=32 Ξ±=64 on Qwen3.5-4B linears; projector retrained. Base = Stage-1 best (ckpt-100000). Best by 10-metric rank-sum.

β˜… = best checkpoint by the unified 9 or 10-metric rank-sum protocol (ASR WER + emotion macro-F1 + captioning CIDEr; tiebreaker = head-to-head wins).

  • Tokens per step: global_batch x cutoff_len. EnCodec = 32 x 4096 = 131,072. Whisper-small = 32 x 3584 = 114,688. DAC-VAE = 16 x 3584 = 57,344.
  • Rows per step: tokens / 470 (row-size weighted mean of training row token length).

EnCodec uses cutoff 4096 because its 75 tokens/s rate fills 2,250 audio frames for a 30-second utterance. Whisper-small uses cutoff 3584 (50 tokens/s Γ— 30 s = 1,500 audio frames + text headroom). DAC-VAE uses global batch 16 (half of others) because its 48 kHz input carries ~3x more raw samples per utterance, forcing a smaller per-step batch under fixed GPU memory.

DAC-VAE at 80k steps is roughly token-matched to EnCodec at ~35,000 steps. EnCodec at 80,000 steps sees ~2.3x more tokens than DAC-VAE at 80,000 steps. Whisper-small at 94,000 steps is roughly token-matched to EnCodec at ~82,000 steps.

dacvae-asr-only: ASR-only Stage-1 ablation

Same architecture and base model as dacvae/checkpoint-80000, but trained with the ASR-only manifest (single-source audio_asr mix at probability 1.0) instead of the default Stage-1 multi-task mix (ASR 65% / env_sound 25% / emotion 10%). Useful for isolating the effect of multi-task interference on DAC-VAE's projector. Best ASR WER at step 100,000 (no rank-sum tradeoff since non-ASR metrics are not meaningful for an ASR-only run).

whisper-tiny/stage2/checkpoint-30000: Stage-2 LoRA ablation

Initialized from Stage-1 best whisper-tiny ckpt-100000 and fine-tuned with LoRA (r=32, alpha=64) on the LM's transformer linears (q/k/v/o/gate/up/down) plus projector full-param. Audio encoder remains frozen. Same multi-task mix as Stage-1. LoRA + projector merged into the base model for inference-ready loading.

Note on encodec/checkpoint-{30000,40000,80000} vs checkpoint-90000

The three earlier encodec/checkpoint-* folders contain the full training state (model + DeepSpeed optimizer + RNG, ~26 GiB each) for resume / forensic use. encodec/checkpoint-90000 and whisper-small/checkpoint-94000 are inference-only (model + tokenizer + code, ~10 GB each) β€” global_step{N}/, rng_state_*.pth, training_args.bin, latest, zero_to_fp32.py, trainer_state.json are excluded. dacvae-asr-only/checkpoint-100000 includes the full training state (resume-capable). whisper-tiny/stage2/checkpoint-30000 is a merged inference-ready model (LoRA + projector merged into base, no adapter file).

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "SJ2048/qwen35ae-v6-trajectory"
subfolder = "encodec/checkpoint-90000"  # or any other folder above

model = AutoModelForCausalLM.from_pretrained(
    repo, subfolder=subfolder, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    repo, subfolder=subfolder, trust_remote_code=True
)

Training context

  • LM: Qwen3.5-4B (frozen during Stage-1 alignment, LoRA-adapted in Stage-2; tokenizer extended with audio control tokens).
  • Projector: 4-layer causal Transformer (hidden 512, 8 heads, FFN 2048, pre-norm RMSNorm, SwiGLU, RoPE, FlashAttention-2). 18M params total (0.45% of LLM).
  • Training mixture (default Stage-1 / Stage-2): ASR (65%) + environmental sound (25%) + emotion (10%), interleaved at the row level with stopping_strategy=all_exhausted.
  • 30s utterance cap, online resampling to encoder native rate, sequence packing into length-3584 or 4096 bins.
  • Optimizer: AdamW with WSD schedule, 1000-step linear warm-up, peak LR 2e-4 (Stage-1) or 2e-5 (Stage-2).

Status

Shared for trajectory analysis purposes. Not a polished release; checkpoint quality varies across steps.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support