Qwen3.5AE v6 — Trajectory checkpoint share

Selected training checkpoints from the projector alignment phase (Stage-1) and the LoRA fine-tune phase (Stage-2) of the "Where Does the Sound Go? Tracing Acoustic Information Loss in Audio-Conditioned LLMs" project. Stage-1: audio encoder frozen, only the 4-layer Transformer projector (~18M params) is updated. LM backbone is Qwen3.5-4B (frozen). Stage-2: LoRA on LM transformer linears + projector full-param. Audio encoder still frozen.

Stage-1 (projector-only alignment)

Folder	Encoder	Step	Tokens seen	Rows seen
`encodec/checkpoint-30000`	EnCodec (24 kHz)	30,000	3.93B	8.4M
`encodec/checkpoint-40000`	EnCodec (24 kHz)	40,000	5.24B	11.2M
`encodec/checkpoint-80000`	EnCodec (24 kHz)	80,000	10.49B	22.3M
`encodec/checkpoint-90000` ★	EnCodec (24 kHz)	90,000	11.80B	25.1M
`dacvae/checkpoint-80000` ★	DAC-VAE (48 kHz)	80,000	4.59B	9.76M
`dacvae-asr-only/checkpoint-100000` ★	DAC-VAE (48 kHz), ASR-only mix	100,000	5.73B	12.2M
`whisper-small/checkpoint-94000` ★	Whisper-small (16 kHz)	94,000	10.78B	22.9M

Stage-2 (LoRA + projector fine-tune)

Folder	Encoder	Step	Notes
`whisper-tiny/stage2/checkpoint-30000` ★	Whisper-tiny (16 kHz)	30,000	LoRA r=32 α=64 on Qwen3.5-4B linears; projector retrained. Base = Stage-1 best (ckpt-100000). Best by 10-metric rank-sum.

★ = best checkpoint by the unified 9 or 10-metric rank-sum protocol (ASR WER + emotion macro-F1 + captioning CIDEr; tiebreaker = head-to-head wins).

Tokens per step: global_batch x cutoff_len. EnCodec = 32 x 4096 = 131,072. Whisper-small = 32 x 3584 = 114,688. DAC-VAE = 16 x 3584 = 57,344.
Rows per step: tokens / 470 (row-size weighted mean of training row token length).

EnCodec uses cutoff 4096 because its 75 tokens/s rate fills 2,250 audio frames for a 30-second utterance. Whisper-small uses cutoff 3584 (50 tokens/s × 30 s = 1,500 audio frames + text headroom). DAC-VAE uses global batch 16 (half of others) because its 48 kHz input carries ~3x more raw samples per utterance, forcing a smaller per-step batch under fixed GPU memory.

DAC-VAE at 80k steps is roughly token-matched to EnCodec at ~35,000 steps. EnCodec at 80,000 steps sees ~2.3x more tokens than DAC-VAE at 80,000 steps. Whisper-small at 94,000 steps is roughly token-matched to EnCodec at ~82,000 steps.

`dacvae-asr-only`: ASR-only Stage-1 ablation

Same architecture and base model as dacvae/checkpoint-80000, but trained with the ASR-only manifest (single-source audio_asr mix at probability 1.0) instead of the default Stage-1 multi-task mix (ASR 65% / env_sound 25% / emotion 10%). Useful for isolating the effect of multi-task interference on DAC-VAE's projector. Best ASR WER at step 100,000 (no rank-sum tradeoff since non-ASR metrics are not meaningful for an ASR-only run).

`whisper-tiny/stage2/checkpoint-30000`: Stage-2 LoRA ablation

Initialized from Stage-1 best whisper-tiny ckpt-100000 and fine-tuned with LoRA (r=32, alpha=64) on the LM's transformer linears (q/k/v/o/gate/up/down) plus projector full-param. Audio encoder remains frozen. Same multi-task mix as Stage-1. LoRA + projector merged into the base model for inference-ready loading.

Note on `encodec/checkpoint-{30000,40000,80000}` vs `checkpoint-90000`

The three earlier encodec/checkpoint-* folders contain the full training state (model + DeepSpeed optimizer + RNG, ~26 GiB each) for resume / forensic use. encodec/checkpoint-90000 and whisper-small/checkpoint-94000 are inference-only (model + tokenizer + code, ~10 GB each) — global_step{N}/, rng_state_*.pth, training_args.bin, latest, zero_to_fp32.py, trainer_state.json are excluded. dacvae-asr-only/checkpoint-100000 includes the full training state (resume-capable). whisper-tiny/stage2/checkpoint-30000 is a merged inference-ready model (LoRA + projector merged into base, no adapter file).

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "SJ2048/qwen35ae-v6-trajectory"
subfolder = "encodec/checkpoint-90000"  # or any other folder above

model = AutoModelForCausalLM.from_pretrained(
    repo, subfolder=subfolder, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    repo, subfolder=subfolder, trust_remote_code=True
)

Training context

LM: Qwen3.5-4B (frozen during Stage-1 alignment, LoRA-adapted in Stage-2; tokenizer extended with audio control tokens).
Projector: 4-layer causal Transformer (hidden 512, 8 heads, FFN 2048, pre-norm RMSNorm, SwiGLU, RoPE, FlashAttention-2). ~~18M params total (~~0.45% of LLM).
Training mixture (default Stage-1 / Stage-2): ASR (65%) + environmental sound (25%) + emotion (10%), interleaved at the row level with stopping_strategy=all_exhausted.
30s utterance cap, online resampling to encoder native rate, sequence packing into length-3584 or 4096 bins.
Optimizer: AdamW with WSD schedule, 1000-step linear warm-up, peak LR 2e-4 (Stage-1) or 2e-5 (Stage-2).

Status

Shared for trajectory analysis purposes. Not a polished release; checkpoint quality varies across steps.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

SJ2048
/

qwen35ae-v6-trajectory

Qwen3.5AE v6 — Trajectory checkpoint share

Contents

Stage-1 (projector-only alignment)

Stage-2 (LoRA + projector fine-tune)

`dacvae-asr-only`: ASR-only Stage-1 ablation

`whisper-tiny/stage2/checkpoint-30000`: Stage-2 LoRA ablation

Note on `encodec/checkpoint-{30000,40000,80000}` vs `checkpoint-90000`

Loading

Training context

Status

Qwen3.5AE v6 — Trajectory checkpoint share

Contents

Stage-1 (projector-only alignment)

Stage-2 (LoRA + projector fine-tune)

dacvae-asr-only: ASR-only Stage-1 ablation

whisper-tiny/stage2/checkpoint-30000: Stage-2 LoRA ablation

Note on encodec/checkpoint-{30000,40000,80000} vs checkpoint-90000

Loading

Training context

Status

`dacvae-asr-only`: ASR-only Stage-1 ablation

`whisper-tiny/stage2/checkpoint-30000`: Stage-2 LoRA ablation

Note on `encodec/checkpoint-{30000,40000,80000}` vs `checkpoint-90000`