Instructions to use SJ2048/qwen35ae-v6-trajectory with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SJ2048/qwen35ae-v6-trajectory with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("SJ2048/qwen35ae-v6-trajectory", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Qwen3.5AE v6 β Trajectory checkpoint share
Selected training checkpoints from the projector alignment phase (Stage-1) and the LoRA fine-tune phase (Stage-2) of the "Where Does the Sound Go? Tracing Acoustic Information Loss in Audio-Conditioned LLMs" project. Stage-1: audio encoder frozen, only the 4-layer Transformer projector (~18M params) is updated. LM backbone is Qwen3.5-4B (frozen). Stage-2: LoRA on LM transformer linears + projector full-param. Audio encoder still frozen.
Contents
Stage-1 (projector-only alignment)
| Folder | Encoder | Step | Tokens seen | Rows seen |
|---|---|---|---|---|
encodec/checkpoint-30000 |
EnCodec (24 kHz) | 30,000 | 3.93B | 8.4M |
encodec/checkpoint-40000 |
EnCodec (24 kHz) | 40,000 | 5.24B | 11.2M |
encodec/checkpoint-80000 |
EnCodec (24 kHz) | 80,000 | 10.49B | 22.3M |
encodec/checkpoint-90000 β
|
EnCodec (24 kHz) | 90,000 | 11.80B | 25.1M |
dacvae/checkpoint-80000 β
|
DAC-VAE (48 kHz) | 80,000 | 4.59B | 9.76M |
dacvae-asr-only/checkpoint-100000 β
|
DAC-VAE (48 kHz), ASR-only mix | 100,000 | 5.73B | 12.2M |
whisper-small/checkpoint-94000 β
|
Whisper-small (16 kHz) | 94,000 | 10.78B | 22.9M |
Stage-2 (LoRA + projector fine-tune)
| Folder | Encoder | Step | Notes |
|---|---|---|---|
whisper-tiny/stage2/checkpoint-30000 β
|
Whisper-tiny (16 kHz) | 30,000 | LoRA r=32 Ξ±=64 on Qwen3.5-4B linears; projector retrained. Base = Stage-1 best (ckpt-100000). Best by 10-metric rank-sum. |
β = best checkpoint by the unified 9 or 10-metric rank-sum protocol (ASR WER + emotion macro-F1 + captioning CIDEr; tiebreaker = head-to-head wins).
- Tokens per step:
global_batch x cutoff_len. EnCodec = 32 x 4096 = 131,072. Whisper-small = 32 x 3584 = 114,688. DAC-VAE = 16 x 3584 = 57,344. - Rows per step:
tokens / 470(row-size weighted mean of training row token length).
EnCodec uses cutoff 4096 because its 75 tokens/s rate fills 2,250 audio frames for a 30-second utterance. Whisper-small uses cutoff 3584 (50 tokens/s Γ 30 s = 1,500 audio frames + text headroom). DAC-VAE uses global batch 16 (half of others) because its 48 kHz input carries ~3x more raw samples per utterance, forcing a smaller per-step batch under fixed GPU memory.
DAC-VAE at 80k steps is roughly token-matched to EnCodec at ~35,000 steps. EnCodec at 80,000 steps sees ~2.3x more tokens than DAC-VAE at 80,000 steps. Whisper-small at 94,000 steps is roughly token-matched to EnCodec at ~82,000 steps.
dacvae-asr-only: ASR-only Stage-1 ablation
Same architecture and base model as dacvae/checkpoint-80000, but trained with the ASR-only manifest (single-source audio_asr mix at probability 1.0) instead of the default Stage-1 multi-task mix (ASR 65% / env_sound 25% / emotion 10%). Useful for isolating the effect of multi-task interference on DAC-VAE's projector. Best ASR WER at step 100,000 (no rank-sum tradeoff since non-ASR metrics are not meaningful for an ASR-only run).
whisper-tiny/stage2/checkpoint-30000: Stage-2 LoRA ablation
Initialized from Stage-1 best whisper-tiny ckpt-100000 and fine-tuned with LoRA (r=32, alpha=64) on the LM's transformer linears (q/k/v/o/gate/up/down) plus projector full-param. Audio encoder remains frozen. Same multi-task mix as Stage-1. LoRA + projector merged into the base model for inference-ready loading.
Note on encodec/checkpoint-{30000,40000,80000} vs checkpoint-90000
The three earlier encodec/checkpoint-* folders contain the full training state (model + DeepSpeed optimizer + RNG, ~26 GiB each) for resume / forensic use. encodec/checkpoint-90000 and whisper-small/checkpoint-94000 are inference-only (model + tokenizer + code, ~10 GB each) β global_step{N}/, rng_state_*.pth, training_args.bin, latest, zero_to_fp32.py, trainer_state.json are excluded. dacvae-asr-only/checkpoint-100000 includes the full training state (resume-capable). whisper-tiny/stage2/checkpoint-30000 is a merged inference-ready model (LoRA + projector merged into base, no adapter file).
Loading
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "SJ2048/qwen35ae-v6-trajectory"
subfolder = "encodec/checkpoint-90000" # or any other folder above
model = AutoModelForCausalLM.from_pretrained(
repo, subfolder=subfolder, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
repo, subfolder=subfolder, trust_remote_code=True
)
Training context
- LM: Qwen3.5-4B (frozen during Stage-1 alignment, LoRA-adapted in Stage-2; tokenizer extended with audio control tokens).
- Projector: 4-layer causal Transformer (hidden 512, 8 heads, FFN 2048, pre-norm RMSNorm, SwiGLU, RoPE, FlashAttention-2).
18M params total (0.45% of LLM). - Training mixture (default Stage-1 / Stage-2): ASR (65%) + environmental sound (25%) + emotion (10%), interleaved at the row level with
stopping_strategy=all_exhausted. - 30s utterance cap, online resampling to encoder native rate, sequence packing into length-3584 or 4096 bins.
- Optimizer: AdamW with WSD schedule, 1000-step linear warm-up, peak LR 2e-4 (Stage-1) or 2e-5 (Stage-2).
Status
Shared for trajectory analysis purposes. Not a polished release; checkpoint quality varies across steps.