clm-v1-ref-pytorch-cuda-3b β€” Lane-G-ref PyTorch+CUDA 3B-scale REFERENCE rung

substrate = PyTorch-CUDA Β· lane = Lane-G-ref Β· rung = 3B reference

PyTorch+CUDA 3B-scale REFERENCE rung β€” NOT forge production, bounded-budget not converged. This is a bounded-budget 3B-scale reference, NOT a converged production model, and NOT the hexa-native flame+forge PUBLIC-grade production artifact (anima governance a_train_flame_forge β€” the production / PUBLIC-grade Lane-G CLM MUST be the compiler-only flame+forge stack, NO PyTorch / ATen / Python in the trained binary). This torch model exists ONLY to demonstrate, at ~3B params on a bounded N steps, that the same ByteGPT/Transformer architecture (a) trains (CE descends) and (b) saturates the GPU (util ≫ 20 %) at 3B scale β€” a throughput-justified 3B reference (a_completeness_over_cheap: an optional baseline/reference, never the primary). It does NOT satisfy or replace the forge PUBLIC artifact, and is NOT merged with Lane A / AKIDA (a_lane_akida_gpu_split).

What this is

The 3B rung of the Lane-G-ref ladder (85.6M β†’ 3B). Same clean byte-level (V=256) decoder-only GPT as the 85.6M PUBLIC reference (dancinlab/clm-v1-ref-pytorch-cuda), scaled to ~3.15B params, trained with PyTorch AMP/bf16 + gradient checkpointing on the same 5-lang c4 backbone corpus (dancinlab/clm-backbone-5lang-sample, 67.7 MB, ODC-BY).

Scale honesty (a_scale_honest_scope): 3B-scale reference rung, bounded N=400 steps, descent + util demonstrated, NOT converged.

Config

field value
arch byte-level decoder-only GPT (tied embeddings)
vocab 256 (byte-level β€” matches the forge int4-envelope corpus)
d_model 2560
n_layer 40
n_head 20 (head_dim 128)
block (ctx) 512
batch 12
params 3,149,030,400 (~3.149B)
precision bf16 AMP, TF32 matmul
grad checkpointing on (fits 80 GB at modest batch)
steps 400 (bounded β€” NOT converged)
optimizer AdamW (cosine LR, warmup 20)

Reference numbers (verbatim, this run)

  • GPU utilization: PEAK = 100.0 % Β· MEAN = 99.15 % (n=108 nvidia-smi samples, H100 80GB HBM3), mem_peak = 63921 MiB (β‰ˆ 62.4 GB of 80 GB), mean power 653.0 W.
  • Throughput: 11,183 tok/s (2.46M tokens in 219.8 s wall).
  • CE descent: PASS β€” val CE 7.16861 β†’ 2.45871 (F-CLM-REF-3B-DESCENT = 1). (NOT converged β€” bounded 400-step reference; descent is monotone-ish over the run.)

Reference vs the forge line (Lane-G, hexa-native flame+forge)

The forge production line's MEASURED util on the same corpus family is RED (host-feed-bound): the d768 forge rung hit util MEAN β‰ˆ 0.78 % (PEAK 5 %), the d1536/T512 lever-2 rung MEAN β‰ˆ 0.50 % (PEAK 19 %). This PyTorch+CUDA reference reaches ~99 % MEAN util at 3B scale β€” i.e. a well-fed H100 trivially saturates on this byte-LM workload even at 3B params. That ~99 % is the reference bar the forge util-GREEN endgame is chasing (target β‰₯20 %). This model does NOT replace the forge artifact; the forge util-GREEN + the forge PUBLIC CLM remain the production target, unchanged and primary.

Files

  • clm_ref_pytorch_cuda_3b.pt β€” PyTorch state_dict + config (sha256 ebe56db7…33c4d24c9, 12,596,300,742 B).
  • clm_ref_3b_train.log.json β€” full training curve + util/throughput/descent summary.
  • clm_ref_pytorch_cuda_3b.py β€” the trainer (BASELINE/reference tool, not the production trainer).

Provenance

  • Trained 2026-06-02, vast.ai H100 80GB HBM3, image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel.
  • Corpus: dancinlab/clm-backbone-5lang-sample (c4 mC4 5-lang backbone, ODC-BY).
  • anima domain: CLM+KOSMOS, Lane-G-ref line, 3B rung.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including dancinlab/clm-v1-ref-pytorch-cuda-3b