SciPapers / AGILLM3_technical_documentation.md

Upload AGILLM3_technical_documentation.md with huggingface_hub

61b9671 verified about 2 months ago

preview code

raw

history blame contribute delete

12 kB

AGILLM-3: Technical Documentation

A 698M Parameter Language Model with Tuneable Attention Rank and Joint AR+SAT Training

Scott Bisset
OpenTransformers Ltd
January 2026

Abstract

This document provides complete technical documentation of AGILLM-3, a language model exploring two architectural variations: (1) tuneable attention rank via learned orthogonal projections, and (2) joint autoregressive and semi-autoregressive training. We make no claims of competing with frontier models—AGI exists in systems like Claude and GPT-4. This is documentation of independent research for reproducibility and potential future reference by the research community.

1. Motivation

1.1 What This Is

AGILLM-3 is a research project exploring:

Tuneable attention rank: What happens when Q and K are projected through an intermediate space of different dimensionality than the standard head dimension?
Joint AR+SAT training: Can a model learn both next-token prediction AND multi-token speculation simultaneously?

1.2 What This Isn't

This is not:

A frontier model
A competitor to GPT-4/Claude/Gemini
A claim that small models can match large ones
A business

AGI already exists. This is documentation, not disruption.

2. Architecture

2.1 Overview

Input tokens
    ↓
Embedding (vocab → d)
    ↓
[Block × L layers]
    ├── LayerNorm → TuneableAttentionMHA → +residual
    └── LayerNorm → FFN (d → 4d → d) → +residual
    ↓
Final LayerNorm
    ↓
├── ARHead (next token prediction)
└── SATHead (multi-token speculation)

2.2 Tuneable Attention (The Novel Bit)

Standard multi-head attention computes:

Q = XWq,  K = XWk,  V = XWv
Attention = softmax(QKᵀ/√d_k) · V

Where Q, K have shape [batch, seq, heads, d_k].

AGILLM-3's modification:

class TuneableAttentionMHA(nn.Module):
    def __init__(self, d: int, h: int, r: int):
        # r = rank (the tuneable parameter)
        self.U = nn.Parameter(torch.randn(d_k, r))
        nn.init.orthogonal_(self.U)
    
    def _proj_qk(self, x):
        # Project through U: [batch, seq, heads, d_k] @ [d_k, r] → [batch, seq, heads, r]
        return x.view(B, N, h, d_k).transpose(1,2) @ self.U

The attention computation becomes:

Q' = Q @ U    # [batch, heads, seq, r]
K' = K @ U    # [batch, heads, seq, r]
Attention = softmax(Q'K'ᵀ/√d_k) · V

What this means:

Regime	Condition	Effect
Compression	r < d_k	Q-K similarity computed in lower-dim space
Identity	r = d_k	Equivalent to standard attention (if U=I)
Expansion	r > d_k	Q-K similarity computed in higher-dim space

The presets encode this as ratios:

nano_1x: r = d_k (standard)
nano_3x: r = 3 × d_k (expansion)
nano_12x: r = 12 × d_k (heavy expansion)

Hypothesis being tested: Does expanding the Q-K interaction space improve attention quality? The orthogonal initialization ensures U starts as a rotation/reflection, not destroying information.

2.3 Positional Encoding: ALiBi

AGILLM-3 uses ALiBi (Attention with Linear Biases) rather than RoPE or learned positions:

def alibi_bias(n_heads, n_tokens):
    # Each head gets a different slope
    # Attention score penalized by distance: score -= slope * |i - j|
    slopes = [2^(-8/n_heads), 2^(-16/n_heads), ...]
    return -slopes * distance_matrix

ALiBi chosen for:

Zero additional parameters
Good length extrapolation
Simplicity

2.4 Block Structure

Each transformer block:

class Block(nn.Module):
    def forward(self, x, mask):
        # Pre-norm architecture
        x = x + self.mha(self.ln1(x), mask)
        x = x + self.ff(self.ln2(x))
        return x

FFN is standard: Linear(d, 4d) → ReLU → Linear(4d, d)

2.5 Model Configurations

From the presets in code:

Preset	d_model	Layers	Heads	Rank	~Params
nano_3x	64	2	4	48	~200K
micro_12x	128	4	8	192	~2M
small	512	8	16	64	~50M
base	768	12	24	96	~125M
large	1024	24	16	128	~698M

The "large" preset at 698M parameters is the primary AGILLM-3 configuration.

3. Joint AR+SAT Training

3.1 The Idea

Standard language models train only on next-token prediction (autoregressive, AR).

AGILLM-3 trains on BOTH:

AR objective: Predict token t+1 from tokens 1..t
SAT objective: Predict tokens t+1..t+k from tokens 1..t (semi-autoregressive)

3.2 Masking

AR mask (standard causal):

Position can attend to: all previous positions
[1 0 0 0]
[1 1 0 0]
[1 1 1 0]
[1 1 1 1]

SAT mask (block-wise):

SAT_BLOCK = 2
Positions in same block can attend to each other AND all previous blocks

Block 0: positions 0,1 can see each other
Block 1: positions 2,3 can see each other + block 0
etc.

def sat_mask(n, block=2):
    idx = torch.arange(n)
    grp = idx // block
    allow = (grp.T == grp) | (grp.T > grp)  # Same block OR previous blocks
    return torch.where(allow, 0.0, -inf)

3.3 Training Loop

Each batch:

# Forward pass 1: AR
h_ar = core(ids, causal_mask(n))
logits_ar = ar_head(h_ar)[:, :-1]
loss_ar = cross_entropy(logits_ar, targets[:, 1:])

# Forward pass 2: SAT
h_sat = core(ids, sat_mask(n))
logits_sat, gate = sat_head(h_sat[:, -SAT_BLOCK:])
loss_sat = cross_entropy(logits_sat, targets[:, 1:SAT_BLOCK+1])

# Optional: gate loss (predict how many tokens to emit)
if gate is not None:
    loss_sat += 0.1 * cross_entropy(gate, emit_target)

loss = loss_ar + loss_sat

3.4 SAT Head with Gating

class SATHead(nn.Module):
    def __init__(self, d, mode="var"):
        self.proj = nn.Linear(d, vocab)  # Token prediction
        self.gate = nn.Linear(d, 2)      # Emit 1 or 2 tokens?

The gate predicts whether to emit 1 or 2 tokens during inference, allowing variable-stride speculation.

3.5 Why Joint Training?

Hypothesis: Training both objectives together might:

Improve representation quality (multi-task learning)
Enable speculative decoding at inference (predict multiple tokens, verify with AR)
Learn confidence estimation via the gate

Current status: Experimental. No claims of improvement over AR-only.

4. Training Infrastructure

4.1 Data Pipeline

def token_stream(ds_names, target_tokens, seed, ...):
    """
    Streaming token generator from HuggingFace datasets.
    - Supports multiple comma-separated datasets
    - Auto-rotates through sources
    - Handles chat format (messages key) or raw text
    - Appends EOS tokens
    """

Default pretraining sources (from code):

OpenTransformer/goddess-crawl
OpenTransformer/agillm-crawl-data
OpenTransformer/web-crawl-2026
OpenTransformer/web-crawl-clean-v2
OpenTransformer/scraped-web-data
OpenTransformer/turbo-crawl
OpenTransformer/sft-data-clean
OpenTransformer/web-crawl-v1

4.2 Optimizer Configuration

opt = AdamW([
    {"params": core.parameters(), "lr": 5e-5},   # LR_CORE
    {"params": ar_head.parameters(), "lr": 2e-4}, # LR_HEAD  
    {"params": sat_head.parameters(), "lr": 2e-4},
])

Separate learning rates for core vs heads.

4.3 Training Features

AMP: Automatic mixed precision (bf16 if available, else fp16)
Gradient clipping: max_norm=1.0
Label smoothing: 0.1
Dropout: 0.1 in attention
Checkpointing: Configurable interval (default 24h), automatic pruning

4.4 Chinchilla Scaling

ratio = 51.2 if args.chilla_max_double else 25
param_count = count_params(core, ar_h, sat_h)
target_tokens = int(ratio * param_count)

Default follows ~25× Chinchilla ratio; optional 51.2× for "double Chinchilla".

For 698M params: ~17.5B tokens default, ~35.7B tokens with double.

4.5 Hot Config

Runtime dataset switching without restart:

# /workspace/hot_config.json
{"datasets": ["new_dataset_1", "new_dataset_2"]}

Trainer checks this file periodically and switches data sources.

4.6 Auto-Grow

Optional feature to increase block size during training:

--auto_grow --grow_plan "576,640,768,896,1024,1122" --grow_every_steps 50000

Starts with smaller context, grows as training stabilizes.

5. Inference

5.1 AR Mode (Standard)

python n.py infer --mode ar --ckpt path/to/ckpt.pt --prompt "Hello"

Standard autoregressive generation with KV-cache.

5.2 SAT Mode (Speculative)

python n.py infer --mode sat --ckpt path/to/ckpt.pt --prompt "Hello" --var

Generates SAT_BLOCK tokens at once, optionally using gate to choose stride.

5.3 Sampling Parameters

Parameter	AR Default	SAT Default
temperature	0.7	0.5
top_k	0	30
repetition_penalty	1.3	2.0
presence_penalty	0.0	0.6
frequency_penalty	0.3	1.0
penalty_last_n	128	200

SAT mode uses more aggressive penalties to avoid repetition from parallel generation.

6. Weight Tying

Optional embedding-LM head weight tying:

class ARHead(nn.Module):
    def __init__(self, d, tie_weights=False, embedding_weight=None):
        if tie_weights and embedding_weight is not None:
            self.proj = nn.Linear(d, vocab, bias=False)
            self.proj.weight = embedding_weight  # Share weights

Reduces parameters by ~vocab × d (significant for large vocab).

7. Current Training Status

As of January 2026:

Step: 2.2M+
Tokens seen: ~2.4B
Preset: large (698M params)
Training on vast.ai 3090
Checkpoints every 6 hours

8. Observations and Notes

8.1 Expansion Ratio Effects

Early experiments suggest:

1x (standard): baseline behavior
3x-6x: slight improvement in attention patterns
12x+: diminishing returns, increased compute

Not rigorously benchmarked. Observations only.

8.2 AR vs AR+SAT

AR-only mode (--ar_only) available for comparison. Joint training adds ~2x forward passes per batch.

8.3 Known Issues

SAT inference quality lags AR (expected - harder task)
Gate accuracy mediocre (often just predicts "emit 2")
Memory usage higher than equivalent AR-only model

9. Code Location

Primary file: n.py

Key classes:

TuneableAttentionMHA: The modified attention
Block: Transformer block
Encoder: Full encoder stack
ARHead, SATHead: Output heads
token_stream: Data pipeline
_train_phase: Training loop

10. License and Citation

Code released under MIT license.

If referencing this work:

@misc{agillm3,
  author = {Bisset, Scott},
  title = {AGILLM-3: Tuneable Attention Rank and Joint AR+SAT Training},
  year = {2026},
  publisher = {OpenTransformers Ltd}
}

Appendix A: Full Preset Table

PRESETS = {
    "femto_1x":  dict(d=16,   layers=1,  heads=1,  rank=16),
    "femto_12x": dict(d=16,   layers=1,  heads=1,  rank=192),
    "pico_1x":   dict(d=32,   layers=1,  heads=2,  rank=16),
    "pico_12x":  dict(d=32,   layers=1,  heads=2,  rank=192),
    "nano_1x":   dict(d=64,   layers=2,  heads=4,  rank=16),
    "nano_3x":   dict(d=64,   layers=2,  heads=4,  rank=48),
    "nano_12x":  dict(d=64,   layers=2,  heads=4,  rank=192),
    "micro_12x": dict(d=128,  layers=4,  heads=8,  rank=192),
    "small":     dict(d=512,  layers=8,  heads=16, rank=64),
    "base":      dict(d=768,  layers=12, heads=24, rank=96),
    "large":     dict(d=1024, layers=24, heads=16, rank=128),
}

Appendix B: Example Training Command

python n.py train \
    --preset large \
    --batch_size 4 \
    --block 1122 \
    --amp \
    --save_every_sec 21600 \
    --save_dir /workspace/ckpts_expansion \
    --max_ckpts 5 \
    --resume /workspace/ckpts_expansion

Documentation current as of January 2026. Code at github.com/OpenTransformer/AGILLM