SciPapers / AGILLM3_technical_documentation.md
OpenTransformer's picture
Upload AGILLM3_technical_documentation.md with huggingface_hub
61b9671 verified

AGILLM-3: Technical Documentation

A 698M Parameter Language Model with Tuneable Attention Rank and Joint AR+SAT Training

Scott Bisset
OpenTransformers Ltd
January 2026


Abstract

This document provides complete technical documentation of AGILLM-3, a language model exploring two architectural variations: (1) tuneable attention rank via learned orthogonal projections, and (2) joint autoregressive and semi-autoregressive training. We make no claims of competing with frontier modelsβ€”AGI exists in systems like Claude and GPT-4. This is documentation of independent research for reproducibility and potential future reference by the research community.


1. Motivation

1.1 What This Is

AGILLM-3 is a research project exploring:

  1. Tuneable attention rank: What happens when Q and K are projected through an intermediate space of different dimensionality than the standard head dimension?

  2. Joint AR+SAT training: Can a model learn both next-token prediction AND multi-token speculation simultaneously?

1.2 What This Isn't

This is not:

  • A frontier model
  • A competitor to GPT-4/Claude/Gemini
  • A claim that small models can match large ones
  • A business

AGI already exists. This is documentation, not disruption.


2. Architecture

2.1 Overview

Input tokens
    ↓
Embedding (vocab β†’ d)
    ↓
[Block Γ— L layers]
    β”œβ”€β”€ LayerNorm β†’ TuneableAttentionMHA β†’ +residual
    └── LayerNorm β†’ FFN (d β†’ 4d β†’ d) β†’ +residual
    ↓
Final LayerNorm
    ↓
β”œβ”€β”€ ARHead (next token prediction)
└── SATHead (multi-token speculation)

2.2 Tuneable Attention (The Novel Bit)

Standard multi-head attention computes:

Q = XWq,  K = XWk,  V = XWv
Attention = softmax(QKα΅€/√d_k) Β· V

Where Q, K have shape [batch, seq, heads, d_k].

AGILLM-3's modification:

class TuneableAttentionMHA(nn.Module):
    def __init__(self, d: int, h: int, r: int):
        # r = rank (the tuneable parameter)
        self.U = nn.Parameter(torch.randn(d_k, r))
        nn.init.orthogonal_(self.U)
    
    def _proj_qk(self, x):
        # Project through U: [batch, seq, heads, d_k] @ [d_k, r] β†’ [batch, seq, heads, r]
        return x.view(B, N, h, d_k).transpose(1,2) @ self.U

The attention computation becomes:

Q' = Q @ U    # [batch, heads, seq, r]
K' = K @ U    # [batch, heads, seq, r]
Attention = softmax(Q'K'α΅€/√d_k) Β· V

What this means:

Regime Condition Effect
Compression r < d_k Q-K similarity computed in lower-dim space
Identity r = d_k Equivalent to standard attention (if U=I)
Expansion r > d_k Q-K similarity computed in higher-dim space

The presets encode this as ratios:

  • nano_1x: r = d_k (standard)
  • nano_3x: r = 3 Γ— d_k (expansion)
  • nano_12x: r = 12 Γ— d_k (heavy expansion)

Hypothesis being tested: Does expanding the Q-K interaction space improve attention quality? The orthogonal initialization ensures U starts as a rotation/reflection, not destroying information.

2.3 Positional Encoding: ALiBi

AGILLM-3 uses ALiBi (Attention with Linear Biases) rather than RoPE or learned positions:

def alibi_bias(n_heads, n_tokens):
    # Each head gets a different slope
    # Attention score penalized by distance: score -= slope * |i - j|
    slopes = [2^(-8/n_heads), 2^(-16/n_heads), ...]
    return -slopes * distance_matrix

ALiBi chosen for:

  • Zero additional parameters
  • Good length extrapolation
  • Simplicity

2.4 Block Structure

Each transformer block:

class Block(nn.Module):
    def forward(self, x, mask):
        # Pre-norm architecture
        x = x + self.mha(self.ln1(x), mask)
        x = x + self.ff(self.ln2(x))
        return x

FFN is standard: Linear(d, 4d) β†’ ReLU β†’ Linear(4d, d)

2.5 Model Configurations

From the presets in code:

Preset d_model Layers Heads Rank ~Params
nano_3x 64 2 4 48 ~200K
micro_12x 128 4 8 192 ~2M
small 512 8 16 64 ~50M
base 768 12 24 96 ~125M
large 1024 24 16 128 ~698M

The "large" preset at 698M parameters is the primary AGILLM-3 configuration.


3. Joint AR+SAT Training

3.1 The Idea

Standard language models train only on next-token prediction (autoregressive, AR).

AGILLM-3 trains on BOTH:

  1. AR objective: Predict token t+1 from tokens 1..t
  2. SAT objective: Predict tokens t+1..t+k from tokens 1..t (semi-autoregressive)

3.2 Masking

AR mask (standard causal):

Position can attend to: all previous positions
[1 0 0 0]
[1 1 0 0]
[1 1 1 0]
[1 1 1 1]

SAT mask (block-wise):

SAT_BLOCK = 2
Positions in same block can attend to each other AND all previous blocks

Block 0: positions 0,1 can see each other
Block 1: positions 2,3 can see each other + block 0
etc.
def sat_mask(n, block=2):
    idx = torch.arange(n)
    grp = idx // block
    allow = (grp.T == grp) | (grp.T > grp)  # Same block OR previous blocks
    return torch.where(allow, 0.0, -inf)

3.3 Training Loop

Each batch:

# Forward pass 1: AR
h_ar = core(ids, causal_mask(n))
logits_ar = ar_head(h_ar)[:, :-1]
loss_ar = cross_entropy(logits_ar, targets[:, 1:])

# Forward pass 2: SAT
h_sat = core(ids, sat_mask(n))
logits_sat, gate = sat_head(h_sat[:, -SAT_BLOCK:])
loss_sat = cross_entropy(logits_sat, targets[:, 1:SAT_BLOCK+1])

# Optional: gate loss (predict how many tokens to emit)
if gate is not None:
    loss_sat += 0.1 * cross_entropy(gate, emit_target)

loss = loss_ar + loss_sat

3.4 SAT Head with Gating

class SATHead(nn.Module):
    def __init__(self, d, mode="var"):
        self.proj = nn.Linear(d, vocab)  # Token prediction
        self.gate = nn.Linear(d, 2)      # Emit 1 or 2 tokens?

The gate predicts whether to emit 1 or 2 tokens during inference, allowing variable-stride speculation.

3.5 Why Joint Training?

Hypothesis: Training both objectives together might:

  1. Improve representation quality (multi-task learning)
  2. Enable speculative decoding at inference (predict multiple tokens, verify with AR)
  3. Learn confidence estimation via the gate

Current status: Experimental. No claims of improvement over AR-only.


4. Training Infrastructure

4.1 Data Pipeline

def token_stream(ds_names, target_tokens, seed, ...):
    """
    Streaming token generator from HuggingFace datasets.
    - Supports multiple comma-separated datasets
    - Auto-rotates through sources
    - Handles chat format (messages key) or raw text
    - Appends EOS tokens
    """

Default pretraining sources (from code):

OpenTransformer/goddess-crawl
OpenTransformer/agillm-crawl-data
OpenTransformer/web-crawl-2026
OpenTransformer/web-crawl-clean-v2
OpenTransformer/scraped-web-data
OpenTransformer/turbo-crawl
OpenTransformer/sft-data-clean
OpenTransformer/web-crawl-v1

4.2 Optimizer Configuration

opt = AdamW([
    {"params": core.parameters(), "lr": 5e-5},   # LR_CORE
    {"params": ar_head.parameters(), "lr": 2e-4}, # LR_HEAD  
    {"params": sat_head.parameters(), "lr": 2e-4},
])

Separate learning rates for core vs heads.

4.3 Training Features

  • AMP: Automatic mixed precision (bf16 if available, else fp16)
  • Gradient clipping: max_norm=1.0
  • Label smoothing: 0.1
  • Dropout: 0.1 in attention
  • Checkpointing: Configurable interval (default 24h), automatic pruning

4.4 Chinchilla Scaling

ratio = 51.2 if args.chilla_max_double else 25
param_count = count_params(core, ar_h, sat_h)
target_tokens = int(ratio * param_count)

Default follows ~25Γ— Chinchilla ratio; optional 51.2Γ— for "double Chinchilla".

For 698M params: ~17.5B tokens default, ~35.7B tokens with double.

4.5 Hot Config

Runtime dataset switching without restart:

# /workspace/hot_config.json
{"datasets": ["new_dataset_1", "new_dataset_2"]}

Trainer checks this file periodically and switches data sources.

4.6 Auto-Grow

Optional feature to increase block size during training:

--auto_grow --grow_plan "576,640,768,896,1024,1122" --grow_every_steps 50000

Starts with smaller context, grows as training stabilizes.


5. Inference

5.1 AR Mode (Standard)

python n.py infer --mode ar --ckpt path/to/ckpt.pt --prompt "Hello"

Standard autoregressive generation with KV-cache.

5.2 SAT Mode (Speculative)

python n.py infer --mode sat --ckpt path/to/ckpt.pt --prompt "Hello" --var

Generates SAT_BLOCK tokens at once, optionally using gate to choose stride.

5.3 Sampling Parameters

Parameter AR Default SAT Default
temperature 0.7 0.5
top_k 0 30
repetition_penalty 1.3 2.0
presence_penalty 0.0 0.6
frequency_penalty 0.3 1.0
penalty_last_n 128 200

SAT mode uses more aggressive penalties to avoid repetition from parallel generation.


6. Weight Tying

Optional embedding-LM head weight tying:

class ARHead(nn.Module):
    def __init__(self, d, tie_weights=False, embedding_weight=None):
        if tie_weights and embedding_weight is not None:
            self.proj = nn.Linear(d, vocab, bias=False)
            self.proj.weight = embedding_weight  # Share weights

Reduces parameters by ~vocab Γ— d (significant for large vocab).


7. Current Training Status

As of January 2026:

  • Step: 2.2M+
  • Tokens seen: ~2.4B
  • Preset: large (698M params)
  • Training on vast.ai 3090
  • Checkpoints every 6 hours

8. Observations and Notes

8.1 Expansion Ratio Effects

Early experiments suggest:

  • 1x (standard): baseline behavior
  • 3x-6x: slight improvement in attention patterns
  • 12x+: diminishing returns, increased compute

Not rigorously benchmarked. Observations only.

8.2 AR vs AR+SAT

AR-only mode (--ar_only) available for comparison. Joint training adds ~2x forward passes per batch.

8.3 Known Issues

  1. SAT inference quality lags AR (expected - harder task)
  2. Gate accuracy mediocre (often just predicts "emit 2")
  3. Memory usage higher than equivalent AR-only model

9. Code Location

Primary file: n.py

Key classes:

  • TuneableAttentionMHA: The modified attention
  • Block: Transformer block
  • Encoder: Full encoder stack
  • ARHead, SATHead: Output heads
  • token_stream: Data pipeline
  • _train_phase: Training loop

10. License and Citation

Code released under MIT license.

If referencing this work:

@misc{agillm3,
  author = {Bisset, Scott},
  title = {AGILLM-3: Tuneable Attention Rank and Joint AR+SAT Training},
  year = {2026},
  publisher = {OpenTransformers Ltd}
}

Appendix A: Full Preset Table

PRESETS = {
    "femto_1x":  dict(d=16,   layers=1,  heads=1,  rank=16),
    "femto_12x": dict(d=16,   layers=1,  heads=1,  rank=192),
    "pico_1x":   dict(d=32,   layers=1,  heads=2,  rank=16),
    "pico_12x":  dict(d=32,   layers=1,  heads=2,  rank=192),
    "nano_1x":   dict(d=64,   layers=2,  heads=4,  rank=16),
    "nano_3x":   dict(d=64,   layers=2,  heads=4,  rank=48),
    "nano_12x":  dict(d=64,   layers=2,  heads=4,  rank=192),
    "micro_12x": dict(d=128,  layers=4,  heads=8,  rank=192),
    "small":     dict(d=512,  layers=8,  heads=16, rank=64),
    "base":      dict(d=768,  layers=12, heads=24, rank=96),
    "large":     dict(d=1024, layers=24, heads=16, rank=128),
}

Appendix B: Example Training Command

python n.py train \
    --preset large \
    --batch_size 4 \
    --block 1122 \
    --amp \
    --save_every_sec 21600 \
    --save_dir /workspace/ckpts_expansion \
    --max_ckpts 5 \
    --resume /workspace/ckpts_expansion

Documentation current as of January 2026. Code at github.com/OpenTransformer/AGILLM