CHIAR-Former

Chiaroscuro Attention: Spending Compute in the Dark Operator Routing via Spectral Entropy Across Tasks and Scales

Prateek Kumar Sikdar · AI Architect, Agentic AI Practice · Accenture, Bengaluru · 2026

arXiv GitHub


Model Description

CHIAR-Former is a 400M parameter language model that routes each token to either DCT spectral mixing (O(d log d), sub-quadratic) or full self-attention (O(n²d), quadratic) based on per-token spectral entropy H(x) in [0,1] — the normalised information entropy of its DCT power spectrum.

The core insight is that not every token deserves the same compute budget. Function words (the, of, and) have smooth, low-frequency embeddings and are handled efficiently by DCT mixing. Content words (overwhelming, consensus, paradox) carry rich, high-entropy representations that require full self-attention. The model measures this distinction analytically — no learned black-box gate for the per-token decision.

The name comes from chiaroscuro, the Renaissance painting technique of Caravaggio and Rembrandt: spend light only where the eye needs detail, leave the periphery in inexpensive shadow. CHIAR-Former applies the same principle to computation.

  • Developed by: Prateek Kumar Sikdar, Accenture
  • Model type: Causal Language Model (decoder-only transformer)
  • Language: English
  • License: MIT
  • Repository: hackie123/CHIAR-former
  • Paper: arXiv:2606.08327
  • Finetuned from: Trained from scratch — not fine-tuned from any existing model

Model Sources


Uses

Direct Use

CHIAR-Former is a causal language model trained on WikiText-103. It can be used directly for:

  • Text generation — auto-regressive next-token prediction on English prose.
  • Perplexity evaluation — scoring text fluency on language modelling benchmarks.
  • Routing analysis — inspecting per-token spectral entropy and operator assignments to study how compute is allocated across token types.
  • Research — as a baseline or component for efficient transformer research; the spectral routing mechanism is modular and can be studied independently of the rest of the architecture.

Downstream Use

CHIAR-Former can be fine-tuned for downstream NLP tasks. The model includes a classifier head (CHIARClassifier) for sequence classification. The MetaRouter's task-level gate adapts automatically to new input distributions when fine-tuned on mixed batches. The architecture has been validated on IMDB sentiment classification and ListOps (symbolic reasoning) in addition to language modelling.

Fine-tuning recommendations:

  • Use a lower learning rate than pretraining (5e-5 or below) to avoid output collapse at this scale.
  • The MetaRouter gate will adapt to the new task domain within the first few hundred steps.

Out-of-Scope Use

  • Non-English text: Trained on English text only. Will not perform well on other languages.
  • Long sequences beyond 256 tokens: The model was trained with max_seq_len=256. Inputs longer than this are truncated. RoPE allows some generalisation, but significant degradation is expected at much longer contexts.
  • Purely symbolic tasks without mixed training: On mathematical expressions, code, or structured data, the standalone model underperforms. Use the mixed-training checkpoint (chiar_v3_mixed_best.pt) for these domains.
  • Safety-critical generation: This is a research model. It has not been aligned, RLHF-trained, or safety-evaluated for deployment in production systems.

Bias, Risks, and Limitations

Training data bias: CHIAR-Former was trained on WikiText-103 (Wikipedia articles). Wikipedia has well-documented demographic and geographic biases — English-language, Western-centric, and skewed toward topics with active editorial communities. These biases are present in the model's outputs.

Technical limitations:

  • Maximum context of 256 tokens. Longer documents must be chunked.
  • The spectral entropy threshold tau = 0.8954 was calibrated on WikiText-103 embeddings. On very different domains, recalibration via calibrate_tau.py is recommended.
  • The MetaRouter's gate g ≈ 0.22 was learned on a specific four-dataset mixture. Fine-tuning on new domains will shift this value.
  • The full 37% FLOP saving requires hard gating at inference. The current soft-blend formulation computes DCT regardless of the gate value.

Not a generative assistant: This is a base language model, not an instruction-following or chat model. It completes text in a continuation style.

Recommendations

Users should treat CHIAR-Former as a research artefact. For any application involving generated text shown to end users, appropriate content filtering and safety evaluation should be applied independently. The routing heatmap tool in the demo is useful for understanding what the model finds computationally complex in any given input.


How to Get Started with the Model

Install

git clone https://github.com/hackie123/CHIAR-former.git
cd CHIAR-former
pip install torch>=2.1.0 transformers>=4.38.0 huggingface_hub

Download the checkpoint

from huggingface_hub import hf_hub_download

hf_hub_download(
    repo_id='prateeksikdar/CHIAR-Former',
    filename='chiar_dct_attn_400M_best.pt',
    local_dir='./checkpoints'
)

Load and run inference

import torch
from config import Config
from model import CHIARFormer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

cfg = Config()
cfg.vocab_size = len(tokenizer)
model = CHIARFormer(cfg)

ckpt = torch.load('checkpoints/chiar_dct_attn_400M_best.pt', map_location='cpu')
model.load_state_dict(ckpt['model'])
model.eval()

text = "Despite the overwhelming scientific consensus on climate change,"
ids  = tokenizer.encode(text, return_tensors='pt')

with torch.no_grad():
    logits, _ = model(ids)

next_token = tokenizer.decode(logits[0, -1].argmax().item())
print(f"Next token: {next_token}")

Inspect per-token routing decisions

with torch.no_grad():
    logits, routing_infos, _, meta_gate = model(ids, return_routing_info=True)

# routing_infos[1] = L2, the first routing layer
op_idx  = routing_infos[1]['op_idx'][0]   # (T,)  0=DCT, 1=Attention
entropy = routing_infos[1]['H'][0]         # (T,)  spectral entropy per token

tokens = [tokenizer.decode([t]) for t in ids[0]]
for tok, op, h in zip(tokens, op_idx, entropy):
    route = 'Attention' if op == 1 else 'DCT'
    print(f"{tok:15s}  H={h:.3f}  ->  {route}")

print(f"\nMetaRouter gate: {meta_gate.item():.3f}  (1=use DCT, 0=bypass)")

Training Details

Training Data

  • Primary: WikiText-103 (wikitext-103-v1) — 118M tokens of verified Good and Featured Wikipedia articles. Used for all ablations and the main 400M training run.
  • Secondary (mixed training only): WikiText-2 (2.4M tokens), IMDB sentiment dataset, ListOps (symbolic reasoning sequences from the Long Range Arena benchmark).

The GPT-2 tokeniser (vocab size 50,257) was used for all datasets. Sequences were packed to max_seq_len=256 with no padding; shorter articles were concatenated with EOS tokens.

Training Procedure

Preprocessing

Text was tokenised with the GPT-2 BPE tokeniser. Articles shorter than one sequence were concatenated sequentially. Sequences of exactly 256 tokens were produced by a sliding fixed-length window with no overlap.

Training Hyperparameters

Hyperparameter Value
Training regime fp16 mixed precision (fp32 weight updates)
Optimiser AdamW (beta1=0.9, beta2=0.95, weight decay=0.01)
Learning rate 1e-4 peak, cosine decay to 1e-6
Warmup steps 1,000
Gradient clipping 1.0
Batch size (per GPU) 8
Gradient accumulation 16 steps (effective batch = 128)
Max sequence length 256 tokens
Epochs 5 (400M) / 10 (17M ablations)
Dropout 0.1
Weight initialisation N(0, 0.02) for all Linear and Embedding layers

The spectral router threshold tau = 0.8954 was calibrated after baseline training by measuring the 33rd and 67th percentiles of H(x) on the WikiText-103 validation set and taking the midpoint.

For mixed training (MetaRouter checkpoint), batches were sampled 75% from WikiText-103 and 25% randomly from WikiText-2, IMDB, and ListOps. The MetaRouter bias was initialised to 2.0 (gate ≈ 0.88) and converged to g ≈ 0.22 over training.

Speeds, Sizes, Times

17M (ablation) 400M (main)
d_model 256 1024
n_heads 4 16
n_layers 4 28
Checkpoint size ~65 MB ~1.6 GB
Training time ~2 hrs ~10 hrs
Hardware NVIDIA RTX A5000 24GB NVIDIA RTX A5000 24GB

Evaluation

Testing Data

All evaluation used the held-out test splits of the respective datasets with no test-set contamination during training: WikiText-103 test set (245K tokens) and WikiText-2 test set (245K tokens, for the mixed-training checkpoint only).

Metrics

Perplexity (PPL): exp(average cross-entropy loss) on the test set. Lower is better. All values use the GPT-2 tokeniser for comparability.

FLOP reduction: Fraction of FLOPs saved in routing layers (L2–L27) when exactly one operator runs per token. At the observed ~50/50 DCT/Attention split on WikiText-103, routing layers save 62.5% of attention FLOPs, yielding ~37% total model FLOP reduction vs. full attention.

Results

WikiText-103 (400M parameters):

Model Params Val PPL Test PPL FLOP Reduction
Baseline (Full Attention + RoPE) 404M 23.73 23.58 —
CHIAR-Former DCT+Attn 400M 27.75 27.51 ~37%
CHIAR-Former Mixed Training 400M 28.81 28.56 ~37%

WikiText-103 ablations (17M parameters):

Model Val PPL Test PPL
Baseline (Full Attention + RoPE) 45.78 44.63
CHIAR Soft routing 46.75 45.62
CHIAR Hard routing (STE) 46.86 45.67
CHIAR Threshold (ours) 49.62 48.34
CHIAR Threshold + Regulariser 49.70 48.45

Summary

CHIAR-Former achieves ~37% FLOP reduction at 400M parameters with a 3.93 PPL cost on WikiText-103 (Test PPL 27.51 vs. 23.58), using 4M fewer parameters than the full-attention baseline. The MetaRouter stabilises at g ≈ 0.22, a stable compute–quality equilibrium where spectral preprocessing and attention capacity play complementary roles at scale.


Model Examination

The routing heatmap is the primary interpretability tool. For any input text, each token is assigned to DCT Mixing (blue) or Full Attention (red) based on spectral entropy. The pattern is consistent and interpretable:

  • Function words (the, of, and, is, that) → DCT (H(x) low, smooth embeddings)
  • Content words (overwhelming, consensus, carbon, reduction) → Attention (H(x) high, complex embeddings)
  • Punctuation and numerals → mostly DCT
  • Domain-specific or rare terms → Attention

The MetaRouter gate value g is also informative across task types:

  • Naturalistic prose (Wikipedia, news) → g ≈ 0.90–0.95
  • Sentiment text (IMDB reviews) → g ≈ 0.88
  • Symbolic sequences (ListOps expressions) → g ≈ 0.05

The interactive routing heatmap demo in CHIAR-former_demo/ allows real-time exploration of these patterns on any user-supplied text.


Environmental Impact

Carbon emissions estimated using the Machine Learning Impact calculator.

  • Hardware type: NVIDIA RTX A5000 (24 GB VRAM)
  • Hours used: ~40 hours total across all runs (baseline + CHIAR + ablations + mixed training)
  • Cloud provider: RunPod
  • Compute region: US (estimated)
  • Carbon emitted: ~3–5 kg CO2eq (estimated; single GPU, ~40 hrs at ~250W TDP)

Technical Specifications

Model Architecture and Objective

Objective: Causal language modelling (next-token prediction) with cross-entropy loss.

Architecture (v3, 400M), N=28 layers:

Token Embedding (d=1024, vocab=50257)
    no absolute positional encoding — RoPE applied inside attention

MetaRouter:  g = sigmoid( Linear(d=1024, 1)( mean_over_batch_and_seq(x) ) )
    g in [0,1]; bias initialised to 2.0; converges to g ≈ 0.22

L1:    DCTMixNoFFN -> shared FFN (4x expansion, GELU) -> LayerNorm
       output = g * DCTMix(X) + (1-g) * X

L2-L27 (per token, per layer):
       H(x) <= tau=0.8954  ->  DCTMixNoFFN
       H(x) >  tau=0.8954  ->  MultiHeadSelfAttention (16 heads, RoPE in Q/K)
       -> shared FFN (4x expansion, GELU) -> LayerNorm

L28:   MultiHeadSelfAttention (16 heads, RoPE) -> shared FFN -> LayerNorm
       (accuracy anchor — always runs full attention)

LayerNorm -> Linear(1024, 50257)   weight-tied with token embedding

Core equations:

  • Spectral entropy: H(x) = -1/log(d) * sum( p_i * log(p_i) ) where p_i = x_hat_i^2 / sum(x_hat_j^2)
  • DCT Mixing: DCTMix(X) = LN( X + FFN( iDCT( DCT(X) * w ) ) ) where w in R^d is learned
  • MetaRouter blend: h1 = g * DCTMix(X) + (1-g) * X
  • RoPE: rotates Q and K by position-dependent angle matrices inside every attention layer

Compute Infrastructure

  • Hardware: Single NVIDIA RTX A5000 (24 GB VRAM)
  • Cloud: RunPod
  • Framework: PyTorch 2.1, HuggingFace Transformers 4.38
  • Precision: fp16 mixed precision via torch.cuda.amp, fp32 weight updates
  • Tokeniser: GPT-2 BPE via HuggingFace Transformers

Citation

BibTeX:

@article{sikdar2026chiar,
  title   = {Chiaroscuro Attention: Spending Compute in the Dark},
  author  = {Sikdar, Prateek Kumar},
  journal = {arXiv preprint arXiv:2606.08327},
  year    = {2026}
}

APA:

Sikdar, P. K. (2026). Chiaroscuro Attention: Spending Compute in the Dark. arXiv preprint arXiv:2606.08327.


Glossary

Spectral entropy H(x): The normalised Shannon entropy of a token embedding's DCT power spectrum. H=0 means all energy in one frequency (maximally smooth). H=1 means energy uniformly distributed (maximally complex). Used as the per-token routing signal.

DCT Mixing: Discrete Cosine Transform spectral mixing. Applies a learned per-frequency filter w in the DCT domain, then inverts back. Complexity O(d log d) per token vs O(n^2 d) for self-attention.

SpectralRouter: The per-token routing module. Computes H(x) and compares against threshold tau to assign each token to DCT or Attention. No learnable parameters in threshold mode (default).

MetaRouter: The task-level gate g in [0,1]. A learned scalar that soft-blends DCT Mixing and Identity bypass at L1, based on the mean embedding of the current batch. Learns the naturalistic/symbolic boundary from mixed-dataset training.

RoPE (Rotary Position Embedding): Encodes relative token positions by rotating Q and K vectors inside attention with position-dependent rotation matrices. Zero learnable parameters. Generalises beyond training sequence length. Used by LLaMA, Qwen, Mistral, Falcon.

Routing collapse: The observed phenomenon in v1 (three-operator: DCT + RBF + Attention) where the RBF operator drops to 0% usage during training, revealing that DCT+Attention is the sufficient operator pair.

tau (threshold): The spectral entropy routing threshold. Set to the midpoint of the 33rd and 67th entropy percentiles of the baseline model's embedding distribution on WikiText-103 validation set. At 400M scale: tau = 0.8954.


More Information


Model Card Authors

Prateek Kumar Sikdar · AI Architect, Agentic AI Practice · Accenture, Bengaluru, India

Model Card Contact

prateek.k.sikdar@accenture.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train prateeksikdar/CHIAR-former

Paper for prateeksikdar/CHIAR-former

Evaluation results