Steklov LLaMA 105M — Checkpoint Collection

Proof-of-concept checkpoints demonstrating Steklov activation sparsity on a LLaMA-style architecture.

Paper: Steklov Activations: Piecewise-Polynomial Gates with Compact Support and Tunable Sparsity Code: steklov-activations (pip install steklov-activations)

What are these models?

These are 105M-parameter LLaMA-style models (12 layers, d=768, d_ff=2048, RMSNorm, RoPE, no bias) trained on OpenWebText for 25K steps. The only difference between checkpoints is the Steklov activation scale parameter α, which controls how much of the MLP is active per token.

Checkpoints

Checkpoint	Activation	α	Per-token zeros	2:4 Compliance	PPL	Seeds
`steklov-a2.0`	SteklovSiLU	2.0	3.4%	—	30.88 ± 0.89	3
`steklov-a0.8`	SteklovSiLU	0.8	28.0%	31.3%	30.99 ± 0.88	3
`steklov-learned`	SteklovSiLU	→1.73	6.5%	—	30.79 ± 0.90	3
`steklov-a0.1`	SteklovSiLU	0.1	87.2%	98.4%	30.57	1
`steklov-a0.05`	SteklovSiLU	0.05	88.9%	98.9%	30.47	1
`steklov-a0.01`	SteklovSiLU	0.01	~90%	99.5%	~30.5	1
`steklov-a0.005`	SteklovSiLU	0.005	90.2%	99.2%	30.47	1

For reference, a SiLU baseline (same architecture, no Steklov) achieves PPL 31.43 ± 0.87 with 0% activation sparsity.

Key result: The α=0.005 model has 90% of its MLP activations exactly zero on every token, yet its perplexity (30.47) is better than the dense SiLU baseline (31.43).

Downstream Benchmarks (single seed)

Checkpoint	ARC-E	HellaSwag	LAMBADA	PIQA	WinoGrande	Mean
SiLU baseline*	35.61	26.28	19.31	57.34	49.80	37.67
steklov-a2.0	36.78	26.63	20.51	57.78	50.36	38.41
steklov-a0.8	36.24	26.45	17.98	56.58	50.20	37.49
steklov-learned	35.31	26.24	20.43	57.73	52.57	38.46
steklov-a0.1	35.65	26.30	18.52	56.69	49.96	37.42
steklov-a0.05	36.32	26.33	18.86	57.18	52.17	38.17
steklov-a0.01	36.24	26.29	18.55	56.96	49.57	37.52
steklov-a0.005	35.52	26.64	19.15	56.58	52.09	38.00

*SiLU baseline not included in this repo (standard LLaMA with SiLU activation).

No downstream degradation, even at 89–90% per-token activation sparsity.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "masalskikh/steklov-llama-105m"

# Load the α=0.05 checkpoint (89% sparse, beats SiLU)
model = AutoModelForCausalLM.from_pretrained(repo, subfolder="steklov-a0.05", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(repo)

# Generate text
model.eval()
input_ids = tokenizer.encode("The future of artificial intelligence is", return_tensors="pt")
with torch.no_grad():
    for _ in range(50):
        logits = model(input_ids).logits[:, -1, :]
        next_token = torch.multinomial(torch.softmax(logits / 0.8, dim=-1), 1)
        input_ids = torch.cat([input_ids, next_token], dim=1)
print(tokenizer.decode(input_ids[0]))

# Check sparsity: count exact zeros in MLP activations
# (see steklov_llama.py get_sparsity_stats() for full profiling)

Architecture

LlamaForCausalLM(
  embed_tokens: Embedding(50257, 768)
  layers: 12 × LlamaDecoderLayer(
    self_attn: LlamaAttention(768, 12 heads)
    mlp: LlamaMLP(
      up_proj: Linear(768 → 2048)
      act_fn: SteklovSiLU(alpha=α, order=3)
      down_proj: Linear(2048 → 768)
    )
    input_layernorm: LlamaRMSNorm(768)
    post_attention_layernorm: LlamaRMSNorm(768)
  )
)

Training details

Data: OpenWebText (2B tokens, deduplicated)
Steps: 25,000
Batch size: 8 × 4 grad_accum × 1024 tokens = 32K tokens/step
Optimizer: AdamW (lr=3e-4, β₁=0.9, β₂=0.95, wd=0.1)
Schedule: Cosine decay with 2,000 warmup steps
Hardware: 1× RTX 5090 (multi-seed runs) / RTX 4090 (single seed)

Intended use

These checkpoints are proof-of-concept models for reproducing the paper's claims. They are not intended for production use. The 105M parameter count is too small for practical applications. Their value is in verifying:

Steklov activations produce exact zeros (profile the model yourself)
The sparsity is tunable via α
Quality is maintained at high sparsity
The 2:4 N:M compliance numbers are reproducible

Limitations

105M parameters (too small for practical use)
Single-seed runs for α ≤ 0.1
Trained for only 25K steps
N:M sparse tensor core kernel is slower than dense at this scale
Post-hoc activation swap does NOT work; must train from scratch

Citation

@article{masalskikh2026steklov,
  author  = {Masalskikh, A.},
  title   = {Steklov Activations: Piecewise-Polynomial Gates with Compact Support and Tunable Sparsity},
  journal = {Zenodo},
  year    = {2026},
  doi     = {10.5281/zenodo.19454642},
  url     = {https://doi.org/10.5281/zenodo.19454642}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

rusalmas
/

steklov-llama-105m