Steklov LLaMA 105M — Checkpoint Collection

Proof-of-concept checkpoints demonstrating Steklov activation sparsity on a LLaMA-style architecture.

Paper: Steklov Activations: Piecewise-Polynomial Gates with Compact Support and Tunable Sparsity Code: steklov-activations (pip install steklov-activations)

What are these models?

These are 105M-parameter LLaMA-style models (12 layers, d=768, d_ff=2048, RMSNorm, RoPE, no bias) trained on OpenWebText for 25K steps. The only difference between checkpoints is the Steklov activation scale parameter α, which controls how much of the MLP is active per token.

Checkpoints

Checkpoint Activation α Per-token zeros 2:4 Compliance PPL Seeds
steklov-a2.0 SteklovSiLU 2.0 3.4% 30.88 ± 0.89 3
steklov-a0.8 SteklovSiLU 0.8 28.0% 31.3% 30.99 ± 0.88 3
steklov-learned SteklovSiLU →1.73 6.5% 30.79 ± 0.90 3
steklov-a0.1 SteklovSiLU 0.1 87.2% 98.4% 30.57 1
steklov-a0.05 SteklovSiLU 0.05 88.9% 98.9% 30.47 1
steklov-a0.01 SteklovSiLU 0.01 ~90% 99.5% ~30.5 1
steklov-a0.005 SteklovSiLU 0.005 90.2% 99.2% 30.47 1

For reference, a SiLU baseline (same architecture, no Steklov) achieves PPL 31.43 ± 0.87 with 0% activation sparsity.

Key result: The α=0.005 model has 90% of its MLP activations exactly zero on every token, yet its perplexity (30.47) is better than the dense SiLU baseline (31.43).

Downstream Benchmarks (single seed)

Checkpoint ARC-E HellaSwag LAMBADA PIQA WinoGrande Mean
SiLU baseline* 35.61 26.28 19.31 57.34 49.80 37.67
steklov-a2.0 36.78 26.63 20.51 57.78 50.36 38.41
steklov-a0.8 36.24 26.45 17.98 56.58 50.20 37.49
steklov-learned 35.31 26.24 20.43 57.73 52.57 38.46
steklov-a0.1 35.65 26.30 18.52 56.69 49.96 37.42
steklov-a0.05 36.32 26.33 18.86 57.18 52.17 38.17
steklov-a0.01 36.24 26.29 18.55 56.96 49.57 37.52
steklov-a0.005 35.52 26.64 19.15 56.58 52.09 38.00

*SiLU baseline not included in this repo (standard LLaMA with SiLU activation).

No downstream degradation, even at 89–90% per-token activation sparsity.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "masalskikh/steklov-llama-105m"

# Load the α=0.05 checkpoint (89% sparse, beats SiLU)
model = AutoModelForCausalLM.from_pretrained(repo, subfolder="steklov-a0.05", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(repo)

# Generate text
model.eval()
input_ids = tokenizer.encode("The future of artificial intelligence is", return_tensors="pt")
with torch.no_grad():
    for _ in range(50):
        logits = model(input_ids).logits[:, -1, :]
        next_token = torch.multinomial(torch.softmax(logits / 0.8, dim=-1), 1)
        input_ids = torch.cat([input_ids, next_token], dim=1)
print(tokenizer.decode(input_ids[0]))

# Check sparsity: count exact zeros in MLP activations
# (see steklov_llama.py get_sparsity_stats() for full profiling)

Architecture

LlamaForCausalLM(
  embed_tokens: Embedding(50257, 768)
  layers: 12 × LlamaDecoderLayer(
    self_attn: LlamaAttention(768, 12 heads)
    mlp: LlamaMLP(
      up_proj: Linear(768 → 2048)
      act_fn: SteklovSiLU(alpha=α, order=3)
      down_proj: Linear(2048 → 768)
    )
    input_layernorm: LlamaRMSNorm(768)
    post_attention_layernorm: LlamaRMSNorm(768)
  )
)

Training details

  • Data: OpenWebText (2B tokens, deduplicated)
  • Steps: 25,000
  • Batch size: 8 × 4 grad_accum × 1024 tokens = 32K tokens/step
  • Optimizer: AdamW (lr=3e-4, β₁=0.9, β₂=0.95, wd=0.1)
  • Schedule: Cosine decay with 2,000 warmup steps
  • Hardware: 1× RTX 5090 (multi-seed runs) / RTX 4090 (single seed)

Intended use

These checkpoints are proof-of-concept models for reproducing the paper's claims. They are not intended for production use. The 105M parameter count is too small for practical applications. Their value is in verifying:

  1. Steklov activations produce exact zeros (profile the model yourself)
  2. The sparsity is tunable via α
  3. Quality is maintained at high sparsity
  4. The 2:4 N:M compliance numbers are reproducible

Limitations

  • 105M parameters (too small for practical use)
  • Single-seed runs for α ≤ 0.1
  • Trained for only 25K steps
  • N:M sparse tensor core kernel is slower than dense at this scale
  • Post-hoc activation swap does NOT work; must train from scratch

Citation

@article{masalskikh2026steklov,
  author  = {Masalskikh, A.},
  title   = {Steklov Activations: Piecewise-Polynomial Gates with Compact Support and Tunable Sparsity},
  journal = {Zenodo},
  year    = {2026},
  doi     = {10.5281/zenodo.19454642},
  url     = {https://doi.org/10.5281/zenodo.19454642}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using rusalmas/steklov-llama-105m 1