Instructions to use rusalmas/steklov-llama-105m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rusalmas/steklov-llama-105m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="rusalmas/steklov-llama-105m")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("rusalmas/steklov-llama-105m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use rusalmas/steklov-llama-105m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rusalmas/steklov-llama-105m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rusalmas/steklov-llama-105m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/rusalmas/steklov-llama-105m
- SGLang
How to use rusalmas/steklov-llama-105m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rusalmas/steklov-llama-105m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rusalmas/steklov-llama-105m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rusalmas/steklov-llama-105m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rusalmas/steklov-llama-105m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use rusalmas/steklov-llama-105m with Docker Model Runner:
docker model run hf.co/rusalmas/steklov-llama-105m
Steklov LLaMA 105M — Checkpoint Collection
Proof-of-concept checkpoints demonstrating Steklov activation sparsity on a LLaMA-style architecture.
Paper: Steklov Activations: Piecewise-Polynomial Gates with Compact Support and Tunable Sparsity
Code: steklov-activations (pip install steklov-activations)
What are these models?
These are 105M-parameter LLaMA-style models (12 layers, d=768, d_ff=2048, RMSNorm, RoPE, no bias) trained on OpenWebText for 25K steps. The only difference between checkpoints is the Steklov activation scale parameter α, which controls how much of the MLP is active per token.
Checkpoints
| Checkpoint | Activation | α | Per-token zeros | 2:4 Compliance | PPL | Seeds |
|---|---|---|---|---|---|---|
steklov-a2.0 |
SteklovSiLU | 2.0 | 3.4% | — | 30.88 ± 0.89 | 3 |
steklov-a0.8 |
SteklovSiLU | 0.8 | 28.0% | 31.3% | 30.99 ± 0.88 | 3 |
steklov-learned |
SteklovSiLU | →1.73 | 6.5% | — | 30.79 ± 0.90 | 3 |
steklov-a0.1 |
SteklovSiLU | 0.1 | 87.2% | 98.4% | 30.57 | 1 |
steklov-a0.05 |
SteklovSiLU | 0.05 | 88.9% | 98.9% | 30.47 | 1 |
steklov-a0.01 |
SteklovSiLU | 0.01 | ~90% | 99.5% | ~30.5 | 1 |
steklov-a0.005 |
SteklovSiLU | 0.005 | 90.2% | 99.2% | 30.47 | 1 |
For reference, a SiLU baseline (same architecture, no Steklov) achieves PPL 31.43 ± 0.87 with 0% activation sparsity.
Key result: The α=0.005 model has 90% of its MLP activations exactly zero on every token, yet its perplexity (30.47) is better than the dense SiLU baseline (31.43).
Downstream Benchmarks (single seed)
| Checkpoint | ARC-E | HellaSwag | LAMBADA | PIQA | WinoGrande | Mean |
|---|---|---|---|---|---|---|
| SiLU baseline* | 35.61 | 26.28 | 19.31 | 57.34 | 49.80 | 37.67 |
| steklov-a2.0 | 36.78 | 26.63 | 20.51 | 57.78 | 50.36 | 38.41 |
| steklov-a0.8 | 36.24 | 26.45 | 17.98 | 56.58 | 50.20 | 37.49 |
| steklov-learned | 35.31 | 26.24 | 20.43 | 57.73 | 52.57 | 38.46 |
| steklov-a0.1 | 35.65 | 26.30 | 18.52 | 56.69 | 49.96 | 37.42 |
| steklov-a0.05 | 36.32 | 26.33 | 18.86 | 57.18 | 52.17 | 38.17 |
| steklov-a0.01 | 36.24 | 26.29 | 18.55 | 56.96 | 49.57 | 37.52 |
| steklov-a0.005 | 35.52 | 26.64 | 19.15 | 56.58 | 52.09 | 38.00 |
*SiLU baseline not included in this repo (standard LLaMA with SiLU activation).
No downstream degradation, even at 89–90% per-token activation sparsity.
How to use
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "masalskikh/steklov-llama-105m"
# Load the α=0.05 checkpoint (89% sparse, beats SiLU)
model = AutoModelForCausalLM.from_pretrained(repo, subfolder="steklov-a0.05", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(repo)
# Generate text
model.eval()
input_ids = tokenizer.encode("The future of artificial intelligence is", return_tensors="pt")
with torch.no_grad():
for _ in range(50):
logits = model(input_ids).logits[:, -1, :]
next_token = torch.multinomial(torch.softmax(logits / 0.8, dim=-1), 1)
input_ids = torch.cat([input_ids, next_token], dim=1)
print(tokenizer.decode(input_ids[0]))
# Check sparsity: count exact zeros in MLP activations
# (see steklov_llama.py get_sparsity_stats() for full profiling)
Architecture
LlamaForCausalLM(
embed_tokens: Embedding(50257, 768)
layers: 12 × LlamaDecoderLayer(
self_attn: LlamaAttention(768, 12 heads)
mlp: LlamaMLP(
up_proj: Linear(768 → 2048)
act_fn: SteklovSiLU(alpha=α, order=3)
down_proj: Linear(2048 → 768)
)
input_layernorm: LlamaRMSNorm(768)
post_attention_layernorm: LlamaRMSNorm(768)
)
)
Training details
- Data: OpenWebText (2B tokens, deduplicated)
- Steps: 25,000
- Batch size: 8 × 4 grad_accum × 1024 tokens = 32K tokens/step
- Optimizer: AdamW (lr=3e-4, β₁=0.9, β₂=0.95, wd=0.1)
- Schedule: Cosine decay with 2,000 warmup steps
- Hardware: 1× RTX 5090 (multi-seed runs) / RTX 4090 (single seed)
Intended use
These checkpoints are proof-of-concept models for reproducing the paper's claims. They are not intended for production use. The 105M parameter count is too small for practical applications. Their value is in verifying:
- Steklov activations produce exact zeros (profile the model yourself)
- The sparsity is tunable via α
- Quality is maintained at high sparsity
- The 2:4 N:M compliance numbers are reproducible
Limitations
- 105M parameters (too small for practical use)
- Single-seed runs for α ≤ 0.1
- Trained for only 25K steps
- N:M sparse tensor core kernel is slower than dense at this scale
- Post-hoc activation swap does NOT work; must train from scratch
Citation
@article{masalskikh2026steklov,
author = {Masalskikh, A.},
title = {Steklov Activations: Piecewise-Polynomial Gates with Compact Support and Tunable Sparsity},
journal = {Zenodo},
year = {2026},
doi = {10.5281/zenodo.19454642},
url = {https://doi.org/10.5281/zenodo.19454642}
}