Instructions to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1
- SGLang
How to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with Docker Model Runner:
docker model run hf.co/OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1
OpenMOSE/RWKV-Gemma-4-5B-E2B-Preview-v0.1 — Model Card
for use
please install flash-linear-attention==0.4.1
⚠️ This is a Proof-of-Concept (v0.1) — NO performance guarantees
This release exists for one reason only: to verify whether a triple-hybrid stack of SWA + Linear (RWKV-7) + TICA (Tiny Infused Causal Attention) is structurally viable when distilled from a Gemma-4 SWA-hybrid teacher.
It is not optimized for downstream quality, and no benchmark results are claimed. Please treat it as a research artifact for architecture validation, not a usable chatbot or assistant model.
Model Summary
RWKV-Gemma-4-5B-E2B-Preview-v0.1 is an experimental linearization of Gemma-4 (5B / E2B effective) in which the model's Full (global) MQA attention layers are replaced with a RWKV-7 + TICA path, while the original Sliding-Window Attention (SWA) layers are inherited and preserved as-is from the teacher.
The resulting network is therefore a triple-hybrid:
| Slot in original Gemma-4 | Replacement in this model |
|---|---|
| SWA layer (local, 512-token window) | SWA — kept (inherited from teacher) |
| Full / Global MQA layer | RWKV-7 (linear) + TICA (tiny SDPA) — new |
This is the first OpenMOSE release that targets Gemma-4's hybrid SWA-Full attention pattern rather than a uniform Transformer stack, and it serves as the architectural proving ground for the line of work pursued in RWKV-GLM-4.7-Flash (full → linear) extended to SWA-hybrid teachers.
Why This Model Exists
Modern hybrid attention models — Gemma-3, Gemma-4, and similar — alternate between local SWA and global Full attention. The Full layers are hypothesized to act as the model's induction heads and long-range retrievers, while SWA layers handle local pattern matching.
If we can:
- Keep SWA layers untouched (they are already efficient — O(N · W) with a fixed window W = 512), and
- Replace only the Full / Global layers with a linear core (RWKV-7) plus a small auxiliary attention path (TICA),
…then we can drastically reduce the global-attention KV cache and FLOPs without disrupting the SWA pathway that the model relies on for short-range behavior.
The single question this v0.1 release attempts to answer is:
Does SWA + Linear (RWKV-7) + TICA actually train and run end-to-end on a Gemma-4 base?
Quality, benchmarks, and long-context performance are explicitly out of scope for this preview.
Key Facts
| Item | Value |
|---|---|
| Name | OpenMOSE/RWKV-Gemma-4-5B-E2B-Preview-v0.1 |
| Base model | google/gemma-4-E2B-it (Gemma-4 5B with E2B effective parameters) |
| Total layers | 35 |
| SWA layers (inherited) | 28 |
| RWKV-7 + TICA layers (new) | 7 |
| Sliding window | 512 tokens |
| Hidden size | 1536 |
| Intermediate size (MLP) | 6144 (double-wide MLP enabled) |
| Vocab size | 262,144 (Gemma multimodal tokenizer) |
| Max position embeddings | 131,072 |
| Final logit softcapping | 30.0 (preserved from Gemma-4) |
| dtype | bfloat16 |
| Status | Experimental proof-of-concept (v0.1) |
Architecture
Layer composition (35 layers total)
The original Gemma-4 SWA/Full pattern is preserved structurally, with each Full layer slot replaced by a RWKV-7 + TICA block:
[SWA, SWA, SWA, SWA, RWKV+TICA] × 7
Concretely, layers [4, 9, 14, 19, 24, 29, 34] are RWKV-7 layers, and every
single one of them carries a TICA module. The remaining 28 layers are SWA
inherited verbatim from the teacher.
SWA layers (inherited from Gemma-4)
- 8 attention heads, 1 KV head (MQA)
head_dim= 256, sliding window = 512- RoPE:
theta= 10,000, default rope type - These layers are not retrained in the conversion — their weights, RoPE
configuration, KV-sharing pattern (
num_kv_shared_layers = 20), and positional behavior are all left identical to Gemma-4.
RWKV-7 layers (replace Full / Global MQA)
The seven Full-attention slots are converted to RWKV-7 (hxa07i family) with:
- 32 attention heads / 4 KV heads,
head_dim= 128 - QK-Norm enabled
- LoRA ranks:
decay = 256,iclr = 256,gate = 256 - No RoPE inside the RWKV core (
use_rope_in_rwkv = false)
This gives the model a recurrent / linear-time global pathway that costs O(1) memory per token at inference — the original Full-attention KV cache on these 7 layers is eliminated.
TICA — Tiny Infused Causal Attention (on all RWKV layers)
To compensate for retrieval and induction-head behavior that pure linear attention struggles with, every RWKV layer is augmented with a small gated SDPA path:
- 4 heads / 2 KV heads (GQA),
head_dim= 128 - Gate rank = 128
- No RoPE inside TICA (
use_rope_in_tiny_attn = false)
TICA's QK compute is approximately 1024 × 1024 per layer versus the original
Full-attention 4096 × 4096 — roughly a up to 1/16 reduction in attention FLOPs
on the global pathway, which is particularly beneficial on compute-bound
consumer GPUs (e.g. RTX 4090) and APU-class hardware (e.g. Strix Halo) where
memory bandwidth is not the dominant bottleneck.
Resulting triple-hybrid
Token in
│
├── SWA (×4) ← inherited, 512-window GQA
├── RWKV-7 + TICA ← new linear core + tiny GQA SDPA
├── SWA (×4)
├── RWKV-7 + TICA
│ …
└── RWKV-7 + TICA ← layer 34 (final block)
Logits → final logit softcapping (cap = 30.0)
Conversion / Training Approach (high level)
- Teacher: Gemma-4 5B (E2B), used for both hidden-state and logit-level distillation.
- SWA layers: frozen throughout. (As an additional motivation: on the single MI300X / FlashAttention stack used during this work, gradients in SWA layers could not be sparsified to the 512-token window in practice, so freezing was also an OOM-mitigation measure.)
- RWKV-7 + TICA layers: newly initialized and trained.
- Stage 1: per-layer hidden-state alignment.
- Stage 2: KL-divergence distillation (forward KL, temperature ≈ 1.0, vocab-filtered) + cumulative hidden-state alignment.
- Logit softcapping (30.0) is preserved end-to-end so that the student's logit scale matches the teacher's calibration.
This v0.1 has only been trained to the point necessary to demonstrate that the architecture is functional — it has not been pushed to convergence, and the SWA-frozen + Full→Linear regime is known to need substantially more tokens than a uniform Transformer→Linear conversion (the Full layers of the teacher likely act as induction heads, and reproducing their behavior through RWKV + TICA is non-trivial).
Limitations and Caveats
- Quality is not guaranteed. This release exists to validate the architecture, not to deliver a competitive model. Outputs may be incoherent, off-topic, or otherwise lower quality than the teacher.
- No benchmark numbers are published with this preview. Any apparent capability is incidental.
- Multimodal tokens are present in the tokenizer but not exercised. Audio, image, and video token IDs are kept in the config for compatibility with the Gemma-4 vocabulary, but this checkpoint is text-only.
- Long-context behavior is unverified. The configured
max_position_embeddingsof 131,072 reflects the structural capability, not measured RULER / NIAH performance. - The model is intended for researchers investigating SWA-hybrid linearization, RWKV-7 distillation, and TICA-style auxiliary attention. It is not intended as a drop-in replacement for the Gemma-4 base model.
Intended Use
- Architecture research on SWA + Linear + TICA triple-hybrid stacks.
- Distillation methodology research on SWA-hybrid teachers.
- Inference-engine work (kernel authors validating SWA + RWKV-7 + TICA graphs end-to-end).
Not intended for production deployment, user-facing assistants, safety- critical applications, or any setting that depends on output quality.
License
Apache 2.0!!!!
Acknowledgments
- The Google Gemma team for releasing Gemma-4 with its SWA-hybrid architecture.
- The RWKV community and the authors of RWKV-7 /
hxa07family for the linear-attention foundation. - The RADLADS distillation work and earlier OpenMOSE PrimeRWKV / TICA research, which directly informed this release.
Citation
If you use this model in research, please cite it as a v0.1 architecture preview:
@misc{openmose2026rwkvgemma4preview,
title = {RWKV-Gemma-4-5B-E2B-Preview-v0.1: A SWA + RWKV-7 + TICA Triple-Hybrid Proof of Concept},
author = {OpenMOSE},
year = {2026},
note = {Experimental proof-of-concept release. No performance guarantees.},
url = {https://huggingface.co/OpenMOSE/RWKV-Gemma-4-5B-E2B-Preview-v0.1}
}
© 2026 OpenMOSE. Released as an experimental research artifact.
- Downloads last month
- 91