Instructions to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1

SGLang

How to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1 with Docker Model Runner:
```
docker model run hf.co/OpenMOSE/RWKV-Gemma-5B-E2B-Preview-0.1
```

OpenMOSE/RWKV-Gemma-4-5B-E2B-Preview-v0.1 — Model Card

for use

please install flash-linear-attention==0.4.1

⚠️ This is a Proof-of-Concept (v0.1) — NO performance guarantees

This release exists for one reason only: to verify whether a triple-hybrid stack of SWA + Linear (RWKV-7) + TICA (Tiny Infused Causal Attention) is structurally viable when distilled from a Gemma-4 SWA-hybrid teacher.

It is not optimized for downstream quality, and no benchmark results are claimed. Please treat it as a research artifact for architecture validation, not a usable chatbot or assistant model.

Model Summary

RWKV-Gemma-4-5B-E2B-Preview-v0.1 is an experimental linearization of Gemma-4 (5B / E2B effective) in which the model's Full (global) MQA attention layers are replaced with a RWKV-7 + TICA path, while the original Sliding-Window Attention (SWA) layers are inherited and preserved as-is from the teacher.

The resulting network is therefore a triple-hybrid:

Slot in original Gemma-4	Replacement in this model
SWA layer (local, 512-token window)	SWA — kept (inherited from teacher)
Full / Global MQA layer	RWKV-7 (linear) + TICA (tiny SDPA) — new

This is the first OpenMOSE release that targets Gemma-4's hybrid SWA-Full attention pattern rather than a uniform Transformer stack, and it serves as the architectural proving ground for the line of work pursued in RWKV-GLM-4.7-Flash (full → linear) extended to SWA-hybrid teachers.

Why This Model Exists

Modern hybrid attention models — Gemma-3, Gemma-4, and similar — alternate between local SWA and global Full attention. The Full layers are hypothesized to act as the model's induction heads and long-range retrievers, while SWA layers handle local pattern matching.

If we can:

Keep SWA layers untouched (they are already efficient — O(N · W) with a fixed window W = 512), and
Replace only the Full / Global layers with a linear core (RWKV-7) plus a small auxiliary attention path (TICA),

…then we can drastically reduce the global-attention KV cache and FLOPs without disrupting the SWA pathway that the model relies on for short-range behavior.

The single question this v0.1 release attempts to answer is:

Does SWA + Linear (RWKV-7) + TICA actually train and run end-to-end on a Gemma-4 base?

Quality, benchmarks, and long-context performance are explicitly out of scope for this preview.

Key Facts

Item	Value
Name	`OpenMOSE/RWKV-Gemma-4-5B-E2B-Preview-v0.1`
Base model	`google/gemma-4-E2B-it` (Gemma-4 5B with E2B effective parameters)
Total layers	35
SWA layers (inherited)	28
RWKV-7 + TICA layers (new)	7
Sliding window	512 tokens
Hidden size	1536
Intermediate size (MLP)	6144 (double-wide MLP enabled)
Vocab size	262,144 (Gemma multimodal tokenizer)
Max position embeddings	131,072
Final logit softcapping	30.0 (preserved from Gemma-4)
dtype	bfloat16
Status	Experimental proof-of-concept (v0.1)

Architecture

Layer composition (35 layers total)

The original Gemma-4 SWA/Full pattern is preserved structurally, with each Full layer slot replaced by a RWKV-7 + TICA block:

[SWA, SWA, SWA, SWA, RWKV+TICA] × 7

Concretely, layers [4, 9, 14, 19, 24, 29, 34] are RWKV-7 layers, and every single one of them carries a TICA module. The remaining 28 layers are SWA inherited verbatim from the teacher.

SWA layers (inherited from Gemma-4)

8 attention heads, 1 KV head (MQA)
head_dim = 256, sliding window = 512
RoPE: theta = 10,000, default rope type
These layers are not retrained in the conversion — their weights, RoPE configuration, KV-sharing pattern (num_kv_shared_layers = 20), and positional behavior are all left identical to Gemma-4.

RWKV-7 layers (replace Full / Global MQA)

The seven Full-attention slots are converted to RWKV-7 (hxa07i family) with:

32 attention heads / 4 KV heads, head_dim = 128
QK-Norm enabled
LoRA ranks: decay = 256, iclr = 256, gate = 256
No RoPE inside the RWKV core (use_rope_in_rwkv = false)

This gives the model a recurrent / linear-time global pathway that costs O(1) memory per token at inference — the original Full-attention KV cache on these 7 layers is eliminated.

TICA — Tiny Infused Causal Attention (on all RWKV layers)

To compensate for retrieval and induction-head behavior that pure linear attention struggles with, every RWKV layer is augmented with a small gated SDPA path:

4 heads / 2 KV heads (GQA), head_dim = 128
Gate rank = 128
No RoPE inside TICA (use_rope_in_tiny_attn = false)

TICA's QK compute is approximately 1024 × 1024 per layer versus the original Full-attention 4096 × 4096 — roughly a up to 1/16 reduction in attention FLOPs on the global pathway, which is particularly beneficial on compute-bound consumer GPUs (e.g. RTX 4090) and APU-class hardware (e.g. Strix Halo) where memory bandwidth is not the dominant bottleneck.

Resulting triple-hybrid

Token in
  │
  ├── SWA (×4)         ← inherited, 512-window GQA
  ├── RWKV-7 + TICA    ← new linear core + tiny GQA SDPA
  ├── SWA (×4)
  ├── RWKV-7 + TICA
  │   …
  └── RWKV-7 + TICA    ← layer 34 (final block)
Logits → final logit softcapping (cap = 30.0)

Conversion / Training Approach (high level)

Teacher: Gemma-4 5B (E2B), used for both hidden-state and logit-level distillation.
SWA layers: frozen throughout. (As an additional motivation: on the single MI300X / FlashAttention stack used during this work, gradients in SWA layers could not be sparsified to the 512-token window in practice, so freezing was also an OOM-mitigation measure.)
RWKV-7 + TICA layers: newly initialized and trained.
Stage 1: per-layer hidden-state alignment.
Stage 2: KL-divergence distillation (forward KL, temperature ≈ 1.0, vocab-filtered) + cumulative hidden-state alignment.
Logit softcapping (30.0) is preserved end-to-end so that the student's logit scale matches the teacher's calibration.

This v0.1 has only been trained to the point necessary to demonstrate that the architecture is functional — it has not been pushed to convergence, and the SWA-frozen + Full→Linear regime is known to need substantially more tokens than a uniform Transformer→Linear conversion (the Full layers of the teacher likely act as induction heads, and reproducing their behavior through RWKV + TICA is non-trivial).

Limitations and Caveats

Quality is not guaranteed. This release exists to validate the architecture, not to deliver a competitive model. Outputs may be incoherent, off-topic, or otherwise lower quality than the teacher.
No benchmark numbers are published with this preview. Any apparent capability is incidental.
Multimodal tokens are present in the tokenizer but not exercised. Audio, image, and video token IDs are kept in the config for compatibility with the Gemma-4 vocabulary, but this checkpoint is text-only.
Long-context behavior is unverified. The configured max_position_embeddings of 131,072 reflects the structural capability, not measured RULER / NIAH performance.
The model is intended for researchers investigating SWA-hybrid linearization, RWKV-7 distillation, and TICA-style auxiliary attention. It is not intended as a drop-in replacement for the Gemma-4 base model.

Intended Use

Architecture research on SWA + Linear + TICA triple-hybrid stacks.
Distillation methodology research on SWA-hybrid teachers.
Inference-engine work (kernel authors validating SWA + RWKV-7 + TICA graphs end-to-end).

Not intended for production deployment, user-facing assistants, safety- critical applications, or any setting that depends on output quality.

License

Apache 2.0!!!!

Acknowledgments

The Google Gemma team for releasing Gemma-4 with its SWA-hybrid architecture.
The RWKV community and the authors of RWKV-7 / hxa07 family for the linear-attention foundation.
The RADLADS distillation work and earlier OpenMOSE PrimeRWKV / TICA research, which directly informed this release.

Citation

If you use this model in research, please cite it as a v0.1 architecture preview:

@misc{openmose2026rwkvgemma4preview,
  title  = {RWKV-Gemma-4-5B-E2B-Preview-v0.1: A SWA + RWKV-7 + TICA Triple-Hybrid Proof of Concept},
  author = {OpenMOSE},
  year   = {2026},
  note   = {Experimental proof-of-concept release. No performance guarantees.},
  url    = {https://huggingface.co/OpenMOSE/RWKV-Gemma-4-5B-E2B-Preview-v0.1}
}

Downloads last month: 91

Safetensors

Model size

5B params

Tensor type

BF16