OpenMOSE/RWKV-Gemma-4-5B-E2B-Preview-v0.1 — Model Card

for use

please install flash-linear-attention==0.4.1

⚠️ This is a Proof-of-Concept (v0.1) — NO performance guarantees

This release exists for one reason only: to verify whether a triple-hybrid stack of SWA + Linear (RWKV-7) + TICA (Tiny Infused Causal Attention) is structurally viable when distilled from a Gemma-4 SWA-hybrid teacher.

It is not optimized for downstream quality, and no benchmark results are claimed. Please treat it as a research artifact for architecture validation, not a usable chatbot or assistant model.

hxa07i

Model Summary

RWKV-Gemma-4-5B-E2B-Preview-v0.1 is an experimental linearization of Gemma-4 (5B / E2B effective) in which the model's Full (global) MQA attention layers are replaced with a RWKV-7 + TICA path, while the original Sliding-Window Attention (SWA) layers are inherited and preserved as-is from the teacher.

The resulting network is therefore a triple-hybrid:

Slot in original Gemma-4 Replacement in this model
SWA layer (local, 512-token window) SWA — kept (inherited from teacher)
Full / Global MQA layer RWKV-7 (linear) + TICA (tiny SDPA) — new

This is the first OpenMOSE release that targets Gemma-4's hybrid SWA-Full attention pattern rather than a uniform Transformer stack, and it serves as the architectural proving ground for the line of work pursued in RWKV-GLM-4.7-Flash (full → linear) extended to SWA-hybrid teachers.


Why This Model Exists

Modern hybrid attention models — Gemma-3, Gemma-4, and similar — alternate between local SWA and global Full attention. The Full layers are hypothesized to act as the model's induction heads and long-range retrievers, while SWA layers handle local pattern matching.

If we can:

  1. Keep SWA layers untouched (they are already efficient — O(N · W) with a fixed window W = 512), and
  2. Replace only the Full / Global layers with a linear core (RWKV-7) plus a small auxiliary attention path (TICA),

…then we can drastically reduce the global-attention KV cache and FLOPs without disrupting the SWA pathway that the model relies on for short-range behavior.

The single question this v0.1 release attempts to answer is:

Does SWA + Linear (RWKV-7) + TICA actually train and run end-to-end on a Gemma-4 base?

Quality, benchmarks, and long-context performance are explicitly out of scope for this preview.


Key Facts

Item Value
Name OpenMOSE/RWKV-Gemma-4-5B-E2B-Preview-v0.1
Base model google/gemma-4-E2B-it (Gemma-4 5B with E2B effective parameters)
Total layers 35
SWA layers (inherited) 28
RWKV-7 + TICA layers (new) 7
Sliding window 512 tokens
Hidden size 1536
Intermediate size (MLP) 6144 (double-wide MLP enabled)
Vocab size 262,144 (Gemma multimodal tokenizer)
Max position embeddings 131,072
Final logit softcapping 30.0 (preserved from Gemma-4)
dtype bfloat16
Status Experimental proof-of-concept (v0.1)

Architecture

Layer composition (35 layers total)

The original Gemma-4 SWA/Full pattern is preserved structurally, with each Full layer slot replaced by a RWKV-7 + TICA block:

[SWA, SWA, SWA, SWA, RWKV+TICA] × 7

Concretely, layers [4, 9, 14, 19, 24, 29, 34] are RWKV-7 layers, and every single one of them carries a TICA module. The remaining 28 layers are SWA inherited verbatim from the teacher.

SWA layers (inherited from Gemma-4)

  • 8 attention heads, 1 KV head (MQA)
  • head_dim = 256, sliding window = 512
  • RoPE: theta = 10,000, default rope type
  • These layers are not retrained in the conversion — their weights, RoPE configuration, KV-sharing pattern (num_kv_shared_layers = 20), and positional behavior are all left identical to Gemma-4.

RWKV-7 layers (replace Full / Global MQA)

The seven Full-attention slots are converted to RWKV-7 (hxa07i family) with:

  • 32 attention heads / 4 KV heads, head_dim = 128
  • QK-Norm enabled
  • LoRA ranks: decay = 256, iclr = 256, gate = 256
  • No RoPE inside the RWKV core (use_rope_in_rwkv = false)

This gives the model a recurrent / linear-time global pathway that costs O(1) memory per token at inference — the original Full-attention KV cache on these 7 layers is eliminated.

TICA — Tiny Infused Causal Attention (on all RWKV layers)

To compensate for retrieval and induction-head behavior that pure linear attention struggles with, every RWKV layer is augmented with a small gated SDPA path:

  • 4 heads / 2 KV heads (GQA), head_dim = 128
  • Gate rank = 128
  • No RoPE inside TICA (use_rope_in_tiny_attn = false)

TICA's QK compute is approximately 1024 × 1024 per layer versus the original Full-attention 4096 × 4096 — roughly a up to 1/16 reduction in attention FLOPs on the global pathway, which is particularly beneficial on compute-bound consumer GPUs (e.g. RTX 4090) and APU-class hardware (e.g. Strix Halo) where memory bandwidth is not the dominant bottleneck.

Resulting triple-hybrid

Token in
  │
  ├── SWA (×4)         ← inherited, 512-window GQA
  ├── RWKV-7 + TICA    ← new linear core + tiny GQA SDPA
  ├── SWA (×4)
  ├── RWKV-7 + TICA
  │   …
  └── RWKV-7 + TICA    ← layer 34 (final block)
Logits → final logit softcapping (cap = 30.0)

Conversion / Training Approach (high level)

  • Teacher: Gemma-4 5B (E2B), used for both hidden-state and logit-level distillation.
  • SWA layers: frozen throughout. (As an additional motivation: on the single MI300X / FlashAttention stack used during this work, gradients in SWA layers could not be sparsified to the 512-token window in practice, so freezing was also an OOM-mitigation measure.)
  • RWKV-7 + TICA layers: newly initialized and trained.
  • Stage 1: per-layer hidden-state alignment.
  • Stage 2: KL-divergence distillation (forward KL, temperature ≈ 1.0, vocab-filtered) + cumulative hidden-state alignment.
  • Logit softcapping (30.0) is preserved end-to-end so that the student's logit scale matches the teacher's calibration.

This v0.1 has only been trained to the point necessary to demonstrate that the architecture is functional — it has not been pushed to convergence, and the SWA-frozen + Full→Linear regime is known to need substantially more tokens than a uniform Transformer→Linear conversion (the Full layers of the teacher likely act as induction heads, and reproducing their behavior through RWKV + TICA is non-trivial).


Limitations and Caveats

  • Quality is not guaranteed. This release exists to validate the architecture, not to deliver a competitive model. Outputs may be incoherent, off-topic, or otherwise lower quality than the teacher.
  • No benchmark numbers are published with this preview. Any apparent capability is incidental.
  • Multimodal tokens are present in the tokenizer but not exercised. Audio, image, and video token IDs are kept in the config for compatibility with the Gemma-4 vocabulary, but this checkpoint is text-only.
  • Long-context behavior is unverified. The configured max_position_embeddings of 131,072 reflects the structural capability, not measured RULER / NIAH performance.
  • The model is intended for researchers investigating SWA-hybrid linearization, RWKV-7 distillation, and TICA-style auxiliary attention. It is not intended as a drop-in replacement for the Gemma-4 base model.

Intended Use

  • Architecture research on SWA + Linear + TICA triple-hybrid stacks.
  • Distillation methodology research on SWA-hybrid teachers.
  • Inference-engine work (kernel authors validating SWA + RWKV-7 + TICA graphs end-to-end).

Not intended for production deployment, user-facing assistants, safety- critical applications, or any setting that depends on output quality.


License

Apache 2.0!!!!


Acknowledgments

  • The Google Gemma team for releasing Gemma-4 with its SWA-hybrid architecture.
  • The RWKV community and the authors of RWKV-7 / hxa07 family for the linear-attention foundation.
  • The RADLADS distillation work and earlier OpenMOSE PrimeRWKV / TICA research, which directly informed this release.

Citation

If you use this model in research, please cite it as a v0.1 architecture preview:

@misc{openmose2026rwkvgemma4preview,
  title  = {RWKV-Gemma-4-5B-E2B-Preview-v0.1: A SWA + RWKV-7 + TICA Triple-Hybrid Proof of Concept},
  author = {OpenMOSE},
  year   = {2026},
  note   = {Experimental proof-of-concept release. No performance guarantees.},
  url    = {https://huggingface.co/OpenMOSE/RWKV-Gemma-4-5B-E2B-Preview-v0.1}
}

© 2026 OpenMOSE. Released as an experimental research artifact.

Downloads last month
91
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support