Wiola13M

Wiola13M is a 13 million parameter decoder-only Transformer developed by OSCOWL AI.

The model introduces a lightweight attention architecture based on:

Spiral Rotary Position Embeddings (Spiral RoPE)
Gated Spiral Attention
Butterfly Feed Forward Network
RMSNorm
Weight-tied language modeling head

Wiola13M is designed as a compact research language model that can be trained efficiently on consumer GPUs while remaining fully compatible with the Hugging Face Transformers ecosystem.

Model Details

Property	Value
Model Name	Wiola13M
Parameters	12.9 Million
Architecture	Decoder-only Transformer
Hidden Size	256
Layers	6
Attention Heads	8
Context Length	512 Tokens
Position Encoding	Spiral Rotary Embeddings
Feed Forward	Butterfly MLP
Framework	PyTorch
Library	Hugging Face Transformers

Training

The model was trained on the TinyStories dataset.

Training configuration:

Optimizer: AdamW
Learning Rate: 3e-4
Scheduler: Cosine
Maximum Steps: 20,000
Effective Batch Size: 32
Mixed Precision Training
Sequence Length: 512

Final Training Loss:

2.0568

Architecture

Wiola13M replaces standard Transformer attention with Gated Spiral Attention.

The architecture consists of:

Embedding
      ↓
Spiral Rotary Embedding
      ↓
Gated Multi-Head Attention
      ↓
Butterfly Feed Forward Network
      ↓
RMSNorm
      ↓
Language Modeling Head

Key innovations include:

Content-adaptive attention gating
Spiral positional encoding
Efficient Butterfly MLP
KV-cache compatible autoregressive decoding

Usage

Install the package:

pip install wiola13m

Load the model:

from transformers import AutoTokenizer
from wiola13m import WiolaForCausalLM

tokenizer = AutoTokenizer.from_pretrained("oscowlai/Wiola13M")
model = WiolaForCausalLM.from_pretrained("oscowlai/Wiola13M")

inputs = tokenizer(
    "Once upon a time",
    return_tensors="pt",
    return_token_type_ids=False,
)

output = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Intended Uses

Wiola13M is intended for:

Language model research
Efficient transformer experimentation
Education
Architecture benchmarking
Fine-tuning experiments

It is not intended for production deployment without further evaluation and fine-tuning.

Limitations

Trained primarily on TinyStories.
Not instruction tuned.
Not RLHF aligned.
May generate inaccurate or repetitive outputs.
Performance outside the training domain has not been extensively evaluated.

Citation

@software{wiola13m2026,
  title={Wiola 13M, a Gated Spiral Attention Architecture for Parameter Efficient Small Language Models},
  author={OSCOWL AI},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/oscowlai/Wiola13M}
}

License

Apache License 2.0

Author

Developed by OSCOWL AI

GitHub: https://github.com/Wiola-OSCOWL-ai/Wiola13M

Hugging Face: https://huggingface.co/oscowlai/Wiola13M

Downloads last month: 146

Safetensors

Model size

12.9M params

Tensor type

F32