Wiola13M

Wiola13M is a 13 million parameter decoder-only Transformer developed by OSCOWL AI.

The model introduces a lightweight attention architecture based on:

  • Spiral Rotary Position Embeddings (Spiral RoPE)
  • Gated Spiral Attention
  • Butterfly Feed Forward Network
  • RMSNorm
  • Weight-tied language modeling head

Wiola13M is designed as a compact research language model that can be trained efficiently on consumer GPUs while remaining fully compatible with the Hugging Face Transformers ecosystem.


Model Details

Property Value
Model Name Wiola13M
Parameters 12.9 Million
Architecture Decoder-only Transformer
Hidden Size 256
Layers 6
Attention Heads 8
Context Length 512 Tokens
Position Encoding Spiral Rotary Embeddings
Feed Forward Butterfly MLP
Framework PyTorch
Library Hugging Face Transformers

Training

The model was trained on the TinyStories dataset.

Training configuration:

  • Optimizer: AdamW
  • Learning Rate: 3e-4
  • Scheduler: Cosine
  • Maximum Steps: 20,000
  • Effective Batch Size: 32
  • Mixed Precision Training
  • Sequence Length: 512

Final Training Loss:

2.0568

Architecture

Wiola13M replaces standard Transformer attention with Gated Spiral Attention.

The architecture consists of:

Embedding
      โ†“
Spiral Rotary Embedding
      โ†“
Gated Multi-Head Attention
      โ†“
Butterfly Feed Forward Network
      โ†“
RMSNorm
      โ†“
Language Modeling Head

Key innovations include:

  • Content-adaptive attention gating
  • Spiral positional encoding
  • Efficient Butterfly MLP
  • KV-cache compatible autoregressive decoding

Usage

Install the package:

pip install wiola13m

Load the model:

from transformers import AutoTokenizer
from wiola13m import WiolaForCausalLM

tokenizer = AutoTokenizer.from_pretrained("oscowlai/Wiola13M")
model = WiolaForCausalLM.from_pretrained("oscowlai/Wiola13M")

inputs = tokenizer(
    "Once upon a time",
    return_tensors="pt",
    return_token_type_ids=False,
)

output = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Intended Uses

Wiola13M is intended for:

  • Language model research
  • Efficient transformer experimentation
  • Education
  • Architecture benchmarking
  • Fine-tuning experiments

It is not intended for production deployment without further evaluation and fine-tuning.


Limitations

  • Trained primarily on TinyStories.
  • Not instruction tuned.
  • Not RLHF aligned.
  • May generate inaccurate or repetitive outputs.
  • Performance outside the training domain has not been extensively evaluated.

Citation

@software{wiola13m2026,
  title={Wiola 13M, a Gated Spiral Attention Architecture for Parameter Efficient Small Language Models},
  author={OSCOWL AI},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/oscowlai/Wiola13M}
}

License

Apache License 2.0


Author

Developed by OSCOWL AI

GitHub: https://github.com/Wiola-OSCOWL-ai/Wiola13M

Hugging Face: https://huggingface.co/oscowlai/Wiola13M

Downloads last month
146
Safetensors
Model size
12.9M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support