TinyV4 β€” 11M Bilingual Base Model

TinyV4 is a compact 11 million parameter bilingual (Indonesian & English) base model. Think of it as a solid foundation β€” pre-trained, ready to be fine-tuned for your specific downstream task.

At just 58 MB, it's small enough to run anywhere. Smart enough to be worth your time.

What is this?

Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck.

TinyV4 is different. 11M parameters with a Mixture-of-Experts architecture β€” pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation.

Why use TinyV4 as your base?

Reason Why it matters
11M params Fine-tune in minutes, not days
58 MB Fits anywhere β€” mobile, edge, browser
CPU-friendly No GPU? No problem
Bilingual Already understands ID + EN
MoE architecture Efficient capacity without the bloat
MIT license No restrictions, no strings

Architecture

Component Spec
Parameters 11,034,955
Dimension 128
Layers 6
Attention Heads 4 (Query), 4 (Index)
MoE Experts 4 routed + 1 shared
Active Experts 2 per token
Vocab Size 32,000
Max Sequence 512 tokens
File Size 58 MB

Built with Mixture-of-Experts (MoE), Sinkhorn-Knopp load balancing, Multi-Token Prediction (MTP), and Hierarchical Compressed Attention β€” techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful.

What can you fine-tune it for?

TinyV4 is a blank canvas. Some ideas:

  • Translation (ID ↔ EN) β€” it already has bilingual foundations
  • Text classification β€” sentiment, topic, intent
  • Story generation β€” fine-tune on your own narrative dataset
  • Chat / instruction following β€” add conversation data
  • Code generation β€” yes, even at 11M, it can learn patterns
  • Domain-specific tasks β€” medical, legal, technical β€” your data, your model

The point is: you control the final model. TinyV4 just gives you a running start.

Quick Start

pip install transformers safetensors torch

Load the base model

from transformers import AutoTokenizer, AutoModel

# Load model & tokenizer (trust_remote_code=True karena arsitektur custom)
model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4")

# Tie embeddings (custom step untuk TinyV4)
model.head.weight = model.embed.weight
model.eval()

print(f"Loaded: {sum(p.numel()):,} params")

Generate text (zero-shot)

@torch.no_grad()
def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    for _ in range(max_new_tokens):
        idx = input_ids[:, -512:]
        logits, _, _ = model(idx)
        logits = logits[:, -1, :] / temperature

        v, _ = torch.topk(logits, top_k)
        logits[logits < v[:, [-1]]] = float('-inf')
        probs = torch.softmax(logits, dim=-1)

        next_token = torch.multinomial(probs, 1)
        input_ids = torch.cat([input_ids, next_token], dim=1)

        if next_token.item() == tokenizer.eos_token_id:
            break

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)

# Try it out
print(generate("Once upon a time,"))
print(generate("Pada suatu hari,"))

Fine-tune for your task

from torch.optim import AdamW

model.train()
optimizer = AdamW(model.parameters(), lr=3e-4)

# Your dataset, your task
for batch in your_dataloader:
    logits, mtp_logits, bal_loss = model(batch)
    loss = compute_your_loss(logits, batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Save your fine-tuned model
from safetensors.torch import save_file
save_file(model.state_dict(), "my-finetuned-model.safetensors")

Comparison: Sub-100M Base Models

Let's be honest β€” most base models under 100M parameters are either:

  • Distilled from larger models (not truly small)
  • Overly specialized (can't adapt to new tasks)
  • Poorly architected (waste parameters on the wrong things)

TinyV4 is different. At 11M parameters, it delivers:

  • Real bilingual understanding β€” not just token overlap
  • MoE efficiency β€” 4 experts, 2 active, more capacity per parameter
  • Proven adaptability β€” fine-tunes well across diverse tasks
  • Zero-shot generation β€” coherent output without any task-specific training

We're not saying 11M beats 1B. We're saying that at this size, nothing else gives you this much to work with.

Pre-training Details

Metric Value
Steps 5,000
Final Loss 3.97
Optimizer AdamW
Schedule Cosine decay with warmup
Weight Decay 0.01

Limitations

Be realistic about what 11M parameters can do:

  • Zero-shot output will be basic β€” this is a base model, not a finished product
  • Long-form coherence requires fine-tuning with appropriate data
  • Domain expertise needs your data β€” it won't magically know medical terms or legal jargon
  • Reasoning is limited β€” complex logical chains need more parameters

Think of TinyV4 as the best possible starting point at 11M. Not the finish line.

License

MIT β€” use it, modify it, ship it. No attribution required (but appreciated).

Citation

@misc{tinyv4-11m,
  title  = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts},
  year   = {2025},
  url    = {https://huggingface.co/ukung/tinyv4}
}
Downloads last month
45
Safetensors
Model size
15.2M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support