mouy (មួយ)

mouy (meaning "one" in Khmer) is the first truly native Khmer language model. Unlike many existing models that are simply fine-tuned versions of Western or multilingual architectures, mouy was built and pretrained entirely from scratch by Cambodians, for Cambodia.

🚀 Key Features

100% Pretrained From Scratch: This model was not fine-tuned on top of an existing English-centric LLM. It was trained from the ground up natively on Khmer text data, giving it a deep, foundational grasp of the language's syntax and nuances.
Custom 5K Khmer Tokenizer: Built with a custom-engineered 5,000-token vocabulary optimized specifically for the Khmer script. This eliminates the "token tax" commonly found in Western tokenizers, making inference significantly faster and more efficient for Khmer text.
Basic QA Capability: Ready out of the box for basic Question-Answering (QA) tasks and text generation in Khmer.

📚 Training Data & Dataset Curation

mouy was pretrained on a highly curated corpus consisting of 1.361 billion tokens (as processed by the model's custom tokenizer).

To ensure high-quality generation and robust linguistic understanding, the dataset was built using:

FineWeb-2 (Khmer Extension): Leveraging the massive, cleaned web-scale data from the FineWeb-2 initiative as our core foundation.
Custom Cleanup Pipelines: We applied rigorous, proprietary filtering and deduplication methods tailored specifically to the Khmer script. This process stripped out machine-translated gibberish, HTML noise, and non-Khmer text, leaving a pristine dataset representing authentic language use.

🏗️ Model Architecture

mouy is built on a highly optimized, deep-yet-narrow Gemma-style autoregressive decoder architecture. While many lightweight models sacrifice depth to reduce parameter counts, mouy prioritizes architectural depth (28 layers) to capture complex, long-range structural dependencies unique to the Khmer language, while maintaining a lean hidden dimension to stay incredibly fast and memory-efficient.

Key Architectural Features

Grouped Query Attention (GQA): Features 8 attention heads for queries but scales down to 2 heads for Keys and Values (KV). This significantly cuts down the KV-cache memory footprint during generation, allowing for faster inference and larger batch sizes.
GeGLU Activation: The feed-forward network (MLP) utilizes Gated Linear Units with GELU activation functions (gate_proj paired with up_proj before projecting down), which has been shown to offer superior semantic representation over standard ReLU or vanilla GELU.
Rotary Position Embeddings (RoPE): Implements dynamic rotary embeddings to inject positional context directly into the attention mechanism, supporting a context window of up to 2,048 tokens.
Root Mean Square Normalization (RMSNorm): Applied at both the input and post-attention stages of every decoder layer to stabilize gradient flows and speed up training convergence without the computational overhead of standard LayerNorm.

📊 Hyperparameters at a Glance

Hyperparameter	Value	Description
Parameters	~100M	Total trainable parameter count
Layers (`num_hidden_layers`)	28	Deep transformer stack for complex linguistic hierarchy
Hidden Size (`hidden_size`)	512	Width of the embedding and hidden states
Intermediate Size	2,048	Dimension of the GeGLU feed-forward layer
Attention Heads ($Q$)	8	Number of query heads
Key-Value Heads ($K, V$)	2	Grouped Query Attention (GQA) configuration
Head Dimension	64	Dimension per attention head
Context Length (`max_position_embeddings`)	2,048	Maximum sequence token window
Vocabulary Size	5,000	Custom localized Khmer-optimized vocabulary
Rope Theta	10,000.0	Base frequency for rotary position embeddings

🛠️ How to Use

You can easily use mouy using the Hugging Face transformers pipeline ecosystem.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Specify the model repository path
model_id = "attentionlab/mouy"

# 2. Load the custom tokenizer and optimized model weights
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

# 3. Format an example prompt (Basic QA / Text Generation)
prompt = "សួស្តី តើអ្នកអាច"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# 4. Generate sequences natively
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📊 Benchmarks

⏳ Coming Soon We are currently preparing a comprehensive benchmark suite tailored specifically to evaluate the model's performance on formal, informal, and historical Khmer text structures. Results will be published here shortly.

🤝 Acknowledgments & Authors

This model is a proud step forward for the Cambodian AI ecosystem, developed independently by local researchers and engineers to push the boundaries of Khmer Natural Language Processing (NLP).

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

attentionlab
/

mouy