mouy (αα½α)
mouy (meaning "one" in Khmer) is the first truly native Khmer language model. Unlike many existing models that are simply fine-tuned versions of Western or multilingual architectures, mouy was built and pretrained entirely from scratch by Cambodians, for Cambodia.
π Key Features
- 100% Pretrained From Scratch: This model was not fine-tuned on top of an existing English-centric LLM. It was trained from the ground up natively on Khmer text data, giving it a deep, foundational grasp of the language's syntax and nuances.
- Custom 5K Khmer Tokenizer: Built with a custom-engineered 5,000-token vocabulary optimized specifically for the Khmer script. This eliminates the "token tax" commonly found in Western tokenizers, making inference significantly faster and more efficient for Khmer text.
- Basic QA Capability: Ready out of the box for basic Question-Answering (QA) tasks and text generation in Khmer.
π Training Data & Dataset Curation
mouy was pretrained on a highly curated corpus consisting of 1.361 billion tokens (as processed by the model's custom tokenizer).
To ensure high-quality generation and robust linguistic understanding, the dataset was built using:
- FineWeb-2 (Khmer Extension): Leveraging the massive, cleaned web-scale data from the FineWeb-2 initiative as our core foundation.
- Custom Cleanup Pipelines: We applied rigorous, proprietary filtering and deduplication methods tailored specifically to the Khmer script. This process stripped out machine-translated gibberish, HTML noise, and non-Khmer text, leaving a pristine dataset representing authentic language use.
ποΈ Model Architecture
mouy is built on a highly optimized, deep-yet-narrow Gemma-style autoregressive decoder architecture. While many lightweight models sacrifice depth to reduce parameter counts, mouy prioritizes architectural depth (28 layers) to capture complex, long-range structural dependencies unique to the Khmer language, while maintaining a lean hidden dimension to stay incredibly fast and memory-efficient.
Key Architectural Features
- Grouped Query Attention (GQA): Features 8 attention heads for queries but scales down to 2 heads for Keys and Values (KV). This significantly cuts down the KV-cache memory footprint during generation, allowing for faster inference and larger batch sizes.
- GeGLU Activation: The feed-forward network (MLP) utilizes Gated Linear Units with GELU activation functions (
gate_projpaired withup_projbefore projecting down), which has been shown to offer superior semantic representation over standard ReLU or vanilla GELU. - Rotary Position Embeddings (RoPE): Implements dynamic rotary embeddings to inject positional context directly into the attention mechanism, supporting a context window of up to 2,048 tokens.
- Root Mean Square Normalization (RMSNorm): Applied at both the input and post-attention stages of every decoder layer to stabilize gradient flows and speed up training convergence without the computational overhead of standard LayerNorm.
π Hyperparameters at a Glance
| Hyperparameter | Value | Description |
|---|---|---|
| Parameters | ~100M | Total trainable parameter count |
Layers (num_hidden_layers) |
28 | Deep transformer stack for complex linguistic hierarchy |
Hidden Size (hidden_size) |
512 | Width of the embedding and hidden states |
| Intermediate Size | 2,048 | Dimension of the GeGLU feed-forward layer |
| Attention Heads ($Q$) | 8 | Number of query heads |
| Key-Value Heads ($K, V$) | 2 | Grouped Query Attention (GQA) configuration |
| Head Dimension | 64 | Dimension per attention head |
Context Length (max_position_embeddings) |
2,048 | Maximum sequence token window |
| Vocabulary Size | 5,000 | Custom localized Khmer-optimized vocabulary |
| Rope Theta | 10,000.0 | Base frequency for rotary position embeddings |
π οΈ How to Use
You can easily use mouy using the Hugging Face transformers pipeline ecosystem.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. Specify the model repository path
model_id = "attentionlab/mouy"
# 2. Load the custom tokenizer and optimized model weights
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# 3. Format an example prompt (Basic QA / Text Generation)
prompt = "αα½ααααΈ ααΎα’αααα’αΆα
"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# 4. Generate sequences natively
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_p=0.9
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Benchmarks
β³ Coming Soon We are currently preparing a comprehensive benchmark suite tailored specifically to evaluate the model's performance on formal, informal, and historical Khmer text structures. Results will be published here shortly.
π€ Acknowledgments & Authors
This model is a proud step forward for the Cambodian AI ecosystem, developed independently by local researchers and engineers to push the boundaries of Khmer Natural Language Processing (NLP).
- Downloads last month
- -