Text Generation
Safetensors
Khmer
gemma

mouy (αž˜αž½αž™)

mouy (meaning "one" in Khmer) is the first truly native Khmer language model. Unlike many existing models that are simply fine-tuned versions of Western or multilingual architectures, mouy was built and pretrained entirely from scratch by Cambodians, for Cambodia.


πŸš€ Key Features

  • 100% Pretrained From Scratch: This model was not fine-tuned on top of an existing English-centric LLM. It was trained from the ground up natively on Khmer text data, giving it a deep, foundational grasp of the language's syntax and nuances.
  • Custom 5K Khmer Tokenizer: Built with a custom-engineered 5,000-token vocabulary optimized specifically for the Khmer script. This eliminates the "token tax" commonly found in Western tokenizers, making inference significantly faster and more efficient for Khmer text.
  • Basic QA Capability: Ready out of the box for basic Question-Answering (QA) tasks and text generation in Khmer.

πŸ“š Training Data & Dataset Curation

mouy was pretrained on a highly curated corpus consisting of 1.361 billion tokens (as processed by the model's custom tokenizer).

To ensure high-quality generation and robust linguistic understanding, the dataset was built using:

  • FineWeb-2 (Khmer Extension): Leveraging the massive, cleaned web-scale data from the FineWeb-2 initiative as our core foundation.
  • Custom Cleanup Pipelines: We applied rigorous, proprietary filtering and deduplication methods tailored specifically to the Khmer script. This process stripped out machine-translated gibberish, HTML noise, and non-Khmer text, leaving a pristine dataset representing authentic language use.

πŸ—οΈ Model Architecture

mouy is built on a highly optimized, deep-yet-narrow Gemma-style autoregressive decoder architecture. While many lightweight models sacrifice depth to reduce parameter counts, mouy prioritizes architectural depth (28 layers) to capture complex, long-range structural dependencies unique to the Khmer language, while maintaining a lean hidden dimension to stay incredibly fast and memory-efficient.

Key Architectural Features

  • Grouped Query Attention (GQA): Features 8 attention heads for queries but scales down to 2 heads for Keys and Values (KV). This significantly cuts down the KV-cache memory footprint during generation, allowing for faster inference and larger batch sizes.
  • GeGLU Activation: The feed-forward network (MLP) utilizes Gated Linear Units with GELU activation functions (gate_proj paired with up_proj before projecting down), which has been shown to offer superior semantic representation over standard ReLU or vanilla GELU.
  • Rotary Position Embeddings (RoPE): Implements dynamic rotary embeddings to inject positional context directly into the attention mechanism, supporting a context window of up to 2,048 tokens.
  • Root Mean Square Normalization (RMSNorm): Applied at both the input and post-attention stages of every decoder layer to stabilize gradient flows and speed up training convergence without the computational overhead of standard LayerNorm.

πŸ“Š Hyperparameters at a Glance

Hyperparameter Value Description
Parameters ~100M Total trainable parameter count
Layers (num_hidden_layers) 28 Deep transformer stack for complex linguistic hierarchy
Hidden Size (hidden_size) 512 Width of the embedding and hidden states
Intermediate Size 2,048 Dimension of the GeGLU feed-forward layer
Attention Heads ($Q$) 8 Number of query heads
Key-Value Heads ($K, V$) 2 Grouped Query Attention (GQA) configuration
Head Dimension 64 Dimension per attention head
Context Length (max_position_embeddings) 2,048 Maximum sequence token window
Vocabulary Size 5,000 Custom localized Khmer-optimized vocabulary
Rope Theta 10,000.0 Base frequency for rotary position embeddings

πŸ› οΈ How to Use

You can easily use mouy using the Hugging Face transformers pipeline ecosystem.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Specify the model repository path
model_id = "attentionlab/mouy"

# 2. Load the custom tokenizer and optimized model weights
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

# 3. Format an example prompt (Basic QA / Text Generation)
prompt = "αžŸαž½αžŸαŸ’αžαžΈ αžαžΎαž’αŸ’αž“αž€αž’αžΆαž…"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# 4. Generate sequences natively
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“Š Benchmarks

⏳ Coming Soon We are currently preparing a comprehensive benchmark suite tailored specifically to evaluate the model's performance on formal, informal, and historical Khmer text structures. Results will be published here shortly.


🀝 Acknowledgments & Authors

This model is a proud step forward for the Cambodian AI ecosystem, developed independently by local researchers and engineers to push the boundaries of Khmer Natural Language Processing (NLP).

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train attentionlab/mouy