YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Keural-14.8B-Base (Stage 1 Checkpoint โ Step 80K)
Status: Early-stage pretraining checkpoint. Not a finished model. This is a research preview of an ongoing training run, shared for transparency.
Model Overview
| Property | Value |
|---|---|
| Architecture | Mixtral-style MoE (Mixture of Experts) |
| Total Parameters | 14.83B |
| Active Parameters per Token | ~3.7B (top-2 of 8 experts) |
| Context Length | 4,096 tokens |
| Languages | English, Korean (primary) |
| Training Stage | Stage 1 Pretraining (step 80K / 100K) |
| License | Apache 2.0 |
| Precision | bfloat16 |
Architecture Details
| Parameter | Value |
|---|---|
| Layers | 24 |
| Hidden size | 4,096 |
| Attention heads | 32 |
| KV heads (GQA) | 8 |
| Head dim | 128 |
| Experts per layer | 8 |
| Active experts per token | 2 (top-2 routing) |
| FFN type | SwiGLU |
| Positional encoding | RoPE (ฮธ = 500,000) |
| Attention | Alternating full causal + sliding window (512) |
| Norm | RMSNorm (ฮต = 1e-5) |
| Vocab size | 131,072 |
Training Details
Hardware
- GPUs: 2ร NVIDIA H200 (150 GB VRAM each)
- Parallelism: FSDP (Fully Sharded Data Parallel)
- Precision: bfloat16 with gradient checkpointing
Hyperparameters
| Parameter | Value |
|---|---|
| Batch size | 4 per GPU |
| Gradient accumulation | 8 steps |
| Effective batch size | 64 sequences |
| Peak learning rate | 3e-4 |
| Min learning rate | 3e-5 |
| LR schedule | Cosine decay |
| Warmup steps | 2,000 |
| Weight decay | 0.1 |
| Gradient clip | 1.0 |
| Optimizer | AdamW (ฮฒโ=0.9, ฮฒโ=0.95, fused) |
| Sequence length | 4,096 tokens |
| Total steps | 100,000 (this checkpoint: step 80,000) |
Loss Curve
| Step | Loss |
|---|---|
| 10 | 12.68 |
| 2,000 | ~3.5 |
| 15,000 | 2.64 |
| 33,000 | 1.37 |
| 50,000 | 1.79 |
| 56,000 | 1.11 |
| 70,000 | 1.06 |
| 78,000 | 0.85 |
| 80,000 | ~0.85 |
Training Data (Stage 1)
| Domain | Source | Tokens |
|---|---|---|
| English | FineWeb (HuggingFaceFW) | 30B |
| Code | The Stack v1 (BigCode) | 8B |
| Science | arXiv | 3.5B |
| Science | PubMed | 2.4B |
| Korean | Wikipedia-ko | 0.5B |
| Korean | Korean-Webtext (HAERAE) | 2.2B |
| Korean | WanJuan-Korean | 3.0B |
| Korean | CC-100 Korean | 0.16B |
| Literature | PG-19 | 0.45B |
| Total | ~50B raw / ~70B packed |
Binary dataset: 158 shards, 15.76M sequences, 95.1% sequence utilization.
Tokenizer: Custom SentencePiece model trained on Korean + English + code corpus. Vocab size: 131,072.
Known Limitations
This is a raw pretraining checkpoint, not an instruction-tuned or RLHF'd model. It has significant known issues:
- Data quality: Stage 1 training data contains unfiltered web content including HTML artifacts (
[content7],<table>), spam, and low-quality Korean web pages. This directly affects output quality. - Korean outputs: May produce brand spam, gambling content, or HTML artifacts โ artifacts from noisy Korean web data in the training set.
- No instruction following: This is a base language model. It continues text, it does not follow instructions or answer questions in a chat format.
- Not safety-tuned: No RLHF, DPO, or safety filtering has been applied.
- Incomplete training: This checkpoint is at step 80K of a planned 100K step run. Training was ongoing at upload time.
Stage 2 pretraining with cleaner data (FineWeb-edu, FineWeb2-Korean, HPLT-Korean) is planned before instruction tuning.
Tokenizer
Custom SentencePiece tokenizer with 131,072 vocabulary tokens, trained on a multilingual corpus (Korean + English + Code). Uses LlamaTokenizer interface for HuggingFace compatibility.
Special tokens:
<s>(BOS) โ ID 1</s>(EOS) โ ID 2<unk>โ ID 0
Usage
With vLLM
pip install vllm
vllm serve mkd-chanwoo/keural-14.8b-base --dtype bfloat16 --max-model-len 4096
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "mkd-chanwoo/keural-14.8b-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
inputs = tokenizer("์ธ๊ณต์ง๋ฅ์ ๋ฏธ๋๋", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Text Generation with Sampling
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.5, # recommended โ reduces repetition loops
do_sample=True,
)
Model Card Metadata
- Model type: Causal language model, MoE
- Training regime: Pretraining only (no SFT, no RLHF)
- Checkpoint step: 80,000
- Converted from: Native Keural
.ptformat โ HuggingFace Mixtral-compatible safetensors - Conversion: Weights remapped to
MixtralForCausalLMschema for vLLM/transformers compatibility
Citation
@misc{keural2026,
title = {Keural: A Korean-English Mixture-of-Experts Language Model},
author = {mkd-chanwoo},
year = {2026},
url = {https://huggingface.co/mkd-chanwoo/keural-14.8b-base}
}
Roadmap
- Stage 1 pretraining (50B tokens, mixed quality data)
- Stage 1 completion (100K steps)
- Stage 2 pretraining (70B clean tokens: FineWeb-edu + FineWeb2-Korean + HPLT-Korean)
- Supervised Fine-Tuning (SFT)
- Preference alignment (DPO/RLHF)
- Evaluation on Korean benchmarks (KoBEST, KLUE)
- Downloads last month
- 1,860