Keural-SFT3-Final — 14.83B Bilingual MoE (SFT Epoch 3)

Keural is a 14.83B parameter Mixture-of-Experts language model trained from scratch for bilingual Korean–English instruction following.

This checkpoint is the result of SFT Epoch 3 (65,849 steps, 2.35M samples) and serves as the base model for DPO Round 2. The final preference-optimised model is mkd-hossain/keural-dpo2-final.

Architecture

Property	Value
Architecture	KeuralMoECausalLM
Parameters	14.83B total / ~7.42B active per token
Layers	24
Hidden size	4,096
Attention heads	32 Q / 8 KV (GQA)
Head dimension	128
Experts	8 total, top-2 per token
Expert intermediate size	5,632 (SwiGLU)
Context length	4,096 tokens
Vocabulary	131,074 (131,072 SPM + `<
RoPE theta	500,000
Sliding window	512 tokens (even layers only)
Normalization	RMSNorm (eps=1e-5)
Dtype	bfloat16

KeuralMoECausalLM is a custom architecture registered via trust_remote_code=True.

Special Tokens

Token	ID	Purpose
`<\|im_start\|>`	131072	Start of each conversation turn
`<\|im_end\|>`	131073	End of turn — use as eos_token_id
`<bos>`	1	Beginning of sequence
`<eos>`	2	Not used for chat
`<pad>`	0	Padding

Critical: Always set eos_token_id=131073. Do not use ID 2 for chat generation.

Full Training Pipeline

Stage	Steps	Tokens	Data	Hardware
Pretraining Stage 1	100,000	~50B	Korean + English web corpus	2× H200 SXM
Pretraining Stage 2	120,000	~19B	Korean + English web corpus	2× H200 SXM
SFT Epoch 1	18,000	~710M	710K instruction samples (9 sources)	2× H200 SXM
DPO Round 1	6,927	—	440K preference pairs (6 sources)	2× H200 SXM
SFT Epoch 2	29,112	~7.6B	710K filtered samples	2× H200 SXM
SFT Epoch 3	65,849	~17.3B	2.35M samples (12 sources)	2× H200 SXM

SFT Epoch 3 Dataset (2,351,212 samples)

Source	Samples	Language
OpenHermes-2.5	1,001,551	English
SlimOrca	517,982	English
UltraChat	193,212	English
OpenOrca	138,639	English
AIHub multisession sci	127,868	Korean
AIHub daily conversation	120,867	Korean
AIHub multisession social	85,346	Korean
Alpaca	46,303	English
KoInstruct QA	45,299	Korean
KoInstruct base	42,276	Korean
KoAlpaca	21,091	Korean
AIHub expert QA	10,778	Korean
Total	2,351,212

SFT Epoch 3 Hyperparameters

Hyperparameter	Value
Learning rate	2e-5 → 2e-6 cosine decay
Warmup steps	100
Effective batch size	64 (2 × 16 accum × 2 GPUs)
Max sequence length	2,048 tokens
Total steps	65,849
Optimizer	AdamW (β1=0.9, β2=0.95, ε=1e-8)
Gradient clipping	1.0
Hardware	2× NVIDIA H200 SXM (143 GiB each)

Chat Format (ChatML)

<|im_start|>system
You are a helpful, accurate, and safe bilingual Korean-English AI assistant.<|im_end|>
<|im_start|>user
Your question here<|im_end|>
<|im_start|>assistant

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
    "mkd-hossain/keural-sft3-final",
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "mkd-hossain/keural-sft3-final",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are a helpful bilingual Korean-English AI assistant."},
    {"role": "user",   "content": "안녕하세요! 오늘 날씨가 어때요?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    eos_token_id=131073,
    do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Next Model

DPO Round 2 has been completed on top of this checkpoint using 485,793 preference pairs (8 sources). The final preference-optimised model is available at: mkd-hossain/keural-dpo2-final

License

Apache 2.0

Training data includes datasets from AI Hub (Korean government open data platform) and publicly available English instruction datasets. All sources are Apache 2.0 or CC-BY compatible.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mkd-hossain/keural-sft3-final

Base model

mkd-hossain/keural-sft-18k

Finetuned

mkd-hossain/keural-dpo-final

Finetuned

(1)

this model

Finetunes

3 models