Keural-SFT3-Final โ€” 14.83B Bilingual MoE (SFT Epoch 3)

Keural is a 14.83B parameter Mixture-of-Experts language model trained from scratch for bilingual Koreanโ€“English instruction following.

This checkpoint is the result of SFT Epoch 3 (65,849 steps, 2.35M samples) and serves as the base model for DPO Round 2. The final preference-optimised model is mkd-hossain/keural-dpo2-final.


Architecture

Property Value
Architecture KeuralMoECausalLM
Parameters 14.83B total / ~7.42B active per token
Layers 24
Hidden size 4,096
Attention heads 32 Q / 8 KV (GQA)
Head dimension 128
Experts 8 total, top-2 per token
Expert intermediate size 5,632 (SwiGLU)
Context length 4,096 tokens
Vocabulary 131,074 (131,072 SPM + `<
RoPE theta 500,000
Sliding window 512 tokens (even layers only)
Normalization RMSNorm (eps=1e-5)
Dtype bfloat16

KeuralMoECausalLM is a custom architecture registered via trust_remote_code=True.


Special Tokens

Token ID Purpose
<|im_start|> 131072 Start of each conversation turn
<|im_end|> 131073 End of turn โ€” use as eos_token_id
<bos> 1 Beginning of sequence
<eos> 2 Not used for chat
<pad> 0 Padding

Critical: Always set eos_token_id=131073. Do not use ID 2 for chat generation.


Full Training Pipeline

Stage Steps Tokens Data Hardware
Pretraining Stage 1 100,000 ~50B Korean + English web corpus 2ร— H200 SXM
Pretraining Stage 2 120,000 ~19B Korean + English web corpus 2ร— H200 SXM
SFT Epoch 1 18,000 ~710M 710K instruction samples (9 sources) 2ร— H200 SXM
DPO Round 1 6,927 โ€” 440K preference pairs (6 sources) 2ร— H200 SXM
SFT Epoch 2 29,112 ~7.6B 710K filtered samples 2ร— H200 SXM
SFT Epoch 3 65,849 ~17.3B 2.35M samples (12 sources) 2ร— H200 SXM

SFT Epoch 3 Dataset (2,351,212 samples)

Source Samples Language
OpenHermes-2.5 1,001,551 English
SlimOrca 517,982 English
UltraChat 193,212 English
OpenOrca 138,639 English
AIHub multisession sci 127,868 Korean
AIHub daily conversation 120,867 Korean
AIHub multisession social 85,346 Korean
Alpaca 46,303 English
KoInstruct QA 45,299 Korean
KoInstruct base 42,276 Korean
KoAlpaca 21,091 Korean
AIHub expert QA 10,778 Korean
Total 2,351,212

SFT Epoch 3 Hyperparameters

Hyperparameter Value
Learning rate 2e-5 โ†’ 2e-6 cosine decay
Warmup steps 100
Effective batch size 64 (2 ร— 16 accum ร— 2 GPUs)
Max sequence length 2,048 tokens
Total steps 65,849
Optimizer AdamW (ฮฒ1=0.9, ฮฒ2=0.95, ฮต=1e-8)
Gradient clipping 1.0
Hardware 2ร— NVIDIA H200 SXM (143 GiB each)

Chat Format (ChatML)

<|im_start|>system
You are a helpful, accurate, and safe bilingual Korean-English AI assistant.<|im_end|>
<|im_start|>user
Your question here<|im_end|>
<|im_start|>assistant

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
    "mkd-hossain/keural-sft3-final",
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "mkd-hossain/keural-sft3-final",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are a helpful bilingual Korean-English AI assistant."},
    {"role": "user",   "content": "์•ˆ๋…•ํ•˜์„ธ์š”! ์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์–ด๋•Œ์š”?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    eos_token_id=131073,
    do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Next Model

DPO Round 2 has been completed on top of this checkpoint using 485,793 preference pairs (8 sources). The final preference-optimised model is available at: mkd-hossain/keural-dpo2-final


License

Apache 2.0

Training data includes datasets from AI Hub (Korean government open data platform) and publicly available English instruction datasets. All sources are Apache 2.0 or CC-BY compatible.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mkd-hossain/keural-sft3-final

Finetuned
(1)
this model
Finetunes
3 models