Instructions to use mkd-hossain/keural-sft-18k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mkd-hossain/keural-sft-18k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mkd-hossain/keural-sft-18k") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mkd-hossain/keural-sft-18k") model = AutoModelForCausalLM.from_pretrained("mkd-hossain/keural-sft-18k") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use mkd-hossain/keural-sft-18k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mkd-hossain/keural-sft-18k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mkd-hossain/keural-sft-18k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mkd-hossain/keural-sft-18k
- SGLang
How to use mkd-hossain/keural-sft-18k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mkd-hossain/keural-sft-18k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mkd-hossain/keural-sft-18k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mkd-hossain/keural-sft-18k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mkd-hossain/keural-sft-18k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mkd-hossain/keural-sft-18k with Docker Model Runner:
docker model run hf.co/mkd-hossain/keural-sft-18k
Keural-SFT-14.83B
Keural is a bilingual KoreanβEnglish Mixture-of-Experts language model trained from scratch. This is the SFT (Supervised Fine-Tuning) checkpoint at step 18,000, fine-tuned from the Keural stage-2 pretrained base using the ChatML instruction format.
Model Details
| Property | Value |
|---|---|
| Architecture | Mixtral-style MoE (8 experts, top-2) |
| Parameters | 14.83B total / ~7.42B active per token |
| Layers | 24 |
| Hidden size | 4096 |
| Attention heads | 32 (GQA β 8 KV heads) |
| Expert intermediate size | 5632 |
| Context length | 4096 tokens |
| Vocabulary | 131,074 (131,072 SPM + `< |
| RoPE theta | 500,000 |
| Sliding window | 512 (every other layer) |
| Dtype | bfloat16 |
| Languages | Korean, English |
Training Pipeline
| Stage | Steps | Data |
|---|---|---|
| Pretraining Stage 1 | 100,000 | Korean + English web corpus |
| Pretraining Stage 2 | 20,000 | Korean + English web corpus (continued) |
| SFT (this checkpoint) | 18,000 | mkd-chanwoo/keural-SFT (1.14M ChatML samples) |
SFT hyperparameters: LR 1e-5 β 1e-6 cosine, batch 64 effective (4 Γ 8 accum Γ 2 GPUs), max_seq 4096, weight_decay 0.05, 2Γ H200 SXM with FSDP FULL_SHARD.
Chat Format (ChatML)
This model uses ChatML format. You must use this exact format for good results.
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
μλ
νμΈμ! μ€λ λ μ¨κ° μ΄λμ?<|im_end|>
<|im_start|>assistant
The model generates until it produces <|im_end|> (token ID 131073).
How to Use
With transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "mkd-hossain/keural-sft-18k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful bilingual Korean-English assistant."},
{"role": "user", "content": "νμ΄μ¬μμ 리μ€νΈλ₯Ό μ λ ¬νλ λ°©λ²μ μλ €μ£ΌμΈμ."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
eos_token_id=131073, # <|im_end|>
)
response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
response = response.split("<|im_end|>")[0].strip()
print(response)
With vLLM (recommended for serving)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model mkd-hossain/keural-sft-18k \
--tokenizer mkd-hossain/keural-sft-18k \
--dtype bfloat16 \
--max-model-len 4096 \
--tensor-parallel-size 1
Then call the OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="mkd-hossain/keural-sft-18k",
messages=[
{"role": "system", "content": "You are a helpful bilingual assistant."},
{"role": "user", "content": "νκ΅μ μλλ μ΄λμΈκ°μ?"},
],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)
With vLLM on multiple GPUs
python -m vllm.entrypoints.openai.api_server \
--model mkd-hossain/keural-sft-18k \
--dtype bfloat16 \
--max-model-len 4096 \
--tensor-parallel-size 2
Manual ChatML prompt (without apply_chat_template)
prompt = (
"<|im_start|>system\n"
"You are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n"
"Tell me about Seoul.<|im_end|>\n"
"<|im_start|>assistant\n"
)
Special Tokens
| Token | ID | Purpose |
|---|---|---|
| `< | im_start | >` |
| `< | im_end | >` |
<bos> |
1 | Beginning of sequence |
<eos> |
2 | End of sequence |
<pad> |
0 | Padding |
Important: Always set
eos_token_id=131073(<|im_end|>) when generating. If you useeos_token_id=2(<eos>), generation may not stop correctly.
Recommended Generation Settings
generation_config = {
"max_new_tokens": 512,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 50,
"repetition_penalty": 1.1,
"do_sample": True,
"eos_token_id": 131073,
}
For factual / deterministic tasks use temperature=0.1, do_sample=False.
Limitations
- SFT training loss plateaued at ~1.96 (comparable models reach ~1.3β1.6). The model follows instructions but may produce repetitive or off-topic responses on complex prompts.
- The pretraining corpus contains Korean web data which skews the style toward informal language.
- Maximum context is 4096 tokens. Inputs longer than this will be truncated.
- This is an intermediate checkpoint β a DPO-aligned version will be released separately.
License
Apache 2.0
- Downloads last month
- 191