LFM2.5-8B-A1B-KO-SFT

Korean full-parameter SFT continuation of LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL, based on LiquidAI/LFM2.5-8B-A1B.

Status

Stage2 is the main KO-SFT model line and has been uploaded to this repository. Stage3 Agentic/Fable training is a separate follow-up model line under LLM-OS-Models/LFM2.5-8B-A1B-KO-Agentic-SFT.

The first selected full benchmark run shows that this Stage2 SFT checkpoint is not a blanket improvement over Base/CPT. It preserves or recovers a few axes, but it is weak on multiple-choice likelihood-style Korean benchmarks. Treat the numbers below as a diagnostic snapshot for the Stage2 SFT checkpoint, not as the final Agentic model report.

stage status samples tokens max seq note
Stage0 legal completed 8,747 35,068,923 8192 Korean legal source/bar-style warmup
Stage0b finance/Text2SQL completed/uploaded 280,000 58,090,087 4096 8 x H200 full SFT, 2,188 planned steps
Stage1 4k finance/Text2SQL completed/uploaded 2,302,304 1,285,864,494 4096 8 x H200 full SFT
Stage1 8k legal/terminal completed/uploaded 1,600,835 1,658,848,754 8192 legal long-context and terminal/tool behavior
Stage2 diverse KO/SWE/reasoning completed 1,467,864 1,364,349,642 4096 excludes raw CPT corpora
Stage2 plus KoTSQA completed/uploaded 1,468,598 1,364,863,776 4096 main KO-SFT checkpoint; adds KoTSQA train split only
Stage3 Agentic/Fable separate repo / running 3,943 7,124,298 8192 Fable5/Helio + local doc/log grounded SFT

Current staged main SFT total is about 4.309577B tokens:

  • Stage1 4k finance/Text2SQL: 1.286B tokens
  • Stage1 8k legal/terminal: 1.659B tokens
  • Stage2 diverse plus KoTSQA: 1.364864B tokens

Stage2 Selected Full Benchmark Snapshot

Evaluation was run with vLLM/lm-eval on the uploaded Stage2 full checkpoint. Base and CPT reference values are copied from the CPT model card for the same task axes. KMMLU direct hard STEM failed once during a crowded vLLM queue and is marked as pending rather than reported here.

task metric Base CPT KO-SFT Stage2 SFT vs Base SFT vs CPT
IFEval prompt loose acc 0.2921 0.3216 0.1738 -0.1183 -0.1478
Leaderboard IFEval prompt loose acc 0.2902 0.3457 0.1756 -0.1146 -0.1701
GSM8K exact match 0.4845 0.5701 0.3381 -0.1464 -0.2320
BoolQ acc 0.6544 0.7902 0.6664 +0.0120 -0.1238
ARC-Challenge acc_norm 0.3771 0.4241 0.2287 -0.1484 -0.1954
PIQA acc_norm 0.7203 0.7476 0.5930 -0.1273 -0.1546
Global MMLU KO medical genetics acc 0.2900 0.3800 0.3000 +0.0100 -0.0800
Global MMLU KO nutrition acc 0.2549 0.3203 0.2157 -0.0392 -0.1046
Global MMLU KO philosophy acc 0.2669 0.3215 0.1994 -0.0675 -0.1221
Global MMLU KO miscellaneous acc 0.3372 0.3921 0.2401 -0.0971 -0.1520
Global MMLU KO professional medicine acc 0.3235 0.2316 0.1838 -0.1397 -0.0478
Global MMLU KO high school statistics acc 0.2870 0.1574 0.2222 -0.0648 +0.0648
Global MMLU KO astronomy acc 0.3421 0.2829 0.1974 -0.1447 -0.0855
Global MMLU KO high school computer science acc 0.3100 0.2800 0.2800 -0.0300 +0.0000
Global MMLU KO jurisprudence acc 0.2870 0.2685 0.2593 -0.0277 -0.0092
KMMLU direct hard exact match 0.2015 0.1720 0.1055 -0.0960 -0.0665
MMLU-ProX Lite KO exact match 0.2585 0.1667 0.0867 -0.1718 -0.0800

Interpretation:

  • Stage2 SFT preserved only a small subset of public benchmark axes. BoolQ is slightly above Base, Global MMLU KO medical genetics is slightly above Base, and high school statistics recovers part of the CPT regression.
  • Korean multiple-choice and exact-answer tasks are mostly below Base/CPT. This suggests the SFT mix improved conversation/domain behavior more than likelihood-style option selection.
  • The next SFT data mix should add explicit Korean MCQA formats: question, choices, answer-only labels, and short rationales with the final option separated. This is especially important for KMMLU, Global MMLU KO, and MMLU-ProX style evaluation.

Goal

The goal is to keep LFM2.5 chat, tool-use, and general reasoning behavior while improving Korean legal, finance, Text2SQL, coding, and exact-answer behavior.

The SFT data follows the LFM ChatML-like template and keeps tool-use examples in the LFM tool-call style. Liquid's public docs describe this format with structured conversation roles and tool call delimiters such as <|tool_call_start|> and <|tool_call_end|>.

Data

Main source groups:

Project implementation and runbooks are public at:

Public dataset releases:

release kind size source / purpose
CPT LFM-style full raw raw LFM text JSONL 20.54GB Korean Wiki, finance, legal, legal RAG/bar-answer, terminal/tool traces
CPT LFM-style source shards source-separated raw shards 26.20GB auditable per-source CPT shards
CPT raw mix before LFM wrapping raw JSONL 4.10GB pre-conversion CPT mix
SFT Stage0 legal 8k tokenized response-only arrays 0.16GB legal source/RAG/bar warmup
SFT Stage0b finance/Text2SQL 4k tokenized response-only arrays 0.26GB finance and Text2SQL smoke stage
SFT Stage1 finance/Text2SQL 4k tokenized response-only arrays 5.24GB main finance/accounting and Text2SQL stage
SFT Stage1 legal/terminal 8k tokenized response-only arrays 6.71GB legal long-context and terminal/tool traces
SFT Stage2 diverse raw raw LFM chat JSONL 5.61GB Korean domain, SWE/coding, reasoning, finance/legal/Text2SQL
SFT Stage2 diverse 4k tokenized response-only arrays 5.52GB Stage2 diverse prepared set
KoTSQA train raw raw LFM chat JSONL 0.002GB KoTSQA v2 train only; test held out
SFT Stage2 plus KoTSQA 4k tokenized response-only arrays 5.52GB planned Stage2 main KO-SFT training set
Agentic/Fable grounded raw raw LFM chat JSONL 0.04GB Fable5/Helio plus local docs/log grounded traces
Agentic/Fable grounded 8k tokenized response-only arrays 0.05GB Stage3 Agentic/Fable response-only arrays
Dataset index and sources source index tiny LLM-Ko-Datasets README/LICENSE snapshot

The current prepared Stage1 pool is about 2.945B tokens:

  • 4k finance/Text2SQL: 1.286B tokens
  • 8k legal/terminal: 1.659B tokens

The next Stage2 pool is being prepared from Korean domain SFT, behavior mix, SWE/coding, reasoning, compact finance/legal, and Text2SQL reinforcement data. Raw CPT-style corpora such as Korean Wikipedia and raw law text are intentionally excluded from this SFT phase.

Quick Sanity Evaluation

This is a small limit=50 vLLM sanity slice, not a final benchmark.

task base LiquidAI/LFM2.5-8B-A1B CPT LFM2.5-8B-A1B-KO-CPT-FULL
ARC Challenge acc 0.2000 0.2000
HellaSwag acc 0.4200 0.3800
GSM8K exact match 0.4600 0.2200
IFEval strict prompt acc 0.1600 0.1200
TruthfulQA MC2 acc 0.5546 0.5407

The current CPT checkpoint is Korean-knowledge heavy and does not improve this small English/general sanity slice. The ongoing SFT stages are intended to recover instruction following, reasoning format, legal/finance QA, tool use, and coding behavior.

Training Recipe

  • Method: full-parameter supervised fine-tuning, not LoRA.
  • Precision: BF16.
  • Parallelism: torchrun DDP across 8 H200 GPUs.
  • Optimizer: fused AdamW.
  • Scheduler: cosine with warmup.
  • Current Stage0b batch: per_device_train_batch_size=2, gradient_accumulation_steps=8, effective batch 128 sequences/update.
  • Checkpoints: every 1000 steps with total limit 2, plus final full model.

The direct DDP trainer is used because a previous Hugging Face Trainer attempt loaded the model but stalled before active GPU training on the second stage.

Evaluation Plan

We will report base, CPT, and SFT under the same vLLM settings. Planned public benchmark families:

area benchmark / probe purpose
Official LFM lineage IFEval, IFBench, Multi-IF instruction following preservation
Official LFM lineage MATH500, AIME25 math/reasoning preservation
Official LFM lineage BFCLv3, BFCLv4 function/tool calling
Official LFM lineage Tau2 Telecom, Tau2 Retail agentic task behavior
Korean language Global MMLU Korean, KMMLU Korean knowledge and MCQA
Korean domain legal/bar/accounting/finance probes target-domain lift
Structured output Text2SQL and JSON exact extraction format and exact-answer behavior

Final scores will be added after the same evaluation matrix has been run on all comparison models.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "LLM-OS-Models/LFM2.5-8B-A1B-KO-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful Korean legal and finance assistant."},
    {"role": "user", "content": "대한민국 상법상 이사의 충실의무를 간단히 설명해줘."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Colab Example

!pip install -U transformers accelerate safetensors

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "LLM-OS-Models/LFM2.5-8B-A1B-KO-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a precise Korean assistant."},
    {"role": "user", "content": "한국어로 LFM2.5 모델을 사용할 때 chat template을 쓰는 이유를 설명해줘."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)
output = model.generate(inputs, max_new_tokens=512, temperature=0.3, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))

한국어 설명

LFM2.5-8B-A1B-KO-SFTLFM2.5-8B-A1B-KO-CPT-FULL 위에 이어서 학습하는 한국어 SFT 모델입니다. 목표는 한국어 법률, 금융, 회계, Text2SQL, 코딩, 터미널 및 툴콜 동작을 강화하면서 기존 LFM2.5의 영어 추론과 도구 사용 능력을 유지하는 것입니다.

현재는 최종 릴리스 전 학습 중입니다. 모델 성능 표는 base, CPT, SFT를 같은 vLLM 평가 설정으로 돌린 뒤 업데이트합니다.

한국어 사용 예시는 위 UsageColab Example을 참고하면 됩니다.

프로젝트 코드와 실행 문서는 GitHub에 공개되어 있습니다.

Downloads last month
20
Safetensors
Model size
8B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LLM-OS-Models/LFM2.5-8B-A1B-KO-SFT

Finetuned
(1)
this model
Finetunes
1 model
Quantizations
2 models