LFM2.5-8B-A1B-KO-SFT

Korean full-parameter SFT continuation of LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL, based on LiquidAI/LFM2.5-8B-A1B.

GitHub: https://github.com/gyunggyung/LFM25-KO-SFT
CPT GitHub: https://github.com/gyunggyung/LFM25-KO-CPT
CPT base checkpoint: https://huggingface.co/LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL
Agentic follow-up repo: https://huggingface.co/LLM-OS-Models/LFM2.5-8B-A1B-KO-Agentic-SFT
Public data releases: 14 Hugging Face dataset repos are published with README.md, dataset_manifest.json, and uploaded data/ files. Combined uploaded size is about 79.94GB, including duplicate raw/tokenized releases.
Korean section: 한국어 설명
Base model: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B
Liquid prompting docs: https://docs.liquid.ai/lfm/key-concepts/text-generation-and-prompting
Liquid chat template docs: https://docs.liquid.ai/lfm/key-concepts/chat-template
Liquid tool-use docs: https://docs.liquid.ai/lfm/key-concepts/tool-use

Status

Stage2 is the main KO-SFT model line and has been uploaded to this repository. Stage3 Agentic/Fable training is a separate follow-up model line under LLM-OS-Models/LFM2.5-8B-A1B-KO-Agentic-SFT.

The first selected full benchmark run shows that this Stage2 SFT checkpoint is not a blanket improvement over Base/CPT. It preserves or recovers a few axes, but it is weak on multiple-choice likelihood-style Korean benchmarks. Treat the numbers below as a diagnostic snapshot for the Stage2 SFT checkpoint, not as the final Agentic model report.

stage	status	samples	tokens	max seq	note
Stage0 legal	completed	8,747	35,068,923	8192	Korean legal source/bar-style warmup
Stage0b finance/Text2SQL	completed/uploaded	280,000	58,090,087	4096	8 x H200 full SFT, 2,188 planned steps
Stage1 4k finance/Text2SQL	completed/uploaded	2,302,304	1,285,864,494	4096	8 x H200 full SFT
Stage1 8k legal/terminal	completed/uploaded	1,600,835	1,658,848,754	8192	legal long-context and terminal/tool behavior
Stage2 diverse KO/SWE/reasoning	completed	1,467,864	1,364,349,642	4096	excludes raw CPT corpora
Stage2 plus KoTSQA	completed/uploaded	1,468,598	1,364,863,776	4096	main KO-SFT checkpoint; adds KoTSQA train split only
Stage3 Agentic/Fable	separate repo / running	3,943	7,124,298	8192	Fable5/Helio + local doc/log grounded SFT

Current staged main SFT total is about 4.309577B tokens:

Stage1 4k finance/Text2SQL: 1.286B tokens
Stage1 8k legal/terminal: 1.659B tokens
Stage2 diverse plus KoTSQA: 1.364864B tokens

Stage2 Selected Full Benchmark Snapshot

Evaluation was run with vLLM/lm-eval on the uploaded Stage2 full checkpoint. Base and CPT reference values are copied from the CPT model card for the same task axes. KMMLU direct hard STEM failed once during a crowded vLLM queue and is marked as pending rather than reported here.

task	metric	Base	CPT	KO-SFT Stage2	SFT vs Base	SFT vs CPT
IFEval	prompt loose acc	0.2921	0.3216	0.1738	-0.1183	-0.1478
Leaderboard IFEval	prompt loose acc	0.2902	0.3457	0.1756	-0.1146	-0.1701
GSM8K	exact match	0.4845	0.5701	0.3381	-0.1464	-0.2320
BoolQ	acc	0.6544	0.7902	0.6664	+0.0120	-0.1238
ARC-Challenge	acc_norm	0.3771	0.4241	0.2287	-0.1484	-0.1954
PIQA	acc_norm	0.7203	0.7476	0.5930	-0.1273	-0.1546
Global MMLU KO medical genetics	acc	0.2900	0.3800	0.3000	+0.0100	-0.0800
Global MMLU KO nutrition	acc	0.2549	0.3203	0.2157	-0.0392	-0.1046
Global MMLU KO philosophy	acc	0.2669	0.3215	0.1994	-0.0675	-0.1221
Global MMLU KO miscellaneous	acc	0.3372	0.3921	0.2401	-0.0971	-0.1520
Global MMLU KO professional medicine	acc	0.3235	0.2316	0.1838	-0.1397	-0.0478
Global MMLU KO high school statistics	acc	0.2870	0.1574	0.2222	-0.0648	+0.0648
Global MMLU KO astronomy	acc	0.3421	0.2829	0.1974	-0.1447	-0.0855
Global MMLU KO high school computer science	acc	0.3100	0.2800	0.2800	-0.0300	+0.0000
Global MMLU KO jurisprudence	acc	0.2870	0.2685	0.2593	-0.0277	-0.0092
KMMLU direct hard	exact match	0.2015	0.1720	0.1055	-0.0960	-0.0665
MMLU-ProX Lite KO	exact match	0.2585	0.1667	0.0867	-0.1718	-0.0800

Interpretation:

Stage2 SFT preserved only a small subset of public benchmark axes. BoolQ is slightly above Base, Global MMLU KO medical genetics is slightly above Base, and high school statistics recovers part of the CPT regression.
Korean multiple-choice and exact-answer tasks are mostly below Base/CPT. This suggests the SFT mix improved conversation/domain behavior more than likelihood-style option selection.
The next SFT data mix should add explicit Korean MCQA formats: question, choices, answer-only labels, and short rationales with the final option separated. This is especially important for KMMLU, Global MMLU KO, and MMLU-ProX style evaluation.

Goal

The goal is to keep LFM2.5 chat, tool-use, and general reasoning behavior while improving Korean legal, finance, Text2SQL, coding, and exact-answer behavior.

The SFT data follows the LFM ChatML-like template and keeps tool-use examples in the LFM tool-call style. Liquid's public docs describe this format with structured conversation roles and tool call delimiters such as <|tool_call_start|> and <|tool_call_end|>.

Data

Main source groups:

Korean legal tasks, bar-style JSON answers, source-grounded legal agent data, and RAG-style legal QA. Legal data includes sources from the legalize-kr ecosystem: https://github.com/legalize-kr.
Korean finance/accounting instruction data.
Text2SQL and structured reasoning data.
Terminal/tool-use and ToolBench-style conversations.
Coding/SWE data.
KoTSQA train split for Korean evidence QA and false-premise correction. The test split is kept out for later evaluation: https://huggingface.co/datasets/etri-lirs/KoTSQA-v.2.0.
Korean dataset index reviewed for additional candidates: https://github.com/gyunggyung/LLM-Ko-Datasets.

Project implementation and runbooks are public at:

SFT code and docs: https://github.com/gyunggyung/LFM25-KO-SFT
CPT code and docs: https://github.com/gyunggyung/LFM25-KO-CPT

Public dataset releases:

release	kind	size	source / purpose
CPT LFM-style full raw	raw LFM text JSONL	20.54GB	Korean Wiki, finance, legal, legal RAG/bar-answer, terminal/tool traces
CPT LFM-style source shards	source-separated raw shards	26.20GB	auditable per-source CPT shards
CPT raw mix before LFM wrapping	raw JSONL	4.10GB	pre-conversion CPT mix
SFT Stage0 legal 8k	tokenized response-only arrays	0.16GB	legal source/RAG/bar warmup
SFT Stage0b finance/Text2SQL 4k	tokenized response-only arrays	0.26GB	finance and Text2SQL smoke stage
SFT Stage1 finance/Text2SQL 4k	tokenized response-only arrays	5.24GB	main finance/accounting and Text2SQL stage
SFT Stage1 legal/terminal 8k	tokenized response-only arrays	6.71GB	legal long-context and terminal/tool traces
SFT Stage2 diverse raw	raw LFM chat JSONL	5.61GB	Korean domain, SWE/coding, reasoning, finance/legal/Text2SQL
SFT Stage2 diverse 4k	tokenized response-only arrays	5.52GB	Stage2 diverse prepared set
KoTSQA train raw	raw LFM chat JSONL	0.002GB	KoTSQA v2 train only; test held out
SFT Stage2 plus KoTSQA 4k	tokenized response-only arrays	5.52GB	planned Stage2 main KO-SFT training set
Agentic/Fable grounded raw	raw LFM chat JSONL	0.04GB	Fable5/Helio plus local docs/log grounded traces
Agentic/Fable grounded 8k	tokenized response-only arrays	0.05GB	Stage3 Agentic/Fable response-only arrays
Dataset index and sources	source index	tiny	LLM-Ko-Datasets README/LICENSE snapshot

The current prepared Stage1 pool is about 2.945B tokens:

4k finance/Text2SQL: 1.286B tokens
8k legal/terminal: 1.659B tokens

The next Stage2 pool is being prepared from Korean domain SFT, behavior mix, SWE/coding, reasoning, compact finance/legal, and Text2SQL reinforcement data. Raw CPT-style corpora such as Korean Wikipedia and raw law text are intentionally excluded from this SFT phase.

Quick Sanity Evaluation

This is a small limit=50 vLLM sanity slice, not a final benchmark.

task	base `LiquidAI/LFM2.5-8B-A1B`	CPT `LFM2.5-8B-A1B-KO-CPT-FULL`
ARC Challenge acc	0.2000	0.2000
HellaSwag acc	0.4200	0.3800
GSM8K exact match	0.4600	0.2200
IFEval strict prompt acc	0.1600	0.1200
TruthfulQA MC2 acc	0.5546	0.5407

The current CPT checkpoint is Korean-knowledge heavy and does not improve this small English/general sanity slice. The ongoing SFT stages are intended to recover instruction following, reasoning format, legal/finance QA, tool use, and coding behavior.

Training Recipe

Method: full-parameter supervised fine-tuning, not LoRA.
Precision: BF16.
Parallelism: torchrun DDP across 8 H200 GPUs.
Optimizer: fused AdamW.
Scheduler: cosine with warmup.
Current Stage0b batch: per_device_train_batch_size=2, gradient_accumulation_steps=8, effective batch 128 sequences/update.
Checkpoints: every 1000 steps with total limit 2, plus final full model.

The direct DDP trainer is used because a previous Hugging Face Trainer attempt loaded the model but stalled before active GPU training on the second stage.

Evaluation Plan

We will report base, CPT, and SFT under the same vLLM settings. Planned public benchmark families:

area	benchmark / probe	purpose
Official LFM lineage	IFEval, IFBench, Multi-IF	instruction following preservation
Official LFM lineage	MATH500, AIME25	math/reasoning preservation
Official LFM lineage	BFCLv3, BFCLv4	function/tool calling
Official LFM lineage	Tau2 Telecom, Tau2 Retail	agentic task behavior
Korean language	Global MMLU Korean, KMMLU	Korean knowledge and MCQA
Korean domain	legal/bar/accounting/finance probes	target-domain lift
Structured output	Text2SQL and JSON exact extraction	format and exact-answer behavior

Final scores will be added after the same evaluation matrix has been run on all comparison models.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "LLM-OS-Models/LFM2.5-8B-A1B-KO-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful Korean legal and finance assistant."},
    {"role": "user", "content": "대한민국 상법상 이사의 충실의무를 간단히 설명해줘."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Colab Example

!pip install -U transformers accelerate safetensors

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "LLM-OS-Models/LFM2.5-8B-A1B-KO-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a precise Korean assistant."},
    {"role": "user", "content": "한국어로 LFM2.5 모델을 사용할 때 chat template을 쓰는 이유를 설명해줘."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)
output = model.generate(inputs, max_new_tokens=512, temperature=0.3, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))

한국어 설명

LFM2.5-8B-A1B-KO-SFT는 LFM2.5-8B-A1B-KO-CPT-FULL 위에 이어서 학습하는 한국어 SFT 모델입니다. 목표는 한국어 법률, 금융, 회계, Text2SQL, 코딩, 터미널 및 툴콜 동작을 강화하면서 기존 LFM2.5의 영어 추론과 도구 사용 능력을 유지하는 것입니다.

현재는 최종 릴리스 전 학습 중입니다. 모델 성능 표는 base, CPT, SFT를 같은 vLLM 평가 설정으로 돌린 뒤 업데이트합니다.

한국어 사용 예시는 위 Usage와 Colab Example을 참고하면 됩니다.

프로젝트 코드와 실행 문서는 GitHub에 공개되어 있습니다.