LFM2.5-8B-A1B-KO-CPT-FULL

Full-parameter Korean continued-pretraining project for LiquidAI/LFM2.5-8B-A1B.

This model is intended to make LFM2.5 stronger at Korean legal, finance, wiki-style knowledge, and terminal/tool-use behavior while preserving the base model's general English and instruction-following ability.

Public CPT dataset releases:

release size format source / purpose
CPT LFM-style full raw 20.54GB single LFM-style JSONL full Korean CPT source after LFM-style wrapping
CPT LFM-style source shards 26.20GB source-separated JSONL shards auditable Korean Wiki, finance, legal, legal RAG/bar-answer, terminal/tool shards
CPT raw mix before LFM wrapping 4.10GB raw JSONL pre-conversion CPT mix for debugging/rebuilding

Status: full CPT completed on 2026-06-28. Weights are prepared from the verified checkpoint-10196 final-step checkpoint and uploaded to Hugging Face. vLLM evaluation shows strong gains on instruction-following, GSM8K, BoolQ, ARC, and several Korean knowledge subjects, but also regressions on Korean hard MCQA, MMLU-ProX-lite-ko, and some STEM/legal/accounting slices.

Performance Snapshot

All numbers below are vLLM/lm-eval base-vs-CPT comparisons against LiquidAI/LFM2.5-8B-A1B. Higher is better.

Confirmed Gains

Benchmark Metric Base CPT Delta Relative
leaderboard_instruction_following / leaderboard_ifeval prompt loose 0.2902 0.3457 +0.0555 +19.11%
IFEval full prompt loose 0.2921 0.3216 +0.0295 +10.10%
GSM8K full 5-shot exact_match flexible 0.4845 0.5701 +0.0856 +17.67%
GSM8K full 5-shot exact_match strict 0.2472 0.4617 +0.2145 +86.77%
BoolQ full acc 0.6544 0.7902 +0.1358 +20.75%
ARC-Challenge full acc_norm 0.3771 0.4241 +0.0469 +12.44%
PIQA full acc_norm 0.7203 0.7476 +0.0272 +3.78%
Global MMLU KO medical_genetics acc 0.2900 0.3800 +0.0900 +31.03%
Global MMLU KO nutrition acc 0.2549 0.3203 +0.0654 +25.64%
Global MMLU KO philosophy acc 0.2669 0.3215 +0.0547 +20.48%
Global MMLU KO miscellaneous acc 0.3372 0.3921 +0.0549 +16.29%
MMLU-Pro economics exact_match 0.4277 0.4704 +0.0427 +9.97%

Regressions To Fix

Benchmark Metric Base CPT Delta Relative
MMLU-ProX Lite KO exact_match 0.2585 0.1667 -0.0918 -35.53%
KMMLU hard acc 0.2015 0.1720 -0.0295 -14.63%
KMMLU hard STEM acc 0.1973 0.1564 -0.0409 -20.74%
Global MMLU KO professional_medicine acc 0.3235 0.2316 -0.0919 -28.41%
Global MMLU KO high_school_statistics acc 0.2870 0.1574 -0.1296 -45.16%
Global MMLU KO astronomy acc 0.3421 0.2829 -0.0592 -17.31%
Global MMLU KO high_school_computer_science acc 0.3100 0.2800 -0.0300 -9.68%
MMLU-Pro law LIMIT=500 exact_match 0.1840 0.1240 -0.0600 -32.61%
Leaderboard Math hard exact_match 0.4977 0.4275 -0.0702 -14.11%

Interpretation: the CPT run successfully injects Korean-domain knowledge and preserves or improves several general benchmarks, but it is not a finished Korean instruction model. The next post-training stage should target Korean MCQA reliability, option-label extraction, STEM hard questions, legal/accounting reasoning, and preservation of the current IFEval/GSM8K/BoolQ gains.

Likely failure mode: many regressions are not simple "Korean got worse" failures. They cluster around multiple-choice answering, exact answer extraction, option-label discipline, and hard STEM/legal/accounting formats. Open-ended Korean knowledge slices and instruction-following often improve, while Korean MCQA and parser-sensitive exact-match tasks need targeted remediation.

Contents

English

LFM2.5-8B-A1B-KO-CPT-FULL is a full fine-tuned Korean CPT checkpoint, not a LoRA adapter. The training objective is text completion over a Korean-heavy corpus, with LFM chat-template formatting applied to instruction, RAG, and tool-use examples.

Training code, source manifests, dataset cards, and runbooks are published at https://github.com/gyunggyung/LFM25-KO-CPT. The supervised fine-tuning follow-up is tracked at https://github.com/gyunggyung/LFM25-KO-SFT.

Target strengths:

  • Korean legal document understanding and legal RAG-style answering
  • Korean finance explanations and finance-domain terminology
  • Korean wiki/general knowledge prose
  • Korean instruction-following
  • Terminal/tool-use style structured assistant behavior

Quick Start

The examples below use the full model repository.

Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are a precise Korean assistant."},
    {"role": "user", "content": "대한민국 민법상 계약 해제와 해지의 차이를 간단히 설명해줘."},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.6,
        top_p=0.9,
        do_sample=True,
    )

print(tokenizer.decode(output[0], skip_special_tokens=False))

vLLM

vllm serve LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --tensor-parallel-size 8

OpenAI-compatible request:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL",
    messages=[
        {"role": "system", "content": "You are a precise Korean assistant."},
        {"role": "user", "content": "한국 기준금리 인상이 은행 순이자마진에 미치는 영향을 설명해줘."},
    ],
    temperature=0.5,
    max_tokens=512,
)

print(response.choices[0].message.content)

Colab Example

Use this after model weights are uploaded. For typical Colab GPUs, start with 4-bit loading to avoid OOM.

!pip install -U "transformers>=4.44" accelerate bitsandbytes sentencepiece huggingface_hub
import torch
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Optional for gated/private models:
# login("hf_xxx")

model_id = "LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "너는 한국어로 정확하고 간결하게 답하는 어시스턴트다."},
    {"role": "user", "content": "한국어로 주택임대차보호법의 대항력 요건을 설명해줘."},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.5,
        top_p=0.9,
        do_sample=True,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

If you have an A100/H100/H200 runtime, bf16 loading can be used instead of 4-bit:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Prompt Format

The model follows the LFM2 chat-template style. Use the tokenizer chat template when possible. The CPT corpus preserves these special tokens for chat and tool-use records:

  • <|startoftext|>
  • <|im_start|>
  • <|im_end|>
  • roles: system, user, assistant, tool

References:

Training Configuration

  • Base model: LiquidAI/LFM2.5-8B-A1B
  • Method: full-parameter continued pretraining, not LoRA
  • Framework: Unsloth + TRL SFTTrainer
  • Hardware target: 8x NVIDIA H200
  • Context length: 8192
  • Precision: bf16 when supported
  • Optimizer: adamw_8bit
  • GPUs: 8
  • Per-device batch size: 2
  • Gradient accumulation steps: 4
  • Effective batch: 64 sequences/update
  • Maximum tokens/update: 524,288
  • Learning rate: 2e-5
  • Schedule: 1 epoch over the prepared full corpus
  • max_steps: -1
  • Checkpoint interval: 1,000 steps
  • Checkpoint retention: 4 latest checkpoints plus final model

Completed run:

  • Estimated tokens: 6,492,697,020
  • Raw estimated steps before packing: 12,384
  • Actual packed trainer steps: 10,196
  • Train runtime: about 9h 38m
  • Train samples/sec: 18.81
  • Train steps/sec: 0.294
  • Final logged train loss: 0.712
  • Final checkpoint source: checkpoint-10196
  • Final model integrity check: model.safetensors opens successfully with 2,302 tensors

Note: the distributed torchrun process reached step 10196/10196 and wrote checkpoint-10196. A SIGSEGV occurred during the extra post-train trainer.save_model(final_full) write, leaving the initial final_full/model.safetensors incomplete. The published final_full was rebuilt from the verified checkpoint-10196 inference files.

Data Mix

Prepared full mix:

/home/work/.data/lfm2_ko_cpt/datasets/ko_cpt_mix_full_lfmstyle_20260627.jsonl

Statistics:

  • Rows after global deduplication: 4,622,971
  • Characters: 11,581,567,658
  • Estimated tokens: 6,492,697,020
  • Raw estimated training steps: 12,384 at effective batch 64 and sequence length 8192
  • Actual packed trainer steps: 10,196

Per-source rows:

Source Rows
kowiki_raw_full_20260524 611,403
bcai_finance_kor_hrm_20260524 1,861,531
korean_legal_raw_full_20260523 227,687
korean_legal_tasks_full_20260524 1,383,340
korean_admrule_precedent_raw_full_20260524 203,477
ko_legal_source_agent_sft_20260621 5,999
ko_legal_rag_agent_sft_round15_v2 749
current_law_bar_json_answer_sft_20260621 2,000
lfm25_terminal_toolbench_hrm_turns_v1 326,785

Raw Korean wiki/legal/finance documents are kept as plain completion text for CPT. Instruction, legal RAG, and terminal/tool-use examples are converted to LFM ChatML-style text.

Legal Data Attribution

Legal-domain data is attributed to the public Legalize-KR ecosystem and related Korean legal source corpora used in the local CPT mix.

Legalize-KR links:

The Legalize-KR organization describes its project as converting Korean statutes, precedents, administrative rules, and local ordinances into Markdown and Git history. Its README states that source data is obtained from the National Law Information Center OpenAPI and transformed into Git repositories. Long-term reproducibility should pin a snapshot or release where possible because Legalize-KR notes that Git history can be reconstructed when parsing and normalization rules improve.

Recommended attribution format:

  • Statutes: cite legalize-kr/legalize-kr, the Markdown path such as kr/{statute-name}/{statute-type}.md, and stable metadata fields such as 법령ID, 법령MST, promulgation date, effective date, and the 출처 URL from law.go.kr.
  • Precedents: cite legalize-kr/precedent-kr, the Markdown path such as {case-type}/{court-level}/{court}_{decision-date}_{case-number}.md, and stable identifiers such as 판례일련번호, court name, decision date, and case number.
  • Administrative rules: cite legalize-kr/admrule-kr, the Markdown path such as {agency-path}/{rule-type}/{rule-name}/본문.md, plus rule serial number or issuing number when available.
  • Local ordinances: cite legalize-kr/ordinance-kr, the Markdown path such as {province}/{city-or-office}/{ordinance-type}/{ordinance-name}/본문.md, plus 자치법규ID, 자치법규일련번호, promulgation date, promulgation number, and the 출처 URL.
  • Avoid using only Git commit hashes as long-term identifiers because Legalize-KR warns that repository history may be reconstructed after parser or normalization improvements.
  • License note from the Legalize-KR READMEs: original legal text is Korean government public work; repository structure and metadata are MIT where specified by the repository.

Local legal sources included in this CPT run:

  • korean_legal_raw_full_20260523
  • korean_legal_tasks_full_20260524
  • korean_admrule_precedent_raw_full_20260524
  • ko_legal_source_agent_sft_20260621
  • ko_legal_rag_agent_sft_round15_v2
  • current_law_bar_json_answer_sft_20260621

Korean

LFM2.5-8B-A1B-KO-CPT-FULL은 LoRA 어댑터가 아니라 full-parameter CPT 모델입니다. 목표는 LFM2.5-8B-A1B에 한국어 법률, 금융, 위키 지식과 터미널/도구 사용 스타일을 계속 사전학습으로 이식하는 것입니다.

목표 성능:

  • 한국어 법률 문서 이해와 법률 RAG 답변
  • 한국어 금융 설명과 금융 용어 처리
  • 한국어 위키/일반 지식 문체
  • 한국어 instruction following
  • 터미널/도구 호출형 assistant 동작 보존

한국어 사용법

가중치 업로드 후 아래처럼 사용할 수 있습니다.

Transformers 사용

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "너는 한국어로 정확하고 간결하게 답하는 어시스턴트다."},
    {"role": "user", "content": "상법상 이사의 충실의무를 실무 관점에서 설명해줘."},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.5,
        top_p=0.9,
        do_sample=True,
    )

print(tokenizer.decode(output[0], skip_special_tokens=False))

vLLM 사용

vllm serve LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --tensor-parallel-size 8

요청 예시:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL",
    messages=[
        {"role": "system", "content": "너는 한국어로 정확하고 간결하게 답하는 어시스턴트다."},
        {"role": "user", "content": "부동산 임대차 계약에서 보증금 반환 분쟁의 핵심 쟁점을 정리해줘."},
    ],
    temperature=0.5,
    max_tokens=512,
)

print(response.choices[0].message.content)

Colab 사용 예시

일반 Colab GPU에서는 VRAM 부족을 피하려고 4-bit 로딩부터 쓰는 것이 좋습니다.

!pip install -U "transformers>=4.44" accelerate bitsandbytes sentencepiece huggingface_hub
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "너는 한국어로 정확하고 간결하게 답하는 어시스턴트다."},
    {"role": "user", "content": "한국 금융시장에서 기준금리와 채권 가격의 관계를 설명해줘."},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.5,
        top_p=0.9,
        do_sample=True,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

권장 생성 설정

  • 법률/금융 설명: temperature=0.3-0.6, top_p=0.8-0.95
  • 일반 한국어 답변: temperature=0.5-0.8, top_p=0.9
  • 긴 문서 요약: max_new_tokens=1024 이상
  • 도구 사용/구조화 출력: 낮은 temperature 권장

한국어 학습 설정

  • 베이스 모델: LiquidAI/LFM2.5-8B-A1B
  • 방식: full-parameter CPT, LoRA 아님
  • 하드웨어: NVIDIA H200 8장
  • 컨텍스트 길이: 8192
  • GPU당 batch size: 2
  • gradient accumulation: 4
  • effective batch: 64 sequences/update
  • update당 최대 token: 524,288
  • learning rate: 2e-5
  • epoch: 1
  • max_steps: -1
  • 저장 간격: 1,000 steps
  • checkpoint 보존: 최신 4개와 final model

학습 규모:

  • 전체 row: 4,622,971
  • 추정 token: 6.49B
  • raw 예상 step: 12,384
  • 실제 packed trainer step: 10,196
  • 실제 train runtime: 약 9시간 38분
  • 최종 train loss: 0.712
  • 최종 weight 출처: 무결성 검사를 통과한 checkpoint-10196

한국어 법률 데이터 출처

법률 도메인 데이터 출처는 Legalize-KR 생태계와 로컬 한국 법률 corpus를 명시한다.

Legalize-KR은 법령/판례/행정규칙/자치법규를 Markdown과 Git 이력으로 관리하는 공개 프로젝트다. 조직 README 기준 원천 데이터는 국가법령정보센터 OpenAPI에서 가져오며, 파싱과 정규화 규칙이 개선되면 Git 이력이 재구성될 수 있으므로 장기 재현에는 snapshot 또는 release 고정이 필요하다.

출처 표기 방식:

  • 법령: legalize-kr/legalize-kr 저장소, kr/{법령명}/{법령구분}.md 경로, 법령ID, 법령MST, 공포일자, 시행일자, 출처 URL을 함께 적는다.
  • 판례: legalize-kr/precedent-kr 저장소, {사건종류}/{법원등급}/{법원명}_{선고일자}_{사건번호}.md 경로, 판례일련번호, 법원명, 선고일자, 사건번호를 함께 적는다.
  • 행정규칙: legalize-kr/admrule-kr 저장소, {기관경로}/{행정규칙종류}/{행정규칙명}/본문.md 경로, 행정규칙일련번호 또는 발령번호를 함께 적는다.
  • 자치법규: legalize-kr/ordinance-kr 저장소, {광역}/{기초 또는 _본청 또는 _교육청}/{자치법규종류}/{자치법규명}/본문.md 경로, 자치법규ID, 자치법규일련번호, 공포일자, 공포번호, 출처 URL을 함께 적는다.
  • commit hash만 장기 출처로 쓰지 않는다. Legalize-KR README는 파서/정규화 개선 시 저장소 history가 재구성될 수 있다고 안내한다.
  • Legalize-KR README 기준 원문은 대한민국 정부 공공저작물이고, 저장소 구조와 메타데이터는 저장소별 MIT 표기를 따른다.

Evaluation Plan

Current vLLM Smoke Check

This is not a benchmark score. It verifies that both the base model and the CPT model load and generate with vLLM tensor parallelism.

  • Date: 2026-06-28
  • vLLM environment: local .vllm-lfm-cu12, vLLM 0.19.1, Torch 2.10.0+cu128
  • Tensor parallel size: 8
  • Max model length: 8192
  • Base model smoke: passed model load and generation
  • CPT model smoke: passed model load and generation
  • Smoke result path: /home/work/.data/lfm2_ko_cpt/evals/20260628_1052_smoke_clean_vllm_smoke
  • CPT checks passed: Korean legal, Korean finance, tool-call format, English instruction smoke
  • CPT wiki smoke note: the answer was relevant, but the simple keyword check expected the literal word 요약, so that specific automatic check is false.

Current vLLM Benchmark Results

Evaluation uses EleutherAI lm-evaluation-harness with vLLM tensor parallelism. The IFEval run below is the full 541-prompt public task, not a limited smoke sample.

  • Date: 2026-06-28
  • Task: ifeval
  • Runner: lm_eval==0.4.11, vllm==0.19.1, Torch 2.10.0+cu128
  • Tensor parallel size: 8
  • Max model length: 8192
  • Result path: /home/work/.data/lfm2_ko_cpt/evals/20260628_022743_ifeval_full_vllm_vllm_matrix
Metric LiquidAI/LFM2.5-8B-A1B LFM2.5-8B-A1B-KO-CPT-FULL Delta Relative
prompt_level_strict_acc 0.2810 0.2976 +0.0166 +5.91%
prompt_level_loose_acc 0.2921 0.3216 +0.0295 +10.10%
inst_level_strict_acc 0.4221 0.4365 +0.0144 +3.41%
inst_level_loose_acc 0.4341 0.4628 +0.0287 +6.61%

GSM8K 5-shot LIMIT=200 limited regression check:

Metric LiquidAI/LFM2.5-8B-A1B LFM2.5-8B-A1B-KO-CPT-FULL Delta Relative
exact_match strict-match 0.2600 0.4250 +0.1650 +63.46%
exact_match flexible-extract 0.4250 0.4950 +0.0700 +16.47%

Global MMLU Korean LIMIT=500 limited check:

Metric LiquidAI/LFM2.5-8B-A1B LFM2.5-8B-A1B-KO-CPT-FULL Delta Relative
global_mmlu_full_ko acc 0.2803 0.3086 +0.0283 +10.10%
humanities acc 0.2784 0.3022 +0.0238 +8.55%
other acc 0.2914 0.3385 +0.0471 +16.16%
social_sciences acc 0.2911 0.3404 +0.0493 +16.93%
stem acc 0.2623 0.2591 -0.0032 -1.22%

Note: GSM8K and Global MMLU Korean above are limited runs and should be treated as early regression checks, not final public benchmark scores. Additional vLLM evaluations are running with one task per GPU.

Additional vLLM checks:

Task Metric LiquidAI/LFM2.5-8B-A1B LFM2.5-8B-A1B-KO-CPT-FULL Delta Relative Note
arc_challenge LIMIT=500 acc 0.3600 0.4020 +0.0420 +11.67% limited
arc_challenge LIMIT=500 acc_norm 0.3760 0.4140 +0.0380 +10.11% limited
gsm8k full 5-shot exact_match strict 0.2472 0.4617 +0.2145 +86.77% full task
gsm8k full 5-shot exact_match flexible 0.4845 0.5701 +0.0856 +17.67% full task
mmlu_pro_economics LIMIT=500 exact_match 0.4420 0.4900 +0.0480 +10.86% limited
mmlu_pro_law LIMIT=500 exact_match 0.1840 0.1240 -0.0600 -32.61% limited
mmlu_prox_lite_ko LIMIT=500 exact_match 0.2585 0.1667 -0.0918 -35.51% limited
global_mmlu_full_ko_professional_law full acc 0.2581 0.2595 +0.0014 +0.54% full subject
global_mmlu_full_ko_professional_accounting full acc 0.2730 0.2340 -0.0390 -14.29% full subject
global_mmlu_full_ko_high_school_macroeconomics full acc 0.2436 0.2846 +0.0410 +16.83% full subject
global_mmlu_full_ko_virology full acc 0.2831 0.3795 +0.0964 +34.05% full subject
global_mmlu_full_ko_world_religions full acc 0.3450 0.4854 +0.1404 +40.70% full subject
hellaswag LIMIT=1000 acc 0.4320 0.4430 +0.0110 +2.55% limited
hellaswag LIMIT=1000 acc_norm 0.4330 0.5110 +0.0780 +18.01% limited
winogrande full acc 0.5643 0.5699 +0.0055 +0.98% full task
piqa full acc 0.7350 0.7541 +0.0190 +2.59% full task
piqa full acc_norm 0.7209 0.7465 +0.0256 +3.55% full task
boolq full acc 0.6544 0.7902 +0.1358 +20.75% full task
global_mmlu_full_ko_high_school_geography full acc 0.3384 0.3434 +0.0051 +1.49% full subject
global_mmlu_full_ko_public_relations full acc 0.2273 0.3000 +0.0727 +32.00% full subject
global_mmlu_full_ko_management full acc 0.3107 0.4369 +0.1262 +40.63% full subject
global_mmlu_full_ko_human_sexuality full acc 0.2672 0.3740 +0.1069 +40.00% full subject
global_mmlu_full_ko_international_law full acc 0.3223 0.4215 +0.0992 +30.77% full subject
leaderboard_instruction_following / leaderboard_ifeval prompt_level_loose_acc 0.2976 0.3346 +0.0370 +12.42% lm-eval leaderboard task
global_mmlu_full_ko_business_ethics full acc 0.2100 0.4500 +0.2400 +114.29% full subject
global_mmlu_full_ko_sociology full acc 0.2886 0.4776 +0.1891 +65.52% full subject
global_mmlu_full_ko_computer_security full acc 0.2900 0.4500 +0.1600 +55.17% full subject
global_mmlu_full_ko_marketing full acc 0.3590 0.5000 +0.1410 +39.29% full subject
global_mmlu_full_ko_professional_psychology full acc 0.2729 0.3284 +0.0556 +20.36% full subject
global_mmlu_full_ko_college_biology full acc 0.2569 0.3333 +0.0764 +29.73% full subject
kmmlu_hard_humss LIMIT=1000 acc 0.2533 0.2675 +0.0143 +5.63% limited
kmmlu_hard LIMIT=1000 acc 0.2015 0.1720 -0.0295 -14.63% limited
kmmlu_hard_stem LIMIT=1000 acc 0.1973 0.1564 -0.0409 -20.74% limited

Latest Global MMLU Korean subject sweep:

Task Metric LiquidAI/LFM2.5-8B-A1B LFM2.5-8B-A1B-KO-CPT-FULL Delta Relative
global_mmlu_full_ko_astronomy acc 0.3421 0.2829 -0.0592 -17.31%
global_mmlu_full_ko_conceptual_physics acc 0.3149 0.2936 -0.0213 -6.76%
global_mmlu_full_ko_econometrics acc 0.2632 0.2807 +0.0175 +6.67%
global_mmlu_full_ko_electrical_engineering acc 0.2759 0.3103 +0.0345 +12.50%
global_mmlu_full_ko_formal_logic acc 0.3254 0.2778 -0.0476 -14.63%
global_mmlu_full_ko_high_school_biology acc 0.2710 0.2871 +0.0161 +5.95%
global_mmlu_full_ko_high_school_chemistry acc 0.2315 0.1921 -0.0394 -17.02%
global_mmlu_full_ko_high_school_statistics acc 0.2870 0.1574 -0.1296 -45.16%
global_mmlu_full_ko_high_school_european_history acc 0.2788 0.3152 +0.0364 +13.04%
global_mmlu_full_ko_high_school_world_history acc 0.2911 0.3376 +0.0464 +15.94%
global_mmlu_full_ko_jurisprudence acc 0.2870 0.2685 -0.0185 -6.45%
global_mmlu_full_ko_logical_fallacies acc 0.3067 0.2945 -0.0123 -4.00%

The limited checks are useful for regression tracking, but they should not be read as final leaderboard-quality numbers. The model improves strongly on several reasoning and instruction-following checks, while law-focused MMLU-Pro and MMLU-ProX-lite-ko need targeted remediation.

KMMLU direct exact-match runs currently show near-zero base scores and small non-zero CPT scores. Treat those as prompt/extraction diagnostics rather than quality benchmarks until the direct-answer parser is fixed.

Recommended Next Post-Training

The next stage should be targeted post-training, not another broad CPT-only pass. Liquid's official LFM2.5-8B-A1B reporting emphasizes instruction following, math, tool use, and agentic workflows such as IFEval, IFBench, Multi-IF, MATH500, AIME, BFCL, and Tau2. The Japanese LFM2.5-1.2B-JP model card reports language-specialized axes such as JMMLU-ProX, JMMLU, J-MIFEval, J-GSM8K, J-MATH500, JHumanEval+, and J-BFCLv3. The Korean follow-up should mirror those axes with Korean data and Korean eval gates.

Priority plan:

  1. Korean MCQA remediation SFT: KMMLU, KMMLU-hard, MMLU-ProX-lite-ko style, legal/accounting/finance questions, exact option-label outputs, and short rationales.
  2. STEM/legal/accounting remediation SFT: target current regressions in high-school statistics, astronomy, chemistry, formal logic, jurisprudence, professional accounting, and MMLU-Pro law.
  3. Korean instruction-following SFT: Ko-IFEval-style constraints, Korean formatting, uncertainty, refusal, and multi-condition prompts.
  4. Tool and agent SFT: Korean BFCL-style tool schemas, terminal/tool-call traces, JSON validity, and multi-turn task completion.
  5. Preference tuning: DPO/ORPO/KTO on current failure pairs. Reward correct option extraction, concise Korean, valid tools, and uncertainty over hallucination.
  6. Small preservation mix: keep GSM8K, ARC, BoolQ, IFEval, and high-gain Global MMLU KO examples in the mix so post-training does not erase current gains.

Concrete gates for the next model:

  • kmmlu_hard: improve from 0.1720 to at least 0.2300.
  • kmmlu_hard_stem: improve from 0.1564 to at least 0.2100.
  • mmlu_prox_lite_ko: recover above the base score 0.2585 and target 0.3000.
  • mmlu_pro_law: recover above the base score 0.1840 and target 0.2300.
  • Preserve gsm8k flexible exact match at or above 0.5701.
  • Preserve boolq accuracy at or above 0.7902.
  • Preserve leaderboard_instruction_following prompt loose at or above 0.3346.

Public Benchmark Plan

Primary public Korean benchmarks:

Secondary checks:

  • Korean legal RAG holdout
  • Korean finance explanation holdout
  • Korean wiki QA/summarization holdout
  • Terminal/tool-use smoke tests

Benchmark results will be added after vLLM base-vs-CPT evaluation.

Downloads last month
69
Safetensors
Model size
8B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL

Finetuned
(29)
this model
Finetunes
1 model

Datasets used to train LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL