LFM2.5-8B-A1B-KO-Agentic-SFT

Agentic/Fable diagnostic SFT follow-up for LLM-OS-Models/LFM2.5-8B-A1B-KO-SFT.

SFT model repo: https://huggingface.co/LLM-OS-Models/LFM2.5-8B-A1B-KO-SFT
CPT model repo: https://huggingface.co/LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL
GitHub SFT repo: https://github.com/gyunggyung/LFM25-KO-SFT
GitHub CPT repo: https://github.com/gyunggyung/LFM25-KO-CPT
Base model: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B
Liquid chat template docs: https://docs.liquid.ai/lfm/key-concepts/chat-template
Liquid tool-use docs: https://docs.liquid.ai/lfm/key-concepts/tool-use

Status

This model is uploaded for reproducibility and diagnostic analysis. It should not be treated as a public benchmark improvement over KO-CPT. It adds a small Agentic/Fable behavior stage after the failed Stage2 KO-SFT line.

item	value
samples	3,943
tokens	7,124,298
max sequence length	8192
training method	full-parameter response-only SFT
source checkpoint	Stage2 KO-SFT final
purpose	terminal/log/document grounded behavior probe

Verdict

The Stage3 Agentic/Fable run is not a successful public benchmark repair. It shows small partial recovery on a few diagnostic slices, but it does not restore the broad benchmark quality that KO-CPT had before SFT.

task	Stage2 KO-SFT	Agentic/Fable	change
Global MMLU KO limit50	0.244681	0.251773	+0.007092
Global MMLU KO medical limit50	0.361111	0.416667	+0.055556
IFEval strict limit50	0.1000	0.1000	+0.0000
KMMLU direct hard limit50	0.113407	0.109734	-0.003673
MMLU-Pro law	0.134423	0.150772	+0.016349
MMLU-Pro economics	0.323460	0.331754	+0.008294
TruthfulQA MC2	0.474975	0.476824	+0.001849
BoolQ	0.6664	0.664220	-0.002180
GSM8K exact	0.3381	0.360879	+0.022779

Interpretation:

The model slightly recovers some law/economics/global-mmlu diagnostic slices.
It does not improve IFEval.
KMMLU direct hard remains weak.
GSM8K recovers only from the failed Stage2 level and remains below KO-CPT and the original base reference.

Data

Agentic/Fable data sources:

source	local source	purpose
Fable5 Korean traces	`fable_distillation/datasets_ko/fable5_ko_sft_20260624.jsonl`	terminal, search, file reading, error-fix traces
Helio Korean traces	`fable_distillation/datasets_ko/helio_ko_sft_20260628.jsonl`	long-form reasoning traces
local grounded examples	generated from this workspace	README/runbook/train-log/git/vLLM diagnosis examples

Public dataset releases:

Raw Agentic/Fable data: https://huggingface.co/datasets/LLM-OS-Models/LFM2.5-KO-Agentic-Fable-Grounded-LFMChat-Raw
Tokenized Agentic/Fable 8k arrays: https://huggingface.co/datasets/LLM-OS-Models/LFM2.5-KO-Agentic-Fable-Grounded-LFMChat-8K

Why It Did Not Fix Benchmarks

This stage is only 7.12M tokens and is focused on behavior traces: reading logs, following repository docs, planning terminal commands, and explaining evidence. Those examples are not direct repair data for Korean multiple-choice likelihood, exact-answer extraction, or option-only output.

The previous Stage2 KO-SFT checkpoint had already moved away from the KO-CPT benchmark distribution. Training a small agentic trace set on top of that did not restore the lost answer-token scoring behavior.

Recommended Use

Use this checkpoint only for inspecting the Agentic/Fable behavior experiment. For public Korean benchmark quality, prefer:

https://huggingface.co/LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL

If future work resumes, start from KO-CPT and run a small MCQA/answer-format repair SFT with strict gates instead of continuing this checkpoint.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "LLM-OS-Models/LFM2.5-8B-A1B-KO-Agentic-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a careful terminal and log analysis assistant."},
    {"role": "user", "content": "다음 학습 로그에서 loss가 갑자기 튀는 원인을 어떻게 확인할지 단계별로 말해줘."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Korean Summary

이 모델은 Stage2 KO-SFT 위에 Fable/문서/로그 기반 agentic SFT를 소량 얹은 진단용 모델입니다. 공개 벤치 개선 모델이 아닙니다. 일부 항목은 Stage2보다 조금 회복했지만, KO-CPT가 가진 공개 벤치 성능을 되찾지 못했습니다.

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

F32

BF16

Model tree for LLM-OS-Models/LFM2.5-8B-A1B-KO-Agentic-SFT

Base model

LiquidAI/LFM2.5-8B-A1B-Base

Finetuned

LiquidAI/LFM2.5-8B-A1B

Finetuned

LLM-OS-Models/LFM2.5-8B-A1B-KO-CPT-FULL

Finetuned

(2)

this model