KoHRM-Text-1.4B

KoHRM-Text-1.4B is a scratch-pretrained Korean/English/code/terminal/tool-use model based on the sapientinc/HRM-Text PrefixLM training stack.

This is not a continued finetune of sapientinc/HRM-Text-1B. It uses a new Korean/terminal-oriented 131K byte-level BPE tokenizer and a new scratch training run.

Links

Release Policy

The main model repository is intended to expose the latest model-only artifact:

  • model.safetensors
  • config.json
  • tokenizer.json
  • tokenizer_config.json
  • README.md

It is not intended to keep every training checkpoint as visible model files. Intermediate FSDP2 .distcp checkpoints are large resume artifacts and are kept separately in LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints when needed. The main repo may still have normal Hugging Face git history, but the current file tree should be treated as the latest public model export.

Current public artifact: stage1 HRM fast-cap checkpoint at step_25000, converted with EMA weights to safetensors. Training is still in progress.

Model Details

Field Value
Model id LLM-OS-Models/KoHRM-Text-1.4B
Standard name KoHRM-Text-1.4B
Training origin scratch
Architecture family HRM-Text PrefixLM
Architecture size XL
Parameters 1,384,120,320
Context length 4,096 tokens
Training dtype bfloat16
Tokenizer byte-level BPE, NFC normalization
Vocabulary size 131,072
Objective PrefixLM response-only loss
Optimizer Adam-atan2 from upstream HRM-Text
EMA 0.9999

The model config uses model_type: hrm_text and architectures: ["HrmTextForCausalLM"]. At the time of this checkpoint, HrmTextForCausalLM is a project-side custom architecture, not a built-in Transformers architecture.

Tokenizer

The tokenizer was trained for Korean, English, code, shell/terminal text, and JSON/tool-call formats. It intentionally keeps common chat/tool special tokens as stable single tokens where possible.

Sample bucket chars/token
Korean general text 2.60
Korean legal text 2.36
Korean terminal instruction 2.18
shell command 2.68
tool-call JSON 3.32
Python code 3.37
English 4.40

Important formatting tokens include:

  • <|im_start|>
  • <|im_end|>
  • <|box_end|>
  • <|object_ref_start|> for direct condition
  • <|object_ref_end|> for cot condition
  • <|quad_start|> for noisy condition
  • <|quad_end|> for synth condition

Usage

Tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "LLM-OS-Models/KoHRM-Text-1.4B",
    use_fast=True,
)

prompt = "<|im_start|><|object_ref_start|>ํ•œ๊ตญ์–ด๋กœ ํ˜„์žฌ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํฐ ํŒŒ์ผ์„ ์ฐพ๋Š” ๋ช…๋ น์„ ์•Œ๋ ค์ฃผ์„ธ์š”.<|im_end|>"
ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
print(len(ids), ids[:20])

Model Weights

The repo currently contains a model-only safetensors export. Because the architecture is custom (hrm_text), direct AutoModelForCausalLM.from_pretrained(...) generation requires an HRM-Text-compatible modeling wrapper or remote-code integration. Until that wrapper is added to the model repo, use the project code and raw FSDP2 checkpoint path for internal inference/resume workflows.

Raw checkpoint inference pattern:

from simple_inference_engine import inference_load_checkpoint, inference_generate

ckpt = inference_load_checkpoint(
    ckpt_path="/path/to/KoHRM-Text-1.4B-stage1-hrm-fastcap-gbs180",
    ckpt_epoch=25000,
    ckpt_use_ema=True,
    device="cuda",
)

prompts = iter([
    (0, ("direct", "ํ•œ๊ตญ์–ด๋กœ `du`์™€ `df`์˜ ์ฐจ์ด๋ฅผ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”.")),
])

for _, text in inference_generate(
    ckpt,
    prompts,
    max_tokens=4096,
    max_generation=512,
    batch_size=1,
    temp=0.0,
):
    print(text)

For code and training scripts, see https://github.com/LLM-OS-Models/KoHRM-text.

Training Data

Prepared data artifacts are uploaded to the Hugging Face dataset repository, not to the model repository:

https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data

All datasets are converted into HRM-Text V1Dataset style records with instruction, response, and condition fields where possible. The training objective is PrefixLM response-only loss, so the model is trained to predict the response span after seeing the instruction/prompt span.

Completed and prepared datasets:

Dataset Tokens Disk Use
koterm_pretrain_mix_v1 711.3M 2.8G stage-0/stage0b
HRM cleaned base sample 250.0M 994M included in stage-0 mix
SWE-ZERO + GLM pilot mix 251.2M 990M included in stage-0 mix
Korean legal SFT/task data 83.1M 336M included in stage-0 mix
ToolBench train tool-call data 127.0M 500M included in stage-0 mix
HRM cleaned fast-cap stage-1 14.55B 148G current stage-1
Korean statutes/local ordinances raw full 308.9M 1.2G prepared for later stages
Korean administrative rules + precedents raw full 271.7M 1.1G prepared for later stages
Korean legal/admin full task data 629.0M 2.5G uploaded to prepared dataset repo
Korean Wikipedia raw full 462.5M 1.8G prepared for later stages
HF extra reasoning/agent/mm subset 112.6M 444M prepared, limited weight
Local terminal conversations 9.39B 36G prepared for terminal-heavy later stages
BCAI Finance Korean 857.7M 3.3G prepared and uploaded for later Korean finance/domain stages
SWE-ZERO prepared 182.7M 720M pretraining and later SFT
GLM reasoning prepared 68.5M 282M pretraining and later SFT

Major source groups and provenance:

Source group Origin Prepared dataset usage
HRM-Text cleaned pretraining data https://huggingface.co/datasets/sapientinc/HRM-Text-data-io-cleaned-20260515 hrm_cleaned_base_sample_v1, koterm_hrm_cleaned_fastcap_stage1_v1; full no-cap retokenization is still running
Korean Wikipedia https://dumps.wikimedia.org/kowiki/20260501/ kowiki_raw_full_v1
Korean statutes https://github.com/legalize-kr/legalize-kr korean_legal_raw_full_v1, sft_korean_legal_v1, korean_legal_tasks_full_v1
Korean local ordinances https://github.com/legalize-kr/ordinance-kr korean_legal_raw_full_v1, sft_korean_legal_v1, korean_legal_tasks_full_v1
Korean administrative rules local Markdown snapshot at admrule-kr/ korean_admrule_precedent_raw_full_v1, korean_legal_tasks_full_v1
Korean precedents local Markdown snapshot at precedent-kr/ korean_admrule_precedent_raw_full_v1, korean_legal_tasks_full_v1
ToolBench train data local extraction under data_toolbench/data/; eval split excluded sft_toolbench_v1
SWE-ZERO trajectories https://huggingface.co/datasets/AlienKevin/SWE-ZERO-12M-trajectories sft_swe_zero_v1, sft_swe_glm_mix_v1
GLM reasoning https://huggingface.co/datasets/Jackrong/GLM-5.1-Reasoning-1M-Cleaned sft_glm_reasoning_v1, sft_swe_glm_mix_v1
Claude reasoning sample https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k small reviewed reasoning subset inside hf_extra_reasoning_agent_mm_v1
Open-MM-RL text subset https://huggingface.co/datasets/TuringEnterprises/Open-MM-RL text-only reviewed subset inside hf_extra_reasoning_agent_mm_v1
DeepSeek agent traces https://huggingface.co/datasets/TeichAI/DeepSeek-v4-Pro-Agent limited agent/tool-use subset; license-sensitive
structured Wikipedia https://huggingface.co/datasets/wikimedia/structured-wikipedia tokenizer/general text support
Local terminal/code/math conversations local swe, code, and math parquet conversations local_terminal_conversations_ctx9k_resp6k_v1
BCAI Finance Kor https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-1862K sft_bcai_finance_kor_v1

The full Korean legal/admin task upload is present in the dataset repository at:

  • korean_legal_tasks_full_v1/
  • raw_jsonl/korean_legal_tasks_full_20260524.jsonl
  • LEGAL_FULL_TASKS_README.md
  • sft_bcai_finance_kor_v1/
  • raw_jsonl/bcai_finance_kor_hrm_20260524.jsonl
  • FINANCE_BCAI_README.md

Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented chi-bench data.

Licenses and terms remain those of the original sources. The prepared dataset upload does not relicense upstream content.

Training Run

The current public checkpoint was produced through staged pretraining:

  1. Train stage-0 on koterm_pretrain_mix_v1 with 711.3M tokens.
  2. Continue once more on the same available mix as stage0b.
  3. Continue to stage-1 on HRM cleaned fast-cap data with 14.55B tokens.
  4. Convert stage1 step_25000 EMA weights to safetensors and upload to the main model repo.

Current long-running stage-1 settings:

Field Value
Hardware 8 x NVIDIA H200
Data koterm_hrm_cleaned_fastcap_stage1_v1
Tokens in current stage dataset 14.55B
Global batch 180,224 tokens
Local token slots/GPU 22,528
Context 4,096
LR 2.2e-4
LR warmup 2,000 steps
Checkpoint interval 5,000 steps
Current public export step_25000, EMA, safetensors

The run uses staged continuation. The checkpoint carries model, optimizer, EMA, and recurrent carry state forward. resume_step_offset and total_steps_override are used so the learning-rate schedule follows the intended longer pretraining run rather than resetting at every data stage.

The full HRM 328G cleaned corpus is being retokenized with the new 131K tokenizer. That full no-cap retokenization is intended to support a larger 40B+ token training continuation, instead of stopping at the 14.55B fast-cap stage.

Intended Use

This checkpoint is intended for:

  • continued pretraining experiments
  • Korean tokenizer and HRM-Text architecture experiments
  • terminal/tool-call/code pretraining research
  • checkpoint conversion and evaluation work

It is not yet intended as a finished assistant model.

Limitations

  • This is an intermediate checkpoint, not a final aligned instruct model.
  • It has not completed the full planned 40B+ token continuation.
  • It has not completed final SFT or safety tuning.
  • Public benchmark scores for this new checkpoint are not final.
  • Direct Transformers generation requires adding the custom hrm_text modeling wrapper or remote-code files.
  • Tool-call JSON validity and terminal action safety must be evaluated before production use.

Citation

This work builds on the HRM-Text architecture and training stack:

Downloads last month
-
Safetensors
Model size
1B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support