Japanese Pretraining Data Filter Classifier

This repository contains a 4-class Japanese document quality classifier for filtering pretraining data for a Japanese small language model focused on business decision support, keyword/tag extraction, causal reasoning, scenario analysis, and insight generation.

The classifier was trained from LLM-filtered Common Crawl Japanese samples. The main supervision field is filter_result.d.

Label Mapping

Original LLM labels were mapped into 4 classifier classes:

Class ID Class Source filter_result.d
0 reject X, R
1 low_value KL, D
2 keep K
3 high_value KH

Original label meaning:

d Meaning
KH keep high value
K keep
KL keep low weight
D deduplicate
R needs review
X reject

The LLM field w was used as sample weight. Because X and R examples have w=0.0, training used sample_weight = max(w, 0.2) so the reject class could still be learned.

Input Construction

Each training example is built from:

  • URL, if available
  • script statistics, if available: kana_ratio, cjk_ratio, latin_ratio, chars
  • document text

For long documents, the text is truncated as:

  • first 10,000 characters
  • [TAIL]
  • last 2,000 characters

The resulting input is tokenized with max_length=2048.

Example structure:

[URL]
https://example.jp/article
[SCRIPT_STATS]
kana_ratio=... cjk_ratio=... latin_ratio=... chars=...
[TEXT]
...
[TAIL]
...

Data

Source paths used locally:

  • LLM-filtered raw labeled data: /mnt/data/commoncrawl_ja_2025_present/data/filtered_random_10k
  • Raw sample file: /mnt/data/commoncrawl_ja_2025_present/data/ja_sample.jsonl

Prepared classifier split:

Split Rows
Train 178,986
Test 19,888
Total 198,874

Split method:

  • stratified train/test split
  • test size: 0.1
  • seed: 20260604

Original filter_result.d counts:

Label Count
D 137,652
X 30,408
KL 17,014
K 12,082
R 1,374
KH 344

Mapped 4-class counts:

Class Total Train Test
reject 31,782 28,604 3,178
low_value 154,666 139,199 15,467
keep 12,082 10,874 1,208
high_value 344 309 35

Model

Base model:

sbintuitions/modernbert-ja-130m

Reason for choosing this model:

  • Japanese-focused encoder model
  • efficient 130M parameter size
  • supports long-context classification better than typical BERT-style 512-token encoders
  • tokenizer/model loaded cleanly in the local training environment

The trained model is stored in:

outputs_4class_modernbert/full_best/best_model/

Final training checkpoints are also included:

outputs_4class_modernbert/full_best/checkpoint-11000/
outputs_4class_modernbert/full_best/checkpoint-11188/

The best checkpoint selected by macro_f1 was checkpoint-11000; best_model/ contains the loadable best model and tokenizer.

Hyperparameter Sweep

Sweep was run on a subset only:

  • train subset: 60,000 samples
  • eval subset: 12,000 samples
  • epochs: 1
Run Max Length LR Batch/GPU Macro F1 Weighted F1 Accuracy Reject F1 Keep+ F1
sweep_len2048_lr2e-5 2048 2e-5 16 0.5995 0.8395 0.8322 0.6796 0.6519
sweep_len1024_lr2e-5 1024 2e-5 32 0.4925 0.7593 0.7383 0.5853 0.5895
sweep_len4096_lr1e-5 4096 1e-5 8 0.4314 0.7559 0.7705 0.3577 0.5249

The best sweep setting was max_length=2048, learning_rate=2e-5.

Final Training Configuration

Final training used the full train split, not the 60k sweep subset.

CUDA_VISIBLE_DEVICES=0,1 TOKENIZERS_PARALLELISM=false \
torchrun --standalone --nproc_per_node=2 \
  code/train_4class_classifier.py \
  --model-name sbintuitions/modernbert-ja-130m \
  --data-dir data_4class \
  --output-dir outputs_4class_modernbert/full_best \
  --max-length 2048 \
  --learning-rate 2e-5 \
  --weight-decay 0.01 \
  --epochs 2 \
  --batch-size 16 \
  --gradient-accumulation-steps 1 \
  --warmup-ratio 0.06 \
  --class-weight sqrt_balanced \
  --eval-steps 1000 \
  --save-steps 1000 \
  --num-proc 8

Important settings:

Parameter Value
GPUs 2, GPU 0 and 1
Max length 2048
Epochs 2
Per-device train batch size 16
Effective train batch size 32
Learning rate 2e-5
Weight decay 0.01
Warmup ratio 0.06
Class weighting sqrt_balanced
BF16 enabled
Seed 20260604

Class weights:

Class Weight
reject 0.3150
low_value 0.1428
keep 0.5110
high_value 3.0312

Final Results

Final evaluation was run on the held-out test split of 19,888 examples.

Metric Value
Accuracy 0.8702
Macro F1 0.7050
Weighted F1 0.8736
Reject F1 0.7355
Keep+ F1 0.7181
Eval loss 0.1485
Train runtime 3,112.9 sec
Train samples/sec 115.0

Per-class metrics:

Class Precision Recall F1 Support
reject 0.7099 0.7631 0.7355 3,178
low_value 0.9372 0.8959 0.9161 15,467
keep 0.6074 0.8377 0.7042 1,208
high_value 0.6190 0.3714 0.4643 35

Confusion matrix, rows=true labels and columns=predicted labels in order [reject, low_value, keep, high_value]:

[
  [2425,   739,   14,  0],
  [ 989, 13857,  620,  1],
  [   2,   187, 1012,  7],
  [   0,     2,   20, 13]
]

Notes:

  • low_value is the strongest class because it dominates the dataset.
  • keep has high recall but moderate precision, which is acceptable for permissive pretraining filtering.
  • high_value is weak and unstable because the test split has only 35 examples.
  • For production use, treat high_value as a useful ranking signal rather than a fully reliable hard class until more KH examples are labeled.

Reproducing the Data Split

python code/prepare_4class_dataset.py \
  --input-dir raw/filtered_random_10k \
  --output-dir data_4class \
  --test-size 0.1 \
  --seed 20260604 \
  --head-chars 10000 \
  --tail-chars 2000 \
  --weight-floor 0.2

Inference Example

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_dir = "model"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir)
model.eval()

text = "[TEXT]\n日本語の文書本文..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)

with torch.no_grad():
    logits = model(**inputs).logits
    pred_id = int(logits.argmax(dim=-1))

print(model.config.id2label[pred_id])

Repository Layout

README.md
prepare_4class_dataset.py
train_4class_classifier.py
run_modernbert_4class_sweep.sh
data_4class/
outputs_4class_modernbert/
raw/
  • outputs_4class_modernbert/full_best/best_model/: best finetuned classifier
  • outputs_4class_modernbert/full_best/checkpoint-*: final training checkpoints
  • reports/sweep_*: sweep final reports and logs, without sweep checkpoints
  • prepare_4class_dataset.py, train_4class_classifier.py: dataset preparation and training scripts
  • data_4class/: prepared train/test JSONL split
  • raw/filtered_random_10k/: LLM-filtered raw labeled data used to create the split
  • raw/ja_sample.jsonl: requested raw sample file
  • outputs_4class_modernbert/full_best/final_report.json: final training report
  • reports/sweep_summary.json: sweep summary
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for minhhien0811/ja-filter-classifier-modernbert-4class

Finetuned
(16)
this model