Japanese Pretraining Data Filter Classifier

This repository contains a 4-class Japanese document quality classifier for filtering pretraining data for a Japanese small language model focused on business decision support, keyword/tag extraction, causal reasoning, scenario analysis, and insight generation.

The classifier was trained from LLM-filtered Common Crawl Japanese samples. The main supervision field is filter_result.d.

Label Mapping

Original LLM labels were mapped into 4 classifier classes:

Class ID	Class	Source `filter_result.d`
0	`reject`	`X`, `R`
1	`low_value`	`KL`, `D`
2	`keep`	`K`
3	`high_value`	`KH`

Original label meaning:

`d`	Meaning
`KH`	keep high value
`K`	keep
`KL`	keep low weight
`D`	deduplicate
`R`	needs review
`X`	reject

The LLM field w was used as sample weight. Because X and R examples have w=0.0, training used sample_weight = max(w, 0.2) so the reject class could still be learned.

Input Construction

Each training example is built from:

URL, if available
script statistics, if available: kana_ratio, cjk_ratio, latin_ratio, chars
document text

For long documents, the text is truncated as:

first 10,000 characters
[TAIL]
last 2,000 characters

The resulting input is tokenized with max_length=2048.

Example structure:

[URL]
https://example.jp/article
[SCRIPT_STATS]
kana_ratio=... cjk_ratio=... latin_ratio=... chars=...
[TEXT]
...
[TAIL]
...

Data

Source paths used locally:

LLM-filtered raw labeled data: /mnt/data/commoncrawl_ja_2025_present/data/filtered_random_10k
Raw sample file: /mnt/data/commoncrawl_ja_2025_present/data/ja_sample.jsonl

Prepared classifier split:

Split	Rows
Train	178,986
Test	19,888
Total	198,874

Split method:

stratified train/test split
test size: 0.1
seed: 20260604

Original filter_result.d counts:

Label	Count
`D`	137,652
`X`	30,408
`KL`	17,014
`K`	12,082
`R`	1,374
`KH`	344

Mapped 4-class counts:

Class	Total	Train	Test
`reject`	31,782	28,604	3,178
`low_value`	154,666	139,199	15,467
`keep`	12,082	10,874	1,208
`high_value`	344	309	35

Model

Base model:

sbintuitions/modernbert-ja-130m

Reason for choosing this model:

Japanese-focused encoder model
efficient 130M parameter size
supports long-context classification better than typical BERT-style 512-token encoders
tokenizer/model loaded cleanly in the local training environment

The trained model is stored in:

outputs_4class_modernbert/full_best/best_model/

Final training checkpoints are also included:

outputs_4class_modernbert/full_best/checkpoint-11000/
outputs_4class_modernbert/full_best/checkpoint-11188/

The best checkpoint selected by macro_f1 was checkpoint-11000; best_model/ contains the loadable best model and tokenizer.

Hyperparameter Sweep

Sweep was run on a subset only:

train subset: 60,000 samples
eval subset: 12,000 samples
epochs: 1

Run	Max Length	LR	Batch/GPU	Macro F1	Weighted F1	Accuracy	Reject F1	Keep+ F1
`sweep_len2048_lr2e-5`	2048	`2e-5`	16	0.5995	0.8395	0.8322	0.6796	0.6519
`sweep_len1024_lr2e-5`	1024	`2e-5`	32	0.4925	0.7593	0.7383	0.5853	0.5895
`sweep_len4096_lr1e-5`	4096	`1e-5`	8	0.4314	0.7559	0.7705	0.3577	0.5249

The best sweep setting was max_length=2048, learning_rate=2e-5.

Final Training Configuration

Final training used the full train split, not the 60k sweep subset.

CUDA_VISIBLE_DEVICES=0,1 TOKENIZERS_PARALLELISM=false \
torchrun --standalone --nproc_per_node=2 \
  code/train_4class_classifier.py \
  --model-name sbintuitions/modernbert-ja-130m \
  --data-dir data_4class \
  --output-dir outputs_4class_modernbert/full_best \
  --max-length 2048 \
  --learning-rate 2e-5 \
  --weight-decay 0.01 \
  --epochs 2 \
  --batch-size 16 \
  --gradient-accumulation-steps 1 \
  --warmup-ratio 0.06 \
  --class-weight sqrt_balanced \
  --eval-steps 1000 \
  --save-steps 1000 \
  --num-proc 8

Important settings:

Parameter	Value
GPUs	2, GPU 0 and 1
Max length	2048
Epochs	2
Per-device train batch size	16
Effective train batch size	32
Learning rate	`2e-5`
Weight decay	`0.01`
Warmup ratio	`0.06`
Class weighting	`sqrt_balanced`
BF16	enabled
Seed	`20260604`

Class weights:

Class	Weight
`reject`	0.3150
`low_value`	0.1428
`keep`	0.5110
`high_value`	3.0312

Final Results

Final evaluation was run on the held-out test split of 19,888 examples.

Metric	Value
Accuracy	0.8702
Macro F1	0.7050
Weighted F1	0.8736
Reject F1	0.7355
Keep+ F1	0.7181
Eval loss	0.1485
Train runtime	3,112.9 sec
Train samples/sec	115.0

Per-class metrics:

Class	Precision	Recall	F1	Support
`reject`	0.7099	0.7631	0.7355	3,178
`low_value`	0.9372	0.8959	0.9161	15,467
`keep`	0.6074	0.8377	0.7042	1,208
`high_value`	0.6190	0.3714	0.4643	35

Confusion matrix, rows=true labels and columns=predicted labels in order [reject, low_value, keep, high_value]:

[
  [2425,   739,   14,  0],
  [ 989, 13857,  620,  1],
  [   2,   187, 1012,  7],
  [   0,     2,   20, 13]
]

Notes:

low_value is the strongest class because it dominates the dataset.
keep has high recall but moderate precision, which is acceptable for permissive pretraining filtering.
high_value is weak and unstable because the test split has only 35 examples.
For production use, treat high_value as a useful ranking signal rather than a fully reliable hard class until more KH examples are labeled.

Reproducing the Data Split

python code/prepare_4class_dataset.py \
  --input-dir raw/filtered_random_10k \
  --output-dir data_4class \
  --test-size 0.1 \
  --seed 20260604 \
  --head-chars 10000 \
  --tail-chars 2000 \
  --weight-floor 0.2

Inference Example

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_dir = "model"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir)
model.eval()

text = "[TEXT]\n日本語の文書本文..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)

with torch.no_grad():
    logits = model(**inputs).logits
    pred_id = int(logits.argmax(dim=-1))

print(model.config.id2label[pred_id])

Repository Layout

README.md
prepare_4class_dataset.py
train_4class_classifier.py
run_modernbert_4class_sweep.sh
data_4class/
outputs_4class_modernbert/
raw/

outputs_4class_modernbert/full_best/best_model/: best finetuned classifier
outputs_4class_modernbert/full_best/checkpoint-*: final training checkpoints
reports/sweep_*: sweep final reports and logs, without sweep checkpoints
prepare_4class_dataset.py, train_4class_classifier.py: dataset preparation and training scripts
data_4class/: prepared train/test JSONL split
raw/filtered_random_10k/: LLM-filtered raw labeled data used to create the split
raw/ja_sample.jsonl: requested raw sample file
outputs_4class_modernbert/full_best/final_report.json: final training report
reports/sweep_summary.json: sweep summary

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for minhhien0811/ja-filter-classifier-modernbert-4class

Base model

sbintuitions/modernbert-ja-130m

Finetuned

(16)

this model