Instructions to use minhhien0811/ja-filter-classifier-modernbert-4class with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use minhhien0811/ja-filter-classifier-modernbert-4class with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="minhhien0811/ja-filter-classifier-modernbert-4class")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("minhhien0811/ja-filter-classifier-modernbert-4class", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Japanese Pretraining Data Filter Classifier
This repository contains a 4-class Japanese document quality classifier for filtering pretraining data for a Japanese small language model focused on business decision support, keyword/tag extraction, causal reasoning, scenario analysis, and insight generation.
The classifier was trained from LLM-filtered Common Crawl Japanese samples. The main supervision field is filter_result.d.
Label Mapping
Original LLM labels were mapped into 4 classifier classes:
| Class ID | Class | Source filter_result.d |
|---|---|---|
| 0 | reject |
X, R |
| 1 | low_value |
KL, D |
| 2 | keep |
K |
| 3 | high_value |
KH |
Original label meaning:
d |
Meaning |
|---|---|
KH |
keep high value |
K |
keep |
KL |
keep low weight |
D |
deduplicate |
R |
needs review |
X |
reject |
The LLM field w was used as sample weight. Because X and R examples have w=0.0, training used sample_weight = max(w, 0.2) so the reject class could still be learned.
Input Construction
Each training example is built from:
- URL, if available
- script statistics, if available:
kana_ratio,cjk_ratio,latin_ratio,chars - document text
For long documents, the text is truncated as:
- first 10,000 characters
[TAIL]- last 2,000 characters
The resulting input is tokenized with max_length=2048.
Example structure:
[URL]
https://example.jp/article
[SCRIPT_STATS]
kana_ratio=... cjk_ratio=... latin_ratio=... chars=...
[TEXT]
...
[TAIL]
...
Data
Source paths used locally:
- LLM-filtered raw labeled data:
/mnt/data/commoncrawl_ja_2025_present/data/filtered_random_10k - Raw sample file:
/mnt/data/commoncrawl_ja_2025_present/data/ja_sample.jsonl
Prepared classifier split:
| Split | Rows |
|---|---|
| Train | 178,986 |
| Test | 19,888 |
| Total | 198,874 |
Split method:
- stratified train/test split
- test size:
0.1 - seed:
20260604
Original filter_result.d counts:
| Label | Count |
|---|---|
D |
137,652 |
X |
30,408 |
KL |
17,014 |
K |
12,082 |
R |
1,374 |
KH |
344 |
Mapped 4-class counts:
| Class | Total | Train | Test |
|---|---|---|---|
reject |
31,782 | 28,604 | 3,178 |
low_value |
154,666 | 139,199 | 15,467 |
keep |
12,082 | 10,874 | 1,208 |
high_value |
344 | 309 | 35 |
Model
Base model:
sbintuitions/modernbert-ja-130m
Reason for choosing this model:
- Japanese-focused encoder model
- efficient 130M parameter size
- supports long-context classification better than typical BERT-style 512-token encoders
- tokenizer/model loaded cleanly in the local training environment
The trained model is stored in:
outputs_4class_modernbert/full_best/best_model/
Final training checkpoints are also included:
outputs_4class_modernbert/full_best/checkpoint-11000/
outputs_4class_modernbert/full_best/checkpoint-11188/
The best checkpoint selected by macro_f1 was checkpoint-11000; best_model/ contains the loadable best model and tokenizer.
Hyperparameter Sweep
Sweep was run on a subset only:
- train subset: 60,000 samples
- eval subset: 12,000 samples
- epochs: 1
| Run | Max Length | LR | Batch/GPU | Macro F1 | Weighted F1 | Accuracy | Reject F1 | Keep+ F1 |
|---|---|---|---|---|---|---|---|---|
sweep_len2048_lr2e-5 |
2048 | 2e-5 |
16 | 0.5995 | 0.8395 | 0.8322 | 0.6796 | 0.6519 |
sweep_len1024_lr2e-5 |
1024 | 2e-5 |
32 | 0.4925 | 0.7593 | 0.7383 | 0.5853 | 0.5895 |
sweep_len4096_lr1e-5 |
4096 | 1e-5 |
8 | 0.4314 | 0.7559 | 0.7705 | 0.3577 | 0.5249 |
The best sweep setting was max_length=2048, learning_rate=2e-5.
Final Training Configuration
Final training used the full train split, not the 60k sweep subset.
CUDA_VISIBLE_DEVICES=0,1 TOKENIZERS_PARALLELISM=false \
torchrun --standalone --nproc_per_node=2 \
code/train_4class_classifier.py \
--model-name sbintuitions/modernbert-ja-130m \
--data-dir data_4class \
--output-dir outputs_4class_modernbert/full_best \
--max-length 2048 \
--learning-rate 2e-5 \
--weight-decay 0.01 \
--epochs 2 \
--batch-size 16 \
--gradient-accumulation-steps 1 \
--warmup-ratio 0.06 \
--class-weight sqrt_balanced \
--eval-steps 1000 \
--save-steps 1000 \
--num-proc 8
Important settings:
| Parameter | Value |
|---|---|
| GPUs | 2, GPU 0 and 1 |
| Max length | 2048 |
| Epochs | 2 |
| Per-device train batch size | 16 |
| Effective train batch size | 32 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.06 |
| Class weighting | sqrt_balanced |
| BF16 | enabled |
| Seed | 20260604 |
Class weights:
| Class | Weight |
|---|---|
reject |
0.3150 |
low_value |
0.1428 |
keep |
0.5110 |
high_value |
3.0312 |
Final Results
Final evaluation was run on the held-out test split of 19,888 examples.
| Metric | Value |
|---|---|
| Accuracy | 0.8702 |
| Macro F1 | 0.7050 |
| Weighted F1 | 0.8736 |
| Reject F1 | 0.7355 |
| Keep+ F1 | 0.7181 |
| Eval loss | 0.1485 |
| Train runtime | 3,112.9 sec |
| Train samples/sec | 115.0 |
Per-class metrics:
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
reject |
0.7099 | 0.7631 | 0.7355 | 3,178 |
low_value |
0.9372 | 0.8959 | 0.9161 | 15,467 |
keep |
0.6074 | 0.8377 | 0.7042 | 1,208 |
high_value |
0.6190 | 0.3714 | 0.4643 | 35 |
Confusion matrix, rows=true labels and columns=predicted labels in order [reject, low_value, keep, high_value]:
[
[2425, 739, 14, 0],
[ 989, 13857, 620, 1],
[ 2, 187, 1012, 7],
[ 0, 2, 20, 13]
]
Notes:
low_valueis the strongest class because it dominates the dataset.keephas high recall but moderate precision, which is acceptable for permissive pretraining filtering.high_valueis weak and unstable because the test split has only 35 examples.- For production use, treat
high_valueas a useful ranking signal rather than a fully reliable hard class until moreKHexamples are labeled.
Reproducing the Data Split
python code/prepare_4class_dataset.py \
--input-dir raw/filtered_random_10k \
--output-dir data_4class \
--test-size 0.1 \
--seed 20260604 \
--head-chars 10000 \
--tail-chars 2000 \
--weight-floor 0.2
Inference Example
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model_dir = "model"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir)
model.eval()
text = "[TEXT]\n日本語の文書本文..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
with torch.no_grad():
logits = model(**inputs).logits
pred_id = int(logits.argmax(dim=-1))
print(model.config.id2label[pred_id])
Repository Layout
README.md
prepare_4class_dataset.py
train_4class_classifier.py
run_modernbert_4class_sweep.sh
data_4class/
outputs_4class_modernbert/
raw/
outputs_4class_modernbert/full_best/best_model/: best finetuned classifieroutputs_4class_modernbert/full_best/checkpoint-*: final training checkpointsreports/sweep_*: sweep final reports and logs, without sweep checkpointsprepare_4class_dataset.py,train_4class_classifier.py: dataset preparation and training scriptsdata_4class/: prepared train/test JSONL splitraw/filtered_random_10k/: LLM-filtered raw labeled data used to create the splitraw/ja_sample.jsonl: requested raw sample fileoutputs_4class_modernbert/full_best/final_report.json: final training reportreports/sweep_summary.json: sweep summary
Model tree for minhhien0811/ja-filter-classifier-modernbert-4class
Base model
sbintuitions/modernbert-ja-130m