[CAUTION!!] This is an alpha version model.

noise-line-remover-jp

noise-line-remover-jp is a Japanese line-level noise classifier for corpus cleanup. Given a multi-line text, it predicts whether each line should be kept or deleted.

This model is fine-tuned from sbintuitions/modernbert-ja-310m. The model uses ModernBERT hidden states at special line-token positions and applies a binary classifier to each line.

Quick Start

import torch
from transformers import AutoModel, AutoTokenizer


LINE_TOKEN = "<line>"
THRESHOLD = 0.6
TEXT = """
富士山は日本で最も高い山で、標高は3,776メートルである。
山頂付近は夏でも気温が低く、天候が急に変化することがある。
外部リンク: https://example.com/fuji
この記事は検証可能な参考文献が不足しています。
登山道は複数あり、利用者は体力や経験に応じて経路を選ぶ。
カテゴリ: 日本の山 | 火山 | 世界遺産
"""

tokenizer = AutoTokenizer.from_pretrained("MK0727/noise-line-remover-jp")
model = AutoModel.from_pretrained("MK0727/noise-line-remover-jp", trust_remote_code=True)
model.eval()

lines = TEXT.split("\n")
text = "".join(f"{LINE_TOKEN}{line}" for line in lines)
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

probabilities = torch.softmax(logits, dim=-1)[:, 1].detach().cpu().tolist()

for line_number, (line, probability) in enumerate(zip(lines, probabilities, strict=True), start=1):
    label = "DELETE" if probability >= THRESHOLD else "KEEP"
    print(f"{line_number:02d} [{label:<6}] {probability:.4f} {line}")

Example output:

01 [DELETE] 0.9714 
02 [KEEP  ] 0.0100 富士山は日本で最も高い山で、標高は3,776メートルである。
03 [KEEP  ] 0.0009 山頂付近は夏でも気温が低く、天候が急に変化することがある。
04 [DELETE] 0.8480 外部リンク: https://example.com/fuji
05 [DELETE] 0.8244 この記事は検証可能な参考文献が不足しています。
06 [KEEP  ] 0.0472 登山道は複数あり、利用者は体力や経験に応じて経路を選ぶ。
07 [DELETE] 0.8707 カテゴリ: 日本の山 | 火山 | 世界遺産
08 [KEEP  ] 0.4276

Intended Use

This model is intended for preprocessing Japanese web corpora before language model training. It is useful when a dataset contains lines such as boilerplate text, navigation fragments, repeated links, low-value fragments, or other noisy content that should be removed while keeping useful body text.

The output is a delete probability for each line. A typical workflow is:

Split a document into lines.
Run line-level prediction.
Delete lines whose probability is above a chosen threshold.
Join the remaining lines back into cleaned text.

The recommended deletion threshold is 0.6.

Model Details

Base model: sbintuitions/modernbert-ja-310m
Task: binary line-level classification
Positive label: line should be deleted
Negative label: line should be kept
Input length: up to 4096 tokens per window
Long documents: split into overlapping line windows, then duplicate predictions are averaged
Architecture: ModernBERT encoder plus a classification head over line-token hidden states

Each input line has to be prefixed with a special line token, <line>. The model classifies the hidden state corresponding to each line token, so one forward pass can produce predictions for multiple lines.

Training Data

The model was trained on MK0727/line-noise-label, a Japanese line-level dataset with delete labels. Training used 4096-token windows, 3 epochs.

Performance

Final test metrics from the same run:

Metric	Value
Test precision for delete lines	0.8041
Test recall for delete lines	0.7572
Test F1 for delete lines	0.7799

Inference Notes

For short texts, all lines can be processed in a single 4096-token window. For longer texts, split the document into overlapping windows and average probabilities for lines that appear in more than one window.

Limitations

This model is specialized for Japanese web-text cleanup. It may perform poorly on other languages, highly structured documents, code, and tables.

Downloads last month: 147

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for MK0727/noise-line-remover-jp

Base model

sbintuitions/modernbert-ja-310m

Finetuned

(14)

this model