Instructions to use MK0727/noise-line-remover-jp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MK0727/noise-line-remover-jp with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="MK0727/noise-line-remover-jp", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("MK0727/noise-line-remover-jp", trust_remote_code=True) model = AutoModel.from_pretrained("MK0727/noise-line-remover-jp", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
[CAUTION!!] This is an alpha version model.
noise-line-remover-jp
noise-line-remover-jp is a Japanese line-level noise classifier for corpus cleanup. Given a multi-line text, it predicts whether each line should be kept or deleted.
This model is fine-tuned from sbintuitions/modernbert-ja-310m. The model uses ModernBERT hidden states at special line-token positions and applies a binary classifier to each line.
Quick Start
import torch
from transformers import AutoModel, AutoTokenizer
LINE_TOKEN = "<line>"
THRESHOLD = 0.6
TEXT = """
富士山は日本で最も高い山で、標高は3,776メートルである。
山頂付近は夏でも気温が低く、天候が急に変化することがある。
外部リンク: https://example.com/fuji
この記事は検証可能な参考文献が不足しています。
登山道は複数あり、利用者は体力や経験に応じて経路を選ぶ。
カテゴリ: 日本の山 | 火山 | 世界遺産
"""
tokenizer = AutoTokenizer.from_pretrained("MK0727/noise-line-remover-jp")
model = AutoModel.from_pretrained("MK0727/noise-line-remover-jp", trust_remote_code=True)
model.eval()
lines = TEXT.split("\n")
text = "".join(f"{LINE_TOKEN}{line}" for line in lines)
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probabilities = torch.softmax(logits, dim=-1)[:, 1].detach().cpu().tolist()
for line_number, (line, probability) in enumerate(zip(lines, probabilities, strict=True), start=1):
label = "DELETE" if probability >= THRESHOLD else "KEEP"
print(f"{line_number:02d} [{label:<6}] {probability:.4f} {line}")
Example output:
01 [DELETE] 0.9714
02 [KEEP ] 0.0100 富士山は日本で最も高い山で、標高は3,776メートルである。
03 [KEEP ] 0.0009 山頂付近は夏でも気温が低く、天候が急に変化することがある。
04 [DELETE] 0.8480 外部リンク: https://example.com/fuji
05 [DELETE] 0.8244 この記事は検証可能な参考文献が不足しています。
06 [KEEP ] 0.0472 登山道は複数あり、利用者は体力や経験に応じて経路を選ぶ。
07 [DELETE] 0.8707 カテゴリ: 日本の山 | 火山 | 世界遺産
08 [KEEP ] 0.4276
Intended Use
This model is intended for preprocessing Japanese web corpora before language model training. It is useful when a dataset contains lines such as boilerplate text, navigation fragments, repeated links, low-value fragments, or other noisy content that should be removed while keeping useful body text.
The output is a delete probability for each line. A typical workflow is:
- Split a document into lines.
- Run line-level prediction.
- Delete lines whose probability is above a chosen threshold.
- Join the remaining lines back into cleaned text.
The recommended deletion threshold is 0.6.
Model Details
- Base model:
sbintuitions/modernbert-ja-310m - Task: binary line-level classification
- Positive label: line should be deleted
- Negative label: line should be kept
- Input length: up to 4096 tokens per window
- Long documents: split into overlapping line windows, then duplicate predictions are averaged
- Architecture: ModernBERT encoder plus a classification head over line-token hidden states
Each input line has to be prefixed with a special line token, <line>. The model classifies the hidden state corresponding to each line token, so one forward pass can produce predictions for multiple lines.
Training Data
The model was trained on MK0727/line-noise-label, a Japanese line-level dataset with delete labels.
Training used 4096-token windows, 3 epochs.
Performance
Final test metrics from the same run:
| Metric | Value |
|---|---|
| Test precision for delete lines | 0.8041 |
| Test recall for delete lines | 0.7572 |
| Test F1 for delete lines | 0.7799 |
Inference Notes
For short texts, all lines can be processed in a single 4096-token window. For longer texts, split the document into overlapping windows and average probabilities for lines that appear in more than one window.
Limitations
This model is specialized for Japanese web-text cleanup. It may perform poorly on other languages, highly structured documents, code, and tables.
- Downloads last month
- 147
Model tree for MK0727/noise-line-remover-jp
Base model
sbintuitions/modernbert-ja-310m