WebOrganizer/FormatClassifier-NoURL
The FormatClassifier-NoURL organizes web content into 24 categories based on the text contents of web pages (without using URL information). The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:
- WebOrganizer/FormatAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
- WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
All Domain Classifiers
- WebOrganizer/FormatClassifier
- WebOrganizer/FormatClassifier-NoURL ← you are here!
- WebOrganizer/TopicClassifier
- WebOrganizer/TopicClassifier-NoURL
Usage
This classifier expects input in the following format:
{text}
Example:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier-NoURL")
model = AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/FormatClassifier-NoURL",
trust_remote_code=True,
use_memory_efficient_attention=False)
web_page = """How to make a good sandwich? [Click here to read article]"""
inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 6 ("Truncated" format, which covers incomplete content)
You can convert the logits
of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label
and label2id
in the model config):
- Academic Writing
- Content Listing
- Creative Writing
- Customer Support
- Comment Section
- FAQ
- Truncated
- Knowledge Article
- Legal Notices
- Listicle
- News Article
- Nonfiction Writing
- About (Org.)
- News (Org.)
- About (Pers.)
- Personal Blog
- Product Page
- Q&A Forum
- Spam / Ads
- Structured Data
- Documentation
- Audio Transcript
- Tutorial
- User Review
The full definitions of the categories can be found in the taxonomy config.
Efficient Inference
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers
(see more here) and loading the model like:
AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/FormatClassifier-NoURL",
trust_remote_code=True,
unpad_inputs=True,
use_memory_efficient_attention=True,
torch_dtype=torch.bfloat16
)
Citation
@article{wettig2025organize,
title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
journal={arXiv preprint arXiv:2502.10341},
year={2025}
}
- Downloads last month
- 19
Model tree for WebOrganizer/FormatClassifier-NoURL
Base model
Alibaba-NLP/gte-base-en-v1.5