Text Classification
Transformers
Safetensors
new
custom_code

WebOrganizer/FormatClassifier-NoURL

[Paper] [Website] [GitHub]

The FormatClassifier-NoURL organizes web content into 24 categories based on the text contents of web pages (without using URL information). The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:

  1. WebOrganizer/FormatAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
  2. WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

All Domain Classifiers

Usage

This classifier expects input in the following format:

{text}

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier-NoURL")
model = AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier-NoURL",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """How to make a good sandwich? [Click here to read article]"""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 6 ("Truncated" format, which covers incomplete content)

You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):

  1. Academic Writing
  2. Content Listing
  3. Creative Writing
  4. Customer Support
  5. Comment Section
  6. FAQ
  7. Truncated
  8. Knowledge Article
  9. Legal Notices
  10. Listicle
  11. News Article
  12. Nonfiction Writing
  13. About (Org.)
  14. News (Org.)
  15. About (Pers.)
  16. Personal Blog
  17. Product Page
  18. Q&A Forum
  19. Spam / Ads
  20. Structured Data
  21. Documentation
  22. Audio Transcript
  23. Tutorial
  24. User Review

The full definitions of the categories can be found in the taxonomy config.

Efficient Inference

We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers (see more here) and loading the model like:

AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier-NoURL",
    trust_remote_code=True,
    unpad_inputs=True,
    use_memory_efficient_attention=True,
    torch_dtype=torch.bfloat16
)

Citation

@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}
Downloads last month
19
Safetensors
Model size
137M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Model tree for WebOrganizer/FormatClassifier-NoURL

Finetuned
(17)
this model

Datasets used to train WebOrganizer/FormatClassifier-NoURL

Collection including WebOrganizer/FormatClassifier-NoURL