Hummingbird GBM

LightGBM binary classifier for block-level web content extraction (keep/discard). Part of the hummingbird pipeline.

Models

  • model.txt (102 MB) โ€” Combined model trained on WebMainBench + Common Crawl (793K blocks, 7696 rounds)
  • model_wmb_only.txt (41 MB) โ€” WebMainBench-only model (446K blocks, 3083 rounds)

Features

40 structural DOM features per block. See selected_features.json for the full list and importance scores.

Performance

ROUGE-5 F1 (WebMainBench, English, 6647 pages)

Model All Simple Mid Hard
Combined (WMB+CC) 0.808 0.885 0.805 0.740
WMB-only 0.806 0.884 0.806 0.733

Qrater Clean Rate (500 pages)

Method Clean%
Dripper 0.6B 80.0%
Hummingbird GBM 66.0%
Raw html2text 5.6%
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support