Hummingbird GBM
LightGBM binary classifier for block-level web content extraction (keep/discard). Part of the hummingbird pipeline.
Models
- model.txt (102 MB) โ Combined model trained on WebMainBench + Common Crawl (793K blocks, 7696 rounds)
- model_wmb_only.txt (41 MB) โ WebMainBench-only model (446K blocks, 3083 rounds)
Features
40 structural DOM features per block. See selected_features.json for the full list and importance scores.
Performance
ROUGE-5 F1 (WebMainBench, English, 6647 pages)
| Model | All | Simple | Mid | Hard |
|---|---|---|---|---|
| Combined (WMB+CC) | 0.808 | 0.885 | 0.805 | 0.740 |
| WMB-only | 0.806 | 0.884 | 0.806 | 0.733 |
Qrater Clean Rate (500 pages)
| Method | Clean% |
|---|---|
| Dripper 0.6B | 80.0% |
| Hummingbird GBM | 66.0% |
| Raw html2text | 5.6% |