dom-node-classifier
Model description
dom-node-classifier is a GATv2 (Graph Attention Network v2) that classifies every node of an HTML DOM into one of 14 semantic classes. It is designed to serve as a perception layer for browser agents and web annotation pipelines.
The model takes a structured DOM representation (nodes with features + a tree edge index) and outputs a class label and confidence score per node. It does not process raw HTML or screenshots — the DOM must be pre-extracted into the JSON format described below.
Architecture: GATv2 with 3 message-passing layers, 4 attention heads, hidden dimension 128, and a learned input projection that mixes heterogeneous node features before graph propagation.
Why GATv2 over GAT v1? GATv1's attention is static (monotonic across queries). GATv2 (Brody, Alon & Yahav, 2022) introduces a non-linearity inside the attention mechanism, enabling truly dynamic, query-dependent attention weights. This matters for DOM nodes whose relevance depends heavily on context.
Intended uses
- Browser agent perception: replacing raw HTML with a typed, confidence-ranked element list to reduce LLM context usage.
- DOM annotation: automatically labeling nodes in a page corpus for downstream ML tasks.
- Web research: studying element-type distributions across sites, languages, and page categories.
Out-of-scope uses
- Accessibility compliance: the model classifies semantic roles as observed in the wild, not as defined by WCAG or ARIA specifications. Do not use it for accessibility audits.
- Production-critical UX automation without human oversight: F1 on thin classes (particularly
action_input,action_select,structure_dismissible) is insufficient for fully unattended operation. - Adversarial robustness: the model was not trained against adversarially obfuscated DOM structures.
How to use
from model.inference import DOMClassifier
from pathlib import Path
import json
# Load from HuggingFace weights (model.safetensors + config.json must be in the same directory)
clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
# Or from a local .pt checkpoint: DOMClassifier.from_checkpoint("checkpoints_final/best.pt")
raw_page = json.loads(Path("examples/sample_page.json").read_text())
predictions = clf.classify_page(raw_page, action_only=False, min_confidence=0.5)
for p in predictions:
print(f"[{p['class']:25s}] {p['confidence']:.2f} {p['selector']}")
Input format
raw_page is a dict with the following top-level keys:
| Key | Type | Description |
|---|---|---|
url |
string | Page URL (used for link feature computation) |
viewport |
dict {width, height} |
Viewport dimensions in pixels |
nodes |
list of node dicts | One entry per DOM node |
edges |
list of [src_idx, dst_idx] pairs |
Parent→child edges using node list indices |
Each node dict:
| Key | Required | Type | Description |
|---|---|---|---|
id |
yes | string | Unique node identifier |
tag |
yes | string | HTML tag name (e.g. "button", "div") |
text |
no | string | Visible text content (truncated to 200 chars) |
selector |
no | string | CSS selector (returned in predictions, not used as feature) |
classes |
no | list[str] | CSS class tokens |
attrs |
no | dict | HTML attributes (href, id, type, role, …) |
css |
no | dict | Computed CSS (display, position, visibility, opacity, cursor, font_size, font_weight, z_index) |
bbox |
no | dict {x, y, width, height} |
Bounding box in pixels |
depth |
no | int | DOM depth from root |
n_children |
no | int | Number of direct children |
is_visible |
no | bool | Whether the node is visible |
in_viewport |
no | bool | Whether the node is in the initial viewport |
has_listeners_heuristic |
no | bool | Whether the node likely has JS event listeners |
Missing optional fields default to sensible zeros/empty values.
A complete example is in examples/sample_page.json.
Training data
The model was trained on a curated set of ~135 diverse web pages spanning e-commerce, SaaS, documentation, news, government, and forms, in English and French. Labels were generated by a deterministic heuristic pipeline based on HTML semantics, ARIA roles, CSS properties, and link structure — not by human annotators.
The training dataset is not publicly distributed.
Training procedure
Hardware: NVIDIA L40S (48 GB VRAM)
Hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 80 (early stopping, patience=15) |
| Batch size | 8 pages |
| Optimizer | AdamW |
| Learning rate | 1e-3 |
| LR schedule | Cosine annealing |
| Weight decay | 1e-4 |
| Dropout | 0.3 |
| Hidden dim | 128 |
| Attention heads | 4 |
| GATv2 layers | 3 |
| Class weighting | sqrt-inverse frequency |
| Edge augmentation | Reverse edges + sibling edges |
Feature vector (618 dims/node):
| Feature block | Dims | Notes |
|---|---|---|
| Tag one-hot | 51 | 50 tags + OOV bucket |
| Class hash | 128 | Hashing trick over CSS class tokens (Tailwind-robust) |
| Attribute presence | 17 | id, href, role, aria-*, type, placeholder, … |
| Computed CSS | 28 | display (11) + position (5) + 6 numeric CSS values |
| Bounding box | 5 | x, y, w, h, area (normalized by viewport) |
| Topology | 5 | depth, n_children, is_visible, in_viewport, has_listeners |
| Link semantics | 9 | absolute/relative/fragment/mailto, same-host/domain, path depth |
| Text embedding | 384 | MiniLM-L6-v2 sentence embedding (frozen) |
Validation criterion: best checkpoint selected by macro-F1 on the validation split.
Data split: 70 / 15 / 15 train/val/test, stratified by page.
Evaluation results
Evaluated on a held-out test set (15% of pages, stratified split). Numbers reported as mean ± std across 5 independent training runs with different random seeds.
| Metric | Mean ± std | Min | Max |
|---|---|---|---|
| Macro F1 | 0.825 ± 0.026 | 0.797 | 0.865 |
| Weighted F1 | 0.917 ± 0.032 | 0.882 | 0.965 |
| Action F1 (5 classes) | 0.895 ± 0.036 | 0.818 | 0.917 |
Per-class F1, mean ± std across 5 seeds:
| Class | Mean F1 | Std | Test support (best seed) |
|---|---|---|---|
action_input |
0.686 | 0.104 | 25 |
action_select |
0.768 | 0.086 | 8 |
action_button |
0.909 | 0.071 | 1 577 |
action_link_internal |
0.996 | 0.004 | 3 119 |
action_link_external |
0.996 | 0.003 | 327 |
structure_navigation |
0.884 | 0.062 | 52 |
structure_region |
0.770 | 0.140 | 52 |
structure_dismissible |
0.363 | 0.073 | 158 |
structure_card |
0.625 | 0.199 | 1 045 |
structure_list_item |
0.974 | 0.015 | 3 885 |
content_heading |
0.986 | 0.007 | 525 |
content_text |
0.736 | 0.067 | 322 |
content_media |
0.915 | 0.035 | 1 319 |
noise |
0.938 | 0.022 | 18 345 |
Limitations
- Low-support classes.
action_input(n=25) andaction_select(n=8) have very small test sets — F1 estimates for these classes have high variance and should not be over-interpreted. structure_dismissibleis hard. Cookie banners and modal overlays vary enormously across sites. Mean F1 of 0.363 reflects genuine label ambiguity, not a model bug.- Heuristic labels. Training labels come from deterministic rules, not human annotation. Near-boundary elements (e.g. a decorative
<button>vs. a functional one) may be mislabeled. - No price class. Numerical price strings are classified as
noise. This is a known gap. - Static DOM only. The model operates on a single DOM snapshot. Dynamically loaded content, shadow DOM, and canvas elements are not modeled.
- Dataset size and diversity. ~135 pages, English and French only. Sites in other languages or with highly unusual layouts are out-of-distribution.
Bias and ethical considerations
- The model encodes statistical regularities of how web developers structure pages in the training data. Sites that deviate from common patterns (niche CMS, custom frameworks) may see lower accuracy.
- The
noiseclass is a catch-all for elements that don't fit other categories. Misclassified functional elements (e.g. a decorative-looking but important button) will be silently dropped inaction_only=Truemode. Always set a confidence threshold and review low-confidence predictions. - The model should not be used as the sole decision-maker for automated actions on behalf of users without oversight.
License
Apache 2.0 — see LICENSE.
Citation
If you use this model in your work, a link back to this repository is appreciated.
Contact
Lucy Paureau · lmi.rest · lucy.paureau@gmail.com
- Downloads last month
- -