dom-node-classifier

Model description

dom-node-classifier is a GATv2 (Graph Attention Network v2) that classifies every node of an HTML DOM into one of 14 semantic classes. It is designed to serve as a perception layer for browser agents and web annotation pipelines.

The model takes a structured DOM representation (nodes with features + a tree edge index) and outputs a class label and confidence score per node. It does not process raw HTML or screenshots — the DOM must be pre-extracted into the JSON format described below.

Architecture: GATv2 with 3 message-passing layers, 4 attention heads, hidden dimension 128, and a learned input projection that mixes heterogeneous node features before graph propagation.

Why GATv2 over GAT v1? GATv1's attention is static (monotonic across queries). GATv2 (Brody, Alon & Yahav, 2022) introduces a non-linearity inside the attention mechanism, enabling truly dynamic, query-dependent attention weights. This matters for DOM nodes whose relevance depends heavily on context.

Intended uses

Browser agent perception: replacing raw HTML with a typed, confidence-ranked element list to reduce LLM context usage.
DOM annotation: automatically labeling nodes in a page corpus for downstream ML tasks.
Web research: studying element-type distributions across sites, languages, and page categories.

Out-of-scope uses

Accessibility compliance: the model classifies semantic roles as observed in the wild, not as defined by WCAG or ARIA specifications. Do not use it for accessibility audits.
Production-critical UX automation without human oversight: F1 on thin classes (particularly action_input, action_select, structure_dismissible) is insufficient for fully unattended operation.
Adversarial robustness: the model was not trained against adversarially obfuscated DOM structures.

How to use

from model.inference import DOMClassifier
from pathlib import Path
import json

# Load from HuggingFace weights (model.safetensors + config.json must be in the same directory)
clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
# Or from a local .pt checkpoint:  DOMClassifier.from_checkpoint("checkpoints_final/best.pt")

raw_page = json.loads(Path("examples/sample_page.json").read_text())
predictions = clf.classify_page(raw_page, action_only=False, min_confidence=0.5)

for p in predictions:
    print(f"[{p['class']:25s}] {p['confidence']:.2f}  {p['selector']}")

Input format

raw_page is a dict with the following top-level keys:

Key	Type	Description
`url`	string	Page URL (used for link feature computation)
`viewport`	dict `{width, height}`	Viewport dimensions in pixels
`nodes`	list of node dicts	One entry per DOM node
`edges`	list of `[src_idx, dst_idx]` pairs	Parent→child edges using node list indices

Each node dict:

Key	Required	Type	Description
`id`	yes	string	Unique node identifier
`tag`	yes	string	HTML tag name (e.g. `"button"`, `"div"`)
`text`	no	string	Visible text content (truncated to 200 chars)
`selector`	no	string	CSS selector (returned in predictions, not used as feature)
`classes`	no	list[str]	CSS class tokens
`attrs`	no	dict	HTML attributes (`href`, `id`, `type`, `role`, …)
`css`	no	dict	Computed CSS (`display`, `position`, `visibility`, `opacity`, `cursor`, `font_size`, `font_weight`, `z_index`)
`bbox`	no	dict `{x, y, width, height}`	Bounding box in pixels
`depth`	no	int	DOM depth from root
`n_children`	no	int	Number of direct children
`is_visible`	no	bool	Whether the node is visible
`in_viewport`	no	bool	Whether the node is in the initial viewport
`has_listeners_heuristic`	no	bool	Whether the node likely has JS event listeners

Missing optional fields default to sensible zeros/empty values.

A complete example is in examples/sample_page.json.

Training data

The model was trained on a curated set of ~135 diverse web pages spanning e-commerce, SaaS, documentation, news, government, and forms, in English and French. Labels were generated by a deterministic heuristic pipeline based on HTML semantics, ARIA roles, CSS properties, and link structure — not by human annotators.

The training dataset is not publicly distributed.

Training procedure

Hardware: NVIDIA L40S (48 GB VRAM)

Hyperparameters:

Parameter	Value
Epochs	80 (early stopping, patience=15)
Batch size	8 pages
Optimizer	AdamW
Learning rate	1e-3
LR schedule	Cosine annealing
Weight decay	1e-4
Dropout	0.3
Hidden dim	128
Attention heads	4
GATv2 layers	3
Class weighting	sqrt-inverse frequency
Edge augmentation	Reverse edges + sibling edges

Feature vector (618 dims/node):

Feature block	Dims	Notes
Tag one-hot	51	50 tags + OOV bucket
Class hash	128	Hashing trick over CSS class tokens (Tailwind-robust)
Attribute presence	17	id, href, role, aria-*, type, placeholder, …
Computed CSS	28	display (11) + position (5) + 6 numeric CSS values
Bounding box	5	x, y, w, h, area (normalized by viewport)
Topology	5	depth, n_children, is_visible, in_viewport, has_listeners
Link semantics	9	absolute/relative/fragment/mailto, same-host/domain, path depth
Text embedding	384	MiniLM-L6-v2 sentence embedding (frozen)

Validation criterion: best checkpoint selected by macro-F1 on the validation split.

Data split: 70 / 15 / 15 train/val/test, stratified by page.

Evaluation results

Evaluated on a held-out test set (15% of pages, stratified split). Numbers reported as mean ± std across 5 independent training runs with different random seeds.

Metric	Mean ± std	Min	Max
Macro F1	0.825 ± 0.026	0.797	0.865
Weighted F1	0.917 ± 0.032	0.882	0.965
Action F1 (5 classes)	0.895 ± 0.036	0.818	0.917

Per-class F1, mean ± std across 5 seeds:

Class	Mean F1	Std	Test support (best seed)
`action_input`	0.686	0.104	25
`action_select`	0.768	0.086	8
`action_button`	0.909	0.071	1 577
`action_link_internal`	0.996	0.004	3 119
`action_link_external`	0.996	0.003	327
`structure_navigation`	0.884	0.062	52
`structure_region`	0.770	0.140	52
`structure_dismissible`	0.363	0.073	158
`structure_card`	0.625	0.199	1 045
`structure_list_item`	0.974	0.015	3 885
`content_heading`	0.986	0.007	525
`content_text`	0.736	0.067	322
`content_media`	0.915	0.035	1 319
`noise`	0.938	0.022	18 345

Limitations

Low-support classes. action_input (n=25) and action_select (n=8) have very small test sets — F1 estimates for these classes have high variance and should not be over-interpreted.
structure_dismissible is hard. Cookie banners and modal overlays vary enormously across sites. Mean F1 of 0.363 reflects genuine label ambiguity, not a model bug.
Heuristic labels. Training labels come from deterministic rules, not human annotation. Near-boundary elements (e.g. a decorative <button> vs. a functional one) may be mislabeled.
No price class. Numerical price strings are classified as noise. This is a known gap.
Static DOM only. The model operates on a single DOM snapshot. Dynamically loaded content, shadow DOM, and canvas elements are not modeled.
Dataset size and diversity. ~135 pages, English and French only. Sites in other languages or with highly unusual layouts are out-of-distribution.

Bias and ethical considerations

The model encodes statistical regularities of how web developers structure pages in the training data. Sites that deviate from common patterns (niche CMS, custom frameworks) may see lower accuracy.
The noise class is a catch-all for elements that don't fit other categories. Misclassified functional elements (e.g. a decorative-looking but important button) will be silently dropped in action_only=True mode. Always set a confidence threshold and review low-confidence predictions.
The model should not be used as the sole decision-maker for automated actions on behalf of users without oversight.

License

Apache 2.0 — see LICENSE.

Citation

If you use this model in your work, a link back to this repository is appreciated.

Contact

Lucy Paureau · lmi.rest · lucy.paureau@gmail.com

Downloads last month: -

Safetensors

Model size

1.27M params

Tensor type

F32