dom-node-classifier

Model description

dom-node-classifier is a GATv2 (Graph Attention Network v2) that classifies every node of an HTML DOM into one of 14 semantic classes. It is designed to serve as a perception layer for browser agents and web annotation pipelines.

The model takes a structured DOM representation (nodes with features + a tree edge index) and outputs a class label and confidence score per node. It does not process raw HTML or screenshots — the DOM must be pre-extracted into the JSON format described below.

Architecture: GATv2 with 3 message-passing layers, 4 attention heads, hidden dimension 128, and a learned input projection that mixes heterogeneous node features before graph propagation.

Why GATv2 over GAT v1? GATv1's attention is static (monotonic across queries). GATv2 (Brody, Alon & Yahav, 2022) introduces a non-linearity inside the attention mechanism, enabling truly dynamic, query-dependent attention weights. This matters for DOM nodes whose relevance depends heavily on context.


Intended uses

  • Browser agent perception: replacing raw HTML with a typed, confidence-ranked element list to reduce LLM context usage.
  • DOM annotation: automatically labeling nodes in a page corpus for downstream ML tasks.
  • Web research: studying element-type distributions across sites, languages, and page categories.

Out-of-scope uses

  • Accessibility compliance: the model classifies semantic roles as observed in the wild, not as defined by WCAG or ARIA specifications. Do not use it for accessibility audits.
  • Production-critical UX automation without human oversight: F1 on thin classes (particularly action_input, action_select, structure_dismissible) is insufficient for fully unattended operation.
  • Adversarial robustness: the model was not trained against adversarially obfuscated DOM structures.

How to use

from model.inference import DOMClassifier
from pathlib import Path
import json

# Load from HuggingFace weights (model.safetensors + config.json must be in the same directory)
clf = DOMClassifier.from_checkpoint("checkpoints_final/model.safetensors")
# Or from a local .pt checkpoint:  DOMClassifier.from_checkpoint("checkpoints_final/best.pt")

raw_page = json.loads(Path("examples/sample_page.json").read_text())
predictions = clf.classify_page(raw_page, action_only=False, min_confidence=0.5)

for p in predictions:
    print(f"[{p['class']:25s}] {p['confidence']:.2f}  {p['selector']}")

Input format

raw_page is a dict with the following top-level keys:

Key Type Description
url string Page URL (used for link feature computation)
viewport dict {width, height} Viewport dimensions in pixels
nodes list of node dicts One entry per DOM node
edges list of [src_idx, dst_idx] pairs Parent→child edges using node list indices

Each node dict:

Key Required Type Description
id yes string Unique node identifier
tag yes string HTML tag name (e.g. "button", "div")
text no string Visible text content (truncated to 200 chars)
selector no string CSS selector (returned in predictions, not used as feature)
classes no list[str] CSS class tokens
attrs no dict HTML attributes (href, id, type, role, …)
css no dict Computed CSS (display, position, visibility, opacity, cursor, font_size, font_weight, z_index)
bbox no dict {x, y, width, height} Bounding box in pixels
depth no int DOM depth from root
n_children no int Number of direct children
is_visible no bool Whether the node is visible
in_viewport no bool Whether the node is in the initial viewport
has_listeners_heuristic no bool Whether the node likely has JS event listeners

Missing optional fields default to sensible zeros/empty values.

A complete example is in examples/sample_page.json.


Training data

The model was trained on a curated set of ~135 diverse web pages spanning e-commerce, SaaS, documentation, news, government, and forms, in English and French. Labels were generated by a deterministic heuristic pipeline based on HTML semantics, ARIA roles, CSS properties, and link structure — not by human annotators.

The training dataset is not publicly distributed.


Training procedure

Hardware: NVIDIA L40S (48 GB VRAM)

Hyperparameters:

Parameter Value
Epochs 80 (early stopping, patience=15)
Batch size 8 pages
Optimizer AdamW
Learning rate 1e-3
LR schedule Cosine annealing
Weight decay 1e-4
Dropout 0.3
Hidden dim 128
Attention heads 4
GATv2 layers 3
Class weighting sqrt-inverse frequency
Edge augmentation Reverse edges + sibling edges

Feature vector (618 dims/node):

Feature block Dims Notes
Tag one-hot 51 50 tags + OOV bucket
Class hash 128 Hashing trick over CSS class tokens (Tailwind-robust)
Attribute presence 17 id, href, role, aria-*, type, placeholder, …
Computed CSS 28 display (11) + position (5) + 6 numeric CSS values
Bounding box 5 x, y, w, h, area (normalized by viewport)
Topology 5 depth, n_children, is_visible, in_viewport, has_listeners
Link semantics 9 absolute/relative/fragment/mailto, same-host/domain, path depth
Text embedding 384 MiniLM-L6-v2 sentence embedding (frozen)

Validation criterion: best checkpoint selected by macro-F1 on the validation split.

Data split: 70 / 15 / 15 train/val/test, stratified by page.


Evaluation results

Evaluated on a held-out test set (15% of pages, stratified split). Numbers reported as mean ± std across 5 independent training runs with different random seeds.

Metric Mean ± std Min Max
Macro F1 0.825 ± 0.026 0.797 0.865
Weighted F1 0.917 ± 0.032 0.882 0.965
Action F1 (5 classes) 0.895 ± 0.036 0.818 0.917

Per-class F1, mean ± std across 5 seeds:

Class Mean F1 Std Test support (best seed)
action_input 0.686 0.104 25
action_select 0.768 0.086 8
action_button 0.909 0.071 1 577
action_link_internal 0.996 0.004 3 119
action_link_external 0.996 0.003 327
structure_navigation 0.884 0.062 52
structure_region 0.770 0.140 52
structure_dismissible 0.363 0.073 158
structure_card 0.625 0.199 1 045
structure_list_item 0.974 0.015 3 885
content_heading 0.986 0.007 525
content_text 0.736 0.067 322
content_media 0.915 0.035 1 319
noise 0.938 0.022 18 345

Limitations

  • Low-support classes. action_input (n=25) and action_select (n=8) have very small test sets — F1 estimates for these classes have high variance and should not be over-interpreted.
  • structure_dismissible is hard. Cookie banners and modal overlays vary enormously across sites. Mean F1 of 0.363 reflects genuine label ambiguity, not a model bug.
  • Heuristic labels. Training labels come from deterministic rules, not human annotation. Near-boundary elements (e.g. a decorative <button> vs. a functional one) may be mislabeled.
  • No price class. Numerical price strings are classified as noise. This is a known gap.
  • Static DOM only. The model operates on a single DOM snapshot. Dynamically loaded content, shadow DOM, and canvas elements are not modeled.
  • Dataset size and diversity. ~135 pages, English and French only. Sites in other languages or with highly unusual layouts are out-of-distribution.

Bias and ethical considerations

  • The model encodes statistical regularities of how web developers structure pages in the training data. Sites that deviate from common patterns (niche CMS, custom frameworks) may see lower accuracy.
  • The noise class is a catch-all for elements that don't fit other categories. Misclassified functional elements (e.g. a decorative-looking but important button) will be silently dropped in action_only=True mode. Always set a confidence threshold and review low-confidence predictions.
  • The model should not be used as the sole decision-maker for automated actions on behalf of users without oversight.

License

Apache 2.0 — see LICENSE.

Citation

If you use this model in your work, a link back to this repository is appreciated.

Contact

Lucy Paureau · lmi.rest · lucy.paureau@gmail.com

Downloads last month
-
Safetensors
Model size
1.27M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support