mind2web-candidate-ranker

A DeBERTa-v3-base cross-encoder fine-tuned for candidate generation in web agents: given a natural-language task (plus action history) and a DOM element, it scores how likely that element is the next action target. It is stage 1 of a two-stage computer-use agent (rank DOM elements → LLM picks the action → Playwright executes).

Trained on Mind2Web (2,000+ tasks across 137 real websites).

Results

Evaluated on the three official Mind2Web generalization splits:

split acc@1 recall@3 recall@5 recall@10 recall@20 MRR
test_task (unseen tasks) 60.2% 83.5% 89.4% 93.0% 93.8% 0.725
test_website (unseen websites) 50.9% 76.1% 85.3% 92.6% 95.3% 0.656
test_domain (unseen domains) 55.4% 79.8% 87.1% 92.8% 94.5% 0.689

~93% recall@10 on entirely unseen domains. The recall curve plateaus at k≈20, making top-20 a good cost/recall operating point for the downstream LLM.

Usage

from sentence_transformers import CrossEncoder

model = CrossEncoder("torontodeveloper/mind2web-candidate-ranker", num_labels=1, max_length=512)

candidate = "tag: input | aria_label: Destination | placeholder: To?"
query = "task is: Search for a flight from Toronto to New York. Type NYC into the destination field.\nPrevious actions: "

score = model.predict([(candidate, query)])

Candidates are serialized DOM elements (tag, aria_label, placeholder, name, is_clickable, bounding box, etc.); the query is the task description plus the last few actions. Score all interactable elements on a page and keep the top-k.

Training

  • Base model: microsoft/deberta-v3-base (86M params)
  • Data: Mind2Web train split — 7,775 actions, 1 positive + 3 sampled negatives each (31,100 pairs)
  • 3 epochs, batch size 8, lr 3e-5, BCEWithLogitsLoss, linear warmup
  • Single T4 GPU (~85 minutes), fp32

⚠️ Precision note

DeBERTa-v3's disentangled attention overflows fp16 (NaN logits). Load and run this model in fp32 (or bf16 on Ampere+). Newer transformers versions may default to fp16 on GPU — pass torch_dtype=torch.float32 explicitly if needed.

Pipeline context

This model is the candidate generator from mind2web-computer-use-agent: stage 2 (action prediction) uses FLAN-T5 / GPT-class models over the top-k candidates, and stage 3 executes via Playwright. Architecture follows MindAct (Deng et al., 2023).

Downloads last month
19
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for torontodeveloper/mind2web-candidate-ranker

Finetuned
(644)
this model

Dataset used to train torontodeveloper/mind2web-candidate-ranker