mind2web-candidate-ranker

A DeBERTa-v3-base cross-encoder fine-tuned for candidate generation in web agents: given a natural-language task (plus action history) and a DOM element, it scores how likely that element is the next action target. It is stage 1 of a two-stage computer-use agent (rank DOM elements → LLM picks the action → Playwright executes).

Trained on Mind2Web (2,000+ tasks across 137 real websites).

Results

Evaluated on the three official Mind2Web generalization splits:

split	acc@1	recall@3	recall@5	recall@10	recall@20	MRR
test_task (unseen tasks)	60.2%	83.5%	89.4%	93.0%	93.8%	0.725
test_website (unseen websites)	50.9%	76.1%	85.3%	92.6%	95.3%	0.656
test_domain (unseen domains)	55.4%	79.8%	87.1%	92.8%	94.5%	0.689

~93% recall@10 on entirely unseen domains. The recall curve plateaus at k≈20, making top-20 a good cost/recall operating point for the downstream LLM.

Usage

from sentence_transformers import CrossEncoder

model = CrossEncoder("torontodeveloper/mind2web-candidate-ranker", num_labels=1, max_length=512)

candidate = "tag: input | aria_label: Destination | placeholder: To?"
query = "task is: Search for a flight from Toronto to New York. Type NYC into the destination field.\nPrevious actions: "

score = model.predict([(candidate, query)])

Candidates are serialized DOM elements (tag, aria_label, placeholder, name, is_clickable, bounding box, etc.); the query is the task description plus the last few actions. Score all interactable elements on a page and keep the top-k.

Training

Base model: microsoft/deberta-v3-base (86M params)
Data: Mind2Web train split — 7,775 actions, 1 positive + 3 sampled negatives each (31,100 pairs)
3 epochs, batch size 8, lr 3e-5, BCEWithLogitsLoss, linear warmup
Single T4 GPU (~85 minutes), fp32

⚠️ Precision note

DeBERTa-v3's disentangled attention overflows fp16 (NaN logits). Load and run this model in fp32 (or bf16 on Ampere+). Newer transformers versions may default to fp16 on GPU — pass torch_dtype=torch.float32 explicitly if needed.

Pipeline context

This model is the candidate generator from mind2web-computer-use-agent: stage 2 (action prediction) uses FLAN-T5 / GPT-class models over the top-k candidates, and stage 3 executes via Playwright. Architecture follows MindAct (Deng et al., 2023).

Downloads last month: 19

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for torontodeveloper/mind2web-candidate-ranker

Base model

microsoft/deberta-v3-base

Finetuned

(644)

this model

torontodeveloper
/

mind2web-candidate-ranker