Instructions to use torontodeveloper/mind2web-candidate-ranker with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use torontodeveloper/mind2web-candidate-ranker with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("torontodeveloper/mind2web-candidate-ranker") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Notebooks
- Google Colab
- Kaggle
mind2web-candidate-ranker
A DeBERTa-v3-base cross-encoder fine-tuned for candidate generation in web agents: given a natural-language task (plus action history) and a DOM element, it scores how likely that element is the next action target. It is stage 1 of a two-stage computer-use agent (rank DOM elements → LLM picks the action → Playwright executes).
Trained on Mind2Web (2,000+ tasks across 137 real websites).
Results
Evaluated on the three official Mind2Web generalization splits:
| split | acc@1 | recall@3 | recall@5 | recall@10 | recall@20 | MRR |
|---|---|---|---|---|---|---|
| test_task (unseen tasks) | 60.2% | 83.5% | 89.4% | 93.0% | 93.8% | 0.725 |
| test_website (unseen websites) | 50.9% | 76.1% | 85.3% | 92.6% | 95.3% | 0.656 |
| test_domain (unseen domains) | 55.4% | 79.8% | 87.1% | 92.8% | 94.5% | 0.689 |
~93% recall@10 on entirely unseen domains. The recall curve plateaus at k≈20, making top-20 a good cost/recall operating point for the downstream LLM.
Usage
from sentence_transformers import CrossEncoder
model = CrossEncoder("torontodeveloper/mind2web-candidate-ranker", num_labels=1, max_length=512)
candidate = "tag: input | aria_label: Destination | placeholder: To?"
query = "task is: Search for a flight from Toronto to New York. Type NYC into the destination field.\nPrevious actions: "
score = model.predict([(candidate, query)])
Candidates are serialized DOM elements (tag, aria_label, placeholder, name,
is_clickable, bounding box, etc.); the query is the task description plus the last
few actions. Score all interactable elements on a page and keep the top-k.
Training
- Base model:
microsoft/deberta-v3-base(86M params) - Data: Mind2Web train split — 7,775 actions, 1 positive + 3 sampled negatives each (31,100 pairs)
- 3 epochs, batch size 8, lr 3e-5, BCEWithLogitsLoss, linear warmup
- Single T4 GPU (~85 minutes), fp32
⚠️ Precision note
DeBERTa-v3's disentangled attention overflows fp16 (NaN logits). Load and run this
model in fp32 (or bf16 on Ampere+). Newer transformers versions may default to fp16
on GPU — pass torch_dtype=torch.float32 explicitly if needed.
Pipeline context
This model is the candidate generator from mind2web-computer-use-agent: stage 2 (action prediction) uses FLAN-T5 / GPT-class models over the top-k candidates, and stage 3 executes via Playwright. Architecture follows MindAct (Deng et al., 2023).
- Downloads last month
- 19
Model tree for torontodeveloper/mind2web-candidate-ranker
Base model
microsoft/deberta-v3-base