PyTorch
llama

Model Card for Llama8b-NNetNav-WA

LLama8b-NNetNav-WA is a LLama-3.1-8B model that is instruct-tuned with NNetNav-WA data collected via unsupervised exploration on WebArena websites, with a larger LLama-3.1-70B model.

More details about this model can be found in our paper: NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild.

Table of Contents

Model Details

This model is intended to be used as a web-agent i.e. given an instruction such as Upvote the post by user smurty123 on subreddit r/LocalLLaMA, and a web-url reddit.com, the model can perform the task by executing a sequence of actions.

The action space of the model is as follows:

Page Operation Actions:
`click [id]`: This action clicks on an element with a specific id on the webpage.
`type [id] [content] [press_enter_after=0|1]`: Use this to type the content into the field with id. By default, the "Enter" key is pressed after typing unless press_enter_after is set to 0.
`hover [id]`: Hover over an element with id.
`press [key_comb]`:  Simulates the pressing of a key combination on the keyboard (e.g., Ctrl+v).
`scroll [down|up]`: Scroll the page up or down.

Tab Management Actions:
`new_tab`: Open a new, empty browser tab.
`tab_focus [tab_index]`: Switch the browser's focus to a specific tab using its index.
`close_tab`: Close the currently active tab.

URL Navigation Actions:
`goto [url]`: Navigate to a specific URL.
`go_back`: Navigate to the previously viewed page.
`go_forward`: Navigate to the next page (if a previous 'go_back' action was performed).

Completion Action:
`stop [answer]`: Issue this action when you believe the task is complete. If the objective is to find a text-based answer, provide the answer in the bracket. If you believe the task is impossible to complete, provide the answer as "N/A" in the bracket.

Results on Benchmarks

This model gets the following results on WebArena and WebVoyager:

Model WebArena (SR) WebVoyager (SR)
GPT-4 14.1 33.5
llama8b-nnetnav-wa 16.3 28.1

Bias, Risks, and Limitations

Bias

As with all ML models, Llama8b-NNetNav-WA inherits biases from its training data. Since the dataset is collected via unsupervised exploration on self-hosted WebArena websites, it will reflect biases present in website structures, navigation flows, and content representations.

  • Selection Bias: The model is trained on Self-hosted websites that mimic reddit, github, google maps, simple e-commerce websites and CMS websites. This model is likely to struggle with websites with modern layouts seen on live websites.
  • Demographic Bias: WebArena self-hosted websites over-represent Western English-speaking users, and the model may perform worse on non-English or culturally distinct websites.
    • Example: A model trained mostly on U.S. e-commerce sites may navigate amazon.com effectively but may struggle with Flipkart (India) or Rakuten (Japan).

If you are interested in training a NNetNav based agent for your own domain, please check out our codebase. Or if you're interested in a model that has been shown to work well on a variety of live websites, please check out LLama8b-NNetNav-Live

Risks

1. Unintended Actions

The model operates by executing web actions based on textual observation spaces, which may lead to unintended consequences when dealing with ambiguous or poorly structured websites.

  • If instructed to "delete all spam messages in my inbox," but the website has unusual button placement in the AXTree, the model might mistakenly delete important emails instead.
  • If asked to "buy the cheapest laptop on Amazon," the model might select an accessory instead of an actual laptop if the AXTree of the listing page has misleading layout

2. Security & Privacy Risks

Since the model interacts with external web content, there are significant risks related to unintentional data exposure, credential leaks, and interaction with harmful content.

  • If asked to "log into my Gmail and check unread emails," the model may type and submit credentials without realizing it, potentially exposing passwords.
  • A user asking the model to "search for free software downloads" might inadvertently lead to interactions with phishing or malware-hosting sites.

3. Adversarial Manipulation

Malicious websites can deceive the model by using dark patterns—UI/UX tricks that mislead users (or bots).

  • A fraudulent website may create fake "Close" buttons in the AXTree that actually trigger downloads or pop-ups. The model, thinking it's closing a window, may instead click a malicious link.
  • If asked to "unsubscribe from a newsletter," but the page uses misleading button labels in the AXTree (e.g., "Unsubscribe" actually means "Resubscribe"), the model could perform the opposite action.

4. Legal & Ethical Considerations

Web navigation often involves handling user-generated content, news, and e-commerce transactions, all of which pose ethical and legal challenges.

  • If instructed to "find the latest election results," the model might click on a misleading news source, potentially spreading misinformation.
  • If asked to "find the cheapest flight ticket," it could unintentionally violate terms of service by scraping restricted airline data.

Limitations

1. Generalization to Unseen Websites

This model is trained via interaction on 5 self-hosted WebArena websites, and is known to struggle on real, live websites. Please check out LLama8b-NNetNav-Live for a model that performs better on live websites.

2. Instruction Sensitivity

Vague instructions can lead to unintended actions.

  • "Find me the best laptop for gaming" is subjective, and the model might select a random option instead of following some criteria (e.g., GPU, refresh rate).

3. Performance on Long-Horizon Tasks

The model may struggle when tasks require deep memory retention, complex multi-step planning, or backtracking.

  • Example: When booking a hotel on a travel website, the model might navigate through multiple filters and options but forget previous selections when reaching the checkout page.

4. Token Limitations

The model's maximum sequence length of 20k tokens limits its ability to handle long, continuous web interactions.

  • Example: When filling a very long multi-step form, the model might forget earlier responses, leading to errors.

How to Get Started with the Model

TODO

Training Details

Training Data

This model was trained with SFT on the NNetnav-WA dataset, which is comprised of synthetic demonstrations entirely from self-hosted websites.

Training Procedure

This model was trained for 2 epochs (roughly 4k gradient steps) with a batch size of 128, and a maximum sequence length of 20000.

Environmental Impact

  • Hardware Type: 4 H100 GPUs (80G)
  • Hours used: Roughly 2 days.
  • Cloud Provider: Stanford compute.
  • Compute Region: Stanford energy grid.

Technical Specifications

Hardware

This model was trained on 4 H100s.

Software

This model was fine-tuned with Open-Instruct

Model Card Authors

Shikhar Murty

Model Card Contact

smurty@cs.stanford.edu

Downloads last month
0
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for stanfordnlp/llama8b-nnetnav-wa

Finetuned
(770)
this model

Dataset used to train stanfordnlp/llama8b-nnetnav-wa