File size: 11,935 Bytes
8818829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "is5mcO3am-g2"
      },
      "outputs": [],
      "source": [
        "# Installation\n",
        "! pip install smolagents\n",
        "# To install from source instead of the last release, comment the command above and uncomment the following one.\n",
        "# ! pip install git+https://github.com/huggingface/smolagents.git"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QmOAmqNSm-g3"
      },
      "source": [
        "# Web Browser Automation with Agents 🤖🌐"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8KNjj3xWm-g3"
      },
      "source": [
        "In this notebook, we'll create an **agent-powered web browser automation system**! This system can navigate websites, interact with elements, and extract information automatically.\n",
        "\n",
        "The agent will be able to:\n",
        "\n",
        "- [x] Navigate to web pages\n",
        "- [x] Click on elements\n",
        "- [x] Search within pages\n",
        "- [x] Handle popups and modals\n",
        "- [x] Extract information\n",
        "\n",
        "Let's set up this system step by step!\n",
        "\n",
        "First, run these lines to install the required dependencies:\n",
        "\n",
        "```bash\n",
        "pip install smolagents selenium helium pillow -q\n",
        "```\n",
        "\n",
        "Let's import our required libraries and set up environment variables:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nAKrZrxWm-g4"
      },
      "outputs": [],
      "source": [
        "from io import BytesIO\n",
        "from time import sleep\n",
        "\n",
        "import helium\n",
        "from dotenv import load_dotenv\n",
        "from PIL import Image\n",
        "from selenium import webdriver\n",
        "from selenium.webdriver.common.by import By\n",
        "from selenium.webdriver.common.keys import Keys\n",
        "\n",
        "from smolagents import CodeAgent, tool\n",
        "from smolagents.agents import ActionStep\n",
        "\n",
        "# Load environment variables\n",
        "load_dotenv()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wyEvPcbWm-g4"
      },
      "source": [
        "Now let's create our core browser interaction tools that will allow our agent to navigate and interact with web pages:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aFQ0HEwSm-g4"
      },
      "outputs": [],
      "source": [
        "@tool\n",
        "def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:\n",
        "    \"\"\"\n",
        "    Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.\n",
        "    Args:\n",
        "        text: The text to search for\n",
        "        nth_result: Which occurrence to jump to (default: 1)\n",
        "    \"\"\"\n",
        "    elements = driver.find_elements(By.XPATH, f\"//*[contains(text(), '{text}')]\")\n",
        "    if nth_result > len(elements):\n",
        "        raise Exception(f\"Match n°{nth_result} not found (only {len(elements)} matches found)\")\n",
        "    result = f\"Found {len(elements)} matches for '{text}'.\"\n",
        "    elem = elements[nth_result - 1]\n",
        "    driver.execute_script(\"arguments[0].scrollIntoView(true);\", elem)\n",
        "    result += f\"Focused on element {nth_result} of {len(elements)}\"\n",
        "    return result\n",
        "\n",
        "@tool\n",
        "def go_back() -> None:\n",
        "    \"\"\"Goes back to previous page.\"\"\"\n",
        "    driver.back()\n",
        "\n",
        "@tool\n",
        "def close_popups() -> str:\n",
        "    \"\"\"\n",
        "    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows!\n",
        "    This does not work on cookie consent banners.\n",
        "    \"\"\"\n",
        "    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YdEVE-Nlm-g5"
      },
      "source": [
        "Let's set up our browser with Chrome and configure screenshot capabilities:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5WBUdv5km-g5"
      },
      "outputs": [],
      "source": [
        "# Configure Chrome options\n",
        "chrome_options = webdriver.ChromeOptions()\n",
        "chrome_options.add_argument(\"--force-device-scale-factor=1\")\n",
        "chrome_options.add_argument(\"--window-size=1000,1350\")\n",
        "chrome_options.add_argument(\"--disable-pdf-viewer\")\n",
        "chrome_options.add_argument(\"--window-position=0,0\")\n",
        "\n",
        "# Initialize the browser\n",
        "driver = helium.start_chrome(headless=False, options=chrome_options)\n",
        "\n",
        "# Set up screenshot callback\n",
        "def save_screenshot(memory_step: ActionStep, agent: CodeAgent) -> None:\n",
        "    sleep(1.0)  # Let JavaScript animations happen before taking the screenshot\n",
        "    driver = helium.get_driver()\n",
        "    current_step = memory_step.step_number\n",
        "    if driver is not None:\n",
        "        for previous_memory_step in agent.memory.steps:  # Remove previous screenshots for lean processing\n",
        "            if isinstance(previous_memory_step, ActionStep) and previous_memory_step.step_number <= current_step - 2:\n",
        "                previous_memory_step.observations_images = None\n",
        "        png_bytes = driver.get_screenshot_as_png()\n",
        "        image = Image.open(BytesIO(png_bytes))\n",
        "        print(f\"Captured a browser screenshot: {image.size} pixels\")\n",
        "        memory_step.observations_images = [image.copy()]  # Create a copy to ensure it persists\n",
        "\n",
        "    # Update observations with current URL\n",
        "    url_info = f\"Current url: {driver.current_url}\"\n",
        "    memory_step.observations = (\n",
        "        url_info if memory_step.observations is None else memory_step.observations + \"\\n\" + url_info\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kcFffSVJm-g5"
      },
      "source": [
        "Now let's create our web automation agent:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XdQciC-tm-g5"
      },
      "outputs": [],
      "source": [
        "from smolagents import InferenceClientModel\n",
        "\n",
        "# Initialize the model\n",
        "model_id = \"meta-llama/Llama-3.3-70B-Instruct\"  # You can change this to your preferred model\n",
        "model = InferenceClientModel(model_id=model_id)\n",
        "\n",
        "# Create the agent\n",
        "agent = CodeAgent(\n",
        "    tools=[go_back, close_popups, search_item_ctrl_f],\n",
        "    model=model,\n",
        "    additional_authorized_imports=[\"helium\"],\n",
        "    step_callbacks=[save_screenshot],\n",
        "    max_steps=20,\n",
        "    verbosity_level=2,\n",
        ")\n",
        "\n",
        "# Import helium for the agent\n",
        "agent.python_executor(\"from helium import *\", agent.state)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4IVaEozSm-g6"
      },
      "source": [
        "The agent needs instructions on how to use Helium for web automation. Here are the instructions we'll provide:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VlvuP6rem-g6"
      },
      "outputs": [],
      "source": [
        "helium_instructions = \"\"\"\n",
        "You can use helium to access websites. Don't bother about the helium driver, it's already managed.\n",
        "We've already ran \"from helium import *\"\n",
        "Then you can go to pages!\n",
        "Code:\n",
        "go_to('github.com/trending')\n",
        "```<end_code>\n",
        "\n",
        "You can directly click clickable elements by inputting the text that appears on them.\n",
        "Code:\n",
        "click(\"Top products\")\n",
        "```<end_code>\n",
        "\n",
        "If it's a link:\n",
        "Code:\n",
        "click(Link(\"Top products\"))\n",
        "```<end_code>\n",
        "\n",
        "If you try to interact with an element and it's not found, you'll get a LookupError.\n",
        "In general stop your action after each button click to see what happens on your screenshot.\n",
        "Never try to login in a page.\n",
        "\n",
        "To scroll up or down, use scroll_down or scroll_up with as an argument the number of pixels to scroll from.\n",
        "Code:\n",
        "scroll_down(num_pixels=1200) # This will scroll one viewport down\n",
        "```<end_code>\n",
        "\n",
        "When you have pop-ups with a cross icon to close, don't try to click the close icon by finding its element or targeting an 'X' element (this most often fails).\n",
        "Just use your built-in tool `close_popups` to close them:\n",
        "Code:\n",
        "close_popups()\n",
        "```<end_code>\n",
        "\n",
        "You can use .exists() to check for the existence of an element. For example:\n",
        "Code:\n",
        "if Text('Accept cookies?').exists():\n",
        "    click('I accept')\n",
        "```<end_code>\n",
        "\"\"\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FBr7mYAAm-g6"
      },
      "source": [
        "Now we can run our agent with a task! Let's try finding information on Wikipedia:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OvRfT4oWm-g6"
      },
      "outputs": [],
      "source": [
        "search_request = \"\"\"\n",
        "Please navigate to https://en.wikipedia.org/wiki/Chicago and give me a sentence containing the word \"1992\" that mentions a construction accident.\n",
        "\"\"\"\n",
        "\n",
        "agent_output = agent.run(search_request + helium_instructions)\n",
        "print(\"Final output:\")\n",
        "print(agent_output)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Utp3NWsBm-g6"
      },
      "source": [
        "You can run different tasks by modifying the request. For example, here's for me to know if I should work harder:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "S7z5mWCQm-g6"
      },
      "outputs": [],
      "source": [
        "github_request = \"\"\"\n",
        "I'm trying to find how hard I have to work to get a repo in github.com/trending.\n",
        "Can you navigate to the profile for the top author of the top trending repo, and give me their total number of commits over the last year?\n",
        "\"\"\"\n",
        "\n",
        "agent_output = agent.run(github_request + helium_instructions)\n",
        "print(\"Final output:\")\n",
        "print(agent_output)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3E520frtm-g6"
      },
      "source": [
        "The system is particularly effective for tasks like:\n",
        "- Data extraction from websites\n",
        "- Web research automation\n",
        "- UI testing and verification\n",
        "- Content monitoring"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}