What's next for Browser Agents? 🤔
I've been tinkering with browser automation recently (e.g., building a bot to search and buy on Amazon), and Operator’s release got me thinking about the future of these tools.
Here are 3 key challenges browser agents face today:
1️⃣ Moving from text-only to multi-modal AI models.
2️⃣ Solving authentication without blending in with bad bots.
3️⃣ Enabling human-in-the-loop collaboration that's seamless and smart.
In this post we unpack these challenges, share insights, and explore what’s next for browser agents. Would you trust browser agents with your day-to-day tasks? Let me know your thoughts! 👇
ChatGPT Operator is out -- what's next for browser agents?
AI start-ups building browser agents must be losing sleep over Operator’s release 😱. We recently tinkered with browser agents ourselves to automate searching on Amazon and buying an item. Having seen some of the demos on Operator, here are our three broad takeaways:
1/ The future of browser agents is multi-modal models
I tried building my Amazon browser agent using a “unimodal” text-based LLM to really grok that point [For the tech savvy, I used Browserbase as my headless browser to automate browser tasks with code]. Because it doesn’t understand web navigation I had to figure out the exact structure of the webpage I wanted to automate and spoon feed it to the LLM. Not only would a developer need to do this for every website, they’d need to revisit this every time a website changes structure. To make matters worse, my context window (the amount of data the LLM can hold in a given conversation) was constantly saturated with the amount of HTML code retrieved. I had to find all sorts of hacks esp. on websites like Amazon (filter for specific tags, convert to other formats like text or markdown etc.). Operator is multi-modal, meaning it was trained on and processes both text and visual data. It is able to navigate a webpage dynamically and to process elements visually rather than rely exclusively on verbose HTML dumps. There are a few other contenders in this space -- I was able to successfully star a Github repository using the `browser-use` open source framework for example (Star them on github!). You can see a GIF below and how they interpret web page layout visually. The accuracy of such models is improving fast, even though some benchmarks claim they are still < 58% (see WebArena’s leaderboard).
Multi-modal AI models are able to navigate the web based on their ability to interpret website content visually as well as textually.
2/ There’s currently little to nothing distinguishing AI web agents (good bots!) from automated fraud (bad bots!)
In our attempts at building browser agents, we got blocked at various points in the browsing session. Occasionally we would get blocked right at the very start of a browsing session and would have to use proxies and other incognito methods. Incidentally we’ve seen a demo shared on X where it seems that Operator occasionally struggles with that as well. More importantly, we could not find a way to load or fill the fields on login pages even when we got the browser agent to hand over the session to me ☠️. Based on the demos I’ve seen, Operator solves authentication for some providers by handing control over to the user in order to complete a login (e.g. Booking.com, Thumbtack or Google Calendar access). Based on the press release, they are relying on bespoke partnerships with those domains to identify Operator-managed browsing sessions and allow these sessions to proceed "while respecting established norms". For now none of the other multi-modal-based frameworks we’ve seen have an easy answer to this problem.
The ecosystem needs a reliable security standard to establish this handshake between browser agents and websites. We envision this looking like an adapted version of delegated OAuth, and it will hopefully level the playing field for startups and make browser agents safer.
3/ We’re missing a structured way to handle the back and forth between a web agent and a user: While we are seeing frameworks for building AI agents start to support human intervention (aka “human in the loop”), none of the well known browser agent frameworks offer a structured way to handle the back and forth between the agent and the human user. Operator definitely stands out in that regard as you can see in this flight booking example. I am not clear what the user experience would be like if the user was not immediately available to guide the LLM. Would it be able to “save its progress” and resume once the user responds to it? Can the human pre-emptively define conditions where the web agent should consult it beyond the obvious ones like making a payment e.g. “if you find any offers for travel insurance during the booking process, let me know so I can explore them before making a decision”?
Resilient and flexible human-in-the-loop support needs to be a core feature of browser agents to ensure that users have as much control as possible over the actions of their browser agents.
How do you feel about browser agents? Would you trust them? Would you use them more broadly to assist you in your day-to-day life? Join us on Discord and share your thoughts 🙏
Join the conversation
- Join the conversation on our Discord channel (↗).
- Watch us embarrass ourselves on our YouTube channel (↗).
- Follow us on Product Hunt (↗).