The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents

Latent Space: The AI Engineer Podcast

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

0:00

-1:11:09

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

Anthropic recently scored a huge win on OpenAI's turf by achieving SOTA on -their- SWE-Bench Verified benchmark, using an upgraded Claude 3.5 Sonnet. For the first time, they spill the beans.

Nov 28, 2024

We have announced our first speaker, friend of the show Dylan Patel, and topic slates for Latent Space LIVE! at NeurIPS. Sign up for IRL/Livestream and to debate!

We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!

The vibe shift we observed in July - in favor of Claude 3.5 Sonnet, first introduced in June — has been remarkably long lived and persistent, surviving multiple subsequent updates of 4o, o1 and Gemini versions, for Anthropic’s Claude to end 2024 as the preferred model for AI Engineers and even being the exclusive choice for new code agents like bolt.new (our next guest on the pod!), which unlocked so much performance from Claude Sonnet that it went from $0 to $4m ARR in 4 weeks when it launched last month1.

Menlo estimates Anthropic market share doubled this year, mostly due to Claude 3.5 Sonnet — though note that Menlo are Anthropic investors

Anthropic has now raised an additional $4b from Amazon and made an incredibly well received update of Claude 3.5 Sonnet (and Haiku), making significant improvements in performance over its predecessors:

Annotated Claude 3→3.5 chart from Anthropic, from AINews. 3.5 Opus numbers are upper/lower bound extrapolations from Haiku/Sonnet improvements.

Solving SWE-Bench

As part of the October Sonnet release, Anthropic teased a blink-and-you’ll miss it result:

The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor.

This was followed up by a blogpost a week later from today’s guest, Erik Schluntz, the engineer who implemented and scored this SOTA2 result using a simple, non-overengineered version of the SWE-Agent framework (you can see the submissions here). We have previously covered the SWE-Bench story extensively:

Speaking with SWEBench/SWEAgent authors at ICLR
Speaking with Cosine Genie, the previous SOTA (43.8%) on SWEBench Verified (with brief update at DevDay 2024)
Speaking with Shunyu Yao on SWEBench and the ReAct paradigm driving SWE-Agent

One of the notable inclusions in this blogpost are the tools that Erik decided to give Claude, e.g. the “Edit Tool”:

https://x.com/antonosika/status/1848845650697330826

The tools teased in the SWEBench submission/blogpost were then polished up and released with Computer Use…

And you can also see even more computer use tools given in the new Model Context Protocol servers:

Claude Computer Use

Because it is one of the best received AI releases of the year, we recommend watching the 2 minute Computer Use intro (and related demos) in its entirety:

https://x.com/AnthropicAI/status/1848742747626226146

Eric also worked on Claude’s function calling, tool use, and computer use APIs, so we discuss that in the episode.

Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero.

As you’ll see, this is very top of mind for Erik as a former Robotics founder who’s company basically used robots to interface with human physical systems like elevators.

Full Video episode

Please like and subscribe!

Show Notes

Timestamps

[00:00:00] Introductions
[00:03:39] What is SWE-Bench?
[00:12:22] SWE-Bench vs HumanEval vs others
[00:15:21] SWE-Agent architecture and runtime
[00:21:18] Do you need code indexing?
[00:24:50] Giving the agent tools
[00:27:47] Sandboxing for coding agents
[00:29:16] Why not write tests?
[00:30:31] Redesigning engineering tools for LLMs
[00:35:53] Multi-agent systems
[00:37:52] Why XML so good?
[00:42:57] Thoughts on agent frameworks
[00:45:12] How many turns can an agent do?
[00:47:12] Using multiple model types
[00:51:40] Computer use and agent use cases
[00:59:04] State of AI robotics
[01:04:24] Robotics in manufacturing
[01:05:01] Hardware challenges in robotics
[01:09:21] Is self-driving a good business?

Transcript

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners. And today we're in the new studio with my usual co-host, Shawn from Smol AI.

Swyx [00:00:14]: Hey, and today we're very blessed to have Erik Schluntz from Anthropic with us. Welcome.

Erik [00:00:19]: Hi, thanks very much. I'm Erik Schluntz. I'm a member of technical staff at Anthropic, working on tool use, computer use, and Swebench.

Swyx [00:00:27]: Yeah. Well, how did you get into just the whole AI journey? I think you spent some time at SpaceX as well? Yeah. And robotics. Yeah. There's a lot of overlap between like the robotics people and the AI people, and maybe like there's some interlap or interest between language models for robots right now. Maybe just a little bit of background on how you got to where you are. Yeah, sure.

Erik [00:00:50]: I was at SpaceX a long time ago, but before joining Anthropic, I was the CTO and co-founder of Cobalt Robotics. We built security and inspection robots. These are sort of five foot tall robots that would patrol through an office building or a warehouse looking for anything out of the ordinary. Very friendly, no tasers or anything. We would just sort of call a remote operator if we saw anything. We have about 100 of those out in the world, and had a team of about 100. We actually got acquired about six months ago, but I had left Cobalt about a year ago now, because I was starting to get a lot more excited about AI. I had been writing a lot of my code with things like Copilot, and I was like, wow, this is actually really cool. If you had told me 10 years ago that AI would be writing a lot of my code, I would say, hey, I think that's AGI. And so I kind of realized that we had passed this level, like, wow, this is actually really useful for engineering work. That got me a lot more excited about AI and learning about large language models. So I ended up taking a sabbatical and then doing a lot of reading and research myself and decided, hey, I want to go be at the core of this and joined Anthropic.

Alessio [00:01:53]: And why Anthropic? Did you consider other labs? Did you consider maybe some of the robotics companies?

Erik [00:02:00]: So I think at the time I was a little burnt out of robotics, and so also for the rest of this, any sort of negative things I say about robotics or hardware is coming from a place of burnout, and I reserve my right to change my opinion in a few years. Yeah, I looked around, but ultimately I knew a lot of people that I really trusted and I thought were incredibly smart at Anthropic, and I think that was the big deciding factor to come there. I was like, hey, this team's amazing. They're not just brilliant, but sort of like the most nice and kind people that I know, and so I just felt like I could be a really good culture fit. And ultimately, I do care a lot about AI safety and making sure that I don't want to build something that's used for bad purposes, and I felt like the best chance of that was joining Anthropic.

Alessio [00:02:39]: And from the outside, these labs kind of look like huge organizations that have these obscure

Swyx [00:02:44]: ways to organize.

Alessio [00:02:45]: How did you get, you joined Anthropic, did you already know you were going to work on of the stuff you publish or you kind of join and then you figure out where you land? I think people are always curious to learn more.

Erik [00:02:57]: Yeah, I've been very happy that Anthropic is very bottoms up and sort of very sort of receptive to whatever your interests are. And so I joined sort of being very transparent of like, hey, I'm most excited about code generation and AI that can actually go out and sort of touch the world or sort of help people build things. And, you know, those weren't my initial initial projects. I also came in and said, hey, I want to do the most valuable possible thing for this company and help Anthropic succeed. And, you know, like, let me find the balance of those. So I was working on lots of things at the beginning, you know, function calling, tool use. And then sort of as it became more and more relevant, I was like, oh, hey, like, let's it's time to go work on encoding agents and sort of started looking at SWE-Bench as sort of a really good benchmark for that.

Swyx [00:03:39]: So let's get right into SWE-Bench. That's one of the many claims to fame. I feel like there's just been a series of releases related with Cloud 3.5 Sonnet around about two or three months ago, 3.5 Sonnet came out and it was it was a step ahead in terms of a lot of people immediately fell in love with it for coding. And then last month you released a new updated version of Cloud Sonnet. We're not going to talk about the training for that because that's still confidential. But I think Anthropic's done a really good job, like applying the model to different things. So you took the lead on SWE-Bench, but then also we're going to talk a little bit about computer use later on. So maybe just give us a context about why you looked at SWE-Bench Verified and you actually came up with a whole system for building agents that would maximally use the model well. Yeah.

Erik [00:04:28]: So I'm on a sub team called Product Research. And basically the idea of product research is to really understand what end customers care about and want in the models and then work to try to make that happen. So we're not focused on sort of these more abstract general benchmarks like math problems or MMLU, but we really care about finding the things that are really valuable and making sure the models are great at those. And so because I've been interested in coding agents, I knew that this would be a really valuable thing. And I knew there were a lot of startups and our customers trying to build coding agents with our models. And so I said, hey, this is going to be a really good benchmark to be able to measure that and do well on it. And I wasn't the first person at Anthropic to find SWE-Bench, and there are lots of people that already knew about it and had done some internal efforts on it. It fell to me to sort of both implement the benchmark, which is very tricky, and then also to sort of make sure we had an agent and basically like a reference agent, maybe I'd call it, that could do very well on it. Ultimately, we want to provide how we implemented that reference agent so that people can build their own agents on top of our system and get sort of the most out of it as possible. So with this blog post we released on SWE-Bench, we released the exact tools and the prompt that we gave the model to be able to do well.

Swyx [00:05:46]: For people who don't know, who maybe haven't dived into SWE-Bench, I think the general perception is they're like tasks that a software engineer could do. I feel like that's an inaccurate description because it is basically, one, it's a subset of like 12 repos. It's everything they could find that every issue with like a matching commit that could be tested. So that's not every commit. And then SWE-Bench verified is further manually filtered by OpenAI. Is that an accurate description and anything you'd change about that? Yes.

Erik [00:06:14]: SWE-Bench is, it certainly is a subset of all tasks. It's first of all, it's only Python repos, so already fairly limited there. And it's just 12 of these popular open source repos. And yes, it's only ones where there were tests that passed at the beginning and also new tests that were introduced that test the new feature that's added. So it is, I think, a very limited subset of real engineering tasks. But I think it's also very valuable because even though it's a subset, it is true engineering tasks. And I think a lot of other benchmarks are really kind of these much more artificial setups of even if they're related to coding, they're more like coding interview style questions or puzzles that I think are very different from day-to-day what you end up doing. I don't know how frequently you all get to use recursion in your day-to-day job, but whenever I do, it's like a treat. And I think it's almost comical, and a lot of people joke about this in the industry, is how different interview questions are.

Swyx [00:07:13]: Dynamic programming. Yeah, exactly.

Erik [00:07:15]: Like, you code. From the day-to-day job. But I think one of the most interesting things about SWE-Bench is that all these other benchmarks are usually just isolated puzzles, and you're starting from scratch. Whereas SWE-Bench, you're starting in the context of an entire repository. And so it adds this entirely new dimension to the problem of finding the relevant files. And this is a huge part of real engineering, is it's actually pretty rare that you're starting something totally greenfield. You need to go and figure out where in a codebase you're going to make a change and understand how your work is going to interact with the rest of the systems. And I think SWE-Bench does a really good job of presenting that problem.

Alessio [00:07:51]: Why do we still use human eval? It's like 92%, I think. I don't even know if you can actually get to 100% because some of the data is not actually

Swyx [00:07:59]: solvable.

Alessio [00:08:00]: Do you see benchmarks like that, they should just get sunsetted? Because when you look at the model releases, it's like, oh, it's like 92% instead of like 89%, 90% on human eval versus, you know, SWE-Bench verified is you have 49%, right? Which is like, before 45% was state of the art, but maybe like six months ago it was like 30%, something like that. So is that a benchmark that you think is going to replace human eval, or do you think they're just going to run in parallel?

Erik [00:08:27]: I think there's still need for sort of many different varied evals. Like sometimes you do really care about just sort of greenfield code generation. And so I don't think that everything needs to go to sort of an agentic setup.

Swyx [00:08:39]: It would be very expensive to implement.

Erik [00:08:41]: The other thing I was going to say is that SWE-Bench is certainly hard to implement and expensive to run because each task, you have to parse, you know, a lot of the repo to understand where to put your code. And a lot of times you take many tries of writing code, running it, editing it. It can use a lot of tokens compared to something like human eval. So I think there's definitely a space for these more traditional coding evals that are sort of easy to implement, quick to run, and do get you some signal. Maybe hopefully there's just sort of harder versions of human eval that get created.

Alessio [00:09:14]: How do we get SWE-Bench verified to 92%? Do you think that's something where it's like line of sight to it, or it's like, you know, we need a whole lot of things to go right? Yeah, yeah.

Erik [00:09:23]: And actually, maybe I'll start with SWE-Bench versus SWE-Bench verified, which is I think something I missed earlier. So SWE-Bench is, as we described, this big set of tasks that were scraped.

Swyx [00:09:33]: Like 12,000 or something?

Erik [00:09:34]: Yeah, I think it's 2,000 in the final set. But a lot of those, even though a human did them, they're actually impossible given the information that comes with the task. The most classic example of this is the test looks for a very specific error string. You know, like assert message equals error, something, something, something. And unless you know that's exactly what you're looking for, there's no way the model is going to write that exact same error message, and so the tests are going to fail. So SWE-Bench verified was actually made in partnership with OpenAI, and they hired humans to go review all these tasks and pick out a subset to try to remove any obstacle like this that would make the tasks impossible. So in theory, all of these tasks should be fully doable by the model. And they also had humans grade how difficult they thought the problems would be. Between less than 15 minutes, I think 15 minutes to an hour, an hour to four hours, and greater than four hours. So that's kind of this interesting sort of how big the problem is as well. To get to SWE-Bench verified to 90%, actually, maybe I'll also start off with some of the remaining failures that I see when running our model on SWE-Bench. I'd say the biggest cases are the model sort of operates at the wrong level of abstraction. And what I mean by that is the model puts in maybe a smaller band-aid when really the task is asking for a bigger refactor. And some of those, you know, is the model's fault, but a lot of times if you're just sort of seeing the GitHub issue, it's not exactly clear which way you should do. So even though these tasks are possible, there's still some ambiguity in how the tasks are described. That being said, I think in general, language models frequently will produce a smaller diff when possible, rather than trying to do a big refactor. I think another area, at least the agent we created, didn't have any multimodal abilities, even though our models are very good at vision. So I think that's just a missed opportunity. And if I read through some of the traces, there's some funny things where, especially the tasks on matplotlib, which is a graphing library, the test script will save an image and the model will just say, okay, it looks great, you know, without looking at it. So there's certainly extra juice to squeeze there of just making sure the model really understands all the sides of the input that it's given, including multimodal. But yeah, I think like getting to 92%. So this is something that I have not looked at, but I'm very curious about. I want someone to look at, like, what is the union of all of the different tasks that have been solved by at least one attempt at SWE-Bench Verified. There's a ton of submissions to the benchmark, and so I'd be really curious to see how many of those 500 tasks at least someone has solved. And I think, you know, there's probably a bunch that none of the attempts have ever solved. And I think it'd be interesting to look at those and say, hey, is there some problem with these? Like, are these impossible? Or are they just really hard and only a human could do them?

Swyx [00:12:22]: Yeah, like specifically, is there a category of problems that are still unreachable by any LLM agent? Yeah, yeah. And I think there definitely are.

Erik [00:12:28]: The question is, are those fairly inaccessible or are they just impossible because of the descriptions? But I think certainly some of the tasks, especially the ones that the human graders reviewed as like taking longer than four hours are extremely difficult. I think we got a few of them right, but not very many at all in the benchmark.

Swyx [00:12:49]: And did those take less than four hours?

Erik [00:12:51]: They certainly did less than, yeah, than four hours.

Swyx [00:12:54]: Is there a correlation of length of time with like human estimated time? You know what I mean? Or do we have sort of more of X paradox type situations where it's something super easy for a model, but hard for a human?

Erik [00:13:06]: I actually haven't done the stats on that, but I think that'd be really interesting to see of like how many tokens does it take and how is that correlated with difficulty? What is the likelihood of success with difficulty? I think actually a really interesting thing that I saw, one of my coworkers who was also working on this named Simon, he was focusing just specifically on the very hard problems, the ones that are said to take longer than four hours. And he ended up sort of creating a much more detailed prompt than I used. And he got a higher score on the most difficult subset of problems, but a lower score overall on the whole benchmark. And the prompt that I made, which is sort of much more simple and bare bones, got a higher score on the overall benchmark, but lower score on the really hard problems. And I think some of that is the really detailed prompt made the model sort of overcomplicate a lot of the easy problems, because honestly, a lot of the suite bench problems, they really do just ask for a bandaid where it's like, hey, this crashes if this is none, and really all you need to do is put a check if none. And so sometimes trying to make the model think really deeply, it'll think in circles and overcomplicate something, which certainly human engineers are capable of as well. But I think there's some interesting thing of the best prompt for hard problems might not be the best prompt for easy problems.

Alessio [00:14:19]: How do we fix that? Are you supposed to fix it at the model level? How do I know what prompt I'm supposed to use?

Swyx [00:14:25]: Yeah.

Erik [00:14:26]: And I'll say this was a very small effect size, and so I think this isn't worth obsessing over. I would say that as people are building systems around agents, I think the more you can separate out the different kinds of work the agent needs to do, the better you can tailor a prompt for that task. And I think that also creates a lot of like, for instance, if you were trying to make an agent that could both solve hard programming tasks, and it could just write quick test files for something that someone else had already made, the best way to do those two tasks might be very different prompts. I see a lot of people build systems where they first sort of have a classification, and then route the problem to two different prompts. And that's sort of a very effective thing, because one, it makes the two different prompts much simpler and smaller, and it means you can have someone work on one of the prompts without any risk of affecting the other tasks. So it creates like a nice separation of concerns. Yeah.

Alessio [00:15:21]: And the other model behavior thing you mentioned, they prefer to generate like shorter diffs. Why is that? Like, is there a way? I think that's maybe like the lazy model question that people have is like, why are you not just generating the whole code instead of telling me to implement it?

Swyx [00:15:36]: Are you saving tokens? Yeah, exactly. It's like conspiracy theory. Yeah. Yeah.

Erik [00:15:41]: Yeah. So there's two different things there. One is like the, I'd say maybe like doing the easier solution rather than the hard solution. And I'd say the second one, I think what you're talking about is like the lazy model is like when the model says like dot, dot, dot, code remains the same.

Swyx [00:15:52]: Code goes here. Yeah. I'm like, thanks, dude.

Erik [00:15:55]: But honestly, like that just comes as like people on the internet will do stuff like that. And like, dude, if you're talking to a friend and you ask them like to give you some example code, they would definitely do that. They're not going to reroll the whole thing. And so I think that's just a matter of like, you know, sometimes you actually do just, just want like the relevant changes. And so I think it's, this is something where a lot of times like, you know, the models aren't good at mind reading of like which one you want. So I think that like the more explicit you can be in prompting to say, Hey, you know, give me the entire thing, no, no elisions versus just give me the relevant changes. And that's something, you know, we want to make the models always better at following those kinds of instructions.

Swyx [00:16:32]: I'll drop a couple of references here. We're recording this like a day after Dario, Lex Friedman just dropped his five hour pod with Dario and Amanda and the rest of the crew. And Dario actually made this interesting observation that like, we actually don't want, we complain about models being too chatty in text and then not chatty enough in code. And so like getting that right is kind of a awkward bar because, you know, you, you don't want it to yap in its responses, but then you also want it to be complete in, in code. And then sometimes it's not complete. Sometimes you just want it to diff, which is something that Enthopic has also released with a, you know, like the, the fast edit stuff that you guys did. And then the other thing I wanted to also double back on is the prompting stuff. You said, you said it was a small effect, but it was a noticeable effect in terms of like picking a prompt. I think we'll go into suite agent in a little bit, but I kind of reject the fact that, you know, you need to choose one prompt and like have your whole performance be predicated on that one prompt. I think something that Enthopic has done really well is meta prompting, prompting for a prompt. And so why can't you just develop a meta prompt for, for all the other prompts? And you know, if it's a simple task, make a simple prompt, if it's a hard task, make a hard prompt. Obviously I'm probably hand-waving a little bit, but I will definitely ask people to try the Enthopic Workbench meta prompting system if they haven't tried it yet. I went to the Build Day recently at Enthopic HQ, and it's the closest I've felt to an AGI, like learning how to operate itself that, yeah, it's, it's, it's really magical.

Erik [00:17:57]: Yeah, no, Claude is great at writing prompts for Claude.

Swyx [00:18:00]: Right, so meta prompting. Yeah, yeah.

Erik [00:18:02]: The way I think about this is that humans, even like very smart humans still use sort of checklists and use sort of scaffolding for themselves. Surgeons will still have checklists, even though they're incredible experts. And certainly, you know, a very senior engineer needs less structure than a junior engineer, but there still is some of that structure that you want to keep. And so I always try to anthropomorphize the models and try to think about for a human sort of what is the equivalent. And that's sort of, you know, how I think about these things is how much instruction would you give a human with the same task? And do you, would you need to give them a lot of instruction or a little bit of instruction?

Alessio [00:18:36]: Let's talk about the agent architecture maybe. So first, runtime, you let it run until it thinks it's done or it reaches 200k context window.

Swyx [00:18:45]: How did you come up? What's up with that?

Erik [00:18:47]: Yeah.

Swyx [00:18:48]: Yeah.

Erik [00:18:49]: I mean, this, so I'd say that a lot of previous agent work built sort of these very hard coded and rigid workflows where the model is sort of pushed through certain flows of steps. And I think to some extent, you know, that's needed with smaller models and models that are less smart. But one of the things that we really wanted to explore was like, let's really give Claude the reins here and not force Claude to do anything, but let Claude decide, you know, how it should approach the problem, what steps it should do. And so really, you know, what we did is like the most extreme version of this is just give it some tools that it can call and it's able to keep calling the tools, keep thinking, and then yeah, keep doing that until it thinks it's done. And that's sort of the most, the most minimal agent framework that we came up with. And I think that works very well. I think especially the new Sonnet 3.5 is very, very good at self-correction, has a lot of like grit. Claude will try things that fail and then try, you know, come back and sort of try different approaches. And I think that's something that you didn't see in a lot of previous models. Some of the existing agent frameworks that I looked at, they had whole systems built to try to detect loops and see, oh, is the model doing the same thing, you know, more than three times, then we have to pull it out. And I think like the smarter the models are, the less you need that kind of extra scaffolding. So yeah, just giving the model tools and letting it keep sample and call tools until it thinks it's done was the most minimal framework that we could think of. And so that's what we did.

Alessio [00:20:18]: So you're not pruning like bad paths from the context. If it tries to do something, it fails. You just burn all these tokens.

Swyx [00:20:25]: Yes.

Erik [00:20:26]: I would say the downside of this is that this is sort of a very token expensive way to do

Swyx [00:20:29]: this. But still, it's very common to prune bad paths because models get stuck. Yeah.

Erik [00:20:35]: But I'd say that, yeah, 3.5 is not getting stuck as much as previous models. And so, yeah, we wanted to at least just try the most minimal thing. Now, I would say that, you know, this is definitely an area of future research, especially if we talk about these problems that are going to take a human more than four hours. Those might be things where we're going to need to go prune bad paths to let the model be able to accomplish this task within 200k tokens. So certainly I think there's like future research to be done in that area, but it's not necessary to do well on these benchmarks.

Swyx [00:21:06]: Another thing I always have questions about on context window things, there's a mini cottage industry of code indexers that have sprung up for large code bases, like the ones in SweetBench. You didn't need them? We didn't.

Erik [00:21:18]: And I think I'd say there's like two reasons for this. One is like SweetBench specific and the other is a more general thing. The more general thing is that I think Sonnet is very good at what we call agentic search. And what this basically means is letting the model decide how to search for something. It gets the results and then it can decide, should it keep searching or is it done? Does it have everything it needs? So if you read through a lot of the traces of the SweetBench, the model is calling tools to view directories, list out things, view files. And it will do a few of those until it feels like it's found the file where the bug is. And then it will start working on that file. And I think like, again, this is all, everything we did was about just giving Claude the full reins. So there's no hard-coded system. There's no search system that you're relying on getting the correct files into context. This just totally lets Claude do it.

Swyx [00:22:11]: Or embedding things into a vector database. Exactly. Oops. No, no.

Erik [00:22:17]: This is very, very token expensive. And so certainly, and it also takes many, many turns. And so certainly if you want to do something in a single turn, you need to do RAG and just push stuff into the first prompt.

Alessio [00:22:28]: And just to make it clear, it's using the Bash tool, basically doing LS, looking at files and then doing CAD for the following context. It can do that.

Erik [00:22:35]: But it's file editing tool also has a command in it called view that can view a directory. It's very similar to LS, but it just sort of has some nice sort of quality of life improvements. So I think it'll only do an LS sort of two directories deep so that the model doesn't get overwhelmed if it does this on a huge file. I would say actually we did more engineering of the tools than the overall prompt. But the one other thing I want to say about this agentic search is that for SWE-Bench specifically, a lot of the tasks are bug reports, which means they have a stack trace in them. And that means right in that first prompt, it tells you where to go. And so I think this is a very easy case for the model to find the right files versus if you're using this as a general coding assistant where there isn't a stack trace or you're asking it to insert a new feature, I think there it's much harder to know which files to look at. And that might be an area where you would need to do more of this exhaustive search where an agentic search would take way too long.

Swyx [00:23:33]: As someone who spent the last few years in the JS world, it'd be interesting to see SWE-Bench JS because these stack traces are useless because of so much virtualization that we do. So they're very, very disconnected with where the code problems are actually appearing.

Erik [00:23:50]: That makes me feel better about my limited front-end experience, as I've always struggled with that problem.

Swyx [00:23:55]: It's not your fault. We've gotten ourselves into a very, very complicated situation. And I'm not sure it's entirely needed. But if you talk to our friends at Vercel, they will say it is.

Erik [00:24:04]: I will say SWE-Bench just released SWE-Bench Multimodal, which I believe is either entirely JavaScript or largely JavaScript. And it's entirely things that have visual components of them.

Swyx [00:24:15]: Are you going to tackle that? We will see.

Erik [00:24:17]: I think it's on the list and there's interest, but no guarantees yet.

Swyx [00:24:20]: Just as a side note, it occurs to me that every model lab, including Enthopic, but the others as well, you should have your own SWE-Bench, whatever your bug tracker tool. This is a general methodology that you can use to track progress, I guess.

Erik [00:24:34]: Yeah, sort of running on our own internal code base.

Swyx [00:24:36]: Yeah, that's a fun idea.

Alessio [00:24:37]: Since you spend so much time on the tool design, so you have this edit tool that can make changes and whatnot. Any learnings from that that you wish the AI IDEs would take in? Is there some special way to look at files, feed them in?

Erik [00:24:50]: I would say the core of that tool is string replace. And so we did a few different experiments with different ways to specify how to edit a file. And string replace, basically, the model has to write out the existing version of the string and then a new version, and that just gets swapped in. We found that to be the most reliable way to do these edits. Other things that we tried were having the model directly write a diff, having the model fully regenerate files. That one is actually the most accurate, but it takes so many tokens, and if you're in a very big file, it's cost prohibitive. There's basically a lot of different ways to represent the same task. And they actually have pretty big differences in terms of model accuracy. I think Eider, they have a really good blog where they explore some of these different methods for editing files, and they post results about them, which I think is interesting. But I think this is a really good example of the broader idea that you need to iterate on tools rather than just a prompt. And I think a lot of people, when they make tools for an LLM, they kind of treat it like they're just writing an API for a computer, and it's sort of very minimal. It's sort of just the bare bones of what you'd need, and honestly, it's so hard for the models to use those. Again, I come back to anthropomorphizing these models. Imagine you're a developer, and you just read this for the very first time, and you're trying to use it. You can do so much better than just sort of the bare API spec of what you'd often see. Include examples in the description. Include really detailed explanations of how things work. And I think that, again, also think about what is the easiest way for the model to represent the change that it wants to make. For file editing, as an example, writing a diff is actually... Let's take the most extreme example. You want the model to literally write a patch file. I think patch files have at the very beginning numbers of how many total lines change. That means before the model has actually written the edit, it needs to decide how many numbers or how many lines are going to change.

Swyx [00:26:52]: Don't quote me on that.

Erik [00:26:54]: I think it's something like that, but I don't know if that's exactly the diff format. But you can certainly have formats that are much easier to express without messing up than others. And I like to think about how much human effort goes into designing human interfaces for things. It's incredible. This is entirely what FrontEnd is about, is creating better interfaces to kind of do the same things. And I think that same amount of attention and effort needs to go into creating agent computer interfaces.

Swyx [00:27:19]: It's a topic we've discussed, ACI or whatever that looks like. I would also shout out that I think you released some of these toolings as part of computer use as well. And people really liked it. It's all open source if people want to check it out. I'm curious if there's an environment element that complements the tools. So how do you... Do you have a sandbox? Is it just Docker? Because that can be slow or resource intensive. Do you have anything else that you would recommend?

Erik [00:27:47]: I don't think I can talk about sort of public details or about private details about how we implement our sandboxing. But obviously, we need to have sort of safe, secure, and fast sandboxes for training for the models to be able to practice writing code and working in an environment.

Swyx [00:28:03]: I'm aware of a few startups working on agent sandboxing. E2B is a close friend of ours that Alessio has led around in, but also I think there's others where they're focusing on snapshotting memory so that it can do time travel for debugging. Computer use where you can control the mouse or keyboard or something like that. Whereas here, I think that the kinds of tools that we offer are very, very limited to coding agent work cases like bash, edit, you know, stuff like that. Yeah.

Erik [00:28:30]: I think the computer use demo that we released is an extension of that. It has the same bash and edit tools, but it also has the computer tool that lets it get screenshots and move the mouse and keyboard. Yeah. So I definitely think there's sort of more general tools there. And again, the tools we released as part of SweetBench were, I'd say they're very specific for like editing files and doing bash, but at the same time, that's actually very general if you think about it. Like anything that you would do on a command line or like editing files, you can do with those tools. And so we do want those tools to feel like any sort of computer terminal work could be done with those same tools rather than making tools that were like very specific for SweetBench like run tests as its own tool, for instance. Yeah.

Swyx [00:29:15]: You had a question about tests.

Alessio [00:29:16]: Yeah, exactly. I saw there's no test writer tool. Is it because it generates the code and then you're running it against SweetBench anyway, so it doesn't really need to write the test or?

Swyx [00:29:26]: Yeah.

Erik [00:29:27]: So this is one of the interesting things about SweetBench is that the tests that the model's output is graded on are hidden from it. That's basically so that the model can't cheat by looking at the tests and writing the exact solution. And I'd say typically the model, the first thing it does is it usually writes a little script to reproduce the error. And again, most SweetBench tasks are like, hey, here's a bug that I found. I run this and I get this error. So the first thing the model does is try to reproduce that. So it's kind of been rerunning that script as a mini test. But yeah, sometimes the model will like accidentally introduce a bug that breaks some other tests and it doesn't know about that.

Alessio [00:30:05]: And should we be redesigning any tools? We kind of talked about this and like having more examples, but I'm thinking even things of like Q as a query parameter in many APIs, it's like easier for the model to like re-query than read the Q. I'm sure it learned the Q by this point, but like, is there anything you've seen like building this where it's like, hey, if I were to redesign some CLI tools, some API tool, I would like change the way structure to make it better for LLMs?

Erik [00:30:31]: I don't think I've thought enough about that off the top of my head, but certainly like just making everything more human friendly, like having like more detailed documentation and examples. I think examples are really good in things like descriptions, like so many, like just using the Linux command line, like how many times I do like dash dash help or look at the man page or something. It's like, just give me one example of like how I actually use this. Like I don't want to go read through a hundred flags. Just give me the most common example. But again, so you know, things that would be useful for a human, I think are also very useful for a model.

Swyx [00:31:03]: Yeah. I mean, there's one thing that you cannot give to code agents that is useful for human is this access to the internet. I wonder how to design that in, because one of the issues that I also had with just the idea of a suite bench is that you can't do follow up questions. You can't like look around for similar implementations. These are all things that I do when I try to fix code and we don't do that. It's not, it wouldn't be fair, like it'd be too easy to cheat, but then also it's kind of not being fair to these agents because they're not operating in a real world situation. Like if I had a real world agent, of course I'm giving it access to the internet because I'm not trying to pass a benchmark. I don't have a question in there more, more just like, I feel like the most obvious tool access to the internet is not being used.

Erik [00:31:47]: I think that that's really important for humans, but honestly the models have so much general knowledge from pre-training that it's, it's like less important for them. I feel like versioning, you know, if you're working on a newer thing that was like, they came after the knowledge cutoff, then yes, I think that's very important. I think actually this, this is like a broader problem that there is a divergence between Sweebench and like what customers will actually care about who are working on a coding agent for real use. And I think one of those there is like internet access and being able to like, how do you pull in outside information? I think another one is like, if you have a real coding agent, you don't want to have it start on a task and like spin its wheels for hours because you gave it a bad prompt. You want it to come back immediately and ask follow up questions and like really make sure it has a very detailed understanding of what to do, then go off for a few hours and do work. So I think that like real tasks are going to be much more interactive with the agent rather than this kind of like one shot system. And right now there's no benchmark that, that measures that. And maybe I think it'd be interesting to have some benchmark that is more interactive. I don't know if you're familiar with TauBench, but it's a, it's a customer service benchmark where there's basically one LLM that's playing the user or the customer that's getting support and another LLM that's playing the support agent and they interact and try to resolve the issue.

Swyx [00:33:08]: Yeah. We talked to the LMSIS guys. Awesome. And they also did MTBench for people listening along. So maybe we need MTSWE-Bench. Sure. Yeah.

Erik [00:33:16]: So maybe, you know, you could have something where like before the SWE-Bench task starts, you have like a few back and forths with kind of like the, the author who can answer follow up questions about what they want the task to do. And of course you'd need to do that where it doesn't cheat and like just get the exact, the exact thing out of the human or out of the sort of user. But I think that would be a really interesting thing to see. If you look at sort of existing agent work, like a Repl.it's coding agent, I think one of the really great UX things they do is like first having the agent create a plan and then having the human approve that plan or give feedback. I think for agents in general, like having a planning step at the beginning, one, just having that plan will improve performance on the downstream task just because it's kind of like a bigger chain of thought, but also it's just such a better UX. It's way easier for a human to iterate on a plan with a model rather than iterating on the full task that sort of has a much slower time through each loop. If the human has approved this implementation plan, I think it makes the end result a lot more sort of auditable and trustable. So I think there's a lot of things sort of outside of SweetBench that will be very important for real agent usage in the world. Yeah.

Swyx [00:34:27]: I will say also, there's a couple of comments on names that you dropped. Copilot also does the plan stage before it writes code. I feel like those approaches have generally been less Twitter successful because it's not prompt to code, it's prompt plan code. You know, so there's a little bit of friction in there, but it's not much. Like it's, it actually, it's, it, you get a lot for what it's worth. I also like the way that Devin does it, where you can sort of edit the plan as it goes along. And then the other thing with Repl.it, we had a, we hosted a sort of dev day pregame with Repl.it and they also commented about multi-agents. So like having two agents kind of bounce off of each other. I think it's a similar approach to what you're talking about with kind of the few shot example, just as in the prompts of clarifying what the agent wants. But typically I think this would be implemented as a tool calling another agent, like a sub-agent I don't know if you explored that, do you like that idea?

Erik [00:35:20]: I haven't explored this enough, but I've definitely heard of people having good success with this. Of almost like basically having a few different sort of personas of agents, even if they're all the same LLM. I think this is one thing with multi-agent that a lot of people will kind of get confused by is they think it has to be different models behind each thing. But really it's sort of usually the same, the same model with different prompts. And yet having one, having them have different personas to kind of bring different sort of thoughts and priorities to the table. I've seen that work very well and sort of create a much more thorough and thought out

Swyx [00:35:53]: response.

Erik [00:35:53]: I think the downside is just that it adds a lot of complexity and it adds a lot of extra tokens. So I think it depends what you care about. If you want a plan that's very thorough and detailed, I think it's great. If you want a really quick, just like write this function, you know, you probably don't want to do that and have like a bunch of different calls before it does this.

Alessio [00:36:11]: And just talking about the prompt, why are XML tags so good in Cloud? I think initially people were like, oh, maybe you're just getting lucky with XML. But I saw obviously you use them in your own agent prompts, so they must work. And why is it so model specific to your family?

Erik [00:36:26]: Yeah, I think that there's, again, I'm not sure how much I can say, but I think there's historical reasons that internally we've preferred XML. I think also the one broader thing I'll say is that if you look at certain kinds of outputs, there is overhead to outputting in JSON. If you're trying to output code in JSON, there's a lot of extra escaping that needs to be done, and that actually hurts model performance across the board. Versus if you're in just a single XML tag, there's none of that sort of escaping that

Swyx [00:36:58]: needs to happen.

Erik [00:36:58]: That being said, I haven't tried having it write HTML and XML, which maybe then you start running into weird escaping things there. I'm not sure. But yeah, I'd say that's some historical reasons, and there's less overhead of escaping.

Swyx [00:37:12]: I use XML in other models as well, and it's just a really nice way to make sure that the thing that ends is tied to the thing that starts. That's the only way to do code fences where you're pretty sure example one start, example one end, that is one cohesive unit.

Alessio [00:37:30]: Because the braces are nondescriptive. Yeah, exactly.

Swyx [00:37:33]: That would be my simple reason. XML is good for everyone, not just Cloud. Cloud was just the first one to popularize it, I think.

Erik [00:37:39]: I do definitely prefer to read XML than read JSON.

Alessio [00:37:43]: Any other details that are maybe underappreciated? I know, for example, you had the absolute paths versus relative. Any other fun nuggets?

Erik [00:37:52]: I think that's a good sort of anecdote to mention about iterating on tools. Like I said, spend time prompt engineering your tools, and don't just write the prompt, but write the tool, and then actually give it to the model and read a bunch of transcripts about how the model tries to use the tool. I think by doing that, you will find areas where the model misunderstands a tool or makes mistakes, and then basically change the tool to make it foolproof. There's this Japanese term, pokayoke, about making tools mistake-proof. You know, the classic idea is you can have a plug that can fit either way, and that's dangerous, or you can make it asymmetric so that it can't fit this way, it has to go like this, and that's a better tool because you can't use it the wrong way. So for this example of absolute paths, one of the things that we saw while testing these tools is, oh, if the model has done CD and moved to a different directory, it would often get confused when trying to use the tool because it's now in a different directory, and so the paths aren't lining up. So we said, oh, well, let's just force the tool to always require an absolute path, and then that's easy for the model to understand. It knows sort of where it is. It knows where the files are. And then once we have it always giving absolute paths, it never messes up even, like, no matter where it is because it just, if you're using an absolute path, it doesn't matter where

Swyx [00:39:13]: you are.

Erik [00:39:13]: So iterations like that, you know, let us make the tool foolproof for the model. I'd say there's other categories of things where we see, oh, if the model, you know, opens vim, like, you know, it's never going to return. And so the tool is stuck.

Swyx [00:39:28]: Did it get stuck? Yeah. Get out of vim. What?

Erik [00:39:31]: Well, because the tool is, like, it just text in, text out. It's not interactive. So it's not like the model doesn't know how to get out of vim. It's that the way that the tool is, like, hooked up to the computer is not interactive. Yes, I mean, there is the meme of no one knows how to get out of vim. You know, basically, we just added instructions in the tool of, like, hey, don't launch commands that don't return.

Swyx [00:39:54]: Yeah, like, don't launch vim.

Erik [00:39:55]: Don't launch whatever. If you do need to do something, you know, put an ampersand after it to launch it in the background. And so, like, just, you know, putting kind of instructions like that just right in the description for the tool really helps the model. And I think, like, that's an underutilized space of prompt engineering, where, like, people might try to do that in the overall prompt, but just put that in the tool itself so the model knows that it's, like, for this tool, this is what's relevant.

Swyx [00:40:20]: You said you worked on the function calling and tool use before you actually started this vBench work, right? Was there any surprises? Because you basically went from creator of that API to user of that API. Any surprises or changes you would make now that you have extensively dog-fooded in a state-of-the-art agent?

Erik [00:40:39]: I want us to make, like, maybe, like, a little bit less verbose SDK. I think some way, like, right now, it just takes, I think we sort of force people to do the best practices of writing out sort of these full JSON schemas, but it would be really nice if you could just pass in a Python function as a tool. I think that could be something nice.

Swyx [00:40:58]: I think that there's a lot of, like, Python- There's helper libraries. ... structure, you know. I don't know if there's anyone else that is specializing for Anthropic. Maybe Jeremy Howard's and Simon Willis's stuff. They all have Cloud-specific stuff that they are working on. Cloudette. Cloudette, exactly. I also wanted to spend a little bit of time with SuiteAgent. It seems like a very general framework. Like, is there a reason you picked it apart from it's the same authors as vBench, or?

Erik [00:41:21]: The main thing we wanted to go with was the same authors as vBench, so it just felt sort of like the safest, most neutral option. And it was, you know, very high quality. It was very easy to modify, to work with. I would say it also actually, their underlying framework is sort of this, it's like, you

Swyx [00:41:39]: know, think, act, observe.

Erik [00:41:40]: That they kind of go through this loop, which is like a little bit more hard-coded than what we wanted to do, but it's still very close. That's still very general. So it felt like a good match as sort of the starting point for our agent. And we had already sort of worked with and talked with the SWE-Bench people directly, so it felt nice to just have, you know, we already know the authors. This will be easy to work with.

Swyx [00:42:00]: I'll share a little bit of like, this all seems disconnected, but once you figure out the people and where they go to school, it all makes sense. So it's all Princeton. Yeah, the SWE-Bench and SuiteAgent.

Erik [00:42:11]: It's a group out of Princeton.

Swyx [00:42:12]: Yeah, and we had Shun Yu on the pod, and he came up with the React paradigm, and that's think, act, observe. That's all React. So they're all friends. Yep, yeah, exactly.

Erik [00:42:22]: And you know, if you actually read our traces of our submission, you can actually see like think, act, observe in our logs. And we just didn't even change the printing code. So it's like doing still function calls under the hood, and the model can do sort of multiple function calls in a row without thinking in between if it wants to. But yeah, so a lot of similarities and a lot of things we inherited from SuiteAgent just as a starting point for the framework.

Alessio [00:42:47]: Any thoughts about other agent frameworks? I think there's, you know, the whole gamut from very simple to like very complex.

Swyx [00:42:53]: Autogen, CooEI, LandGraph. Yeah, yeah.

Erik [00:42:56]: I think I haven't explored a lot of them in detail. I would say with agent frameworks in general, they can certainly save you some like boilerplate. But I think there's actually this like downside of making agents too easy, where you end up very quickly like building a much more complex system than you need. And suddenly, you know, instead of having one prompt, you have five agents that are talking to each other and doing a dialogue. And it's like, because the framework made that 10 lines to do, you end up building something that's way too complex. So I think I would actually caution people to like try to start without these frameworks if you can, because you'll be closer to the raw prompts and be able to sort of directly understand what's going on. I think a lot of times these frameworks also, by trying to make everything feel really magical, you end up sort of really hiding what the actual prompt and output of the model is, and that can make it much harder to debug. So certainly these things have a place, and I think they do really help at getting rid of boilerplate, but they come with this cost of obfuscating what's really happening and making it too easy to very quickly add a lot of complexity. So yeah, I would recommend people to like try it from scratch, and it's like not that bad.

Alessio [00:44:08]: Would you rather have like a framework of tools? Do you almost see like, hey, it's maybe easier to get tools that are already well curated, like the ones that you build, if I had an easy way to get the best tool from you, and

Swyx [00:44:21]: like you maintain the definition?

Alessio [00:44:22]: Or yeah, any thoughts on how you want to formalize tool sharing?

Erik [00:44:26]: Yeah, I think that's something that we're certainly interested in exploring, and I think there is space for sort of these general tools that will be very broadly applicable. But at the same time, most people that are building on these, they do have much more specific things that they're trying to do. You know, I think that might be useful for hobbyists and demos, but the ultimate end applications are going to be bespoke. And so we just want to make sure that the model's great at any tool that it uses. But certainly something we're exploring.

Alessio [00:44:52]: So everything bespoke, no frameworks, no anything.

Swyx [00:44:55]: Just for now, for now.

Erik [00:44:56]: Yeah, I would say that like the best thing I've seen is people building up from like, build some good util functions, and then you can use those as building blocks. Yeah, yeah.

Alessio [00:45:05]: I have a utils folder, or like all these scripts. My framework is like def, call, and tropic. And then I just put all the defaults.

Swyx [00:45:12]: Yeah, exactly. There's a startup hidden in every utils folder, you know? No, totally not. Like, if you use it enough, like it's a startup, you know? At some point. I'm kind of curious, is there a maximum length of turns that it took? Like, what was the longest run? I actually don't.

Erik [00:45:27]: I mean, it had basically infinite turns until it ran into a 200k context. I should have looked this up. I don't know. And so for some of those failed cases where it eventually ran out of context, I mean, it was over 100 turns. I'm trying to remember like the longest successful run, but I think it was definitely over 100 turns that some of the times.

Swyx [00:45:48]: Which is not that much. It's a coffee break. Yeah.

Erik [00:45:52]: But certainly, you know, these things can be a lot of turns. And I think that's because some of these things are really hard, where it's going to take, you know, many tries to do it. And if you think about like, think about a task that takes a human four hours to do. Think about how many different files you read, and like times you edit a file in four hours. That's a lot more than 100.

Alessio [00:46:10]: How many times you open Twitter because you get distracted. But if you had a lot more compute, what's kind of like the return on the extra compute now? So like, you know, if you had thousands of turns or like whatever, like how much better would it get?

Erik [00:46:23]: Yeah, this I don't know. And I think this is, I think sort of one of the open areas of research in general with agents is memory and sort of how do you have something that can do work beyond its context length where you're just purely appending. So you mentioned earlier things like pruning bad paths. I think there's a lot of interesting work around there. Can you just roll back but summarize, hey, don't go down this path? There be dragons. Yeah, I think that's very interesting that you could have something that that uses way more tokens without ever using at a time more than 200k. So I think that's very interesting. I think the biggest thing is like, can you make the model sort of losslessly summarize what it's learned from trying different approaches and bring things back? I think that's sort of the big challenge.

Swyx [00:47:11]: What about different models?

Alessio [00:47:12]: So you have Haiku, which is like, you know, cheaper. So you're like, well, what if I have a Haiku to do a lot of these smaller things and then put it back up?

Erik [00:47:20]: I think Cursor might have said that they actually have a separate model for file editing.

Swyx [00:47:25]: I'm trying to remember.

Erik [00:47:25]: I think they were on maybe the Lex Fridman podcast where they said they have a bigger model, like write what the code should be and then a different model, like apply it. So I think there's a lot of interesting room for stuff like that. Yeah, fast supply.

Swyx [00:47:37]: We actually did a pod with Fireworks that they worked with on. It's speculative decoding.

Erik [00:47:41]: But I think there's also really interesting things about like, you know, paring down input tokens as well, especially sometimes the models trying to read like a 10,000 line file. That's a lot of tokens. And most of it is actually not going to be relevant. I think it'd be really interesting to like delegate that to Haiku. Haiku read this file and just pull out the most relevant functions. And then, you know, Sonnet reads just those and you save 90% on tokens. I think there's a lot of really interesting room for things like that. And again, we were just trying to do sort of the simplest, most minimal thing and show that it works. I'm really hoping that people, sort of the agent community builds things like that on top of our models. That's, again, why we released these tools. We're not going to go and do lots more submissions to SWE-Bench and try to prompt engineer this and build a bigger system. We want people to like the ecosystem to do that on top of our models. But yeah, so I think that's a really interesting one.

Swyx [00:48:32]: It turns out, I think you did do 3.5 Haiku with your tools and it scored a 40.6. Yes.

Erik [00:48:38]: So it did very well. It itself is actually very smart, which is great. But we haven't done any experiments with this combination of the two models. But yeah, I think that's one of the exciting things is that how well Haiku 3.5 did on SWE-Bench shows that sort of even our smallest, fastest model is very good at sort of thinking agentically and working on hard problems. Like it's not just sort of for writing simple text anymore.

Alessio [00:49:02]: And I know you're not going to talk about it, but like Sonnet is not even supposed to be the best model, you know? Like Opus, it's kind of like we left it at three back in the corner intro. At some point, I'm sure the new Opus will come out. And if you had Opus Plus on it, that sounds very, very good.

Swyx [00:49:19]: There's a run with SuiteAgent plus Opus, but that's the official SWE-Bench guys doing it.

Erik [00:49:24]: That was the older, you know, 3.0.

Swyx [00:49:25]: You didn't do yours. Yeah. Okay. Did you want to? I mean, you could just change the model name.

Erik [00:49:31]: I think we didn't submit it, but I think we included it in our model card.

Swyx [00:49:35]: Okay.

Erik [00:49:35]: We included the score as a comparison. Yeah.

Swyx [00:49:38]: Yeah.

Erik [00:49:38]: And Sonnet and Haiku, actually, I think the new ones, they both outperformed the original Opus. Yeah. I did see that.

Swyx [00:49:44]: Yeah. It's a little bit hard to find. Yeah.

Erik [00:49:47]: It's not an exciting score, so we didn't feel like they need to submit it to the benchmark.

Swyx [00:49:52]: We can cut over to computer use if we're okay with moving on to topics on this, if anything else. I think we're good.

Erik [00:49:58]: I'm trying to think if there's anything else SWE-Bench related.

Swyx [00:50:02]: It doesn't have to be also just specifically SWE-Bench, but just your thoughts on building agents, because you are one of the few people that have reached this leaderboard on building a coding agent. This is the state of the art. It's surprisingly not that hard to reach with some good principles. Right. There's obviously a ton of low-hanging fruit that we covered. Your thoughts on if you were to build a coding agent startup, what next?

Erik [00:50:24]: I think the really interesting question for me, for all the startups out there, is this kind of divergence between the benchmarks and what real customers will want. So I'm curious, maybe the next time you have a coding agent startup on the podcast, you should ask them that. What are the differences that they're starting to make? Tomorrow.

Swyx [00:50:40]: Oh, perfect, perfect. Yeah.

Erik [00:50:41]: I'm actually very curious what they will see, because I also have seen, I feel like it's slowed down a little bit if I don't see the startups submitting to SWE-Bench that much anymore.

Swyx [00:50:52]: Because of the traces, the trace. So we had Cosign on, they had a 50-something on full, on SWE-Bench full, which is the hardest one, and they were rejected because they didn't want to submit their traces. Yep. IP, you know? Yeah, that makes sense, that makes sense. Actually, tomorrow we're talking to Bolt, which is a cloud customer. You guys actually published a case study with them. I assume you weren't involved with that, but they were very happy with Cloud. Cool. One of the biggest launches of the year. Yeah, totally. We actually happened to be sitting in Adept's former office. My take on this is Anthropic shipped Adept as a feature. It's still a beta feature, but yes. What was it like when you tried it for the first time? Was it obvious that Cloud had reached that stage where you could do computer use? It was somewhat of a surprise to me.

Erik [00:51:40]: I had been on vacation, and I came back, and everyone's like, computer use works. So it was this very exciting moment. After the first go to Google, I think I tried to have it play Minecraft or something, and it actually installed and opened Minecraft.

Swyx [00:51:54]: I was like, wow, this is pretty cool.

Erik [00:51:55]: So I was like, wow, yeah, this thing can actually use a computer. And certainly, it is still beta. There's certain things that it's not very good at yet. But I'm really excited, I think, most broadly, not just for new things that weren't possible before, but as a much lower friction way to implement tool use. One anecdote from my days at Cobalt Robotics, we wanted our robots to be able to ride elevators, to go between floors and fully cover a building. The first way that we did this was doing API integrations with the elevator companies. Some of them actually had APIs. We could send a request, and it would move the elevator. Each new company we did took six months to do,

Swyx [00:52:37]: because they were very slow.

Erik [00:52:39]: They didn't really care.

Swyx [00:52:40]: Or an elevator, not an API.

Erik [00:52:42]: Even installing, once we had it with the company, they would have to literally go install an API box on the elevator that we wanted to use, and that would sometimes take six months.

Swyx [00:52:51]: So very slow.

Erik [00:52:52]: And eventually, we're like, okay, this is slowing down all of our customer deployments. And I was like, what if we just add an arm to the robot? And I added this little arm that could literally go and press the elevator buttons, and we use computer vision to do this. And we could deploy that in a single day, and have the robot being able to use the elevators. At the same time, it was slower than the API. It wasn't quite as reliable. Sometimes it would miss, and it would have to try to press it again.

Swyx [00:53:20]: But it would get there.

Erik [00:53:20]: But it was slower and a little bit less reliable. And I kind of see this as an analogy to computer use, of anything you can do with computer use today, you could probably write tool use and integrate it with APIs.

Swyx [00:53:33]: It's up to the language model.

Erik [00:53:34]: But that's going to take a bunch of software engineering to write those integrations.

Swyx [00:53:38]: You have to do all this stuff.

Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero.

Alessio [00:54:20]: Or farming on World of Warcraft.

Swyx [00:54:23]: Yes, or that.

Erik [00:54:23]: Just go computer use.

Alessio [00:54:25]: Very high-value use cases.

Swyx [00:54:27]: I always say about this, this is the oldest question in robotics or self-driving, which is, do you drive by vision or do you have special tools? And vision is the universal tool to claim all tools. There's trade-offs, but there's situations in which that will come. But this week's podcast, the one that we just put out, had Stan Polu from Dust saying that he doesn't see a future where it's the significant workhorse. I think there could be a separation between maybe the high-volume use cases. You want APIs. And then the long tail, you want computer use. I totally agree. Right?

Erik [00:55:00]: Or you'll start, you'll prototype something with computer use. And then, hey, this is working. Customers have adopted this feature. OK, let's go turn it into an API. And it'll be faster and use less tokens.

Swyx [00:55:11]: I'd be interested to see a computer use agent replace itself by figuring out the API and then just dropping out of the equation altogether.

Erik [00:55:20]: Yeah, that's really fun, actually.

Swyx [00:55:22]: If I was running an RPA company, you would have the RPA scripting. RPA, for people listening, is robotic process automation, where you would script things that always show up in sequence. So you don't have an LLM in the loop. And so basically what you need to do is train an LLM to code that script. And then you can naturally hand off from computer use to non-computer use.

Erik [00:55:43]: Or have some way to turn Claude's actions of computer use into a saved script that you can then run repeatedly.

Swyx [00:55:49]: Yeah, it'd be interesting to record that.

Alessio [00:55:50]: Why did you decide to not ship any sandbox harness for computer use? It's kind of like, hey, peace.

Swyx [00:55:58]: Run at your own risk. It's Docker, right?

Erik [00:55:59]: No, no, we launched it with, I think, a VM or Docker, a Docker as system.

Alessio [00:56:03]: But it's not for your actual computer, right? The Docker instance runs in the Docker. It's not for...

Swyx [00:56:10]: Yeah, it runs its own browser.

Erik [00:56:13]: I mean, the main reason for that, one, is sort of security. We don't want... The model can do anything. So we wanted to give it a sandbox, not have people do their own computer. At least sort of for our default experience. We really care about providing a nice sort of... Making the default safe, I think, is the best way for us to do it. And I mean, very quickly, people made modifications to let you run it on your own desktop. And that's fine.

Swyx [00:56:37]: Someone else can do that.

Erik [00:56:37]: But we don't want that to be the official, anthropic thing to run. I would say also, from a product perspective, right now, because this is sort of still in beta, I think a lot of the most useful use cases are... Like, a sandbox is actually what you want. You want something where, hey, it can't mess up anything in here. It only has what I gave it. Also, if it's using your computer, you know, you can't use your computer at the same time. I think you actually want it to have its own screen. It's like you and a person pair programming, but only on one laptop versus you have two laptops.

Swyx [00:57:07]: Everyone should totally have a side laptop where the computer uses... Cloud is just doing its thing. Yeah, yeah.

Erik [00:57:11]: I think it's such a better experience. Unless there's something very explicit you want it to do for you on your own computer.

Swyx [00:57:17]: It becomes like you're sort of shelling into a remote machine and, you know, maybe checking in on it every now and then. Like, I have fond memories of... Half our audience is going to be too young to remember this, but Citrix desktop experience, like, you were sort of remote into a machine that someone else was operating. And for a long time, that would be how you did, like, enterprise computing. Yeah, yeah. It's coming back. Any other implications of computer use? You know, is it a fun demo or is it, like, the future of Anthropic? I'm very excited about it.

Erik [00:57:50]: I think that, like, there's a lot of sort of very repetitive work that, like, computer use will be great for. I think I've seen some examples of people build, like, coding agents that then also, like, test the front end that they made. So I think it's very cool to, like, use computer use to be able to close the loop on a lot of things that right now just a terminal-based agent can't do. So I think that's very exciting.

Swyx [00:58:11]: It's kind of like end-to-end testing. Exactly. Yeah, yeah.

Erik [00:58:14]: The end sort of front-end and web testing is something I'm very excited about.

Swyx [00:58:18]: Yeah, I've seen Amanda also talking... This would be Amanda Askell, the head of Cloud Character. She goes on a lunch break and it generates, you know, research ideas for her. Giving it a name like computer use is very practical. It's like you're supposed to do things, but maybe sometimes it's not about doing things, it's about thinking. And thinking... In the process of thinking, you're using the computer. In some way that's, you know, solving SweetBench, like, you should be allowed to use the internet or you should be allowed to use a computer to solve it and use your vision and use whatever. Like, we're just sort of shackling it with all these restrictions just because we want to play nice for a benchmark. But really, you know, a full AI will be able to do all these things. To think. Yeah, we'll definitely be able to. To reason. To Google and search for things.

Erik [00:58:58]: Yeah, yeah. Pull down inspiration.

Alessio [00:59:00]: Can we just do a... before we wrap, a robotics corner?

Swyx [00:59:03]: Oh, yeah, yeah.

Alessio [00:59:04]: People are always curious, especially with somebody that is not trying to hype their own company. What's the state of AI robotics? Under-hyped, over-hyped?

Erik [00:59:12]: Yeah, and I'll say, like, these are my opinions, not Anthropic's. And again, coming from a place of a burned-out robotics founder, so take everything with a grain of salt. I would say on the positives, like, there is really sort of incredible progress that's happened in the last five years that I think will be a big unlock for robotics. The first is just general purpose language models. I mean, there was an old saying in robotics that if to fully describe your task is harder than to just do the task, you can never automate it. Because, like, it's going to take more effort to even tell the robot how to do this thing than to me just do it itself. LLM solved that. I no longer need to go exhaustively program in every little thing I could do. The thing just has common sense. And it's going to know, how do I make a Reuben sandwich? I'm not going to have to go program that in. Whereas before, like, the idea of even, like, a cooking thing, it's like, oh god, like, we're gonna have the team of engineers that are hard coding recipes for the long tail of anything. It would be a disaster. So I think that's one thing, is that bringing common sense really is, like, solves this huge problem of describing tasks. The second big innovation has been diffusion models for path planning. A lot of this work came out of Toyota Research. There's a lot of startups now that are working on this, like Physical Intelligence Pi, Chelsea Finn's startup out of Stanford. And the basic idea here is using a little bit of the, I'd say maybe more inspiration from diffusion rather than diffusion models themselves. But they're a way to basically learn an end-to-end sort of motion control. Whereas previously, all of robotics motion control was sort of very hard-coded. You either, you know, you're programming in explicit motions, or you're programming in an explicit goal and using an optimization library to find the shortest path to it. This is now something where you just give it a bunch of demonstrations. And again, just like using learning, it's basically like learning from these examples. What does it mean to go pick up a cup? And doing these in a way just like diffusion models, where they are somewhat conditioned by text, you can have the same model learn many different tasks. And then the hope is that these start to generalize. That if you've trained it on picking up coffee cups and picking up books, then when I say pick up the backpack, it knows how to do that too. Even though you've never trained it on that. That's kind of the holy grail here, is that you train it on 500 different tasks, and then that's enough to really get it to generalize to do anything you would need. I think that's like still a big TBD. And these people are working, have like measured some degree of generalization. But at the end of the day, it's also like LLMs. Like, you know, do you really care about the thing, being able to do something that no one has ever shown in training data? People for like a home robot, there's going to be like a hundred things that people really wanted to do. And you can just make sure it has good training for those things. What you do care about then is like generalization within a task of, oh, I've never seen this particular coffee mug before. Can I still pick it up? And those, the models do seem very good at. So these kind of are the two big things that are going for robotics right now, is LLMs for common sense and diffusion-inspired path planning algorithms. I think this is very promising, but I think there's a lot of hype. And I think where we are right now is where self-driving cars were 10 years ago. I think we have very cool demos that work. I mean, 10 years ago, you had videos of people driving a car on the highway, driving a car, you know, on a street with a safety driver. But it's really taken a long time to go from there to, I took a Waymo here today. And even Waymo is only in SF and a few other cities. And I think it takes a long time for these things to actually get everywhere and to get all the edge cases covered. I think that for robotics, the limiting factor is going to be reliability, that these models are really good at doing these demos of doing laundry or doing dishes. If they only work 99% of the time, that sounds good, but that's actually really annoying. Humans are really good at these tasks. Imagine if one out of every 100 dishes, it washed, it breaks. You would not want that robot in your house, or you certainly wouldn't want that in your factory if one of every 100 boxes that it moves, it drops and breaks things inside it. So I think for these things to really be useful, they're going to have to hit a very, very high level of reliability, just like self-driving cars. And I don't know how hard it's going to be for these models to move from the 95% reliability to 99.9. I think that's going to be the big thing. And I think also, I'm a little skeptical of how good the unit economics of these things will be. These robots are going to be very expensive to build. And if you're just trying to replace labor, like a one-for-one purchase, it kind of sets an upper cap about how much you can charge. And so it seems like it's not that great a business. I'm also worried about that for the self-driving car industry.

Alessio [01:04:05]: Do you see most of the applications actually taking some of the older, especially manufacturing machinery, which needs to be very precise? Even if it's off by just a few millimeters, it cannot screw up the whole thing and be able to adjust at the edge? Or do you think the net new use cases may be more interesting?

Erik [01:04:24]: I think it'd be very hard to replace a lot of those traditional manufacturing robots because everything relies on that precision. If you have a model that can, again, only get there 99% of the time, you don't want 1% of your cars to have the weld in the wrong spot. That's going to be a disaster. And a lot of manufacturing is all about getting rid of as much variance and uncertainty as

Swyx [01:04:47]: possible.

Erik [01:04:47]: Yeah.

Swyx [01:04:48]: And what about the hardware?

Alessio [01:04:49]: A lot of my friends that work in robotics, one of their big issues is sometimes you just have a servo that fails, and it takes a bunch of time to fix that.

Swyx [01:04:57]: Is that holding back things?

Alessio [01:04:58]: Or is the software still, anyway, not that ready?

Swyx [01:05:01]: I think both.

Erik [01:05:01]: I think there's been a lot more progress in the software in the last few years. And I think a lot of the humanoid robot companies now are really trying to build amazing hardware. Hardware is just so hard. It's something where you build your first robot, and it works. You're like, great. Then you build 10 of them. Five of them work. Three of them work half the time. Two of them don't work. And you built them all the same, and you don't know why. And it's just like the real world has this level of detail and differences that software

Swyx [01:05:28]: doesn't have.

Erik [01:05:29]: Imagine if every for loop you wrote, some of them just didn't work. Some of them were slower than others. How do you deal with that? Imagine if every binary that you shipped to a customer, each of those four loops was a

Swyx [01:05:41]: little different.

Erik [01:05:41]: It becomes just so hard to scale and maintain quality of these things. And I think that's what makes hardware really hard. It's not building one of something, but repeatedly building something and making it work reliably. Where again, you'll buy a batch of 100 motors, and each of those motors will behave a little bit differently to the same input command.

Swyx [01:06:01]: This is your lived experience at Cobalt.

Erik [01:06:03]: And robotics is all about how do you build something that's robust despite these differences.

Swyx [01:06:08]: We can't get the tolerance of motors down to-

Erik [01:06:10]: It's just everything.

Swyx [01:06:13]: It's actually everything.

Alessio [01:06:14]: Yeah.

Erik [01:06:15]: No, I mean, one of my horror stories was that at Cobalt, this was many years ago, we had a thermal camera on the robot that had a USB connection to the computer inside, which is, first of all, is a big mistake. You're not supposed to use a USB. It is not a reliable protocol. It's designed that if there's mistakes, the user can just unplug it and plug it back in. I see. And so typically things that are USB, they're not designed to the same level of very high reliability you need. Again, because they assume someone will just unplug it and replug it back in. You just say someone sometime.

Swyx [01:06:46]: I heard this too, and I didn't listen to it.

Erik [01:06:47]: I really wish I had before. Anyway, at a certain point, a bunch of these thermal cameras started failing, and we couldn't figure out why. And I asked everyone on the team, like, hey, what's changed? Did the software change around this? Did the hardware design change around this? And I was investigating all this stuff, looking at kernel logs of what's happening with this

Swyx [01:07:07]: thing.

Erik [01:07:07]: And finally, the procurement person was like, oh, yeah, well, I found this new vendor for USB cables last summer.

Swyx [01:07:14]: And I'm like, what?

Erik [01:07:15]: You switched which vendor were buying USB cables? I'm like, yeah, it's the same exact cable. It's just a dollar cheaper. And it turns out this was the problem. This new cable had slightly worse resistance or slightly worse EMI interference. And it worked most of the time. But 1% of the time, these cameras would fail, and we'd need to reboot a big part of the system. And it was all just because the same exact spec, these two different USB cables, slightly different. And so these are the kind of things you deal with with hardware.

Swyx [01:07:45]: For listeners, we had an episode with Josh Albrecht in BU where he talked about buying tens of thousands of GPUs. And just some of them will just not do math. Yeah, that's the same thing. You run some tests to find the bad batch, and then you return it to sender because they just, GPUs won't do math, right? Yeah, yeah, this is the thing.

Erik [01:08:05]: The real world has this level of detail. Eric Jang, he did AI at Google.

Swyx [01:08:11]: Yeah, 1X. Yeah, and then joined 1X.

Erik [01:08:13]: I see him post on Twitter occasionally of complaints about hardware and supply chain. And we know each other, and we joke occasionally. I went from robotics into AI, and he went from AI into robotics.

Swyx [01:08:26]: I mean, look, very, very promising. The time of the real world is unlimited, right? But just also a lot harder. And yeah, I do think something I also tell people about for why working software agents is they're infinitely clonable. Yeah, they always work the same way. Mostly, unless you're using Python. And yeah, I mean, this is the whole thesis. I'm also interested, you dropped a little bit of alpha there. I don't want to make sure we don't lose it. Like, you're kind of skeptical about self-driving as a business. So I want to double click on this a little bit, because I mean, I think that shouldn't be taken away. We do have some public Waymo numbers. Read from Waymo is pretty public with their stats. They're exceeding 100 Waymo trips a week. If you assume a 25𝑟𝑖𝑑𝑒𝑎𝑣𝑒𝑟𝑎𝑔𝑒,𝑡ℎ𝑎𝑡′𝑠25rideaverage,that′s130 million revenue run rate. At some point, they will recoup their investment, right? Like, what are we talking about here? Way to skepticism.

Erik [01:09:21]: I think, and again, I'm not an expert. I don't know their financials. I would say the thing I'm worried about is compared to an Uber, I don't know how much an Uber driver takes home a year, but call that the revenue that a Waymo is going to be making in that same year. Those cars are expensive. It's not about if you can hit profitability, it's about your cash conversion cycles. Is building one Waymo, how cheap can you make that compared to how much you're earning as the equivalent of what an Uber driver would take home? Because remember, an Uber driver, you're not getting that whole revenue. You think about, for the Uber driver, the cost of the car, the depreciation of the car. I'm not convinced how much profit Waymo can actually make per car.

Swyx [01:10:02]: That's, I think, my skepticism.

Alessio [01:10:02]: Well, they need to pre-assess the run Waymo because the Class C is like $110 grand, something

Swyx [01:10:09]: like that, plus the LiDAR. That's many years of, yeah, yeah, yeah. Exactly, exactly. Anything else?

Alessio [01:10:14]: Parting thoughts? Call to action? Rants?

Swyx [01:10:18]: The floor is yours.

Erik [01:10:19]: I'm very excited to see a lot more LLM agents out there in the world doing things. And I think they'll be, the biggest limiting thing will start to become, do people trust the output of these agents? And how do you trust the output of an agent that did five hours of work for you and is coming back with something? And if you can't find some way to trust that agent's work, it kind of wasn't valuable at all. So I think that's going to be a really important thing, is not just doing the work, but doing the work in a trustable, auditable way where you can also explain to the human, hey, here's exactly how this works and why and how I came to it. I think that's going to be really important.

Swyx [01:10:54]: Thank you so much. Yeah, thanks. This was great.

and has now comfortably overtaken the previous leader in the prompt-to-app space, Vercel’s v0, which uses a custom finetuned model.

A surprisingly short lived one, lasting a whole 6 days.

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

Solving SWE-Bench

Claude Computer Use

Full Video episode

Show Notes

Timestamps

Transcript

Discussion about this episode

Ready for more?