How Cosine Genie reached 50% on SWE-Bench Lite, 30% on the full SWE-Bench, and 44% on OpenAI's new SWE-Bench Verified, all state of the art results by the widest ever margin recorded.
An interesting observation from the talk is that swe requires thinking and acting like a human engineer. I wonder if there are better, not human like workflows that could be better but very different?
An interesting observation from the talk is that swe requires thinking and acting like a human engineer. I wonder if there are better, not human like workflows that could be better but very different?
Another good episode. I appreciate it making what an “agent” is a bit more scientific. I’m trying to expand into the space a bit in my work.