Menu
Sign In Pricing Add Podcast

AI Daily Brief Host

Appearances

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

0.209

Today on the AI Daily Brief, OpenAI released a paper effectively seeking to test how competent their leading models are in real-world coding applications. Before that in the headlines, former OpenAI CTO Meera Muradi has officially announced her new company Thinking Machines. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1013.45

For individual contributor tasks, O1 came in second place, earning $78,000, while GPT-40 performed less well, earning $29,000. As interesting as the results, though, was the analysis. The report explained, agents excel at localizing but fail to root cause, resulting in partial or flawed solutions.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1032.357

Agents pinpoint the source of an issue remarkably quickly, using keyword searches across the whole repository to quickly locate the relevant file and functions, often far faster than a human would.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1041.88

However, they often exhibit a limited understanding of how the issue spans multiple components or files and fail to address the root cause, leading to solutions that are incorrect or insufficiently comprehensive. We rarely find cases where the agent aims to reproduce the issue or fails due to not finding the right file or location to edit.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1057.869

For the managerial tasks, each model displayed better performance. Quad 3.5 Sonnet was again the best-performing model, earning $314,000 of a possible $585,000, completing 54% of tasks. O1 was hot on its heels, correctly completing 52% of tasks for a total of $302,000. And even GPT-4O, bringing up the rear, still managed 47% of tasks to earn $275,000.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1081.465

This showed that the models were all decent at choosing the right solution when presented with several options, but still have a long way to go until they can fully replace a technical lead. Overall, Claude 3.5 Sonnet won the day, earning $403,000 overall with a 40% completion rate.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1093.15

O1 earned $380,000 while completing 38% of the full set of tasks, and GPT-40 finished 30% of tasks, earning $304,000. Now, to be clear, no money was actually earned. These tasks were all simulated, but that's how much they would have earned had the AI actually been in charge of that job from Upwork or Expensify.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1112.983

Part of what's so interesting about this, and we'll get to this in a moment in the commentary, is that this absolutely reflects the broad consensus that people have had for some time, which is that Cloud 3.5 Sonnet is just by far and away the best coding model.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1124.813

We've even talked about how its ubiquity as a coding model created some challenges for Anthropic's economic report, given what a high percentage of Claude's use comes from those coding use cases. Now, in terms of commentary and the response to this so far, a lot of it is focusing on exactly this weird contrast that we've identified. Mihir Patel writes, As always, evals remain hard and messy.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1143.987

And still, somehow, Sonnet is the best code model. Benjamin DeKraker, who was previously on the team at XAI but fired for saying that Grok 3 wasn't the second coming, noted that it was bold of OpenAI to show that Claude 3.5 Sonnet outperformed O1 on their own benchmark. Synthetica Lab responded, I'm not benchmarking, but in a real project that I'm working on in C++. O1 was basically unusable.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1177.481

They then went to share their experience with O1, Claude 3.5, and Grok 3, again pointing out that these benchmarks are really not necessarily useful for understanding how things are going to work in the real world. Another interesting comment came from Henry Shi, the founder of Super.com.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1192.13

He pointed out that in a previous experiment that he had run that was very similar, while they had reached the same conclusion that, quote, frontier models are still unable to solve the majority of tasks, he also wrote, what's interesting and underappreciated in the paper is that O1 is able to solve almost 50% of all IC suite tasks on the Upwork benchmark.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1208.863

This makes sense as human freelancers rarely get the solution right on the first try. There's a lot of back and forth and clarification required with the client. If AI agents are able to effectively iterate on a problem, it should be able to drastically improve performance, just like humans and feedback in the workplace as well.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1223.756

In other words, for the sake of this benchmark, these model-powered agents were given a single chance to do it. That's not actually how it would work in the real world. And so as the user experience and interactive capabilities of agents go up, it's likely that in real-world settings, they'd be able to even outperform where they got during this test.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

123.559

And despite their potential, these systems remain difficult for people to customize to their specific needs and values. To bridge the gaps for building thinking machine labs to make AI systems more widely understood, customizable, and generally capable. Now, if you're sitting there thinking, boy, I have absolutely no idea what these folks are actually building. You, my friend, are not alone.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1240.58

Another thing that some pointed out was the likelihood that this means that OpenAI is actually building an end production coding agent. Developer Nick Dobos writes, if they took the time to build a benchmark, it means they are building a product to test an agent against it.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1253.834

We haven't talked about this all that much on this show, but I'm fairly certain that in a world where it's increasingly clear that the underlying models are going to be commoditized and that there's not going to be much moat when it comes to technology, I think OpenAI has a much stronger incentive to own the customer experience end to end.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1268.832

And my guess is that they are looking at agents in just about every key domain of work. Now, going back to this broader idea of vibe coding, I wanted to flag just how big a theme this has gotten to be. Like I said, I think that coding is one of the areas where agents are coming to production and actually being deployed for businesses most quickly.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1285.993

And I think that this whole idea of vibe coding is really fleshing out the spectrum of code creation from no code all the way to coding agents, all the way to traditional coding experiences. A16Z recently did a new market map of these types of tools.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1298.203

People like Riley Brown, who's the number one AI creator on TikTok, has gone all in on vibe coding, even working on some tools to improve how people do their vibe coding now. He also shared some interesting thoughts recently about how this might change the structure of the economy.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1312.233

Specifically, he points out that as creators can monetize their audiences with software rather than things like courses and ads, it creates a very different type of economic opportunity, one that's starting to be reflected in a new generation of VC creator funds. And speaking of VCs, it's very clear that there is lots of interest in this area.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1330.453

A16Z's Andrew Chen tweets, So you can, for example, make the signup flow the same as XYZ app, explainers on highlighted code diffs, etc., library of graphic assets, integrated logo creator. And Andrew points out all the PMs and ex-PMs like me will have a field day with this.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1355.498

Point being that when we look at coding right now, not only are we talking about disruption to the way that coding happens among traditional software engineers, we're also talking about totally different modalities and an expansion of who gets to actually push code.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1369.164

At the same time, even as all of these people get excited about what they can do that they couldn't do before because they weren't coders, that's not the same as these tools being able to be inserted willy-nilly into enterprise code processes.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

1381.59

And so a lot of the work over the next couple of years is going to be to figure out how these experiences diverge and what type of coding agents are good for different settings. Still, it is an absolutely fascinating time and I am very excited to see what comes next. For now though, that is going to do it for today's AI Daily Brief. Appreciate you listening as always. Until next time, peace.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

141.527

Cosmic Chaos writes good luck, but I'm still not sure what exactly you are building. Is it one product that does all three or separately? Is it a service or a product? And what's your roadmap? William Wolfe writes, I'm rooting for Thinking Machines, but I wish projects like this had product to both engineering and design in their founding philosophies.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

157.858

Otherwise, it kind of just feels like yet another group of world-class researchers vaguely gesticulating at the future. Where is the vision? Swicks pointed out what he called two notable omissions from the Thinking Machines manifesto. The website does not use the word reasoning or agent at all. So what are these folks building? I have absolutely no idea.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

176.972

It does feel a little bit like the type of text that may be in retrospect when we learn what they're building, like it'll make sense. Right now, I think vaguely gesticulating at the future is a pretty accurate way to describe it. At the end of the day, though, when it comes to things like potential for fundraising, the clarity of the description doesn't probably matter even a little bit.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

19.46

To join the conversation, follow the Discord link in our show notes. Welcome back to the AI Daily Brief Headlines Edition, all the daily AI news you need in around five minutes. OpenAI has had a lot of talent departures over the last year and a half or so. In some cases, it's felt like a protest on how the direction of the company was going and indeed has explicitly been shared as such.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

195.745

Currently, the 29 or so employees come from places like OpenAI, Meta, Character AI, and Google DeepMind. Verit Zoff, OpenAI's former VP of post-training research, is taking on the CTO role, with OpenAI co-founder John Shulman serving as chief scientist.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

208.733

And indeed, when it comes to people's interest in the company, it's best summed up by Andrej Karpathy, who writes, In other words, while this may be a situation where we don't have any idea what they're actually building, they're probably still worth paying attention to. Next up on the other end of the startup journey, less than a year after launch, the Humane pin is officially dead and gone.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

233.432

Humane announced on Tuesday that their AI wearable startup has been acquired by HP. Customers have been given just 10 days notice that servers would be shut down, rendering the expensive device useless. In the FAQ, Humane noted the device could still be used for offline features like checking the battery level. So there's something there, I guess.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

250.847

Now, of course, the Humane pin was a bold early attempt at creating a wearable AI assistant, but fell flat for a number of reasons, all of which have been endlessly discussed in retrospect. It was originally priced at $699, making it very inaccessible, really only for very high-end gizmo enthusiasts.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

268.034

Initial reviews were universally terrible, the absolute apex of which was Marques Brownlee calling it the worst product I've ever reviewed, a review which has been seen 8.5 million times. Updates also couldn't save the device. At one point last summer, Humane was processing more returns than they had sales. Humane even told customers to stop using the charging case due to battery fire concerns.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

289.365

As for the buyout, HP said they were acquiring the team in the company's AI operating system to help them create, quote, intelligent ecosystem across all HP devices, from AI PCs to smart printers and connected conference rooms.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

301.087

Gonzalo Nunez writes, the humane founders having to go work for AI for office jet printers at HP is the ultimate Sisyphusian punishment for the prototypical Steve Jobs LARPer founder. I cannot imagine anything more cruel. So is there anything to learn from the failure of Humane?

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

316.627

Investor Justin Duke doesn't think so, writing, Basically, Duke is arguing that Humane was very much a creature of the 2019, 2020, 2021 era of VC when massive checks were flying around Silicon Valley at the very end of Zerp.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

337.944

Entrepreneur Chris Back writes, Humane is the perfect cautionary tale of how talented people get completely distorted from reality by staying at large successful companies for too long. Are you really a great product designer or do you just work at Apple? Are you actually great at sales or do you just work at Google? Are you really an incredible growth marketer or do you just work at Instagram?

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

354.998

After a certain size, the brands sell themselves. The only way to test your abilities is to leave the shelter of these mega brands and go out and build something yourself from scratch. And usually throwing lots of money at the problem pre-launch isn't going to help you. Maybe more pertinent in question is what it means about the state of AI wearables in general.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

371.059

One thing that makes it complicated to determine is the disconnect between when it was launched and how capabilities have changed. The Humane Pin was released in April 2024, a few months before Google released the first version of AI search that suggested eating rocks and using glue as a pizza topping.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

385.006

Now, however, we're at a stage where leading AI models, even small ones designed for on-device use, are as good at coding as most junior programmers. although exactly how good they are we'll get into in the main episode. Still, at this point, it's not clear that people actually want an AI assistant in a standalone device. Newsletter writer Jack Appleby thinks that there's a form factor problem.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

402.812

He writes, the future of AI isn't new hardware, it's upgrading existing software. Control-L Dwayne writes, the first AI hardware flop. I don't know a single person who bought a humane AI pin, but this is brutal. This is exactly why AI hardware will only succeed when it's 100% local with no cloud or API dependencies. I don't know, man. I'm not so sure that the lessons are as clear as people think.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

41.342

In others, it's about people making a boatload of money and just wanting to do something different for a while. And then still in others, it's about building something new outside of the constraints of that company. And among that set, one of the most closely watched people has been former CTO Mira Marotti.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

422.086

People have a love to rip on Humane from the very beginning, and a lot of it is absolutely self-inflicted. The overly-wrought marketing videos that felt like they were trying too hard to live in Steve Jobs' shadow. The price point. The amount of money raised. There were plenty of red flags for even someone who was trying to go in unbiased.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

438.815

It is going to be an extraordinary process of trial and error to figure out if and what sort of AI wearable experiences consumers are actually going to want. No one has a perfect crystal ball into that future, otherwise they'd be making a ton of money. I'm glad that there are experiments still happening.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

454.082

I would say that Humane is a great reminder that extraordinarily well-funded startups tend not to be the ones to invent these sort of new experiences. But at the same time, there are some indicators of AI wearables actually getting some traction. Best example of that may be the Ray-Ban Meta AI glasses, which are an extremely popular product. So who knows?

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

472.333

All we know for sure is that Humane's part of the story is done for now, but I would be very surprised, ultimately, if that means the category of AI wearables is actually cooked. Anyways, guys, that's going to do it for today's AI Daily Brief. One new beginning, one ending. And next up, the main episode. Today's episode is brought to you by Vanta. That's where Vanta comes in.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

506.433

Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC 2 and ISO 27001. Centralized security workflows complete questionnaires up to 5x faster and proactively manage vendor risk. Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

528.678

Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company. Join over 9,000 global companies like Atlassian, Quora, and Factory who use Vanta to manage risk and prove security in real time. For a limited time, this audience gets $1,000 off Vanta at vanta.com slash nlw. That's v-a-n-t-a dot com slash nlw for $1,000 off.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

554.844

If there is one thing that's clear about AI in 2025, it's that the agents are coming, vertical agents by industry, horizontal agent platforms, agents per function. If you are running a large enterprise, you will be experimenting with agents next year. And given how new this is, all of us are going to be back in pilot mode.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

56.574

For months now, there have been rumors around what she's building, mostly fueled by departures and recruitment from OpenAI and Anthropic to join Marotti on some as yet unrevealed company. Now, however, that company has been officially announced. Yesterday, Mira tweeted, I started Thinking Machines Labs alongside a remarkable team of scientists, engineers, and builders.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

576.021

That's why Superintelligent is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

582.749

Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

597.685

If you are interested in the agent readiness and opportunity audit, reach out directly to me, nlw at bsuper.ai. Put the word agent in the subject line so I know what you're talking about, and let's have you be a leader in the most dynamic part of the AI market. Hey, listeners, are you tasked with the safe deployment and use of trustworthy AI?

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

616.709

KPMG has a first-of-its-kind AI risk and controls guide, which provides a structured approach for organizations to begin identifying AI risks and design controls to mitigate threats. What makes KPMG's AI Risks and Controls Guide different is that it outlines practical control considerations to help businesses manage risks and accelerate value. To learn more, go to www.kpmg.us slash AI Guide.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

641.261

That's www.kpmg.us slash AI Guide. Welcome back to the AI Daily Brief. If you've been anywhere near AI Twitter slash X over the last few weeks, you've probably heard this term vibe coding. It was coined by OpenAI co-founder Andrej Karpathy, who said, there's a new kind of coding I call vibe coding, where you fully give into the vibes, embrace exponentials, and forget that the code even exists.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

668.132

It's possible because the LLMs, e.g. Cursor Composer with Sonnet, are getting too good. Also, I just talked to Composer with Super Whisper so I barely even touch the keyboard. I ask for the dumbest things like decrease the padding on the sidebar by half because I'm too lazy to find it. I accept all always. I don't read the diffs anymore.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

685.182

When I get error messages, I just copy-paste them in with no comment. Usually that fixes it. The code grows beyond my usual comprehension. I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug, so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

702.418

I'm building a project or a web app, but it's not really coding. I just see stuff, say stuff, run stuff, and copy-paste stuff, and it mostly works. Now, this, as we will discuss, has begot an entire movement of vibe coders who are thinking about new categories of tools.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

717.332

And it's predicated, as Karpathy points out, on the availability of a particular set of new coding tools that hit that line right between LLMs and agents in terms of how much they're being controlled by humans and how much they're actually doing for themselves. Indeed, I think part of what makes this area so interesting is that it is really at the forefront of agents in practice.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

736.725

It demonstrates on the one hand how mushy some of this terminology is, but at the same time, how powerful these tools are likely to be in practice. All right, so part of the context for today's show is vibe coding. But then another little bit of background is the conversation we were having yesterday about Grok 3.0. When Grok 3 launched, it showed off how it had done on a bunch of benchmarks.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

756.452

And I, like many people, found myself basically just having my eyes glaze over when it came to those benchmarks because they're so saturated at this point that it's really hard to actually get signal from them. As Ethan Malek pointed out, public benchmarks are both meh and saturated, leaving a lot of AI testing to be like food reviews based on taste. If AI is critical to work, we need more.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

77.851

We're building three things, helping people adapt AI systems to work for their specific needs, developing strong foundations to build more capable AI systems, fostering a culture of open science that helps the whole field understand and improve these systems.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

775.41

He also pointed out that a lot of these benchmarks, quote, look nothing like actual work. And given that we spend all of our time over at Superintelligent on the actual deployment and practice of AI and agents at work, this is a particularly poignant problem. It's also not an easy one.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

790.655

Another reminder from just this morning from Ethan, AI is so challenging to figure out because it's genuinely capable of doing PhD-level work in some areas while messing up basic tasks in closely related areas. And the abilities of AI are growing, but unevenly. All right, so all of this is background to our main topic today, which is a new benchmark from OpenAI called the SWE Lancer Benchmark.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

811.924

The gist and the question that provoked the whole conversation was, can frontier LLMs earn $1 million from real-world freelance software engineering? Earlier this week, OpenAI released a paper effectively seeking to test how competent their leading models are in real world coding applications.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

827.445

This new SWE Lancer benchmark consists of, quote, over 1,400 freelance software engineering tasks from Upwork valued at $1 million in USD total in real world payouts. SWE Lancer encompasses both independent engineering tasks ranging from $50 bug fixes to $32,000 feature implementations and managerial tasks where models choose between technical implementation proposals. So why is this important?

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

851.942

Well, this gets at exactly what we were just discussing. Until now, coding benchmarks have largely involved competitive coding problems. These are tests that assess models on tricky programming puzzles, but don't translate directly into practical real-world use cases.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

866.386

On top of their inapplicability to the real world, they're also, as we just mentioned, becoming increasingly saturated, making it difficult to know whether a new model represents a significant improvement or was simply trained to perform well on a known set of questions. This benchmark then is much more focused on the real world.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

882.694

And it actually harkens back to an idea that some like Microsoft's Mustafa Suleiman have proposed for a new type of Turing test based on how AI interacts with the real world. Back in the middle of 2023, Mustafa Suleiman proposed a Turing test of whether AI could make a million dollars.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

898.241

Mustafa wrote, I think we're in a moment of genuine confusion or perhaps more charitably debate about what's really happening. Even as the Turing test fails, it doesn't leave us much clearer on where we are with AI or what it can actually achieve. It doesn't tell us what impact these systems will have on society or help us understand how that will play out.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

90.899

Our goal is simple, advance AI by making it broadly useful and understandable through solid foundations, open science, and practical applications. Alongside it, they published a website, thinkingmachines.ai. They write, Knowledge of how these systems are trained is concentrated within the top research labs, limiting both the public discourse on AI and people's abilities to use AI effectively.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

914.628

His proposal then for a modern Turing test would be to give AI the instruction, go make a million dollars on a retail web platform in a few months with just a $100,000 investment.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

923.951

So this is a little bit different, obviously, than what OpenAI had done, in that OpenAI is specifically giving the model these 1,400 freelance tasks, rather than asking it to go be creative and figure out how to make that money. But the principle of getting benchmarks into the real world, plus this baselining to a million dollars, obviously are reminiscent.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

941.818

Getting back to Sweet Lancer, for the purposes of this paper, the researchers set three LLMs to the task. They tested OpenAI's GPT-40 and 01 alongside Anthropic's Claude 3.5 Sonnet. Each LLM was driving a basic coding agent capable of directly interacting with a codebase. The models were given one shot to complete each task.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

960.552

Overall, researchers found that, quote, the results indicate that the real world freelance work in our benchmarks remains challenging for frontier language models. Going even farther in the abstract, they write, we find that frontier models are still unable to solve the majority of tasks.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

975.339

Providing a little more clarity on the tasks themselves, they were scraped directly from Upwork and Expensify with no word changes or clarification, giving the models a taste of real world freelancing work. The models were also denied internet access, including GitHub, ensuring that they were working based solely on their pre-trained dataset.

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

What AI Coding Agents Can Do Right Now

991.025

However, they did have access to a snapshot of the code bases they were working on. The results found that none of the models had earned a million dollars as an automated freelancer. Interestingly, though, despite the fact that this research was from OpenAI, Claude 3.5 saw it perform the best, resolving 26% of individual contributor issues and earning $89,000 out of a possible $415,000.