
Aman Sanger, Arvid Lunnemark, Michael Truell, and Sualeh Asif are creators of Cursor, a popular code editor that specializes in AI-assisted programming. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep447-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/cursor-team-transcript CONTACT LEX: Feedback - give feedback to Lex: https://lexfridman.com/survey AMA - submit questions, videos or call-in: https://lexfridman.com/ama Hiring - join our team: https://lexfridman.com/hiring Other - other ways to get in touch: https://lexfridman.com/contact EPISODE LINKS: Cursor Website: https://cursor.com Cursor on X: https://x.com/cursor_ai Anysphere Website: https://anysphere.inc/ Aman's X: https://x.com/amanrsanger Aman's Website: https://amansanger.com/ Arvid's X: https://x.com/ArVID220u Arvid's Website: https://arvid.xyz/ Michael's Website: https://mntruell.com/ Michael's LinkedIn: https://bit.ly/3zIDkPN Sualeh's X: https://x.com/sualehasif996 Sualeh's Website: https://sualehasif.me/ SPONSORS: To support this podcast, check out our sponsors & get discounts: Encord: AI tooling for annotation & data management. Go to https://encord.com/lex MasterClass: Online classes from world-class experts. Go to https://masterclass.com/lexpod Shopify: Sell stuff online. Go to https://shopify.com/lex NetSuite: Business management software. Go to http://netsuite.com/lex AG1: All-in-one daily nutrition drinks. Go to https://drinkag1.com/lex OUTLINE: (00:00) - Introduction (09:25) - Code editor basics (11:35) - GitHub Copilot (18:53) - Cursor (25:20) - Cursor Tab (31:35) - Code diff (39:46) - ML details (45:20) - GPT vs Claude (51:54) - Prompt engineering (59:20) - AI agents (1:13:18) - Running code in background (1:17:57) - Debugging (1:23:25) - Dangerous code (1:34:35) - Branching file systems (1:37:47) - Scaling challenges (1:51:58) - Context (1:57:05) - OpenAI o1 (2:08:27) - Synthetic data (2:12:14) - RLHF vs RLAIF (2:14:01) - Fields Medal for AI (2:16:43) - Scaling laws (2:25:32) - The future of programming PODCAST LINKS: - Podcast Website: https://lexfridman.com/podcast - Apple Podcasts: https://apple.co/2lwqZIr - Spotify: https://spoti.fi/2nEwCF8 - RSS: https://lexfridman.com/feed/podcast/ - Podcast Playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 - Clips Channel: https://www.youtube.com/lexclips
The following is a conversation with the founding members of the Cursor team, Michael Truel, Swale Asif, Arvid Lundmark, and Aman Sanger. Cursor is a code editor based on VS Code that adds a lot of powerful features for AI-assisted coding. It has captivated the attention and excitement of the programming and AI communities.
So I thought this is an excellent opportunity to dive deep into the role of AI in programming. This is a super technical conversation that is bigger than just about one code editor. It's about the future of programming and in general, the future of human AI collaboration in designing and engineering complicated and powerful systems. And now a quick few second mention of each sponsor.
Check them out in the description. It's the best way to support this podcast. We've got Encore for unifying your machine learning stack, Masterclass for learning, Shopify for selling stuff online, NetSuite for your business, and AG1 for your health. Choose wisely, my friends.
Also, if you want to get in touch with me for whatever reason, or take a survey or send me questions for an AMA, all of that would be great. Go to lexgerman.com contact. And now onto the full ad reads. I try to make them interesting, but if you skip them, please still check out our sponsors. I enjoy their stuff. Maybe you will too.
This episode is brought to you by Encore, a platform that provides data-focused AI tooling for data annotation, curation, management, and for model evaluation. One of the things I love about these guys is they have a great blog that describes cleanly. I mean, it's technical, but it's not too technical, but it's sufficiently technical to where it's actually describing ideas, not BS.
blog posts on sort of the state-of-the-art, like the OpenAI 01 model that was just released. So sometimes they integrate it into why this is a part of Encore, why this makes sense, and sometimes not. And so I love that. I recommend their blog just in general. That said, when they are looking at state-of-the-art models, they are always looking for ways to integrate it into their platform.
Basically, it's a place to organize your data, and data is everything. This was true before the popularity and the explosion of attention methods of transformers. And it is still very much true now. Sort of the non-synthetic, the human generated data is extremely important.
How you generate that data, how you organize that data, how you leverage it, how you train on it, how you fine tune on it, the pre-training, the post-training, all of it, the whole thing. Data is extremely, extremely important. And so Encore takes data very seriously. Anyway, go try out Encore to create, annotate, and manage your AI data at Encore.com slash Lex. That's Encore.com slash Lex.
This episode is also brought to you by Masterclass, where you can watch over 200 classes from the best people in the world in their respective disciplines. Carlos Santana on guitar, for example. I loved that one. There's a few guitar ones, Tom Morello too. Great, great, great stuff. But Carlos Santana, his instrumental Europa. I haven't quite tried to play that, but it's on my to-do list.
It's sort of one of those things, you know for sure this is a thing I will play because it's too beautiful. It's too soulful. It feels like once you play, you understand something about the guitar that you didn't before. It's not blues. It's not, I don't know what it is. It's some kind of dreamlike teleportation into a psychedelic world.
where the tone is warmer than anything else I've ever heard. And still, the guitar can cry. I don't know. I love it. He's a genius. So it's such a gift that you can get a genius like that. to teach us about his secrets. Get unlimited access to every Masterclass and get an additional 15% off an annual membership at masterclass.com slash lexpod. That's masterclass.com slash lexpod.
This episode is also brought to you by Shopify, a platform designed for anyone to sell anywhere with a great-looking online store, or simple-looking online store, like the one I put together at lexfreeman.com/.store. I have a few shirts on there in case you're interested. And speaking of shirts, I'm reminded of thrift stores, which I very much loved for a long time. I still love thrift stores.
Or a nice place to get stuff. Like, I don't know, kitchen stuff and clothing. And the kind of clothing you get at thrift stores is actually pretty interesting because there's shirts there that are just unlike anything else you would get anywhere else. So if you're sort of selective, and creative-minded, there's a lot of interesting fashion that's there.
And in terms of t-shirts, there's just like hilarious t-shirts. T-shirts that are very far away from the kind of trajectories you have taken in life, or are not, but you just haven't thought about it. Like a band that you love, but you never would have thought to wear their t-shirt. Anyway, a little bit, I think of Shopify as the internet's thrift store.
Of course, you can do super classy, you can do super fancy, or you can do super thrift. All of it is possible. Sign up for a $1 per month trial period at Shopify.com slash Lex. That's all lowercase. Go to Shopify.com slash Lex to take your business to the next level today. This episode is also brought to you by NetSuite, an all-in-one cloud business management system.
Sometimes I think that NetSuite is supporting this podcast because they're trolling me. They're saying, hey Lex, aren't you doing a little too much talking? Maybe you should be building more. I agree with you, NetSuite. I agree with you. And so every time I do an ad read for NetSuite, it is a chance for me to confront my Jungian shadow.
Some of the demons emerge from the subconscious and ask questions that I don't have answers to. Questions about one's mortality and that life is short and that one of the most fulfilling things in life is to have a family and kids and all of these things I would very much like to have. And also the reality that I love programming and I love building
I love creating cool things that people can use and share and that would make their life better. All of that. Of course, I also love listening to podcasts. And I kind of think of this podcast as me listening to a podcast where I can also maybe participate by asking questions. So all these things that you love, but you ask the hard question of like, okay, well, life is slipping away. It's short.
It really, really is short. What do you want to do with the rest of the minutes and the hours that make up your life? Yeah, so thank you for the existential crisis, Nasweet. I appreciate it. If you're running a business, if you have taken the leap into the unknown and started a company, then you should be using the right tools to manage that company.
In fact, over 37,000 companies have upgraded to NetSuite. Take advantage of NetSuite's flexible financing plan at netsuite.com slash lex. That's netsuite.com slash lex. This episode is also brought to you by the delicious, the delicious AG1. It's an all-in-one daily drink to support better health and peak performance.
It's basically a super awesome multivitamin that makes me feel like I have my life together. Even when everything else feels like it's falling apart, at least I have AG1. At least I have that nutritional foundation to my life.
So all the fasting I'm doing, all the carnivore diets, all the physical endurance events and the mental madness of staying up all night or just the stress of certain things I'm going through, all of that, AG1 is there. At least I have the vitamins. Also, I sometimes wonder, they used to be called Athletic Greens, and now they're called AG1. I always wonder, is AG2 coming? Why is it just one?
It's an interesting branding decision, like AG1. Me as an OCD kind of programmer type, it's like, okay, is this a versioning thing? Is this like AG 0.1 alpha? When's the final release? Anyway, the thing I like to say and to consume is AG1. They'll give you one month supply of fish oil when you sign up at drinkag1.com. This is the Lex Friedman Podcast.
To support it, please check out our sponsors in the description. And now, dear friends, here's Michael, Swale, Arvid, and Aman. All right, this is awesome. We have Michael, Aman, Swale, Arvid here from the Cursor team. First up, big ridiculous question. What's the point of a code editor?
So the code editor is largely the place where you build software. And today, or for a long time, that's meant the place where you text edit a formal programming language. And for people who aren't programmers, the way to think of a code editor is like a really souped-up word processor for programmers, where the reason it's souped up is code has a lot of structure.
And so the quote-unquote word processor, the code editor, can actually do a lot for you that word processors sort of in the writing space haven't been able to do for people editing text there.
And so that's everything from giving you visual differentiation of the actual tokens in the code so you can scan it quickly, to letting you navigate around the code base, sort of like you're navigating around the internet with hyperlinks. You're going to sort of definitions of things you're using, to error checking to catch rudimentary bugs.
And so traditionally, that's what a code editor has meant. And I think that what a code editor is is going to change a lot over the next 10 years as what it means to build software maybe starts to look a bit different. I think also a code editor should just be fun.
Yes, that is very important. That is very important. And it's actually sort of an underrated aspect of how we decide what to build. Like a lot of the things that we build and then we try them out, we do an experiment and then we actually throw them out because they're not fun. And so a big part of being fun is like being fast a lot of the time. Fast is fun.
Yeah, that should be a t-shirt.
Like fundamentally, I think one of the things that draws a lot of people to building stuff on computers is this like insane integration speed where, you know, in other disciplines, you might be sort of gatecapped by resources or the ability, even the ability, you know, to get a large group together and coding is this like amazing thing where it's you and the computer and that alone, you can build really cool stuff really quickly.
So for people who don't know, Cursor is this super cool new editor that's a fork of VS Code. It would be interesting to get your kind of explanation of your own journey of editors. I think all of you were big fans of VS Code with Copilot. How did you arrive to VS Code and how did that lead to your journey with Cursor?
Yeah, so... I think a lot of us, well, all of us were originally Vim users.
Pure Vim.
Pure Vim, yeah. No NeoVim, just pure Vim in a terminal. And at least for myself, it was around the time that Copilot came out. So 2021. that I really wanted to try it. So I went into VS Code, the only platform, the only code editor in which it was available. And even though I really enjoyed using Vim, just the experience of Copilot with VS Code was more than good enough to convince me to switch.
And so that kind of was the default until we started working on Cursor.
And maybe we should explain what QuotePilot does. It's like a really nice autocomplete. It suggests, as you start writing a thing, it suggests one or two or three lines how to complete the thing. And there's a fun experience in that, you know, like when you have a close friendship and your friend completes your sentences? Like when it's done well, there's an intimate feeling.
There's probably a better word than intimate, but there's a cool feeling of like, holy shit. It gets me. And then there's an unpleasant feeling when it doesn't get you. And so there's that kind of friction. But I would say for a lot of people, the feeling that it gets me overpowers that it doesn't.
And I think actually one of the underrated aspects of GitHub Copilot is that even when it's wrong, it's like a little bit annoying, but it's not that bad because you just type another character and then maybe then it gets you or you type another character and then it gets you. So even when it's wrong, it's not that bad.
Yeah, you can sort of iterate and fix it. I mean, the other underrated part of Copilot for me sort of was just the first real AI product. So the first language model consumer product.
So Copile was kind of like the first killer app for LLMs. Yeah. And like the beta was out in 2021. Right.
Okay. So what's the origin story of Cursor? So around 2020, the scaling loss papers came out from OpenAI. And that was a moment where this looked like clear, predictable progress for the field, where even if we didn't have any more ideas, it looked like you could make these models a lot better if you had more compute and more data.
By the way, we'll probably talk for three to four hours on the topic of scaling laws. Just to summarize, it's a paper and a set of papers and a set of ideas that say bigger might be better for model size and data size in the realm of machine learning.
It's bigger and better, but predictably better. That's another topic of conversation.
So around that time, for some of us, there were a lot of conceptual conversations about what's this going to look like? What's the story going to be for all these different knowledge worker fields about how they're going to be made better by this technology getting better?
And then I think there were a couple of moments where the theoretical gains predicted in that paper started to feel really concrete. And it started to feel like a moment where you could actually go and not do a PhD if you wanted to work on, do useful work in AI. Actually felt like now there was this whole set of systems one could build that were really useful.
And I think that the first moment we already talked about a little bit, which was playing with the early bit of Copilot, that was awesome and magical. I think that the next big moment where everything kind of clicked together was actually getting early access to GPT-4. So it was sort of end of 2022 was when we were tinkering with that model. And the step up in capabilities felt enormous.
And previous to that, we had been working on a couple of different projects. We had been because of Copilot, because of scaling Oz, because of our prior interest in the technology, we had been tinkering around with tools for programmers, but things that are like very specific.
So, you know, we were building tools for financial professionals who have to work within a Jupyter notebook or like, you know, playing around with, can you do static analysis with these models? And then the stuff up in GPT-4 felt like, look, that really made concrete the theoretical gains that we had predicted before. Felt like you could build a lot more just immediately at that point in time.
And also, if we were being consistent, it really felt like this wasn't just going to be a point solution thing. This was going to be all of programming was going to flow through these models. And it felt like that demanded a different type of programming environment, a different type of programming. And so we set off to build that sort of larger vision around that.
There's one that I distinctly remember. So my roommate is an IML Gold winner, and there's a competition in the U.S. called the Putnam, which is sort of the IML for college people, and it's this math competition. He's exceptionally good. So Sheng Tong and Aman, I remember, it's sort of June of 2022.
had this bet on whether 2024, June or July, you were going to win a gold medal in the IMO with models.
IMO is International Math Olympiad.
Yeah, I was International Math Olympiad. And so Arvid and I are both, you know, also competed in it. So it was sort of personal. And I remember thinking, Matt, this is not going to happen. This was like, even though I sort of believed in progress, I thought... you know, I'm a girl just like a modest, just delusional.
That was the, that was the, and to be honest, I mean, I, I was to be clear, very wrong, but that was maybe the most prescient bet in the group.
So the new results from DeepMind, it turned out that you were correct.
That's what the- Well, it was technically not.
Technically incorrect, but one point away. Amon was very enthusiastic about this stuff. Yeah. And before, Amon had this, like, Scaling Laws t-shirt that he would walk around with, where it had the, like- charts and like the formulas on it.
So you like felt the AGI or you felt the scaling?
Yeah, I distinctly remember there was this one conversation I had with Michael where before I hadn't thought super deeply and critically about scaling laws. And he kind of posed the question, why isn't scaling all you need or why isn't scaling going to result in massive gains in progress? And I think I went through like the stages of grief.
There is anger, denial, and then finally at the end, just thinking about it, acceptance. And I think I've been quite hopeful and optimistic about progress since. I think one thing I'll caveat is I think it also depends on like which domains you're going to see progress.
Like math is a great domain because especially like formal theorem proving because you get this fantastic signal of actually verifying if the thing was correct. And so this means something like RL can work really, really well. And I think like you could have systems that are perhaps very superhuman in math and still not technically have AGI.
Okay, so can we take it all the way to Cursor? And what is Cursor? It's a fork of VS Code. And VS Code is one of the most popular editors for a long time. Everybody fell in love with it. Everybody left Vim. I left Emacs for it. Sorry. So it unified in some fundamental way the developer community. And then you look at the space of things. You look at the scaling laws. AI is becoming amazing.
And you decided, okay, it's not enough to just write an extension for your VS Code because there's a lot of limitations to that. If AI is going to keep getting better and better and better, we need to really rethink how the AI is going to be part of the editing process. And so you decided to fork VS Code and start to build a lot of the amazing features we'll be able to talk about.
But what was that decision like? Because there's a lot of extensions, including Copilot.
of vs code that are doing sort of ai type stuff what was the decision like to just fork vs code so the decision to do an editor seemed kind of self-evident to us for at least what we wanted to do and achieve because when we started working on the editor the idea was these models are going to get much better their capabilities are going to improve and it's going to entirely change how you build software both in a you will have big productivity gains but also radical and not like the act of building software is going to change a lot
And so you're very limited in the control you have over a code editor if you're a plugin to an existing coding environment. And we didn't want to get locked in by those limitations. We wanted to be able to just build the most useful stuff.
Okay, well then the natural question is, you know, VS Code is kind of with Copilot a competitor. So how do you win? Is it basically just the speed and the quality of the features?
Yeah, I mean, I think this is a space that is quite interesting, perhaps quite unique, where if you look at previous tech waves, maybe there's kind of one major thing that happened and it unlocked a new wave of companies.
But every single year, every single model capability or jump you get in model capabilities, you now unlock this new wave of features, things that are possible, especially in programming. And so I think in AI programming, being even just a few months ahead, let alone a year ahead, makes your product much, much, much more useful.
I think the cursor a year from now will need to make the cursor of today look obsolete. And I think, you know, Microsoft has done a number of like fantastic things, but I don't think they're in a great place to really keep innovating and pushing on this in the way that a startup can. Just rapidly implementing features.
And kind of doing the research experimentation necessary to really push the ceiling.
I don't know if I think of it in terms of features as I think of it in terms of capabilities for programmers. It's that as the new one model came out, and I'm sure there are going to be more models of different types, like longer context and maybe faster. There's all these... crazy ideas that you can try. And hopefully 10% of the crazy ideas will make it into something kind of cool and useful.
And we want people to have that sooner. To rephrase, it's like an underrated fact is we're making it for ourself. When we started Cursor, you really felt this frustration that, you know, models, you could see models getting better. But the COBOL experience had not changed. It was like, man, these guys, the ceiling is getting higher. Why are they not making new things?
They should be making new things. Where's all the alpha features? There were no alpha features. It was like... I'm sure it was selling well. I'm sure it was a great business, but it didn't feel, I'm one of these people that really want to try and use new things. And it was just, there's no new thing for like a very long while.
Yeah, it's interesting. I don't know how you put that into words, but when you compare Cursor with Copilot, Copilot pretty quickly became, started to feel stale for some reason.
Yeah, I think one thing that I think helps us is that we're sort of doing it all in one, where we're developing the UX and the way you interact with the model at the same time as we're developing how we actually make the model give better answers. So we're like... how you build up the prompt or like how do you find the context and for a cursor tab, like how do you train the model?
So I think that helps us to have all of it like sort of like the same people working on the entire experience end-to-end.
Yeah, it's like the person making the UI and the person training the model sit 18 feet away.
Often the same person even.
Yeah, often even the same person. You can create things that are sort of not possible if you're not talking, you're not experimenting.
And you're using, like you said, Cursor to write Cursor. Of course. Oh, yeah. Well, let's talk about some of these features. Let's talk about the all-knowing, the all-powerful, praise B to the tab. You know, autocomplete on steroids. Basically. So how does tab work?
What is tab? To highlight and summarize at a high level, I'd say that there are two things that Cursor is pretty good at right now. There are other things that it does. But two things that it helps programmers with. One is this idea of looking over your shoulder and being like a really fast colleague who can kind of jump ahead of you and type and figure out what you're gonna do next.
And that was the original idea behind, that was kind of the kernel of the idea behind a good autocomplete was predicting what you're gonna do next. But you can make that concept even more ambitious by not just predicting the characters after your cursor, but actually predicting the next entire change you're gonna make, the next diff, next place you're gonna jump to.
And the second thing Kirscher is pretty good at right now, too, is helping you sometimes jump ahead of the AI and tell it what to do and go from instructions to code. And on both of those, we've done a lot of work on making the editing experience for those things ergonomic and also making those things smart and fast.
One of the things we really wanted was we wanted the model to be able to edit code for us. That was kind of a wish, and we had multiple attempts at it before we had a good model that could edit code for you. Then after we had a good model, I think there'd been a lot of effort to make the inference fast for having a good experience.
And we've been starting to incorporate, I mean, Michael sort of mentioned this, like, ability to jump to different places. And that jump to different places, I think, came from a feeling of, you know, once you accept an edit, it's like, man, it should be just really obvious where to go next.
It's like, I made this change, the model should just know that, like, the next place to go to is, like, 18 lines down. Like, if you're a WIM user, you could press 1, 8, JJ, or whatever.
but like why why even why am i doing this like the model the model should just know it and then so so the idea was you just press tab it would go 18 lines down and then make it show you show you the next edit and you would press tab so it's just you as long as you could keep pressing tab and so the internal competition was how many tabs can we make someone press it once you have like the idea uh more more uh sort of
abstractly, the thing to think about is how are the edits zero entropy? So once you've expressed your intent and the edit is... There's no new bits of information to finish your thought, but you still have to type some characters to make the computer understand what you're actually thinking. Then maybe the model should just read your mind and all the zero entropy bits should just be tabbed away.
Yeah, that was that was sort of the abstract.
There's this interesting thing where if you look at language model loss on different domains, I believe the bits per byte, which is kind of character normalized loss for code is lower than language, which means in general, there are a lot of tokens in code that are super predictable, a lot of characters that are super predictable.
And this is, I think, even magnified when you're not just trying to autocomplete code, but predicting what the user is going to do next in their editing of existing code. And so, you know, the goal of cursor taps, let's eliminate all the low entropy actions you take inside of the editor. When the intent is effectively determined, let's just jump you forward in time, skip you forward.
Well, what's the intuition and what's the technical details of how to do next cursor prediction? That jump. That's not so intuitive, I think, to people.
Yeah. I think I can speak to a few of the details on how to make these things work. They're incredibly low latency, so you need to train small models on this task. In particular... they're incredibly pre-filled token hungry. What that means is they have these really, really long prompts where they see a lot of your code and they're not actually generating that many tokens.
And so the perfect fit for that is using a sparse model, meaning an MOE model. Um, so that was kind of one, one breakthrough, one breakthrough we made that substantially improved performance at longer context. The other being, um, a variant of speculative decoding that we kind of built out called speculative edits.
These are two, I think, important pieces of what make it quite high quality and very fast.
Okay, so MOE, mixture of experts, the input is huge, the output is small. Okay, so what else can you say about how to make, does caching play a role in this particular?
Caching plays a huge role. Because you're dealing with this many input tokens, if every single keystroke that you're typing in a given line, you had to rerun the model on all of those tokens passed in, you're just going to, one, significantly degrade latency, two, you're going to kill your GPUs with load. So you need to design the actual prompts used for the model such that they're caching aware.
And then, yeah, you need to reuse the KV cache across requests just so that you're spending less work, less compute.
Again, what are the things that tab is supposed to be able to do kind of in the near term? Just to like sort of linger on that. Generate code, like fill empty space, also edit code across multiple lines. Yeah. And then jump to different locations inside the same file.
Yeah. And then like launch. Hopefully jump to different files also. So if you make an edit in one file and... Maybe you have to go to another file to finish your thought. It should go to the second file also.
The full generalization is like next action prediction. Sometimes you need to run a command in the terminal, and it should be able to suggest the command based on the code that you wrote to. Or sometimes you actually need to... Like it suggests something, but it's hard for you to know if it's correct because you actually need some more information to learn.
Like you need to know the type to be able to verify that it's correct. And so maybe it should actually take you to a place that's like the definition of something and then take you back so that you have all the requisite knowledge to be able to accept the next completion.
So providing the human the knowledge. Yes. Right. Can you integrate, like, I just got to know a guy named PrimeGen who I believe has an SS, you can order coffee via SSH.
Oh, yeah. Oh, we did that. We did that.
So can also the model do that, like feed you and provide you with caffeine? Okay, so that's the general framework.
Yeah, and the magic moment would be if... it is programming is this weird discipline where sometimes the next five minutes, not always, but sometimes the next five minutes, what you're going to do is actually predictable from the stuff you've done recently.
And so can you get to a world where that next five minutes either happens by you disengaging and it taking you through, or maybe a little bit more of just you seeing next step, what it's going to do. And you're like, okay, that's good. That's good. That's good. That's good. And you can just sort of tap, tap, tap through these big changes.
As we're talking about this, I should mention that one of the really cool and noticeable things about cursor is that there's this whole diff interface situation going on. So like the model suggests with the red and the green of like, here's how we're going to modify the code. And in the chat window, you can apply and it shows you the diff and you can accept the diff.
So maybe can you speak to whatever direction of that?
We'll probably have like four or five different kinds of diffs. So we have optimized the diff for the autocomplete. So that has a different diff interface than when you're reviewing larger blocks of code. And then we're trying to optimize another diff thing for when you're doing multiple different files.
And sort of at a high level, the difference is for when you're doing autocomplete, it should be really, really fast to read. Actually, it should be really fast to read in all situations. But in autocomplete, it's sort of, you're really like your eyes focused in one area. You can't be in too many, the humans can't look in too many different places.
So you're talking about on the interface side?
On the interface side. So it currently has this box on the side. So we have the current box. And if it tries to delete code in some place and tries to add other code, it tries to show you a box on the side. You can maybe show it if we pull it up on cursor.com.
This is what we're talking about.
So that, that box, it was like three or four different attempts at trying to make this, this thing work where first the attempt was like these blue crossed out lines. So before it was a box on the side, it used to show you the code to delete by showing you like, uh, like Google doc style, you would see like a line through it. Then you would see the new code. That was super distracting.
And then we tried many different, you know, there was sort of deletions, there was trying to red highlight. Then the next iteration of it, which is sort of funny, you would hold on Mac the option button. So it would sort of highlight a region of code to show you that there might be something coming. So maybe in this example, like the input and the value would all get blue.
And the blue was to highlight that the AI had a suggestion for you. So instead of directly showing you the thing, it would show you that the AI, it would just hint that the AI had a suggestion. And if you really wanted to see it, you would hold the option button and then you would see the new suggestion. Then if you release the option button, you would then see your original code.
Mm-hmm. So that's, by the way, that's pretty nice, but you have to know to hold the option button.
Yeah.
By the way, I'm not a Mac user, but I got it. It's a button, I guess, you people have.
Again, it's just non-intuitive. I think that's the key thing.
And there's a chance this is also not the final version of it.
I am personally very excited for... making a lot of improvements in this area. We often talk about it as the verification problem, where these diffs are great for small edits. For large edits, or when it's multiple files or something, it's actually a little bit prohibitive to review these diffs. And so there are a couple of different ideas here.
One idea that we have is, okay, parts of the diffs are important. They have a lot of information. And then parts of the diff are just very low entropy. They're the same thing over and over again. And so maybe you can highlight the important pieces and then gray out the not so important pieces. Or maybe you can have a model that
looks at the diff and sees, oh, there's a likely bug here, I will mark this with a little red squiggly and say, you should probably review this part of the diff. And ideas in that vein, I think, are exciting.
Yeah, that's a really fascinating space of UX design engineering. So you're basically trying to guide the human programmer through all the things they need to read and nothing more.
Yeah.
Like optimally.
Yeah, and you want an intelligent model to do it. Like currently, diff algorithms are, they're like... like they're just like normal algorithms. There is no intelligence. There's like intelligence that went into designing the algorithm, but then there's no, like, you don't care if it's about this thing or this thing, as you want a model to do this.
So I think the general question is like, Matt, these models are going to get much smarter. As the models get much smarter, the changes they will be able to propose are much bigger. So as the changes gets bigger and bigger and bigger, the humans have to do more and more and more verification work. It gets more and more and more hard. Like it's just, you need to help them out.
It's sort of, I don't want to spend all my time reviewing code.
Can you say a little more across multiple files, Div?
Yeah, I mean, so GitHub tries to solve this, right, with code review. When you're doing code review, you're reviewing multiple diffs across multiple files. But like Arvid said earlier, I think you can do much better than code review. You know, code review kind of sucks. Like, you spend a lot of time trying to grok this code that's often quite unfamiliar to you, and...
it often doesn't even actually catch that many bugs. And I think you can significantly improve that review experience using language models, for example, using the kinds of tricks that Arvind had described of maybe pointing you towards the regions that actually matter.
I think also if the code is produced by these language models and it's not produced by someone else, like the code review experience is designed for both the reviewer and the person that produced the code. In the case where the person that produced the code is the language model, You don't have to care that much about their experience.
And you can design the entire thing around the reviewers such that the reviewer's job is as fun, as easy, as productive as possible. And I think that feels like the issue with just kind of naively trying to make these things look like code review. I think you can be a lot more creative and push the boundary on what's possible.
Just one idea there is I think ordering matters. Generally, when you review a PR, you have this list of files and you're reviewing them from top to bottom. But actually, you actually want to understand this part first, because that came logically first. And then you want to understand the next part. And you don't want to have to figure out that yourself.
You want a model to guide you through the thing.
And is the step of creation going to be more and more natural language is the goal versus with actual writing?
I think sometimes. I don't think it's going to be the case that all of programming will be natural language. And the reason for that is, you know, if I'm pair programming with Swala and Swala is at the computer and the keyboard. And sometimes, if I'm driving, I want to say to Swallow, hey, implement this function. And that works.
And then sometimes it's just so annoying to explain to Swallow what I want him to do. And so I actually take over the keyboard and I show him. I write part of the example. And then... it makes sense. And that's the easiest way to communicate. And so I think that's also the case for AI.
Sometimes the easiest way to communicate with the AI will be to show an example, and then it goes and does the thing everywhere else. Or sometimes if you're making a website, for example, the easiest way to show to the AI what you want is not to tell it what to do, but drag things around or draw things. And Yeah.
And like maybe eventually we will get to like brain machine interfaces or whatever and kind of like understand what you're thinking. And so I think natural language will have a place. I think it will not definitely not be the way most people program most of the time.
I'm really feeling the AGI with this editor. It feels like there's a lot of machine learning going on underneath. Tell me about some of the ML stuff that makes it all work.
Well, Cursor really works via this ensemble of custom models that we've trained alongside the frontier models that are fantastic at the reasoning intense things. And so CursorTab, for example, is a great example of where you can specialize this model to be even better than even frontier models if you look at evals on the task we set it at.
The other domain, which it's kind of surprising that it requires custom models, but it's kind of necessary and works quite well, is in apply. So I think these models are like the frontier models are quite good at sketching out plans for code and generating like rough sketches of like the change. But actually, Creating diffs is quite hard for frontier models, for your training models.
You try to do this with Sonnet, with O1, any frontier model, and it really messes up stupid things like counting line numbers, especially in super, super large files. And so what we've done to alleviate this is we let the model kind of sketch out this rough code block that indicates what the change will be. And we train a model to then apply that change to the file.
And we should say that apply is the model looks at your code. It gives you a really damn good suggestion of what new things to do. And the seemingly for humans trivial step of Combining the two you're saying is not so trivial.
Contrary to popular perception. It is not a deterministic algorithm.
Yeah. I think like you see shallow copies of apply, um, elsewhere and it just breaks like most of the time, because you think you can kind of try to do some deterministic matching and then it fails, you know, at least 40% of the time. And that just results in a terrible product experience. Um, I think in general, this regime of you are going to get smarter and smarter models.
So one other thing that Apply lets you do is it lets you use fewer tokens with the most intelligent models. This is both expensive in terms of latency for generating all these tokens and cost. So you can give this very, very rough sketch and then have your small models go and implement it because it's a much easier task to implement this very, very sketched out code.
And I think that this regime will continue where you can use smarter and smarter models to do the planning and then maybe the implementation details can be handled by the less intelligent ones. Perhaps you'll have, you know, maybe O1, maybe it'll be even more capable models given an even higher level plan that is kind of recursively implemented applied by Sonnet and an Eply model.
Maybe we should talk about how to make it fast. Yeah.
Fast is always an interesting detail. Fast is good.
Yeah. How do you make it fast?
Yeah, so one big component of making it fast is speculative edits. So speculative edits are a variant of speculative decoding. And maybe it'd be helpful to briefly describe speculative decoding. With speculative decoding, what you do is you can kind of take advantage of the fact that most of the time, and I'll add the caveat that it would be when you're memory bound in language model generation.
If you... process multiple tokens at once, it is faster than generating one token at a time. So this is the same reason why if you look at tokens per second with prompt tokens versus generated tokens, it's much, much faster for prompt tokens.
So what we do is instead of using what speculative decoding normally does, which is using a really small model to predict these draft tokens that your larger model will then go in and verify, With code edits, we have a very strong prior of what the existing code will look like. And that prior is literally the same exact code.
So what you can do is you can just feed chunks of the original code back into the model. And then the model will just pretty much agree most of the time that, okay, I'm just going to spit this code back out. And so you can process all of those lines in parallel. And you just do this with sufficiently many chunks. And then eventually you'll reach a point of disagreement.
where the model will now predict text that is different from the ground truth original code. It'll generate those tokens, and then we kind of will decide after enough tokens match the original code to restart speculating in chunks of code. What this actually ends up looking like is just a much faster version of normal editing code.
So it looks like a much faster version of the model rewriting all the code. So we can use the same exact interface, that we use for diffs, but it will just stream down a lot faster.
And then the advantage is that while it's streaming, you can just also start reviewing the code before it's done, so there's no big loading screen. So maybe that is part of the advantage.
So the human can start reading before the thing is done.
I think the interesting riff here is something like, like speculation is a fairly common idea nowadays. It's like not only in language models, I mean, there's obviously speculation in CPUs and there's like speculation for databases and speculation all over the place.
Let me ask the ridiculous question of which LLM is better at coding. GPT, Claude, who wins in the context of programming? And I'm sure the answer is much more nuanced because it sounds like every single part of this involves a different model.
Yeah, I think there's no model that... Grado dominates others, meaning it is better in all categories that we think matter. The categories being speed, ability to edit code, ability to process lots of code, long context, you know, a couple of other things and kind of coding capabilities. The one that I'd say right now is just kind of net best is Sonnet. I think this is a consensus opinion.
Our one's really interesting and it's really good at reasoning. So if you give it really hard, uh, programming interview style problems or lead code problems. It can do quite, quite well on them. But it doesn't feel like it kind of understands your rough intent as well as Sonnet does.
Like, if you look at a lot of the other frontier models, one qualm I have is it feels like they're not necessarily over, I'm not saying they train on benchmarks. But they perform really well in benchmarks relative to kind of everything that's kind of in the middle.
So if you try it in all these benchmarks and things that are in the distribution of the benchmarks they're evaluated on, you know, they'll do really well. But when you push them a little bit outside of that, Sonnet's I think the one that kind of does best at kind of maintaining that same capability.
Like you kind of have the same capability in the benchmark as when you try to instruct it to do anything with coding.
What, another ridiculous question, is the difference between the normal programming experience versus what benchmarks represent? Like where do benchmarks fall short, do you think, when we're evaluating these models?
By the way, that's like a really, really hard, it's like critically important detail, like how different like benchmarks are versus like real coding. Where real coding, it's not interview style coding. It's you're doing these, You know, humans are saying, like, half-broken English sometimes, and sometimes you're saying, like, oh, do what I did before. Sometimes you're saying...
you know, go add this thing and then do this other thing for me and then make this UI element. And then, you know, it's just like a lot of things are sort of context dependent. You really want to like understand the human and then do what the human wants as opposed to sort of this, maybe the way to put it is sort of abstractly is the interview problems are very well-specified.
they lean a lot on specification while the human stuff is less specified. Yeah.
I think that this benchmark question is both complicated by what Svali just mentioned, and then also to... What Aman was getting into is that even if you like, you know, there's this problem of like the skew between what can you actually model in a benchmark versus real programming. And that can be sometimes hard to encapsulate because it's like real programming is like very messy.
And sometimes things aren't super well specified what's correct or what isn't. But then it's also doubly hard because of this public benchmark problem. And that's both because public benchmarks are sometimes kind of hill-climbed on, but then it's really, really hard to also get the data from the public benchmarks out of the models.
And so, for instance, one of the most popular agent benchmarks, SweetBench, is really, really contaminated in the training data of these foundation models. And so if you ask these foundation models to do a sweet bench problem, but you actually don't give them the context of a code base, they can like hallucinate the right file pass, they can hallucinate the right function names.
And so it's also just the public aspect of these things is tricky.
Yeah, like in that case, it could be trained on the literal issues or pull requests themselves. And maybe the labs will start to do a better job, or they've already done a good job at decontaminating those things. But they're not going to emit the actual training data of the repository itself. Like these are all like some of the most popular Python repositories, like SymPy is one example.
I don't think they're going to handicap their models on SymPy and all these popular Python repositories in order to get true evaluation scores in these benchmarks.
I think that given the dearths and benchmarks, there have been a few interesting crutches that places that build systems with these models or build these models actually use to get a sense of are they going in the right direction or not. And in a lot of places, people will actually just have humans play with the things and give qualitative feedback on these things.
like one or two of the foundation model companies, they have people who, that's a big part of their role. And, you know, internally we also, you know, qualitatively assess these models and actually lean on that a lot in addition to like private evals that we have. It's like the vibe. The vibe, yeah. It's like the vibe.
The vibe benchmark, human benchmark. Yeah. You pull in the humans to do a vibe check. Yeah. Okay. I mean, that's kind of what I do, like just like reading online forums and Reddit and X, just like, Well, I don't know how to properly load in people's opinions because they'll say things like, I feel like Claude or GPT has gotten dumber or something.
They'll say, I feel like, and then I sometimes feel like that too, but I wonder if it's the model's problem or mine.
Yeah, with Claude, there's an interesting take I heard where I think AWS has different chips. And I suspect they have slightly different numerics than NVIDIA GPUs. And someone speculated that Claude's degraded performance had to do with maybe using the quantized version that existed on AWS Bedrock versus whatever was running on Anthropix GPUs.
I interview a bunch of people that have conspiracy theories, so I'm glad you spoke to this conspiracy theory.
Well, it's not like conspiracy theory as much. They're just, they're like, they're, you know, humans, humans are humans and there's, there's these details and, you know, you're doing like this crazy amount of flops and, you know, chips are messy and man, you can just have bugs. Like bugs are, it's, it's hard to overstate how hard bugs are to avoid. Yeah.
What's the role of a good prompt in all of this? We mentioned that benchmarks have really structured, well-formulated prompts. What should a human be doing to maximize success? And what's the importance of what the human, you wrote a blog post on, you called it prompt design.
Yeah, I think it depends on which model you're using. And all of them are slightly different and they respond differently to different prompts. But I think the original GPT-4 and the original sort of pre-double models last year, they were quite sensitive to the prompts. And they also had a very small context window.
And so we have all of these pieces of information around the code base that would maybe be relevant in the prompt. Like you have the docs, you have the files that you add, you have the conversation history. And then there's a problem like how do you decide what you actually put in the prompt and when you have a limited space.
And even for today's models, even when you have long context, filling out the entire context window means that it's slower. It means that sometimes the model actually gets confused and some models get more confused than others. And we have this one system internally that we call pre-empt, which helps us with that a little bit.
And I think it was built for the era before where we had 8,000 token context windows. And it's a little bit similar to when you're making a website. You want it to work on mobile. You want it to work on a desktop screen. And you have this dynamic information, which you don't have, for example, if you're designing a print magazine. You know exactly where you can put stuff.
But when you have a website or when you have a prompt, you have these inputs. And then you need to format them to always work. Even if the input is really big, then you might have to cut something down. And so the idea was, okay, let's take some inspiration. What's the best way to design websites?
Well, the thing that we really like is React and the declarative approach where you use JSX in JavaScript, and then you declare, this is what I want, and I think this has higher priority, or this has higher z-index than something else. And then you have this rendering engine. In web design, it's like Chrome, and in our case, it's a preempt renderer, which then fits everything onto the page.
And as you clearly decide what you want, and then it figures out what you want. And so we have found that to be quite helpful. And I think the role of it has sort of shifted over time, where initially it was to fit to these small context windows. Now it's really useful because it helps us with...
splitting up the data that goes into the prompt and the actual rendering of it and so it's easier to debug because you can change the rendering of the prompt and then try it on old prompts because you have the raw data that went into their prompt and then you can see did my change actually improve it for for like this entire eval set so do you literally prompt with jsx Yes, yes.
So it kind of looks like React. There are components. We have one component that's a file component, and it takes in the cursor. Usually there's one line where the cursor is in your file, and that's probably the most important line because that's the one you're looking at. And so then you can give priorities.
So that line has the highest priority, and then you subtract one for every line that is farther away. And then eventually when it's rendered, it figures out how many lines can actually fit, and it centers around that thing.
That's amazing. And you can do, like, other fancy things where if you have lots of code blocks from the entire code base, you could use retrieval and things like embedding and re-ranking scores to add priorities for each of these components.
So should humans, when they ask questions, also try to use something like that? Like, would it be beneficial to write JSX in the problem? Or the whole idea is it should be loose and messy and...
I think our goal is kind of that you should just do whatever is the most natural thing for you. And then we, our job is to figure out how do we actually like retrieve the relative event thing so that your thing actually makes sense.
Well, this is sort of the discussion I had with Arvin of perplexity. It's like his whole idea is like you should let the person be as lazy as he wants.
Yeah.
But like, yeah, that's a beautiful thing. But I feel like you're allowed to ask more of programmers, right? Yes. So like if you say just do what you want, I mean, humans are lazy. There's a kind of tension between just being lazy versus like provide more as –
be prompted, almost like the system pressuring you or inspiring you to be articulate, not in terms of the grammar of the sentences, but in terms of the depth of thoughts that you convey inside the prompts.
I think even as the system gets closer to some level of perfection, Often when you ask the model for something, not enough intent is conveyed to know what to do. And there are a few ways to resolve that intent. One is the simple thing of having the model just ask you, I'm not sure how to do these parts based on your query. Could you clarify that? I think the other could be maybe...
If there are five or six possible generations given the uncertainty present in your query so far, why don't we just actually show you all of those and let you pick them?
How hard is it for the model to choose to talk back? Sort of versus generating. It's hard. It's sort of like how to deal with the uncertainty. Do I choose to ask for more information to reduce the ambiguity?
So, I mean, one of the things we do is, it's like a recent addition, is try to suggest files that you can add. So while you're typing, one can guess what the uncertainty is and maybe suggest that like, you know, maybe you're writing your API And we can guess using the commits that you've made previously in the same file that the client and the server is super useful.
And there's like a hard technical problem of how do you resolve it across all commits? Which files are the most important given your current prompt? And we're still sort of initial version is ruled out and I'm sure we can make it much more accurate. It's very experimental.
But then the idea is we show you, do you just want to add this file, this file, this file also to tell the model to edit those files for you? Because if maybe you're making the API, you should also edit the client and the server that is using the API and the other one resolving the API.
So that'll be kind of cool as both there's the phase where you're writing the prompt and there's before you even click enter, maybe we can help resolve some of the uncertainty.
To what degree do you use agentic approaches? How useful are agents?
We think agents are really, really cool. I think agents is like... It's like it resembles sort of like a human. It's sort of like you can kind of feel that you're getting closer to AGI because you see a demo where it acts as a human would. And it's really, really cool. I think... agents are not yet super useful for many things. I think we're getting close to where they will actually be useful.
And so I think there are certain types of tasks where having an agent would be really nice. I would love to have an agent. For example, we have a bug where you sometimes can't command C and command V inside our chat input box, and that's a task that's super well specified. I just want to say in two sentences, this does not work, please fix it.
And then I would love to have an agent that just goes off, does it, and then a day later I come back and I review the thing.
You mean it goes, finds the right file?
Yeah, it finds the right files, it tries to reproduce the bug, it fixes the bug, and then it verifies that it's correct. And this could be a process that takes a long time. And so I think I would love to have that. And then I think a lot of programming, there is often this belief that agents will take over all of programming.
I don't think we think that that's the case because a lot of programming, a lot of the value is in iterating or you don't actually want to specify something upfront because you don't really know what you want until you've seen an initial version and then you want to iterate on that and then you provide more information.
And so for a lot of programming, I think you actually want a system that's instant that gives you an initial version instantly back and then you can iterate super, super quickly.
What about something like that recently came out, Replit Agent, that does also like setting up the development environment, installing software packages, configuring everything, configuring the databases, and actually deploying the app?
Yeah.
Is that also in the set of things you dream about?
I think so. I think that would be really cool. For certain types of programming, it would be really cool.
Is that within scope of Cursor?
Yeah, we aren't actively working on it right now. But it's definitely like, we want to make the programmer's life easier and more fun. And some things are just really tedious and you need to go through a bunch of steps and you want to delegate that to an agent. And then some things you can actually have an agent in the background while you're working.
Like, let's say you have a PR that's both backend and frontend, and you're working in the frontend, and then you can have a background agent that does some work and figure out kind of what you're doing. And then when you get to the backend part of your PR, then you have some like initial piece of code that you can iterate on. And so that would also be really cool.
One of the things we already talked about is speed. But I wonder if we can just linger on that some more in the various places that the technical details involved in making this thing really fast. So every single aspect of Cursor, most aspects of Cursor feel really fast. Like I mentioned, the apply is probably the slowest thing. And for me, I'm sorry, the pain.
It's a pain. It's a pain that we're feeling and we're working on fixing it.
Yeah. Yeah, I mean, it says something that something that feels I don't know what it is like one second or two seconds. That feels slow. That means that's actually shows that everything else is just really, really fast. So is there some technical details about how to make some of these models hot to make the chat fast how to make the diffs fast? Is there something that just jumps to mind?
Yeah, I mean, so we can go over a lot of the strategies that we use. One interesting thing is cache warming. And so what you can do is if, as the user is typing, you can have, you're probably going to use some piece of context. And you can know that before the user's done typing. So, you know, as we discussed before, Reusing the KV cache results in lower latency, lower costs, cross requests.
So as the user starts typing, you can immediately warm the cache with like, let's say the current file contents. And then when they press enter, there's very few tokens. It actually has to pre-fill and compute before starting the generation. This will significantly lower TTFD.
Can you explain how KV cache works?
Yeah. So the way transformers work, I mean, like one of the mechanisms that allow transformers to not just independently, like the mechanism that allows transformers to not just independently look at each token, but see previous tokens. are the keys and values to tension.
And generally the way attention works is you have at your current token, some query, and then you've all the keys and values of all your previous tokens, which are some kind of representation that the model stores internally of all the previous tokens in the prompt. And By default, when you're doing a chat, the model has to, for every single token, do this forward pass through the entire model.
That's a lot of matrix multiplies that happen, and that is really, really slow. Instead, if you have already done that, and you stored the keys and values, and you keep that in the GPU...
Then when I'm, let's say I have sorted for the last n tokens, if I now want to compute the output token for the n plus one token, I don't need to pass those first n tokens through the entire model because I already have all those keys and values. And so you just need to do the forward pass through that last token.
And then when you're doing attention, you're reusing those keys and values that have been computed, which is the only kind of sequential part or sequentially dependent part of the transformer.
Is there like higher level caching of like caching of the prompts or that kind of stuff? I see help.
Yeah, that that there's other types of caching you can kind of do. One interesting thing that you can do for cursor tab is you can basically predict ahead as if the user would have accepted the suggestion and then trigger another request. And so then you've cached, you've done this speculative, it's a mix of speculation and caching, right?
Because you're speculating what would happen if they accepted it. And then you have this value that is cached, this suggestion. And then when they press tab, the next one would be waiting for them immediately. It's a kind of clever heuristic slash trick. that uses a higher level caching and can give the... It feels fast despite there not actually being any changes in the model.
And if you can make the KV cache smaller, one of the advantages you get is like, maybe you can speculate even more. Maybe you can guess, here's the 10 things that... could be useful. Like, predict the next 10, and it's possible the user hits the one of the 10. It's a much higher chance than the user hits the exact one that you show them.
Maybe they type another character, and we sort of hit something else in the cache. So there's all these tricks where... The general phenomena here is... I think it's also super useful for RL is... maybe a single sample from the model isn't very good. But if you predict like 10 different things, it turns out that one of the 10, that's right, is the probability is much higher.
There's these passive key curves. And, you know, part of RL, like what RL does is you can exploit this pass at k phenomena to make many different predictions. And one way to think about this, the model sort of knows internally, has some uncertainty over which of the k things is correct, or which of the k things does the human want.
So when we RL our cursor tab model, one of the things we're doing is we're predicting which of the hundred different suggestions the model produces is more amenable for humans? Like, which of them do humans more like than other things? Maybe, like, there's something where the model can predict very far ahead versus, like, a little bit and maybe somewhere in the middle and...
And then you can give a reward to the things that humans would like more and sort of punish the things that it won't like and sort of then train the model to output the suggestions that humans would like more. You have these like RL loops that are very useful that exploit these passive K-curves. Oman maybe can go into even more detail.
Yeah, it is a little different than speed. But, I mean, like, technically you tie it back in because you can get away with the smaller model if you RL your smaller model and it gets the same performance as the bigger one. That's, like, and while I was mentioning stuff about... about reducing the size of your KB cache. There are other techniques there as well that are really helpful for speed.
So kind of back in the day, like all the way two years ago, people mainly use multi-head attention. And I think there's been a migration towards more efficient attention schemes like group query or multi-query attention. And this is really helpful for then with larger batch sizes, being able to generate the tokens much faster.
The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.