Menu
Sign In Pricing Add Podcast
Podcast Image

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

20VC: Why Google Will Win the AI Arms Race & OpenAI Will Not | NVIDIA vs AMD: Who Wins and Why | The Future of Inference vs Training | The Economics of Compute & Why To Win You Must Have Product, Data & Compute with Steeve Morin @ ZML

Mon, 24 Feb 2025

Description

Steeve Morin is the Founder & CEO @ ZML, a next-generation inference engine enabling peak performance on a wide range of chips. Prior to founding ZML, Steeve was the VP Engineering at Zenly for 7 years leading eng to millions of users and an acquisition by Snap.  In Today’s Episode We Discuss: 04:17 How Will Inference Change and Evolve Over the Next 5 Years 09:17 Challenges and Innovations in AI Hardware 15:38 The Economics of AI Compute 18:01 Training vs. Inference: Infrastructure Needs 25:08 The Future of AI Chips and Market Dynamics 34:43 Nvidia's Market Position and Competitors 38:18 Challenges of Incremental Gains in the Market 39:12 The Zero Buy-In Strategy 39:34 Switching Between Compute Providers 40:40 The Importance of a Top-Down Strategy for Microsoft and Google 41:42 Microsoft's Strategy with AMD 45:50 Data Center Investments and Training 46:40 How to Succeed in AI: The Triangle of Products, Data, and Compute 48:25 Scaling Laws and Model Efficiency 49:52 Future of AI Models and Architectures 57:08 Retrieval Augmented Generation (RAG) 01:00:52 Why OpenAI’s Position is Not as Strong as People Think 01:06:47 Challenges in AI Hardware Supply  

Audio
Featured in this Episode
Transcription

Chapter 1: Who is Steeve Morin and what does ZML do?

00:47 - 01:10 Harry Stebbings

Steve is the founder of ZML, a next generation inference engine, enabling peak performance on a wide range of chips. Literally the perfect speaker for this topic. And this was a super nerdy show. It was probably the most information dense episode we've done in a long time. So do slow it down, pause it, get a notebook out. But wow, there is so much gold in this one.

0

01:10 - 01:31 Harry Stebbings

But before we dive in today, turning your back of a napkin idea into a billion-dollar startup requires countless hours of collaboration and teamwork. It can be really difficult to build a team that's aligned on everything from values to workflow, but that's exactly what Coda was made to do. Coda is an all-in-one collaborative workspace that started as a napkin sketch.

0

01:31 - 01:55 Harry Stebbings

Now, just five years since launching in beta, Coda has helped 50,000 teams all over the world get on the same page. Now, at 20VC, we've used Coda to bring structure to our content planning and episode prep, and it's made a huge difference. Instead of bouncing between different tools, we can keep everything from guest research to scheduling and notes all in one place, which saves us so much time.

0

01:55 - 02:12 Harry Stebbings

With Coda, you get the flexibility of docs, the structure of spreadsheets, and the power applications all built for enterprise. And it's got the intelligence of AI, which makes it even more awesome. If you're a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record time.

0

00:00 - 00:00 Harry Stebbings

To try it for yourself, go to Coda.io slash 20VC today and get six free months of the team plan for startups. That's Coda.io slash 20VC to get started for free and get six free months of the team plan now that your team is aligned and collaborating, let's tackle those messy expense reports.

00:00 - 00:00 Harry Stebbings

You know, those receipts that seem to multiply like rabbits in your wallet, the endless email chains asking, can you approve this? Don't even get me started on the month-end panic when you realize you have to reconcile it all. Well, PLEO offers smart company cards. physical, virtual, and vendor-specific, so teams can buy what they need while finance stays in control.

00:00 - 00:00 Harry Stebbings

Automate your expense reports, process invoices seamlessly, and manage reimbursements effortlessly, all in one platform. With integrations to tools like Xero, QuickBooks, and NetSuite, PLEO fits right into your workflow, saving time and giving you full visibility over every entity, payment, and subscription. Join over 37,000 companies already using PLEO to streamline their finances.

00:00 - 00:00 Harry Stebbings

Try PLEO today. It's like magic, but with fewer rabbits. Find out more at pleo.io forward slash 20VC. And don't forget to revolutionize how your team works together. Rome, a company of tomorrow, runs at hyper speed with quick drop-in meetings. A company of tomorrow is globally distributed and fully digitized. The company of tomorrow instantly connects human and AI workers.

00:00 - 00:00 Harry Stebbings

A company of tomorrow is in a Rome virtual office. See a visualization of your whole company, the live presence, the drop-in meetings, the AI summaries, the chats. It's an incredible view to see. Roam is a breakthrough workplace experience loved by over 500 companies of tomorrow for a fraction of the cost of Zoom and Slack. Visit Rome, that's O-R dot A-M, for an instant demo of Rome today.

Chapter 2: How does NVIDIA's strategy differ from AMD's in AI?

05:49 - 06:05 Steeve Morin

Probably the number one, you know, I would say obvious thing would be that if you ask a model to generate an image, then it will, you know, switch to a diffusion model, right? Not an LLM. And there's many, many more tricks. The turbo models and OpenAI do that. There's a lot of tricks.

0

06:05 - 06:27 Steeve Morin

So definitely models in the sense of getting, you know, weights and running them is something that is ultimately going away because, you know, in favor of like full-blown backends, right? You feel like you're talking to a model, but ultimately you're talking to an API. The thing is, that API will be running locally in your own cloud instances and so on.

0

06:28 - 06:44 Harry Stebbings

So we will have a world where we're switching between models, and there's kind of this trickery around them. OK, perfect. So we've got that at the top, then we've got ZML in the middle, and then you said, and then on any hardware. So will we be using multiple hardware providers at the same time, or will we be more rigid in our hardware usage?

0

06:45 - 07:11 Steeve Morin

No, absolutely. You can get probably an order of magnitude more efficiency depending on the hardware you run on. That is substantial. Not a lot of people have that problem at the moment. Things are getting built as we speak. But a simple example is if you switch from NVIDIA to AMD on a 7 TB model, you can get four times better efficiency in terms of spend. So that is substantial.

0

00:00 - 00:00 Steeve Morin

That is very much substantial. Now, the problem is getting some AMD GPUs, right? I'm really sorry.

00:00 - 00:00 Harry Stebbings

If there is such a cost efficiency four times, why does everyone not do that?

00:00 - 00:00 Steeve Morin

So there's a few reasons. Probably the most important one is the PyTorch CUDA, I would say, Duo. And that's very, very hard to break. These two are very much intertwined. Can you just explain to us what PyTorch and CUDA are? Oh, yes, absolutely. Yeah, yeah. PyTorch is the ML framework that people use to build actually trained models, right?

00:00 - 00:00 Steeve Morin

You can do inference with it, but by far the most successful framework for training is PyTorch. And PyTorch was very much built on top of CUDA, which is NVIDIA software, right? Let's just say the strengths of PyTorch make it ultimately very, very bound to CUDA. So of course it runs on, you know, it runs on AMD, it runs on, you know, even Apple and so on.

00:00 - 00:00 Steeve Morin

But there was always, you know, the tens of little details that not exactly run like, you know, you would expect and there's work involved, but then also there's supply. So probably that's the number one thing. The second thing is there's a lot of GPUs on the market. Pretty much all of them are NVIDIA.

Chapter 3: What challenges exist in AI hardware supply?

12:23 - 12:42 Steeve Morin

I mean, they're not idiots, right? How should agents change NVIDIA's strategy? Hard to say, because NVIDIA has a very, very vertical approach. They do more of more, right? Like if you look at Blackwell, it's actually crazy what they did for Blackwell. They assembled two chips.

0

12:43 - 13:05 Steeve Morin

But the surface was so big that the chip started to bend a bit, which further perpetuated the problem because it then didn't make contact with the heat sink and so on. So they are very much, and you know, the power envelope, they push it to a thousand watts. It requires liquid cooling and so on. So they are very much in a very vertical foot to the pedal in terms of GPU scaling.

0

13:05 - 13:16 Steeve Morin

But the thing is, GPUs are a good trick for AI, but they're not built for AI. It's not a specialized chip. It is a specialization of a GPU, but it is not an AI chip.

0

13:18 - 13:25 Harry Stebbings

Forgive me for continuously asking stupid questions. Why are GPUs not built for AI? And if not, what is better?

0

00:00 - 00:00 Steeve Morin

So the way it worked is that you can think of a screen as a matrix. And if you have to render pixels on a screen, there's a lot of pixels and everything has to happen in parallel, right? So that you don't waste time. Turns out, you know, matrices are a very important thing in AI.

00:00 - 00:00 Steeve Morin

So there was this cool trick in which we essentially tricked the GPU into, back, that was like probably 20 years ago, we would trick the GPU into believing it was doing graphics rendering, where actually we were making it do parallel work, right? It was called GPGPU at the time, right? So it was always a cool trick. But it was not dedicated for this.

00:00 - 00:00 Steeve Morin

The pioneers probably were, of course, Google with TPU, which are very much more advanced on the architectural level. But essentially, the way they work, it kind of works for AI. But for LLMs, that starts to crack because they're so big and there's a lot of memory transfers and so on.

00:00 - 00:00 Steeve Morin

Actually, that's why Grok achieves, not Grok, but Grok, Cerebras, and all these folks, they achieve very high performance single stream is because the data is right in the chip. They don't have to get it from memory, which is slow, which GPU has to do. So there's a lot of these things that ultimately make it a good trick but not a dedicated solution per se.

00:00 - 00:00 Steeve Morin

That said, though, the reason probably NVIDIA won, at least in the training space, is because of Mellanox, not because of the raw compute. Because you need to run lots of these GPUs in parallel. So the interconnect between them is ultimately what matters. How fast can they exchange data? Because remember,

Chapter 4: Why might NVIDIA not dominate the inference market?

17:14 - 17:20 Harry Stebbings

With increasing competitiveness within each of those layers, do we not see margin reduction?

0

17:21 - 17:45 Steeve Morin

Absolutely, yes. Here's the problem, though. Let's say you are on Google Cloud and you're on TPUs. Suddenly, you just removed that 90% chunk on the spend. The problem is that for multiple software reasons, which we are solving at DML, is that they're not really, I would say, a commercial success. They are very much successful inside of Google, but not much outside of Google.

0

17:45 - 18:02 Steeve Morin

Amazon, same, is pushing very, very hard for their, you know, Tranium chips. So the future I see is that you use whatever, you know, your provider has because you don't want to pay, you know, 90% outrageous margin and try to make, you know, a profit out of that. Okay.

0

18:02 - 18:15 Harry Stebbings

So when we move to actually inference and training, everyone's focused so much on training. I'd love to understand what are the fundamental differences in infrastructure needs when we think about training versus inference?

0

00:00 - 00:00 Steeve Morin

So these two obey fundamentally different, I would say, tectonic forces. So in training, more is better. You want more of everything, essentially. And the recipe for success is the speed of iteration. You change stuff, you see how it works, and you do it again. Hopefully it converges. And it's like changing the wheel of a moving car, so to speak. So that is training.

00:00 - 00:00 Steeve Morin

On inference, this is a complete reverse. Less is better. You want less headaches. You don't want to be working up at night because inference is production. You could say that training is research and inference is production. And it's fundamentally different. In terms of infra, probably the number one thing that is the number one difference between these two is the need for interconnect.

00:00 - 00:00 Steeve Morin

So if you do production, if you can avoid to have interconnect between, let's say, a cluster of GPUs, of course you will avoid that if you can. And this is why models have the sizes they have. It's so that people can run them without the need to connect multiple machines together. It's very constraining in terms of the environment.

00:00 - 00:00 Steeve Morin

So that is probably the fundamental difference, the need for interconnect. And number two is, ultimately, do you really care about what your model is running on as long as it's outputting whatever you want it to output?

00:00 - 00:00 Harry Stebbings

Can you just help me understand, sorry, why is training more is more and that's great and in inference less is more? Why do we have that difference?

Chapter 5: What is latent space reasoning and its implications?

25:29 - 25:31 Harry Stebbings

And actually, that market won't be won by NVIDIA.

0

25:32 - 25:49 Steeve Morin

Technically speaking, he is right. But realistically speaking, I'm not sure I agree. The thing is, these chips are on the market. They're here. I'll tab on Chrome and get one. That is something that I don't take lightly. Availability, that is, right?

0

25:49 - 26:15 Steeve Morin

I think Nvidia is used to stay, at least if not for the H100 bubble bust, because these chips are going to be on the market and people will buy them and do inference with them. Remains to see the OPEX and the electricity, etc., but... The thing is, the only chips that are really frontier on that sense are probably TPUs and then the upcoming chips.

0

26:15 - 26:37 Steeve Morin

But the thing is, they're great chips, but they're not on the market. Or like there are outrageous prices, like millions of dollars to run a model. So what chips are great and why aren't they on the market? Let's say for instance, Cerebras, incredible technology, incredibly expensive. So how will the market value the premium of having single stream, very high tokens per second?

0

00:00 - 00:00 Steeve Morin

There is a value into that, right? As we saw with Mistral and Perplexity, but I think there was another loss. I don't know, I don't have the details, But I think it was done at a loss that Cerebrus put it out. So today there's three actors on the market that can deliver this. I think this will be, I would say, the pushing force for change in the inference landscape, agents and reasoning.

00:00 - 00:00 Steeve Morin

So that is very high tokens per second only for you.

00:00 - 00:00 Harry Stebbings

What is forcing the price of a Cerebrus to be so high? And then you heard Jonathan at Grok on the show say that they're 80% cheaper than NVIDIA.

00:00 - 00:00 Steeve Morin

So there's this trick. Because here's the thing, there's no magic. This little trick is called SRAM. SRAM is memory on the chip directly. So that is very, very fast memory. But here's the problem with SRAM. is that SRAM consumes surface on the chip, which makes it a bigger chip, which is very hard in terms of yield, right? Because the chances of problems are higher and so on.

00:00 - 00:00 Steeve Morin

So SRAM is, I would say, very, very, very fast memory, which gives you a lot of advantage when you do very, very high inference, but it's terribly expensive. And if you look at, for instance, Grok, they have on their generation, this generation, they have 230 megabytes of SRAM per chip. A 7TB model is 140 gigabytes. So you do the math, right?

Chapter 6: How does memory technology impact AI chip performance?

33:38 - 33:41 Harry Stebbings

Do you think NVIDIA owns both of those markets in five years' time?

0

33:42 - 34:06 Steeve Morin

Depends on the supply. I think that there's a shot that they don't. Because here's the thing, you know, even if we take, you know, same amount of, you know, let's imagine we have a new chip from Amazon, right? That is the same amount. Oh, wait, we do. It's called Tranium. You know, why would I pay 90% margin of NVIDIA if I can freely change to Tranium? My old production runs on AWS anyways, right?

0

34:07 - 34:32 Steeve Morin

If you run on the cloud and you're running on NVIDIA, you're getting squeezed out of your money, right? So if you're on production on dedicated chips, of course, so maybe through commoditization, but hey, I'm on AWS, I can just click and boom, it runs on AWS's chips. Who cares, right? I just run my model like I did two minutes ago.

0

34:33 - 34:39 Harry Stebbings

With that realization, do you think we'll see NVIDIA move up stack and also move into the cloud and model?

0

00:00 - 00:00 Steeve Morin

They are. They have a protocol NIM, sort of does that. The thing with NVIDIA is that they spend a lot of energy making you care about stuff you shouldn't care about. And they were very successful. Like, who gives a shit about CUDA? I'm sorry, but I don't want to care about that, right? I want to do my stuff.

00:00 - 00:00 Steeve Morin

And NVIDIA got me into saying, hey, you should care about this because there's nothing else on the market. Well, that's not true. But ultimately, this is the GPU I have in my machine. So, you know, off I go. If tomorrow that changes, why would I pay 90% margin on my compute? That's insane. This is why I believe it ultimately goes through the software. This is my entry point to the ecosystem.

00:00 - 00:00 Steeve Morin

If the software abstracts away those idiosyncrasies, as they do on CPUs, then the providers will compete on specs and not on fake modes or circumstantial modes, right? So this is where I think, you know, the market is going. And of course, there's the availability problem. There is, you know, if you, you know, piss off Jensen, you might need to kiss the ring, you know, to get back in line, right?

00:00 - 00:00 Steeve Morin

But I mean, ultimately, I don't see this as being sustainable.

00:00 - 00:00 Harry Stebbings

When we chatted before, you said about AMD. And I said, hey, I bought NVIDIA and I bought AMD. And NVIDIA, thanks, Jensen, I made a ton of money. And AMD, I think I'm up 1% versus the 20% gain I've had on NVIDIA. You said that AMD basically sold everything to Microsoft and Meta and had a GTM problem. Can you just unpack that for me?

Chapter 7: What role does compute in memory play in AI's future?

39:00 - 39:07 Harry Stebbings

What is the right sustainable strategy then? You don't want to go so heavy that you can't ever get out and you have that switching cost.

0

39:07 - 39:07 Steeve Morin

Right.

0

39:07 - 39:11 Harry Stebbings

But you also don't want to sprinkle it around and do, as you said, multiple.

0

39:11 - 39:29 Steeve Morin

Absolutely. The right approach to me is making the buy-in zero. If the buy-in is zero, you don't worry about this. You just buy whatever is best today. How do you do that by renting? Oh, because this is what we do. This is our promise. Our thesis is that if the buy-in is zero, you know, you completely unlock that value.

0

00:00 - 00:00 Harry Stebbings

When you say the buy-in is zero, what does that actually mean?

00:00 - 00:00 Steeve Morin

It means that you can freely switch, you know, compute to compute, like freely, right? You just say, hey, now it's AMD, boom, it runs. You just say, oh, it's 10 store and boom, it runs, right? How do you do that then?

00:00 - 00:00 Harry Stebbings

Do you have agreements with all the different providers?

00:00 - 00:00 Steeve Morin

Oh, yeah, yeah, yeah. Not agreements, but we work with them to support their chips. But the thing is, at least as a user myself of our tech, is that if it's free for me to switch or to choose whichever provider I want in terms of compute, AMD, Nvidia, whatever, then I can take whatever is best today, and I can take whatever is best tomorrow, and I can run both.

00:00 - 00:00 Steeve Morin

I can run three different platforms at the same time. I don't care. I only run what is good at the moment. And that unlocks, to me, a very cool thing, which is incremental improvement. If you are 30% better, I'll switch to you.

Chapter 8: What are the economic dynamics of AI compute?

41:02 - 41:28 Steeve Morin

And mind you, TPUs do training. They are the only ones. We're training them now. But AMD can do training, but in terms of maturity, by far the most mature software and compute is TPUs, and then it's NVIDIA, right? So the buy-in is so high that people are like, no, fuck. We'll see, right? I'm not on Google Cloud. I have to sign up. Oh my God, right? So these are tremendous chips.

0

41:28 - 41:51 Steeve Morin

These are tremendous assets. Now, in terms of the risk, I think if you want to do it, you have to do it top to bottom. You have to start with whatever it is you're going to build and then permeate downwards into the infrastructure. Take, for example, Microsoft with OpenAI. They just bought all of AMD's supply and they run ChatGPT on it. That's it. And that puts them in the green.

0

41:51 - 41:56 Steeve Morin

That's actually what makes them profitable on inference. Or at least, let's say, not lose money.

0

41:57 - 42:03 Harry Stebbings

I'm sorry, how does Microsoft buying all of AMD's supply make them not lose money on inference? Just help me understand that.

0

00:00 - 00:00 Steeve Morin

Because I can give you actual numbers. If you run eight H100, you can put two 70B models on them because of the RAM. That's number one. Number two is if you go from one GPU to two, you don't get twice the performance. Maybe you get 10% better performance. Yeah, that's the dirty secret nobody talks about. I'm talking inference, right?

00:00 - 00:00 Steeve Morin

So you go from, let's say, 100 to 110 by doubling the amount of GPUs. That is insane. So you'd rather have two by one than one by two, right? So with one machine of H100, you can run two 7 TBs model if you do four GPUs and four GPUs, right? That's number one. If you run on AMD, well, there's enough memory inside the GPU to run one model per card. So you get eight GPUs, eight times the throughput.

00:00 - 00:00 Steeve Morin

While on the other hand, you get eight GPUs, two, maybe two and a half times the throughput. So that is a Forex right there, just by virtue of this. So that is the compute part. But if you look at all of these things, there are tremendous amount of, we talked to companies who have chips upcoming with almost 300 gigabytes of memory. on it, right?

00:00 - 00:00 Steeve Morin

So that is, you know, a model like one chip per model. This is the best thing you want if you're on seven TBs, right? So, which is what I would say, not the state of the art, but this is the regular stuff people will use for serving.

00:00 - 00:00 Steeve Morin

So if you look, you know, top to bottom and you know what you're going to build with them, then it's a lot better to do the efficiency gains because four times is a big deal, right? And mind you, these chips are 30% cheaper than Nvidia's. It's like a no brainer. But if you go bottom up and say, I'm going to rent them out, people will not rent them. Simple.

Comments

There are no comments yet.

Please log in to write the first comment.