Menu
Sign In Pricing Add Podcast
Podcast Image

Oxide and Friends

Holistic Engineering with Robert Mustacchi

Thu, 23 Jan 2025

Description

In addition to Bryan Cantrill and Adam Leventhal, we were joined by Oxide colleague, Robert Mustacchi.Some of the topics we hit on, in the order that we hit them:Experiences Porting KVM to SmartOSMeltdown and SpectreRobert's "Big Theory Statement" for MACRobert's "Big Theory Statement" for cpuidAGESAOxF: Put the OS back in OSDIOxide RFD 63: Network ArchitectureOxide RFD 82: Motivations and Principles for the Design of Operator FacilitiesOxide RFD 88: Chassis Management Responsibility AllocationIf we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!

Audio
Transcription

0.249 - 1.551 Bryan Cantrill

I think I've invited Robert to speak.

0
💬 0

2.813 - 8.943 Adam Leventhal

And Craig and Chiark have deemed, now that I've given them, I've succumbed to their demands.

0
💬 0

10.495 - 24.298 Bryan Cantrill

Listen, the podcast starts when Craig and GRX says it starts. I also do love this is we getting a little bit of a Tomax and Zaymont thing and that they like, you would think you might have some redundancy with the twins here, but they actually apparently think with a like mind.

0
💬 0

24.318 - 27.539 Adam Leventhal

At least a like set of permissions in Discord.

0
💬 0

28.66 - 51.144 Bryan Cantrill

Listen, Tomax and Zaymont shared the pain. These guys share permissions. It makes more sense. It's a little less fantastical. It's a little more pedestrian, a little more quotidian, but come on. I mean, this is just like, you know, we're just trying to be realistic over here. There you go. Robert, you're here. Yeah, good evening. Excellent. How are you?

0
💬 0

51.164 - 60.265 Bryan Cantrill

You know, our former colleague, Alex Wilson, said, you know, it is really great to see you kicking off Robert Moustaki Appreciation Week this way.

0
💬 0

61.446 - 64.046 Robert Mustacchi

That's funny. I got a very different sounding message from him.

0
💬 0

66.323 - 89.37 Bryan Cantrill

No. I don't know if you quite ate the hook the way I did, but I was just like, Oh, we, is that a, is that a thing we do? Is that like an annual thing? Because we have a, our colleague CJ has a kicks off and there is an Eric Anderson appreciation day here at Oxide for one of our employees. So like, it seems like a reasonable thing that we would do to have a Robert Moustaki appreciation week.

0
💬 0

89.45 - 112.483 Bryan Cantrill

I don't know. I'm like, have we, have we, so after like, I can't quite tell if he's joking or not. So I did ask him, like, are we, is that, is that a good thing? Have we done that? He's like, oh yeah, it's definitely a thing. Like we've got, it's documented in rm.c and there are, it's actually an extraordinary block comment complete with diagrams. And I'm like, okay, what, where, where's this?

0
💬 0

113.163 - 129.795 Bryan Cantrill

Like, I've clearly missed something. I'm like, have I missed a like annual celebration that we do about, and then my mind's going to super weird places. Like, have I always been like out for our Mistaki appreciation week? Is that a source of resentment that Robert, I mean, I just don't know, you know, I'm just like, and I'm like, where is RM.C? Um,

0
💬 0

131.716 - 151.021 Bryan Cantrill

And so I did, I looked for, I went to some pretty obscure places looking for this thing. And I finally, I, I can't find it. And I'm looking at some, by the way, I do not recommend looking at some of these RM, RM.C. I mean, and Robert, I guess this is the curse of having a, a name that is a, an old Unix command, but Hey, you know, I've got BMC over here, baseboard management controller.

0
💬 0

151.041 - 156.722 Bryan Cantrill

I'm not, it's not, it's not good over here. Uh, have you ever been in RM.C?

0
💬 0

158.322 - 159.283 Bryan Cantrill

Uh, I have a little bit.

0
💬 0

160.103 - 160.743 Robert Mustacchi

It is not attractive.

0
💬 0

162.079 - 183.757 Bryan Cantrill

Yeah, it is really not a good looking file, but I can't find this thing. And so finally, after like a half an hour, I'm like, okay, look, I'm going to have to ask the embarrassing question. Like, where is RM.C? He's like, oh, no, I just made all that up. He's like, but he said, but there should be a Robert Moustaki Appreciation Week. I'm like, okay, all right, there we go. There we are.

0
💬 0

183.837 - 185.178 Bryan Cantrill

There definitely should be.

0
💬 0

185.478 - 187.12 Robert Mustacchi

I think better off without it.

0
💬 0

189.618 - 214.431 Bryan Cantrill

Robert, we are very grateful for you joining us. And actually, it was funny because just earlier today, we had someone by the office who was asking about how have you managed at Oxide? How have you done hardware software code design? How are you able to do that? It's organizationally challenging. It's obviously technically challenging.

0
💬 0

215.291 - 237.924 Bryan Cantrill

And I was like, you have to tune in to the podcast tonight when we have Robert on for holistic engineering. So Robert, you and I have had the blessing of working together for a really long time. For much of your professional career, and much of my professional career at this point too, but I guess a greater percentage of yours, just given our relative differences in age. Yeah.

0
💬 0

238.565 - 246.353 Bryan Cantrill

But you came out to Sun as a, well, you originally did microelectronics, right? As an intern in 2008, maybe? When was that?

0
💬 0

247.881 - 272.773 Robert Mustacchi

Yeah, yeah. I did a brief stint on the KT verification. And you got to see the glory of a microprocessor that would ship really soon. Not ship. So you kind of got to learn lessons that we're seeing today from when you see Spark roadmaps from the other CPU manufacturers. Right, exactly.

0
💬 0

275.325 - 289.433 Bryan Cantrill

I've lived that. Right. Okay. So then, and then the next summer you interned, that was the next summer that you interned with us at fish works in San Francisco. Right. Is that right? Yeah. Right. And that was great, obviously. And then we had, if I may, I'm sorry.

0
💬 0

289.493 - 314.284 Adam Leventhal

So first of all, his internship was great. I, and I can't take anything away from that. I would say that the thing I remember from that internship was, is that Robert then and continues to be an amazing baker and maker of confections of all kinds. And I think it was that internship. I think it was that internship. Food for Money Friday? Yes, which spawned.

0
💬 0

316.045 - 316.805 Bryan Cantrill

God, it is true.

0
💬 0

316.845 - 320.086 Adam Leventhal

The cheesecake brownies that launched the thousand ships.

0
💬 0

320.386 - 340.19 Bryan Cantrill

It really did launch a thousand ships. Adam, I'm really glad you mentioned this because people thought they were tuning in for holistic engineering and they're actually tuning in for the origin story of what we call Food for Money Friday. So do you, Adam, would you want to, I mean, Robert, may Adam do the honors to tell the origin story of Food for Money Friday? Because I do think it's important.

0
💬 0

341.561 - 345.905 Robert Mustacchi

Sure. Yeah, I can add my commentary. You know, the background thoughts I had, all different things were going on.

0
💬 0

347.086 - 369.109 Adam Leventhal

I mean, I do want to sort of get to the actual point of what we're talking about, but I would just say that first of all... The point of what we're talking about is Food for Money Friday, sir. Okay, good. Then allow me to digress. So Robert brought in these cheesecake brownies, which I think, Robert, you said took... like all the butter in every Safeway.

0
💬 0

369.149 - 372.691 Adam Leventhal

You had to like smurf butter from all the different Safeways in San Francisco.

0
💬 0

373.251 - 377.414 Robert Mustacchi

You know, there's no more. That's why there's an egg shortage now.

0
💬 0

377.734 - 391.262 Bryan Cantrill

That's right. That's why butter is in this packaging that's really hard to open because Robert was smurfing butter back in 2009. And so they put a lot of legislation plates under lock and key. You got to ring the bell. Someone comes and locks it for you anyway.

0
💬 0

391.442 - 407.087 Adam Leventhal

It's California. Uh, so, so Robert brings in this delicious cheesecake brownie with a billion calories. And I can't remember, I think we had already gotten to some like light food betting here and there. And I think I said something like, I don't, I'm not, okay.

0
💬 0

407.467 - 409.948 Bryan Cantrill

I don't think that's part of it's correct, but okay.

0
💬 0

410.388 - 422.494 Adam Leventhal

So I think I was like, Brian, I'll, I'll pay you five bucks or whatever to eat the rest of it. And there was a lot, a lot left. And I think you like, so in a rare act of self-preservation.

0
💬 0

423.074 - 426.177 Bryan Cantrill

So it is my recollection, Robert, go ahead. What is your recollection?

0
💬 0

426.377 - 432.802 Robert Mustacchi

Basically there's, I think I was going to say there's about one inch of a nine by 13 on the nine of the, you know, on the 13 inch side that's been gone.

0
💬 0

433.463 - 446.434 Bryan Cantrill

Yeah. It's been gone. And my recollection, and actually it's interesting that we've got like a, the difference of dollar figure. I recall it being $10. Okay. Okay. Well, maybe it was five inflation adjustment. I'll give you 10 bucks to eat the rest of it.

0
💬 0

446.774 - 446.914 Adam Leventhal

Yeah.

0
💬 0

448.198 - 481.81 Bryan Cantrill

It is like 10.30 in the morning. Which is going to be important later. I am extremely intrigued by this offer. This offer is extremely interesting to me. I feel like this is not something that had happened in the past because I just felt like this was like a bolt of lightning. Maybe you call back to my youth of like, I can earn money while doing what I'm best at, eating.

0
💬 0

482.812 - 491.023 Bryan Cantrill

So this is like, okay, I am extremely interested, but I have to leave for a meeting. I have got like a...

0
💬 0

491.664 - 494.887 Adam Leventhal

So it wasn't self-preservation. So the other thing I remember is- Oh, it was self-preservation.

0
💬 0

495.107 - 504.274 Bryan Cantrill

I absolutely wanted to, I like, this, you have made an intriguing offer. One that it didn't even occur to me to counter offer.

0
💬 0

504.414 - 517.425 Adam Leventhal

What I also remember is that even, I think even before you had declined the number, like as soon as I said, you know, five or 10 bucks, everyone else on the team was kicking in. It was up to 75 bucks almost immediately.

0
💬 0

517.765 - 530.09 Bryan Cantrill

That is my recollection, too, and like an idea whose time had come, this just exploded. I mean, everybody, I mean, I had to run for the meeting, and it felt like within seconds the pot was at $75.

0
💬 0

530.15 - 535.533 Adam Leventhal

Everyone's like, I, too, would like to see this person commit ritual suicide.

0
💬 0

536.113 - 550.741 Bryan Cantrill

Right, and then our colleague Julie gets wind of this, and she's like, wait a minute, is someone like... is someone offering $75 for someone to eat all of that cheesecake? Like I am, I'm definitely in, I'm going to eat it all. Yeah. It's like, great. I leave.

0
💬 0

551.401 - 558.126 Adam Leventhal

That's right. And we had, we had some time-based constraints that I won't go into the specifics of.

0
💬 0

558.446 - 580.266 Bryan Cantrill

I don't want to go into, I don't want to go into the entire IOC rule book here in terms of the, you know, but with the, you know, was it a sanctioned event? You know, there's like, listen, there's a lot of, uh, And I came back to Julie looking in front. Robert, what is your recollection? I recall her having like one, one and a half inch by one and a half inch square remaining.

0
💬 0

581.152 - 589.461 Robert Mustacchi

Something like that. I mean, I'll admit, I was kind of in shock the whole time. You know, you bring this in, you're kind of trying to sort of be like, hey, here's something nice for everyone. And then it's like, then it becomes food for money.

0
💬 0

591.142 - 595.827 Adam Leventhal

This thing that I thought would be a delicious treat for everyone, I guess, has become your prop bet.

0
💬 0

596.668 - 600.332 Bryan Cantrill

I guess, am I in the country of crime or merely to very poor judgment?

0
💬 0

601.218 - 605.261 Robert Mustacchi

And definitely ruined cheesecake, I think, for Julie for some time.

0
💬 0

606.302 - 610.524 Adam Leventhal

Oh, God. But yeah, Brian, it was like an inch and a half by an inch and a half square. Yes.

0
💬 0

611.405 - 624.974 Bryan Cantrill

And you're just like, Julie, well, just like, come on. I was like, just put it in your mouth and force yourself to eat it. She's like, that's what I've been doing. That's what I did for the last four of these squares. Like, I literally can't look at it without wanting to throw up.

0
💬 0

625.394 - 641.787 Adam Leventhal

So now this is the part of the story when usually I tell it to folks and I'm like, and then Julie couldn't do it. So she didn't get the money and everyone is aghast. Wait, you didn't give her the money. And I think, no, I don't know. What kind of lesson are we teaching the children?

0
💬 0

641.947 - 663.853 Bryan Cantrill

And this was, you know, and we had an arguable child with like we had Robert as a young adult, you know, Robert as a college student example to the intern who was shocked and aghast that we had taken his very kind gesture and turned it into something that, that was just depraved.

0
💬 0

664.314 - 664.674 Adam Leventhal

That's right.

0
💬 0

664.974 - 688.93 Bryan Cantrill

Um, But no, she didn't get the money. She got the cheesecake. So it's like, Robert, I can't believe you worked with us anyway. We're so grateful that you, that despite it all. So we will absolutely need to do a podcast episode only on Food for Money Friday. But let's just say we've had other episodes. There's a teaser episode.

0
💬 0

689.351 - 712.811 Bryan Cantrill

We have had other food for money adventures over the years, and Robert was not for the last time with the intern involved, which has become increasingly ambiguous for me. Someone in the chat is asking, he's like, I don't know that we actually volunteered what we're doing to sound legal.

0
💬 0

713.291 - 725.667 Adam Leventhal

Yeah, no, I mean, and no, we weren't like, you know, a ritual of, I don't know, whatever position was to like eat 90% of a cheesecake. No, this was all voluntary. These were all consulting consenting adults.

0
💬 0

726.223 - 730.986 Bryan Cantrill

That's right. There are many attributes of hazing that are absent here. There is nothing. Yeah, exactly.

0
💬 0

731.006 - 756.445 Adam Leventhal

There actually is another. So there's a kind of bait and switch follow-up to this internship, which is Robert interns with us. It was great. Cheesecake included, but not the only reason. It's a great internship. Yeah, yeah, great. Goes back to Brown University to finish out, gets his degree there. And we made Robert an offer to join us full time and he accepted.

0
💬 0

758.086 - 769.834 Adam Leventhal

But in the meantime, we were off getting acquired by Oracle. Yes. And, and then, you know, you and I were off figuring out how not to work for Oracle.

0
💬 0

770.675 - 788.725 Bryan Cantrill

Yeah. And I felt very, so we had, I and Robert, I left what was then Oracle before you joined. And my recollection is that I reached out to you as I was leaving. I think I did anyway. Yeah. I certainly felt very bad that you were going to be joining a bit of a ghost ship. Yeah.

0
💬 0

789.285 - 807.98 Adam Leventhal

And Robert joined and I don't know if it was clear that I was leaving, but I was like, Robert, it's a great learning opportunity is me teaching you about all the stuff that I've done and everything that I've ever owned in case it comes up. And then I think I left two weeks later.

0
💬 0

809.667 - 813.128 Bryan Cantrill

Something like that, yeah. It's a good learning exercise. There we go.

0
💬 0

814.868 - 841.433 Bryan Cantrill

Yeah, it was nice. But fortunately, Robert, you joined me at Joann, and you and I did a ton of things together at Joann, but something that you and I worked closely on early on was the port of KVM to SmartOS, to our Lumos derivative at the time. Um, and, uh, was that, I mean, what is your memory of those years?

0
💬 0

841.453 - 857.034 Bryan Cantrill

I mean, is that as that, that was a, it was a bit of a terrifying project in that, uh, we really needed to succeed and it was really hard, but you, you and I worked very closely on KVM. Um, And, uh, which is great.

0
💬 0

857.955 - 879.012 Bryan Cantrill

Um, and then what kind of in that, so what am I also kind of memory of those years is all of the, the kind of our early interactions with Intel and you really owning that kind of the, the, the interaction with Intel, um, that seemed to have happened pretty early on at joint anyway, is that, is that an accurate recollection?

0
💬 0

879.881 - 900.727 Robert Mustacchi

Yeah, you know, I think we had some very early bits at 365 California. And then it was actually I think when Keith came on. Right. He kind of was really driving more of those programs and I was working with him there. And then I took over more and more of that as Keith retired to the farm for the first but not last time.

0
💬 0

901.876 - 927.952 Bryan Cantrill

for the first, but not last time. And so the, and I mean, you'd always, I mean, I obviously always have had an interest in low level software and sharing, and I've always kind of been at that hardware software interface, but definitely found yourself there very much at, at joint. And the, the I mean, I, I, Well, do you have anything formative that you want to talk about?

0
💬 0

927.992 - 947.078 Bryan Cantrill

Because there's definitely something formative at Joint, obviously. And I know Alex is here in the chat, but we definitely had a, there were many episodes over the, how many years were we together at Joint? Almost 10 years, nine years at Joint. But at that kind of, that low-level system software interface.

0
💬 0

947.098 - 967.474 Robert Mustacchi

I don't know. I think a lot of it is just learning how to learn and learning how to debug, especially in those early years. Because there's a lot of times where I might know the kind of question I wanted to ask, but not how to ask it. Or you or a lot of the rest of the team would kind of come in with some of the detrace or other bits there.

0
💬 0

968.214 - 992.4 Robert Mustacchi

And I feel like also where one of the great adages that I've kept in mind when being faced with gnarly problems, which is basically write debugging tools when you're stuck. Yeah, interesting. I don't know. I feel like that's a lot of... A lot of it is during the times of really just figuring out how do you go learn a new subsystem that no one's been in really before, and there's no one to go ask.

0
💬 0

994.502 - 994.642 Bryan Cantrill

Right?

0
💬 0

994.702 - 1005.169 Robert Mustacchi

It's not like people go ask, like, ah, how does this, like, how does the NASDAQ work, or how does the USB NASDAQ work? It's like, oh, just go use some DTrace, go use some MDB, and start figuring it out.

0
💬 0

1006.502 - 1024.63 Bryan Cantrill

Okay, so this is a really interesting point about kind of learning how to learn and help not being on the way. Like there's no one, you're in a subsystem where there is actually, like you are going to be, if you're not already, you're going to be the local domain expert. So do you have a particular methodology when you're,

0
💬 0

1025.572 - 1044.863 Bryan Cantrill

When you're going into... I mean, one of the things that you, among your many special attributes, I mean, you're a terrific code reviewer. And I mean, how much do you go into a kind of a new subsystem just by looking at the actual code?

0
💬 0

1047.224 - 1069.274 Robert Mustacchi

I'd say... Often there's some kind of group of questions I'm trying to ask or answer. And it'll be some combination of looking at codes. Basically, I almost always have C-scope. C-scope with Vim integration that I've inherited from Dave and others over the years. Enough that I can set up Dave's keyboard and we can have the same key bindings, which is Shockingly convenient. Wow.

0
💬 0

1069.574 - 1071.036 Bryan Cantrill

Yeah, god. That's amazing.

0
💬 0

1073.117 - 1106.415 Robert Mustacchi

But that and using that with a combination of DTrace. And this is where, like, oh, I'm trying to think of this. Some of the classic one-liners, like instrumenting a module, like all entry probes in a module and aggregating on ProbeFunk is one. aggregating on certain stacks and just seeing what happens, um, trying to trace control flow or data flow.

0
💬 0

1107.255 - 1115.058 Bryan Cantrill

Are you, do you, when you're kind of in a new subsystem like that, trying to ramp up on it, are you like writing down questions that you're trying to answer?

0
💬 0

1115.078 - 1137.956 Robert Mustacchi

I mean, how do you, that's actually a good point. Cause I think one thing that I've noticed that is having questions in the notebook and writing stuff down in there. Um, I think that's one of the other things, um, that I found really valuable, just trying to figure out, what are you trying to do? Or trying to diagram out on a whiteboard how the subsystem works and flows.

0
💬 0

1137.996 - 1169.097 Robert Mustacchi

I think I remember we were debugging one of the, what was it? There was something with the X2APIC for the Apex PSM driver. That block diagram is now in OSEnter.c, but it filled up two joint whiteboards. And a lot of it was just trying to understand, how can I understand this control flow well enough to know what's going on, where is everything flowing, et cetera.

0
💬 0

1169.257 - 1190.872 Robert Mustacchi

And that's definitely a useful, just kind of, how can you understand it well enough to explain it to someone else? And I think that's the other thing. I was often sometimes sitting there talking with other folks in the office or on chat and using that as a way to kind of, like, have them ask. Sometimes they would ask me questions.

0
💬 0

1191.653 - 1214.3 Robert Mustacchi

Like, I know sometimes Josh and I or Dave and I or Patrick and I would just go there and, like, or Alex, we probably did this a bunch at 655 Montgomery. I remember there was that little bench. behind your desk. And I feel like there would be a lot of kind of questions and back and forth there. I'm just like, how does this?

0
💬 0

1214.32 - 1222.308 Robert Mustacchi

You know, kind of the old, like, one of the useful things about being a TA is by the time you can finally explain to someone else, you might start to have an idea of what's going on.

0
💬 0

1223.697 - 1241.334 Bryan Cantrill

Yeah, that's really interesting because I mean, all three of us were TAs as undergraduates. It's very formative for all of us. And yeah, I mean, this is not a deep thought and anyone who's taught knows this, but boy, when you teach something, you have to learn it with a whole different level of mastery. It's not...

0
💬 0

1242.054 - 1258.962 Bryan Cantrill

Um, like you even, you can even do well in the course and then you go to TA and you're like, Oh my God, how did I even like pass this thing? I barely like, I, I really, I'm learning so much more because when people ask you, sometimes when people will ask you a question and you'll give an answer, well, they'll be like, Oh yeah, that's like a convincing answer.

0
💬 0

1258.982 - 1273.524 Bryan Cantrill

And you're like, that is not a commitment. Like you are convinced by this answer because you don't actually, you're asking me the question because you don't actually know, but I actually know the limitation of my own knowledge and I have not given you a very good answer actually. And I actually need to go research that because I'm not, So that's interesting.

0
💬 0

1273.564 - 1286.067 Bryan Cantrill

And then, Robert, I also think it's interesting that you mentioned the block comment that actually gets... Because, I mean, your block comments are... I mean, Nepu's Ultra. The ASCII art in your block comments is the stuff of legend.

0
💬 0

1286.948 - 1299.451 Bryan Cantrill

And I say this as someone who prides himself on his own ASCII art in my own block comments, but... Honestly, it's mostly just because I don't... If I don't do it, I'm going to forget it all.

0
💬 0

1299.931 - 1308.435 Robert Mustacchi

And... having it written down. Like I've gone back to some of these and been like, Oh, good job. Pass me. I'm glad it's here. Cause I know I was just going to say, I forgot. Yeah.

0
💬 0

1310.8 - 1327.951 Bryan Cantrill

Uh, well, the number of times we often talk about past Robert around here and you will be like, we'll have a question. You're like, God, I don't know. I'm like, actually past Robert knows the answer to that question. I actually know that. And then like, fortunately past me wrote this down. I have heard you say that so many times over our shared career.

0
💬 0

1328.111 - 1338.678 Bryan Cantrill

And, um, but it's so, cause you mentioned that like when you're stuck, write a debugging, write debugging tooling. I definitely agree with that. But when you're stuck, you also write the, the,

0
💬 0

1339.398 - 1366.087 Robert Mustacchi

write a block comment when and you can go make an artifact better by just explaining how it works yeah i think as i've been flipping back through my notebook here for things that are um things that aren't meeting notes a lot of it really is uh you know just a whole bunch of things around like you know if there are problems it's like starting with questions i'm trying to answer and go figure out what those are um you know even on uh

0
💬 0

1367.659 - 1391.37 Robert Mustacchi

thursdays or fridays lr dim hunt is the same thing yeah just like what are some of the things what are some of the observations what's different um you know and i find that writing stuff down is a helpful focusing thing and that those for me i learned a lot about i mean everyone everyone learns in different ways so it's not going to be the same for everyone but for me uh physically writing in mostly illegible cursive that looks good from a distance is uh

0
💬 0

1393.146 - 1420.975 Bryan Cantrill

Robert's cursive is remarkable. It's fair to say. It gives me great... Robert, there have been times when I have asked you to go back and read your own cursive, and even you can struggle. It is beautiful, but it's just like... It's got this very... Obviously, you've seen plenty of Robert's cursive. I feel like I'm reading a letter that Alexander Hamilton wrote to me. You know what I mean?

0
💬 0

1420.995 - 1446.366 Bryan Cantrill

It's got like, it's got that kind of like 18th century vibe to it, which is like, I, this is, it's gorgeous, but I just can't make it out. So, Robert, this was all, I mean, kind of as the years go by at Joy-In, I mean, you're going into, like, kind of rappelling down into deeper and deeper caverns.

0
💬 0

1459.591 - 1459.911 Robert Mustacchi

Indeed.

0
💬 0

1461.115 - 1465.997 Bryan Cantrill

I assume. Do you want to describe what we saw on the internet and where that kind of led you?

0
💬 0

1467.918 - 1486.406 Robert Mustacchi

Yeah, actually, I think my memory is that you or Alex probably saw it first. Probably with Spectre and Meltdown. This is being like the... It's put back in Linux in a way that starts leaking and everyone's kind of denying it until they stop denying it. I think is the...

0
💬 0

1487.965 - 1507.785 Bryan Cantrill

Yes, and it has become a Hacker News story on January 1st, 2018, if memory serves. And... I think I am. I mean, certainly, Alex, I'm sure saw it first. I definitely DM'd you being like, are we? Do you think this affects us? You're like, I don't know. We're learning about this for the first time.

0
💬 0

1507.825 - 1525.713 Bryan Cantrill

And this is us learning about Spectre and Meltdown and then discovering that we are vulnerable and we are running in production and a public cloud and Meltdown in particular. was really, really acute. John Masters had done terrific work having a real vivid proof of concept of Meltdown.

0
💬 0

1532.777 - 1534.278 Adam Leventhal

Yeah.

0
💬 0

1535.139 - 1551.653 Bryan Cantrill

I... You know, it's funny. Like that is definitely like not a question we were going to go answer because it was like, it's not actionable. It's you're just like, how far is it down to the bottom? And it's like, it is enough. And you're at a height where you're going to die and nothing else actually matters. So it's like, beyond that, right.

0
💬 0

1552.173 - 1552.754 Robert Mustacchi

More than 10. Yeah.

0
💬 0

1556.553 - 1579.001 Bryan Cantrill

It was a lot. And Alex and Robert needed to go implement kernel page table isolation. And I mean, Robert, maybe that doesn't stand out for you as much as it does for me, but that is one of the singular engineering efforts I feel I've ever been kind of in the proximity of. I mean, it was really... extraordinary.

0
💬 0

1579.441 - 1602.687 Bryan Cantrill

Maybe the pressure was clarifying because you're just like, well, I didn't create this problem. It's very clear what to go do. There's not a question about what is the highest priority thing to go do, which I feel is something as engineers we always grapple with. It's like the highest priority thing is to go do this neurosurgery on the VM system.

0
💬 0

1604.505 - 1633.194 Robert Mustacchi

Yeah, there was a lot of different parts to it. And so it was a combination of Alex and also John Levin, who was very helpful to have while working on that. And obviously, lots of conversations with others. But yeah, we were able to kind of split up that work into a bunch of different pieces. I think I dealt with per CPU page tables, which was an exciting thing in its own right.

0
💬 0

1633.214 - 1647.879 Robert Mustacchi

I think Alex dealt with a lot of the trampoline assembly. But we also kind of settled on a somewhat unique solution, I feel like, that hadn't really been done by others with the per CPU page tables.

0
💬 0

1648.868 - 1661.096 Bryan Cantrill

but could you talk a little bit about the problem we were solving? Like what was the problem that we needed to go solve? What actually is this? And this is to be clear, this is for meltdown, not for specter, but meltdown was a much more acute.

0
💬 0

1661.937 - 1690.01 Robert Mustacchi

Yeah. So yeah, that's a great question. So, um, On x86, ARM, and a bunch of other common RISC-V CPUs, when you use the MMU, you have page tables that describe virtual to physical mappings. So every process has its own address space and maps to generally disjoint physical memory, and those page tables describe where they exist. and different attributes.

0
💬 0

1690.27 - 1713.239 Robert Mustacchi

So some of those attributes say this page can be read, this page can be written, this page can be executed. One of the attributes are basically permissions in terms of what the privileges are required to read, write, or execute that page. So you can really think of this as that there's a whole bunch of memory that people sometimes call kernel memory and then memory for processes.

0
💬 0

1713.979 - 1738.737 Robert Mustacchi

But effectively, if you take Let's just use the 4 gigabyte 32-bit address space as a simple example for a second. Every process has a 4 gigabyte address space, or a 64-bit process has 64 bits with a bunch of holes. But the top gig in that 4 gig address space is always the kernel. And it's the same in every process.

0
💬 0

1740.726 - 1763.039 Robert Mustacchi

But when you make a system call, you can start executing kernel text, and you don't have to go try and basically change the MMU context, change the page tables, because that's generally expensive and potentially causes cache invalidations and is the root of a lot of CPU performance challenges.

0
💬 0

1763.903 - 1771.468 Bryan Cantrill

And this has always been in x86. I mean, other architectures have done, like Spark did things differently with address space identifiers.

0
💬 0

1771.828 - 1790.7 Robert Mustacchi

Yeah, exactly. And ARM, through various incarnations these days, especially in the 64-bit ARMv8a profile, looks very much like x86 does in that regard. But eventually, there's a bit in there, or a few bits that say, should this page be a kernel page or a user page?

0
💬 0

1791.539 - 1817.453 Robert Mustacchi

And effectively, what that's meant to say is that if you're a user process, even though those kernel pages and those kernel VAs exist, if you try to read them, you'll get a page fault. And then the kernel will come and drop a signal on you to basically say, you've been reading something that you can't. Unfortunately, through the power of speculation, what basically happened is that

0
💬 0

1819.727 - 1843.439 Robert Mustacchi

That check happened, but after all the side effects of doing the read were pretty much done. So everything other than, you know, it doesn't show up in your register, but it's impacted, but it was loaded into all the caches and everything else, such that you could still see it. So basically, you could read any arbitrary piece of kernel memory you want.

0
💬 0

1843.639 - 1855.498 Robert Mustacchi

So whether that was, you know, someone's packets, security keys, you know, someone else's file system cache data. It was, yeah.

0
💬 0

1856.299 - 1879.795 Bryan Cantrill

And to be clear, the way you would do that is because you can, what you are controlling, you should not have been able to control, but you are able to get this thing to do a load for you, but you can't see the results. So the trick is how do I exploit the side effects of that load, namely allocating in the cache and, And then what can I go do to exploit those?

0
💬 0

1879.815 - 1892.643 Bryan Cantrill

And the things you can go support, as it turns out, you can have a conditional branch that then also gets executed speculatively. And then that can do something else in the cache that you can then go observe the side effects of. So you can kind of chain together.

0
💬 0

1893.403 - 1913.977 Bryan Cantrill

People may want to know, because I think one question people have is like, well, wait a minute, why didn't someone discover this a lot earlier? And the answer is like, it's sophisticated in that you... You do have to kind of chain these things together. And I think that this had been kind of in the abstract. We knew that kind of speculative attacks could happen.

0
💬 0

1914.337 - 1930.289 Bryan Cantrill

I don't think anyone thought that they were going to be this brazen. And in particular, it is really bad that it would speculatively execute on these addresses that you don't have the ability to read. The chip should never have done that.

0
💬 0

1931.51 - 1955.952 Robert Mustacchi

Yeah, there have been a bunch of a build up in the literature of different L3 cache attacks, like prime and pump and other things, where you start using L3 cache as shared resources, but people didn't expect you could go through the page tables. Or my favorite one really is EagerFPU, which is just a fun one of just like, oh, you really can speculate through everything.

0
💬 0

1956.977 - 1972.674 Bryan Cantrill

You really can't speculate. I mean, with Eagle Air for you in particular, and this is like fast forward now, I don't know, maybe nine months or 10 months, and we're having kind of like these constant calls with Intel where they're reviewing yet more. Because it is also true that like once the group discovered Spectrum Meltdown,

0
💬 0

1973.194 - 1994.795 Bryan Cantrill

And I mean, we knew this was going to be just like, oh, my God, it's going to be an absolute bonanza because now people know this is a target rich environment and I can now go explore every unit in the part. And Robert, I think it was eager FPU when I had missed the call with Intel because of another another conflict. And you said, well, we've got another disclosure.

0
💬 0

1994.895 - 2004.717 Bryan Cantrill

What is the one unit we have not yet heard about from them? And I'm like, we've not heard anything about the FPU. You're like, yes, it's the FPU. It's like, oh my God, literally every unit.

0
💬 0

2006.158 - 2032.545 Robert Mustacchi

Yeah. But yeah, I'd say that's been one of the small advantages of working for smaller companies that you get to explore a lot more of this stuff than you necessarily do at kind of some of the larger places because there's just... We're a less cemented team, so often there's not a big kernel org sitting by. Even if you look at Apple, they have a lot of different groups there.

0
💬 0

2032.665 - 2050.925 Robert Mustacchi

A bug comes up, you're going to pass it off to that group, and you're not going to chase it down or look at it or have to figure it out, which is sometimes with a blessing and a curse. It's great to have a lot of different colleagues, but sometimes it means there's less opportunities for you to learn or kind of move around in that regard.

0
💬 0

2051.265 - 2064.152 Adam Leventhal

I'd also say, Robert, there's fewer opportunities for you to... I think one of the things that you do well, I'm sure we're going to get into this, is cast that eye that is so well-informed about disparate parts of the system. And I think that

0
💬 0

2064.992 - 2087.109 Adam Leventhal

You're right that a strength of a large organization is that you can have experts narrowly focused, and sometimes it's a lot easier to be productive along those lines. But I think what they miss is the kind of system-wide insights that you've been able to make at Oxide, for example, because of diving deep into all these different areas.

0
💬 0

2089.468 - 2112.148 Bryan Cantrill

Yeah, I think that the, because I agree, Adam, and I think that like, I don't know, Robert, sorry to speak about you in the third person, but I feel like that is something that I feel I've kind of have seen more at like at Oxide where it's like at Joint, like you dove deep into so many different domains and then coming to Oxide, maybe it's a good opportunity to kind of fast forward to Oxide a little bit.

0
💬 0

2112.368 - 2136.821 Bryan Cantrill

But, you know, when we started the company, and able to raise money and were able to convince you to join us. And you'd been thinking about a lot of these, you'd gone deep in a lot of these disparate areas to the point that you were now really beginning to synthesize them. And maybe it was just that Oxide gave us the opportunity to do that synthesis.

0
💬 0

2137.401 - 2144.967 Bryan Cantrill

But is that, I mean, as you kind of think about your own story, is that kind of when that synthesis really begins to come into focus?

0
💬 0

2146.24 - 2170.128 Robert Mustacchi

I think it's when we're able to execute it most directly. I mean, so I think there's a, you remember Keith and I had, like Keith was driving, together we had kind of this dog patch pitch. Yes. Must have been 2015, 2014. Yes. Maybe 2016 at the latest. Can't be that much later. No, not 2016.

0
💬 0

2170.448 - 2171.268 Bryan Cantrill

It would be like 2014, maybe even 2013.

0
💬 0

2174.246 - 2200.159 Robert Mustacchi

yeah of kind of where we want to go but with the constraints um but with the constraints that you know effectively you know we're not doing our own boards we're not really getting to that little axis you know how do we how do we work with folks there and at the same time there's broader business constraints around you know who you have to work with um what's available on the market you know ultimately to be economically minded in different ways so but

0
💬 0

2200.379 - 2214.889 Adam Leventhal

Robert, can you talk about more, like what problems you're trying to solve with Dogpatch and what were the solutions? But before you get there, I just want to say, Brian, you mentioned hiring Robert. You failed to mention that he was actually your first hire. So he is employee number one.

0
💬 0

2215.229 - 2243.096 Bryan Cantrill

First hire? And also I feel we definitely, I so desperate for Robert that as soon as Robert is Robert kind of like tentatively said, like, I guess I work here now. We were both like, yes, yes, you work here now. Yes, yes, you work here. There was a moment where we're just like, okay, sorry, Robert, you can't take it back. Now you work here.

0
💬 0

2243.116 - 2267.982 Bryan Cantrill

It takes a special... And Robert really, obviously, appreciate the... Because the company's nothing at that point. And again, I know vividly where I was anyway, Robert. I don't know if you remember where we were, but on Jess's little back deck there as... You'd obviously had some questions. I felt we kind of knocked down all the questions and you're like, all right, I guess I'm in. I don't know.

0
💬 0

2270.424 - 2290.135 Adam Leventhal

I know, but it's like raising money certainly gives you that feel of tangibility, but it's when other people are now saying, people you've worked with, in Robert's case, for five, 10 years or whatever, saying, yeah, okay, I'm in. Let's do it. It's a new kind of reality setting in about the venture you're heading down.

0
💬 0

2291.112 - 2305.698 Bryan Cantrill

Yes. Yeah. And not too long thereafter. And I want to get Robert back to dog patch in a second, but, um, and I'll, I'll, I'll definitely drop this photo into the chat because I've got, I mean, is it the whiteboard?

0
💬 0

2305.858 - 2320.245 Robert Mustacchi

Yeah. Cause I was about to say, if you have that, you should, you should drop that in. Cause I feel like that's, that's a emblematic of the, of the discussion we have. It's like, ah, like let's drop these different pieces on the whiteboard. And then we go back and forth. Like, oh, let's talk about what's connected. You're like, they're just gonna be fully connected. And of course you are right.

0
💬 0

2320.385 - 2320.505 Robert Mustacchi

But,

0
💬 0

2321.127 - 2341.981 Bryan Cantrill

Yeah, I mean, it was really great because we had all these different So Robert is, we've got all these kind of different elements of the system. And Robert is like trying to figure out like, okay, what are the kind of the connections between, you know, the root of trust and, you know, we've got the switch, we've got all these different elements.

0
💬 0

2342.101 - 2362.54 Bryan Cantrill

And I'm like, this is going to be a fully connected graph. And sure enough, after not too long, it was. So yeah, I'll drop in that photo, Robert, but it's definitely a great one. But would you mind actually, just to go back to Adam's question, what was Dogpatch? What were the problems you were thinking about with respect to Dogpatch?

0
💬 0

2363.481 - 2378.704 Robert Mustacchi

Yeah, so I think to really start to get at Dogpatch, I actually had to rewind to actually back to Fishworks and AK, which was the appliance kit. So there was a lot of hardware software integration we did there around manageability and serviceability in particular.

0
💬 0

2380.445 - 2406.519 Robert Mustacchi

And if you're ever fortunate enough to use one of those, there was a lot of things that did around just like, hey, detecting drive failure, indicating that blinking an LED and even being able to blink the LED tell you where it was, already started re-silvering, swapped in spares, sent out emails, did contacted support potentially to get replacement parts shipped, et cetera.

0
💬 0

2408.04 - 2441.105 Robert Mustacchi

And that was kind of at a single box level. And that was really 2007, 2008. Then you kind of fast forward, and it's even 2013, 2014, and the ability to kind of deal with that data center management at scale, kind of getting to that warehouse-style computing, is really limited unless you're one of the really big players. You're making an LED blink reliably. Surprisingly challenging.

0
💬 0

2441.365 - 2463.765 Robert Mustacchi

Actually, you didn't... Honestly, in half the times, even after you figured it out, actually, you've learned that the server was built incorrectly. And a bunch of things to drive, you know, to the basically SAS expander or maybe the drive. It's like the drive backplane cables were swapped. So what you thought was drive zero was drive eight.

0
💬 0

2465.263 - 2468.684 Bryan Cantrill

And so you were literally lighting the wrong LED.

0
💬 0

2468.704 - 2469.404 Robert Mustacchi

The wrong LED.

0
💬 0

2470.644 - 2472.784 Bryan Cantrill

Hey, a drive failed. Come on.

0
💬 0

2473.004 - 2473.385 Adam Leventhal

I don't know.

0
💬 0

2473.805 - 2480.366 Bryan Cantrill

Exactly. Actually, like a drive failed. So I want you to remove one of the good ones. Did that help?

0
💬 0

2481.206 - 2491.888 Robert Mustacchi

Did that help? Exactly. But also all of this had to be done very manually. And, you know, the joint operations team put a lot of work into dealing with and managing that and dealing with the kind of

0
💬 0

2492.842 - 2521.794 Robert Mustacchi

lack of features but um you know the serviceability and manageability story which you know worked good at you know the one to two system or if you had multiple systems you know at fishworks we really wanted to bring uh together um and you actually go see if i actually have this deck still somewhere um yeah the dog patch deck we've got the dot yeah yeah yeah it's a it's remind myself what else is what else was actually in there but

0
💬 0

2522.911 - 2544.691 Robert Mustacchi

Yeah, I think this is actually where, you know, I think the big thing is, you know, there's a lot of fights between what's being done by the OS and what's being done by the BMC in those systems. Because basically the BMC is basically where there's a whole bunch of value add from the vendor for varying degrees of value.

0
💬 0

2544.711 - 2546.332 Bryan Cantrill

Oh my God, Diet Coke just almost came out of my nose. Yeah.

0
💬 0

2547.386 - 2550.807 Adam Leventhal

I know. Those air quotes are just like screaming.

0
💬 0

2551.587 - 2554.727 Bryan Cantrill

Oh, God. Yeah. I needed a little bit of a warning on that one.

0
💬 0

2559.768 - 2565.85 Robert Mustacchi

Yeah. Just the features of that, the out of building had were challenging.

0
💬 0

2567.11 - 2572.251 Bryan Cantrill

Robert's being very forgiving to not mention this, but this is when we first heard about Redfish.

0
💬 0

2573.553 - 2600.971 Bryan Cantrill

um for whatever reason i somehow got i i'm like i think this can really help us i'm like redfish this seems really interesting this is going to solve exactly this kind of problem that you mentioned and remember i remember you being like i don't think you understand what this is like that's not gonna you know no sorry this is taking an htp layer and smearing it on on top of the same garbage this is no this is not gonna help us at all this is and of course everybody's like no wait a minute i'm sorry

0
💬 0

2601.92 - 2607.385 Adam Leventhal

But that is the marketing of it, right? That is the ostensible value add of Redfish.

0
💬 0

2607.745 - 2632.115 Robert Mustacchi

Just totally divorced from reality. And it's gotten slightly better than when it first came out as an empty schema. But yeah, as I'm trying to look through this to see some of the other things, like obviously some of the classics, dealing with firmware, you know, the different kind of architectural challenges we had there.

0
💬 0

2632.515 - 2654.443 Robert Mustacchi

We actually were thinking about how do we eliminate the BIOS in UEFI and basically just do a small, basically let the bootloader take care of a whole bunch of stuff. That for us was IPXI at the time. So it's just like, how do we basically just get out of these different layers that are differently broken?

0
💬 0

2656.018 - 2682.568 Robert Mustacchi

I think this is during the, as we were writing this, this is one of those times where we had, I think it was like a, oh, like a Dell, maybe like Haswell era server. And after we typed reboot, it would just hang in the BIOS. Like, after we, you know, like, we just reboot, like... And it would only happen on a warm reboot. And there's a lot of back and forth of being like, well, it's your fault.

0
💬 0

2683.349 - 2710.155 Robert Mustacchi

You know, don't tell us this. And it's like, and you're trying to like, you know, some of us would say it less politely. You know, I think, you know, this is where we had good cop, bad cop, and psycho cop. But it's like, hey, the BIOS has just like erased all of our program text. And it's taken over. And its job is to restore it from an arbitrary state. So how exactly is it our fault?

0
💬 0

2710.775 - 2721.18 Robert Mustacchi

But actually, that in and of itself is a lesson as to one of the things that we actually did in Oxide, which is getting rid of two different reboot paths to actually simplify the system and streamline

0
💬 0

2722.976 - 2747.376 Robert Mustacchi

streamline things so that is you know in a standard system if you type reboot it's not going to do a full post it's not going to go reset everything ufi is going to kind of be clever and or the real reality is the cpu actually isn't going to erase everything so even if you reassert the reset line if you're in this acpi s5 state there's a whole bunch of state that stays across that so

0
💬 0

2749.278 - 2759.149 Robert Mustacchi

That all of a sudden means there's two different initialization paths. Some of this data you'll see in some of these data sheets described as being in certain power wells or as sticky across resets. And so we're just like,

0
💬 0

2760.183 - 2783.592 Robert Mustacchi

have none of that let's let's kill all power and that actually made it easier for us to build a more reliable system with actually less uh you know with less code pads to actually think about because we can actually say there is no worm reset there is no way there then at the end of the day what that means for customers is that hey it works more reliably more of the time because a lot of the challenges here is that you have all these different code paths and

0
💬 0

2784.625 - 2801.158 Robert Mustacchi

it's hard to actually test reset in a bajillion different ways. Or what does it mean to do a warm reset when, you know, you've been up for two years versus you've been up for 30 seconds versus, you know, I've hot swapped all these drives upteen times versus, yeah, I've done nothing because I've just been up for, you know, two minutes.

0
💬 0

2801.578 - 2807.203 Robert Mustacchi

So it really got us out of that problem once we ironed out some of the bugs.

0
💬 0

2808.055 - 2845.69 Bryan Cantrill

And this is, again, this is at Oxide. I dropped the dog patch deck from Joyent into the chat, Robert, but circa 2014 is when I got it. And actually, 2014 on there. And you will see a lot of the Oxide vision there in terms of illuminating both BIOS and UEFI. Much easier said than done. Well, no one took us up on it, so... No one took us up on it. And so, I mean, a bunch of things.

0
💬 0

2845.71 - 2866.237 Bryan Cantrill

I mean, one, I mean, I think this is the kind of thing that nobody disagreed with us, that this is the right thing to go do. But it just seemed like, God, it seems really, really, really, really hard. AMD thought it was impossible. And it was in order for us to, I mean, this is the part of the part that is not well-documented.

0
💬 0

2867.217 - 2887.774 Bryan Cantrill

Um, oh, and I mean, do you want to describe some of your trials and tribulations about like, how did we do this? How did we pull off this? Because I mean, I think this is, I mean, unfortunately, like we had AMD's cooperation at the level of AMD was not going to like get in our way. Right. I mean, AMD is like, what you're doing is so hard.

0
💬 0

2887.794 - 2902.051 Bryan Cantrill

We don't know how to help you, but we're also like, we're supportive of this effort in the abstract, which is extremely important. Um, but it's also not hugely, hugely helpful. How did we go about on that particular problem?

0
💬 0

2902.371 - 2906.376 Adam Leventhal

They're helping by not actively sabotaging us, is what it sounds like.

0
💬 0

2906.416 - 2927.65 Bryan Cantrill

Hey, listen. Sorry. You'll take it. You'll take what you get. That's ahead of the class for you. That's terrific. They did actually one better in a very important way. And this is why people may have seen the OpenSil effort from AMD around OpenSil and initialization.

0
💬 0

2928.69 - 2950.62 Bryan Cantrill

We are very, very supportive of that effort, even though we're not using it at all, because it was tacking in... AMD was going in a lot of the same directions that we were going. So it actually... But... What we were doing was far earlier than OpenCell, before OpenCell was really just kind of still in its kind of earliest phases. So we really were on our own.

0
💬 0

2951.72 - 2964.067 Bryan Cantrill

And Robert, talk about, I mean, surely the work that you did there to understand AGISA has got to rank as one of your most challenging projects in terms of understanding a foreign code base.

0
💬 0

2965.788 - 2979.729 Robert Mustacchi

Yeah, I'd say it's even more challenging as one that we, I mean, at least at the time, there was really no good way for me to build or run or instrument. So that was really all about code inspection.

0
💬 0

2980.869 - 2987.251 Adam Leventhal

Robert, can you talk about what AGESA is and what we needed to do in that domain?

0
💬 0

2987.731 - 3024.067 Robert Mustacchi

Yeah, Adam, that's a good question. So AGESA is effectively one part of AMD's boot software. So it contains both the... both all the kind of binary blobs at the PSP or other kind of hidden cores run, but also really contains a whole lot of all of the x86 initialization things. So how do memory mappings get set up? How do various... pieces of the data fabric of CPU initialization.

0
💬 0

3024.528 - 3049.01 Robert Mustacchi

Turning on and when you start, there's only one CPU turned on. And even though operating systems have this traditional IPI dance, there's a whole bunch of other stuff you have to do. in advance to start this. The AMD SoCs, like others, has 128 PCIe lanes. But those PCIe lanes can be carved up into arbitrary different slots, depending on what board you have. So how do you communicate that?

0
💬 0

3049.81 - 3076.033 Robert Mustacchi

So AGISA, which looks like, what does it stand for? AMD Generic Encapsulated Software Architecture. Only one of the letters is an acronym. So that whole bit, is designed and ties into a separate, these days, UEFI code base. So it in and of itself is not the complete picture.

0
💬 0

3079.455 - 3110.54 Robert Mustacchi

It's built up as a series of UEFI modules that runs in the PEI and Dixie phases, which are different phases of boot, and still means that you need a Tiano core or other UEFI implementation to kind of fit alongside it. It will do anything and everything from setting up I2C devices to... That's where SMBIOS tables are created.

0
💬 0

3113.161 - 3118.122 Bryan Cantrill

To just... It's presumably where ACPI tables are created as well, right?

0
💬 0

3118.162 - 3122.684 Robert Mustacchi

ACPI tables are created there too. And just a whole bunch of stuff. So yeah, that was an effort where...

0
💬 0

3128.143 - 3148.414 Bryan Cantrill

And it also should be added Robert that it, so you're like, okay, so you've got this like early platform initialization thing. So you have to like follow all the, just the code flow through that. It's like, yeah, about the code flow, describe what makes following the code flow through here. Absolutely brutal.

0
💬 0

3149.935 - 3171.56 Robert Mustacchi

Oh yeah. So as I said, these are all UEFI is designed in a series of different modules. And they all, basically rely on different callbacks firing. So because these modules are coming up and loading in probably somewhat defined but arbitrary orders, they'll often wait.

0
💬 0

3172.64 - 3202.983 Robert Mustacchi

So before I begin, the PCIe module or the MBIO module might start loading, but it's going to register a callback that fires when all these different PPIs, which is a UEFI term, are provided. Lots of them. And there's not like a... Or at least the things we had access to, there was no clear map of like, this is the expected ordering of these different... these different sets of services.

0
💬 0

3203.023 - 3226.568 Robert Mustacchi

So it's like, when does the logical SOC service begin? When does the memory map service start? What is it blocked on? So a lot of it is basically trying to come up with this effectively callback-driven control flow and trying to understand what is that just by purely reading, which is not straightforward and definitely not always correct.

0
💬 0

3227.611 - 3248.059 Bryan Cantrill

Well, and so unlike with the kind of other things, like when you went into the page tables to understand that, to understand the VM system, you're armed with the source code and you also have some tools that you can use to actually observe the running dynamic system. You don't have any of that here. You have, or very, very little.

0
💬 0

3248.079 - 3254.542 Bryan Cantrill

I mean, I guess the thing that you could, I mean, you can go dork with some of these attributes, right?

0
💬 0

3254.642 - 3276.847 Bryan Cantrill

these apcb tokens and you can i guess watch how this and i guess you did do that because i know at least once you're like could you go over to this machine and see what's happening uh because i was in the office and you were not um and i'm like this machine yeah i'll say that so that was that was a very different uh yeah that that was uh dealing with the hitting one of the hitting cores

0
💬 0

3278.17 - 3285.033 Bryan Cantrill

Right, right, right. That was actually sending messages. Right, right. This is on our own software. You're not actually trying to understand their software. You're actually running our own software.

0
💬 0

3285.353 - 3310.238 Robert Mustacchi

Yeah, that was in our software. We're trying to send a message to the hidden core, which is responsible for PCIe initialization. We're also trying to send it a data structure that has all the mapping of all of our lanes and all this fun stuff. And then you send a message and then wait a while. And surprise, it came back to the bootloader prompt. which is never a good sign. Never a good sign.

0
💬 0

3310.579 - 3319.913 Robert Mustacchi

The kernel's up, so in theory, if I take a page fault or a double fault, you'd trap into KMDB and you could debug. And that's when I started asking Brian, like,

0
💬 0

3321.4 - 3344.185 Bryan Cantrill

going off the system and you know it was powering off yes yeah it was it this is like you were putting a message in a bottle there's a lot of like messages in the bottle where you're putting a message in a bottle you're sending it out to one of these other hidden cores um on the die and in this case you're putting the message in the bottle chucking it towards the island and then the island was somehow like bursting into flames and sinking into the ocean

0
💬 0

3344.465 - 3366.945 Bryan Cantrill

you're like i i don't send that one i guess like okay like okay that's a that's a no thank you on that message let me go yeah the island launched global thermonuclear war so like whoa okay uh and it's brutal yeah and that one came down to a um

0
💬 0

3369.483 - 3395.816 Robert Mustacchi

missing right to one of the registers in the effectively the mbio which is northbridge io which has parts of what the memory map bar so we were basically trying to do dma to an address and um it didn't have a it was missing information that told it whether that was dram or memory mapped uh, IO.

0
💬 0

3396.136 - 3417.2 Robert Mustacchi

And so it probably hit some internal error and, you know, uh, especially since then we didn't have good observer, but we were still working on a reference platform. So we didn't have our service processor, our other stuff there. So we couldn't see if there were, you know, some of the low level asserts were being fired that would trigger something on a pin. Uh, right. We had nothing.

0
💬 0

3417.7 - 3423.821 Adam Leventhal

So how on earth did you do like, what was it just back to the code and reading more? Like it just, uh,

0
💬 0

3424.654 - 3448.4 Robert Mustacchi

Yeah, well, and I think this also gets to some of the tooling stuff. It's like we had actually built up a whole bunch of random demods in KMDB because there was no user land. So it's all in the kernel debugger to basically be able to read and write some of these different register spaces. So there's the system management network is one of them. And then the data fabric is another one.

0
💬 0

3448.78 - 3468.938 Robert Mustacchi

And having that tooling be able to do that, just let us kind of do some inspection, We use this in other problems because that is something that we could use without the oxide architecture. So we actually sometimes would compare and contrast that to what we saw on an i86 PC, on a standard PC.

0
💬 0

3469.799 - 3485.155 Robert Mustacchi

But for this one, it was really code inspection, double and triple checking, rereading, getting it wrong a lot. And I don't know. There was a bit of a... Yeah, a lot of the time it's kind of a blur, I'll be honest.

0
💬 0

3485.475 - 3489.896 Adam Leventhal

Were the iteration loops on this just like, you know, 5, 10, 20 minutes?

0
💬 0

3489.956 - 3498.418 Bryan Cantrill

I mean, it sounds... Depends. It's a lot of... I think we... Yeah, go ahead.

0
💬 0

3498.678 - 3507.875 Robert Mustacchi

The actual boot and like load time isn't that bad. It's really more of the mental... the mental effort there, I'd say. Just knowing what to do next.

0
💬 0

3507.915 - 3520.506 Bryan Cantrill

You would love to have so much to iterate on that you're blocked on the iteration loop, which, yes, is like 20 minutes, but that's not even the problem. You have the despair loop, which is actually much longer. Oh, yeah.

0
💬 0

3520.646 - 3524.449 Adam Leventhal

I'm factoring in the trip to the therapist and then the drive back home.

0
💬 0

3524.909 - 3553.564 Bryan Cantrill

That's right. Well, and I do wonder on some of that stuff, because Robert, so frequently you tackle these problems where the stakes are high. If we don't resolve this problem, we've got a serious issue. But you've always got a very cool head when you go to debug these. Do you just have an innate sense of confidence? Are you not as scared as I am? I'm terrified. Are you not terrified?

0
💬 0

3556.775 - 3583.386 Robert Mustacchi

No, don't worry. I'm plenty terrified. Okay, that actually makes you feel better. You know, I think part of it is also that I actually am never doing this alone, no matter where I've been at. Even if other folks don't necessarily have all the background, you know, that time, you know, Keith and I were working together a lot, but other folks would sit there, help listen to things.

0
💬 0

3583.786 - 3601.837 Robert Mustacchi

And this has been true when I've been debugging up and down across the stack, whether it's software, whether it's hardware. It really is a team effort. And even though some of the debugging is there, is sometimes a solitary activity at first,

0
💬 0

3604.198 - 3628.591 Robert Mustacchi

a lot of the other pieces like writing it up um having you know especially keith and i when we were doing bring up i think the fact that we were working together uh even though we're tackling different parts means we would write up different things in some of our bug reports review each other's notes uh ask questions of one another and the act of asking questions and being forced to answer them uh is there you know that's that's been the same thing that's true and we've been bugging uh you know

0
💬 0

3629.411 - 3646.764 Robert Mustacchi

some of the T6 stuff which Nathaniel and I were working on recently or other things that we're doing that are up and down across the stack. I think the first thing is that you're not alone and you have the expertise or even just the different perspectives of all of your colleagues. And I think that's really invaluable in and of itself.

0
💬 0

3649.126 - 3653.749 Robert Mustacchi

Because they'll ask you questions that might make you think in a different way or prompt different thinking.

0
💬 0

3655.214 - 3676.444 Robert Mustacchi

and you know it can never be just you know yes sometimes you need some time where you kind of you you get off the world and just kind of stare at things um but really it is uh working with others i think that ends up being necessary to kind of get through some of these harder bits because that's i think partly what helps uh helps you get through it um

0
💬 0

3677.551 - 3695.013 Robert Mustacchi

Because, you know, if I'm stuck and I see Keith making progress and that helps there, or if I see us making progress, you know, on the board work or higher up in the, you know, the control plane or other parts of the product or, you know, down below, then that kind of helps, you know, motivate you to kind of keep going. You know, it's not all bad. Right.

0
💬 0

3695.493 - 3716.13 Bryan Cantrill

Well, and I think that, I mean, because it's part of what is so kind of remarkable about the way you are able to approach the system is you do oscillate from these, the absolute lowest layers of implementation, like really stuff that is often like does not have a lot of, eyes on it.

0
💬 0

3716.17 - 3730.727 Bryan Cantrill

I mean, like the deepest aspects of system implementation, ones that are absolutely required for system correctness and liveness. And then you're able to oscillate back up to this really much broader view. I mean, we need to drop a link to RFD 63 in here.

0
💬 0

3733.69 - 3755.671 Bryan Cantrill

but in terms of like, you know, just, so at the same time, you're kind of like in the, the kind of the deepest possible muck in the, the lowest levels of the system in terms of, of early boot, you're also like at the whiteboard in terms of conceiving of, of what the kind of the networking interfaces we want to have for customers and the way we kind of think of that problem. Um,

0
💬 0

3757.489 - 3769.813 Bryan Cantrill

I mean, could you speak to that a little bit? Because I mean, just that ability to oscillate, I just feel is extraordinary. And it's been such an asset, I think to your teammates so many times over.

0
💬 0

3771.553 - 3786.598 Robert Mustacchi

Sure. I can try. I mean, I think there's a lot of stuff that's just virtue of, you know, being at the company earlier, there wasn't really a lot of other folks to do stuff. Yes. So, you know, there, there, there is a lot of stuff where it's just like, Hey, how do we start thinking about this? And,

0
💬 0

3788.872 - 3812.202 Robert Mustacchi

I think for me, one of the things that's there is that you'll see in that, actually, RFID 63 was like the last of a set of networking docs. And it really started from higher level product comparison, kind of feature use cases, user networking API that we wanted in another doc. And then with that in mind, how do we start building up the lower level stuff?

0
💬 0

3814.187 - 3822.415 Bryan Cantrill

I know we go to this metaphor a lot, Adam, but I'm sorry. These are the Federalist Papers for Oxide. These early RFDs from Robert.

0
💬 0

3823.516 - 3843.349 Robert Mustacchi

But I think a lot of it comes back to, you know, for some of these things, they're things I've been thinking about for a while, so that helps. You know, even though RFD 63, there's a lot of retrospective and on past attempts there or goals or, you know, things that worked well, things that didn't work well, you know, what we learned from paper reading and other stuff.

0
💬 0

3846.091 - 3863.896 Robert Mustacchi

And, you know, I think coming back to the dog, you know, if you go back to that dog patch decks in 2014, there's a lot of the high level goals and, you know, I'd say experience and kind of usability goals, you know, still kind of, you know, are things that are kind of at the, always in the back of my head.

0
💬 0

3863.916 - 3878.166 Robert Mustacchi

So I think that that's always helped to kind of anchor that or kind of raise questions about, you know, what does this future look like? What does it mean? How does it actually tie back to those goals that we have? And then I think sometimes, yeah, go for it.

0
💬 0

3878.186 - 3878.806 Bryan Cantrill

No, no, go ahead.

0
💬 0

3878.826 - 3901.812 Robert Mustacchi

Sorry, go ahead. I think sometimes, you know, you need to work being able to get, you know, well, maybe it's a little bit of thrashing, but, and sometimes you need to kind of really focus on one or the other, it helps to kind of, for me at least, to go to different things. Because sometimes some of these small bugs are kind of low-level details. You can kind of really just focus in on it.

0
💬 0

3902.032 - 3911.682 Robert Mustacchi

But having the broader context of how that fits into the system is helpful or a good distraction from that. So I don't know. It's not really a good answer.

0
💬 0

3912.386 - 3924.936 Bryan Cantrill

Yeah. I mean, no, that's a great answer. And I think it's, I mean, it is really, I mean, I, and I have always believed, I mean, Adam, you told me, this is like our, a unifying belief I would say across the company is that details really, really matter.

0
💬 0

3925.696 - 3948.991 Bryan Cantrill

And that you, that if you want to understand things at the highest level of a system and understand how these big pieces go together, you really need to understand low level details in order to be, because if you don't, that's where you get these kinds of emergent behavior. that really runs contrary to the goals you have as a system. So I think it's really, really important.

0
💬 0

3949.111 - 3970.499 Bryan Cantrill

But I mean, your ability to do it, Robert, has just been extraordinary. And the thing that I want to ask you about, because as you say, like, okay, it's early, so you're putting together the kind of, as our John Jay, you're putting together the Federalist Papers of Oxide. But we are then adding people to the company that are able to go and pick up pieces of that.

0
💬 0

3970.799 - 3987.807 Bryan Cantrill

And you've always been, I think, terrific about really enabling other people to go pick up these pieces. I don't think anyone has ever felt like, oh God, I can't touch networking because I can't touch this aspect of the system. Because you're always like, no, no, please, someone, please come on in. Water's warm.

0
💬 0

3987.867 - 3999.094 Bryan Cantrill

We'd love to help you, ramp you up and help you understand what I understand so you can go tackle this. Can you speak to that a little bit as well? Because I think that's part of what has been has enabled you to be so effective.

0
💬 0

3999.934 - 4017.552 Robert Mustacchi

Um, yeah, I mean, I think you're right. I think there's a, there's a lot to be said that, you know, I think the important thing is that as you go from, you know, the first couple of us that hire to 10 to 20 to 30, you know, and you continue to grow is how do you help, uh, teach other people what's going on? Um, and how do you help ramp them up?

0
💬 0

4017.592 - 4047.093 Robert Mustacchi

Cause I think that's equally important because yeah, I mean, um, the amount that one person can do is bounded. And sure, you can work overtime or push yourself a bit, but that will never be as effective as actually getting more people together. So I think that, to me, is an important question of how do you help teach people? How do you help them learn? How do you try to be helpful and help

0
💬 0

4049.312 - 4058.622 Robert Mustacchi

let folks be productive, um, and share knowledge because otherwise, um, yeah, I don't know. So I think that we got into this.

0
💬 0

4061.164 - 4063.465 Bryan Cantrill

we got into this in the RFD episode that you joined us for.

0
💬 0

4063.766 - 4081.235 Bryan Cantrill

I mean, but obviously, I mean, you are our most prolific RFD author and you know, I think it's been great for people to kind of be able to join the company and are actually also, I've had, I'm sure you've had this too, Adam, where like people who have not yet joined the company, like Robert, you're famous to people who have not yet joined the company.

0
💬 0

4081.255 - 4086.658 Bryan Cantrill

Cause you're just like, I've read this guy's work. I want to meet this person. Like this is, I feel like I know their voice.

0
💬 0

4088.139 - 4088.419 Robert Mustacchi

Yeah.

0
💬 0

4088.88 - 4089.2 Bryan Cantrill

And, but it's,

0
💬 0

4091.195 - 4109.692 Robert Mustacchi

Yeah. I think that's, you know, having the docs is good, but it's not, it's not, it's necessary, but not sufficient for that. Cause I think, well, I don't want to just show up and just be like, Hey, here you go. Please read this, you know, a hundred thousand word dissertation, uh, you know, 200,000 words and, you know, come up with a summary and, uh, you know, go from there.

0
💬 0

4109.713 - 4129.153 Robert Mustacchi

Cause I think that, I think that's, uh, that can be very overwhelming the same way, you know, uh, this wasn't all written, you know, Rome wasn't built in a day. This was an alternate day. So, um, I think it's important to also figure out, you know, how do you kind of introduce these topics, kind of, you know, get to more and more detail.

0
💬 0

4129.393 - 4143.564 Robert Mustacchi

And then I think a lot of it is also just not being, being really willing to, if people ask for, Hey, how does this work? Um, you know, being willing to put together an ad hoc presentation, whether it's formally with slides or not on how, how does this, you know, what's going on? You know, this is, uh,

0
💬 0

4144.52 - 4164.401 Robert Mustacchi

Another example is a colleague was asking about, hey, I'm trying to get more into some of the PCIe hot plug stuff. Can you help me understand how this actually all fits together with this? Because the PCIe spec is long and involved and not the easiest thing to read. So, you know, then the answer to that is, you know, yeah, let's find a time that does this and, you know, get into it.

0
💬 0

4164.802 - 4187.348 Robert Mustacchi

And I think that's, that's partly how you do that. And then helping make sure folks feel that it's okay to ask those questions too is equally important, which, you know, is always an ongoing effort to kind of build those reports because you can't, you can't necessarily know someone's struggling. So all you can do is try to be open and willing to answer stuff and try to be helpful.

0
💬 0

4189.778 - 4217.142 Bryan Cantrill

Right. Yeah. Well, and I think, I mean, just getting into PCIe, because I mean, it's another total, someone in the chat, PCIe is a plug. Yes. Oh, yes. I mean, another, and it feels like, you know, this is true for all of these domains where it is an entire universe of complexity. And I mean, having, you know, kind of helping people navigate it

0
💬 0

4218.062 - 4237.431 Bryan Cantrill

And then I've always felt that like, you know, you're so charitable with your own knowledge. It definitely, I think I feel anyway that you inspire people like, oh yeah, I can learn this stuff too. Like I can actually, this stuff is actually learnable. This is, it feels like it's very dense and I don't understand it yet, but I just need to, I can do it.

0
💬 0

4237.591 - 4245.235 Bryan Cantrill

I can actually learn this stuff, which is very inspiring because there's a lot to go learn. Yeah.

0
💬 0

4246.454 - 4269.987 Bryan Cantrill

you want to talk a little bit about like the just as a very concrete example i'm not sure if you want to take the this recent t6 problem and or this recent lr dim problem and or both as i kind of i just say another actually no but before we do that actually yes or good no thank you actually let me back up because i actually have a different question i want to ask you then we'll get to those

0
💬 0

4271.328 - 4286.487 Bryan Cantrill

Because one of the things that I definitely appreciate about, again, the way you think about the system is, and I know other people have appreciated it as well, is your ability to seemingly see around corners. And the number of times I will hit an issue

0
💬 0

4287.208 - 4301.958 Bryan Cantrill

And then I go into an old RFD of yours where it's like, I look under my chair and like, oh my God, Robert has already written a dissertation on this exact issue that I've grappled with and that I even read at the time, but didn't really appreciate.

0
💬 0

4303.815 - 4325.645 Bryan Cantrill

How do you – I mean, because I feel it's like a real – I mean, it's a tremendous gift, for lack of a better word, to be able to kind of project a system into the future and anticipate some of the issues that we're going to hit on a system that's not yet built. Do you have any tricks for doing that? I mean, is that – how do you do it?

0
💬 0

4326.986 - 4332.981 Robert Mustacchi

I wish I could claim omnipotence or something like that, but that clearly – or something like that.

0
💬 0

4333.321 - 4336.983 Bryan Cantrill

Um, it would make a lot of sense though, if you did like, you know, maybe that now might be the time.

0
💬 0

4337.003 - 4359.688 Robert Mustacchi

Yeah. I mean, I think, um, that's a good question. I think a chunk of this I picked up from just working with Keith over the years. Um, yeah, Keith's also very, very good at it. There's a lot there, but, um, I, I think this sometimes gets back to, uh, you know, the earlier kind of a code review is really trying to ask why, um, Yeah.

0
💬 0

4360.028 - 4380.325 Robert Mustacchi

And really trying to understand how does this fit into the system? How is someone going to do this? Or if someone wants to do X, what else does that mean they need to be able to do? Or how does it work? For some of this stuff that we're seeing around, I think some of it's cheating that we've been thinking about for over 10, over a decade. Dog patch is 2014. It's 2025. Yeah.

0
💬 0

4380.645 - 4402.875 Robert Mustacchi

So someone's just had a long time to just marinate in the back of the head with just different experiences. And I think part of that also is just a bit of just, um, you know, paying attention to how to, how are folks using things? How's opera, how are operators using things? Um, I don't know. I don't know. I wish I had a better answer for you there. Uh, so I could, no, no, no, no, no.

0
💬 0

4402.915 - 4418.028 Robert Mustacchi

I think, I think a lot of it is just trying to, A lot of it's listening, and then a lot of just digging and really needing to understand the low-level details before you can kind of go and make the high-level answer and how the two inform one another.

0
💬 0

4420.393 - 4444.873 Bryan Cantrill

When I also think that this is where the amount of time you spent in the details just really helps inform your engineering wisdom. I mean, just to give you a really concrete example of your wisdom and how we benefited from it recently, we had a Murata Power Shelf issue. that Eric Austin and our team had done a terrific job debugging. And Eric is like, I think this is a control loop issue.

0
💬 0

4445.113 - 4468.721 Bryan Cantrill

And I think we're going to need, we're going to get a firmware drop from Murata that fixes this. And I remember thinking like, wow, that is, I mean, in some ways that'd be great. But this is a problem that we definitely needed to resolve. Murata did a great job, resolved it. And fortunately, we had a mechanism that, We had not built anything on top of. We had not built any of the software.

0
💬 0

4469.262 - 4481.516 Bryan Cantrill

But the actual PowerShell itself had a mechanism to update its software. And that's something that you really believed in strongly when we were looking at PowerShell selection. I just think it's an example of what I mean of where...

0
💬 0

4482.857 - 4509.928 Bryan Cantrill

not just evaluating the artifact you have in front of you, but what could potentially go wrong with this where we would need to be able to update the firmware on a PSU that is not necessarily the first thing that a lot of people would think about. And You kind of have to have a certain kind of quantity of scar tissue to really be able to think about it. And for many years, we didn't need it.

0
💬 0

4510.388 - 4527.184 Bryan Cantrill

For many years, it would have felt like, well, I don't know, maybe this firmware always does work. But now, of course, we do need it. But we were able to build on top of it, and it's all worked, and we were able to upgrade the firmware. It was terrific. But it was a great example of you really saw around a very important corner.

0
💬 0

4529.591 - 4557.018 Robert Mustacchi

Yeah, and I think that often just comes back to some, let's say, just scar tissue and experience, right? Some of it was we've had times where we couldn't upgrade firmware or we couldn't get, you know, we didn't have the tools to or even get firmware updates. And so we've just been burnt by that over the years. And so I do think there's also just, you know...

0
💬 0

4560.038 - 4588.366 Robert Mustacchi

you really want to think about how you approach and think about uh firmware and just you know just the same way it's like software and just that it's not this uh while it can be very hard to understand just because of the um specifics of the way you know you're working with vendors you know obviously um if you just get a binary blob it's hard to really get into there but you know it's a thing that can fail and needs to be updated just like anything else so um

0
💬 0

4590.597 - 4615.708 Robert Mustacchi

I suspect that if we probably go back to Dogpatch, we probably have some commentary on the firmware that we were dealing with and just the operational problems there and trying to figure out how do you get to a better model? Where even is all the firmware? If it's there, assume something's going to go wrong. Or if there's an EEPROM that has data, assume it has to be flashed.

0
💬 0

4615.888 - 4634.981 Robert Mustacchi

We're going to need to flash it and then Once you start thinking about it in some portions, then it turns out that that's actually true in a lot more stuff. There's nothing that's just specific to... I think a lot of it is also just trying to take what are the things that you learn from one area and don't just... How can you apply them to others?

0
💬 0

4636.182 - 4659.123 Robert Mustacchi

And sometimes that'll be right, and sometimes that'll misfire and maybe not be quite the right starting approach. But I think it lets you ask other questions and helps you think about that. So for some of us, it's like, no one would say, ah, we should never be able to upgrade the software in the product. Right. The same would kind of be true.

0
💬 0

4659.143 - 4681.608 Robert Mustacchi

It's like, well, if someone comes back to you, what's the service experience? Okay, let's say we did have to do this. If there was no way to do online, to upgrade the firmware of the rectifier through software, then that means I'd have to send a tech out to every DC of every customer and have to pull one out one at a time and do this. And that's just not...

0
💬 0

4682.926 - 4695.216 Robert Mustacchi

That works when your n is small and it's a bad day for someone, but it really doesn't scale. So that's another big part of just where are these things that are, what would be the remediation if it fails?

0
💬 0

4697.794 - 4713.828 Bryan Cantrill

Just asking yourself that question for all these. Because, I mean, as a result, like we did, I mean, the fact that we did our own PowerShell controller, which is not necessarily the first thing that I think other people would think about. And, I mean, Murata, I would say Murata, some other PowerShell players,

0
💬 0

4714.248 - 4730.415 Bryan Cantrill

uh manufacturers had different opinions about us doing that um and uh it was a real tribute part of the the reason that we have a great relationship with marauders because they were willing to accommodate the fact that we're like no we're we're chucking out your power shelf control we want to do our own um

0
💬 0

4731.015 - 4744.583 Bryan Cantrill

And that was a very, very, very good decision because it gave us control over – and not just the ability to update the firmware, although also that too now, but just the level of observability into a part of the system that doesn't generally get that kind of observability.

0
💬 0

4745.404 - 4762.264 Robert Mustacchi

Yeah, and I think what helped is, again, from the kind of RFDs or docks building up on top of one of another – I think it's something like RFD 82, which is the one about kind of operator design principles and facilities for operation, you know, has something about firmware upgrade in it.

0
💬 0

4762.304 - 4777.279 Robert Mustacchi

So then as you're going out and kind of doing, sending these questions out to different vendors, you can go back and say, okay, what are the things I need to think about from there? You know, what are the kind of the key things, you know, how does it reply to things, even just not just firmware, but, you know,

0
💬 0

4778.846 - 4802.886 Robert Mustacchi

When a rectifier, we'll just pick on the PowerShell, fails, how do you identify which slot it is? What serial number it was? Where is it? And how does that turn back into... you know, just different features that you need, you know, and, you know, how do you tie that into basically different operator stories? And then, you know, that same thing is true of failing disks, right?

0
💬 0

4802.906 - 4820.218 Robert Mustacchi

I think it's much easier for us to think about, you know, a disk fails, I need to pull it, I want to blink the right slot. Well, the same is going to be true of a rectifier. The same is going to be true of a fan. The same is true of, you know, a transceiver. So, you know, there's a lot more similarity in some of these things, even though there are differences.

0
💬 0

4823.359 - 4851.191 Bryan Cantrill

Yeah, so I'm going to try to make RFD 82 public while we're in the podcast. So RFD 82 is not currently public, but I think it's a very good one to make public because that really captures, I think, an important aspect of your own approach and you're thinking about the way we ask different questions of different parts of the system.

0
💬 0

4852.858 - 4871.011 Robert Mustacchi

Yeah, and this isn't also, you know, there's a lot of different folks provided feedback on there, and there's a lot of different work there. So it's not, again, this is another one of those things that's not just, you know, a single person doing it.

0
💬 0

4871.371 - 4885.266 Robert Mustacchi

You know, there's, you know, we're taking these ideas, you know, colleagues to kind of get feedback on them, explore it more, get different perspectives. Make sure we're communicating it well. That makes it better.

0
💬 0

4885.286 - 4894.409 Bryan Cantrill

So it's not, you know. I think I've just made it too public. Sorry.

0
💬 0

4894.47 - 4896.851 Robert Mustacchi

I made something public.

0
💬 0

4897.191 - 4927.981 Bryan Cantrill

Definitely made something public. So let's hope it's RFD 82. So you may or may not, it may take a second for the, this is like us, this is a terrific RFD API. So that may take a second, but hopefully people will be able to see that at some point real soon. Okay, so the... And I think, Robert, 82 is a great one in terms of your, again, your own kind of disposition. So maybe to kind of...

0
💬 0

4929.352 - 4948.171 Bryan Cantrill

just dive back down into some of these specific technical details of things that you've been dealing with recently. I mean, the T6 issue is a really interesting issue. I'm not sure if you're willing to get into the weeds on that one, but I do think that it's very concrete. How much do you know about this one, Adam? Have you...

0
💬 0

4949.552 - 4963.544 Adam Leventhal

Uh, only Robert was describing a bit of it to me, uh, maybe a week or two ago, as I heard a lot of T6 buzz, but for, uh, T6 is the, the Nick Silicon that we're using in the next generation server, right?

0
💬 0

4964.024 - 4969.689 Bryan Cantrill

Uh, in the current generation, current generation, part of me. So this is, um, in, in Gilmour.

0
💬 0

4970.509 - 4973.873 Robert Mustacchi

Yeah, so we do something a little weird with the neck.

0
💬 0

4974.234 - 4986.769 Bryan Cantrill

So there's something... In stark contrast to the rest of the system where we are just like... Down the fairway. Down the fairway, exactly. It's like we also are a little weird with the neck. So welcome to Oxide.

0
💬 0

4986.789 - 5010.193 Robert Mustacchi

Yeah. We've gone our own way. One of the things that we were really trying to make sure we could do, because the NIC has its own firmware and configuration, which changes a whole bunch of different settings and things there, is we wanted to make sure we could validate slash attest that information. And we went through a bunch of different ways as a group to figure out how could we do that.

0
💬 0

5012.494 - 5035.414 Robert Mustacchi

And what we settled was that the NIC actually has its own manufacturing mode built in. So, which is useful, because some of this config file are things like, you know, what PCI, you know, it's a whole bunch of information for the PCIe certies, or, you know, describes things about how Ethernet should work, you know, what, do you have I squared C for transceivers? Do you have different PHYs, etc.

0
💬 0

5035.994 - 5062.007 Robert Mustacchi

So this information is all very critical for the NIC to work. But, you know, we didn't want to have just a one time factory programming process here, because What if we need to update it? What if something got wrong? How do we deal with it? So we end up using this feature of the NIC called manufacturing mode, which basically has the NIC boot out of an internal mask ROM. It doesn't enable Ethernet.

0
💬 0

5062.067 - 5077.182 Robert Mustacchi

It doesn't load any firmware that would run on some of the cores internally. It doesn't do a whole lot of stuff. But it gives us access to the hardware blocks for the NIC's own... EEPROM and SpyFlash.

0
💬 0

5078.404 - 5099.657 Robert Mustacchi

So basically what this means is rather than basically creating a complex MUX system to change ownership of these devices like we have to for the host SpyFlash, here we basically read and validate this through the NIC in its manufacturing mode. So what this means is that every time the server turns on... We're all booted.

0
💬 0

5099.737 - 5110.085 Bryan Cantrill

Importantly, this means we're able to do this. This is not the SP running in some context before the host CPU is up. This is the host CPU is able to do this while it itself is up.

0
💬 0

5110.806 - 5128.714 Robert Mustacchi

So the way we've rigged this up is that on the board, we have a GPIO strapped. So the system, the NIC always shows up in manufacturing mode. Right. Then, because we have power control over every device, we can basically validate all this and effectively, you know, we don't necessarily turn off all the power here.

0
💬 0

5128.754 - 5134.718 Robert Mustacchi

We kind of reassert the reset of the device and then basically boot it back up into what they call mission mode.

0
💬 0

5136 - 5148.77 Adam Leventhal

And so the way that normal folks would program their NIC is like having a little like a Norflash or whatever kind of strapped to it that had this configuration information and it loaded up autonomously. Is that...

0
💬 0

5149.601 - 5164.046 Robert Mustacchi

So yeah, the way mission mode works is exactly that way, Adam. They basically have a little spy nor flash, a little EEPROM, and it reads from that. And so we have those same things, it's just that in manufacturing mode, to basically bootstrap it, we just do it through the NIC.

0
💬 0

5165.906 - 5177.11 Robert Mustacchi

And on a normal PCI card, this is a little jumper that's there for the factory, done once, and then you update this, you can update the firmware from the NIC while it's live, but...

0
💬 0

5179.255 - 5201.136 Adam Leventhal

Right. And as you were saying, doing this every time allows us to put known, attested, valid bits there as opposed to having this question of what's living persistently and did some nefarious actor manage to wedge some bad bits down there?

0
💬 0

5201.736 - 5210.453 Robert Mustacchi

It also sells the factory, the egg problem, like how do you get the initial version on there? It's just like, there never, it doesn't matter if there ever was something there or not. We just always put what we think should be there.

0
💬 0

5211.495 - 5231.084 Bryan Cantrill

Right. And this is like having the recovery path be the primary path, Adam. It's like this is so you don't you don't have this kind of path that you rarely execute that when you needed to execute doesn't work. And it also because this is a real problem with like Nix have gotten really, really complicated. And it's like, oh, by the way, there's another computer over there.

0
💬 0

5231.345 - 5249.637 Bryan Cantrill

I mean, it's not really clear like who's in charge of the system anyway. And especially with like, you know, smart necks, they're being pretty overt about like, no, I've got my own CPU. I've got like, I load my own OS image. And it's like, no, no, no, we don't want any of that.

0
💬 0

5249.797 - 5267.153 Bryan Cantrill

We want to actually be, we want to have all this information flow in a way that it's attestable, that we know exactly what we're putting out there, where we're So we don't want that level of autonomy out of the system. We don't want this thing for a bunch of reasons. Security, reliability, a bunch of things. So...

0
💬 0

5269.797 - 5289.088 Bryan Cantrill

I mean, this felt like a huge, uh, win that we can use this manufacturing mode to do this. I mean, they're really a real important aspect. I mean, it kind of reminds me of the, our, the fact that we, we don't actually do warm resets. We only boot because we don't want to have this kind of like state accruing in places.

0
💬 0

5289.569 - 5315.366 Robert Mustacchi

And it definitely simplified the, you know, we talk about hardware software co-design. This definitely simplified the electrical design, um, Definitely, the last thing you want are more spy muxes and other things in the way, dealing with voltage translation, dealing with questions of even which spy... port would we have to connect it to, you know, on what device, what way would data flow?

0
💬 0

5315.386 - 5317.568 Robert Mustacchi

So it really simplified a lot of stuff, which was.

0
💬 0

5317.789 - 5331.538 Bryan Cantrill

Yeah, actually, this is a very good point, Robert, because it's like, I do feel that when we, you know, I had always thought that like, you know, there's kind of this back and forth over the what controls a certain aspect of the system, whether it's the SP or the kind of the host CPU.

0
💬 0

5331.818 - 5337.499 Bryan Cantrill

And there's a kind of a Conway's law thing that, you know, you've got different orgs that are kind of dueling for power, which we don't have at Oxide.

0
💬 0

5337.919 - 5356.283 Bryan Cantrill

But like, even if you are like, you have like total organizational harmony, just when you have something that is effectively dual ported, that is kind of owned by two things, you have all of these like electrical issues that are really thorny about, as you say, like level translation, which is a bunch of stuff that,

0
💬 0

5357.056 - 5384.072 Robert Mustacchi

we want to sorry just to underscore the importance of this yeah it's i mean we have a team that can has gotten it right time and time again but also um you know say when people joke you know the best software is the code you didn't write no bugs in the code you didn't write you know there's no electrical problems in the if you don't put those things down so um and sometimes they're necessary but you know if you can get away with it makes it a lot smoother so anyways um

0
💬 0

5385.815 - 5408.517 Robert Mustacchi

All that said, the way that this kind of whole problem is that the mask ROM starts up in PCIe Gen 2. It's basically hard-coded in silicon to start as a Gen 2 by 8 device, as opposed to a Gen 3 by 16. And we had occasionally seen some failures.

0
💬 0

5410.789 - 5434.58 Robert Mustacchi

And I think it was really Josh Glula who insisted we kind of do some of this boot loop testing where we'd occasionally see devices that had a surprisingly occasional failure to train the device in manufacturing mode. And if we came back to it later and tried to restart things or took another lap, it would often work. Or even if you tried to reset it after that, it would come up just fine.

0
💬 0

5435.44 - 5440.222 Robert Mustacchi

But the first time on a cold boot, it wouldn't turn on.

0
💬 0

5441.334 - 5448.02 Bryan Cantrill

And so when you say it does not train, what does training a link mean? Who's doing that training, and what does it mean when it fails to train?

0
💬 0

5448.1 - 5478.423 Robert Mustacchi

Yeah, so that's a great question. So what we think of this is, so there's a complex state machine in the PCI docs. So basically, a PCI device goes through to basically have a PCIe link come up. So when your operating system wants to go read or write from a register that ultimately gets transformed into a transaction on the PCIe bus, which is a point-to-point link between a port on the CPU

0
💬 0

5479.415 - 5499.161 Robert Mustacchi

and the downstream device, so the NIC in our case. You may have switches or other things in more complex designs, but really you can think of PCIe as a bunch of point-to-point links, generally between something on your CPU, which they may call the root port or an upstream port, and a device, often called the downstream port.

0
💬 0

5499.921 - 5521.393 Robert Mustacchi

And so link training is a process of basically figuring out shared... Effectively, what are the shared... ways we're going to operate. So for example, because PCIe is backwards compatible, you can take today's Gen 5 devices and put them in a PCIe Gen 1 board, and the link will train at PCIe Gen 1.

0
💬 0

5523.174 - 5543.465 Robert Mustacchi

You might have a root port that supports up to 16 lanes, but you might put in a NIC that only has one lane, a small 1 gig link. So it will do that. To make these links work at high speeds, is a very complex process because you have to figure out a lot of equalization and tuning so that they can interact.

0
💬 0

5543.985 - 5562.655 Robert Mustacchi

Effectively, link training is this process in this kind of large state machine that basically hopefully ends with the link training, so basically successfully completing the state machine. And it's really done by the PCIe device that you're plugging in and really the PCIe root port, which is

0
💬 0

5564.008 - 5571.633 Robert Mustacchi

generally a whole bunch of hardware in your CPU that probably itself has a secret core running stuff too that no one tells us about.

0
💬 0

5571.653 - 5579.698 Adam Leventhal

Is this about figuring out analog kind of coefficients and constant values that make things flow with minimal errors?

0
💬 0

5580.589 - 5601.068 Robert Mustacchi

Yeah, I think that's one part of it. And, you know, that's definitely a large part of it. And other parts of like the digital protocol communication, you know, what features can be used. So, you know, the PCI sake has done a lot of work. And, you know, it's a testament that we can, you know, PCIe has been backwards compatible back to its first release.

0
💬 0

5601.088 - 5602.329 Bryan Cantrill

It's amazing. Yeah.

0
💬 0

5606.657 - 5619.946 Bryan Cantrill

So that is what it means to train. So when we're not training occasionally in manufacturing mode, we're just giving up on this. That device is just not working. I don't know why.

0
💬 0

5620.987 - 5651.038 Robert Mustacchi

Yeah. Basically, we don't see a device come up. So there's a register in the root port that says, is there a PCI link established? And if you read it, it says no. there sure isn't there sure isn't and you're like well that's very sad um so there was a whole bunch of stuff that we were trying to do to figure this out because you know we had some challenges in the t6 initial initially um

0
💬 0

5653.584 - 5678.676 Robert Mustacchi

There's some erratum we found the hard way around it needing some double resets and some other conditions. There's a lot of investigation that we kind of split up and kind of took this in a couple different phases. Especially as I think right as this was kind of kicking off, I was disappearing on vacation for a bit.

0
💬 0

5679.536 - 5696.686 Robert Mustacchi

But I worked with Nathaniel and Josh and Nathaniel started to basically go through a bunch of, you know, just different questions we had electrically. You know, is there a chance that this could be happening because... because we don't see the device coming out of reset.

0
💬 0

5696.746 - 5714.738 Robert Mustacchi

So one of the first things we kind of looked at, and now you can correct me where I'm misremembering some of this, was trying to bifurcate, you know, did the device assert it coming out of reset? And did we ever try to even begin PCIe initialization or not? Because depending on the answer to that, that would take us down two very different paths.

0
💬 0

5716.379 - 5745.586 Robert Mustacchi

And the chip itself has a little pin that says it came out of reset. So there's a bunch of stuff we looked at there. And secondarily, because of a whole bunch of the low-level work we had done to boot, we knew how to read out the state of the PCI state machine diagram that the root port thought it was in. So what this meant is that we could go look at the root port.

0
💬 0

5746.847 - 5767.487 Robert Mustacchi

It has basically a ring buffer of the last... 30-ish state transitions it's performed. So we can figure out what has it been doing, what has it seen as kind of a guide, and you can compare that against the PCIe spec, and there's more or less a one-to-one correspondence between those states and

0
💬 0

5768.957 - 5790.124 Bryan Cantrill

state machine um and how are we are we able to query that because of our own holistic boot because we who's got access to that so that that you could actually uh anyone who has access to the system management network so it doesn't strictly speaking need to have holistic boot um

0
💬 0

5791.63 - 5819.014 Robert Mustacchi

Asterisks. Right. Sorry. So anyone can read this. That is, whether you're in Holistic Boot or in the Oxide architecture or just running a Lumos in general on Linux or other systems, you can actually read from the system management network. Now, knowing what to read and what maps to what can be a little more challenging because the big gotcha, as we said earlier, is that the CPU is flexible.

0
💬 0

5819.983 - 5831.494 Robert Mustacchi

So it has 128 lanes. And there's a mapping, when you talk to that firmware, between some of those lanes and what underlying hardware

0
💬 0

5832.954 - 5856.697 Adam Leventhal

resource they're using i mean so is robert like you're saying that like anyone can read it but holistic boot allows us to read it coherently and know what the fuck is going on whereas like in in other scenarios it's possible but kind of academic right like they're very hard to make sense of it yeah yeah i mean you can with enough expertise you can build up a mapping

0
💬 0

5857.845 - 5878.573 Robert Mustacchi

It's just certainly a lot faster when we have a data structure that tells us this is what it is for this. This device, which should have the T6, read all these registers. It's certainly a lot simpler and certainly a lot easier because we also integrated a bunch of register grabbing initially into the actual boot and training path itself.

0
💬 0

5879.707 - 5905.059 Robert Mustacchi

So we could actually just, on a debug build, for example, we'll just automatically collect a whole bunch of different registers from the PCIe core and the PCIe port, which corresponds to the root port, just by default. And so certainly that is where this is a lot easier because that's not as straightforward to do outside of building it really into the system itself.

0
💬 0

5907.315 - 5913.578 Bryan Cantrill

Yeah, totally. So you were able to go through that to figure out why is this thing not training?

0
💬 0

5914.518 - 5929.405 Robert Mustacchi

Well, yeah. So that, that we were able to eventually kind of first do that first kind of bifurcation and say, okay, the device is always coming out of reset. Um, then, um, you know, then we can go through and figure out, um,

0
💬 0

5931.038 - 5959.346 Robert Mustacchi

know why because basically we no longer have to go look at the we had a lot of electrical questions but uh that ruled out a whole huge class and uh uh i guess nathaniel's thankful was back in my court a little bit um but uh but yeah so then we started looking at this and you know we had um this uh boot loop stuff that Josh had put together.

0
💬 0

5959.366 - 5975.886 Robert Mustacchi

And we had a modified version that was grabbing out some of this register state at every loop. And actually, Andy had actually already gone through and analyzed a bunch of it prior to me coming back to this problem when I was coming back from being out for a little bit.

0
💬 0

5976.641 - 5998.658 Robert Mustacchi

So once we kind of had the sense of that these were all here, these were all very similar, and they were all ending in a similar state, it got us, it was pretty suspicious because what we actually saw, and so to understand how PCIe works, you always train a PCIe device to Gen 1, no matter what. Then from there, you're going to go to different speeds.

0
💬 0

5999.279 - 6025.665 Robert Mustacchi

And basically, as part of the Gen 1 negotiation, you're saying what else you can support. And then the device will go and go to these higher speeds. 2 versus 3 is very different. And then 3, 4, 5 are kind of a different path. But we'd see that we basically got to Gen 2. We just successfully trained to Gen 1. We would go down the upgrade path to Gen 2. We would think we got to Gen 2. And then...

0
💬 0

6028.962 - 6055.706 Robert Mustacchi

as you're kind of going through the state machine and you're waiting for the other side to acknowledge all of this in the recovery.speed substate or something like that, it'd just be like, ah, well, we timed out. And we're going back to detect, which is basically the entry state that says, you know, Start looking for something here. Well, actually, it wasn't quite nothing here.

0
💬 0

6056.366 - 6076.276 Robert Mustacchi

But the whole point is basically you go back to detect to try to see, which is basically see, is anyone there? We would see someone is there and start going down the path again anew. You kind of do a whole fresh link training. And it would just stop replying at a certain transition point that was an indication to enter what's called compliance mode.

0
💬 0

6078.134 - 6092.558 Robert Mustacchi

So normally compliance was something you only enter because you specifically requested it because you're at one of these like PCIe interop tests and you're trying to like prove something to you and the PCI, you know, basically to pass compliance, PCI level compliance.

0
💬 0

6093.438 - 6100.039 Bryan Cantrill

Mom is here right now. So we need to, I'm setting the mom is here, but bit, we need to actually do something slightly differently. Um,

0
💬 0

6103.646 - 6124.904 Robert Mustacchi

But normally you're not supposed to enter it by yourself. There's only a very few occasions that enter it of its own volition. So we would just find that we were entering it. And that was just kind of confusing. Especially because we had the ability to see what the host side saw. It was not very easy to go, there's no good way to go answer what was the T6 seeing.

0
💬 0

6126.045 - 6128.327 Robert Mustacchi

And getting a logic analyzer on that

0
💬 0

6130.828 - 6159.001 Robert Mustacchi

going to be very challenging because we need to get that on there we need to get it on for all 16 lanes and that's the one downside with the chip down solution is that you know it's a little bit hard to get the logic analyzer on there just a little bit so uh then you know we started doing different experiments um i don't remember all the ones i did but the one that surprisingly worked was you know we kind of said hey

0
💬 0

6162.511 - 6176.296 Robert Mustacchi

everything's training to Gen 1 just fine. And then it's going to Gen 2 that's failing. So what if we just stopped at Gen 1? Yeah. Especially since we're only in this manufacturing mode for a little bit.

0
💬 0

6177.396 - 6183.918 Bryan Cantrill

We actually don't care about the throughput in this manufacturing mode where we're just trying to... Yeah, the difference between Gen 1 and Gen 2 is not going to

0
💬 0

6184.762 - 6205.139 Robert Mustacchi

We're really bound by the time it takes for it to talk over Spy more than anything else. The actual piece of bandwidth is not going to make or break us. So, shockingly enough, that actually worked. Which is both great and a little dissatisfying, but...

0
💬 0

6207.864 - 6226.139 Bryan Cantrill

Dissatisfying that we don't know what's going on. We also learned in this that as it turns out, this mode, we think this makes a lot of sense to use this manufacturing mode on every boot. But it may make... It's a little more unusual from Chelsea's perspective.

0
💬 0

6226.579 - 6247.352 Robert Mustacchi

For them, it's a factory-only tool. So any issues that they might have had there, it's really just factory programming. If you end up needing to hit the thing twice... and it doesn't impact normal operation, like, you know, that's totally fine in a factory context, not fine in a product context. So for us, but.

0
💬 0

6249.313 - 6271.618 Bryan Cantrill

This was hugely important because these were, I mean, when we, in our testing in manufacturing, when these things were failing to come into manufacturing mode, we effectively, we like would not ship that sled. Because we didn't know what the issue was. And we didn't know. I know Nathaniel was, I'm sure, very relieved to get it back into the software domain.

0
💬 0

6271.638 - 6281.424 Bryan Cantrill

Because the nightmare is you've got a manufacturing defect, effectively, where it's like, yeah, that board needs to be scrapped. And that was not the case, which is a huge relief.

0
💬 0

6283.075 - 6305.59 Robert Mustacchi

Yeah. And I think this is another one where we ended up there, but we definitely didn't start with just software only. And so it's really about working together, brainstorming different ideas, and trying to think about what could be going wrong. And then more so, how do you make hypotheses that you can disprove one way or the other to help narrow the solution space? Because otherwise,

0
💬 0

6306.69 - 6335.228 Robert Mustacchi

know joker's wild on different things we thought about and there was a bunch of other electrical stuff that we did investigate around you know were we seeing power you know were we not ramping power correctly other things um you know could there be something where we're not draining enough while we're taking this a2 lap um you know and some of that were easy experiments to go to go run so we did did do some of them just to disprove that and um you know just the

0
💬 0

6336.833 - 6338.214 Robert Mustacchi

Negative results actually are important.

0
💬 0

6339.214 - 6360.3 Bryan Cantrill

Yeah. Well, this is a very important point you're making, Robert, in terms of holistic design also means holistic debugging. And when the system is not behaving correctly, you've got to get ready to cross a bunch of different boundaries that are historically difficult boundaries to cross. And you've got to be able to really get a team together from across the stack to go brainstorm an issue. Yeah.

0
💬 0

6360.64 - 6383.86 Adam Leventhal

Not to quibble, Brian, but I'd say it enables holistic debugging. No, no, as you're saying, holistic design, if this were a more traditional system, you'd chase it to the border, and that's the best you could do. That's right. You could sort of imply that there was some other component that required now examination, but there was really no...

0
💬 0

6385.101 - 6409.675 Adam Leventhal

trivial way or straightforward way at all to communicate across that those arbitrary boundaries in fact that conway's law i don't know conway's law is when it applies to an entire industry but you've got that industrial super form of conway's law applied to this aggregation that has become the the modern server is that like a conway's wall conway's yeah mr conway tear down this wall

0
💬 0

6411.778 - 6425.39 Robert Mustacchi

Yeah, and I think that's actually a good point, because in this, you know, we talked about the software changes in the host, but we also looked at stuff in the service processor ring buffers, you know, FPGA register readbacks, we added and captured a bit more state along the way.

0
💬 0

6427.432 - 6448.392 Robert Mustacchi

And the fact that we could kind of use that to rule out things, the fact that we can easily see what is the state of this ExtResetL pin is hugely helpful. Or the fact that if we had had to do something much more invasive, we could have communicated to the service processor over the communication channel.

0
💬 0

6451.411 - 6463.024 Bryan Cantrill

Right. I mean, I think having all those kind of options at our disposal really does allow us to... I mean, as you say, Robert, it gives us the opportunity to go holistically debug. Right. Yeah, that is awesome.

0
💬 0

6463.444 - 6486.845 Robert Mustacchi

And I think that is actually the ultimate part of Dogpatch, taking it back there, is these different things actually working together towards solving our problems versus... you know, fighting with one another and kind of hoarding information. And, you know, because we're doing all sides together, you know, some of the things in Dogpatch that we're saying, you know, it can only be the OS.

0
💬 0

6486.885 - 6508.021 Robert Mustacchi

Well, that's actually a reflection of the fact that, you know, that's because that was the only thing we could actually modify. And the fact that we can modify not just the service processor, not just the... host software, but actually the board design itself gives us a lot of different flexibility in how we can approach different problems.

0
💬 0

6510.641 - 6528.845 Bryan Cantrill

That's right. And actually, RFD88 is actually already public, which is great. So that one I don't need to make public because it's already public, but that's another one to check out, which is an RFD that talks about exactly kind of how we think about who does what in the system. And definitely a consequence of this, again, this kind of holistic thinking.

0
💬 0

6531.418 - 6536.72 Bryan Cantrill

Well, Robert, I think this has been a very exciting way to kick off Robert Moustaki Appreciation Week.

0
💬 0

6537.82 - 6540.701 Robert Mustacchi

Yeah, let's hit that one in the bud.

0
💬 0

6540.801 - 6555.166 Bryan Cantrill

But we really, really, really appreciate it. This is so great to kind of obviously walk through the origins of Food for Money Friday. I mean, clearly, let's not let's not let's not break the lead here.

0
💬 0

6555.186 - 6555.466 Unidentified Speaker (Brief Interjection)

But yeah,

0
💬 0

6558.84 - 6565.823 Bryan Cantrill

Future historians will be very pleased that they have finally discovered the canonical origins of this cultural phenomenon known as Food for Money Friday.

0
💬 0

6565.863 - 6585.351 Bryan Cantrill

But just getting kind of your perspective and kind of the way you've done what you've been able to do, which is with a lot of collaboration, a lot of hard work, and being willing to dive into new things all the time, and then also being able to come up and think about how does this impact the actual user of the system? I love your...

0
💬 0

6585.951 - 6603.675 Bryan Cantrill

thinking about that kind of operator empathy as a guide in terms of a lot of what you've thought about in terms of the system. But really terrific. So thank you very much for joining and for indulging us and for taking us up, down, and all around.

0
💬 0

6605.296 - 6606.816 Robert Mustacchi

No problem. Thanks for having me. It was my pleasure.

0
💬 0

6608.49 - 6614.992 Bryan Cantrill

Well, good stuff. And then, Adam, next week, we've got a European-friendly time, I understand.

0
💬 0

6615.292 - 6629.117 Adam Leventhal

Yep, 9 a.m. in a week, 9 a.m. Pacific, and we've got Orhun, and we're talking about Ratatouille, one of our favorite crates. I know we talked about lots of favorite crates last week, but one of our favorite crates for writing Tooie is going to be really fun.

0
💬 0

6629.758 - 6642.61 Bryan Cantrill

It's going to be a lot of fun. So join us next time. I am looking forward to having you. And Robert, thanks again for joining us. That was a great discussion. All right. Thanks, everyone.

0
💬 0
Comments

There are no comments yet.

Please log in to write the first comment.