Robert Mustacchi
Appearances
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
Please make sure that your boards can actually support these chips.
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
Honestly, I would not surprise me if somebody did have a sense of humor.
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
I'm like, I couldn't really say anything. I was like, what's going on? And he's like, do you want to show the audience that we just broke a record? I'm like, okay, completely unplanned. I had no idea that was going to happen.
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
noticed it was still booting was like okay let me go feed my cats fed my cats came back was still booting it's like okay let me go run to the bathroom come back still booting i'm like is it actually booting like what's taking it and i was about to turn it off when i saw in the corner of my eye that my monitor flashed up i'm like okay i'm finally booted it's alive it's
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
Yeah, I'm like, okay, thank God it's actually done. And I did screw something up because I had an update to BIOS, which is always a bit of a nerve wracking experience when you have to do it over just watching a flashing light go. And you're like, is it done? Are you done? Did you work? I hope it worked. And then you turn it on and you're just like, okay, hopefully this works.
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
Wait, there's literally, I pulled the Lisa Sue. Wait, there's more.
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
The system was built in like the 1980s. It was not updated until 2020. Yeah. Wow. Yeah.
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
Oh no, this was no temporary fix. It was designed like that. The... So...
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
Exactly. It's all the EDA folks. It's all the EDA folks who are like.
Oxide and Friends
Unshrouding Turin (or Benvenuto a Torino)
That's for Sony and Microsoft to take up.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I'd say... Often there's some kind of group of questions I'm trying to ask or answer. And it'll be some combination of looking at codes. Basically, I almost always have C-scope. C-scope with Vim integration that I've inherited from Dave and others over the years. Enough that I can set up Dave's keyboard and we can have the same key bindings, which is Shockingly convenient. Wow.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But that and using that with a combination of DTrace. And this is where, like, oh, I'm trying to think of this. Some of the classic one-liners, like instrumenting a module, like all entry probes in a module and aggregating on ProbeFunk is one. aggregating on certain stacks and just seeing what happens, um, trying to trace control flow or data flow.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I mean, how do you, that's actually a good point. Cause I think one thing that I've noticed that is having questions in the notebook and writing stuff down in there. Um, I think that's one of the other things, um, that I found really valuable, just trying to figure out, what are you trying to do? Or trying to diagram out on a whiteboard how the subsystem works and flows.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I think I remember we were debugging one of the, what was it? There was something with the X2APIC for the Apex PSM driver. That block diagram is now in OSEnter.c, but it filled up two joint whiteboards. And a lot of it was just trying to understand, how can I understand this control flow well enough to know what's going on, where is everything flowing, et cetera.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And that's definitely a useful, just kind of, how can you understand it well enough to explain it to someone else? And I think that's the other thing. I was often sometimes sitting there talking with other folks in the office or on chat and using that as a way to kind of, like, have them ask. Sometimes they would ask me questions.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Like, I know sometimes Josh and I or Dave and I or Patrick and I would just go there and, like, or Alex, we probably did this a bunch at 655 Montgomery. I remember there was that little bench. behind your desk. And I feel like there would be a lot of kind of questions and back and forth there. I'm just like, how does this?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
You know, kind of the old, like, one of the useful things about being a TA is by the time you can finally explain to someone else, you might start to have an idea of what's going on.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And... having it written down. Like I've gone back to some of these and been like, Oh, good job. Pass me. I'm glad it's here. Cause I know I was just going to say, I forgot. Yeah.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
write a block comment when and you can go make an artifact better by just explaining how it works yeah i think as i've been flipping back through my notebook here for things that are um things that aren't meeting notes a lot of it really is uh you know just a whole bunch of things around like you know if there are problems it's like starting with questions i'm trying to answer and go figure out what those are um you know even on uh
Oxide and Friends
Holistic Engineering with Robert Mustacchi
thursdays or fridays lr dim hunt is the same thing yeah just like what are some of the things what are some of the observations what's different um you know and i find that writing stuff down is a helpful focusing thing and that those for me i learned a lot about i mean everyone everyone learns in different ways so it's not going to be the same for everyone but for me uh physically writing in mostly illegible cursive that looks good from a distance is uh
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, actually, I think my memory is that you or Alex probably saw it first. Probably with Spectre and Meltdown. This is being like the... It's put back in Linux in a way that starts leaking and everyone's kind of denying it until they stop denying it. I think is the...
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, there was a lot of different parts to it. And so it was a combination of Alex and also John Levin, who was very helpful to have while working on that. And obviously, lots of conversations with others. But yeah, we were able to kind of split up that work into a bunch of different pieces. I think I dealt with per CPU page tables, which was an exciting thing in its own right.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I think Alex dealt with a lot of the trampoline assembly. But we also kind of settled on a somewhat unique solution, I feel like, that hadn't really been done by others with the per CPU page tables.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah. So yeah, that's a great question. So, um, On x86, ARM, and a bunch of other common RISC-V CPUs, when you use the MMU, you have page tables that describe virtual to physical mappings. So every process has its own address space and maps to generally disjoint physical memory, and those page tables describe where they exist. and different attributes.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So some of those attributes say this page can be read, this page can be written, this page can be executed. One of the attributes are basically permissions in terms of what the privileges are required to read, write, or execute that page. So you can really think of this as that there's a whole bunch of memory that people sometimes call kernel memory and then memory for processes.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But effectively, if you take Let's just use the 4 gigabyte 32-bit address space as a simple example for a second. Every process has a 4 gigabyte address space, or a 64-bit process has 64 bits with a bunch of holes. But the top gig in that 4 gig address space is always the kernel. And it's the same in every process.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But when you make a system call, you can start executing kernel text, and you don't have to go try and basically change the MMU context, change the page tables, because that's generally expensive and potentially causes cache invalidations and is the root of a lot of CPU performance challenges.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, exactly. And ARM, through various incarnations these days, especially in the 64-bit ARMv8a profile, looks very much like x86 does in that regard. But eventually, there's a bit in there, or a few bits that say, should this page be a kernel page or a user page?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And effectively, what that's meant to say is that if you're a user process, even though those kernel pages and those kernel VAs exist, if you try to read them, you'll get a page fault. And then the kernel will come and drop a signal on you to basically say, you've been reading something that you can't. Unfortunately, through the power of speculation, what basically happened is that
Oxide and Friends
Holistic Engineering with Robert Mustacchi
That check happened, but after all the side effects of doing the read were pretty much done. So everything other than, you know, it doesn't show up in your register, but it's impacted, but it was loaded into all the caches and everything else, such that you could still see it. So basically, you could read any arbitrary piece of kernel memory you want.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So whether that was, you know, someone's packets, security keys, you know, someone else's file system cache data. It was, yeah.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, there have been a bunch of a build up in the literature of different L3 cache attacks, like prime and pump and other things, where you start using L3 cache as shared resources, but people didn't expect you could go through the page tables. Or my favorite one really is EagerFPU, which is just a fun one of just like, oh, you really can speculate through everything.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah. But yeah, I'd say that's been one of the small advantages of working for smaller companies that you get to explore a lot more of this stuff than you necessarily do at kind of some of the larger places because there's just... We're a less cemented team, so often there's not a big kernel org sitting by. Even if you look at Apple, they have a lot of different groups there.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
A bug comes up, you're going to pass it off to that group, and you're not going to chase it down or look at it or have to figure it out, which is sometimes with a blessing and a curse. It's great to have a lot of different colleagues, but sometimes it means there's less opportunities for you to learn or kind of move around in that regard.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I think it's when we're able to execute it most directly. I mean, so I think there's a, you remember Keith and I had, like Keith was driving, together we had kind of this dog patch pitch. Yes. Must have been 2015, 2014. Yes. Maybe 2016 at the latest. Can't be that much later. No, not 2016.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
yeah of kind of where we want to go but with the constraints um but with the constraints that you know effectively you know we're not doing our own boards we're not really getting to that little axis you know how do we how do we work with folks there and at the same time there's broader business constraints around you know who you have to work with um what's available on the market you know ultimately to be economically minded in different ways so but
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah. Cause I was about to say, if you have that, you should, you should drop that in. Cause I feel like that's, that's a emblematic of the, of the discussion we have. It's like, ah, like let's drop these different pieces on the whiteboard. And then we go back and forth. Like, oh, let's talk about what's connected. You're like, they're just gonna be fully connected. And of course you are right.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, so I think to really start to get at Dogpatch, I actually had to rewind to actually back to Fishworks and AK, which was the appliance kit. So there was a lot of hardware software integration we did there around manageability and serviceability in particular.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And if you're ever fortunate enough to use one of those, there was a lot of things that did around just like, hey, detecting drive failure, indicating that blinking an LED and even being able to blink the LED tell you where it was, already started re-silvering, swapped in spares, sent out emails, did contacted support potentially to get replacement parts shipped, et cetera.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And that was kind of at a single box level. And that was really 2007, 2008. Then you kind of fast forward, and it's even 2013, 2014, and the ability to kind of deal with that data center management at scale, kind of getting to that warehouse-style computing, is really limited unless you're one of the really big players. You're making an LED blink reliably. Surprisingly challenging.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Actually, you didn't... Honestly, in half the times, even after you figured it out, actually, you've learned that the server was built incorrectly. And a bunch of things to drive, you know, to the basically SAS expander or maybe the drive. It's like the drive backplane cables were swapped. So what you thought was drive zero was drive eight.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, yeah. I did a brief stint on the KT verification. And you got to see the glory of a microprocessor that would ship really soon. Not ship. So you kind of got to learn lessons that we're seeing today from when you see Spark roadmaps from the other CPU manufacturers. Right, exactly.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Did that help? Exactly. But also all of this had to be done very manually. And, you know, the joint operations team put a lot of work into dealing with and managing that and dealing with the kind of
Oxide and Friends
Holistic Engineering with Robert Mustacchi
lack of features but um you know the serviceability and manageability story which you know worked good at you know the one to two system or if you had multiple systems you know at fishworks we really wanted to bring uh together um and you actually go see if i actually have this deck still somewhere um yeah the dog patch deck we've got the dot yeah yeah yeah it's a it's remind myself what else is what else was actually in there but
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, I think this is actually where, you know, I think the big thing is, you know, there's a lot of fights between what's being done by the OS and what's being done by the BMC in those systems. Because basically the BMC is basically where there's a whole bunch of value add from the vendor for varying degrees of value.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah. Just the features of that, the out of building had were challenging.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Just totally divorced from reality. And it's gotten slightly better than when it first came out as an empty schema. But yeah, as I'm trying to look through this to see some of the other things, like obviously some of the classics, dealing with firmware, you know, the different kind of architectural challenges we had there.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
We actually were thinking about how do we eliminate the BIOS in UEFI and basically just do a small, basically let the bootloader take care of a whole bunch of stuff. That for us was IPXI at the time. So it's just like, how do we basically just get out of these different layers that are differently broken?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I think this is during the, as we were writing this, this is one of those times where we had, I think it was like a, oh, like a Dell, maybe like Haswell era server. And after we typed reboot, it would just hang in the BIOS. Like, after we, you know, like, we just reboot, like... And it would only happen on a warm reboot. And there's a lot of back and forth of being like, well, it's your fault.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
You know, don't tell us this. And it's like, and you're trying to like, you know, some of us would say it less politely. You know, I think, you know, this is where we had good cop, bad cop, and psycho cop. But it's like, hey, the BIOS has just like erased all of our program text. And it's taken over. And its job is to restore it from an arbitrary state. So how exactly is it our fault?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But actually, that in and of itself is a lesson as to one of the things that we actually did in Oxide, which is getting rid of two different reboot paths to actually simplify the system and streamline
Oxide and Friends
Holistic Engineering with Robert Mustacchi
streamline things so that is you know in a standard system if you type reboot it's not going to do a full post it's not going to go reset everything ufi is going to kind of be clever and or the real reality is the cpu actually isn't going to erase everything so even if you reassert the reset line if you're in this acpi s5 state there's a whole bunch of state that stays across that so
Oxide and Friends
Holistic Engineering with Robert Mustacchi
That all of a sudden means there's two different initialization paths. Some of this data you'll see in some of these data sheets described as being in certain power wells or as sticky across resets. And so we're just like,
Oxide and Friends
Holistic Engineering with Robert Mustacchi
have none of that let's let's kill all power and that actually made it easier for us to build a more reliable system with actually less uh you know with less code pads to actually think about because we can actually say there is no worm reset there is no way there then at the end of the day what that means for customers is that hey it works more reliably more of the time because a lot of the challenges here is that you have all these different code paths and
Oxide and Friends
Holistic Engineering with Robert Mustacchi
it's hard to actually test reset in a bajillion different ways. Or what does it mean to do a warm reset when, you know, you've been up for two years versus you've been up for 30 seconds versus, you know, I've hot swapped all these drives upteen times versus, yeah, I've done nothing because I've just been up for, you know, two minutes.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So it really got us out of that problem once we ironed out some of the bugs.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, I'd say it's even more challenging as one that we, I mean, at least at the time, there was really no good way for me to build or run or instrument. So that was really all about code inspection.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, Adam, that's a good question. So AGESA is effectively one part of AMD's boot software. So it contains both the... both all the kind of binary blobs at the PSP or other kind of hidden cores run, but also really contains a whole lot of all of the x86 initialization things. So how do memory mappings get set up? How do various... pieces of the data fabric of CPU initialization.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Turning on and when you start, there's only one CPU turned on. And even though operating systems have this traditional IPI dance, there's a whole bunch of other stuff you have to do. in advance to start this. The AMD SoCs, like others, has 128 PCIe lanes. But those PCIe lanes can be carved up into arbitrary different slots, depending on what board you have. So how do you communicate that?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So AGISA, which looks like, what does it stand for? AMD Generic Encapsulated Software Architecture. Only one of the letters is an acronym. So that whole bit, is designed and ties into a separate, these days, UEFI code base. So it in and of itself is not the complete picture.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
It's built up as a series of UEFI modules that runs in the PEI and Dixie phases, which are different phases of boot, and still means that you need a Tiano core or other UEFI implementation to kind of fit alongside it. It will do anything and everything from setting up I2C devices to... That's where SMBIOS tables are created.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
ACPI tables are created there too. And just a whole bunch of stuff. So yeah, that was an effort where...
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Oh yeah. So as I said, these are all UEFI is designed in a series of different modules. And they all, basically rely on different callbacks firing. So because these modules are coming up and loading in probably somewhat defined but arbitrary orders, they'll often wait.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So before I begin, the PCIe module or the MBIO module might start loading, but it's going to register a callback that fires when all these different PPIs, which is a UEFI term, are provided. Lots of them. And there's not like a... Or at least the things we had access to, there was no clear map of like, this is the expected ordering of these different... these different sets of services.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So it's like, when does the logical SOC service begin? When does the memory map service start? What is it blocked on? So a lot of it is basically trying to come up with this effectively callback-driven control flow and trying to understand what is that just by purely reading, which is not straightforward and definitely not always correct.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, that was in our software. We're trying to send a message to the hidden core, which is responsible for PCIe initialization. We're also trying to send it a data structure that has all the mapping of all of our lanes and all this fun stuff. And then you send a message and then wait a while. And surprise, it came back to the bootloader prompt. which is never a good sign. Never a good sign.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
The kernel's up, so in theory, if I take a page fault or a double fault, you'd trap into KMDB and you could debug. And that's when I started asking Brian, like,
Oxide and Friends
Holistic Engineering with Robert Mustacchi
missing right to one of the registers in the effectively the mbio which is northbridge io which has parts of what the memory map bar so we were basically trying to do dma to an address and um it didn't have a it was missing information that told it whether that was dram or memory mapped uh, IO.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And so it probably hit some internal error and, you know, uh, especially since then we didn't have good observer, but we were still working on a reference platform. So we didn't have our service processor, our other stuff there. So we couldn't see if there were, you know, some of the low level asserts were being fired that would trigger something on a pin. Uh, right. We had nothing.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Sure. Yeah, I can add my commentary. You know, the background thoughts I had, all different things were going on.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, well, and I think this also gets to some of the tooling stuff. It's like we had actually built up a whole bunch of random demods in KMDB because there was no user land. So it's all in the kernel debugger to basically be able to read and write some of these different register spaces. So there's the system management network is one of them. And then the data fabric is another one.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And having that tooling be able to do that, just let us kind of do some inspection, We use this in other problems because that is something that we could use without the oxide architecture. So we actually sometimes would compare and contrast that to what we saw on an i86 PC, on a standard PC.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But for this one, it was really code inspection, double and triple checking, rereading, getting it wrong a lot. And I don't know. There was a bit of a... Yeah, a lot of the time it's kind of a blur, I'll be honest.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
The actual boot and like load time isn't that bad. It's really more of the mental... the mental effort there, I'd say. Just knowing what to do next.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
No, don't worry. I'm plenty terrified. Okay, that actually makes you feel better. You know, I think part of it is also that I actually am never doing this alone, no matter where I've been at. Even if other folks don't necessarily have all the background, you know, that time, you know, Keith and I were working together a lot, but other folks would sit there, help listen to things.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And this has been true when I've been debugging up and down across the stack, whether it's software, whether it's hardware. It really is a team effort. And even though some of the debugging is there, is sometimes a solitary activity at first,
Oxide and Friends
Holistic Engineering with Robert Mustacchi
a lot of the other pieces like writing it up um having you know especially keith and i when we were doing bring up i think the fact that we were working together uh even though we're tackling different parts means we would write up different things in some of our bug reports review each other's notes uh ask questions of one another and the act of asking questions and being forced to answer them uh is there you know that's that's been the same thing that's true and we've been bugging uh you know
Oxide and Friends
Holistic Engineering with Robert Mustacchi
some of the T6 stuff which Nathaniel and I were working on recently or other things that we're doing that are up and down across the stack. I think the first thing is that you're not alone and you have the expertise or even just the different perspectives of all of your colleagues. And I think that's really invaluable in and of itself.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Because they'll ask you questions that might make you think in a different way or prompt different thinking.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
and you know it can never be just you know yes sometimes you need some time where you kind of you you get off the world and just kind of stare at things um but really it is uh working with others i think that ends up being necessary to kind of get through some of these harder bits because that's i think partly what helps uh helps you get through it um
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Because, you know, if I'm stuck and I see Keith making progress and that helps there, or if I see us making progress, you know, on the board work or higher up in the, you know, the control plane or other parts of the product or, you know, down below, then that kind of helps, you know, motivate you to kind of keep going. You know, it's not all bad. Right.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
You know, there's no more. That's why there's an egg shortage now.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Sure. I can try. I mean, I think there's a lot of stuff that's just virtue of, you know, being at the company earlier, there wasn't really a lot of other folks to do stuff. Yes. So, you know, there, there, there is a lot of stuff where it's just like, Hey, how do we start thinking about this? And,
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I think for me, one of the things that's there is that you'll see in that, actually, RFID 63 was like the last of a set of networking docs. And it really started from higher level product comparison, kind of feature use cases, user networking API that we wanted in another doc. And then with that in mind, how do we start building up the lower level stuff?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But I think a lot of it comes back to, you know, for some of these things, they're things I've been thinking about for a while, so that helps. You know, even though RFD 63, there's a lot of retrospective and on past attempts there or goals or, you know, things that worked well, things that didn't work well, you know, what we learned from paper reading and other stuff.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And, you know, I think coming back to the dog, you know, if you go back to that dog patch decks in 2014, there's a lot of the high level goals and, you know, I'd say experience and kind of usability goals, you know, still kind of, you know, are things that are kind of at the, always in the back of my head.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So I think that that's always helped to kind of anchor that or kind of raise questions about, you know, what does this future look like? What does it mean? How does it actually tie back to those goals that we have? And then I think sometimes, yeah, go for it.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Sorry, go ahead. I think sometimes, you know, you need to work being able to get, you know, well, maybe it's a little bit of thrashing, but, and sometimes you need to kind of really focus on one or the other, it helps to kind of, for me at least, to go to different things. Because sometimes some of these small bugs are kind of low-level details. You can kind of really just focus in on it.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But having the broader context of how that fits into the system is helpful or a good distraction from that. So I don't know. It's not really a good answer.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Um, yeah, I mean, I think you're right. I think there's a, there's a lot to be said that, you know, I think the important thing is that as you go from, you know, the first couple of us that hire to 10 to 20 to 30, you know, and you continue to grow is how do you help, uh, teach other people what's going on? Um, and how do you help ramp them up?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Cause I think that's equally important because yeah, I mean, um, the amount that one person can do is bounded. And sure, you can work overtime or push yourself a bit, but that will never be as effective as actually getting more people together. So I think that, to me, is an important question of how do you help teach people? How do you help them learn? How do you try to be helpful and help
Oxide and Friends
Holistic Engineering with Robert Mustacchi
let folks be productive, um, and share knowledge because otherwise, um, yeah, I don't know. So I think that we got into this.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah. I think that's, you know, having the docs is good, but it's not, it's not, it's necessary, but not sufficient for that. Cause I think, well, I don't want to just show up and just be like, Hey, here you go. Please read this, you know, a hundred thousand word dissertation, uh, you know, 200,000 words and, you know, come up with a summary and, uh, you know, go from there.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Cause I think that, I think that's, uh, that can be very overwhelming the same way, you know, uh, this wasn't all written, you know, Rome wasn't built in a day. This was an alternate day. So, um, I think it's important to also figure out, you know, how do you kind of introduce these topics, kind of, you know, get to more and more detail.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And then I think a lot of it is also just not being, being really willing to, if people ask for, Hey, how does this work? Um, you know, being willing to put together an ad hoc presentation, whether it's formally with slides or not on how, how does this, you know, what's going on? You know, this is, uh,
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Another example is a colleague was asking about, hey, I'm trying to get more into some of the PCIe hot plug stuff. Can you help me understand how this actually all fits together with this? Because the PCIe spec is long and involved and not the easiest thing to read. So, you know, then the answer to that is, you know, yeah, let's find a time that does this and, you know, get into it.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And I think that's, that's partly how you do that. And then helping make sure folks feel that it's okay to ask those questions too is equally important, which, you know, is always an ongoing effort to kind of build those reports because you can't, you can't necessarily know someone's struggling. So all you can do is try to be open and willing to answer stuff and try to be helpful.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Basically there's, I think I was going to say there's about one inch of a nine by 13 on the nine of the, you know, on the 13 inch side that's been gone.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I wish I could claim omnipotence or something like that, but that clearly – or something like that.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah. I mean, I think, um, that's a good question. I think a chunk of this I picked up from just working with Keith over the years. Um, yeah, Keith's also very, very good at it. There's a lot there, but, um, I, I think this sometimes gets back to, uh, you know, the earlier kind of a code review is really trying to ask why, um, Yeah.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And really trying to understand how does this fit into the system? How is someone going to do this? Or if someone wants to do X, what else does that mean they need to be able to do? Or how does it work? For some of this stuff that we're seeing around, I think some of it's cheating that we've been thinking about for over 10, over a decade. Dog patch is 2014. It's 2025. Yeah.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So someone's just had a long time to just marinate in the back of the head with just different experiences. And I think part of that also is just a bit of just, um, you know, paying attention to how to, how are folks using things? How's opera, how are operators using things? Um, I don't know. I don't know. I wish I had a better answer for you there. Uh, so I could, no, no, no, no, no.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I think, I think a lot of it is just trying to, A lot of it's listening, and then a lot of just digging and really needing to understand the low-level details before you can kind of go and make the high-level answer and how the two inform one another.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, and I think that often just comes back to some, let's say, just scar tissue and experience, right? Some of it was we've had times where we couldn't upgrade firmware or we couldn't get, you know, we didn't have the tools to or even get firmware updates. And so we've just been burnt by that over the years. And so I do think there's also just, you know...
Oxide and Friends
Holistic Engineering with Robert Mustacchi
you really want to think about how you approach and think about uh firmware and just you know just the same way it's like software and just that it's not this uh while it can be very hard to understand just because of the um specifics of the way you know you're working with vendors you know obviously um if you just get a binary blob it's hard to really get into there but you know it's a thing that can fail and needs to be updated just like anything else so um
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I suspect that if we probably go back to Dogpatch, we probably have some commentary on the firmware that we were dealing with and just the operational problems there and trying to figure out how do you get to a better model? Where even is all the firmware? If it's there, assume something's going to go wrong. Or if there's an EEPROM that has data, assume it has to be flashed.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
We're going to need to flash it and then Once you start thinking about it in some portions, then it turns out that that's actually true in a lot more stuff. There's nothing that's just specific to... I think a lot of it is also just trying to take what are the things that you learn from one area and don't just... How can you apply them to others?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And sometimes that'll be right, and sometimes that'll misfire and maybe not be quite the right starting approach. But I think it lets you ask other questions and helps you think about that. So for some of us, it's like, no one would say, ah, we should never be able to upgrade the software in the product. Right. The same would kind of be true.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
It's like, well, if someone comes back to you, what's the service experience? Okay, let's say we did have to do this. If there was no way to do online, to upgrade the firmware of the rectifier through software, then that means I'd have to send a tech out to every DC of every customer and have to pull one out one at a time and do this. And that's just not...
Oxide and Friends
Holistic Engineering with Robert Mustacchi
That works when your n is small and it's a bad day for someone, but it really doesn't scale. So that's another big part of just where are these things that are, what would be the remediation if it fails?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, and I think what helped is, again, from the kind of RFDs or docks building up on top of one of another – I think it's something like RFD 82, which is the one about kind of operator design principles and facilities for operation, you know, has something about firmware upgrade in it.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So then as you're going out and kind of doing, sending these questions out to different vendors, you can go back and say, okay, what are the things I need to think about from there? You know, what are the kind of the key things, you know, how does it reply to things, even just not just firmware, but, you know,
Oxide and Friends
Holistic Engineering with Robert Mustacchi
When a rectifier, we'll just pick on the PowerShell, fails, how do you identify which slot it is? What serial number it was? Where is it? And how does that turn back into... you know, just different features that you need, you know, and, you know, how do you tie that into basically different operator stories? And then, you know, that same thing is true of failing disks, right?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I think it's much easier for us to think about, you know, a disk fails, I need to pull it, I want to blink the right slot. Well, the same is going to be true of a rectifier. The same is going to be true of a fan. The same is true of, you know, a transceiver. So, you know, there's a lot more similarity in some of these things, even though there are differences.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, and this isn't also, you know, there's a lot of different folks provided feedback on there, and there's a lot of different work there. So it's not, again, this is another one of those things that's not just, you know, a single person doing it.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
You know, there's, you know, we're taking these ideas, you know, colleagues to kind of get feedback on them, explore it more, get different perspectives. Make sure we're communicating it well. That makes it better.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, so we do something a little weird with the neck.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah. We've gone our own way. One of the things that we were really trying to make sure we could do, because the NIC has its own firmware and configuration, which changes a whole bunch of different settings and things there, is we wanted to make sure we could validate slash attest that information. And we went through a bunch of different ways as a group to figure out how could we do that.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And what we settled was that the NIC actually has its own manufacturing mode built in. So, which is useful, because some of this config file are things like, you know, what PCI, you know, it's a whole bunch of information for the PCIe certies, or, you know, describes things about how Ethernet should work, you know, what, do you have I squared C for transceivers? Do you have different PHYs, etc.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So this information is all very critical for the NIC to work. But, you know, we didn't want to have just a one time factory programming process here, because What if we need to update it? What if something got wrong? How do we deal with it? So we end up using this feature of the NIC called manufacturing mode, which basically has the NIC boot out of an internal mask ROM. It doesn't enable Ethernet.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
It doesn't load any firmware that would run on some of the cores internally. It doesn't do a whole lot of stuff. But it gives us access to the hardware blocks for the NIC's own... EEPROM and SpyFlash.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So basically what this means is rather than basically creating a complex MUX system to change ownership of these devices like we have to for the host SpyFlash, here we basically read and validate this through the NIC in its manufacturing mode. So what this means is that every time the server turns on... We're all booted.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So the way we've rigged this up is that on the board, we have a GPIO strapped. So the system, the NIC always shows up in manufacturing mode. Right. Then, because we have power control over every device, we can basically validate all this and effectively, you know, we don't necessarily turn off all the power here.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
We kind of reassert the reset of the device and then basically boot it back up into what they call mission mode.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So yeah, the way mission mode works is exactly that way, Adam. They basically have a little spy nor flash, a little EEPROM, and it reads from that. And so we have those same things, it's just that in manufacturing mode, to basically bootstrap it, we just do it through the NIC.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And on a normal PCI card, this is a little jumper that's there for the factory, done once, and then you update this, you can update the firmware from the NIC while it's live, but...
Oxide and Friends
Holistic Engineering with Robert Mustacchi
It also sells the factory, the egg problem, like how do you get the initial version on there? It's just like, there never, it doesn't matter if there ever was something there or not. We just always put what we think should be there.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And it definitely simplified the, you know, we talk about hardware software co-design. This definitely simplified the electrical design, um, Definitely, the last thing you want are more spy muxes and other things in the way, dealing with voltage translation, dealing with questions of even which spy... port would we have to connect it to, you know, on what device, what way would data flow?
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So it really simplified a lot of stuff, which was.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
we want to sorry just to underscore the importance of this yeah it's i mean we have a team that can has gotten it right time and time again but also um you know say when people joke you know the best software is the code you didn't write no bugs in the code you didn't write you know there's no electrical problems in the if you don't put those things down so um and sometimes they're necessary but you know if you can get away with it makes it a lot smoother so anyways um
Oxide and Friends
Holistic Engineering with Robert Mustacchi
All that said, the way that this kind of whole problem is that the mask ROM starts up in PCIe Gen 2. It's basically hard-coded in silicon to start as a Gen 2 by 8 device, as opposed to a Gen 3 by 16. And we had occasionally seen some failures.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And I think it was really Josh Glula who insisted we kind of do some of this boot loop testing where we'd occasionally see devices that had a surprisingly occasional failure to train the device in manufacturing mode. And if we came back to it later and tried to restart things or took another lap, it would often work. Or even if you tried to reset it after that, it would come up just fine.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But the first time on a cold boot, it wouldn't turn on.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, so that's a great question. So what we think of this is, so there's a complex state machine in the PCI docs. So basically, a PCI device goes through to basically have a PCIe link come up. So when your operating system wants to go read or write from a register that ultimately gets transformed into a transaction on the PCIe bus, which is a point-to-point link between a port on the CPU
Oxide and Friends
Holistic Engineering with Robert Mustacchi
and the downstream device, so the NIC in our case. You may have switches or other things in more complex designs, but really you can think of PCIe as a bunch of point-to-point links, generally between something on your CPU, which they may call the root port or an upstream port, and a device, often called the downstream port.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And so link training is a process of basically figuring out shared... Effectively, what are the shared... ways we're going to operate. So for example, because PCIe is backwards compatible, you can take today's Gen 5 devices and put them in a PCIe Gen 1 board, and the link will train at PCIe Gen 1.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
You might have a root port that supports up to 16 lanes, but you might put in a NIC that only has one lane, a small 1 gig link. So it will do that. To make these links work at high speeds, is a very complex process because you have to figure out a lot of equalization and tuning so that they can interact.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Effectively, link training is this process in this kind of large state machine that basically hopefully ends with the link training, so basically successfully completing the state machine. And it's really done by the PCIe device that you're plugging in and really the PCIe root port, which is
Oxide and Friends
Holistic Engineering with Robert Mustacchi
generally a whole bunch of hardware in your CPU that probably itself has a secret core running stuff too that no one tells us about.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, I think that's one part of it. And, you know, that's definitely a large part of it. And other parts of like the digital protocol communication, you know, what features can be used. So, you know, the PCI sake has done a lot of work. And, you know, it's a testament that we can, you know, PCIe has been backwards compatible back to its first release.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah. Basically, we don't see a device come up. So there's a register in the root port that says, is there a PCI link established? And if you read it, it says no. there sure isn't there sure isn't and you're like well that's very sad um so there was a whole bunch of stuff that we were trying to do to figure this out because you know we had some challenges in the t6 initial initially um
Oxide and Friends
Holistic Engineering with Robert Mustacchi
There's some erratum we found the hard way around it needing some double resets and some other conditions. There's a lot of investigation that we kind of split up and kind of took this in a couple different phases. Especially as I think right as this was kind of kicking off, I was disappearing on vacation for a bit.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But I worked with Nathaniel and Josh and Nathaniel started to basically go through a bunch of, you know, just different questions we had electrically. You know, is there a chance that this could be happening because... because we don't see the device coming out of reset.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So one of the first things we kind of looked at, and now you can correct me where I'm misremembering some of this, was trying to bifurcate, you know, did the device assert it coming out of reset? And did we ever try to even begin PCIe initialization or not? Because depending on the answer to that, that would take us down two very different paths.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And the chip itself has a little pin that says it came out of reset. So there's a bunch of stuff we looked at there. And secondarily, because of a whole bunch of the low-level work we had done to boot, we knew how to read out the state of the PCI state machine diagram that the root port thought it was in. So what this meant is that we could go look at the root port.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
It has basically a ring buffer of the last... 30-ish state transitions it's performed. So we can figure out what has it been doing, what has it seen as kind of a guide, and you can compare that against the PCIe spec, and there's more or less a one-to-one correspondence between those states and
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Asterisks. Right. Sorry. So anyone can read this. That is, whether you're in Holistic Boot or in the Oxide architecture or just running a Lumos in general on Linux or other systems, you can actually read from the system management network. Now, knowing what to read and what maps to what can be a little more challenging because the big gotcha, as we said earlier, is that the CPU is flexible.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Something like that. I mean, I'll admit, I was kind of in shock the whole time. You know, you bring this in, you're kind of trying to sort of be like, hey, here's something nice for everyone. And then it's like, then it becomes food for money.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So it has 128 lanes. And there's a mapping, when you talk to that firmware, between some of those lanes and what underlying hardware
Oxide and Friends
Holistic Engineering with Robert Mustacchi
It's just certainly a lot faster when we have a data structure that tells us this is what it is for this. This device, which should have the T6, read all these registers. It's certainly a lot simpler and certainly a lot easier because we also integrated a bunch of register grabbing initially into the actual boot and training path itself.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So we could actually just, on a debug build, for example, we'll just automatically collect a whole bunch of different registers from the PCIe core and the PCIe port, which corresponds to the root port, just by default. And so certainly that is where this is a lot easier because that's not as straightforward to do outside of building it really into the system itself.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Well, yeah. So that, that we were able to eventually kind of first do that first kind of bifurcation and say, okay, the device is always coming out of reset. Um, then, um, you know, then we can go through and figure out, um,
Oxide and Friends
Holistic Engineering with Robert Mustacchi
know why because basically we no longer have to go look at the we had a lot of electrical questions but uh that ruled out a whole huge class and uh uh i guess nathaniel's thankful was back in my court a little bit um but uh but yeah so then we started looking at this and you know we had um this uh boot loop stuff that Josh had put together.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And we had a modified version that was grabbing out some of this register state at every loop. And actually, Andy had actually already gone through and analyzed a bunch of it prior to me coming back to this problem when I was coming back from being out for a little bit.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So once we kind of had the sense of that these were all here, these were all very similar, and they were all ending in a similar state, it got us, it was pretty suspicious because what we actually saw, and so to understand how PCIe works, you always train a PCIe device to Gen 1, no matter what. Then from there, you're going to go to different speeds.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And basically, as part of the Gen 1 negotiation, you're saying what else you can support. And then the device will go and go to these higher speeds. 2 versus 3 is very different. And then 3, 4, 5 are kind of a different path. But we'd see that we basically got to Gen 2. We just successfully trained to Gen 1. We would go down the upgrade path to Gen 2. We would think we got to Gen 2. And then...
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And definitely ruined cheesecake, I think, for Julie for some time.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
as you're kind of going through the state machine and you're waiting for the other side to acknowledge all of this in the recovery.speed substate or something like that, it'd just be like, ah, well, we timed out. And we're going back to detect, which is basically the entry state that says, you know, Start looking for something here. Well, actually, it wasn't quite nothing here.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But the whole point is basically you go back to detect to try to see, which is basically see, is anyone there? We would see someone is there and start going down the path again anew. You kind of do a whole fresh link training. And it would just stop replying at a certain transition point that was an indication to enter what's called compliance mode.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
So normally compliance was something you only enter because you specifically requested it because you're at one of these like PCIe interop tests and you're trying to like prove something to you and the PCI, you know, basically to pass compliance, PCI level compliance.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
That's funny. I got a very different sounding message from him.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
But normally you're not supposed to enter it by yourself. There's only a very few occasions that enter it of its own volition. So we would just find that we were entering it. And that was just kind of confusing. Especially because we had the ability to see what the host side saw. It was not very easy to go, there's no good way to go answer what was the T6 seeing.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
going to be very challenging because we need to get that on there we need to get it on for all 16 lanes and that's the one downside with the chip down solution is that you know it's a little bit hard to get the logic analyzer on there just a little bit so uh then you know we started doing different experiments um i don't remember all the ones i did but the one that surprisingly worked was you know we kind of said hey
Oxide and Friends
Holistic Engineering with Robert Mustacchi
everything's training to Gen 1 just fine. And then it's going to Gen 2 that's failing. So what if we just stopped at Gen 1? Yeah. Especially since we're only in this manufacturing mode for a little bit.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
We're really bound by the time it takes for it to talk over Spy more than anything else. The actual piece of bandwidth is not going to make or break us. So, shockingly enough, that actually worked. Which is both great and a little dissatisfying, but...
Oxide and Friends
Holistic Engineering with Robert Mustacchi
For them, it's a factory-only tool. So any issues that they might have had there, it's really just factory programming. If you end up needing to hit the thing twice... and it doesn't impact normal operation, like, you know, that's totally fine in a factory context, not fine in a product context. So for us, but.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah. And I think this is another one where we ended up there, but we definitely didn't start with just software only. And so it's really about working together, brainstorming different ideas, and trying to think about what could be going wrong. And then more so, how do you make hypotheses that you can disprove one way or the other to help narrow the solution space? Because otherwise,
Oxide and Friends
Holistic Engineering with Robert Mustacchi
know joker's wild on different things we thought about and there was a bunch of other electrical stuff that we did investigate around you know were we seeing power you know were we not ramping power correctly other things um you know could there be something where we're not draining enough while we're taking this a2 lap um you know and some of that were easy experiments to go to go run so we did did do some of them just to disprove that and um you know just the
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Negative results actually are important.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, and I think that's actually a good point, because in this, you know, we talked about the software changes in the host, but we also looked at stuff in the service processor ring buffers, you know, FPGA register readbacks, we added and captured a bit more state along the way.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And the fact that we could kind of use that to rule out things, the fact that we can easily see what is the state of this ExtResetL pin is hugely helpful. Or the fact that if we had had to do something much more invasive, we could have communicated to the service processor over the communication channel.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And I think that is actually the ultimate part of Dogpatch, taking it back there, is these different things actually working together towards solving our problems versus... you know, fighting with one another and kind of hoarding information. And, you know, because we're doing all sides together, you know, some of the things in Dogpatch that we're saying, you know, it can only be the OS.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Well, that's actually a reflection of the fact that, you know, that's because that was the only thing we could actually modify. And the fact that we can modify not just the service processor, not just the... host software, but actually the board design itself gives us a lot of different flexibility in how we can approach different problems.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
No problem. Thanks for having me. It was my pleasure.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
Yeah, you know, I think we had some very early bits at 365 California. And then it was actually I think when Keith came on. Right. He kind of was really driving more of those programs and I was working with him there. And then I took over more and more of that as Keith retired to the farm for the first but not last time.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
I don't know. I think a lot of it is just learning how to learn and learning how to debug, especially in those early years. Because there's a lot of times where I might know the kind of question I wanted to ask, but not how to ask it. Or you or a lot of the rest of the team would kind of come in with some of the detrace or other bits there.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
And I feel like also where one of the great adages that I've kept in mind when being faced with gnarly problems, which is basically write debugging tools when you're stuck. Yeah, interesting. I don't know. I feel like that's a lot of... A lot of it is during the times of really just figuring out how do you go learn a new subsystem that no one's been in really before, and there's no one to go ask.
Oxide and Friends
Holistic Engineering with Robert Mustacchi
It's not like people go ask, like, ah, how does this, like, how does the NASDAQ work, or how does the USB NASDAQ work? It's like, oh, just go use some DTrace, go use some MDB, and start figuring it out.
Oxide and Friends
RFDs: The Backbone of Oxide
There was an initial version that already existed when I joined. And it was just a list of kind of RFDs that were rendered. I think... I think at that point, a lot of people were looking at RFTs directly on GitHub. I guess due to the deficiencies of the site. And then I don't remember exactly when we kind of relaunched sort of the version that we have now. We kind of were built upon it since.
Oxide and Friends
RFDs: The Backbone of Oxide
But then we kind of incrementally added things like the full text search. That was a huge pain point for me. I forgot what kind of command it was. There was some kind of grep thing, which would take... would take minutes to search the. It was a long time. It wasn't a great process. And now we have full text search.
Oxide and Friends
RFDs: The Backbone of Oxide
We're using some library that indexes all the RFDs and searches, which I think is a good, that's super useful. I can, to your point earlier, we're sort of at an inflection point where there's so much in there that even full-text search doesn't necessarily get you everything that you need. Someone's already mentioned LLM, so I think I'm free to do without kind of...
Oxide and Friends
RFDs: The Backbone of Oxide
Yeah, you don't have to feel like you're breaking the schedule. I'm not being accused of a shill, but I think there's, on the roadmap, there's, I think, there's room for something that kind of summarizes all the RFDs and at the very least tells you where to look. And let's say you're joining at Oxide.
Oxide and Friends
RFDs: The Backbone of Oxide
There are a bunch of RFDs that everyone should read, but for most people, there's a smaller group which are kind of really pertinent to their work.
Oxide and Friends
RFDs: The Backbone of Oxide
And I think that we can use something like that to generate reading lists for people that, that join and they can, they can kind of, they don't feel like they're just losing their mind, like drinking from the fire was trying to read everything to, to, to get up and running.
Oxide and Friends
RFDs: The Backbone of Oxide
Yeah, and it's such an obvious thing, but then afterwards it felt kind of magical, as if a few of those moments where I kind of added something and it just sort of, I think, transformed at least the way I use the RFD site. Full text search is one of them getting kind of discussion in line on the RFD site. Okay.
Oxide and Friends
RFDs: The Backbone of Oxide
There's been a few versions of it. And I think like I kind of have a love-hate relationship with ASCII doc. I think it's amazing because it is so versatile and it is so painful because it's so versatile. You can create so much and I'm aware of a small subset of features. The RFD site is the place to check
Oxide and Friends
RFDs: The Backbone of Oxide
like a renderer for ASCII doc, because you will see every version of, uh, of, of any way which you can use. That's good. It has been used on the RFD site and that will get kind of an issue now and again. And I'm just baffled. Cause I'm like, I didn't even know this was possible.
Oxide and Friends
RFDs: The Backbone of Oxide
You should see the stuff people are doing. But the thing is, it's... It's really rewarding doing this stuff, but there's a balance, right? I don't want to spend all of my time working on like an ASCII doc renderer and kind of the styling for this thing if it's just kind of used in one place. And I think one way in which we have worked around that is we've embraced ASCII doc.
Oxide and Friends
RFDs: The Backbone of Oxide
I'm using ASCII doc for everything. So our doc site is built on ASCII doc, our RFD site, and the blog posts on the marketing site are all ASCII doc.
Oxide and Friends
RFDs: The Backbone of Oxide
So what we have there is, and that in tandem with our design system, our colors, our typography, all of that stuff is unified across both kind of product and internal sites, which is unusual because usually you have your product design system and then you have some other designers who kind of are off like playing in the sandpit, doing like the website, doing whatever they want. I'm doing both.
Oxide and Friends
RFDs: The Backbone of Oxide
So I can have a unified system. And what it means is... when we, when I, when I do the styling for the kind of RFD site it's shared across all these places. So I can justify spending a bit of time on these things because, um, Because it's so useful.
Oxide and Friends
RFDs: The Backbone of Oxide
So you, Hey, don't, don't blame me. Don't blame me for that.
Oxide and Friends
RFDs: The Backbone of Oxide
Really briefly on the stuff that we've made for it. So there is a... My main issue of ASCII Doc is that there is no kind of native... like Rust or JavaScript library for processing it, there is a JavaScript library which is transpiled from Ruby. And so kind of making changes or seeing how it works in another hood is kind of, is sometimes challenging.
Oxide and Friends
RFDs: The Backbone of Oxide
And then it kind of, there's a bunch of assumptions on the way that it works. Like it assumes that you're going to process a whole ICO document in one go, top to bottom, right? which makes sense if you're processing it locally. In my case, I kind of worked on a React render of ASCII doc.
Oxide and Friends
RFDs: The Backbone of Oxide
And essentially what I do is I take this ASCII doc tree and I work through it and I render kind of each part as kind of React components. If you're not doing that, you have to create a renderer where you're returning a string. So you want to render an image, you return a string, and you drop in.
Oxide and Friends
RFDs: The Backbone of Oxide
You're accessing the attributes, and you're swapping stuff in and out, which isn't ideal, especially if you want to do some more interactive things like our images. It gets complicated because we're using signed images that come from GCP. We want a little light box. There's some other kind of features where if you want any kind of interactivity, that model doesn't work.
Oxide and Friends
RFDs: The Backbone of Oxide
So we wrote a React renderer for that. But then you kind of, React doesn't run once, top to bottom, and you have issues. The big issue that I spent way too long on was every time you render a piece of content, with a footnote in it, it assumes that it's a new footnote.
Oxide and Friends
RFDs: The Backbone of Oxide
So every time it's re-rendering that piece of text with a footnote, your number of footnotes is increasing, increasing, increasing up on the page. So those things are kind of challenging. And we've found ways around them. And I think like I have a, each one of these versions is motivated by kind of a piece of functionality that we want.
Oxide and Friends
RFDs: The Backbone of Oxide
So one thing I've been playing with recently is creating this intermediary format for ASCII doc. So I take, I go through the tree, I process it, I turn it into like a JSON object that can be passed easily from the server to the client. And that means that you can kind of pre-process it on the server and then you're not shipping like these big
Oxide and Friends
RFDs: The Backbone of Oxide
this like two megabyte, uh, client library just to handle it on the front end. Um, but, and, and, and the reason why that is important is let's say we're in the, we're in the, then the console. And then, and this is a big aspiration of Robert's is if we, if we want documentation that exists within the console, it doesn't feel justified to ship this big ASCII doc JS, um, library in the console.
Oxide and Friends
RFDs: The Backbone of Oxide
Cause that kind of is a bit bloated. Um, but I think, uh, kind of working on experiments so that we can have dynamic documentation. Maybe if you're kind of working on, uh,
Oxide and Friends
RFDs: The Backbone of Oxide
repairs and kind of swapping components then then we have that without needing to um do yeah kind of do all this stuff so yeah it's it's it's interesting um it's a it's a bit of a challenge at times but uh yeah the versatility is the kind of is the is is the the thing that's great and the thing that kind of trips me up sometimes
Oxide and Friends
RFDs: The Backbone of Oxide
That was a, I think, I guess I think to, to, to begin, I think there's, there's, there's something around like the accessibility of this stuff. Oxide has always been like, um, really kind of engineering driven. Um, and engineers are really familiar with GitHub and that's really easy for them.
Oxide and Friends
RFDs: The Backbone of Oxide
But, uh, I think if you want this process to be used outside in kind of like kind of sales and operations and, and, and, and design, then, um, really, you have to make this stuff as accessible as possible. And we're not there. And I think we're working towards it. But getting GitHub discussions directly into the RFD site felt like a big step towards that.
Oxide and Friends
RFDs: The Backbone of Oxide
Essentially, what's nice about the GitHub discussions is you're leaving line comments. And one great feature of our ASCII doc is that you can trace you can kind of, given a line in an ASCII doc document, you can get the, no, in the return, in the render document, you can get the associated line number.
Oxide and Friends
RFDs: The Backbone of Oxide
So what it meant is we could query the GitHub API, get all of the comments, kind of collect them in,
Oxide and Friends
RFDs: The Backbone of Oxide
kind of a format that can be kind of rendered easily which which is which is pretty kind of tricky to begin with um and then we go through sort of line by line and and we we position them alongside the content directly if they're still relevant um and then we have this sidebar which kind of you can see the kind of full canonical discussion and you can kind of jump to the the relevant part um but yeah just that little bit of kind of being able to associate the original
Oxide and Friends
RFDs: The Backbone of Oxide
document with the rendered document uh was enough that we could pull like from the api people's avatars and we can kind of have this little this module that people can click on and can and and and view um yeah and i think my kind of first experiment was this super ugly way of getting like a little avatar alongside the text but that yeah that felt kind of a real um kind of a huge improvement
Oxide and Friends
RFDs: The Backbone of Oxide
Because one of the reasons to do that is the longer the discussion, the longer the response takes. So I think there's one RFD, something to do with time. And yeah, why would that have lots of comments? RFD 34. And it crashed. It crashes every time, I think, because I think it basically timed. Oh, no. Oh, wow, this worked. What a miracle.
Oxide and Friends
RFDs: The Backbone of Oxide
But essentially, the more comments, the longer the response takes. And so I think there's some kind of funkiness in the GitHub API, which is, yeah. We do have something where essentially we serve the regular rendered version of And then David might want to talk a little bit about this, but there's a Remix feature.
Oxide and Friends
RFDs: The Backbone of Oxide
Remix is the framework that we're using to do all this, where you can stream data from the server later on. So what we do is we give you the main document first, because that's important. And then later on, asynchronously, we give you the comments, just so that's not holding up the ..
Oxide and Friends
RFDs: The Backbone of Oxide
And then your mileage may vary on the actual GitHub portion.
Oxide and Friends
RFDs: The Backbone of Oxide
There was a file with an array of RFDs. Yeah, pretty much. And then initially it was semi-public, right? So anyone could log in with any GitHub login, I think. And then I think David really pushed to have it so anyone could access the public RFDs without logging in, which, yeah, was, I think, another big improvement.
Oxide and Friends
RFDs: The Backbone of Oxide
Well, I mean, the RFD site is kind of fake open source in that it's all... A message to our community. We've got the BSL license on it.
Oxide and Friends
RFDs: The Backbone of Oxide
Yeah, even at Oxide. Right, too real. So, because it's still... The RFDE API is much more... is much broader. I think anyone can use that pretty easily. The RFDE site is... And it's very Oxide at the moment. It has all the styling, all of that. It's kind of directly in the repo.
Oxide and Friends
RFDs: The Backbone of Oxide
We're doing it kind of incrementally as well. And that's the beauty of working on something on a platform like GitHub is you can use the RFD API package, use GitHub, and you don't even need a front end, at least initially. You don't need to worry about that. And then eventually you can kind of add more as and when. But yeah, the accessibility thing I think is big.
Oxide and Friends
RFDs: The Backbone of Oxide
Augustus added an endpoint so that you can create RFDs from the API. And so yeah, I'm kind of working on a way that people can just create a new RFD from the website directly. And then a few months ago, I was working on a little experiment where it was a live ASCII doc editor where you could kind of create and share notes maybe before they're an RFD or notes that aren't going to be an RFD at all.
Oxide and Friends
RFDs: The Backbone of Oxide
And within that, because of the API, I had an option where you could then turn, you could turn a note into an RFD and it would kind of hit the API and it would kind of take all of that content and throw it in an RFD. So I think that's what we're working towards eventually. Some basic authoring. There's a line though. It's really tempting to try and recreate GitHub and Google Docs.
Oxide and Friends
RFDs: The Backbone of Oxide
Uh, I think we've, yeah, it's, it's, we've got enough on our plate, but it's tempting, tempting to do for sure. And yeah. And this, yeah, we, we, I, I, I found personally that kind of working on internal tooling, uh, is, uh, is, is, uh, really beneficial. Like we, we get a lot of it, especially as we get bigger. Um,
Oxide and Friends
RFDs: The Backbone of Oxide
And we really benefit when more people can contribute to RFDs, especially engineers.
Oxide and Friends
RFDs: The Backbone of Oxide
If you spent any time in a European Airbnb, regardless of the country, you know that they always have a large canvas of a black and white scene of London with a colorized red bus. So I think that's the first clue that this is not actually Italy.