Menu
Sign In Pricing Add Podcast

Anita Zhang

Appearances

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

2773.589

Well, I support a team that basically, well, my manager calls it supports the Meta's Linux distribution team. I like to call it operating systems. Sounds better, but we primarily contribute to system D, to BPF related projects, building out some of the common components at the OS layer that other infrastructure services build on top of.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

2799.987

We have like an actual kernel team to do the kernel, but one layer up, I guess.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

2821.972

Yeah. I mean, we've been around a while. We have We personally, the company owns millions of hosts at this point, a mix of like compute, storage, and now the AI fleet. Teams primarily work out of a shared pool. So we have a pool of machines called TW Shared where all of the container jobs run. There are a few services that run in like their own set of host prefixes.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

2848.823

But for the most part, the largest pool is TW Shared. A lot of our infrastructure to support this scale is homegrown.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

2866.817

No, we actually use CentOS for production, all of our production hosts, and even inside the containers we're using CentOS. Desktops are primarily some flavor of Fedora, Windows, or macOS.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

2914.009

You know, we've gotten a lot better at it over the years. When I started, we were doing like CentOS six to seven. And I think that probably took like a year or two to actually reach over like 99% of the fleet. And there's always that trailing 1% that For some reason, they can't shut down their services or they don't want to drain or lose traffic or things like that.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

2935.442

But now we're able to complete, I'd say like 99% of the fleet in a year or less. We started doing a lot of validation sooner. So now we actually hook in Fedora ELN into our testing pipeline and we start deploying parts of Fedora ELN and running like our internal container tests against them. And so that has caught a few like system wide distribution changes that we'll be ready for.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

2962.552

Like once sent to us, I guess now sent to us stream 10 is going to be released later this year.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

2971.198

So Fedora ELN is, man, I don't know what exactly it stands for. It's Fedora something next. So it's going to be like the next release of Fedora that will eventually feed into things like CentOS Stream.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3006.912

Yeah, I'd say the change to stream didn't really affect us much because we were already kind of doing rolling OS updates inside the fleet. So when new point releases get released, we have a system that syncs it to our internal repos and then updates the repositories.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3025.682

And then we have Chef running to actually pick up the new packages and things and just updates depending on what's in those repositories. So the change to Stream didn't really change that model at all. We're still doing that, picking up new packages on like a two-week cadence.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3044.286

Yeah, we kind of have to.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3094.401

Yep. So containers, they don't get the live updates that the bare metal hosts get. So users can just find their jobs in a spec. And for the lifetime of the job, the packages and things that go into it don't change. I mean, there are certificates that also are used to identify the job. Those get renewed. But we have a big push to get every job updated at least every 90 days.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3123.009

Most jobs update more frequently than that.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3129.889

Yeah, they'll actually have to shut down their job and restart it on a fresh container and they'll pick up any new changes to the images or any changes to the packages that have happened in that time.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3158.592

So I used to work on the containers team, the part that's actually on the host. The whole like Twine team consists of like the scheduler and they're like resource allocation teams. to figure out which hosts we can actually use, how to allocate them between the teams that need them.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3175.29

And then on the actual container side, we have something called the agent that actually talks directly to the scheduler and translate the user specification into the actual code that needs to get run on the host. And that agent sets up a bunch of namespaces and starts systemd and basically just gets the job started.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3199.751

Yeah. So the bulk of the work that is done in the agent, at least for the systemd setup, is it translates the spec into systemd units that get run in the container. So if there are jobs, if there are commands that need to run before the main job, those get translated to different units. And then the main job is in its own unit as well.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3220.127

And then there's a bunch of different configuration to make sure the kill behavior for the container is the way we expect and things like that. There is a sidecar for the logs specifically. So logs are pretty important, as you'd imagine, to users being able to debug their jobs. There is a separate service that runs alongside the container to actually make sure that no logs get lost.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3244.424

And so those logs get preserved in the host somewhere.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3268.764

Yeah. So the container job, it will be one systemd unit, and you'll see a bunch of processes in it. And you'll also see a couple of agents that we run, but mostly just the usual systemd PID1 inside the container and like their own instance of JournalD, LoginD and all that stuff.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3300.059

Yeah.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3313.245

Yeah, pretty much.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3350.714

So that part's a little newer than from the time I was in containers. But so you create a host profile, you work with like the host management team to do that. And then you can, I believe, specify it in your job spec. And then when you need to either restart your job or move the job around, they actually have to drain the host.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3370.581

Most host profiles require a host restart, because things like huge pages, you need to restart the host to apply. And then the jobs gets started back up on the host with the host profile you're asking for.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3388.269

Not specifically, but they do. So the host agent actually builds a lot of their components on top of system D as well. So they've been doing things like moving more configuration out of Chef into host agent where it's more predictable. So things like systemd networkd configs or the syscuddle configs that also go through systemd as well.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3416.19

Oh, yeah. The tux hoodies. This is the one that Justin was talking about. That is so cool.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3433.888

That's so cool. If anyone from scale is listening, they probably have a hoodie.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3458.149

Yeah, I'd say at least the way the kernel team and our team operates is that we're mostly upstream first. Um, so everything that we write, we write it with like the idea that we're going to be upstreaming it. And that's how we managed to keep our team size small so that we don't have to maintain like a bunch of backports, things like that.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3491.247

Yeah. So the kernel, we actually like, build and maintain internally. So we can kind of pull from the release whenever you want. And we can kind of do the same thing with CentOS too, because we all contribute to the CentOS hyperscale SIG. That's where any bleeding edge packages that we want to release immediately goes into the hyperscale SIG.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3525.693

I mean, I'd say meta is like super into, you know, release frequently. And so if we always stick to like upstream, then we'll always get like the newest stuff and we're less likely to run into some obscure bug from like two years ago that is really hard to debug.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3565.375

Yeah, so it's mainly the major upgrades that take up to a year. So, you know, when we're about to go from 9, CentOS stream 9 to 10, that will probably take a long time than if we were just doing like our rolling OS upgrades. So the thing about CentOS is that we do maintain kind of like ABI boundaries. So we expect that the changes that, you know, Red Hat and CentOS are making.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3593.502

Two packages are mostly like bug fixes that won't break compatibility in the program. And that's remained true. We haven't run into a lot of major issues with rolling OS upgrades. Most issues come from like when we personally are trying to pull in like the latest version of system D or something and we're rolling that out. Those we have to do with more intention.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3636.772

Yeah, I'm probably not the best person to ask about it, but we do have a pretty sizable team now of production engineers dedicated to supporting the AI fleet and making sure that it's stable and that our train jobs don't crash and things like that.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3659.533

It's more like the latter. So even though everything's in TW Shared, we know what kind of machine type they are. So you can specify what purpose you're using the machine for and things like that.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3673.936

Well, I'm a software engineer technically, I guess.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3689.166

I'd say production engineer and software engineer are the most similar, especially in infrastructure. When I was in the containers team, the production engineers and software engineers pretty much all just did the same stuff. Like we were all just focused on scaling and making the system more reliable.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3707.578

I'd say in like a product team, production engineers focus more on operationalizing and making the service production ready while the software engineer is kind of like creating new features and things like that.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3771.687

Yeah, I mean, I probably dispute the fact that people have to understand kind of like the internals of how the hosts and things are laid out. So the majority of services, we're talking like millions of hosts and TW shared, they are running containers.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3787.51

And I'd say a lot of their knowledge about the infrastructure probably stops at when they write the job spec and to the point where they go into the UI and look at the logs. So if you're just writing like a service, a lot of that's abstracted away from you. You don't even have to handle like load balancing and stuff. There's like a whole separate team that deals with that as well.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3808.462

Yeah, but if you're on the infrastructure side, sometimes you need to maintain those widely distributed binaries on the bare metal hosts. So like us running SystemD or the team at Siamat that does the load balancing, they also run a widely distributed binary across the fleet on bare metal.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3826.256

There's also another service that does specifically fetching packages or shipping out configuration files and things like that. But yeah, most of the services people write, they're running in containers. Databases, they have kind of their own separate thing going on as well.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3845.035

Most of them are moving more into TWShared as well, but they have more specific requirements related to draining the host and making sure there's no data loss.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3858.642

Yeah, but they're one of those teams that, They just want their own set of bare metal hosts as well to do their own thing with. They don't care about running things in a container if they don't have to.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3879.951

The AI fleet's always a challenge, I guess, making sure jobs stay running for that long. I think we're... Every site event is like kind of an opportunity to see where we can make our infrastructure more stable, adding more validation in places and things like that. Just removing some of the clowniness that people who have been here a long time have kind of gotten used to.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3917.791

Yeah, making things more deterministic, removing cases where teams that don't need to have their own host, shifting them into TW shared so that they're on more common infrastructure, adding more safeguards in place so that we can't roll things out live and stuff like that.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3946.633

Yeah, that's probably not true anymore. But yeah, the majority of our compute fleet looks like that. Yeah.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3970.485

Yeah.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

3973.886

Yeah. And we're trying to shift to a model now where, you know, we have bigger compute hosts so that we can run more jobs side by side. Stacking because realistically, you know, one service isn't going to be able to scale to like all the resources on the host, uh, forever. So yeah, we're, we're getting into stacking now.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4019.857

Oh, I mean, even in like the past year, we've made a few notable infrastructure shifts to support the AI fleet. Yeah, it's not even just like the different like resources on the host, but like all of the different components. A lot of them have like additional network cards. managing how the accelerators work and how to make sure they're healthy and things like that.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4055.676

Oh, yeah, for sure.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4112.338

You know, we are also working on our own ASICs to do like inference and trading. That's probably the place where we're going to see like, not just like the monetary gains from developing in-house, but also, you know, on the power and resource side as well.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4128.288

That's starting to come out this year in production.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4142.506

Yeah, that's a better question for the silicon team. I only see the part where, you know, we actually get the completed chip, but I'm sure they're doing their development on FPGAs.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4160.764

Oh, yeah. We have a team that is actually, I work pretty closely with that writes FPGAs. We shifted to like a user space driver. It just uses VFIO over the kernel. I think the chip is just, the accelerator is just over PCIe.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4185.433

Yeah, I'd say you can really go as deep as you want to here.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4205.841

Yeah. I mean, that's my favorite part. I mean, some people are just really into like developing C++ or the language. But then I'm on the infrastructure side. I just really like working directly with hosts.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4220.22

Almost eight and a half years at this point.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4242.029

Yeah, it's super nice just feel to ping like anybody over work chat, like literally anyone. Just if you have a question, everyone's super nice about helping you out as long as you're nice too.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4258.862

Yeah, I started here out of graduation. I did one internship. Uh, before I started here full time.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4278.723

I mean, I'm always interested in doing more stuff with system D. I think there's still a bunch of components internally, um, that could be utilizing system D in more ways, you know, making sure that we're all in the common base. That's kind of the main, like, general goal that I'm always going to be focused on, I guess.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4298.618

There are also some bigger, I mean, the Journal D, I've been trying to get us to replace our syslog completely. and move entirely to systemd-journald. That's an ongoing effort.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4332.75

Yeah, I want to be there.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4338.714

I mean, moving completed to systemd networkd was pretty cool. But I mean, now that that's done, I can just like be happy with it. There are probably some more stuff we're going to be doing with like systemd umd, the out of memory killer. I think we're about ready to get Senpai upstreamed into systemd. Senpai is like a memory auto resizer that we wrote.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4363.69

And I don't think that that's been open sourced in any way. I mean, we have like an internal plugin to do that with the old like FBMD. I think it's time to get that into system VMD as well.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4382.946

It's a way to kind of poke a process and like make sure that they're only using the amount of memory that they actually need. Cause a lot of, you know, services and things will allocate more memory than they need.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4401.212

A little bit.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4426.667

I mean, we're trying to shift to more of an immutable model. Internally, we have something called MetalOS. And right now, we're rolling out a variation of MetalOS called Maclassica. The goal is kind of an immutable file system, but it's making strides to get there. We still have to rely on Chef to do a lot of configuration, but a lot of it has shifted to a more static configuration.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4452.647

that is more deterministic and gets updated at a cadence where we can more clearly see what the changes are.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4476.508

Yeah, I haven't looked too deeply into what that team's been up to. But I do know that they did make use of some of the bleeding edge systemd features to build these images and things like that. We're not using systemd sysx just yet. I mean, I wouldn't count it out.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4562.438

I mean, I think from the engineering standpoint, we just kind of get the warm fuzzies when people actually use and like the stuff we write.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4642.098

Oh yeah. I really appreciate the academic side of things.

The Changelog: Software Development, Open Source

Flavors of Ship It! (Interview)

4656.561

All right. Looking forward to it. Thank you.