Paul Zaich

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

These are trying to address these issues head on, not try to find ways to kind of smooth them out under the surface.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Background checks tend to be a very important final safety step for a lot of these companies to make sure that their platform is going to be safe and secure for customers. their customers. And so in 2014, they started an automated background check company.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1071.77

You are. I think at the end of the day, you're going to try to find the root cause, right? You're going to look for that commit. You're going to look for the log. Maybe it was a script that was logged into your logging system, whatever it is. You're going to look for that and look for the root.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1088.861

So honestly, a lot of times, you know, maybe what caused the issue from whether if it was something that was specifically run by a specific person and they probably feel a little bit of guilt there, but there's no reason to lay on more there. And I think everyone, like you said, feels a lot of responsibility around the work that they're doing already. So there's no reason to overemphasize that.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1109.358

So what that looks like is typically the team that is impacted is really going to own that postmortem. And that's one way for you to feel like you're resolving the incident or the issue that caused the incident. This has definitely become a bit of a different process as the team is growing.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1128.729

When we were at 30, I think it's a little bit easier just to know exactly who should work on those types of mitigations. Typically, it's pretty isolated to a specific team. As the team is growing and the system is growing, that's definitely become more of a challenge because sometimes,

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1145.493

happen because different issues that multiple teams have introduced, or maybe there's multiple teams that need to be involved in the mitigation. And for that, in that case, we've definitely been trying to evolve our postmortem process and the action items. So we have a program manager that one of her responsibilities is specifically around making sure that we are coordinating some of those out

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1170.68

some additional rules and coordination around the process as we've started to grow. A lot of it was just on the individual teams initially, and now as we've grown, again, there's more process involved. I think that's a pretty common thing that you have to introduce as teams grow.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

120.356

And initially, the biggest selling point was that Checkr abstracted away a lot of the complexity of background check process, collecting candidate information, and then executing that flow and exposing that via an API that was developed in a Sinatra app. And three years later, in 2017, I just joined about four or five months before this particular incident happened.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1261.343

So we used a number of different types of monitoring. At the time, we were pretty heavily reliant on exception tracking, and we also had some application performance monitoring as well, commonly called APM. A couple examples of that would be something like New Relic or Datadog has a product as well now. And then we did also use a StatsD cluster that sent metrics of over to Datadog.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1289.316

And I think we just had started using that maybe just a few months before this particular incident occurred. So like I alluded to before, we had some monitors for this particular issue, but they were pretty simplistic. They basically just looked for a minimum threshold of the number of reports that we're creating.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1308.927

And we had to set that threshold to be very low over like an hour period because traffic is variable. You never know exactly how many reports you're going to get created. There's times a day where we've received very few requests, and then there's other times where we see large spikes. So we just had very simplistic monitoring in place for some of these. key metrics at that point.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1331.622

At that point, we're still very heavily reliant on, like I said, exception tracking using systems bug trackers like Sentry that then could then alert if you had certain thresholds of number of errors over a period of time. In this particular case, exception tracking isn't very useful because we were responding with a 404. There was an exception in the system.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1354.38

It was just automatically active record not found, something like that. that was then handled automatically in response to the 404. So it was an expected behavior, but there wasn't an exception that could have been caught.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1389.704

I think this was more of a monitoring problem overall. As Dave mentioned, there was a component where a page was snoozed, but I think that was still a failure on our monitoring system. Because in this case, that was just a signal of what the true issue was. It was a downstream client application that had a page earlier on. And it wasn't clear at all what the issue was.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1422.145

And I think when you're developing a system for alerting clients, you need to have clear action items. So you need to have, and that's for custom metrics, building application metrics as you grow, become really important. Having clear signal of what's wrong so that someone knows where to investigate. In this case, it was a client application and browser. There's a lot of noise there.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1449.57

And I can easily understand why someone would just snooze something like that. In my opinion, it wasn't really a people issue in this particular case.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

146.793

Fast forward to that point, we were running, I'd say, a few million checks a year for a variety of different customers. Most of those customers use our API, like I said before, to interface with the candidate on their side in their own application.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1582.771

It also depends on where you are in terms of your applications, use cases, what the customer profile looks like, how large the company has gotten, how many people are supporting it. When you're early on, when you're building a new application, new product. By definition, the developers on that are going to really understand the whole system very well.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1607.125

So essentially, exception tracking probably is going to be able to give you most of what you need to know in terms of being able to understand what's going on. As the system starts to grow, and especially as you have more discrete teams, I think that's where things like StatsD become more useful because use cases for core parts of your application.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1631.498

And I would maybe say that the bar there is maybe when you start to hit the point where you start to have a significant number of paying customers using specific features, maybe you need to start to hone in on one or two key processes that they break. It's absolutely critical that you know immediately. That's kind of the point that Checkr is at in 2017. We really need to have high intelligence...

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1653.187

very clear intelligence and visibility into specific parts of our system. And we're trying to move in that direction when this incident happened. We've continued to invest in that area going forward. I think it's become even more important as we're getting larger because there's just...

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1670.456

so many different systems that are interacting together that no one really understands the whole system at this point. And the only way to really know how the different systems are working together is maybe make sure everything's working properly is to have some of these custom metrics defined for specific key processes.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1695.207

That's a good question. We're all remote now. So at this point, having had to experiment with that, we did have some of those in our office. I think I've been trying to find ways to make that more visible and make metrics more visible to our team as we've been shifted to 100% remote due to the pandemic. There's also a challenge for our business in particular where...

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1717.498

Sometimes things are very, many of our processes are very asynchronous and they could take hours to date to fully execute. And so finding ways to short circuit and know that those things are broken can be challenging at times. So one of the things we have to do is we have to look at the data over time as well and not just look at real time metrics.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1738.472

So one thing I've been experimenting with is trying to create more automated reports that go into sort of a Slack channel that we can look at. And so people can review that. And we've also implemented basically a bi-weekly review during our retro where we just look at our metrics and some of the longer running trends so that we can see if those look correct. Is there anything that's wrong?

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1760.424

We can talk about it, see if there's things that we want to actually action on based on that review. So we're trying to find some ways to do check-ins that don't require us to be all in the office together.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1796.462

We have implemented what I consider custom metrics. We use Datadog. So a lot of this is out of the box. You can use their implementation, but you're adding some code to specific parts of your application. Maybe it's a callback on your active record model. When something is created, you send a message to an queue and then that triggers over a message into statsd.com. that goes to Datadog.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1824.927

Anyways, it's a pretty lightweight implementation in terms of what you can do, but you're adding specific events that you want to track. And then you can create your own monitors and alerting around those or correlations between different events in your system.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

183.715

I did not. There are a lot of very important guidelines and stipulations governed by the Federal Credit Reporting Act that make sure that you have to have a permissible purpose for running a background check. So in this case, most of our customers are using the permissible purpose around employment as the reason for actually running that check.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1841.161

So you could potentially look at a custom metric and then look at that compared to HTTP statuses that are coming through or the latency of an endpoint. And then you could correlate those two metrics as well. So there's some more advanced things you can do there as well if you need to. But again, it's not really a lot of custom work.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1861.078

It's just adding some specific points in your code base that you feel like are really important to track. And one example of this for Rails users is, I believe there's something like this already set up for Datadog for Sidekick. So we instrument it on a lot of our

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1875.692

sidekick jobs and we can see when the lag is growing on on one of those cues we can see what the the average completion time is and look at the p90 completion time for different types of jobs So you get a lot of visibility into your sidekick workers and processes very easily, basically for free.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1920.747

Just to be clear, we capture all of our errors in Sentry. We do have some alerting that goes to Slack, but I would also want to emphasize that anything that truly has any chance of being a serious issue should never be either an email or a Slack alert.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

1942.978

You really should have some kind of escalation via either maybe it's text, maybe it's an actual incident response system like PagerDuty where you can have an escalation policy. For us, that's what we're using. It should have this synchronous alerting that really forces someone to look at it. You can't rely on something asynchronous like Slack in this case for serious response on issues.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2010.858

You can actually do that, I believe, at least with iOS. You can set up an override where you snooze everything else and then you can set up and you have to just put it in your personal contacts, whatever numbers you think you're going to receive critical notification from. And then that'll actually ring through.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

217.457

So, like I said, in 2017, Checkr at this point was a pretty important component of a number of customers onboarding process. But we had started off small and things grew quickly. In a lot of ways, we were just trying to keep the lights on and scale the system along with our customers as they continue to grow. On-demand is growing a lot in this time as well.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2176.261

So composite monitor is basically a combination of several different metrics that you're measuring using, tuning together those with and or or statements. So maybe referencing what I was talking about before, you might want to have a custom metric that you're looking at and you want to look at how many of those are coming through, how many events are coming through.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2198.232

And then you might also want to look at, in this case, the error rate for HTTP status, maybe how many 400 errors you're getting. relative to 200s. You could basically do something where you have an and statement between those two different measures and those Boolean evaluations. Or you could do something where you have an or.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2219.343

So you could say, these are basically signaling for the same type of issue that I want to alert on, but I'm going to look for these different conditions. all in the same monitor.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2275.923

Right. So I think it gives you the ability to tune things to potentially make something have a higher fidelity when it alerts.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2284.55

So you're not getting... One, you can set the thresholds actually higher and keep things... It depends how you want to use it, but you can... In this case, you could set the thresholds higher, but you could have something where it's like, well, if there aren't any errors coming through, then... Maybe we're okay with that, even though the number's a little bit lower.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2302.447

Or you can do things where you can be more, and again, you can also tune this to be more sensitive. In this particular incident, if we had had some air monitoring around 400s, in addition to the threshold that we had that was pretty low, I think we would have been alerted on that within maybe an hour.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2320.585

So you can do things there that give you more sensitivity without necessarily causing a lot more false alarms. And that's something that you have to just be really careful with any kind of monitor on a team. You really need to make sure that you're not creating false alarms. I'd say it's almost as important or equally important to the sensitivity of the alarm as well.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2343.83

Because if you're creating false alarms all the time, it's just... not really give them the review that they need. So if you're doing that all the time, you're probably going to miss something inevitably when there's actually a real issue.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2370.745

I just want to emphasize that this is a growing process that I think every team should go through. It's something that is going to evolve over time. And as your product becomes more important to customers and can use to grow, you need to just be constantly revealing what your approach is to this.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2391.865

What's going to work for brand new product, brand new startup, brand new company isn't necessarily going to be the And it's something that you need to evaluate.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2401.78

And as your product starts to be something that's really a critical service for your customers or for other teams at your company, you just need to continually set the bar higher and make sure that you're continuing to grow observability across the stack.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

241.141

So in 2017, we were doing some fairly routine changes to a data model. I wasn't directly involved with that, but we were changing something from an integer ID to a UUID and the references, and there were some backfills that needed to happen. And so an engineer executed a script on a Friday afternoon, which is always a great idea to do.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2423.812

You're welcome to reach out to me on Twitter at GitHub, GitHub at QZyche, or you can reach out to me on LinkedIn as well.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

25.214

Zeich. Well done. Thank you.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

266.635

And they actually used the script at about 4.30 PM, probably went and grabbed something, had a little happy hour and then headed home. And about an hour later, we started to receive a few various different pages to completely unrelated teams that didn't really know what was going on in terms of this backfill. And it didn't look like anything too serious.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2729.853

I really enjoyed something that was in the Ruby Weekly newsletter this last week. There's a Ruby one-liners cookbook. So it has a bunch of different one-liners you can actually just shout out to and make those calls. And it explains how you can do a lot of things they would do. with a shell script very easily with Ruby.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

2780.276

It's been a pleasure.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

287.306

It was just an elevated number of exceptions in our client application that does some of the candidate PII collection. And so we just decided to, that team decided to snooze that and decided just to kind of ignore it.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

308.878

Thank you. Anyway, go ahead. So come Saturday morning, this has been going on for about 12 hours now, this exception comes in again. And at that point, someone on our team actually decided to escalate that and get more stakeholders involved. We had some variety of other issues going on. We just migrated from one deployment platform to Kubernetes. And so we had some issues getting onto the cluster.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

334.846

There are too many of us trying to get on at the same time. So we ended up all having to actually go into the office to the physical intranet to finally get in and debug the issue. So we had a couple of other confounding issues come up at the same time that made the process of response even worse.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

351.08

So finally, this is maybe 10 o'clock in the morning, 10 or 11 in the morning, we finally, after being able to take a look at that, identified what the issue was. And we were responding to about 50 to 60% of the most one of the most critical endpoints on our system, which is to actually create requests to make a report.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

371.137

So after you've collected the candidate's information, you say, please execute this report so we can get that back. And that's a synchronous request that you make using our API. And when that request was failing, it was failing about 40 to 50% of the time with a 404 response, which isn't really expected. So at that point, we were finally able to pin down the issue and it came back to this script.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

393.475

And it turned out that when you went to create this report, we would look for create these additional sub objects called screenings. And due to the script, we had actually created an issue where validation would cause the reports to fail to create in this edge case. So there are some confounding issues with the way that we had set up the data

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

412.915

of modeling to begin with that we were trying to work around and this exception happened. But when we finally fixed the issue, that's where we shifted more into what actually went wrong and what were the real issues that caused a sadness outage.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

42.608

Sure. So I've been a software engineer for about 10 years. Recently in the last year or so, transitioned into an engineering management role. But I've worked at a number of different Small startups.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

450.931

Right. So I think the first most important thing that we did was that really from the beginning, we have had what we call a blameless culture. I think it's a common term now in the industry, but the idea there is to really focus on learning from issues, not trying to find who made the particular mistake and trying to look at what processes

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

477.443

as well, that would have prevented the problem from happening. So not trying to focus on the individual mistake that was made. So as part of that, we did a postmortem doc and we went through and identified things like, one, we should really have a dedicated script repository that goes through a code review process. So that's one thing we implemented. And we made some safeguards and

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

501.517

to address this particular issue with the data models as well. But I think for everyone, the bigger issue is really the fact that we missed the outage for so long. And we did actually have some monitoring in place for this particular issue that should have paged for the downtime that we were experiencing for some report creation.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

521.513

But it turned out that our monitors were really just not set up in the most effective way to trigger for that particular type of outage. In this case, it was a partial outage, and that requires a much more sensitive monitor in order for us to detect. Everything we designed beforehand was much more targeted towards a complete failure of our system.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

54.977

I joined Checkr in 2017 when the company was at about 100 employees, 30 engineers, contributed as an engineer for a couple of years to our team, and then have recently transitioned, like I said, into an engineering management role.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

549.842

This particular issue most likely could not have been caught by an automated test because it was so... so outside of the norm of what we expected the data to look like. We had, of course, unit tests for everything that we were running and we had request specs as well. We did not have an end-to-end environment set up for a staging environment where you could run these tests end-to-end.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

576.711

But again, the data in this particular case was very old and it was essentially doing a migration in our code base at this point. So I'm not sure we could have anticipated this particular issue.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

606.696

Right, that's always one of the, especially as your application becomes more important to customers and what your service, the impact to customers is, is more and more extreme. And so in this case, I think this is a Friday night. It wasn't something where a lot of our customers were actively monitoring on their end as well.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

626.294

Fortunately, we were able to see that retries were happening and many of our customers use a retry fallback mechanism. So they were able to just allow those to run through. But this is particularly tricky in this case because there wasn't actually a record ID for many of these particular responses. Fortunately, we do keep API logs.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

650.535

So we were able to see exactly which requests failed for each of our customers. And so we were able to then reach out to our customer success team and they were able to start to share the impact with each of those customers pretty quickly. I will say that we've done a lot of work to make our customer communication a lot more polished since then.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

670.269

And that's something that we're really focusing on now as well. And just being able to get more visibility customers sooner. And one of the most important things there is when it comes to monitoring is that you really want to be able to find the issue and be able to start to investigate it before you... You don't want a customer to identify it first.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

688.7

You should really understand what's happening in your system before anyone else detects that issue.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

715.613

Exactly. If some of these requests were happening in the browser or were not set up to automatically retry, that could be a much worse impact on the customer.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

789.346

I think it starts for us with, it really started with our CTO, Jonathan, and co-founder, making that a priority from pretty much day one, basically from the beginning of our process when we've had issues or incidents. We've done a postmortem doc. We've had a process around that. And it's always been very forward-looking. facing very, very much about what could we have done better?

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

815.187

What can we improve? What are the things we should be doing going forward? So I think having that first touch point and really having that emphasis from the beginning was really important and cascades down. I think as you're building out a bigger engineering team, that's critical as be able to just continue to build, keep that culture going.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

835.135

And I think that's definitely a challenge to continue doing. But I think as we've grown, we've been able to do that so far. So I think that was step number one. I think a second piece of it is understanding and trying to understand when it's more of a process issue versus something that someone particularly did wrong.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

853.649

And I think a lot of the time, I think a lot of incidents do occur because you're trying to make different prioritization decisions and you're trying to make sure that you anticipate things in advance or failure, failure moments. And sometimes you just miss those. And those are particular cases where I think the management team needs to really take responsibility for it.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

876.249

It's not an individual issue that caused that particular downtime or that it wasn't that necessarily that one piece of code. And so it could be just an example. I mean, this is an example, I think, actually, where we had some technical debt, we were trying to clean it up. And that was a good thing.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

89.494

Sure. So Checkr was founded in 2014. Daniel and Jonathan are founders. I had worked in the on-demand space, another company, and had discovered that it was very difficult to integrate background checks into their onboarding process.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

893.661

But I think we didn't necessarily have everything in place to be able to address that technical debt effectively. And that's not necessarily one engineer's responsibility to be out in front of.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

963.052

Absolutely. And I think one other thing to highlight here is that when you don't have a famous culture, folks are going to be very afraid to speak out when they do see an issue, whether it was there, they think it was their mistake or someone else's. They're not going to want to escalate that issue and make sure that it gets attention necessarily.

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

982.564

And so one of the best side effects of having a blameless culture is that you get really engaged response and everyone's going to work together to try to address the issue. I think that even cascades down to customer communication as well, because when you're really engaged in trying to do that, then you're doing the best thing possible. for the customers as well.

Appearances

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues

Ruby Rogues