Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Paul Zaich

👤 Person
168 total appearances

Appearances Over Time

Podcast Appearances

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

And they actually used the script at about 4.30 PM, probably went and grabbed something, had a little happy hour and then headed home. And about an hour later, we started to receive a few various different pages to completely unrelated teams that didn't really know what was going on in terms of this backfill. And it didn't look like anything too serious.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

And they actually used the script at about 4.30 PM, probably went and grabbed something, had a little happy hour and then headed home. And about an hour later, we started to receive a few various different pages to completely unrelated teams that didn't really know what was going on in terms of this backfill. And it didn't look like anything too serious.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

It was just an elevated number of exceptions in our client application that does some of the candidate PII collection. And so we just decided to, that team decided to snooze that and decided just to kind of ignore it.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

It was just an elevated number of exceptions in our client application that does some of the candidate PII collection. And so we just decided to, that team decided to snooze that and decided just to kind of ignore it.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Thank you. Anyway, go ahead. So come Saturday morning, this has been going on for about 12 hours now, this exception comes in again. And at that point, someone on our team actually decided to escalate that and get more stakeholders involved. We had some variety of other issues going on. We just migrated from one deployment platform to Kubernetes. And so we had some issues getting onto the cluster.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Thank you. Anyway, go ahead. So come Saturday morning, this has been going on for about 12 hours now, this exception comes in again. And at that point, someone on our team actually decided to escalate that and get more stakeholders involved. We had some variety of other issues going on. We just migrated from one deployment platform to Kubernetes. And so we had some issues getting onto the cluster.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

There are too many of us trying to get on at the same time. So we ended up all having to actually go into the office to the physical intranet to finally get in and debug the issue. So we had a couple of other confounding issues come up at the same time that made the process of response even worse.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

There are too many of us trying to get on at the same time. So we ended up all having to actually go into the office to the physical intranet to finally get in and debug the issue. So we had a couple of other confounding issues come up at the same time that made the process of response even worse.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

So finally, this is maybe 10 o'clock in the morning, 10 or 11 in the morning, we finally, after being able to take a look at that, identified what the issue was. And we were responding to about 50 to 60% of the most one of the most critical endpoints on our system, which is to actually create requests to make a report.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

So finally, this is maybe 10 o'clock in the morning, 10 or 11 in the morning, we finally, after being able to take a look at that, identified what the issue was. And we were responding to about 50 to 60% of the most one of the most critical endpoints on our system, which is to actually create requests to make a report.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

So after you've collected the candidate's information, you say, please execute this report so we can get that back. And that's a synchronous request that you make using our API. And when that request was failing, it was failing about 40 to 50% of the time with a 404 response, which isn't really expected. So at that point, we were finally able to pin down the issue and it came back to this script.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

So after you've collected the candidate's information, you say, please execute this report so we can get that back. And that's a synchronous request that you make using our API. And when that request was failing, it was failing about 40 to 50% of the time with a 404 response, which isn't really expected. So at that point, we were finally able to pin down the issue and it came back to this script.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

And it turned out that when you went to create this report, we would look for create these additional sub objects called screenings. And due to the script, we had actually created an issue where validation would cause the reports to fail to create in this edge case. So there are some confounding issues with the way that we had set up the data

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

And it turned out that when you went to create this report, we would look for create these additional sub objects called screenings. And due to the script, we had actually created an issue where validation would cause the reports to fail to create in this edge case. So there are some confounding issues with the way that we had set up the data

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

of modeling to begin with that we were trying to work around and this exception happened. But when we finally fixed the issue, that's where we shifted more into what actually went wrong and what were the real issues that caused a sadness outage.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

of modeling to begin with that we were trying to work around and this exception happened. But when we finally fixed the issue, that's where we shifted more into what actually went wrong and what were the real issues that caused a sadness outage.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Right. So I think the first most important thing that we did was that really from the beginning, we have had what we call a blameless culture. I think it's a common term now in the industry, but the idea there is to really focus on learning from issues, not trying to find who made the particular mistake and trying to look at what processes

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Right. So I think the first most important thing that we did was that really from the beginning, we have had what we call a blameless culture. I think it's a common term now in the industry, but the idea there is to really focus on learning from issues, not trying to find who made the particular mistake and trying to look at what processes

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

as well, that would have prevented the problem from happening. So not trying to focus on the individual mistake that was made. So as part of that, we did a postmortem doc and we went through and identified things like, one, we should really have a dedicated script repository that goes through a code review process. So that's one thing we implemented. And we made some safeguards and

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

as well, that would have prevented the problem from happening. So not trying to focus on the individual mistake that was made. So as part of that, we did a postmortem doc and we went through and identified things like, one, we should really have a dedicated script repository that goes through a code review process. So that's one thing we implemented. And we made some safeguards and