Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Paul Zaich

👤 Person
168 total appearances

Appearances Over Time

Podcast Appearances

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

to address this particular issue with the data models as well. But I think for everyone, the bigger issue is really the fact that we missed the outage for so long. And we did actually have some monitoring in place for this particular issue that should have paged for the downtime that we were experiencing for some report creation.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

to address this particular issue with the data models as well. But I think for everyone, the bigger issue is really the fact that we missed the outage for so long. And we did actually have some monitoring in place for this particular issue that should have paged for the downtime that we were experiencing for some report creation.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

But it turned out that our monitors were really just not set up in the most effective way to trigger for that particular type of outage. In this case, it was a partial outage, and that requires a much more sensitive monitor in order for us to detect. Everything we designed beforehand was much more targeted towards a complete failure of our system.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

But it turned out that our monitors were really just not set up in the most effective way to trigger for that particular type of outage. In this case, it was a partial outage, and that requires a much more sensitive monitor in order for us to detect. Everything we designed beforehand was much more targeted towards a complete failure of our system.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

This particular issue most likely could not have been caught by an automated test because it was so... so outside of the norm of what we expected the data to look like. We had, of course, unit tests for everything that we were running and we had request specs as well. We did not have an end-to-end environment set up for a staging environment where you could run these tests end-to-end.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

This particular issue most likely could not have been caught by an automated test because it was so... so outside of the norm of what we expected the data to look like. We had, of course, unit tests for everything that we were running and we had request specs as well. We did not have an end-to-end environment set up for a staging environment where you could run these tests end-to-end.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

But again, the data in this particular case was very old and it was essentially doing a migration in our code base at this point. So I'm not sure we could have anticipated this particular issue.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

But again, the data in this particular case was very old and it was essentially doing a migration in our code base at this point. So I'm not sure we could have anticipated this particular issue.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Right, that's always one of the, especially as your application becomes more important to customers and what your service, the impact to customers is, is more and more extreme. And so in this case, I think this is a Friday night. It wasn't something where a lot of our customers were actively monitoring on their end as well.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Right, that's always one of the, especially as your application becomes more important to customers and what your service, the impact to customers is, is more and more extreme. And so in this case, I think this is a Friday night. It wasn't something where a lot of our customers were actively monitoring on their end as well.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Fortunately, we were able to see that retries were happening and many of our customers use a retry fallback mechanism. So they were able to just allow those to run through. But this is particularly tricky in this case because there wasn't actually a record ID for many of these particular responses. Fortunately, we do keep API logs.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Fortunately, we were able to see that retries were happening and many of our customers use a retry fallback mechanism. So they were able to just allow those to run through. But this is particularly tricky in this case because there wasn't actually a record ID for many of these particular responses. Fortunately, we do keep API logs.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

So we were able to see exactly which requests failed for each of our customers. And so we were able to then reach out to our customer success team and they were able to start to share the impact with each of those customers pretty quickly. I will say that we've done a lot of work to make our customer communication a lot more polished since then.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

So we were able to see exactly which requests failed for each of our customers. And so we were able to then reach out to our customer success team and they were able to start to share the impact with each of those customers pretty quickly. I will say that we've done a lot of work to make our customer communication a lot more polished since then.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

And that's something that we're really focusing on now as well. And just being able to get more visibility customers sooner. And one of the most important things there is when it comes to monitoring is that you really want to be able to find the issue and be able to start to investigate it before you... You don't want a customer to identify it first.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

And that's something that we're really focusing on now as well. And just being able to get more visibility customers sooner. And one of the most important things there is when it comes to monitoring is that you really want to be able to find the issue and be able to start to investigate it before you... You don't want a customer to identify it first.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

You should really understand what's happening in your system before anyone else detects that issue.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

You should really understand what's happening in your system before anyone else detects that issue.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Exactly. If some of these requests were happening in the browser or were not set up to automatically retry, that could be a much worse impact on the customer.

Ruby Rogues
The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Exactly. If some of these requests were happening in the browser or were not set up to automatically retry, that could be a much worse impact on the customer.