Paul Zaich
👤 PersonAppearances Over Time
Podcast Appearances
to address this particular issue with the data models as well. But I think for everyone, the bigger issue is really the fact that we missed the outage for so long. And we did actually have some monitoring in place for this particular issue that should have paged for the downtime that we were experiencing for some report creation.
to address this particular issue with the data models as well. But I think for everyone, the bigger issue is really the fact that we missed the outage for so long. And we did actually have some monitoring in place for this particular issue that should have paged for the downtime that we were experiencing for some report creation.
But it turned out that our monitors were really just not set up in the most effective way to trigger for that particular type of outage. In this case, it was a partial outage, and that requires a much more sensitive monitor in order for us to detect. Everything we designed beforehand was much more targeted towards a complete failure of our system.
But it turned out that our monitors were really just not set up in the most effective way to trigger for that particular type of outage. In this case, it was a partial outage, and that requires a much more sensitive monitor in order for us to detect. Everything we designed beforehand was much more targeted towards a complete failure of our system.
This particular issue most likely could not have been caught by an automated test because it was so... so outside of the norm of what we expected the data to look like. We had, of course, unit tests for everything that we were running and we had request specs as well. We did not have an end-to-end environment set up for a staging environment where you could run these tests end-to-end.
This particular issue most likely could not have been caught by an automated test because it was so... so outside of the norm of what we expected the data to look like. We had, of course, unit tests for everything that we were running and we had request specs as well. We did not have an end-to-end environment set up for a staging environment where you could run these tests end-to-end.
But again, the data in this particular case was very old and it was essentially doing a migration in our code base at this point. So I'm not sure we could have anticipated this particular issue.
But again, the data in this particular case was very old and it was essentially doing a migration in our code base at this point. So I'm not sure we could have anticipated this particular issue.
Right, that's always one of the, especially as your application becomes more important to customers and what your service, the impact to customers is, is more and more extreme. And so in this case, I think this is a Friday night. It wasn't something where a lot of our customers were actively monitoring on their end as well.
Right, that's always one of the, especially as your application becomes more important to customers and what your service, the impact to customers is, is more and more extreme. And so in this case, I think this is a Friday night. It wasn't something where a lot of our customers were actively monitoring on their end as well.
Fortunately, we were able to see that retries were happening and many of our customers use a retry fallback mechanism. So they were able to just allow those to run through. But this is particularly tricky in this case because there wasn't actually a record ID for many of these particular responses. Fortunately, we do keep API logs.
Fortunately, we were able to see that retries were happening and many of our customers use a retry fallback mechanism. So they were able to just allow those to run through. But this is particularly tricky in this case because there wasn't actually a record ID for many of these particular responses. Fortunately, we do keep API logs.
So we were able to see exactly which requests failed for each of our customers. And so we were able to then reach out to our customer success team and they were able to start to share the impact with each of those customers pretty quickly. I will say that we've done a lot of work to make our customer communication a lot more polished since then.
So we were able to see exactly which requests failed for each of our customers. And so we were able to then reach out to our customer success team and they were able to start to share the impact with each of those customers pretty quickly. I will say that we've done a lot of work to make our customer communication a lot more polished since then.
And that's something that we're really focusing on now as well. And just being able to get more visibility customers sooner. And one of the most important things there is when it comes to monitoring is that you really want to be able to find the issue and be able to start to investigate it before you... You don't want a customer to identify it first.
And that's something that we're really focusing on now as well. And just being able to get more visibility customers sooner. And one of the most important things there is when it comes to monitoring is that you really want to be able to find the issue and be able to start to investigate it before you... You don't want a customer to identify it first.
You should really understand what's happening in your system before anyone else detects that issue.
You should really understand what's happening in your system before anyone else detects that issue.
Exactly. If some of these requests were happening in the browser or were not set up to automatically retry, that could be a much worse impact on the customer.
Exactly. If some of these requests were happening in the browser or were not set up to automatically retry, that could be a much worse impact on the customer.