Paul Zaich
👤 PersonAppearances Over Time
Podcast Appearances
I think this was more of a monitoring problem overall. As Dave mentioned, there was a component where a page was snoozed, but I think that was still a failure on our monitoring system. Because in this case, that was just a signal of what the true issue was. It was a downstream client application that had a page earlier on. And it wasn't clear at all what the issue was.
I think this was more of a monitoring problem overall. As Dave mentioned, there was a component where a page was snoozed, but I think that was still a failure on our monitoring system. Because in this case, that was just a signal of what the true issue was. It was a downstream client application that had a page earlier on. And it wasn't clear at all what the issue was.
And I think when you're developing a system for alerting clients, you need to have clear action items. So you need to have, and that's for custom metrics, building application metrics as you grow, become really important. Having clear signal of what's wrong so that someone knows where to investigate. In this case, it was a client application and browser. There's a lot of noise there.
And I think when you're developing a system for alerting clients, you need to have clear action items. So you need to have, and that's for custom metrics, building application metrics as you grow, become really important. Having clear signal of what's wrong so that someone knows where to investigate. In this case, it was a client application and browser. There's a lot of noise there.
And I can easily understand why someone would just snooze something like that. In my opinion, it wasn't really a people issue in this particular case.
And I can easily understand why someone would just snooze something like that. In my opinion, it wasn't really a people issue in this particular case.
It also depends on where you are in terms of your applications, use cases, what the customer profile looks like, how large the company has gotten, how many people are supporting it. When you're early on, when you're building a new application, new product. By definition, the developers on that are going to really understand the whole system very well.
It also depends on where you are in terms of your applications, use cases, what the customer profile looks like, how large the company has gotten, how many people are supporting it. When you're early on, when you're building a new application, new product. By definition, the developers on that are going to really understand the whole system very well.
So essentially, exception tracking probably is going to be able to give you most of what you need to know in terms of being able to understand what's going on. As the system starts to grow, and especially as you have more discrete teams, I think that's where things like StatsD become more useful because use cases for core parts of your application.
So essentially, exception tracking probably is going to be able to give you most of what you need to know in terms of being able to understand what's going on. As the system starts to grow, and especially as you have more discrete teams, I think that's where things like StatsD become more useful because use cases for core parts of your application.
And I would maybe say that the bar there is maybe when you start to hit the point where you start to have a significant number of paying customers using specific features, maybe you need to start to hone in on one or two key processes that they break. It's absolutely critical that you know immediately. That's kind of the point that Checkr is at in 2017. We really need to have high intelligence...
And I would maybe say that the bar there is maybe when you start to hit the point where you start to have a significant number of paying customers using specific features, maybe you need to start to hone in on one or two key processes that they break. It's absolutely critical that you know immediately. That's kind of the point that Checkr is at in 2017. We really need to have high intelligence...
very clear intelligence and visibility into specific parts of our system. And we're trying to move in that direction when this incident happened. We've continued to invest in that area going forward. I think it's become even more important as we're getting larger because there's just...
very clear intelligence and visibility into specific parts of our system. And we're trying to move in that direction when this incident happened. We've continued to invest in that area going forward. I think it's become even more important as we're getting larger because there's just...
so many different systems that are interacting together that no one really understands the whole system at this point. And the only way to really know how the different systems are working together is maybe make sure everything's working properly is to have some of these custom metrics defined for specific key processes.
so many different systems that are interacting together that no one really understands the whole system at this point. And the only way to really know how the different systems are working together is maybe make sure everything's working properly is to have some of these custom metrics defined for specific key processes.
That's a good question. We're all remote now. So at this point, having had to experiment with that, we did have some of those in our office. I think I've been trying to find ways to make that more visible and make metrics more visible to our team as we've been shifted to 100% remote due to the pandemic. There's also a challenge for our business in particular where...
That's a good question. We're all remote now. So at this point, having had to experiment with that, we did have some of those in our office. I think I've been trying to find ways to make that more visible and make metrics more visible to our team as we've been shifted to 100% remote due to the pandemic. There's also a challenge for our business in particular where...
Sometimes things are very, many of our processes are very asynchronous and they could take hours to date to fully execute. And so finding ways to short circuit and know that those things are broken can be challenging at times. So one of the things we have to do is we have to look at the data over time as well and not just look at real time metrics.
Sometimes things are very, many of our processes are very asynchronous and they could take hours to date to fully execute. And so finding ways to short circuit and know that those things are broken can be challenging at times. So one of the things we have to do is we have to look at the data over time as well and not just look at real time metrics.