Menu
Sign In Pricing Add Podcast
Podcast Image

Ruby Rogues

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Mon, 23 Sep 2024

Description

Paul Zaich from Checkr tells us about a critical outage that occurred, what caused it and how they tracked down and fixed the issue. The conversation ranges through troubleshooting complex systems, building team culture, blameless post-mortems, and monitoring the right things to make sure your applications don't fail or alert you when they do.LinksPaul's TwitterPaul's LinkedInPicksBlood Pressure Monitor - Daveeft - LukeRuby one-liners cookbook - PaulPodcast Growth Summit - ChuckMost Valuable Dev - ChuckMost Valuable Dev Summit - ChuckMushroom Wars - ChuckGmelius - ChuckBecome a supporter of this podcast: https://www.spreaker.com/podcast/ruby-rogues--6102073/support.

Audio
Transcription

5.742 - 24.793 Luke Stutters

Hey, everybody, and welcome to another episode of the Ruby Rogues podcast. This week on our panel, we have Luke Stutters. Hello. We have Dave Kimura. Hey, everyone. I'm Charles Maxwood from devchat.tv. Quick shout out about mostvaluable.dev. Go check it out. We have a special guest this week, and that is Paul Zeich.

0
💬 0

25.214 - 27.015 Paul Zaich

Zeich. Well done. Thank you.

0
💬 0

27.924 - 42.288 Luke Stutters

Now, you're here from Checkr. You gave a talk at RailsConf about how you broke stuff or somebody broke stuff. Do you want to just kind of give us a quick intro to who you are and what you do? And then we'll dive in and talk about what broke and how you figured it out?

0
💬 0

42.608 - 54.957 Paul Zaich

Sure. So I've been a software engineer for about 10 years. Recently in the last year or so, transitioned into an engineering management role. But I've worked at a number of different Small startups.

0
💬 0

54.977 - 68.865 Paul Zaich

I joined Checkr in 2017 when the company was at about 100 employees, 30 engineers, contributed as an engineer for a couple of years to our team, and then have recently transitioned, like I said, into an engineering management role.

0
💬 0

69.221 - 89.054 Luke Stutters

at the company. Very cool. I actually have a Checkr t-shirt in my closet that I never wear. It's Checkr for those that are listening and not reading it. Yeah. So why don't you kind of tee us up for this as far as, yeah, what happened? What broke? Yeah. Give us a preliminary timeline and explain what Checkr does and why that matters.

0
💬 0

89.494 - 104.825 Paul Zaich

Sure. So Checkr was founded in 2014. Daniel and Jonathan are founders. I had worked in the on-demand space, another company, and had discovered that it was very difficult to integrate background checks into their onboarding process.

0
💬 0

105.046 - 119.916 Paul Zaich

Background checks tend to be a very important final safety step for a lot of these companies to make sure that their platform is going to be safe and secure for customers. their customers. And so in 2014, they started an automated background check company.

0
💬 0

120.356 - 146.593 Paul Zaich

And initially, the biggest selling point was that Checkr abstracted away a lot of the complexity of background check process, collecting candidate information, and then executing that flow and exposing that via an API that was developed in a Sinatra app. And three years later, in 2017, I just joined about four or five months before this particular incident happened.

0
💬 0

146.793 - 165.829 Paul Zaich

Fast forward to that point, we were running, I'd say, a few million checks a year for a variety of different customers. Most of those customers use our API, like I said before, to interface with the candidate on their side in their own application.

0
💬 0

166.21 - 179.706 Luke Stutters

Oh, that's interesting. Yeah, I think a lot of the background check portals that I've seen, they're like the fully baked portal instead of being a background service that somebody else can integrate into their own app.

0
💬 0

180.171 - 183.315 Dave Kimura

Did you do a background check on me before this episode?

0
💬 0

183.715 - 205.775 Paul Zaich

I did not. There are a lot of very important guidelines and stipulations governed by the Federal Credit Reporting Act that make sure that you have to have a permissible purpose for running a background check. So in this case, most of our customers are using the permissible purpose around employment as the reason for actually running that check.

0
💬 0

205.895 - 206.915 Chuck

Well, that's no fun.

0
💬 0

206.935 - 217.097 Luke Stutters

I know, right? I want to know everybody's dirty secrets. Interesting. So, yeah, why don't you tell us a little bit about what went down with the app, right?

0
💬 0

217.457 - 240.881 Paul Zaich

So, like I said, in 2017, Checkr at this point was a pretty important component of a number of customers onboarding process. But we had started off small and things grew quickly. In a lot of ways, we were just trying to keep the lights on and scale the system along with our customers as they continue to grow. On-demand is growing a lot in this time as well.

0
💬 0

241.141 - 265.354 Paul Zaich

So in 2017, we were doing some fairly routine changes to a data model. I wasn't directly involved with that, but we were changing something from an integer ID to a UUID and the references, and there were some backfills that needed to happen. And so an engineer executed a script on a Friday afternoon, which is always a great idea to do.

0
💬 0

266.635 - 287.186 Paul Zaich

And they actually used the script at about 4.30 PM, probably went and grabbed something, had a little happy hour and then headed home. And about an hour later, we started to receive a few various different pages to completely unrelated teams that didn't really know what was going on in terms of this backfill. And it didn't look like anything too serious.

0
💬 0

287.306 - 301.153 Paul Zaich

It was just an elevated number of exceptions in our client application that does some of the candidate PII collection. And so we just decided to, that team decided to snooze that and decided just to kind of ignore it.

0
💬 0

301.333 - 308.437 Luke Stutters

Yeah. So people that aren't aware, PII is an acronym for Personally Identifiable Information. and is usually protected by law.

0
💬 0

308.878 - 334.145 Paul Zaich

Thank you. Anyway, go ahead. So come Saturday morning, this has been going on for about 12 hours now, this exception comes in again. And at that point, someone on our team actually decided to escalate that and get more stakeholders involved. We had some variety of other issues going on. We just migrated from one deployment platform to Kubernetes. And so we had some issues getting onto the cluster.

0
💬 0

334.846 - 350.7 Paul Zaich

There are too many of us trying to get on at the same time. So we ended up all having to actually go into the office to the physical intranet to finally get in and debug the issue. So we had a couple of other confounding issues come up at the same time that made the process of response even worse.

0
💬 0

351.08 - 370.976 Paul Zaich

So finally, this is maybe 10 o'clock in the morning, 10 or 11 in the morning, we finally, after being able to take a look at that, identified what the issue was. And we were responding to about 50 to 60% of the most one of the most critical endpoints on our system, which is to actually create requests to make a report.

0
💬 0

371.137 - 393.175 Paul Zaich

So after you've collected the candidate's information, you say, please execute this report so we can get that back. And that's a synchronous request that you make using our API. And when that request was failing, it was failing about 40 to 50% of the time with a 404 response, which isn't really expected. So at that point, we were finally able to pin down the issue and it came back to this script.

0
💬 0

393.475 - 412.875 Paul Zaich

And it turned out that when you went to create this report, we would look for create these additional sub objects called screenings. And due to the script, we had actually created an issue where validation would cause the reports to fail to create in this edge case. So there are some confounding issues with the way that we had set up the data

0
💬 0

412.915 - 427.848 Paul Zaich

of modeling to begin with that we were trying to work around and this exception happened. But when we finally fixed the issue, that's where we shifted more into what actually went wrong and what were the real issues that caused a sadness outage.

0
💬 0

428.271 - 447.439 Luke Stutters

Gotcha. So I'm curious, as you work through this, what did you add to your workflow to make sure that this doesn't happen again? Because I mean, some of it's going to be technical, right? It's testing or, you know, maybe you set up a staging environment or something like that. And some of it is going to be, hey, when this kind of alert comes up, do this thing, right?

0
💬 0

447.499 - 450.46 Luke Stutters

Because it sounded like you did have some early indication that this happened.

0
💬 0

450.931 - 473.575 Paul Zaich

Right. So I think the first most important thing that we did was that really from the beginning, we have had what we call a blameless culture. I think it's a common term now in the industry, but the idea there is to really focus on learning from issues, not trying to find who made the particular mistake and trying to look at what processes

0
💬 0

477.443 - 500.837 Paul Zaich

as well, that would have prevented the problem from happening. So not trying to focus on the individual mistake that was made. So as part of that, we did a postmortem doc and we went through and identified things like, one, we should really have a dedicated script repository that goes through a code review process. So that's one thing we implemented. And we made some safeguards and

0
💬 0

501.517 - 521.313 Paul Zaich

to address this particular issue with the data models as well. But I think for everyone, the bigger issue is really the fact that we missed the outage for so long. And we did actually have some monitoring in place for this particular issue that should have paged for the downtime that we were experiencing for some report creation.

0
💬 0

521.513 - 544.493 Paul Zaich

But it turned out that our monitors were really just not set up in the most effective way to trigger for that particular type of outage. In this case, it was a partial outage, and that requires a much more sensitive monitor in order for us to detect. Everything we designed beforehand was much more targeted towards a complete failure of our system.

0
💬 0

544.898 - 549.401 Chuck

And so was this something that could have been caught by automated tests?

0
💬 0

549.842 - 576.111 Paul Zaich

This particular issue most likely could not have been caught by an automated test because it was so... so outside of the norm of what we expected the data to look like. We had, of course, unit tests for everything that we were running and we had request specs as well. We did not have an end-to-end environment set up for a staging environment where you could run these tests end-to-end.

0
💬 0

576.711 - 592.452 Paul Zaich

But again, the data in this particular case was very old and it was essentially doing a migration in our code base at this point. So I'm not sure we could have anticipated this particular issue.

0
💬 0

592.812 - 605.456 Dave Kimura

What was the fallout? Did everyone like phone up and get really angry? Oh, from a customer perspective? Yeah, yeah. This is the best bit of outage stories is the kind of the human cost of whoever has to answer the phone the next week.

0
💬 0

605.736 - 606.376 Luke Stutters

Code drama.

0
💬 0

606.696 - 626.114 Paul Zaich

Right, that's always one of the, especially as your application becomes more important to customers and what your service, the impact to customers is, is more and more extreme. And so in this case, I think this is a Friday night. It wasn't something where a lot of our customers were actively monitoring on their end as well.

0
💬 0

626.294 - 650.195 Paul Zaich

Fortunately, we were able to see that retries were happening and many of our customers use a retry fallback mechanism. So they were able to just allow those to run through. But this is particularly tricky in this case because there wasn't actually a record ID for many of these particular responses. Fortunately, we do keep API logs.

0
💬 0

650.535 - 670.068 Paul Zaich

So we were able to see exactly which requests failed for each of our customers. And so we were able to then reach out to our customer success team and they were able to start to share the impact with each of those customers pretty quickly. I will say that we've done a lot of work to make our customer communication a lot more polished since then.

0
💬 0

670.269 - 688.58 Paul Zaich

And that's something that we're really focusing on now as well. And just being able to get more visibility customers sooner. And one of the most important things there is when it comes to monitoring is that you really want to be able to find the issue and be able to start to investigate it before you... You don't want a customer to identify it first.

0
💬 0

688.7 - 693.683 Paul Zaich

You should really understand what's happening in your system before anyone else detects that issue.

0
💬 0

693.863 - 715.093 Dave Kimura

And I guess for this specific, not this specific product, but kind of product where your customers are consuming your API, you're also at the mercy of their implementation too. So, you know, they're making a kind of call against you. And if that call is failing, you know, you've got to hope that their system can cope with that as well.

0
💬 0

715.613 - 725.46 Paul Zaich

Exactly. If some of these requests were happening in the browser or were not set up to automatically retry, that could be a much worse impact on the customer.

0
💬 0

725.78 - 746.976 Dave Kimura

Can we talk about the blameless culture for a bit? This is a new idea. And when I was managing engineering teams, I used to have what I called the finger of blame. So I used to do it the other way around. I would hold up my finger in a meeting and I'd introduce the finger as the finger of blame. And then we'd work out who the finger of blame should be pointing to.

0
💬 0

746.996 - 767.135 Dave Kimura

Now, more often than not, of course, it was me. So the finger of blame was a double edged finger. But it was it was a kind of way of, you know, people take it very seriously when they mess up the kind of stuff. So you kind of have to get your. get your team back on board. So it was a way of kind of lightening the mood after that week's disaster.

0
💬 0

767.355 - 788.986 Dave Kimura

But a blameless culture, as you said, is the kind of more sophisticated way of doing it instead of pointing a jovial finger at the person who messed up. What does that look like? I mean, you know, do you just go around telling people it's not their fault? Or, you know, how do you implement a blameless culture in what sounds like quite a big engineering team?

0
💬 0

789.346 - 814.707 Paul Zaich

I think it starts for us with, it really started with our CTO, Jonathan, and co-founder, making that a priority from pretty much day one, basically from the beginning of our process when we've had issues or incidents. We've done a postmortem doc. We've had a process around that. And it's always been very forward-looking. facing very, very much about what could we have done better?

0
💬 0

815.187 - 834.574 Paul Zaich

What can we improve? What are the things we should be doing going forward? So I think having that first touch point and really having that emphasis from the beginning was really important and cascades down. I think as you're building out a bigger engineering team, that's critical as be able to just continue to build, keep that culture going.

0
💬 0

835.135 - 853.469 Paul Zaich

And I think that's definitely a challenge to continue doing. But I think as we've grown, we've been able to do that so far. So I think that was step number one. I think a second piece of it is understanding and trying to understand when it's more of a process issue versus something that someone particularly did wrong.

0
💬 0

853.649 - 876.028 Paul Zaich

And I think a lot of the time, I think a lot of incidents do occur because you're trying to make different prioritization decisions and you're trying to make sure that you anticipate things in advance or failure, failure moments. And sometimes you just miss those. And those are particular cases where I think the management team needs to really take responsibility for it.

0
💬 0

876.249 - 893.581 Paul Zaich

It's not an individual issue that caused that particular downtime or that it wasn't that necessarily that one piece of code. And so it could be just an example. I mean, this is an example, I think, actually, where we had some technical debt, we were trying to clean it up. And that was a good thing.

0
💬 0

893.661 - 903.687 Paul Zaich

But I think we didn't necessarily have everything in place to be able to address that technical debt effectively. And that's not necessarily one engineer's responsibility to be out in front of.

0
💬 0

903.707 - 921.357 Luke Stutters

Yeah, one thing I just want to add is that I like the blameless culture just from the sense of unless somebody is either malicious, which I have never, ever, ever encountered, or is chronically reckless, which I've also never encountered, right? Everybody is usually trying to pull along in the same way.

0
💬 0

921.417 - 941.965 Luke Stutters

You know, if somebody has that issue, you identify it pretty fast and you usually are able to counter it before it becomes a real problem. But yeah, just to put that together, then, you know, yeah, the rest of it, it's, hey, look, we're on the same team. We're all trying to get the same place. So let's talk about how we can do this better so that doesn't happen again.

0
💬 0

942.205 - 962.532 Luke Stutters

Because next time it might be me, right? That misses a critical step. And I don't want you all fingering me either. I mean, I want to learn from it, but I, you know, we don't want people... walking around in fear. Instead, if somebody screws up, we want them to come forward and say, hey, I might have messed this up before it becomes an issue next time.

0
💬 0

963.052 - 981.944 Paul Zaich

Absolutely. And I think one other thing to highlight here is that when you don't have a famous culture, folks are going to be very afraid to speak out when they do see an issue, whether it was there, they think it was their mistake or someone else's. They're not going to want to escalate that issue and make sure that it gets attention necessarily.

0
💬 0

982.564 - 1003.979 Paul Zaich

And so one of the best side effects of having a blameless culture is that you get really engaged response and everyone's going to work together to try to address the issue. I think that even cascades down to customer communication as well, because when you're really engaged in trying to do that, then you're doing the best thing possible. for the customers as well.

0
💬 0

1004.019 - 1010.369 Paul Zaich

These are trying to address these issues head on, not try to find ways to kind of smooth them out under the surface.

0
💬 0

1010.849 - 1027.348 Luke Stutters

Yeah, it also, and this is important, and sometimes I think people hear this and they're going to go, That sounds a little scary. But you want people to take chances sometimes, right? You want people to kind of take a shot at making things better. That opens it up to them to do that, right?

0
💬 0

1027.469 - 1048.558 Luke Stutters

It's, oh, well, you know, I tried this tweak on the Jenkins file or I tried this tweak on the Kubernetes setup or I tried this tweak on this other thing. And a lot of times those things pay off. But if you don't give people the freedom to go for it, a lot of times you're going to miss out on a lot of those benefits. And again, as long as they're not being reckless about it, right?

0
💬 0

1048.618 - 1067.108 Luke Stutters

So they're taking the steps, they're verifying it on their own system and things like that, then you benefit much, much more from people being willing to take a shot. So yeah, so with the blameless culture, I'm curious. So you get together and you start identifying what the issue is. So what does that look like then as far as figuring out what's going on?

0
💬 0

1067.128 - 1071.45 Luke Stutters

Because you're not pointing fingers, but you are looking for the commit that made the problem, right?

0
💬 0

1071.77 - 1088.08 Paul Zaich

You are. I think at the end of the day, you're going to try to find the root cause, right? You're going to look for that commit. You're going to look for the log. Maybe it was a script that was logged into your logging system, whatever it is. You're going to look for that and look for the root.

0
💬 0

1088.861 - 1109.158 Paul Zaich

So honestly, a lot of times, you know, maybe what caused the issue from whether if it was something that was specifically run by a specific person and they probably feel a little bit of guilt there, but there's no reason to lay on more there. And I think everyone, like you said, feels a lot of responsibility around the work that they're doing already. So there's no reason to overemphasize that.

0
💬 0

1109.358 - 1128.649 Paul Zaich

So what that looks like is typically the team that is impacted is really going to own that postmortem. And that's one way for you to feel like you're resolving the incident or the issue that caused the incident. This has definitely become a bit of a different process as the team is growing.

0
💬 0

1128.729 - 1144.753 Paul Zaich

When we were at 30, I think it's a little bit easier just to know exactly who should work on those types of mitigations. Typically, it's pretty isolated to a specific team. As the team is growing and the system is growing, that's definitely become more of a challenge because sometimes,

0
💬 0

1145.493 - 1166.524 Paul Zaich

happen because different issues that multiple teams have introduced, or maybe there's multiple teams that need to be involved in the mitigation. And for that, in that case, we've definitely been trying to evolve our postmortem process and the action items. So we have a program manager that one of her responsibilities is specifically around making sure that we are coordinating some of those out

0
💬 0

1170.68 - 1187.503 Paul Zaich

some additional rules and coordination around the process as we've started to grow. A lot of it was just on the individual teams initially, and now as we've grown, again, there's more process involved. I think that's a pretty common thing that you have to introduce as teams grow.

0
💬 0

1188.044 - 1211.602 Dave Kimura

I will say that if you've got relatives who are in the medical profession, especially if they're pathologists, even the use of the term post-mortem makes me uncomfortable because those are no fun at all. But, yeah, it's also a word that we use. So, yeah, it just makes me – oh, it's creepy – It's all zombies. I don't know.

0
💬 0

1211.642 - 1220.684 Dave Kimura

Yeah, the post-mortem brings me flashbacks to episodes of the X-Files in the 90s when Dana Scully was taking an alien apart.

0
💬 0

1220.944 - 1231.427 Luke Stutters

Yeah, but it does give you a little perspective too, right? Because usually in our post-mortems, we're talking about what went wrong with the system, not that somebody actually died because of this, right?

0
💬 0

1231.727 - 1234.968 Dave Kimura

I just got a weird brain, all right? It's what my brain thinks of.

0
💬 0

1235.729 - 1249.879 Luke Stutters

Well, some software it is. life supporting, you know, a lot of the medical equipment and stuff out there. But, you know, in this case, yeah, we all want to keep our jobs as well. So, I mean, it's not like we can just blow it off either. So, yeah.

0
💬 0

1249.959 - 1260.75 Luke Stutters

So I want to get back to the topic at hand, though, and talk a little bit about what kind of monitoring did you have before and what kind of monitoring you have now in order to catch this kind of thing.

0
💬 0

1261.343 - 1288.955 Paul Zaich

So we used a number of different types of monitoring. At the time, we were pretty heavily reliant on exception tracking, and we also had some application performance monitoring as well, commonly called APM. A couple examples of that would be something like New Relic or Datadog has a product as well now. And then we did also use a StatsD cluster that sent metrics of over to Datadog.

0
💬 0

1289.316 - 1308.207 Paul Zaich

And I think we just had started using that maybe just a few months before this particular incident occurred. So like I alluded to before, we had some monitors for this particular issue, but they were pretty simplistic. They basically just looked for a minimum threshold of the number of reports that we're creating.

0
💬 0

1308.927 - 1331.402 Paul Zaich

And we had to set that threshold to be very low over like an hour period because traffic is variable. You never know exactly how many reports you're going to get created. There's times a day where we've received very few requests, and then there's other times where we see large spikes. So we just had very simplistic monitoring in place for some of these. key metrics at that point.

0
💬 0

1331.622 - 1354.18 Paul Zaich

At that point, we're still very heavily reliant on, like I said, exception tracking using systems bug trackers like Sentry that then could then alert if you had certain thresholds of number of errors over a period of time. In this particular case, exception tracking isn't very useful because we were responding with a 404. There was an exception in the system.

0
💬 0

1354.38 - 1366.988 Paul Zaich

It was just automatically active record not found, something like that. that was then handled automatically in response to the 404. So it was an expected behavior, but there wasn't an exception that could have been caught.

0
💬 0

1367.388 - 1373.891 Luke Stutters

Yeah, that makes sense. Somebody typed this question in. It was one of the panelists. Did you get that answered? I don't know if it was Luke or Dave.

0
💬 0

1374.051 - 1385.858 Dave Kimura

It was me. Just to be clear, was this incident a monitoring problem or an alerting problem? Because it sounds like an alert did go off at some point.

0
💬 0

1386.442 - 1389.204 Chuck

Sounds like it was a people problem because they snoozed the alert.

0
💬 0

1389.704 - 1421.744 Paul Zaich

I think this was more of a monitoring problem overall. As Dave mentioned, there was a component where a page was snoozed, but I think that was still a failure on our monitoring system. Because in this case, that was just a signal of what the true issue was. It was a downstream client application that had a page earlier on. And it wasn't clear at all what the issue was.

0
💬 0

1422.145 - 1448.669 Paul Zaich

And I think when you're developing a system for alerting clients, you need to have clear action items. So you need to have, and that's for custom metrics, building application metrics as you grow, become really important. Having clear signal of what's wrong so that someone knows where to investigate. In this case, it was a client application and browser. There's a lot of noise there.

0
💬 0

1449.57 - 1457.494 Paul Zaich

And I can easily understand why someone would just snooze something like that. In my opinion, it wasn't really a people issue in this particular case.

0
💬 0

1458.291 - 1482.757 Chuck

Yeah, I think we've all been there before where we get an alert from whatever monitoring that we're doing and the error looks serious, but you kind of read it and like, oh, you know what, this is probably just a one-off situation. And then turns out it is actually a big deal that needs to be addressed as soon as possible. So I know I've been there before and

0
💬 0

1483.697 - 1508.687 Chuck

And, you know, the hard times to really track this, I use Sentry for my error tracking. And so I get email text notifications with that. And one of the nice things about it is that it'll show the number of occurrences, whether they are unique or not. So I can see if, okay, this particular error is only coming from one user.

0
💬 0

1510.328 - 1533.261 Chuck

Or I could see we're getting 100 errors that's coming from 100 different users. So there's a more widespread problem. So I think definitely getting the notifications, but then having proper analytics on your errors so you can actually see the scope of how big this is can really kind of weigh in on the importance.

0
💬 0

1533.841 - 1554.112 Dave Kimura

Yeah, makes sense. I mentioned, Dave, you've been through... like me, many different monitoring platforms, Datadog, you said New Relic, which are the good monitoring platforms? Or which ones are you like, this is the platform that works really well for this API situation?

0
💬 0

1554.574 - 1582.235 Chuck

I think it all depends on what you're doing. So if you have a heavy JavaScript front end kind of deal, and if you also have a lot of Ruby backend code, I know Sentry can handle both of those situations. Other people will go with another solution. So I personally found Sentry to be my flavor of choice, but mileage will vary based on what other people have.

0
💬 0

1582.771 - 1606.545 Paul Zaich

It also depends on where you are in terms of your applications, use cases, what the customer profile looks like, how large the company has gotten, how many people are supporting it. When you're early on, when you're building a new application, new product. By definition, the developers on that are going to really understand the whole system very well.

0
💬 0

1607.125 - 1631.478 Paul Zaich

So essentially, exception tracking probably is going to be able to give you most of what you need to know in terms of being able to understand what's going on. As the system starts to grow, and especially as you have more discrete teams, I think that's where things like StatsD become more useful because use cases for core parts of your application.

0
💬 0

1631.498 - 1652.407 Paul Zaich

And I would maybe say that the bar there is maybe when you start to hit the point where you start to have a significant number of paying customers using specific features, maybe you need to start to hone in on one or two key processes that they break. It's absolutely critical that you know immediately. That's kind of the point that Checkr is at in 2017. We really need to have high intelligence...

0
💬 0

1653.187 - 1669.716 Paul Zaich

very clear intelligence and visibility into specific parts of our system. And we're trying to move in that direction when this incident happened. We've continued to invest in that area going forward. I think it's become even more important as we're getting larger because there's just...

0
💬 0

1670.456 - 1687.723 Paul Zaich

so many different systems that are interacting together that no one really understands the whole system at this point. And the only way to really know how the different systems are working together is maybe make sure everything's working properly is to have some of these custom metrics defined for specific key processes.

0
💬 0

1688.123 - 1695.027 Dave Kimura

Do you find that putting really large screens on the office wall helps make your application more reliable?

0
💬 0

1695.207 - 1716.598 Paul Zaich

That's a good question. We're all remote now. So at this point, having had to experiment with that, we did have some of those in our office. I think I've been trying to find ways to make that more visible and make metrics more visible to our team as we've been shifted to 100% remote due to the pandemic. There's also a challenge for our business in particular where...

0
💬 0

1717.498 - 1738.212 Paul Zaich

Sometimes things are very, many of our processes are very asynchronous and they could take hours to date to fully execute. And so finding ways to short circuit and know that those things are broken can be challenging at times. So one of the things we have to do is we have to look at the data over time as well and not just look at real time metrics.

0
💬 0

1738.472 - 1760.364 Paul Zaich

So one thing I've been experimenting with is trying to create more automated reports that go into sort of a Slack channel that we can look at. And so people can review that. And we've also implemented basically a bi-weekly review during our retro where we just look at our metrics and some of the longer running trends so that we can see if those look correct. Is there anything that's wrong?

0
💬 0

1760.424 - 1770.768 Paul Zaich

We can talk about it, see if there's things that we want to actually action on based on that review. So we're trying to find some ways to do check-ins that don't require us to be all in the office together.

0
💬 0

1770.928 - 1795.541 Dave Kimura

The Slack channel truly is the giant performance monitor of 2020. That is literally what tells me whether stuff is working at the moment. I'm thinking there are a lot of people in the same boat. So it sounds like you're saying that once you get to a certain stage... then the off-the-shelf monitoring isn't really going to cut it. So you have written custom monitoring for your application.

0
💬 0

1795.581 - 1796.142 Dave Kimura

Is that correct?

0
💬 0

1796.462 - 1824.667 Paul Zaich

We have implemented what I consider custom metrics. We use Datadog. So a lot of this is out of the box. You can use their implementation, but you're adding some code to specific parts of your application. Maybe it's a callback on your active record model. When something is created, you send a message to an queue and then that triggers over a message into statsd.com. that goes to Datadog.

0
💬 0

1824.927 - 1840.74 Paul Zaich

Anyways, it's a pretty lightweight implementation in terms of what you can do, but you're adding specific events that you want to track. And then you can create your own monitors and alerting around those or correlations between different events in your system.

0
💬 0

1841.161 - 1861.038 Paul Zaich

So you could potentially look at a custom metric and then look at that compared to HTTP statuses that are coming through or the latency of an endpoint. And then you could correlate those two metrics as well. So there's some more advanced things you can do there as well if you need to. But again, it's not really a lot of custom work.

0
💬 0

1861.078 - 1875.572 Paul Zaich

It's just adding some specific points in your code base that you feel like are really important to track. And one example of this for Rails users is, I believe there's something like this already set up for Datadog for Sidekick. So we instrument it on a lot of our

0
💬 0

1875.692 - 1893.76 Paul Zaich

sidekick jobs and we can see when the lag is growing on on one of those cues we can see what the the average completion time is and look at the p90 completion time for different types of jobs So you get a lot of visibility into your sidekick workers and processes very easily, basically for free.

0
💬 0

1893.98 - 1913.844 Chuck

And if you're going to use Slack for your error notification, I'm not dissing that at all. I have a few applications that actually do that. It just triggers a Slack notification. But if you're only capturing the error message and not a stack trace along with it, then that error message is pretty much useless.

0
💬 0

1914.104 - 1920.307 Chuck

Because it tells you you have a problem somewhere in your millions of lines of code, but we're not going to tell you where it's at.

0
💬 0

1920.747 - 1941.077 Paul Zaich

Just to be clear, we capture all of our errors in Sentry. We do have some alerting that goes to Slack, but I would also want to emphasize that anything that truly has any chance of being a serious issue should never be either an email or a Slack alert.

0
💬 0

1942.978 - 1966.197 Paul Zaich

You really should have some kind of escalation via either maybe it's text, maybe it's an actual incident response system like PagerDuty where you can have an escalation policy. For us, that's what we're using. It should have this synchronous alerting that really forces someone to look at it. You can't rely on something asynchronous like Slack in this case for serious response on issues.

0
💬 0

1966.657 - 1987.984 Chuck

There's a little off topic, but you know what issue I found with that is I use my cell phone for everything. It's where I have my email, get my text messages, phone calls, and all that stuff. And so I would like to keep it on full volume late at night when I'm sleeping. So if a critical does arrive, then I can get notified.

0
💬 0

1988.484 - 2010.338 Chuck

But my issue is that I would never get any sleep because my phone would just go off. So I need to figure out some way that I can set up for a particular phone number or something to override any kind of sleep mode or whatever that I have on my phone right now or get a different phone for that purpose. That seems a bit overkill.

0
💬 0

2010.858 - 2027.465 Paul Zaich

You can actually do that, I believe, at least with iOS. You can set up an override where you snooze everything else and then you can set up and you have to just put it in your personal contacts, whatever numbers you think you're going to receive critical notification from. And then that'll actually ring through.

0
💬 0

2027.485 - 2030.086 Chuck

All right. I need to quit being lazy then and just do that.

0
💬 0

2030.546 - 2053.416 Dave Kimura

Back in 2015, I was working in the States and due to various issues, I was still responsible effectively for a bunch of servers in the UK. And I'd gone to see a film and put my phone on silent. And of course, all the servers melted halfway through Skyfall or whatever movie it was. Tom Cruise did not alert me of the impending server disaster while he was dealing with the aliens.

0
💬 0

2053.656 - 2080.329 Dave Kimura

So I came out and everyone was very upset. So I ended up writing custom alerting with a custom app using the Android automator that when it received a text message with the magic string in it would actually like turn up volume and then play the Beatles help at full volume. And that worked. That worked very well.

0
💬 0

2080.609 - 2094.985 Dave Kimura

But what it didn't have, which I like on the patient duty system, is the acknowledgement. So you can see, you know, yeah, I've sent the message. Has that person seen that message? And, you know, tap the yes, I am aware service a melting button.

0
💬 0

2095.376 - 2113.643 Luke Stutters

Yeah, I've got, I think it's the bedtime settings in iOS. And yeah, I've just told it if it's a number in my contacts, then ring. And if it's not, then don't. So yeah, it'll go off, but it'll only go off if it's, yeah, if it's in my contacts. So yeah, then I just add whoever or whatever to my contacts and I'm set.

0
💬 0

2113.883 - 2116.904 Chuck

Yeah, that should work well for my use case because no one ever calls me.

0
💬 0

2116.964 - 2120.285 Dave Kimura

Yeah, just the spammers, right? That's a tragic thing to say, Dave.

0
💬 0

2122.426 - 2132.081 Chuck

Now, I have the Verizon call filter, which actually works pretty well. It's reduced the 15, 20 phone calls I would get a day down to like one.

0
💬 0

2132.362 - 2137.63 Luke Stutters

Yeah, the iPhone has that feature too, where you can essentially tell it, don't ring unless the number's in my contacts.

0
💬 0

2137.89 - 2158.247 Chuck

Yeah, I got burned by that pretty bad one time. My wife was over at the pool. She had forgotten her phone or she had lost it. And so she borrowed someone's phone there. And because that random person wasn't in my contacts, I never got her phone call. My phone just stayed silent. So I had to disable that pretty quick. That'll teach you.

0
💬 0

2158.447 - 2176.06 Dave Kimura

Can I ask you about composite monitors? Because that is a phrase I have not heard before. I'm familiar with a rate monitor. My understanding of that is if it drops really quick, it goes off. But if it drops slowly, it doesn't go off. But what is this composite monitor?

0
💬 0

2176.261 - 2198.212 Paul Zaich

So composite monitor is basically a combination of several different metrics that you're measuring using, tuning together those with and or or statements. So maybe referencing what I was talking about before, you might want to have a custom metric that you're looking at and you want to look at how many of those are coming through, how many events are coming through.

0
💬 0

2198.232 - 2219.263 Paul Zaich

And then you might also want to look at, in this case, the error rate for HTTP status, maybe how many 400 errors you're getting. relative to 200s. You could basically do something where you have an and statement between those two different measures and those Boolean evaluations. Or you could do something where you have an or.

0
💬 0

2219.343 - 2229.79 Paul Zaich

So you could say, these are basically signaling for the same type of issue that I want to alert on, but I'm going to look for these different conditions. all in the same monitor.

0
💬 0

2229.95 - 2254.029 Dave Kimura

So you're looking at multiple different things at once. Is that so that you could combine those to kind of set effectively a much lower threshold and get higher signal-to-noise? So you say something like, well, we'll allow this number of 404s or allow this number of server load this number of other errors. But if you get all three at the same time, then it triggers something different.

0
💬 0

2254.629 - 2275.783 Dave Kimura

Or does it use a lower number? What's the result of that? The advantage of using that logic instead of just saying, here is the minimum number of 404s. Here is the maximum number of 404s. Here's the maximum number of errors. How does that actually translate into a better metric?

0
💬 0

2275.923 - 2284.31 Paul Zaich

Right. So I think it gives you the ability to tune things to potentially make something have a higher fidelity when it alerts.

0
💬 0

2284.55 - 2301.767 Paul Zaich

So you're not getting... One, you can set the thresholds actually higher and keep things... It depends how you want to use it, but you can... In this case, you could set the thresholds higher, but you could have something where it's like, well, if there aren't any errors coming through, then... Maybe we're okay with that, even though the number's a little bit lower.

0
💬 0

2302.447 - 2319.547 Paul Zaich

Or you can do things where you can be more, and again, you can also tune this to be more sensitive. In this particular incident, if we had had some air monitoring around 400s, in addition to the threshold that we had that was pretty low, I think we would have been alerted on that within maybe an hour.

0
💬 0

2320.585 - 2343.75 Paul Zaich

So you can do things there that give you more sensitivity without necessarily causing a lot more false alarms. And that's something that you have to just be really careful with any kind of monitor on a team. You really need to make sure that you're not creating false alarms. I'd say it's almost as important or equally important to the sensitivity of the alarm as well.

0
💬 0

2343.83 - 2359.494 Paul Zaich

Because if you're creating false alarms all the time, it's just... not really give them the review that they need. So if you're doing that all the time, you're probably going to miss something inevitably when there's actually a real issue.

0
💬 0

2359.634 - 2370.124 Luke Stutters

Makes sense. All right, we're getting close to the end of our time. Are there any other stories or examples or lessons that you want to make sure somebody listening to this gets?

0
💬 0

2370.745 - 2391.344 Paul Zaich

I just want to emphasize that this is a growing process that I think every team should go through. It's something that is going to evolve over time. And as your product becomes more important to customers and can use to grow, you need to just be constantly revealing what your approach is to this.

0
💬 0

2391.865 - 2401.62 Paul Zaich

What's going to work for brand new product, brand new startup, brand new company isn't necessarily going to be the And it's something that you need to evaluate.

0
💬 0

2401.78 - 2417.368 Paul Zaich

And as your product starts to be something that's really a critical service for your customers or for other teams at your company, you just need to continually set the bar higher and make sure that you're continuing to grow observability across the stack.

0
💬 0

2417.648 - 2423.692 Luke Stutters

All right. Well, one more thing before we go to picks, and that is if people want to get in contact with you, how do they find you on the internet?

0
💬 0

2423.812 - 2431.215 Paul Zaich

You're welcome to reach out to me on Twitter at GitHub, GitHub at QZyche, or you can reach out to me on LinkedIn as well.

0
💬 0

2431.535 - 2437.177 Luke Stutters

Awesome. Yeah, we'll get links to those and we'll put them in the show notes. Let's go ahead and do some picks then. Dave, do you want to start us off with the picks?

0
💬 0

2437.277 - 2463.158 Chuck

Yeah, sure. So went to the doctor the other week and they said I had high blood pressure, which I attribute to raising kids and them stressing me out. So I got this blood pressure monitor that syncs up with my iPhone. So it keeps a historical track of it. And it's been really nice. And I guess it's accurate. I don't know. It says it's high. So I guess it's doing something.

0
💬 0

2463.819 - 2469.185 Chuck

So it is the Withings and it's a wireless rechargeable blood pressure monitor.

0
💬 0

2469.545 - 2477.009 Dave Kimura

Cool. Luke, how about you? I just said it's a really interesting. Is this something you wear all the time, Dave?

0
💬 0

2477.269 - 2489.976 Chuck

No, it's just like the doctor's one where they put it, roll up your sleeve, put it on your arm and, you know, it starts to squeeze your arm. It's not like a wristwatch or anything. So I do it a couple of times a day. That'll raise your blood pressure.

0
💬 0

2490.157 - 2503.853 Dave Kimura

Just kidding. Yeah, just checking it, just obsessing about it. I suppose that's good. It's not real time. Otherwise, that'd be even more stressful because you'd be sitting there and go off and say, yeah. Blood pressure's going up. Get caught in the feedback loop.

0
💬 0

2504.133 - 2505.973 Luke Stutters

Cool. How about you, Luke? What are your picks?

0
💬 0

2506.473 - 2532.562 Dave Kimura

I've been fighting the code this week, Chuck. I've been building strange command line interfaces in Ruby, and I've been using a little application which is installed by default on most Ubuntu-based systems called Whiptail. This is an old-school text-style interface for when you can't put a GUI on it for various reasons. So this is kind of like, it makes it look more professional.

0
💬 0

2532.742 - 2553.543 Dave Kimura

It makes it look like a real piece of software. And using this from Ruby has been a real pain because you need to do funny things with file descriptors to get the user data out. So it turns out a very nice man by the name of Felix C. Sturgeman has written a gem There's written a gem to do it all for you in native. Way to go, Felix.

0
💬 0

2553.923 - 2575.752 Dave Kimura

So, yeah, you know, all of that work I did was totally unnecessary. And you, too, can build amazing old-school ASCII-looking interfaces using the gem. It's called EFE, and it's on GitHub under the obfusc. And there's loads of really interesting utilities online. on Neog Faster Kit Hub.

0
💬 0

2575.792 - 2583.176 Dave Kimura

If you dig in, there's some interesting low-level stuff for when you want to kind of Ruby yourself up on the command line. So well worth a look.

0
💬 0

2583.456 - 2601.59 Luke Stutters

Awesome. All right, I'm going to throw out a couple of picks. The first one is I'm still working on this, so keep checking in. mostvaluable.dev and summit.mostvaluable.dev. I think I've mentioned it on the show before, but I'm talking to folks out there in the community. We've talked to a number of people that you've heard of that you know well, that you're excited to hear from.

0
💬 0

2601.79 - 2621.125 Luke Stutters

But yeah, I'm going to be interviewing them and asking them what they would do if they woke up tomorrow as a mid-level developer. and felt like they didn't quite know where to go from there. So a lot of folks, that's where they kind of end up, right? They get to junior or mid-level developer and then it's, okay, I'm proficient, now what? Yeah, there are a lot of options, a lot of ways you can go.

0
💬 0

2621.205 - 2640.817 Luke Stutters

I'm hoping to have people come talk about blogging, podcasting, speaking at conferences and all the other stuff. And then just how to stay current, how they keep up on what's going on out there. So I'm gonna pick that. I've been playing a game on my phone just when I have a minute And, you know, I want to sink a little bit of time into it. It's called Mushroom Wars 2. It's on the iPhone.

0
💬 0

2640.957 - 2660.49 Luke Stutters

I don't know if it's on the Android phone. Yeah, liking that. And then, yeah, I'm also putting on a podcasting summit. So if you're interested in that, you can go to podcasts.com. podcast growth summit.co and we'll have all the information up there. If you listen to the freelancer show, um, the first interview I did was with Petra Manos. She's from, she's in Australia.

0
💬 0

2660.55 - 2677.822 Luke Stutters

So I was talking to her in the evening here in the morning there, which is always fun with all the time zone stuff. But she talked about basically how to measure your growth and then how to use Google's tools, not just to measure your growth, but then to figure out where to double down on it and get more traffic. So, um, It was awesome.

0
💬 0

2677.862 - 2704.15 Luke Stutters

I'm talking to a bunch of other people that I've known for years and years in the podcasting space. And I'm super excited about it too. And I should probably throw out one more pick. So I'm going to throw out Gmailius. That's G-M-E-L-I-U-S. And what it is, is it's a tool. It's a CRM, but it also has like scheduling. So like schedule once or what's the other one? Calendly. It allows you to set up

0
💬 0

2705.13 - 2729.18 Luke Stutters

series of emails it'll it'll do automatic follow-up for you and stuff like that and so it just does a whole bunch of email automation but it runs out of your email account your gmail account that's the big nice thing about it is that you don't get downgraded by send grid or something if your emails aren't landing and so that's another thing that i'm just really digging so i'm going to shout out about that paul what are your picks

0
💬 0

2729.853 - 2750.632 Paul Zaich

I really enjoyed something that was in the Ruby Weekly newsletter this last week. There's a Ruby one-liners cookbook. So it has a bunch of different one-liners you can actually just shout out to and make those calls. And it explains how you can do a lot of things they would do. with a shell script very easily with Ruby.

0
💬 0

2750.993 - 2764.139 Luke Stutters

Awesome. I'll have to check that out. Sounds like a decent episode too, whether we just go through some of those and pick our favorites or whether we get whoever compiled it on. Thanks for coming, Paul. This was really helpful. And I think some folks are probably going to either encounter this and go...

0
💬 0

2764.679 - 2780.016 Luke Stutters

Yeah, I wish we were doing that because the last time we ran into something like this, it was painful. Or some folks hopefully will be proactive and go out there and set things up so that they're watching things and communicating about the way that they handle issues and the way that they avoid them in the first place.

0
💬 0

2780.276 - 2781.097 Paul Zaich

It's been a pleasure.

0
💬 0

2781.438 - 2785.842 Luke Stutters

All right, we'll go ahead and wrap this up and we will be back next week. Till next time, Max out, everybody.

0
💬 0
Comments

There are no comments yet.

Please log in to write the first comment.