What to Monitor When You’re Monitoring
Observability and production monitoring are increasingly becoming a part of an engineer’s general toolkit, and for good reason: observability tools are some of the best ways to gain confidence that your application is doing what you want it to do in production and some of the best ways to understand what’s happening when your application isn’t doing what you want it to.
Yet despite the large amount of ink spilled on topics ranging from alerts to traces to logs, I’ve found few resources that actually spell out anything more than general principles. Of course, there’s good reason for this; what you observe is highly context-dependent: each application has its own mixture of brittle areas that need extra observation and highly complex processes that require hand-tuned monitoring.
At the same time, I’ve found myself rolling out some core monitoring with every application I’ve built. This article is a playbook for some of that core monitoring.
Observability vs monitoring?
Before diving in to the playbook, it’s worth staking my claim in a debate in one corner of the observability world. There’s quite a bit of discussion about the relationship between observability and monitoring, with some claiming things like “monitoring is legacy”, “observability leaves little room for monitoring”, and more (1, 2, 3).
When I talk about monitoring here, I’m not referring to the supposedly outdated view of observability that these authors are. Instead, by monitoring I’m talking about the suite of tools and strategies that give you proactive alerting about your application. In an overly-simplified sense, monitoring here tells you what is going on (”this system is down!”, “our error rates have increased!”), and observability tools tell you why that is happening, via an in-depth look into events, traces, logs, and more.
Ultimately, though, trying to break down the taxonomies between these tools isn’t particularly relevant. What is relevant is this: What’s the best way for me to be confident that my application is working in production?
The playbook
So, what is the best way? We’re going to cover four areas of monitoring: individual endpoint health and behavior, individual job/task health and behavior, infrastructure health, and domain data. Our focus is going to be on what to monitor rather than on the technical implementation of that monitoring, since it will vary widely from tool to tool.
Individual endpoint health and behavior
Monitors and alerts that operate on individual endpoints (e.g., individual controller actions in an MVC application) are some of the lowest-effort but highest-value tools in observability. Keeping the grain to an individual endpoint helps both with detecting issues in less-used parts of the application as well as with debugging problems, since they are naturally scoped to a very discrete unit in your application.
Status code monitoring
At a minimum, I recommend monitoring the returned status codes for each individual endpoint. One of the common first signs of a problem is an influx of 5xx status codes or 5xx status codes exceeding a small percentage of an endpoint’s overall responses. An error alert like “Elevated 5xx status code rate for endpoint /api/users/signups” is an exceedingly clear alert, and one that will show up quickly after a deploy with malfunctioning code.
Exactly what status codes you care about will be application-dependent: some applications may never expect known endpoints to return 404
or 422
, while others may use both of those status codes to indicate expected errors to clients. I try to ensure that status codes are consistent across an application, so that 422
always indicates an unexpected problem or never does, so that I don’t have to do a lot of per-endpoint configuring. In general, although it isn’t RESTful, I try to keep the set of status codes that don’t indicate a problem to a minimum — maybe just 200
, 201
, 301
, and 302
.
The mechanism you use for determining that something isn’t right will be influenced by the velocity of your application, or even of individual endpoints. If you only serve a few thousand requests a day, you might want to alert whenever there is a single unexpected error code. If you have an older application with known malfunctioning endpoints and with a large number of requests, you might need to reach for something more sophisticated, like anomaly detection.
Latency
This one might be even more obvious: per-endpoint latency is a powerful monitor for improving your customer experience. In general, slow endpoints mean a slow user experience, and so I try to monitor for any endpoint which is taking more than a few hundred milliseconds to respond on average. You might need to opt out of latency monitors for certain complex endpoints, but in general I’d recommend searching for ways to decrease endpoint time even for complex procedures, like moving asynchronous work to background jobs or switching a process to polling rather than holding a connection open for multiple seconds.
Understanding latency issues can be difficult, since there are naturally multiple sources of latency (at the very least, slow code and slow infrastructure). It’s important to make sure you can quickly understand which is the cause, since slow infrastructure could be a signal for more pressing problems about to surface, and slow code might require a nuanced look at the code paths to understand where the performance problem lies.
Behavior
Behavior is a far more sophisticated and difficult monitor to get right, but it can be invaluable when it’s present. Behavior monitors look at the velocity and shape of incoming requests as well as the shape of outgoing responses. Data payloads changing unexpectedly can indicate a more subtle problem with how an application is functioning. Unfortunately, those payloads often might change just over the course of routine development, especially in the early stages for an API that controls all of its clients, so it’s important to make sure you have the ability to filter out that noise or (at the very least) quickly identify which alerts are false positives.
For a more detailed look at monitoring individual endpoints, check out Akita’s article on API-centric observability. Akita’s entire product is devoted to improving observability around individual endpoints, and they’re taking a unique approach to endpoint behavior monitoring in particular.
Job/task health and behavior
Much like monitoring for individual endpoints, monitoring individual background jobs provides a high signal-to-noise ratio on issues in applications, especially those that rely heavily on jobs for asynchronous operations. While some of the mechanics behind monitoring jobs are the same, there are some core differences between job monitoring and endpoint monitoring. For example, latency may be a much less issue for jobs, since you might very intentionally relegate slow tasks to jobs.
Success monitoring
The most basic type of job monitoring is understanding the success rates of jobs. Ideally, jobs can be modeled in such a way that a successful job run always indicates things are going as expected and a failed job run always indicates something unexpected occurred. If your system is modeled this way, you can use the same tactics that we discussed in status code monitoring to quickly learn when jobs are failing.
Plenty of applications have different types of error states for jobs, however. For example, jobs that mediate between an application and a third party API might be expected to fail since the third party is notoriously unreliable. Where possible, I recommend using the following taxonomy to manage job success:
- Success: Report the job to your monitoring as a success.
- Failure, but with an expected error that cannot be retried or recovered from: Report the job to your monitoring as a success, but optionally record the failure in a domain-specific data store for future analysis.
- Failure, but with an expected error that can be retried: Don’t report the job to your monitoring until a retry succeeds or until all retries are exhausted, at which point report it as a failure to the monitoring.
- Unexpected failure: Report the job as a failure to your monitoring
With this taxonomy in place, you’ll only be alerted when something unexpected occurs: either something you thought was recoverable could not be recovered from, or an invariant occurred.
Time in queue
It can be helpful to be alerted when an individual job has been stuck in a queue for a longer-than-expected period of time. In the past, I’ve used multiple queues to help identify when this occurrence is problematic: jobs in a high-priority queue should be expected to have a minimal enqueued time, while jobs in a slow batch queue might be expected to stay in the queue for hours before being processed. For that reason, I recommend using the queue (or some equivalent tag on your jobs) as the level of granularity for measuring time in queue, and I recommend alerting whenever that time exceeds a fixed threshold configured on a pure-queue basis.
Queue size
This is closely related to time in queue but can more proactively detect issues, since time in queue alone may take hours before you are alerted. If queue sizes are increasing over a period of time, you aren’t processing jobs faster than they are coming in, which could indicate you need to scale up your infrastructure, or it could indicate a new performance issue in a job that needs to be resolved. In some cases, this might even indicate that one or more of your job workers are stuck on some task and therefore not correctly processing jobs.
Infrastructure health
For years, infrastructure health was the primary thing engineers were talking about when they referred to monitoring. It’s likely true that even today the word “monitoring” conjures up cryptic dashboards of infrastructure CPU usage, memory usage, and outgoing bytes for many engineers. Since infrastructure monitoring is widely known and discussed, we won’t dig too deeply into how to monitor your infrastructure. But here are some quick pointers to things that I’ve found useful to monitor.
Postgres
- CPU usage: If CPU usage is exceeding a normal range (or if it is creeping above 80%), you’re in danger of seeing performance degradation that could substantially affect your application.
- Storage: Both quick spikes in storage usage and gradual increases overtime are important signals that you may need to either prune data or scale up your Postgres infrastructure.
- Connection count: If you aren’t using a tool like pgBouncer, increasing number of connections may mean you’re about to hit a connection limit, and even without hitting a limit an increased number of connections will correlate with lower performance given the work Postgres does on a per-connection basis.
Redis
Memory usage: Especially if you use Redis to manage your job queue and have an influx of failures that include stack traces, memory usage increases can indicate your entire application is about to go down when Redis becomes unavailable.
Latency: Redis can be surprisingly slow in some usage conditions, so it’s important to quickly be notified when the average time Redis is taking to respond to requests is increasing.
what else here? Or should we cut this infra section entirely…
Domain data
We’ve saved the best — and most important — for last. When many people think of the question “What’s the best way for me to be confident that my application is working in production?“, they take a narrow, technically-focused view. They translate the question to some variation of “how do I know my system is running?” In fact, though, in order to know your application is working, you need to know that it is solving the problems it is intended to solve (or providing the functionality it is intended to provide), and the only way to know this in an automated fashion is to monitor the user behavior and data within the application.
It’s hard to give any sort of specific advice about what this looks like, since it’s almost entirely bespoke for each company. But there are a few general principles to help guide you:
Funnel progress
In a mature product-engineering organization, the product leaders likely already look at a variety of product funnels to understand conversion rates, drop-off, points of friction, and so on. These funnels and metrics are a great place to start with domain data monitoring. Setting up anomaly detection for the main conversion funnels will help you immediately tell if a recently-deployed change is subtly breaking the customer experience.
Even beyond just funnels, it can be incredibly valuable to check the performance of individual features in your application. For example, if you have an address autocomplete feature with a fallback to manual entry, putting monitoring around the percentage of customers who are manually entering their address will help you verify that your address autocomplete works and is in fact making customers’ lives easier. (Imagine a scenario in which the autocomplete is frequently returning wrong results, so even though it is “working” customers are just overriding the autocomplete anyway, causing additional friction).
Database invariants
Especially in complex or eventually-consistent domains, applications tend to end up with a lot of assumptions about the shape of data in the database. Monitoring is a great place to encode these assumptions and verify they aren’t violated. As a simple example, consider an order in an ecommerce system. You might want to verify that “any order that is older than 72 hours old has been shipped” to make sure that unshipped orders aren’t going undetected. Running a query which returns all orders older than 72 hours that are unshipped and then putting monitoring on top of the results of that query is an easy and nearly foolproof way to ensure this invariant doesn’t exist in your system, and to get quick feedback if it does exist.