This post will be about operational metrics and alerts for distributed software systems. What do I mean by that? I mean the metrics and alerts that allow operations personel to detect failure of of a distributed software system and helps them to quickly diagnose what is wrong.
The metrics are measurements of characteristics of the system collected at regular(ish) intervals and stored somewhere for processing - rendering into graphs, triggering alert notifications, etc. Metrics can be divided into 3 categories: input metrics, output metrics, and process metrics. Input metrics are measures of the inputs to the system, for example, the number of user requests, counts of particular characteristics of the requests - where they are from, how large the request data is, counts of particular features in the request (for example, which resources/items/products are being asked for). Output metrics are measures of the output of the system. Examples of these would include orders successfully placed, counts of unsuccessful orders, and, since users often care about it the time to respond to a user request can also be considered an output metric. Good output metrics are a close proxy for dollars earned or saved by the system per minute. Process metrics are measurements of internal operation of the system. Examples of this include the standard host metrics, such as load average, free memory, disk space or inodes free, etc. Process metrics can also include application specific internal measurements, such as the number of times an API call retried before it was successful.
Sometimes the lines between metric categories are blurry. For example, counts HTTP response codes sent back to the client can belong to each of the categories. Typically, 2xx and 5xx response counts are output metrics. 4xx responses are normally input metrics, though if the request is built from data included the response of previous requests to the system, then a case can be made for including them in the output metric category. The category that 3xx responses fall into is entirely application specific.
In a large system, composed of multiple modules, components, or services, then each subcomponent can have metrics of each type. That is each subcomponent or service can have its own input metrics, output metrics, and process metrics.
Each of these categories of metrics are useful in different ways. Output metrics are best for indicating the existence of a problem and its severity. Input metrics are good for indicating whether a problem exists in the system itself, or whether an upstream system is at fault. Process metrics are best for drilling down into what is wrong once the existence of a problem has been established.
Metrics should be gathered regularly enough to indicate changes quickly, and should be predictable enough to detect problems easily. The ideal metric's graph should look like a boring flat line when things are okay, and very definitely not be a boring flat line at the point where problems have started.
Alerts are a notification that a negative unexpected situation has occurred. In practical terms, some metric has changed in a direction that indicates that bad things are happening. Traditionally alerts have been categorized based on the severity of the underlying event.
- SEV 1 : The event is severe enough to threaten business continuity if nothing is done, e.g., through a significant loss of revenue or reputation or due to a violation of laws or regulations.
- SEV 2 : The event has a significant business impact, e.g., there is a spike in failing orders, the order rate has dropped by 10%, customer responses are taking 10 times longer than normal or some employees are not able to do their jobs due to a failure in the system.
- SEV 3 : The system metrics indicate that something is seriously wrong, e.g., servers are very heavily loaded or some of the requests coming in are malformed, but the business is not affected and the output of the system looks normal.
- SEV 4 : Some unexpected but not particularly serious change has occurred in the metrics.
The typical responses to these events are:
- SEV 1 : Page everyone. Think of the scene in the movie Leon where Stansfield asks for everyone. This is likely to require quick, co-ordinated action, PR handling, frantic debugging, and possibly approval for significant expenses. It is better to have people not be needed and there than the opposite in such situations.
- SEV 2 : Page someone (or multiple someones) with the ability and authority to fix the issue. Have fixing the issue be their highest priority.
- SEV 3 : Make a note in Slack or create a ticket in the ticketing system. The issue should be worked on in the near future, ideally before the end of the next sprint.
- SEV 4 : Unless the team is very proactive, don't bother creating these alerts. For very proactive teams, a notification on Slack, or a backlog item to investigate the error may be appropriate. Even for proactive teams, digging into the root cause of such items is often not the best use of the team's time. Creating process metrics on the number and frequency of such events is probably more appropriate. Getting lots more of these weird events, a lot more often, could then be categorized as a SEV 3 event.
Putting them together
You should identify at least one output metric for the overall system that is providing a service to customers. Ideally, that metric is a close proxy for dollars earned per minute earned or saved. Examples, ads served per minute, page impressions per minute, bytes streamed per minute, successful uploads of customer pictures per hour, etc. It is also good to include latency on requests to the end customer as an output metric.
For aggregate metrics such as sum or average of some values, e.g., the average latency on customer requests, it is good to generate a few more aggregates. Always include the count of the number of inputs int the aggregate. Consider also including quantiles (p0, p25, p50, p75, p90, p99, and p100 are useful). The modal number and median number are also helpful sometimes. If the input values are normally distributed then standard deviation should be included.
Pageable alerts, i.e., for SEV 1 and SEV 2 events, should be on output metrics that meet the following criteria:
- The metric is clean, i.e., the signal in the metric is not swamped by random noise. If a suitable metric is noisy, it might be less noisy if averaged over a longer time period. Rolling averages can work well.
- There should be a significant negative change to the metric. It should either be too large to explain as noise or too long in duration to be caused by noise.
- The problem should require human intervention to fix. There is no point in paging someone for a transient blip. It is better to let them sleep.
Other things worth paging people for are process metrics that correlate very strongly with a system failure in the near future. As an example, if your system uses MySQL, then a sustained and increasing history list length metric value, is almost certainly going to result in system failure in a few hours after it starts. However, the correlation needs to be very strong to avoid alert fatigue. If it is a 50/50 chance, then it is better to let the on-call engineer sleep until the system actually fails, in most SEV 2 cases.
Related to this, if host metrics going into alarm (load average, CPU usage, disk space or memory free, etc., on a particular host) are good predictor of system failure, then this is an indicator of architectural weakness. Instead of setting up a pageable alert, fix the redundancy and failover architecture instead.