Skip to main content

Operational Metrics and Alerts for Distributed Software Systems

This post will be about operational metrics and alerts for distributed software systems. What do I mean by that? I mean the metrics and alerts that allow operations personel to detect failure of of a distributed software system and helps them to quickly diagnose what is wrong.


The metrics are measurements of characteristics of the system collected at regular(ish) intervals and stored somewhere for processing - rendering into graphs, triggering alert notifications, etc. Metrics can be divided into 3 categories: input metrics, output metrics, and process metrics. Input metrics are measures of the inputs to the system, for example, the number of user requests, counts of particular characteristics of the requests - where they are from, how large the request data is, counts of particular features in the request (for example, which resources/items/products are being asked for). Output metrics are measures of the output of the system. Examples of these would include orders successfully placed, counts of unsuccessful orders, and, since users often care about it the time to respond to a user request can also be considered an output metric. Good output metrics are a close proxy for dollars earned or saved by the system per minute. Process metrics are measurements of internal operation of the system. Examples of this include the standard host metrics, such as load average, free memory, disk space or inodes free, etc. Process metrics can also include application specific internal measurements, such as the number of times an API call retried before it was successful.

Sometimes the lines between metric categories are blurry. For example, counts HTTP response codes sent back to the client can belong to each of the categories. Typically, 2xx and 5xx response counts are output metrics. 4xx responses are normally input metrics, though if the request is built from data included the response of previous requests to the system, then a case can be made for including them in the output metric category. The category that 3xx responses fall into is entirely application specific.

In a large system, composed of multiple modules, components, or services, then each subcomponent can have metrics of each type. That is each subcomponent or service can have its own input metrics, output metrics, and process metrics.

Each of these categories of metrics are useful in different ways. Output metrics are best for indicating the existence of a problem and its severity. Input metrics are good for indicating whether a problem exists in the system itself, or whether an upstream system is at fault. Process metrics are best for drilling down into what is wrong once the existence of a problem has been established.

Metrics should be gathered regularly enough to indicate changes quickly, and should be predictable enough to detect problems easily. The ideal metric's graph should look like a boring flat line when things are okay, and very definitely not be a boring flat line at the point where problems have started.


Alerts are a notification that a negative unexpected situation has occurred. In practical terms, some metric has changed in a direction that indicates that bad things are happening. Traditionally alerts have been categorized based on the severity of the underlying event.

  • SEV 1 : The event is severe enough to threaten business continuity if nothing is done, e.g., through a significant loss of revenue or reputation or due to a violation of laws or regulations.
  • SEV 2 : The event has a significant business impact, e.g., there is a spike in failing orders, the order rate has dropped by 10%, customer responses are taking 10 times longer than normal or some employees are not able to do their jobs due to a failure in the system.
  • SEV 3 : The system metrics indicate that something is seriously wrong, e.g., servers are very heavily loaded or some of the requests coming in are malformed, but the business is not affected and the output of the system looks normal.
  • SEV 4 : Some unexpected but not particularly serious change has occurred in the metrics.

The typical responses to these events are:

  • SEV 1 : Page everyone. Think of the scene in the movie Leon where Stansfield asks for everyone. This is likely to require quick, co-ordinated action, PR handling, frantic debugging, and possibly approval for significant expenses. It is better to have people not be needed and there than the opposite in such situations.
  • SEV 2 : Page someone (or multiple someones) with the ability and authority to fix the issue. Have fixing the issue be their highest priority.
  • SEV 3 : Make a note in Slack or create a ticket in the ticketing system. The issue should be worked on in the near future, ideally before the end of the next sprint.
  • SEV 4 : Unless the team is very proactive, don't bother creating these alerts. For very proactive teams, a notification on Slack, or a backlog item to investigate the error may be appropriate. Even for proactive teams, digging into the root cause of such items is often not the best use of the team's time. Creating process metrics on the number and frequency of such events is probably more appropriate. Getting lots more of these weird events, a lot more often, could then be categorized as a SEV 3 event.

Putting them together

You should identify at least one output metric for the overall system that is providing a service to customers. Ideally, that metric is a close proxy for dollars earned per minute earned or saved. Examples, ads served per minute, page impressions per minute, bytes streamed per minute, successful uploads of customer pictures per hour, etc. It is also good to include latency on requests to the end customer as an output metric.

For aggregate metrics such as sum or average of some values, e.g., the average latency on customer requests, it is good to generate a few more aggregates. Always include the count of the number of inputs int the aggregate. Consider also including quantiles (p0, p25, p50, p75, p90, p99, and p100 are useful). The modal number and median number are also helpful sometimes. If the input values are normally distributed then standard deviation should be included.

Pageable alerts, i.e., for SEV 1 and SEV 2 events, should be on output metrics that meet the following criteria:

  1. The metric is clean, i.e., the signal in the metric is not swamped by random noise. If a suitable metric is noisy, it might be less noisy if averaged over a longer time period. Rolling averages can work well.
  2. There should be a significant negative change to the metric. It should either be too large to explain as noise or too long in duration to be caused by noise.
  3. The problem should require human intervention to fix. There is no point in paging someone for a transient blip. It is better to let them sleep.

Other things worth paging people for are process metrics that correlate very strongly with a system failure in the near future. As an example, if your system uses MySQL, then a sustained and increasing history list length metric value, is almost certainly going to result in system failure in a few hours after it starts. However, the correlation needs to be very strong to avoid alert fatigue. If it is a 50/50 chance, then it is better to let the on-call engineer sleep until the system actually fails, in most SEV 2 cases.

Related to this, if host metrics going into alarm (load average, CPU usage, disk space or memory free, etc., on a particular host) are good predictor of system failure, then this is an indicator of architectural weakness. Instead of setting up a pageable alert, fix the redundancy and failover architecture instead.


Popular posts from this blog

Repost: ANTLR Trinity

This post is a repost of an article I had on a previous incarnation of this blog. I hadn't intended to transfer it over, as the technology is old now (ANTLR is on version 4), but I recently came acros a slide deck online, where the post was referenced, so I am reposting in case anyone was looking for it. There are 3 components to a really useful software development technology: innovative features, clear and comprehensive documentation, and solid tools. The recent release of ANTLR v3.0 is a perfect example of this. This parser generator tool has all 3 components and each component is done superbly. ANTLR is a parser generator tool that is capable of targeting multiple output languages. Out of the box it will generate Java, Python, C, C#, or Ruby code for parsers. Other target languages are possible if the code generators are written. Amongst its cool features are: LL(*) parsing: This is an extension to the normal, top down with looka

First Post

Hello and welcome to the inane ramblings of an Irish software developer. The title of the blog comes from Lewis Carroll's, Through the Looking Glass . In the book, Alice goes running with the Red Queen, but they don't seem to make any progress. Alice remarks on this, saying, "Well in our country, you'd generally get to somewhere else - if you ran very fast for a long time as we've been doing." The Red Queen replies, "A slow sort of country. Now, here, you see, it takes all the running you can do, to stay in the same place." The Red Queen Effect is quite applicable to the software industry, and as I probably will be talking quite a bit about the software industry, I thought it would be a good name for a blog. I have a few objectives for my new blog. By writing here, I hope to learn how to write well. That is, I hope to learn how to write clearly and concisely, and be interesting at the same time. I also hope that this blog will become a good prof

Useful Links: Logs and Metrics

This article explains why storing log messages alone is insufficient for robust operation of a software service. Metrics also need to be gathered and stored.…/logs-and-metrics-6d34d3026e38 tl;dr - Log volume can spike dramatically when user activity increases, especially when things go wrong. This makes it possible for an alerting system based on logs to be swamped. For a metrics system, volume increases with the number of metrics collected. This is stable and much less likely to fail or slow down during a crisis.