Sunday, July 23, 2017

Operational Metrics and Alerts for Distributed Software Systems

This post will be about operational metrics and alerts for distributed software systems. What do I mean by that? I mean the metrics and alerts that allow operations personel to detect failure of of a distributed software system and helps them to quickly diagnose what is wrong.


The metrics are measurements of characteristics of the system collected at regular(ish) intervals and stored somewhere for processing - rendering into graphs, triggering alert notifications, etc. Metrics can be divided into 3 categories: input metrics, output metrics, and process metrics. Input metrics are measures of the inputs to the system, for example, the number of user requests, counts of particular characteristics of the requests - where they are from, how large the request data is, counts of particular features in the request (for example, which resources/items/products are being asked for). Output metrics are measures of the output of the system. Examples of these would include orders successfully placed, counts of unsuccessful orders, and, since users often care about it the time to respond to a user request can also be considered an output metric. Good output metrics are a close proxy for dollars earned or saved by the system per minute. Process metrics are measurements of internal operation of the system. Examples of this include the standard host metrics, such as load average, free memory, disk space or inodes free, etc. Process metrics can also include application specific internal measurements, such as the number of times an API call retried before it was successful.

Sometimes the lines between metric categories are blurry. For example, counts HTTP response codes sent back to the client can belong to each of the categories. Typically, 2xx and 5xx response counts are output metrics. 4xx responses are normally input metrics, though if the request is built from data included the response of previous requests to the system, then a case can be made for including them in the output metric category. The category that 3xx responses fall into is entirely application specific.

In a large system, composed of multiple modules, components, or services, then each subcomponent can have metrics of each type. That is each subcomponent or service can have its own input metrics, output metrics, and process metrics.

Each of these categories of metrics are useful in different ways. Output metrics are best for indicating the existence of a problem and its severity. Input metrics are good for indicating whether a problem exists in the system itself, or whether an upstream system is at fault. Process metrics are best for drilling down into what is wrong once the existence of a problem has been established.

Metrics should be gathered regularly enough to indicate changes quickly, and should be predictable enough to detect problems easily. The ideal metric's graph should look like a boring flat line when things are okay, and very definitely not be a boring flat line at the point where problems have started.


Alerts are a notification that a negative unexpected situation has occurred. In practical terms, some metric has changed in a direction that indicates that bad things are happening. Traditionally alerts have been categorized based on the severity of the underlying event.

  • SEV 1 : The event is severe enough to threaten business continuity if nothing is done, e.g., through a significant loss of revenue or reputation or due to a violation of laws or regulations.
  • SEV 2 : The event has a significant business impact, e.g., there is a spike in failing orders, the order rate has dropped by 10%, customer responses are taking 10 times longer than normal or some employees are not able to do their jobs due to a failure in the system.
  • SEV 3 : The system metrics indicate that something is seriously wrong, e.g., servers are very heavily loaded or some of the requests coming in are malformed, but the business is not affected and the output of the system looks normal.
  • SEV 4 : Some unexpected but not particularly serious change has occurred in the metrics.

The typical responses to these events are:

  • SEV 1 : Page everyone. Think of the scene in the movie Leon where Stansfield asks for everyone. This is likely to require quick, co-ordinated action, PR handling, frantic debugging, and possibly approval for significant expenses. I is better to have people not be needed and there than the opposite in such situations.
  • SEV 2 : Page someone (or multiple someones) with the ability and authority to fix the issue. Have fixing the issue be their highest priority.
  • SEV 3 : Make a note in Slack or create a ticket in the ticketing system. The issue should be worked on in the near future, ideally before the end of the next sprint.
  • SEV 4 : Unless the team is very proactive, don't bother creating these alerts. For very proactive teams, a notification on Slack, or a backlog item to investigate the error may be appropriate. Even for proactive teams, digging into the root cause of such items is often not the best use of the team's time. Creating process metrics on the number and frequency of such events is probably more appropriate. Getting lots more of these weird events, a lot more often, could then be categorized as a SEV 3 event.

Putting them together

You should identify at least one output metric for the overall system that is providing a service to customers. Ideally, that metric is a close proxy for dollars earned per minute earned or saved. Examples, ads served per minute, page impressions per minute, bytes streamed per minute, successful uploads of customer pictures per hour, etc. It is also good to include latency on requests to the end customer as an output metric.

For aggregate metrics such as sum or average of some values, e.g., the average latency on customer requests, it is good to generate a few more aggregates. Always include the count of the number of inputs int the aggregate. Consider also including quantiles (p0, p25, p50, p75, p90, p99, and p100 are useful). The modal number and median number are also helpful sometimes. If the input values are normally distributed then standard deviation should be included.

Pageable alerts, i.e., for SEV 1 and SEV 2 events, should be on output metrics that meet the following criteria:

  1. The metric is clean, i.e., the signal in the metric is not swamped by random noise. If a suitable metric is noisy, it might be less noisy if averaged over a longer time period. Rolling averages can work well.
  2. There should be a significant negative change to the metric. It should either be too large to explain as noise or too long in duration to be caused by noise.
  3. The problem should require human intervention to fix. There is no point in paging someone for a transient blip. It is better to let them sleep.

Other things worth paging people for are process metrics that correlate very strongly with a system failure in the near future. As an example, if your system uses MySQL, then a sustained and increasing history list length metric value, is almost certainly going to result in system failure in a few hours after it starts. However, the correlation needs to be very strong to avoid alert fatigue. If it is a 50/50 chance, then it is better to let the on-call engineer sleep until the system actually fails, in most SEV 2 cases.

Related to this, if host metrics going into alarm (load average, CPU usage, disk space or memory free, etc., on a particular host) are good predictor of system failure, then this is an indicator of architectural weakness. Instead of setting up a pageable alert, fix the redundancy and failover architecture instead.

Saturday, March 28, 2015

With Great Power Comes Great Responsibility

Forth is used as a bootloader for SPARC based machines. One feature that SPARC based machines made by SUN Mircosystems had was the ability to drop back to the bootloader's Forth interpreter by pressing the Stop-A key combination at the console. This suspended the operating system and gave the user an ok prompt to work at. Typically this was used to kick off a kernel debugger or to kick errant SCSI hardware back into line. In effect the Open Boot Prom (OBP), as the Forth based bootloader was branded, was a very lightweight hypervisor.

A consequence of this was that while working at the ok prompt, the user wasn't subject to privilege system of Solaris. People at the console could use this to gain root privileges. The method worked as follows:

  1. Find the address in memory where the proc structure of a shell that the user has open, i.e., where the shell's process resides in memory.

  2. Press Stop-A to drop to OBP.

  3. Write 0 to the cr_uid field of the processes cred structure. The location of this in memory is easily found from the process address.

  4. Type go to return to Solaris where there is a shell where the user now has an effective user id of 0, i.e., root privileges.

Full details can be found at Brendan Gregg's website. The option to ps that gave easy access to processes' addresses, has been since removed to make this more difficult, but it would still be easy to find with a debugger, for example.

There are a few things to be learned from this:

  • With great power comes great responsibility.

  • A hypervisor can completely bypass the security controls of its guest operating systems.

  • If an attacker has access to a machine physically or via a hypervisor, it is a matter of "when" and not "if" they gain control.

Sunday, January 25, 2015

Learning Forth

One of my side projects for this year is to learn the programming language, Forth. Some people might consider this an odd language to learn. It is not a popular language. There are no hot startups using it (that I know of). It doesn't even show up in the top 100 languages in the TIOBE Index. However, I am convinced learning it is worthwhile. Some of my reasons for this are:

  • Forth is probably the most successful and widely deployed language that nobody has heard of. It is the language used to develop OpenFirmware. This boot loader is installed on the laptops of the One Laptop Per Child Project, on PowerPC based Apple Mac computers, and on SPARC based computers from SUN Microsystems. It has also been used to develop to develop control software for the National Radio Astronomy Observatory, which is where it was developed.

    While not as widely used as C/C++, Forth is used a lot in embedded applications and has been ported to most micro-controllers. For example, the Forth, Inc. website has downloadable examples for Arduino and the TI LaunchPad development board. The website also lists a number of interesting applications built with Forth.

  • Forth is a concatenative stack based language. This makes it very different from most mainstream languages, which are based on the object oriented (e.g., Java), imperative (e.g., C), or functional (e.g., Haskell), paradigms, or hybrid versions of these (e.g., Scala or Ruby). Learning this new paradigm opens up new approaches to solving programming problems and provides a new perspective on the art of programming. The stack programming paradigm is used in the JVM byte code interpreter and in the PostScript interpreter, so getting to grips with this programming model is helpful for understanding the low level details of these widely used technologies. Due to its underlying philosophies, Forth is the most pared down and open of the concatenative languages.

  • The history of the language is interesting. For example, one of the first Forth primers was written by W. Richard Stevens.

  • Forth is an excellent language for interacting directly with hardware and exploring the features of hardware. For too long in my career I have been able to get away without knowing much about the underlying hardware that my code runs on. With the rise of the Internet of Things this is a handicap. Understanding of hardware and how to code on it efficiently will become more important. The hardware to software interface is becoming more fluid and that is where Forth lives, so it is ideal for exploring the trade-offs.

  • The primary reason I want to learn Forth is that it challenges conventional programming wisdom. Conventional wisdom says hardware can be abstracted away completely behind multiple layers of abstraction. With Forth it is one layer away. This does mean you can cause damage, like accidentally frying the rx/tx GPIO pins on your Raspberry Pi, to pick a totally random example. However, it also allows for very small and efficient code. Conventional wisdom says you should always use libraries and not reinvent the wheel. The philosophy of Forth says that you are not going to need most of the library and it probably won't meet all the requirements of your application anyway, so writing your own version should be considered. Additionally, how well do you know how the library works and what its tradeoffs are if you haven't tried to implement it. It's these little heresies that point out how much of programming wisdom is taken for granted.

I am using the following resources to learn Forth:

  • Starting Forth, by Leo Brodie. This book is unfortunately out of print, but can be found online here. This is a beginners introductory book, but, looking at the table of contents, it seems to sneak in some advanced topics, like metaprogramming, near the end.

  • Thinking Forth, again by Leo Brodie. I have already read this and it is the best book I have ever read on how to decompose a programming problem and how to structure the solution code. I'll be reading it again after I write a significant amount of Forth code.

  • GForth: this is the main Forth implementation I'll be using.

  • Pi Jones Forth: this is a very bare bones Forth implementation that runs, bare metal, on a Raspberry Pi.

Sunday, November 23, 2014

TIL: ARM Has Java Bytecode Execution in Hardware

I recently purchased a Raspberry Pi. While poking around in /proc I discovered that java is one of the features of the ARM processor in the Pi. It turns out that some ARM models have Java bytecode instructions implemented in hardware.

Wednesday, November 19, 2014

On Writing Well

I like doing things that have a body of theory behind them. For example, I prefer taijiquan to kickboxing, as a martial art, as it has deeper theory. So when I started blogging, I went looking for its theoretical foundations. I found them in the principles of good nonfiction writing. I bought two books: The Elements of Style, by William Strunk Jr. and E. B. White, and On Writing Well, by William Zinsser.

The book is unusual for a writing guide. Firstly, it is a good read. It is actually hard to put down. Advice on writing is given clearly and simply. That is part of it. However, Mr. Zinsser illustrates his points with personal anecdotes, and this is what makes the book so interesting. In the chapter entitled A Writer's Decisions, for example, he uses an account of a trip he took to Timbuktu. He walks us through the article he wrote, paragraph by paragraph, explaining what he wrote and what he was thinking at the time. Between the travel piece and its explanation, you get an idea of the author's personality. He's interesting. It makes his book interesting.

He is passionate about the craft of writing, and that comes through in a few humorous digs at bad writing. For example:

He or she may think "sanguine" and "sanguinary" mean the same thing, but the difference is a bloody big one.

Humour livens up the advice too:

Don't get caught holding a bag full of abstract nouns. You'll sink to the bottom of the lake and never be seen again.

Secondly, the book covers more than just grammar and rules for composition. It covers the whole craft of writing nonfiction. There is a section on forms of nonfiction writing, such as travel writing, sports writing, biographies, and business writing. There are a few paragraphs on the relationship between an author and an editor. This is useful information for a professional, and interesting for an amateur blogger like myself. There is a chapter on interviewing people too. It explains how to conduct an interview, how to quote people, and the ethical responsibility that a writer has to be faithful when using a quotation. The author also explains why you would want to quote someone in the first place. He uses quotations effectively himself, and these make his point very clear. The author includes a story about an article he wrote about Mount Rushmore. Instead of describing the place himself, he interviewed the people that worked there. I cannot think of a more evocative way of describing Mount Rushmore than one of the quotations he got:

"In the afternoon when the sunlight throws shadows into that socket," one of the rangers, Fred Banks, said, "you feel that the eyes of those four men are looking right at you, no matter where you move. They're peering right into your mind, wondering what you're thinking, making you feel guilty: 'Are you doing your part?'"

In short, On Writing Well is an informative book. It covers the whole craft of non fiction writing in about 300 pages and it is written well.

Monday, November 17, 2014

First Post

Hello and welcome to the inane ramblings of an Irish software developer.

The title of the blog comes from Lewis Carroll's, Through the Looking Glass. In the book, Alice goes running with the Red Queen, but they don't seem to make any progress. Alice remarks on this, saying, "Well in our country, you'd generally get to somewhere else - if you ran very fast for a long time as we've been doing." The Red Queen replies, "A slow sort of country. Now, here, you see, it takes all the running you can do, to stay in the same place." The Red Queen Effect is quite applicable to the software industry, and as I probably will be talking quite a bit about the software industry, I thought it would be a good name for a blog.

I have a few objectives for my new blog. By writing here, I hope to learn how to write well. That is, I hope to learn how to write clearly and concisely, and be interesting at the same time. I also hope that this blog will become a good professional advertisement for me - something that says, "Yup. That guy is a decent programmer."

Specific things I hope to talk about here will include my favourite programming languages, good books that I have read, and interesting things that I have learned. I hope you find something of value here.