Skip to main content

Paper on Gray Failures

A great paper annotated by The Morning Paper, this time on the subject of gray failures. There are a handful of interesting takeaways from this one:

  • It is important that monitoring on a system aligns with the clients of the system's definition of failure. The cycle of failure is inevitable unless proper root causes are identified.

Some personal observations:

  • A potential observability gap is the difference between proximate and root cause. For example a service may fail because a proxy server returns an error. The error may be due to timing out a request to an upstream server. The cause of that could be the fact that latency has increased on the upstream server. The root cause of that could be that the working set of the server no longer fully fits in memory and the server is swapping leading to latency. The server itself is not failing any requests, but the proxy is throwing the results away because it is too late arriving.
  • The "masked failure" area is a great area to search for indicators of failure. If metrics can be found that correlate strongly with later gray failures, then remediation can take place before customers even notice. The trick is finding the correlating metrics.


Popular posts from this blog

Repost: ANTLR Trinity

This post is a repost of an article I had on a previous incarnation of this blog. I hadn't intended to transfer it over, as the technology is old now (ANTLR is on version 4), but I recently came acros a slide deck online, where the post was referenced, so I am reposting in case anyone was looking for it. There are 3 components to a really useful software development technology: innovative features, clear and comprehensive documentation, and solid tools. The recent release of ANTLR v3.0 is a perfect example of this. This parser generator tool has all 3 components and each component is done superbly. ANTLR is a parser generator tool that is capable of targeting multiple output languages. Out of the box it will generate Java, Python, C, C#, or Ruby code for parsers. Other target languages are possible if the code generators are written. Amongst its cool features are: LL(*) parsing: This is an extension to the normal, top down with looka

First Post

Hello and welcome to the inane ramblings of an Irish software developer. The title of the blog comes from Lewis Carroll's, Through the Looking Glass . In the book, Alice goes running with the Red Queen, but they don't seem to make any progress. Alice remarks on this, saying, "Well in our country, you'd generally get to somewhere else - if you ran very fast for a long time as we've been doing." The Red Queen replies, "A slow sort of country. Now, here, you see, it takes all the running you can do, to stay in the same place." The Red Queen Effect is quite applicable to the software industry, and as I probably will be talking quite a bit about the software industry, I thought it would be a good name for a blog. I have a few objectives for my new blog. By writing here, I hope to learn how to write well. That is, I hope to learn how to write clearly and concisely, and be interesting at the same time. I also hope that this blog will become a good prof

Useful Links: Logs and Metrics

This article explains why storing log messages alone is insufficient for robust operation of a software service. Metrics also need to be gathered and stored.…/logs-and-metrics-6d34d3026e38 tl;dr - Log volume can spike dramatically when user activity increases, especially when things go wrong. This makes it possible for an alerting system based on logs to be swamped. For a metrics system, volume increases with the number of metrics collected. This is stable and much less likely to fail or slow down during a crisis.