A great paper annotated by The Morning Paper, this time on the subject of gray failures. https://blog.acolyer.org/2017/06/15/gray-failure-the-achilles-heel-of-cloud-scale-systems/ There are a handful of interesting takeaways from this one:
- It is important that monitoring on a system aligns with the clients of the system's definition of failure. The cycle of failure is inevitable unless proper root causes are identified.
Some personal observations:
- A potential observability gap is the difference between proximate and root cause. For example a service may fail because a proxy server returns an error. The error may be due to timing out a request to an upstream server. The cause of that could be the fact that latency has increased on the upstream server. The root cause of that could be that the working set of the server no longer fully fits in memory and the server is swapping leading to latency. The server itself is not failing any requests, but the proxy is throwing the results away because it is too late arriving.
- The "masked failure" area is a great area to search for indicators of failure. If metrics can be found that correlate strongly with later gray failures, then remediation can take place before customers even notice. The trick is finding the correlating metrics.