These two posts dive into John Allspaw's (previous Head of Engineering at Etsy) Masters Thesis on heuristics on decision making under pressure, specifically in the context of dealing with an outage to a software service: https://blog.acolyer.org/2020/01/22/trade-offs-under-pressure-part-1/ and https://blog.acolyer.org/2020/01/24/trade-offs-under-pressure-part-2/ There are two noteworthy aspects to this: firstly the subject matter itself is useful. It identifies heuristics that engineers use to make trade-offs during outages. The second noteworthy thing is the methodology used: it demonstrates both an excellent methodology for conducting incident reviews. The visualization and classification of the timeline is very informative.
This post will be about operational metrics and alerts for distributed software systems. What do I mean by that? I mean the metrics and alerts that allow operations personel to detect failure of of a distributed software system and helps them to quickly diagnose what is wrong. Metrics The metrics are measurements of characteristics of the system collected at regular(ish) intervals and stored somewhere for processing - rendering into graphs, triggering alert notifications, etc. Metrics can be divided into 3 categories: input metrics, output metrics, and process metrics. Input metrics are measures of the inputs to the system, for example, the number of user requests, counts of particular characteristics of the requests - where they are from, how large the request data is, counts of particular features in the request (for example, which resources/items/products are being asked for). Output metrics are measures of the output of the system. Examples of these would include orders s