The Fast Country

Posts

Showing posts from April, 2020

Useful Links: Going faster with continuous delivery

Just thought I would share a blog post on how Amazon does continuous deployment. The title of the article highlights a key goal: faster deployment of completed features. This is a key metric that identifies high performing teams, i.e., deployment latency. In her book, Accelerate: The Science of Lean Software and DevOps, Nicole Fosgren identified this as one of four highly predictive metrics for high performing software teams. The section on risk management is especially worthwhile. The risk reduction strategies mentioned in the article can be implemented with AWS Code Pipeline and/or Kubernetes Deployments. https://aws.amazon.com/builders-library/going-faster-with-continuous-delivery/

Useful Links: Deploys - It’s Not Actually About Fridays

This: https://charity.wtf/2019/10/28/deploys-its-not-actually-about-fridays/ Read. Contemplate. Incorporate. Seriously, there are 4 metrics that reliably indicate a high function software organization (see Accelerate, by Fosgren, et al., for details - https://www.amazon.com/Accelerate-Software-Performing-Technology-Organizations-ebook/dp/B07B9F83WM ): Lead time for changes Deployment frequency Time to restore service Change failure rate This article addresses the 'change failure rate' one, by improving the first two with observability tooling.

Useful Links: Microservices Prerequisites

This is a great article describing what capabilities a team has to have in order to run a system which has a microservices architecture: https://martinfowler.com/bliki/MicroservicePrerequisites.html

Useful Links: Logs and Metrics

This article explains why storing log messages alone is insufficient for robust operation of a software service. Metrics also need to be gathered and stored. https://medium.com/@copyconst…/logs-and-metrics-6d34d3026e38 tl;dr - Log volume can spike dramatically when user activity increases, especially when things go wrong. This makes it possible for an alerting system based on logs to be swamped. For a metrics system, volume increases with the number of metrics collected. This is stable and much less likely to fail or slow down during a crisis.

Useful Links: The Practice of Practice

This is a very interesting talk on practicing for Operational events. The speaker draws parallels with musicians practicing for a performance: https://www.youtube.com/watch?v=87EhBrC2L1U

Useful Links: Logging Rules of Thumb

Some very useful advice in here for developers. https://engineering.hellofresh.com/logging-rules-of-thumb-f6c0f71a2351

Useful Links: Anatomy of Cascading Failure

An interesting article on Cascading Failures aimed more at the Dev side of DevOps. The list of design anti-patterns is very useful: https://www.infoq.com/articles/anatomy-cascading-failure/

Useful Link: PagerDuty Incident Response Documentation

This documentation from PagerDuty on incident response is pretty good. It would need to be tailored for local conditions but it does highlight aspects of incident response that people might not be aware of (different roles during a major incident, for example). https://response.pagerduty.com/

Useful Links: AWS Cost Optimization 101

An interesting article on AWS cost optimization. I am not in 100% agreement with all of it (re-architecting apps to minimize inter AZ traffic and not using AWS endpoints, for example), but there are some good tips in there: https://cloudonaut.io/aws-cost-optimization-101/

Useful Links: Trade-offs Under Pressure

These two posts dive into John Allspaw's (previous Head of Engineering at Etsy) Masters Thesis on heuristics on decision making under pressure, specifically in the context of dealing with an outage to a software service: https://blog.acolyer.org/2020/01/22/trade-offs-under-pressure-part-1/ and https://blog.acolyer.org/2020/01/24/trade-offs-under-pressure-part-2/ There are two noteworthy aspects to this: firstly the subject matter itself is useful. It identifies heuristics that engineers use to make trade-offs during outages. The second noteworthy thing is the methodology used: it demonstrates both an excellent methodology for conducting incident reviews. The visualization and classification of the timeline is very informative.