I find it quite educational to see what went wrong technically. It was the combination of two factors that caused the severe error.
The video (16 minutes long) presents the article Knightmare: A DevOps Cautionary Tale – Doug Seven in an entertaining (but very loud!) way.
In short: they used an old switch to activate new code. Unfortunately, they had not installed the new code on all servers. That alone would not have been so devastating, but in the meantime, the old code had also changed…
“Any time your deployment process relies on humans reading and following instructions you are exposing yourself to risk. Humans make mistakes. The mistakes could be in the instructions, in the interpretation of the instructions, or in the execution of the instructions. … Deployments need to be automated and repeatable and as free from potential human error as possible.”
Last modified:
Leave a Reply
You must be logged in to post a comment.