Chat with us, powered by LiveChat

Blog

Blameless Retrospective & Incident Analysis to Improve Go-live Practices

Blog By Aun Raza - VP Operations

February 2023

Building software is a complex process, and failures are bound to happen. However, when a production deployment goes wrong, it can be a painful and embarrassing experience for everyone involved. Rolling back a bad production deployment with a pre-approved window from clients can be particularly awkward and hurtful.

In such scenarios, the worst thing that can happen is pointing fingers and placing blame on the developers, ops engineers, or site reliability engineers. It's human nature to immediately look for someone to blame when things go wrong, but blaming others is not a productive solution. Blame is detrimental to productivity and growth. Blame is addictive and easy way out.

Let's take the incidents as an opportunity to learn and improve - Let's talk openly about what went wrong - a Blameless Retrospective.

Cultural Shift from Blame to Accountability ( Ownership?)

Yes, I believe "Ownership" is a much better word than "Accountability".

Instead of:
"Identify the person who forgot to update environment variables before running the deployment script"

To:
"Is there anything in our automation process that we can do to ensure the right environment variables are automatically injected before a production deployment kicks off"

Instead of identifying the "individuals", find the "systems" that need improving, instead of "people" focus on "problems". A blameless environment will discourage a "cover-up, no learning" attitude in your team members and encourage an "Openness, Learning" attitude.

Blameless Incident Analysis
Here are a few steps for successful blameless incident analysis and setting up a learning plan.

Start on the right tone - As a leader, it's important to create a positive and conducive learning environment. Emphasize that our goal is to identify areas for improvement in our go-live process and cycle.

To map an incident, it is important to focus on the steps involved in the incident rather than the individuals involved. To gain a better understanding of the situation, it is crucial to ask the following questions:

  • Which systems were affected?
  • How did we become aware of the incident?
  • When did we commence responding to the incident?
  • What temporary or permanent measures did we implement to mitigate the issue?
  • Have we encountered similar situations before?

Listen to the right people - To ensure that the right information is gathered during a meeting, it is important to hear from the individuals who were directly involved in resolving the issue. This includes those who introduced, identified, responded to, debugged, resolved the issue, or have additional insights to contribute. It is crucial to give these individuals the opportunity to speak and share their perspectives.

Identify the action items - After identifying a potential problem and discussing it with the relevant teams, it's important to conduct a root cause analysis (RCA) to identify specific action items. These action items should be prioritized based on their level of importance (P0, P1, P2, etc.), categorized according to type (preventative, mitigative, or other), and assigned to appropriate team members. To ensure that these action items do not fall through the cracks, it's recommended to create ticket(s) for each item and ensure that they are not closed until at least one production deployment is executed based on the outcome of the RCA.

Conclusion
DevOps is more than just a set of practices and tools - it's a mindset and a culture. Keep in mind that no matter how well-prepared you are, sometimes things can go wrong during your go-live attempts. The best DevOps teams are those that can react quickly to such incidents and learn from them. If your team is learning quickly from their mistakes, incidents are not being repeated, and a blameless learning process is in place, then you can rest easy knowing that your operations are in good hands. However, it's important to remember that there will always be surprises and failures, so it's crucial to foster a culture of learning and growth from such experiences.