post-mortems | Tags | PagerDuty Build It | Ship It | Own It Thu, 01 Aug 2024 14:28:33 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 Learning from Major Incidents: The Opportunities We’re Missing by Nora Jones https://www.pagerduty.com/blog/learning-from-major-incidents-the-opportunities-were-missing/ Mon, 22 Jul 2024 16:47:26 +0000 https://www.pagerduty.com/?p=88035 While they are untimely, stressful and likely to highlight communication breakdowns within an organization; incidents can be a powerful tool for learning and growth in...

The post Learning from Major Incidents: The Opportunities We’re Missing appeared first on PagerDuty.

]]>
While they are untimely, stressful and likely to highlight communication breakdowns within an organization; incidents can be a powerful tool for learning and growth in organizations.

When an incident occurs with a large impact, which it feels like we read about this happening in the news on a weekly basis, oftentimes the focus is on two things: stabilizing the situation, and controlling the narrative. Organizations often miss the opportunity incidents present: learning.

While all organizations will say they support learning, many simply haven’t realized the expertise it takes to both unearth the necessary data points and to disseminate those insights so that employees (and executives) can use them for growth.

Most organizations rely on a small group of people to jump in and start fixing the situation—they are the experts and can often figure out what needs to be done and who they should call on.

One of the best opportunities you have after an incident? Building more experts.

As an industry, we know that we tend to over-rely on the expertise of a few engineers and individuals to help in a situation. In fact, I bet if I asked you now, you could easily rattle off five names of who you would call on in a major incident. My concern is that the way we’re approaching this problem is to replace these humans with GenAI, rather than leveraging GenAI to instead teach more humans in our organizations and grow beyond this group of five humans.

Typically, the expert in resolving a situation doesn’t even realize why or how they’re doing what they’re doing – it’s second nature to them. If we can solicit the why and the how – we can use incidents to build a larger group of experts.

“Some developers of expert systems observed that highly skilled experts can carry out tasks without being aware of how or why they do what they do” 

– Minding the Weather, How Expert Forecasters Think

Here are some quick tips for organizations after a large incident (I highly recommend reading our HOWIE post-incident guide for thorough recommendations):

  1. Separate the public incident review (with the motive to show customer confidence) from the internal learning review. Yes, it’s important to share things right away—a “day 1 flash” of sorts. However, there is also a “day 5 flash”, and a “day 30 flash”, where you learn more (the 30 day flash can take insights from the internal post-incident review). It’s important not to make promises in the day 1 flash—you’re still in “learning mode” and if you do, it can distract how your organization improves.
  2. Leverage someone technical that did not participate in the incident to conduct the internal incident review, interviews, and recommendations. This is important. Oftentimes the people that participate in the incident have too much tunnel vision in order to really unearth the full picture. (With PagerDuty, you can work with us on this, no product attached—just incident analysis experts that can help you extract insights that might be hard to see when you’re in it.)
  3. Take the time to really understand the divide and perspectives between executives and technical employees on the front lines. These may be exacerbated after a large event and it’s imperative that the person from item number 2 collect both perspectives through their cognitive interviews. Closely evaluating this relationship will allow the incident analyst to provide recommendations that shed light on the best ROI.

No organization asks for large-scale, public, costly incidents. As PagerDuty CEO and Chairperson, Jennifer Tejada, said on CNBC – software is not perfect. But ultimately, incidents and outages happen and GenAI alone won’t fix it. These tools and a deep, human-centric incident analysis process can help employees learn, and help them be more interested in evolving. Invest in your employees training and development, and they’re not only more likely to stay, they’re more likely to have the expertise needed to continue growing your business.

The post Learning from Major Incidents: The Opportunities We’re Missing appeared first on PagerDuty.

]]>
Better Incident Postmortems by Paul Rechsteiner https://www.pagerduty.com/blog/better-incident-post-mortems/ https://www.pagerduty.com/blog/better-incident-post-mortems/#respond Tue, 09 May 2017 13:00:17 +0000 https://www.pagerduty.com/?p=31945 While a major incident is ongoing, all of your focus is on restoring service: watch the smoke, figure out where the fire is, and put...

The post Better Incident Postmortems appeared first on PagerDuty.

]]>
While a major incident is ongoing, all of your focus is on restoring service: watch the smoke, figure out where the fire is, and put it out. But after service has been restored—the incident is resolved, the adrenaline has drained, and it’s peace time—that’s the time to learn from what happened and then use those learnings to get better at resolving, responding, and preventing future incidents. The core best practice that enables this cycle of improvement is the postmortem process, and PagerDuty is pleased to introduce integrated support for postmortems in our full lifecycle incident management platform! Coupled with several other PagerDuty capabilities, such as system and operational efficiency analytics and the Operations Command Console, we now provide everything you need to learn and proactively improve both the resiliency of your infrastructure and your resolution process.

PagerDuty improves all parts of the postmortem process, from building the timeline all the way through to tracking the status of postmortems. Construct a timeline with relevant PagerDuty and chat activity in minutes instead of hours, then use that detailed breakdown to efficiently investigate root cause, assess response effectiveness, and determine the most important follow-up actions. We’ve taken the friction out of conducting effective postmortems, so that more of your postmortem time can be focused on learning and less on manual work. How easy can your postmortems be? Let’s take a look!

Now you can kick off the postmortem process for an incident in a single click:

Investigate

With the postmortem report created, it’s time to roll up our sleeves and start investigating what actually happened. We’ll want to pull in activity from our already existing sources of communication and incident response: chat and PagerDuty. Our PagerDuty incident information was automatically associated with our new postmortem, so let’s add in the relevant chat channels:

Now we can review the combined activity available from the incident and these chat rooms, and include in the postmortem timeline exactly those bits that are most relevant to understanding how the incident played out. We want to cover several aspects of the incident: the technology systems involved, our response effectiveness, and resolution steps.

Postmortem Timeline

Including an item in the postmortem timeline is also just a single click—no cut and paste, no switching between applications, no error-prone and manual time-zone math. The full range of PagerDuty activity can be included: incident state changes, notes, escalations, notifications, when additional responders were requested, when status updates were dispatched to stakeholders, and more. Once the activity is in the timeline, you can also annotate to describe its relevance to the incident, as seen here:

Analyze

With the timeline built out, we can continue on to the analysis phase. This consists of summarizing what happened, identifying the underlying root cause, calling out the path to resolution, and so on. This step is key as it enables the team to introspect on what worked well and where we could have done better, then identify the most important improvements to pursue as action items. All of this is easy to capture within the postmortem editor, which also provides instructions for approaching each of these sections:

And it’s as simple as that!

Streamline Postmortem Management

Not only is individual postmortem construction easier and more effective, the overall process is also significantly streamlined. All postmortems are available in the catalog.

This makes it easy to locate postmortems, identify impactful long-running incidents, and see which postmortems are still in progress, or are already complete. Postmortems can also be exported as PDFs for distribution or archiving, and both the report template and per-section instructions for authors can be customized to fit the needs of your organization. Together, all of these tools provide a complete end-to-end postmortem process that is both easy to use and easy to manage.

This suite of functionality helps you get the most from postmortems:

  • Timeline building is faster, less painful, and enables broader insights.
  • It’s far easier to manage the postmortem process with a simplified toolchain.
  • Your team can accelerate continuous improvement by getting more and better learnings, while spending less time on the process.

We hope that this capability makes it as easy as possible for your team to facilitate a culture of shared learning. And if you’re interested in learning more, download our free post-mortem handbook for best practices on conducting effective postmortems.

PagerDuty Postmortems is included for all customers on our Standard and Enterprise plans. To get started, check out the support article here!

 

The post Better Incident Postmortems appeared first on PagerDuty.

]]>
https://www.pagerduty.com/blog/better-incident-post-mortems/feed/ 0
Where is the Modern Day Postmortem? by Priya Sony https://www.pagerduty.com/resources/webinar/google-modern-day-postmortem/ Wed, 22 Feb 2017 13:00:56 +0000 https://www.pagerduty.com/?page_id=29128 The post Where is the Modern Day Postmortem? appeared first on PagerDuty.

]]>
The post Where is the Modern Day Postmortem? appeared first on PagerDuty.

]]>
The Journey of Chaos Engineering | Webinar | PagerDuty by Priya Sony https://www.pagerduty.com/resources/webinar/twilio-chaos-engineering/ Wed, 22 Feb 2017 13:00:31 +0000 https://www.pagerduty.com/?page_id=29239 The post The Journey of Chaos Engineering | Webinar | PagerDuty appeared first on PagerDuty.

]]>
The post The Journey of Chaos Engineering | Webinar | PagerDuty appeared first on PagerDuty.

]]>