As systems become more complex, failures are inevitable. No system has perfect uptime. These serious problems affecting the software we can call incidents – an event that prevents the user from accessing your product or, even worse, alters data and information without the customer’s consent. As you can imagine, business is now serious!
In this post, I’ll tell you how Resultados Digitais handles an incident so that it has the least possible impact on its customers and so that the team learns about the errors in order to ensure that they don’t happen again in the future.
An incident is nothing more than an unexpected interruption or reduction in the quality of an IT service.
Generally speaking, people don’t like problems, whether in personal or professional life. However, we need to learn to face these everyday situations.
“Everything fails all the time” Werner Vogels (Amazon, CTO)
Even IT industry giants like Facebook, Amazon and Google experience incidents.
You might be thinking: I use Google services and have never encountered a problem. I access Facebook every day and it works perfectly.
The companies I mentioned above solve serious problems as quickly as possible, have the least impact on customers and learn from mistakes so as not to repeat them . They also open up more critical changes or introduce new features in a way that makes access available to a portion of users. Here at Digital Results we have the same practice and we call it Controlled Rollout .
At the end of the day, what will make a difference to the business is how prepared the team is to handle its own incidents.
Incident handling plan
Having an incident handling plan helps overcome the tough days at technology companies and makes a significant impact on the customer experience.
At RD, our incident handling plan has basically four work fronts: identification , communication , solution and learning/action plan .
Identify the problem and have clear criteria for considering a problem as an incident. Each team must analyze its scenario and map out specific criteria.
As much as there are particularities of each team, we list a set of general criteria for identifying an incident that can help you:
- system downtime
- Degradation of an indicator beyond the tolerated limit
- Data loss of any kind
- Unavailability of one of the third-party services connected to the system
- An impeding defect ( bug ) in the main stream
Upon confirming the incident, the person responsible must notify the stakeholders , passing on the knowledge acquired so far to start the solution front and the communication front .
But before that, get organized! Set up a communication channel in the tool used by the company – in the case of RD, we created a channel in Slack. In this channel, you can add the team responsible for the incident, the support team and the operations team – for example.
It is also interesting to open a shared document and report the investigation and the actions carried out in real time, so that you can later discuss and extract the lessons learned. The objective is to centralize and document the entire dialogue, avoiding misalignment and rework.
Communication with the customer
Now that internal communication is underway, we need to start external communication with the customer.
During an incident, it’s easy to forget that your customers are often in an even worse situation. They are also impacted by the interruption, but they have less information about what is happening.
Keeping customers informed in times of system outages brings trust to the relationship. Confidence is the reflection of knowledge .
Customers need to know that the company is aware of the issue and is actively working to correct it.
In RD, the PM ( Product Manager ) of the team that triggered the problem must fully focus their efforts on providing assistance, whether alerting, calming or being available to Customers and CS ( Customer Success ). Its duty is to protect the interests of clients by passing on, in a filtered way, the internal inputs of the team that is working to solve the problem.
- Raise data by incident impact: Based on the incident’s impact, we decide whether to open a status page .
- High: affects more than 30% of the customer base
- Medium: affects between 5% and 20% of the customer base
- Low: affects less than 5% of the customer base
- Turn on the Status page: The status page helps to map which features of the system are operating normally and which ones are having a problem. When registering an incident, you must inform the detailed description of the problem, the affected functionalities and the status (investigating, identified, monitoring, resolved).
Make sure your customers know where to look for information. They need to understand that the situation is being taken seriously so that they can continue to use the other features of the system that are operating normally.
The solution front must be composed of the team that caused the incident and a team whose number of members is in accordance with the size of the effort needed to correct the problem.
Initially, we must contain the incident , preventing more customers from being affected, even if the service is partially or totally unavailable.
Once this is done, we correct or work around the problem so that the service can return to operating in conditions acceptable to the customer. In some cases, a palliative solution may be a good option.
Palliative can be a measure taken for a given situation that only disguises and does not solve the problem.
Let’s say your system accesses a URL provided by an external service to return some graphics that are shown to customers. On a fine day this URL is changed and no communication is made for you. The graphics are no longer shown on screen and then we have a clear beginning of an incident!
To solve the customer’s problem as quickly as possible, the team contacts the external service provider to find the new URL and adjusts the system.
Okay, now everything is working normally again. However, if the URL is changed again we will have another incident. In other words, the solution found was palliative. For this reason, after getting around the situation, we must correct the problem with excellence and definitively , refactoring and rebuilding what is necessary.
learn and act
Once a service has been restored, the company must have some kind of process to assess what happened. Probably, any resolution to an issue is not truly complete until a team has documented and thought through it.
In this case, the post-mortem appears with the intention of creating a positive culture where people discuss and learn from the problems .
The post-mortem is basically a meeting lasting about 1 hour, which must be scheduled within a maximum period of one week after the incident is resolved.
In this meeting, participants build a document telling a story that basically includes the following points:
- Summary of how the incident was identified
- root cause analysis
- Steps to Assess, Diagnose and Resolve
- Timeline of activities
- Learning and next steps
We need to come out of the post-mortem with an action plan containing the actions that will prevent the incident from happening again in the future.
It is interesting that these actions contain as much automation as possible, as our memory is short and new people can join the team and they do not have the knowledge.
Based on the previous example, where the URL is changed by the external service provider, we can think that an interesting preventive action to avoid the problem in the future would be to automate alerts to receive notification when the URL changes.
It is also necessary to have one or more responsible persons to perform the actions and a resolution date. You can log tasks in a tool like Trello or Jira for easier management.
Finally, something we’ve learned over time is that there’s no point in having a complete process if people don’t know about it. So make the process visual!
You can create a workflow of the incident process containing the macro tasks and their respective responsible and later make it available in an accessible place for everyone on the team.
In our experience, having the workflow with the incident treatment plan posted on the wall made the process much more streamlined! <3
Demonstrating competence in recovering from a system incident can lead to a higher level of customer satisfaction than never having failed.
In addition, being able to investigate the reason for the incident, how it happened and what to do to avoid new occurrences is a learning experience for your team. A mistake becomes a good lesson if we choose to learn from it.
Are you handling your incidents carefully and efficiently? Do you have a different process than shown here? Tell us!