ChatOps & Major Incident Management

ChatOps & Major Incident Management

Imagine the scene, you are running a Major Incident, trying to coordinate teams based locally, working from home and offshore. You are trying to understand what went wrong, but are getting different information from different sources, verbally, and can’t get a clear picture of what is going on. One of the engineers suggests making a change, that, based on the information you have, seems to make sense.  You give the go ahead, and then, the next minute, the whole network goes down, and the business is now coming to you to find out what happened.

I’m sure that this is a scenario that many people will be able to relate to.

One innovation that has come out of the emergence of Agile and DevOps, that would be able to help in this scenario, is the concept of ChatOps.

ChatOps is a conversation-driven collaboration model that brings people, tools, process, and automation into a transparent workflow, connecting work needed, work happening, and work done, in a persistent location. This transparency tightens the feedback loop, improving information sharing, enhancing team collaboration, culture and cross-training. Leading players in this space include Slack and HipChat.

Atlassian HipChat (http://hipchat.com/)

hipchat

Slack (https://slack.com/uk)
slack

While in the chat room, team members can collaborate with each other, receive information from enabled tools, and submit commands that will be executed by ‘bots’ through custom scripts and plug-ins. The entire team is able to collaborate in real time and be aware of what has gone on, and is going on, at all times.

Tasks that used to be done manually, and often involved human error, are now automated. Any type of work can be done inside the chat, including server deployment, maintenance tasks and simple reboots, as long as the API of a platform is available.

Comprehensive sets of online chat rooms and bots can be created to drive organisational activity putting them at the centre of team activity, essentially creating a real-time operations centre.

This doesn’t mean that we are replacing the requirement for a Service Management tool. Leveraging APIs allows the tool to integrate with the ChatOps tools, providing relevant information, such as an incident or change number, and taking a feed for relevant updates.

Taking the ChatOps approach, for our scenario, all teams work in a chat session, with automated feeds providing information on the current system status, giving everyone a common a view of what is going on and facilitates informed discussion on the potential root cause.

Once agreed that an action needs to be carried out, the command that the engineer submits to make the change, will be visible to all.  They will have the ability to peer review the change before it is implemented, and if it does still cause a further issue, everyone will be immediately familiar with the change that was carried out.  The automated reporting feeds will also reflect whether the change fixed the issue or not.

ChatOps increases the visibility and the sharing of information, provides an exact record of what was discussed and agreed, and reduces the risk of any misunderstanding, which often occurs when managing a Major Incident.

The collaboration and automation benefits of ChatOps significantly reduce the cycle time for a Major Incident by providing everyone with a common view of the incident status, reducing the requirement for communications sessions and meetings to articulate the status and progress. As a result, people can focus on resolving the Major Incident rather than waste time providing regular updates.

ChatOps technology isn’t just limited to managing a Major Incident, it can be leveraged across the whole IT department and beyond, putting conversations to work; improving collaboration, breaking down silos and aiding a move to more agile ways of working.

If you would like to find out how iCore can help you with incident management then please contact us on 0207 868 2405 or email info@icore-ltd.com.