How to handle a major technical incident.
First, get the facts; who what, where and when.
Collect any error messages and diagnostic data.
Some good places to start are in monitoring tools and user problem reports.
When gathering diagnostics, check the server logs and user web console.
Record a recreation scenario. Try to boil the problem down to its essential elements.
Don’t forget to get detail on the business impact!
It is more significant that a user can’t perform a financial transaction than an informational feature is not acting as expected.
The business impact should drive both the urgency and scope of the escalation.
Store all this info; The facts, messages and diagnostics and user business impact in a cool, dry data repository for reference and sharing later.
Reach out to your management and describe the incident. Get their direction on who to contact next.
This may be their Acct Manager, CS rep, Product Manager or a Development Manager. Business impact and functional area will drive who to get involved.
Share the documents you have collected.
Be prepared to answer several questions from the group you convene such as which settings are active, any log data, and several what-if scenarios.
There is no way to anticipate all of these, so be prepared to do some more experimentation with your recreation scenario.
Be the liaison to the client.
Describe the actions they can expect to be done to address the incident and give them a preliminary timeline for resolution.
You may need to ask the user to try a few variations of the scenario, too.
Of course, this is all best done in writing. Remote control and virtual sessions are useful here, too.
Take the time to create additional data points as necessary, a new product Jira ticket for example, or a knowledge base article citing the issue as a known problem.
Schedule status meetings internally on a regular basis. Update your documents with findings and action items.
Hold regular update meetings with the client until the issue is resolved.
The Support team assumes the burden of working with the client to try out and verify solutions and get more diagnostic data as appropriate.
Once the issue is resolved, meet again with the internal team in order to determine what went well and how the process could be improved.
Do not neglect to close out the Jira tickets or update the knowledge base with the root cause and resolution.