Outage Response Process
A standardized outage process ensures proper incident handling, timely communication to users and stakeholders, and prevents recurrence through identified action items.
Note: This section provides a high-level summary of key steps during an outage. Always refer to the full Outage Process documentation for complete details on severity levels, communication cadences, and postmortem procedures.
When to Declare an Outage
- Bias towards using the outage process even if unsure whether something qualifies as an incident
- If uncertain about severity level (e.g., SEV-1 vs SEV-2), treat it as the higher severity
- During an incident is not the time to discuss severities—assume the highest and review during the postmortem
Immediate Response Steps
- Start an Incident Bot workflow in the #support-outage channel: This will create a thread for the incident, send a care team facing message and send you a message with further instructions/resources.
- Live Call: Join the Google meet call created by the Incident Bot.
- Assign Roles:
- Incident Commander: If you triggered the workflow, you are considered the incident commander. If you must pass this responsibility along, identify a new Incident Commander and communicate explicitly on the incident thread
- Response Team Engineers: Domain owners, service owners, or engineers who recently deployed to affected systems
- Operators: Stakeholders of affected services/products
- Delegate Tasks: The Incident Commander should explicitly delegate investigation, communication, and mitigation tasks to specific response team members.
- Consider Immediate Rollback: If the outage started right after a deploy, trigger a rollback immediately before deeper investigation. Rollbacks have minimal downside and provide time to investigate safely.
- Assess Product & Care Team Impact: Quickly understand the user-facing and Care Team member impact, not just the technical impact. Communicate this clearly to stakeholders.
- Create Postmortem Doc: Start a doc from the postmortem template for note-taking during the incident
- [SEV-1/SEV-2] Maintain Frequent Communication: Keep the Care Team informed at regular cadences (avoid large gaps between updates, even if investigation is still ongoing)
- Resolution Communication: Once resolved or mitigated, inform the Care Team and Operators
Post-Incident Actions
- Complete the Postmortem Doc: Fill out remaining details from the template
- Schedule Live Postmortem: Include response team, domain owners, and relevant external stakeholders
- Track Action Items: Document preventative action items on the
team-r&d-operational-excellenceAsana board- Progress will be reviewed during recurring team OE syncs
Resources
- Full Outage Process Documentation - Complete process including severity level definitions
- Postmortem Template - Template for incident documentation
- Slack Channels:
#support-outagefor incident threads