Skip to main content

Engineering On Call

Overview

To ensure system reliability and timely response to urgent issues, we’ve established an on-call rotation for the engineering team. This process defines responsibilities and expectations for engineers assigned to on-call duties, promoting accountability, learning, and continuous improvement.

⏳ Goals & Time Allocation

The primary goals and expected time allocation for the on-call engineer are:

  • Build prioritized care team features and backend “make eng life easier” features (~70%)
  • Resolve urgent bugs (~20%)
  • Support triage of feature issues (~10%)

Resources:

🔄 Rotation Details

  • 📅 Duration: Each on-call shift lasts 1 week
  • 👥 Participants:
    • Triage Support: Generalist Ops support who can triage the initial
    • Primary On-Call Engineer: Main point of contact for any alerts or support issues during the rotation.
    • Secondary On-Call Engineer: Backup support for the primary. The primary may delegate tasks to the secondary if needed (e.g., during time off or overlapping workload)

📋 On-Call Workflow & Ticket Management

A triage layer is in place to handle initial issues. As the on-call engineer, you don't need to pay attention unless you are both tagged in Slack and assigned the ticket.

Once you are tagged for support:

  1. Review the Ticket: You should review the ticket within ~30 minutes.
  2. Categorize the Request: Confirm which category the ticket falls into and update the “Request Type” field in the ticket.
    • Urgent Bug: If it's an urgent bug that should be fixed now, keep the Asana task assigned to yourself and close it when complete.
    • Feature Bug: If it's a feature bug that should be triaged, tag the pod's Product Manager (PM) and assign the Asana task to them for prioritization. You should then remove the task from the support Asana board.

🚀 Handoff & Feature Prioritization

The majority of the on-call week is dedicated to building features. This work is planned as follows:

  • The Friday before On Call: The PM for your pod will share a list of proposed features to work on.
  • Monday Morning of On Call:
    • Meet with the previous week’s on-call engineer to go through any handoffs.
    • Review the list of proposed features and size each one.
    • Document the count of Care Team Features, Urgent Bugs, and Triage of Feature Issues that you had to do

📝 Responsibilities

1️⃣ Triage Support

  • Provide initial triage of incoming tickets, assign them to the appropriate on-call engineer, and tag them in Slack.

2️⃣ Primary Engineer

  • Actively monitor and respond to user reports (support-arc-twilio tickets) and system alerts (ntfy-alerts) when tagged.
  • Coordinate with relevant teams to resolve high-priority issues.
  • Keep stakeholders informed as needed and document actions taken.
  • Share a summary of the week’s activities by EOD Monday after the on-call week, including handled tickets, learnings, fixes, and proactive suggestions.

3️⃣ Secondary Engineer

  • Stay informed and available in case backup is required.
  • Assist with ticket resolution if the primary delegates a task.
  • Optionally shadow the process for ramp-up or cross-training.