Back to overview
Degraded

Delayed Processing of Background Jobs

Dec 04 at 01:17pm EST
Affected services
Level App
Level API

Resolved
Dec 04 at 07:50pm EST

Retrospective Report - Queue Incident on December 4th

I am writing to provide a detailed account of the incident that occurred on December 4th. This event, which lasted about four hours, led to a substantial delay in the processing of background jobs, impacting our system's overall performance. Despite this, I am relieved to report that our core remote control functionalities remained operational throughout the incident.

Summary of Events:

  • Initial Detection: At 1:17 PM EST, our support team was alerted to widespread customer-reported issues with Level, independent of any recent deployments.
  • DevOps Team Involvement: Starting their investigation immediately, our team observed a massive surge in queued jobs, reaching over 700k.
  • Background Queue Impact: Our background queue, responsible for various essential tasks, experienced an unprecedented growth from under 100 jobs to nearly 1 million jobs in just four hours.
  • Monitoring and Alert Failures: Notably, the queue began growing around 8 AM EST, yet no alerts were received. Our APM (Application Performance Monitor) had indeed triggered an alert, but AlertOps failed to notify our DevOps team, leading to a delayed response.
  • Security Concern: Alongside these events, a brute force attack was detected but deemed unrelated to the queue issue.

Key Issues and Resolution Efforts:

  • Capacity Expansion: Initially, we lacked sufficient workers to process the rapidly growing queue. We promptly expanded server capacities and increased the number of workers.
  • Queue Growth Rate: Despite processing >17k jobs per minute, the queue continued to grow at an alarming rate of approximately 18k jobs per minute between 3:47 PM EST and 5:10 PM EST.
  • Redis Configuration Error: An attempt to manage this growth led to a misconfiguration in Redis, exacerbating the problem.
  • Scaling Up Resources: By 5:44 pm, we had added five more servers dedicated to managing the background job queue.
  • Increased Processing Power: At 6:23 pm, we further increased our worker count to 800, enabling us to process around 34k jobs per minute. Resolution: By 7:50 pm, the queue had returned to normal levels.

Preventative Measures and Future Plans:

  • Robust Monitoring: We are intensifying our monitoring capabilities with redundant alerting systems for early detection of similar issues.
  • Resource Scalability: Recognizing the need for flexible resources, we've doubled our server capacity and are planning more infrastructure enhancements.
  • Alert System Reliability: A thorough investigation into the AlertOps malfunction is underway to ensure reliable future notifications.
  • Configuration Management: We've heightened our vigilance in configuration changes, particularly under pressure, to prevent compounding issues. Dependency Dashboards: Development of monitoring dashboards for key systems like Redis, PSQL, Nginx, and HAProxy is in progress, complete with sophisticated alert mechanisms.

This incident has been instrumental in highlighting areas for improvement at Level. We are dedicated to learning from this experience and implementing the necessary changes to uphold the high standards of service you expect from us.

Thank you for your ongoing support and trust in Level.

Created
Dec 04 at 01:17pm EST

We are investigating reports of issues from customers.