Incidents | Level Incidents reported on status page for Level https://status.level.io/ https://d1lppblt9t2x15.cloudfront.net/logos/e7cbb66b8182f20c8c8c2c118c4f5b53.png Incidents | Level https://status.level.io/ en DC Maintenance https://status.level.io/incident/555467 Fri, 02 May 2025 05:30:00 +0000 https://status.level.io/incident/555467#4eebad3591be20795b3a4b7b52ba494f2a1028944dcf955f7256a8729a9ca9ed Maintenance completed DC Maintenance https://status.level.io/incident/555467 Fri, 02 May 2025 05:30:00 +0000 https://status.level.io/incident/555467#4eebad3591be20795b3a4b7b52ba494f2a1028944dcf955f7256a8729a9ca9ed Maintenance completed Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 05:20:22 -0000 https://status.level.io/incident/555272#cf15e8ef5d3885bfe867a8397526852ecbf262c954f7835432c447076e735811 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 05:20:22 -0000 https://status.level.io/incident/555272#cf15e8ef5d3885bfe867a8397526852ecbf262c954f7835432c447076e735811 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 05:12:15 -0000 https://status.level.io/incident/555272#556ff03ebc2c0df7ebdaf26e34deb01187be49d2b5fc2d259407a300027c9f71 Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 05:12:15 -0000 https://status.level.io/incident/555272#556ff03ebc2c0df7ebdaf26e34deb01187be49d2b5fc2d259407a300027c9f71 Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 04:52:53 -0000 https://status.level.io/incident/555272#a2b5e93cfb8c65735fb3ef99ae7292f8945edb1d25a2434587e7c95c92a05b68 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 04:52:53 -0000 https://status.level.io/incident/555272#a2b5e93cfb8c65735fb3ef99ae7292f8945edb1d25a2434587e7c95c92a05b68 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 04:43:57 -0000 https://status.level.io/incident/555272#7426034c653aca8ff8e6be0a4398b7919aa586053985d0def13867d660c443dc Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 04:43:57 -0000 https://status.level.io/incident/555272#7426034c653aca8ff8e6be0a4398b7919aa586053985d0def13867d660c443dc Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 04:39:24 -0000 https://status.level.io/incident/555272#6428eb9d8b2a934f8e0198d9bcb3e3ecbd20ab0cf04f356c79ad6c920a1345e6 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 04:39:24 -0000 https://status.level.io/incident/555272#6428eb9d8b2a934f8e0198d9bcb3e3ecbd20ab0cf04f356c79ad6c920a1345e6 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 04:26:18 -0000 https://status.level.io/incident/555272#de68125c70f460d328cf5805fad466f3fa371897dcc033b0ea20cec59a744ae1 Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 04:26:18 -0000 https://status.level.io/incident/555272#de68125c70f460d328cf5805fad466f3fa371897dcc033b0ea20cec59a744ae1 Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 03:22:25 -0000 https://status.level.io/incident/555272#71c1ed5bc71dc4ca2c2e9da1b6e8480f619cecedc8416d3c29ec105406d918ef Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 03:22:25 -0000 https://status.level.io/incident/555272#71c1ed5bc71dc4ca2c2e9da1b6e8480f619cecedc8416d3c29ec105406d918ef Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 03:19:18 -0000 https://status.level.io/incident/555272#6f264179751167318275c575ca5f1aab3f5c53a3b79d90f211ac3a8fe27b08ca Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/555272 Fri, 02 May 2025 03:19:18 -0000 https://status.level.io/incident/555272#6f264179751167318275c575ca5f1aab3f5c53a3b79d90f211ac3a8fe27b08ca Level API and Agents API went down. DC Maintenance https://status.level.io/incident/555467 Fri, 02 May 2025 03:00:00 -0000 https://status.level.io/incident/555467#865e291a75e958efc8903ae179511437cf2a1ccf4a11b8f67202be7d866b8ca6 We will be performing upgrades to our data center infrastructure. Minimal downtime is expected. DC Maintenance https://status.level.io/incident/555467 Fri, 02 May 2025 03:00:00 -0000 https://status.level.io/incident/555467#865e291a75e958efc8903ae179511437cf2a1ccf4a11b8f67202be7d866b8ca6 We will be performing upgrades to our data center infrastructure. Minimal downtime is expected. Level API and Agents API are down https://status.level.io/incident/523219 Wed, 05 Mar 2025 17:44:31 -0000 https://status.level.io/incident/523219#5e38e0f768a37ecb07a747116c80c78d60a05b8e95f0b294e63999183ebeada1 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/523219 Wed, 05 Mar 2025 17:44:31 -0000 https://status.level.io/incident/523219#5e38e0f768a37ecb07a747116c80c78d60a05b8e95f0b294e63999183ebeada1 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/523219 Wed, 05 Mar 2025 17:33:42 -0000 https://status.level.io/incident/523219#f2061e4c49965c1b71eeb57f7fabe3622985c4bc505b19cd77ff1eddafac9d7a Level API went down. Level API and Agents API are down https://status.level.io/incident/523219 Wed, 05 Mar 2025 17:33:21 -0000 https://status.level.io/incident/523219#9e429695bea9389420da71e257b23e661304a1cee531d4936434a0d038516a8e Agents API went down. Level API and Agents API are down https://status.level.io/incident/441171 Tue, 08 Oct 2024 12:29:11 -0000 https://status.level.io/incident/441171#f375746e0de19359e630015e0622482e5406ba2b9d3525e5b1fcc7e3f9e7a1bc Agents API recovered. Level API and Agents API are down https://status.level.io/incident/441171 Tue, 08 Oct 2024 12:28:28 -0000 https://status.level.io/incident/441171#2696219f2e72d9c54d0a61dda5fd9f0026ab9066ce9194216b5e503300e803cd Level API recovered. Level API and Agents API are down https://status.level.io/incident/441171 Tue, 08 Oct 2024 11:36:13 -0000 https://status.level.io/incident/441171#ea13142b61876bdb3696b8b6636c7618ceb0669620e6f51a3f17da814d332df7 Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/441171 Tue, 08 Oct 2024 11:36:13 -0000 https://status.level.io/incident/441171#ea13142b61876bdb3696b8b6636c7618ceb0669620e6f51a3f17da814d332df7 Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/436038 Sat, 28 Sep 2024 20:29:27 -0000 https://status.level.io/incident/436038#8217b91296680bb4c5f3752ad573b0e3ca629cc3f08e704e8fa9f388bf916eaf Level API recovered. Level API and Agents API are down https://status.level.io/incident/436038 Sat, 28 Sep 2024 19:53:31 -0000 https://status.level.io/incident/436038#e0412bbfa94574518e1b4dc6959fce489e60ac7e26d9b4bcf605003e2bd73b7a Agents API recovered. Level API and Agents API are down https://status.level.io/incident/436038 Sat, 28 Sep 2024 19:49:03 -0000 https://status.level.io/incident/436038#2d3ea7d2f21d80819b1831b9a95a5160ab548899f821d5088c8da20f4d47a817 Agents API went down. Level API and Agents API are down https://status.level.io/incident/436038 Sat, 28 Sep 2024 19:38:31 -0000 https://status.level.io/incident/436038#dc338a9583a2ffb361e8ef1c192ee34863ce422be5a1687bdf0e5e9545d67aab Level API went down. Level API and Agents API are down https://status.level.io/incident/436038 Sat, 28 Sep 2024 12:29:56 -0000 https://status.level.io/incident/436038#c5ed240c94289df81e338465e8147685e81cbbb55a5bfb767fae2e656ca7da9d Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/436038 Sat, 28 Sep 2024 12:29:56 -0000 https://status.level.io/incident/436038#c5ed240c94289df81e338465e8147685e81cbbb55a5bfb767fae2e656ca7da9d Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/436038 Sat, 28 Sep 2024 03:08:10 -0000 https://status.level.io/incident/436038#b390a02bfd5d8c99d3a11e13089890029f25ea6db6e2c1ea8c941f7f06c84966 Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/436038 Sat, 28 Sep 2024 03:08:10 -0000 https://status.level.io/incident/436038#b390a02bfd5d8c99d3a11e13089890029f25ea6db6e2c1ea8c941f7f06c84966 Level API and Agents API went down. Agent Updates and Online Monitor are down https://status.level.io/incident/350540 Thu, 04 Apr 2024 11:51:53 -0000 https://status.level.io/incident/350540#05632a98874dd374169767dd9cb222c6bdf564603977183204b141b6b77c860a Online Monitor recovered. Agent Updates and Online Monitor are down https://status.level.io/incident/350540 Thu, 04 Apr 2024 02:55:23 -0000 https://status.level.io/incident/350540#a290795a58d96dfb68c55254bb6baf44298c9fa5c987537a1913de716576c3e1 Online Monitor went down. Agent Updates and Online Monitor are down https://status.level.io/incident/350540 Thu, 04 Apr 2024 01:55:35 -0000 https://status.level.io/incident/350540#b25204d39814794bc48012ebbfcda71b2453ea6096b5a1a66a7bdcafd8e2b159 Agent Updates recovered. Agent Updates and Online Monitor are down https://status.level.io/incident/350540 Thu, 04 Apr 2024 01:45:40 -0000 https://status.level.io/incident/350540#f57bb915ba6f4df6af4f78780c4a11284b5473c84681ae68b38068f5bfbec437 Agent Updates went down. Level App and Marketing Website are down https://status.level.io/incident/314204 Fri, 19 Jan 2024 03:16:35 -0000 https://status.level.io/incident/314204#0730d3843347276406d307c028649ae90ce180fe0c4f52bf634c29cb8171c6a7 Level App recovered. Level App and Marketing Website are down https://status.level.io/incident/314204 Fri, 19 Jan 2024 01:29:30 -0000 https://status.level.io/incident/314204#c9bd561257a6f415058e36f8b616c138364016fe97fcdc1b88147d077f8f4299 Level App went down. Level App and Marketing Website are down https://status.level.io/incident/314204 Fri, 19 Jan 2024 01:11:19 -0000 https://status.level.io/incident/314204#c21b15b594aee867ce1fa268f16573287d7b1e20bc10b9e986da467dcbb9aee5 Level App recovered. Level App and Marketing Website are down https://status.level.io/incident/314204 Thu, 18 Jan 2024 22:50:19 -0000 https://status.level.io/incident/314204#599e76632f14d0c97288ece84d0dbf397bce75b5f83424a1af5cf45bd6f30276 Level App went down. Level App and Marketing Website are down https://status.level.io/incident/314204 Thu, 18 Jan 2024 22:42:44 -0000 https://status.level.io/incident/314204#94e318f5a68c605d1ab377e69eaca34e56776cecb5cb3fe612fe221f4d4b5390 Level App recovered. Level App and Marketing Website are down https://status.level.io/incident/314204 Thu, 18 Jan 2024 01:18:08 -0000 https://status.level.io/incident/314204#ed26b65bc52548474d007ea7df5bec60859ce299d944a0ffaa9848ea77b7ac81 Level App went down. Level App is down https://status.level.io/incident/314168 Wed, 17 Jan 2024 23:50:09 -0000 https://status.level.io/incident/314168#0fe551f019b8a231e51d94cf5a3454c7ac89bab0b4b7893ad389abd1213d0228 Level App recovered. Level App is down https://status.level.io/incident/314168 Wed, 17 Jan 2024 23:48:52 -0000 https://status.level.io/incident/314168#e8af8906e4214be7693d38f38781d6d5f8bc4c1ab484b0fbe388b84cc3771d8d Level App went down. Level App is down https://status.level.io/incident/299063 Tue, 12 Dec 2023 03:30:21 -0000 https://status.level.io/incident/299063#3fa5536383da89ad9a400589d4ecfa2bf4835201adad95fc11e5822e1203d9d6 Level App recovered. Level App is down https://status.level.io/incident/299063 Tue, 12 Dec 2023 03:29:00 -0000 https://status.level.io/incident/299063#050bb0a2e82a61bb43219629c7b5efbc3b119108beed7246915d45482cb595f8 Level App went down. Delayed Processing of Background Jobs https://status.level.io/incident/296444 Tue, 05 Dec 2023 00:50:00 -0000 https://status.level.io/incident/296444#084555bc95dad626d83e071087735cdcaacbeccf3229d93f58226caa92757b26 ## Retrospective Report - Queue Incident on December 4th I am writing to provide a detailed account of the incident that occurred on December 4th. This event, which lasted about four hours, led to a substantial delay in the processing of background jobs, impacting our system's overall performance. Despite this, I am relieved to report that our core remote control functionalities remained operational throughout the incident. ### Summary of Events: - *Initial Detection:* At 1:17 PM EST, our support team was alerted to widespread customer-reported issues with Level, independent of any recent deployments. - *DevOps Team Involvement:* Starting their investigation immediately, our team observed a massive surge in queued jobs, reaching over 700k. - *Background Queue Impact:* Our background queue, responsible for various essential tasks, experienced an unprecedented growth from under 100 jobs to nearly 1 million jobs in just four hours. - *Monitoring and Alert Failures:* Notably, the queue began growing around 8 AM EST, yet no alerts were received. Our APM (Application Performance Monitor) had indeed triggered an alert, but AlertOps failed to notify our DevOps team, leading to a delayed response. - *Security Concern:* Alongside these events, a brute force attack was detected but deemed unrelated to the queue issue. ### Key Issues and Resolution Efforts: - *Capacity Expansion:* Initially, we lacked sufficient workers to process the rapidly growing queue. We promptly expanded server capacities and increased the number of workers. - *Queue Growth Rate:* Despite processing >17k jobs per minute, the queue continued to grow at an alarming rate of approximately 18k jobs per minute between 3:47 PM EST and 5:10 PM EST. - *Redis Configuration Error:* An attempt to manage this growth led to a misconfiguration in Redis, exacerbating the problem. - *Scaling Up Resources:* By 5:44 pm, we had added five more servers dedicated to managing the background job queue. - *Increased Processing Power:* At 6:23 pm, we further increased our worker count to 800, enabling us to process around 34k jobs per minute. Resolution: By 7:50 pm, the queue had returned to normal levels. ### Preventative Measures and Future Plans: - *Robust Monitoring:* We are intensifying our monitoring capabilities with redundant alerting systems for early detection of similar issues. - *Resource Scalability:* Recognizing the need for flexible resources, we've doubled our server capacity and are planning more infrastructure enhancements. - *Alert System Reliability:* A thorough investigation into the AlertOps malfunction is underway to ensure reliable future notifications. - *Configuration Management:* We've heightened our vigilance in configuration changes, particularly under pressure, to prevent compounding issues. Dependency Dashboards: Development of monitoring dashboards for key systems like Redis, PSQL, Nginx, and HAProxy is in progress, complete with sophisticated alert mechanisms. This incident has been instrumental in highlighting areas for improvement at Level. We are dedicated to learning from this experience and implementing the necessary changes to uphold the high standards of service you expect from us. Thank you for your ongoing support and trust in Level. Delayed Processing of Background Jobs https://status.level.io/incident/296444 Tue, 05 Dec 2023 00:50:00 -0000 https://status.level.io/incident/296444#084555bc95dad626d83e071087735cdcaacbeccf3229d93f58226caa92757b26 ## Retrospective Report - Queue Incident on December 4th I am writing to provide a detailed account of the incident that occurred on December 4th. This event, which lasted about four hours, led to a substantial delay in the processing of background jobs, impacting our system's overall performance. Despite this, I am relieved to report that our core remote control functionalities remained operational throughout the incident. ### Summary of Events: - *Initial Detection:* At 1:17 PM EST, our support team was alerted to widespread customer-reported issues with Level, independent of any recent deployments. - *DevOps Team Involvement:* Starting their investigation immediately, our team observed a massive surge in queued jobs, reaching over 700k. - *Background Queue Impact:* Our background queue, responsible for various essential tasks, experienced an unprecedented growth from under 100 jobs to nearly 1 million jobs in just four hours. - *Monitoring and Alert Failures:* Notably, the queue began growing around 8 AM EST, yet no alerts were received. Our APM (Application Performance Monitor) had indeed triggered an alert, but AlertOps failed to notify our DevOps team, leading to a delayed response. - *Security Concern:* Alongside these events, a brute force attack was detected but deemed unrelated to the queue issue. ### Key Issues and Resolution Efforts: - *Capacity Expansion:* Initially, we lacked sufficient workers to process the rapidly growing queue. We promptly expanded server capacities and increased the number of workers. - *Queue Growth Rate:* Despite processing >17k jobs per minute, the queue continued to grow at an alarming rate of approximately 18k jobs per minute between 3:47 PM EST and 5:10 PM EST. - *Redis Configuration Error:* An attempt to manage this growth led to a misconfiguration in Redis, exacerbating the problem. - *Scaling Up Resources:* By 5:44 pm, we had added five more servers dedicated to managing the background job queue. - *Increased Processing Power:* At 6:23 pm, we further increased our worker count to 800, enabling us to process around 34k jobs per minute. Resolution: By 7:50 pm, the queue had returned to normal levels. ### Preventative Measures and Future Plans: - *Robust Monitoring:* We are intensifying our monitoring capabilities with redundant alerting systems for early detection of similar issues. - *Resource Scalability:* Recognizing the need for flexible resources, we've doubled our server capacity and are planning more infrastructure enhancements. - *Alert System Reliability:* A thorough investigation into the AlertOps malfunction is underway to ensure reliable future notifications. - *Configuration Management:* We've heightened our vigilance in configuration changes, particularly under pressure, to prevent compounding issues. Dependency Dashboards: Development of monitoring dashboards for key systems like Redis, PSQL, Nginx, and HAProxy is in progress, complete with sophisticated alert mechanisms. This incident has been instrumental in highlighting areas for improvement at Level. We are dedicated to learning from this experience and implementing the necessary changes to uphold the high standards of service you expect from us. Thank you for your ongoing support and trust in Level. Delayed Processing of Background Jobs https://status.level.io/incident/296444 Mon, 04 Dec 2023 18:17:00 -0000 https://status.level.io/incident/296444#0ebcb85bb840992df71c67b6ab62ee3c1337ffcce1c8bbce13ba05e7ab70a807 We are investigating reports of issues from customers. Delayed Processing of Background Jobs https://status.level.io/incident/296444 Mon, 04 Dec 2023 18:17:00 -0000 https://status.level.io/incident/296444#0ebcb85bb840992df71c67b6ab62ee3c1337ffcce1c8bbce13ba05e7ab70a807 We are investigating reports of issues from customers. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:50:31 -0000 https://status.level.io/incident/279732#b82f47ef775779bc99026269c9b8c810b36d5a73df87aef2e5ff4f8acd24072c Level API recovered. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:49:59 -0000 https://status.level.io/incident/279732#68ae1f4b1ac8ffcdd42d20b6d909cff68e893cd48cf7fbdbf20e3c8089f4be78 Agents API and Marketing Website recovered. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:49:17 -0000 https://status.level.io/incident/279732#c28b22eff866ddbc8b1436c7d52ba1e746a5a8a356cc0a5031ad4b6183899692 Level API and Agents API went down. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:49:17 -0000 https://status.level.io/incident/279732#c28b22eff866ddbc8b1436c7d52ba1e746a5a8a356cc0a5031ad4b6183899692 Level API and Agents API went down. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:46:05 -0000 https://status.level.io/incident/279732#7fd6ce1b50e8d8d481532d8a47faf54104593c2a8e9b083c2d13da84b2491e68 Level API and Agents API recovered. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:46:05 -0000 https://status.level.io/incident/279732#7fd6ce1b50e8d8d481532d8a47faf54104593c2a8e9b083c2d13da84b2491e68 Level API and Agents API recovered. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:45:03 -0000 https://status.level.io/incident/279732#0280a43dd6c9a560b1dba6ca6b7e12473f7f0cdc85e2c332cd523eda1cdfa892 Agents API went down. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:44:36 -0000 https://status.level.io/incident/279732#3c1609e30f04e6e3a698fd2f8803309fef3e01489abac2dc75af98d59864d3c9 Level API went down. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:39:50 -0000 https://status.level.io/incident/279732#eb8c781d0f4ca1267e20d058322136377e2c95275dc7cad986d7454d464486b8 Level API, Agents API, and Marketing Website recovered. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:39:50 -0000 https://status.level.io/incident/279732#eb8c781d0f4ca1267e20d058322136377e2c95275dc7cad986d7454d464486b8 Level API, Agents API, and Marketing Website recovered. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:38:41 -0000 https://status.level.io/incident/279732#64b5d81f19a80eefd425847a2aea8939d5242a6515727b248bf06bf130319e52 Level API went down. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/279732 Mon, 30 Oct 2023 14:36:23 -0000 https://status.level.io/incident/279732#f2965a24f8f39f282452d2711e4bcf11cd3aee53f8cb49d1af55aaa9da5c1b32 Agents API and Marketing Website went down. Emergency DB Maintenance https://status.level.io/incident/265126 Sat, 30 Sep 2023 19:00:00 +0000 https://status.level.io/incident/265126#09536580e26423dcd83c70c0b637472116b31cf86e6487acb87af1b18994d836 Maintenance completed Emergency DB Maintenance https://status.level.io/incident/265126 Sat, 30 Sep 2023 19:00:00 +0000 https://status.level.io/incident/265126#09536580e26423dcd83c70c0b637472116b31cf86e6487acb87af1b18994d836 Maintenance completed Emergency DB Maintenance https://status.level.io/incident/265126 Sat, 30 Sep 2023 19:00:00 +0000 https://status.level.io/incident/265126#09536580e26423dcd83c70c0b637472116b31cf86e6487acb87af1b18994d836 Maintenance completed Emergency DB Maintenance https://status.level.io/incident/265126 Sat, 30 Sep 2023 18:00:00 -0000 https://status.level.io/incident/265126#c0abf18116217ad8107422f16b3e43db7be096e238e407a9410ab2be7f2f4ec1 We've identified the need to replace our primary database cluster to ensure the highest level of performance and reliability for you. We've scheduled this critical maintenance for this coming Saturday (09/30/2023). Emergency DB Maintenance https://status.level.io/incident/265126 Sat, 30 Sep 2023 18:00:00 -0000 https://status.level.io/incident/265126#c0abf18116217ad8107422f16b3e43db7be096e238e407a9410ab2be7f2f4ec1 We've identified the need to replace our primary database cluster to ensure the highest level of performance and reliability for you. We've scheduled this critical maintenance for this coming Saturday (09/30/2023). Emergency DB Maintenance https://status.level.io/incident/265126 Sat, 30 Sep 2023 18:00:00 -0000 https://status.level.io/incident/265126#c0abf18116217ad8107422f16b3e43db7be096e238e407a9410ab2be7f2f4ec1 We've identified the need to replace our primary database cluster to ensure the highest level of performance and reliability for you. We've scheduled this critical maintenance for this coming Saturday (09/30/2023). Level API and Agents API are down https://status.level.io/incident/201379 Fri, 28 Apr 2023 17:36:32 -0000 https://status.level.io/incident/201379#650f36d1237aa28e50d6ce0b68314554f067b91dd81645a0b097dada6961d1c4 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/201379 Fri, 28 Apr 2023 17:36:32 -0000 https://status.level.io/incident/201379#650f36d1237aa28e50d6ce0b68314554f067b91dd81645a0b097dada6961d1c4 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/201379 Fri, 28 Apr 2023 17:26:17 -0000 https://status.level.io/incident/201379#f43d4a661a2f055cdd2f316672243fdc51d57406ae69cb59460534dcc235d33e Level API went down. Level API and Agents API are down https://status.level.io/incident/201379 Fri, 28 Apr 2023 17:25:53 -0000 https://status.level.io/incident/201379#756fe45dde44ab17c2f213b78c2266680de462f52c4c3e02e111d60cb51a4aa3 Agents API went down. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/194871 Tue, 11 Apr 2023 00:19:19 -0000 https://status.level.io/incident/194871#f519fae8b6c0d2469c42dd964a114cf537136123c310161a50b8d8577128126e Level API and Agents API recovered. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/194871 Tue, 11 Apr 2023 00:19:19 -0000 https://status.level.io/incident/194871#f519fae8b6c0d2469c42dd964a114cf537136123c310161a50b8d8577128126e Level API and Agents API recovered. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/194871 Tue, 11 Apr 2023 00:03:07 -0000 https://status.level.io/incident/194871#10bc19662959633a911bf9ba5642fb0c7c1f5ab152b53ec23ee27e9c07a9634f Level API, Agents API, and Marketing Website went down. Level API, Agents API, Marketing Website, and 1 other service are down https://status.level.io/incident/194871 Tue, 11 Apr 2023 00:03:07 -0000 https://status.level.io/incident/194871#10bc19662959633a911bf9ba5642fb0c7c1f5ab152b53ec23ee27e9c07a9634f Level API, Agents API, and Marketing Website went down. Level API and Agents API are down https://status.level.io/incident/191451 Fri, 31 Mar 2023 15:43:47 -0000 https://status.level.io/incident/191451#316312fe21b3c320fd9c8804ab8deee5e8a5e13ec7ff28e00a7a8102ad01a517 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/191451 Fri, 31 Mar 2023 15:43:47 -0000 https://status.level.io/incident/191451#316312fe21b3c320fd9c8804ab8deee5e8a5e13ec7ff28e00a7a8102ad01a517 Level API and Agents API recovered. Level API and Agents API are down https://status.level.io/incident/191451 Fri, 31 Mar 2023 15:41:53 -0000 https://status.level.io/incident/191451#54f690f18a74d6bff5400a6ac45d280d2d6049c1ebca2b7a4a5b6dcb36f2a6d8 Level API and Agents API went down. Level API and Agents API are down https://status.level.io/incident/191451 Fri, 31 Mar 2023 15:41:53 -0000 https://status.level.io/incident/191451#54f690f18a74d6bff5400a6ac45d280d2d6049c1ebca2b7a4a5b6dcb36f2a6d8 Level API and Agents API went down. Linux Endpoint Management https://status.level.io/incident/188776 Sat, 25 Mar 2023 19:00:00 -0000 https://status.level.io/incident/188776#03c0794ae0e77eca3cc413ef8d069f3d5563aea0f64d3238bece3052bd224ecc # Retrospective On Friday, March 24th Level had its first real product outage. For about six hours, all Linux devices went completely offline. The outage lasted from about 11 am to 5 pm EST. Unfortunately, during the outage, the Level service also became hung on many existing Linux agents. Those agents are no longer reaching out to the Level API for instructions. The issue that caused them to go offline has been resolved but they are unlikely to come back online without manually restarting the Level service. The easiest way to bring the agents back online would be to run: `sudo systemctl restart Level` Otherwise, the service will start again the next time the machine is rebooted. We understand that Level is the primary way that you access these machines so bringing them back up without it is a significant burden We feel absolutely terrible for the trouble that this is causing. We recognize that the uptime and reliability of your endpoint management tools are of the utmost importance. I’d like to walk you through how this happened and what we’re doing to keep it from happening again. ### What Happened? Friday began innocently enough, we were excited to go live with a minor agent update in the morning. Just a simple fix for an edge case that occurred if an agent’s database ever became corrupted. This fix had been running in our staging environment for about a week with zero issues so we felt quite confident in it. At 10:50 am EST, we pushed the button to go live with the change. It wasn’t long before we started seeing problems. Internally, we have Level agents installed on all of the servers that host the Level API, Databases, etc. We quickly started getting alert emails that these servers had gone offline. Immediately we jumped into these servers only to discover that they were fine, it was the connection to Level that had been severed. The obvious cause would be the agent patch that we had just released. It certainly didn’t seem like that small fix could cause this much trouble but the timelines matched up. We rolled back to a previous agent version but it didn't bring the devices back online. That’s when we realized something else was going on. ### Build Process & GitHub Actions Level has a simple method of determining whether the agent binary should use a production config or a staging config. Essentially, it looks at the current git branch during build time. If the branch is “master” then it knows to pull in the production config and embed those values into the binary otherwise it pulls the staging config values. Of most importance to our bug case, these values include the URL location for downloading future updates. On Friday morning Github sent out a release that “out of an abundance of caution” they were invalidating their RSA SSH host key: https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/ As usual, we read and digest these things, our developers updated their known hosts, and everything seemed fine. The post does mention the checkout Github Action, which we use, but it indicates that our use case would not be impacted. When we pushed out that update in the morning, it ran through our GitHub Actions pipeline as normal. We use these pipelines to build the agent binary, sign it, and ship it off to S3. Everything was green (everything should not have been green). During the Linux build, the git clone didn’t behave exactly as expected. Because of issues surrounding the SSH host key rotation, it seems to have fallen back to using the GitHub HTTP API to fetch the code. This works somewhat similarly to the SSH clone that we were expecting except it only grabs the code - it doesn’t create a local .git repository. That turned out to be a big problem. I mentioned earlier that we use git to determine which config file to build the binary with. During the build, we query the local git repository and check which branch we’re on. Unfortunately, because of the HTTP fetch, we were not actually on a branch. During our investigation we came across these helpful logs regarding the binary build: ```Building Linux Agent... fatal: not a git repository (or any of the parent directories): .git fatal: not a git repository (or any of the parent directories): .git fatal: not a git repository (or any of the parent directories): .git Built cmd/agent/level-linux-amd64``` Well, they would have been helpful, had we had something in place to act on them. Instead, as you can see, the Linux binary was built. This set off the catastrophic series of events: - Because git wasn’t able to provide the branch name, our build process defaulted to using the staging binary config. - The staging binary was shipped off to the production S3 buckets. - Production agents saw that there was a new update available and downloaded this staging binary right away. - Staging binaries cannot communicate with our production servers so the agents immediately went offline. - The updated agents were now looking to the staging S3 buckets for future updates which added significant confusion about why we couldn’t get them back on the proper build. It took a while to trace through these steps and figure out what had happened. Even after all of this, we should have still been able to recover the agents and bring them back online without user interaction. The plan became to reverse the process and put the production binaries on the staging S3 buckets. That way the Linux agents would see the available update, download it, and then be back on production. Unfortunately, none of that happened. It seems that we have a bug on Linux that can occur if an update is applied to an agent with a bad API key (bad in this case meaning that it's using a production API key trying to talk to the staging API). The agent seems to get completely hung as it restarts to apply the update. Systemd shows that it's active but the logs (both local and API side) show that nothing is happening. Unfortunately, this is the state that many Linux agents are now in. Completely cut off from anything that we could do to recover them. ### How are we going to prevent this in the future? Our biggest blunder here was allowing a build with errors to appear successful. We cannot let that happen again, so here is what we’re going to do: - We’re reworking our build pipelines to make sure they fail fast if anything isn’t right. - We’re moving to self-hosted Github runners so that we have more control over our environment. - We’re adding a final check outside of the build step, immediately prior to the binary upload, that verifies that the build is for production before shipping it. - We already have "latest" and "stable" release channels for production. We will avoid releasing to the "stable" channel without letting the update run on "latest" for a longer period of time - even if the fix seems harmless. - We're going to start using the Watchdog config built-in to systemd. This is essentially a heartbeat that the agent needs to send to systemd every X minutes. If systemd doesn't hear from the agent in that time it will forcefully restart (even if systemd shows "active"). Once again, we're really sorry for the disruption this has caused. We understand that our customers rely on Level to manage their endpoints and we take that responsibility very seriously. We are committed to doing everything in our power to prevent similar incidents from happening in the future. We appreciate your patience and understanding as we work to implement the changes outlined above. Please do not hesitate to reach out to our [support team](mailto:support@level.io) if you have any further questions or concerns. Thank you for your continued trust in Level. Linux Endpoint Management https://status.level.io/incident/188776 Fri, 24 Mar 2023 19:00:00 -0000 https://status.level.io/incident/188776#098a191839426da4be3f869bbd1a4a61b0f8103ef6b0aaa8f2cb0299ade83601 Linux endpoint management is working on all newly created devices. Existing devices should be back shortly. Linux Endpoint Management https://status.level.io/incident/188776 Fri, 24 Mar 2023 16:48:00 -0000 https://status.level.io/incident/188776#efa9d88ef4cf00eb6b78f5407c51631fd0d0ba7ae932b6516d2863193f2feaaf We have identified the root cause of the issue and are working toward a resolution. Thanks for your patience! Linux Endpoint Management https://status.level.io/incident/188776 Fri, 24 Mar 2023 15:10:00 -0000 https://status.level.io/incident/188776#a384eddb8cf10293fdb3144a05763513895089a860eba03fe03a70d9ac1d4e35 We're aware of an issue impacting Linux endpoint management. We're working on a resolution and will keep everyone updated.