Degraded

Linux Endpoint Management

Mar 24 at 11:10am EDT

Affected services

Agents API

Resolved
Mar 25 at 03:00pm EDT

Retrospective

On Friday, March 24th Level had its first real product outage. For about six hours, all Linux devices went completely offline. The outage lasted from about 11 am to 5 pm EST. Unfortunately, during the outage, the Level service also became hung on many existing Linux agents. Those agents are no longer reaching out to the Level API for instructions. The issue that caused them to go offline has been resolved but they are unlikely to come back online without manually restarting the Level service.

The easiest way to bring the agents back online would be to run:

sudo systemctl restart Level

Otherwise, the service will start again the next time the machine is rebooted. We understand that Level is the primary way that you access these machines so bringing them back up without it is a significant burden

We feel absolutely terrible for the trouble that this is causing. We recognize that the uptime and reliability of your endpoint management tools are of the utmost importance. I’d like to walk you through how this happened and what we’re doing to keep it from happening again.

What Happened?

Friday began innocently enough, we were excited to go live with a minor agent update in the morning. Just a simple fix for an edge case that occurred if an agent’s database ever became corrupted. This fix had been running in our staging environment for about a week with zero issues so we felt quite confident in it.

At 10:50 am EST, we pushed the button to go live with the change. It wasn’t long before we started seeing problems. Internally, we have Level agents installed on all of the servers that host the Level API, Databases, etc. We quickly started getting alert emails that these servers had gone offline.

Immediately we jumped into these servers only to discover that they were fine, it was the connection to Level that had been severed. The obvious cause would be the agent patch that we had just released. It certainly didn’t seem like that small fix could cause this much trouble but the timelines matched up. We rolled back to a previous agent version but it didn't bring the devices back online. That’s when we realized something else was going on.

Build Process & GitHub Actions

Level has a simple method of determining whether the agent binary should use a production config or a staging config. Essentially, it looks at the current git branch during build time. If the branch is “master” then it knows to pull in the production config and embed those values into the binary otherwise it pulls the staging config values. Of most importance to our bug case, these values include the URL location for downloading future updates.

On Friday morning Github sent out a release that “out of an abundance of caution” they were invalidating their RSA SSH host key:

https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/

As usual, we read and digest these things, our developers updated their known hosts, and everything seemed fine. The post does mention the checkout Github Action, which we use, but it indicates that our use case would not be impacted.

When we pushed out that update in the morning, it ran through our GitHub Actions pipeline as normal. We use these pipelines to build the agent binary, sign it, and ship it off to S3. Everything was green (everything should not have been green).

During the Linux build, the git clone didn’t behave exactly as expected. Because of issues surrounding the SSH host key rotation, it seems to have fallen back to using the GitHub HTTP API to fetch the code. This works somewhat similarly to the SSH clone that we were expecting except it only grabs the code - it doesn’t create a local .git repository. That turned out to be a big problem.

I mentioned earlier that we use git to determine which config file to build the binary with. During the build, we query the local git repository and check which branch we’re on. Unfortunately, because of the HTTP fetch, we were not actually on a branch. During our investigation we came across these helpful logs regarding the binary build:

Building Linux Agent... fatal: not a git repository (or any of the parent directories): .git fatal: not a git repository (or any of the parent directories): .git fatal: not a git repository (or any of the parent directories): .git Built cmd/agent/level-linux-amd64

Well, they would have been helpful, had we had something in place to act on them. Instead, as you can see, the Linux binary was built. This set off the catastrophic series of events:

Because git wasn’t able to provide the branch name, our build process defaulted to using the staging binary config.
The staging binary was shipped off to the production S3 buckets.
Production agents saw that there was a new update available and downloaded this staging binary right away.
Staging binaries cannot communicate with our production servers so the agents immediately went offline.
The updated agents were now looking to the staging S3 buckets for future updates which added significant confusion about why we couldn’t get them back on the proper build.

It took a while to trace through these steps and figure out what had happened. Even after all of this, we should have still been able to recover the agents and bring them back online without user interaction. The plan became to reverse the process and put the production binaries on the staging S3 buckets. That way the Linux agents would see the available update, download it, and then be back on production. Unfortunately, none of that happened.

It seems that we have a bug on Linux that can occur if an update is applied to an agent with a bad API key (bad in this case meaning that it's using a production API key trying to talk to the staging API). The agent seems to get completely hung as it restarts to apply the update. Systemd shows that it's active but the logs (both local and API side) show that nothing is happening. Unfortunately, this is the state that many Linux agents are now in. Completely cut off from anything that we could do to recover them.

How are we going to prevent this in the future?

Our biggest blunder here was allowing a build with errors to appear successful. We cannot let that happen again, so here is what we’re going to do:

We’re reworking our build pipelines to make sure they fail fast if anything isn’t right.
We’re moving to self-hosted Github runners so that we have more control over our environment.
We’re adding a final check outside of the build step, immediately prior to the binary upload, that verifies that the build is for production before shipping it.
We already have "latest" and "stable" release channels for production. We will avoid releasing to the "stable" channel without letting the update run on "latest" for a longer period of time - even if the fix seems harmless.
We're going to start using the Watchdog config built-in to systemd. This is essentially a heartbeat that the agent needs to send to systemd every X minutes. If systemd doesn't hear from the agent in that time it will forcefully restart (even if systemd shows "active").

Once again, we're really sorry for the disruption this has caused. We understand that our customers rely on Level to manage their endpoints and we take that responsibility very seriously. We are committed to doing everything in our power to prevent similar incidents from happening in the future.

We appreciate your patience and understanding as we work to implement the changes outlined above. Please do not hesitate to reach out to our support team if you have any further questions or concerns. Thank you for your continued trust in Level.

Updated
Mar 24 at 03:00pm EDT

Linux endpoint management is working on all newly created devices. Existing devices should be back shortly.

Updated
Mar 24 at 12:48pm EDT

We have identified the root cause of the issue and are working toward a resolution. Thanks for your patience!

Created
Mar 24 at 11:10am EDT

We're aware of an issue impacting Linux endpoint management. We're working on a resolution and will keep everyone updated.