Common practice used to be throwing issues over the fence and expecting someone else to figure it out. But root cause analysis (RCA) should be standard across all tech teams. In this post, we’ll take a look at RCA from an operational perspective.
Over the last twenty years, the way companies do business and the way they build applications has changed dramatically. In large monolith applications with annual deployments and on-premises IT infrastructure, it was almost obvious where an issue originated.
Nowadays, we don’t have a big room with all of our servers to monitor or flashing red lights to tell us that a particular server is down, and we don’t have the luxury of pointing out that no code changes were deployed for the last year.
Instead, we have complex, distributed systems with microservices and cloud computing solutions and continuous deployments. We don’t have any visibility into our remote servers, and code is being deployed weekly or even daily.
An unfortunate side effect of this otherwise beneficial progression is that it’s become increasingly difficult to trace an error back to the specific change that caused it. How do we know if an operational issue is related to code or to infrastructure? That’s more or less where root cause analysis comes in.
Let’s take a look at what that means and how we can identify root cause in the world of IT Operations.
— OverOps (@overopshq) January 30, 2019
Anomaly Detection and Root Cause Analysis
Let’s start with a (very real) hypothetical. Imagine, you have a mission-critical application throwing errors and shutting clients out of the system, but no one has changed any code. Nothing happened from a developer standpoint, but you see that some infrastructure component, a database let’s say, is unavailable. This is a classic operational issue.
The first question is, how did you even know about it in the first place? How does the IT Ops team become aware that something isn’t working as expected? The next question, of course, is how do you fix it?
There are 2 core capabilities that operations-types (e.g. DevOps and SREs) are interested in for the purpose of answering these questions; anomaly detection and root cause analysis.
Anomaly detection is the capability that points us to the problem – all day, every day this database is called on for user login information and it connects with no problem, but this time it didn’t perform as expected. That’s an anomaly. Root cause analysis is the process of understanding the exact element or state that caused the unexpected behavior.
In the past, there was a more clear split between these capabilities. Operations dealt with anomalies, developers dealt with root cause. That’s where the familiar “throwing issues over the fence” comes from.
But that’s not enough anymore. Nowadays, the line between infrastructure and code is not as clear-cut as it used to be. With this, we’ve seen the emergence of new functions – DevOps, Site Reliability Engineers (SREs) and the likes – ready to investigate complex issues in distributed systems.
Understanding the Root of Root Cause
Applications are a complex combination of code and infrastructure, not unlike the human body. You can think of code as the brain and infrastructure as the limbs.
Now, let’s say that something is wrong with your left arm. How do you know? Unless you’re able to look down and physically see that you’ve been injured, the only true way to know is when your brain receives a signal from your arm (i.e. pain) or when your brain tries to send a signal that isn’t received correctly by your arm (i.e. paralysis, unexpected movement). At the end of the day, any error with a part of the body is identified by the brain.
Because, for the most part, we consider the human body to have a static structure, we expect our arms (and the rest of our bodies) to continue working normally unless something happens to them. We don’t immediately assume that the brain is the issue.
Applications behave similarly, though there are notable differences. As the code runs, it communicates with remote assets like the database, firewall, storage, etc. When something goes wrong, instead of sending out that signal and getting back the usual response, an error message comes back or some other response that tells the code that the action was not successful.
Unlike the human body, though, application code in the modern world does not remain unchanged for long. New features and fixes are being added constantly, so in applications, the first assumption when an error comes up is usually that a newly introduced piece of code isn’t performing as expected. Still, that’s only an assumption and in most cases, an investigation is required to identify the root cause of the issue.
Root Cause Analysis is “Rooted” in Code Signals
How do you know when something isn’t working as expected, then? You know by paying attention to code signal anomalies – whether that comes in the form of log entries, slowdowns or poor customer experiences. Then you need to investigate the root cause based on what those signals contain.
In this sense, developers and operations teams should be looking at the same thing when it comes to investigating errors and exceptions – we all need to find traces of code signals that point to the root cause of the issue.
The first question we asked is how do we know that something is wrong in the first place. Next, we have to ask why. Did something change in the code that affected the connection or did something change in the “limb” that’s causing it to be unresponsive or to respond in unexpected ways?
Unfortunately, in complex, distributed systems it’s become increasingly difficult to trace an error back to the specific change that caused it, and the current tools and processes in place aren’t providing the granularity we need.
Which Code Signals Point to Root Cause?
We all use logs. We don’t always like it, but we do. Despite being manual, shallow and unstructured, log files do give us a sense of when something goes wrong and can even provide a small glimpse into where and why (depending on logging verbosity and, of course, if it was logged in the first place).
When we look at log files, what we’re looking at is simplified code signals. Every log statement ever written to a log file was written by developers as a part of their code.
The real challenge is to differentiate between logs that point to code issues versus logs that point to infrastructure issues. The problem is that the line between the two is almost never clear-cut.
If the code tries to read a file and the file can’t be found, is the problem that the code is trying to read from a file that doesn’t exist – maybe because the code is trying to calculate the filename dynamically with an incorrect? Or is the problem that the file was not correctly configured into the deployment?
Either way, we see in the logs “File not found. Could not read from _____”.
At this point, root cause analysis comes in. Why were we trying to read from that file in the first place? What was the code doing when the error occurred? With that information, we should be able to understand the root cause of the issue.
Knowing that we aren’t able to read from a file or that we can’t connect to a database is the error, but not the root cause. The general nature of log files, while supplying basic code signals, doesn’t allow us to get enough information to understand specifically what the code was trying to do.
PERFORMANCE METRICS –
Slowdowns and other performance metrics captured by Application Performance Monitoring (APM) tools are another form of code signals, though more indirectly.
The performance level of a piece of code, something that these tools track, can provide a sense of whether or not a problem exists. Many of these tools even work with an agent that allows them to detect where a slowdown is occurred down to the line of code. What these tools can’t provide is the state of the code, meaning – what the code was actually trying to do – at the time that the performance issue occurred.
Again, did the code slowdown because it was trying to connect to a database with faulty configuration? Or did it slowdown because the code relied on flawed logic?
BAD CUSTOMER EXPERIENCES –
Arguably the most indirect form of code signal we can receive are those that come from our customers. Some may argue that these aren’t code signals per say, but it’s certainly relevant to this conversation.
One way that we detect and resolve issues in our applications is by receiving and interpreting customer complaints, trying to convert those into an understanding of the code signals being sent to the customer and identifying the root cause of the issue based on that.
As you can see, all of them contain information that we can get about our application comes in the form of code signals – unless we missed it completely and our customers are the ones telling us about it… but even then.
What’s the big takeaway? The key to root cause, regardless of which room you sit in, dev or ops, is code signals. Unfortunately, most companies don’t have access the level of code signals that they need to be able to identify and resolve issues in a timely manner and before they start to affect the company’s bottomline.
OverOps provides direct access to the code signals that are behind all application errors including uncaught and swallowed exceptions. From a high-level understanding of an application’s overall health down to the variable state and source code where an error occurred, OverOps helps DevOps engineers identify and investigate errors in production.
One more final thought before you go. It’s important to care for our applications like we care for our health. If your arm hurts, do something about it. Be proactive. That might mean taking an anti-inflammatory, scheduling an appointment at the doctor, or even just looking down to make sure that on the outside it seems okay.
Just don’t ignore it, assuming it’ll be fine, and wait for somebody walking past to point out that your arm fell off completely and you’ve been leaving a trail of blood behind you.