Root Cause Analysis (RCA) or simply “Root Cause” are terms often used when troubleshooting enterprise application behavior.

A quick web search shows that “Root Cause” is a term that describes a wide range of approaches, tools and techniques to uncover the cause of an issue.

More specifically, the term describes the process of understanding the exact element or state that caused unexpected behavior. But what does this translate to in terms of content? What level of information is needed today in order to find the root cause of an issue? And is it enough? In this post, we will answer these questions and introduce what content is needed to understand the true root cause of any issue.

Understanding “Root Cause” in Today’s World

What is considered “Root Cause” today? Many think it is a stack trace or a slow method identified by an APM tool. Here are 4 common methods used by various APM and Log Analysis tools and their respective shortcomings.

  1. Using transaction variables written into a log file. While often helpful, this provides limited visibility. Variables not logged could be critical to determining the true root cause.
  2. Being a developer myself, back in the day, I used to look at database queries as the “one source of truth”. For application troubleshooting, looking at the query executed often tells you a lot about the data returned to populate a screen. Some APM tools allow enabling a DB trace. Some tools introduce concepts of “DB bind variables”. This kind of debugging takes a hit in today’s schemaless persistence where document storage takes precedence.  
  3. Various APM tools help identify the right layer of code where an issue occurred. In most cases, this comes with a stack trace. Often, however, more information is needed to understand why there was an issue with that layer of code. In order to debug further, we have to go back to the logs.
  4. Is identifying the right “business logic location” considered finding root cause? APM tools allow you to identify when a method is slow. However, knowledge of the variables/data that caused the method slowdown is missing.

Ultimately, many of these methods for root cause identification require developer foresight (writing variables to the logs) or require you to turn on specific tooling features such as “Data Collectors” or “DB Bind Variables”. Often, these practices are implemented sparingly due to latency and overhead costs.

This begs the question, is true root cause something that you always have access to or something that you have to turn on in advance?

The 7 Components of “True Root Cause”

To get “True Root Cause”, we need access to additional content – we need the ability to go deep, down to the lowest level of detail – without dependency on developer or operational foresight.

Below are the 7 components needed to attain the true root cause of any issue:

1. Code Graph

Any code, module or component could be invoked in a variety of different ways. Microservices can be chained together to produce larger business flows. Understanding the code graph (i.e. the index of all possible execution paths) corresponding to an issue goes a long way in determining its root cause.

2. The Source Code

The source code that was executing at the moment of the incident can provide additional context on application behavior.

3. Exact Line of Code

Pinpointing the exact line number where an incident occurs is important. In a lot of cases, the exception could have occured in one area but caught or logged in a different area, so additional context might be needed.

4. Data and Variables

Arguably, the most important component of “True Root Cause” – context or, in other words, the data and variables associated with the incident. In production especially, the code encounters many different scenarios. Code paths could present a “happy path” scenario for one dataset but a “failure path” for another different dataset. This gets even more tricky with microservices as these could be invoked using a variety of workflows.

For true root cause, it’s important to understand which specific data inputs caused the issue being investigated.

5. Log Statements (including debug and trace level) in Production

Using logs is the most common troubleshooting mechanism. Logs provide system visibility by showing a sequence of events in chronological order. Unfortunately, logs reveal the consequence of what happened – not the underlying reason why it happened.

Plus, modern software logging has multiple levels – ERROR, WARN, INFO, DEBUG, TRACE etc. Production systems usually limit visibility to ERROR. This makes it tougher to get to the true root cause because content available in a lower log levels isn’t necessarily accessible.

6. System or Environment Variables

As a developer, I used to use System or Environment variables as a “switch” to control application behavior. The advantage of this was to have one deployable artifact but control application behavior based on environment. The flip side of this approach is that if these variables are not passed or are incorrectly passed, application behavior will suffer. The ability to investigate these is very helpful in troubleshooting issues.

7. Mapping Events to Specific Applications, Releases, Services, Etc.

Applications are built continuously. The ability to map anomalies to the corresponding artifact (including application, version, service, etc.) is essential in the process of finding, understanding and, eventually, fixing bugs.

Most engineers are familiar with the period of time immediately after deploying to production, when everyone is holding their breath and hoping that nothing goes wrong. When the servers crash within a few hours of deployment, it’s not hard to make the connection between the issue and the newest release, but what about when something goes wrong three months after deployment? Mapping issues to the relevant application, deployment and/or service can be crucial when trying to resolve them.

Conclusion

Content and Context is the key to enterprise troubleshooting. If you don’t have it, you will spend more time trying to figure it out. True root cause not only involves casting a wide net, it involves going that extra mile deep into the various facets of code. Unfortunately, a large number of organizations only have a subset of the above 7 key components. This results in limited visibility and in turn leads to dependency on a “few select groups of individuals” that have intrinsic knowledge of the systems. Dependency on “pockets of brilliance” does solve an immediate need, but can it scale? What happens when these pockets disappear over time?

While software application technology has evolved in leaps and bounds over the last 2 decades, troubleshooting is still in its infancy. Today’s top enterprises are realizing that to really be able to find True Root Cause for application issues within minutes, they must have these 7 key components to remove all the guesswork. Flying blind is not an option.

OverOps is the only tool that arms teams with all 7 components of True Root Cause for every error and slowdown across the software delivery lifecycle. Visit our website to learn more.

Happy troubleshooting! And best of luck identifying the “True Root Cause” of your production issues.

Karthik Lalithraj is a Principal Solutions Architect at OverOps with focus on Code Quality and Application Reliability for the IT Services industry. With over 2 decades of software experience in a variety of roles and responsibilities, Karthik takes a holistic view of software architecture with special emphasis on helping enterprise IT organizations improve their service availability, application performance and scale. Karthik has successfully helped recruit and build enterprise teams, architected, designed and implemented business and technical solutions with numerous customers in various business verticals.