All the data in the world means nothing if it’s not the right data. But when it comes to delivering reliable software and troubleshooting issues, what is the right data?
To answer this question, we recently created a framework that helps organizations pinpoint critical gaps in data and metrics that are holding them back on their reliability journeys. At the foundation of this framework is the concept of Continuous Reliability (CR), or the notion of balancing balancing speed, complexity and quality by taking a continuous, proactive approach to reliability across the SDLC. When it comes to CR, it’s not just about what data you can capture, but how you analyze and leverage it.
With increasingly complex systems and ever growing expectations for digital customer experience, traditional tooling and the shallow data they provide is insufficient. To fully understand what’s going on inside your application and maintain stability, this data must be collected at the code level.
One of the things that makes OverOps a powerful reliability tool is the way that we capture, analyze and present code-level data across the software delivery lifecycle. In this post, we’ll break down the four key types of data OverOps captures and why they’re critical to advancing in the journey towards Continuous Reliability.
1. Code Metrics
Capturing all the information about events occurring in your code is critical to deciphering which issues need to be addressed. Before you can effectively prioritize and fix critical code-level issues, you first need visibility into exactly which issues are occurring.
At the most basic level, OverOps automatically captures 100% of events happening within your application in both test and production – even those missed by your logging framework or APM tools. This includes:
- Logged errors and warnings
- Uncaught and swallowed exceptions
- Slowdowns and APM bottlenecks
With OverOps, you no longer need to rely on logs and foresight into which events to capture, what to include in a log statement, or how to analyze it.
On top of detecting every event, OverOps applies a layer of intelligence to automatically prioritize all events based on severity so your team can focus on the issues that matter most. Taking into account things like if an error is new, when it was first and last seen, how many times it occured and if there has been a sudden increase, OverOps is able to mark errors as severe based on criteria such as if a new or increasing error is uncaught, or if its volume and rate exceeds a certain threshold. It considers established baselines and averages to pinpoint anomalies and immediately notify DevOps and SRE teams of events that require immediate resolution.
2. True Root Cause
Many APM vendors will tell you that they provide the root cause of an issue, including “code-level” insights. What they actually mean is that they provide you with a stack trace. Stack traces, while useful, only help identify the layer of code where an issue occurred. From there, you’re left to your own devices, including spending time manually digging through shallow log files to find context that can help you reproduce the issue.
OverOps helps you go beyond the stack trace, capturing deep data, down to the lowest level of detail – without dependency on developer or operational foresight. This includes:
- The source code executing at the moment of the incident captured directly from the JVM
- The exact offending line of code
- Key data and variables associated with the incident
- DEBUG and TRACE Log statements
- Environment and Container Variables
- Ability to map Events to Specific Applications, Releases, Services, Etc.
Check out our recent blog from OverOps Principal Solutions Architect, Karthik Lalithraj, where we take a deep-dive into the seven key components that make up True Root Cause and why OverOps is the only tool that can help you capture the context needed to effectively troubleshoot.
3. Transactions & Performance Metrics
In the context of software development and reliability, a transaction is a sequence of calls that are treated as a unit, often based on a user-facing function. When a transaction fails, customer experience is often impacted, so it’s important to be able to identify and prioritize these failures in the context of the transactions that they impact.
OverOps captures data about every transaction failure, ranging from how many times it happened, to how many transactions failed, to the response time of the transaction. Using insights from the code events we mentioned above, we can determine the success of a transaction by correlating errors, exceptions and slowdowns within a given timeframe and surface this data to our users.
These performance metrics include things like throughput, or the number of transactions that occur during a given period of time, and response time baselines. The ability to capture data about application performance is critical to understanding what your end users are experiencing, as well as correlating related events that may help with identifying the root cause.
4. System Metrics
OverOps focuses on data at the code level of your application, but we recognize the importance of correlating code-level failures with other aspects of your system. For example, what impact did your latest deployment have on CPU/memory utilization? Are there any blocked threads related to this failure? Was this CPU spike caused by the application?
Through the OverOps reliability dashboards, you can correlate events, transactions and performance metrics to things like Garbage Collection, Threads, CPU, Class Loading and Memory Consumption, giving you a more comprehensive view into dependencies indirectly related to your application.
How Do We Do It?
What allows OverOps to capture this depth and breadth of data that other monitoring tools simply can’t? The not-so-secret secret to our unique capabilities is a combination of a few key elements:
- The OverOps Native Agent – Our agent operates between the JVM / .NET CLR and processor to capture real-time code and variable state from live microservices in production and pre-production environments. While traditional APM tools also rely on agents, the OverOps native agent operates at a lower level, allowing us to capture deeper, code-level insights about your application.
- Runtime Code Analysis & Code Graphs – As code is loaded during server startup, OverOps maps it and assigns unique fingerprints for every code instruction. Then at runtime, the resulting code graph is used to efficiently and securely access the memory and capture application state.
- Advanced Machine Learning – OverOps runs this high-fidelity data through machine learning algorithms for de-duplication, classification and anomaly detection and delivers Code Quality reports that are based on the code’s runtime behavior.
To learn more about how OverOps can help you capture deeper data, schedule a call with one of our engineers here.
The powerful combination of data and analysis is the key to enterprise scale observability and reliability. OverOps helps your team not only capture a complete picture of how your code is executing and the errors and slowdowns that occur, but analyzes and adds meaning to that data so you know exactly which issues to prioritize.
Find out how these unique metrics will empower your team to deliver more reliable software in our new eBook about the Continuous Reliability Maturity Model.