New OverOps Reliability Dashboards score deployments, applications and infrastructure configurations so teams can drill down into operational issues and immediately see the root cause.
In previous posts, we’ve discussed the challenges we need to overcome when figuring out the root cause of issues (1), prioritizing events and anomalies (2), and blocking critical issues from reaching production (3).
A common theme in all of these posts is the need for us to access code signals that, until now, have been unavailable to us. In order to quickly identify and resolve issues, we should be able to see where in the code an error occurred, which inputs caused the error and (especially in the case of slowdowns) what the state of the JVM is at the time of the error.
This post will look at how OverOps presents code-level insights in a way that enables IT Ops and SRE teams to detect anomalous behavior, triage issues for specific releases, applications or components, and see the overall reliability of a deployment or application.
— OverOps (@overopshq) February 19, 2019
OverOps Reliability Dashboards
Upon installation, without any changes to code or build, OverOps creates reliability score cards for every deployment, application and code tier; for example, any package or component configuration that’s related to our newest deployment. These scores are determined using a machine learning algorithm that takes into account new, increasing and critical issues and slowdowns. This allows us to assess the quality of our releases over time and to see how they’re performing in pre-production and production environments.
So, let’s say we deployed a new version of our software. This is when everyone’s eyes are glued to their screens and dashboards, trying to understand if this version is stable and reliable, or if it introduced any anomalies or issues that we need to be aware of. The OverOps Reliability scorecard let’s us see it in one screen.
We can see here (in the lower left-hand corner) that there seems to be something wrong with version 72 (v.4.0.2172), as its score was 52.5, compared to previous versions who were scored higher. That means there are issues in this version that risk affecting our customers.
More specifically, we see in the dashboard that this deployment introduced 12 new issues (1 of which is prioritized as a potential Sev1 issue), 2 increasing errors (meaning issues whose failure rate has increased compared to previous releases), and is causing 2 slowdowns. This is useful because we immediately know that there’s something wrong with this version. Now, for the tricky part; finding WHERE the issues are coming from – Are these network? Security? Infra? Or code? Which team should we alert? How can we avoid having the entire engineering org on the war-room call? This is where OverOps’ code-level insight can help us triage within minutes, without relying on logs. Let’s dive deeper into this version.
By filtering for only the most recent deployment, we can easily identify errors related directly to our application or to other operational components. Since OverOps analyzes the code as it’s running, in real time, it can break down the errors not only by application and version, but by the code tier which caused it; meaning is it 3rd-party code? Is it the DB? Etc. In this case, we can see that 3 of the errors in this version are related to our AWS instance, including one critical issue that must be resolved ASAP (We can also see the application layer is causing 9 issues).
Now, we know where we want to focus and which team needs to be alerted, but we also want to make sure they’re able to solve it quickly. To do that, we’ll zoom into the deployment itself. From here, we can see more information about the individual anomalies that make this version unreliable. In addition to new or increasing errors and slowdowns, which are the basis for the deployment’s reliability score, we see the percentage of failed transactions and the volume of unique errors.
Now, let’s filter for those AWS errors we saw earlier.
At this point, we want to know more information about a specific error. Clicking on it will open the Automated Root Cause (ARC) Screen. This screen reveals the True Root Cause of the error by showing a complete picture of the JVM when it happened.
With access to the source code and the full variable state at the time the error occurred, we can pretty easily see that our code isn’t interacting properly with AWS code. Without OverOps, we may never see that this issue was happening at all. It would never show up in the logs, but has a more than 2% fail rate which is adding extra noise to the system.
Now we can send it the the AWS team, along with the probable solution for this. If needed, the team assigned to this issue can also examine the stack trace going back 10 levels into the heap, the previous 250 log statements (including DEBUG and TRACE-level in production) and JVM metrics such as the state of the Memory, Garbage Collection, etc.
Why This Matters
It’s pretty much impossible to know how long the resolution of this error would take without OverOps. It depends on the company, their workflows and so many other factors. For some issues, it’s possible for it to take weeks before a customer experiences and reports it to the team. In the case of a severe issue, it may take days of back and forth to determine who was responsible or it may require the assembly of a war room.
With OverOps, we identified, analyzed, prioritized and assigned an issue for resolution all in less than an hour. The Jira ticket attached to our Reliability Dashboard included all relevant information so the person responsible for resolving the issue doesn’t have to try to reproduce or search through the logs.