Following on our latest launch of OverOps Reliability Dashboards, it’s time to take a closer look at the dashboards themselves and see what added value they can bring to DevOps/SRE, QA and dev teams.
OverOps’ new Reliability Dashboards give visibility across pre-production and production environments. These capabilities help developers, DevOps and QA teams identify and prioritize anomalies before a release, which can stop you from promoting bad code.
One of the methods in which we help teams prioritize anomalies, is through our Jenkins integration. Jenkins is one of the more popular automation servers and is usually used to automate the Continuous Integration part of the software development process, which in return helps improve the entire CI/CD workflow.
However, Jenkins on its own is only as good as the data it relies on, which is usually log files, meaning it can’t provide an informative feedback loop on what exactly caused a build to fail. That’s where OverOps fits in, providing insight into the functional quality of your applications, servers, machines and environments, helping automate the non-human part of your software development process.
— OverOps (@overopshq) February 25, 2019
To see how OverOps enriches Jenkins, let’s take an in-depth into the integration and how it works:
OverOps and Jenkins
OverOps provides four quality criteria that can mark a build as unstable. These criteria, which can be configured according to each user’s own specific needs, are:
1. New Error – Any new error that enters the build.
2. Resurfaced Error – Whether an old error has resurfaced in the current build.
3. Total Error Volume – The total number of errors for the build.
4. Unique Error Volume – The total number of unique errors in this build.
There are two additional criteria that OverOps checks, which are:
1. Critical Exception Types – Identify new critical exceptions, according to what we define for our application (Values set in configuration – NullPointerException, IndexOutOfBoundsException, YourCustomException).
2. Increasing Errors – Compare between the Active Time Window to a Baseline Time Window.
As we can see in the screenshot below, OverOps identified that build 248 is unstable. Let’s dive in and see why it was marked as such:
As we can see in the screenshot above, the build failed in 3 different criteria: Resurfaced Errors, Total Error Volume and Unique Error Volume.
For each criterion, we can see the number of errors that were detected, their type, which version first introduced them and their volume. For example, looking at the Unique Error Volume criterion, we can see that it failed due to 8 unique errors, while the max amount “allowed” is 1.
OverOps Quality Dashboard can determines that the top error is from a previous build and not the current one, and can identify which previous build introduced it. It allows us to find and resolve the issue, before the current version is deployed.
We know the build is unstable, and now it’s time to see why it failed to pass our quality gates. In other words, it’s time for us to start investigating why these errors occurred. In this case, we want to take a closer look at the error with the highest volume rate (28,099 events) – ParseException.
In Jenkins, we can see the general reason for this error – “Unparseable date: “2-17-1989“. However, OverOps allows us to see its True Root Cause, on the Automatic Root Cause (ARC) analysis screen:
The click leads us straight into the source this issue, the line of code in which it happened and the variables that were sent during the specific transaction that caused it to fail. We can see the fire icon next to the line in which the error has happened, and by hovering over the dateString we can even see the variable that caused it to fail.
We can see that the date format that was sent in this transaction, 9-21-2009, didn’t match the desired format, MM/dd/yyyy, which caused it to fail 37.48% of the times it ran.
OverOps ARC screen also allows us to examine the stack trace going back 10 levels into the heap, the previous 250 log statements (including DEBUG and TRACE-level in production) and JVM metrics such as the state of the Memory, Garbage Collection, etc. So we can deduplicate it and resolve it with just a few clicks.
Now that we know what happened, it’s easy to assign it to the right developer, with all the information needed to solve the issue. The issue, whether a Jira ticket, Slack message, etc, comes with the True Root Cause needed to resolve the issue, saving developers hours (and sometimes days) in the process.
Why is it Important?
Through the new Jenkins integration, QA teams can see all new anomalies introduced by any release in test or stage, and automatically assign them a severity based on potential impact to the code.
OverOps allows these teams to analyze the number of new errors, regressions and slowdowns introduced in each phase of pre-production and prioritize those with critical impact, enabling them to accurately assess the reliability of each deployment compared to previous releases. In other words, OverOps can automatically stop a bad release from being promoted, sending it back to the engineers with the enriched data needed to resolve the error.
Using these quality gates, organizations can certify releases to be moved through their delivery pipeline, or stop them in their tracks to proactively fix any issues. Our goal is to improve the continuous delivery pipeline, and allow both dev and ops to push better code to production.