You can figure out how much errors cost when they crash your application, but what about other errors and issues that are caught and known? They might end up costing you even more.

iceberg

We’ve all been there – It’s Saturday night, we’re sitting at home enjoying the weekend when suddenly the phone starts buzzing. The application has crashed. We hop on a war-room conference call that includes developers, Ops and executives, all trying to figure out what happened and how to handle the situation. It takes a few hours (or even days) and some new grey hairs, but eventually the application is up and running again.

Now that the storm is over, it’s time to understand how much this downtime cost us. It’s easy to analyze and understand what we need to add to the formula – Lost revenue from downtime and failed transactions, angry customers and brand tarnishment, dev time debugging and solving the issue, product roadmap delays, and so on.

These elements, or costs, are visible to us. The right person from the right team can be aware of what happened and how much they spent in regarding to this single issue, but what about the elements that lay beneath the surface? The ones that are invisible and we’re not aware that they’re hurting us?

We’re talking about errors and issues that are caught by tools, log files, exception handling or any other method we might be using. While we tend to think that caught errors and exceptions = handled errors and exceptions, that’s not always the case. As a matter of fact, those issues might end up costing us more than war-room crises.

Let’s see what’s the true cost of an error is.

Discover the Hidden Costs of Your Errors – Try Out Our “Cost of an Error” Calculator

Pay Per Error

Before we can understand how much errors are costing us and our company, we first need to take a broader look and see how much they cost the entire industry. According to a report by Herb Krasner, titled “The Cost of Poor Quality Software in the US”, the US software industry is paying about $2.84 trillion dollars(!) due to poor quality code.

As the report title suggests, the most significant cost for US companies is due to software failures (37.46%). But other factors make an impact as well, such as legacy system problems (21.42%), technical debt (18.22%), time wasted on finding and fixing issues (16.87%) and troubled or canceled projects (6.01%).

Legacy system, technical debt, time spent on finding and fixing issues and canceled projects are something that are common in almost every company, but are often ignored or pushed back due to lack of time, manpower or budget. However, with the move towards CI/CD practices and deploying code faster than before, and as we continue to push bad code to production, these issues and costs will only increase over time.

Enter the Iceberg Model

Now that we know how much the entire industry is spending on errors, and as we understand that this is a significant issue that is hurting our applications, it’s time to see how much it costs us.

Inspired by Herb Krasner’s report, we decided to analyze the costs using the iceberg model. Coined by Ernest Hemingway, this theory focuses on showing surface elements without explicitly discussing the deeper meaning of it. In other words – we have the known error costs vs. the hidden costs that are beneath the surface.

We started pulling our own data by speaking to customers, developers and operation teams, trying to see how much money is spent without them being aware of it.

The top of the iceberg are the issues we are aware of, and those that we can see and measure how they affect our application, brand and costs, such as:

  • Customer complaints
  • Dev time spent debugging and solving issues
  • Application downtime
  • Roadmap delays
  • SLA violation costs
  • Stack of tools to ensure reliability (and catch errors)

While there’s no doubt that these factors contribute to costs, they’re not telling the full story of how much you’re spending on your errors. Underneath the surface we saw that companies are dealing with issues that, while they are aware of them, they can’t quite quantify or measure how they’re affecting the application or the customers and how much they add to their expenses. These include:

  • Technical debt
  • Excessive logging
  • CPU performance
  • Poor quality data
  • Memory leaks
  • Brand tarnishment
  • Loss market opportunities
  • Additional hardware/storage/network costs

The bottom part of the iceberg includes issues that are created when we promote poor quality (or simply bad) code down our application pipeline and don’t handle the consequences. In a survey we ran a few months back, titled “Dev vs. Ops: The State of Accountability”, we saw that the pressure to move quickly and meet deadlines when deploying features, significantly impacts code quality and reliability. 38.2% of all respondents indicated that moving too quickly is a primary reason that errors make it into production.

In other words, we need to continue moving fast without breaking things. We need quality gates, feedback loops and criteria that will help the entire company, from Dev and QA to Ops teams, stop unstable builds from being deployed. We need to add a stamp of approval to the code before it’s being pushed to production, as well as identify what happened even after the code is live and running.

Stop Promoting Bad Code; Say Goodbye to Sev1 Issues

Now that we are aware of what the hidden costs are, it’s time to handle them before they impact the amount of money the company is spending. In order for that to happen, teams need to know what is going on inside their applications: how many errors and log events are occurring every minute of every day, with a goal of reducing the number of errors.

We can’t realistically expect to reach a point where we have 0 errors, but we also can’t stay in the current state where we think that as long as the errors aren’t directly impacting performance or causing the app to crash, that it has no impact on the company. To put it in plain words, the current error volume is costing us more money than we realize.

By reducing error volume and rates, we also reduce the size of our log files, reducing ingestion and storage costs. Less CPU usage is allocated to error handling, meaning applications will scale better. Performance naturally improves, increasing customer satisfaction and reducing churn.

To help teams achieve this goal, we created the OverOps Reliability Dashboards, through which we see the current status of the application, identify and prioritize all errors across the application. We can detect issues, slowdowns and anomalies and allow teams to quickly and more easily resolve errors that are contributing to high error volumes.

OverOps alerts you when new errors are introduced into your system, when there’s an increase in errors or when slowdowns occur, allowing you to stay on top of things – rather than waiting for your customers to complain about issues.

With proper identification and classification of anomalies in code behavior and performance, you can preemptively block releases from being deployed to production if circumstances suggest that a Sev1 issue is likely to be introduced.

Want to see how OverOps can stop Sev1 issues, enforce quality gates to every build and help you save hundreds, or even thousands, of dollars each month? Visit our website or watch a live demo.

Discover the Hidden Costs of Your Errors – Try Out Our “Cost of an Error” Calculator

Henn is a content manager at OverOps covering topics related to Java, Scala and everything in between. She is a lover of gadgets, apps, technology and tea.