97% of Logged Errors are Caused by 10 Unique Errors

It’s 2016 and one thing hasn’t changed in 30 years. Dev and Ops teams still rely on log files to troubleshoot application issues. For some unknown reason we trust log files implicitly because we think the truth is hidden within them. If you just grep hard enough, or write the perfect regex query, the answer will magically present itself in front of you.

Yep, tools like Splunk, ELK and Sumologic have made it faster to search logs but all these tools suffer from one thing – operational noise. Operational noise is the silent killer of IT and your business today. It’s the reason why application issues go undetected and take days to resolve.

[This blog post is included as chapter 3 of our free Guide to Solving Java Application Errors in Production. Download the full eBook here.]

Log Reality

Here’s a dose of reality, you will only log what you think will break an application, and you’re constrained by how much you can log without incurring unnecessary overhead on your application. This is why debugging through logging doesn’t work in production and why most application issues go undetected.

Let’s assume you do manage to find all the relevant log events, that’s not the end of the story. The data you need usually isn’t there, and leaves you adding additional logging statements, creating a new build, testing, deploying and hoping the error happens again. Ouch.

Time for Some Analysis

At OverOps we capture and analyze every error or exception that is thrown by Java applications in production. Using some cheeky data science this is what I found from analyzing over 1,000 applications monitored by OverOps.

High-level aggregate findings:

  • Avg. Java application will throw 9.2 million errors/month
  • Avg. Java application generates about 2.7TB of storage/month
  • Avg. Java application contains 53 unique errors/month
  • Top 10 Java Errors by Frequency were
    • NullPointerException
    • NumberFormatException
    • IllegalArgumentException
    • RuntimeException
    • IllegalStateException
    • NoSuchMethodException
    • ClassCastException
    • Exception
    • ParseException
    • InvocationTargetException

So there you have it, the pesky NullPointerException is to blame for all thats broken in log files. Ironically, checking for null was the first feedback I got in my first code review back in 2004 when I was a java developer.

Right, here are some numbers from a randomly selected enterprise production application over the past 30 days:

  • 25 JVMs
  • 29,965,285 errors
  • ~8.7TB of storage
  • 353 unique errors
  • Top Java errors by frequency were:
    • NumberFormatException
    • NoSuchMethodException
    • Custom Exception
    • StringIndexOutOfBoundsException
    • IndexOutOfBoundsException
    • IllegalArgumentException
    • IllegalStateException
    • RuntimeException
    • Custom Exception
    • Custom Exception

Time for Trouble (shooting)

So, you work in development or operations and you’ve been asked to troubleshoot the above application which generates a million errors a day, what do you do? Well, let’s zoom in on when the application had an issue right?

Let’s pick, say a 15 minute period. However, that’s still 10,416 errors you’ll be looking at for those 15 minutes. You now see this problem called operational noise? This is why humans struggle to detect and troubleshoot applications today…and it’s not going to get any easier.

What if we Just Fixed 10 Errors?

Now, let’s say we fixed 10 errors in the above application. What percent reduction do you think these 10 errors would have on the error count, storage and operational noise that this application generates every month?

1%, 5%, 10%, 25%, 50%?

How about 97.3%. Yes, you read that. Fixing just 10 errors in this application would reduce the error count, storage and operational noise by 97.3%.

The top 10 errors in this application by frequency are responsible for 29,170,210 errors out of the total 29,965,285 errors thrown over the past 30 days.

Take the Crap Out of Your App

The vast majority of application log files contain duplicated crap which you’re paying to manage every single day in your IT environment.

You pay for:

  • Disk storage to host log files on servers
  • Log management software licenses to parse, transmit, index and store this data over your network
  • Servers to run your log management software
  • Humans to analyze and manage this operational noise

The easiest way to solve operational noise is to fix application errors versus ignore them. Not only will this dramatically improve the operational insight of your teams, you’ll help them detect more issues and troubleshoot much faster because they’ll actually see the things that hurt your applications and business.

The Solution

If you want to identify and fix the top 10 errors in your application, download OverOps for free, stick it on a few production JVMs, wait a few hours, sort the errors captured by frequency and in one-click OverOps will show you the exact source code, object and variable values that caused each of them. Your developers in a few hours should be able to make the needed fixes and Bob will be your uncle.

The next time you do a code deployment in production OverOps will instantly notify you of new errors which were introduced and you can repeat this process. Here’s two ways we use OverOps at OverOps to detect new errors in our SaaS platform:

Slack Real-time Notifications which inform our team of every new error introduced in production as soon as it’s thrown, and a one-click link to the exact root cause (source code, objects & variable values that caused the error).

Slack Real Time Notifications

Email Deployment Digest Report showing the top 5 new errors introduced with direct links to the exact root cause.

OverOps digest

Final Thoughts

We see time and time again that the top few logged errors in production are pulling away most of the time and logging resources. The damage these top few events cause, each happening millions of times, is disproportionate to the time and effort it takes to solve them.

For a deeper dive into each exception type of the top 10 list, check out the next post in this series.



Steve held executive marketing and evangelist positions at AppDynamics, Glassdoor and Moogsoft. He likes Formula 1 and is dangerous with a golf ball.
  • bytebuffer

    Well, this is some strong marketing you’ve got over here. Isn’t this combination of title and content what any non-marketoid would call an outright lie to your readers’ faces? Seriously, play fair please.

    • Stephen Burton

      This was originally intended as a two part blog post where the top 10 java errors were going to be published on Thursday of this week. Since you asked nicely, we’ll update the original blog above and play fair. Thanks for your feedback.

      • bytebuffer

        I’m sorry if that came out too strong, but I was really disappointed at how I went through the article and didn’t find that “Here” from the title. And I’m a regular user and the follower of your blog, so I know what Takipi is capable of and for most of those like me the disappointment for the time wasted on this may result in this kind of frustration.
        Crunching billions of log entries is impressive by itself, and you’ve got some insights in the article, so why go as far as this misleading titles? It’s a matter of a mutual respect, I guess.

        • Stephen Burton

          Nothing to apologize for, we made the edit just now based on reader feedback and we wouldn’t have it any other way. We’ll make sure all future blogs give you the meat 🙂

  • Markus A. Loisson

    “29,965,285 million errors”? 30 million millions?

  • Gareth Murfin

    A million errors a day? surely something is seriously wrong with the app.

    • Stephen Burton

      I wish this was the exception (no pun intended) but I’ve seen this quite a lot over the past ten years with customer applications. You just need one or two noisy errors/exceptions on a busy online application and they can easily throw a few million a day 🙁

      • Gareth Murfin

        Never heard anything quite so ridiculous as a million errors a day. That is simply awful coding, maybe the coding of a teenager just picking up Java. I find this whole article insulting to the language and profession. A billion errors? Get a proper coder!!!

  • dadimitrov

    I typically find that the reason for having null is an invalid assumption about the input data. I often see people defensively replacing null with null-object, or skipping part of the algorithm only to result in subtle data errors downstream. And I would take hundred NPEs than having to debug one of these…