The Future of Observability: Trends, Tools & Best Practices

Jim Pierson
Jun 20, 2023
10 min read

Updated: Mar 4

Observability Defined

‘Observability’ gives insights into the inner workings of your entire enterprise’s web service, improving the ability to deliver high-quality service and reducing the risk of service interruptions. Observability features like Real User Monitoring (RUM) and Application Performance Monitoring (APM) can lead your enterprise to increased customer satisfaction and revenue for your enterprise. Observability is also becoming increasingly important in non-web industries such as manufacturing, where data-driven decision-making is critical.

Observability - Levels of Evolution

Let’s look at the current state and consider where the Observability industry is going with systems like APM and RUM, new technologies like Large Language Models (LLM), and the leaders for website observability.

While searching for new work in the Observability field, I realized that although I have 30 years of Product Management experience building monitoring and performance engineering systems, I have been heads down on a limited toolset for a few years. I needed to update my knowledge of what is new in a rapidly growing field. Over several weeks, I studied the top 10 players in the market, combing through hundreds of documents and videos.

I needed a framework with key criteria to compare current capabilities. In this first section, I will discuss how we can evaluate the Evolution Levels of Observability tools. By using the phrase ‘Evolution level,’ I mean in the sense of WHAT a tool can do for the user. A tool that requires the user to be an expert and set everything up themselves is less evolved than a tool that does everything for the user and automatically bubbles up valuable observations.

Across the industry, we hear the vernacular “Pillars of Observability.”, Logs, Metrics, Traces, and some argue Events and Profiles. Although these are technically logs, they are different formats and have evolved separately. Still, a toolset that can collect logs-traces-metrics-events, store these, show dashboards with the data, and even use ML anomaly detection to trigger Alerts are all Basic Observability features today. Every vendor does this. For a tool to rise above the rest, it needs to do more. As I looked through the tools available today, there appear to be several distinct levels of evolution that differentiate all Observability tools in use over the last decade.

Table of Evolutionary Levels Of Observability Tools

Below Basic Observability is the Component - level 1 (orange) tools. We use these tools hands-on to take measurements, generally a single session from one point of view, without any statistical ability to look across many samples. Many of these tools are 20+ years old. These are, for example, the simplest of tools like Ping to much more complex tools like Wireshark, Chrome Devtools, or Web-Page-Test. These are limited to one area of expertise and have little to no logging. Often, these tools look at the raw source data logged into higher-level tools.

At the Basic - level 2 (yellow), you get logging capability with storage, the ability to write custom queries, build dashboards, and set up Machine Learning (ML) Anomaly Detection jobs. You, the user, must know what metrics to look for to Log and display. You get all the building blocks from the tool, but you need to put them together, each-and-every iteration. Several vendors start with this basic feature set and do really well; Elasticsearch, for example, their effort to make DIY easy may give them a leg up on the competitors in non-web industries.

Above Basic Observability is the Expert - level 3 (green). This tool level comes out of the box with specific metrics and UIs that are already prepared. The vendor or a 3rd party integration has identified key metrics for monitoring that can provide an all-up view of the service, stack ranks of leading indicators to look at before drilling into finer details, and predefined parent-child tracing across distributed systems. ML anomaly detection is enabled but not auto for more than a few metrics. If the tool has prefabricated screens for a few Key metrics, I call it an Expert. That said, there are many levels of how well the tool does the Expert function. The APM and RUM solutions from all the vendors I looked at have screens already laid out for you, so you can see high-level metrics and drill down from them.

The above Expert level is Intelligent - level 4 (blue). We can point this tool at a raw data source and let it tell us what it discovers. You recognize this tool when you see it building the UI on the fly, bringing into focus only the items necessary. This tool incorporates all the lower-level data and expert knowledge. This tool is not limited to a small set of key metrics. It is not limited to a decision tree approach. Instead, this tool discovers anomalies across all metrics simultaneously. The ability to discover at scale is a big difference from Expert tools. It skims across the top of the data, stack-ranks anomalies, and bubbles them up to the user. It automatically follows the clues connecting one service level to another across the entire stack, leading the user toward the root cause. Almost all of the top vendors I looked at have one tool that aspires to this level of Observability. The most Intelligent observational tools include AWS DevOps Guru, Splunk: IT Service Intel, Dynatrace: Davis, New Relic: AIOps, and Moogsoft AIOps.

The Assistant Operator – Level 5 (purple) is another level above. This tool is given ‘Operator’ access privileges to all lower-level tools and becomes like a junior dev/operations team member under supervision. It manages metric discovery, creating and tuning lower-level ML anomaly jobs, and sifting through results, predicting trends at a massive scale. Using Natural Language Processing (NLP) and Large Language Models (LLM) enables a much better “contextual” analysis of text. It can analyze customer calls and social media using sentiment analysis. This tool is collaborative and supportive. Can communicate directly with non-technical users. Takes feedback and modifies its own behavior accordingly.

Looking Back In Time And Then Towards The Future, We See The Following Pattern

To close out on the topic, Evolution Levels of Observability, the optimal tool does not need the user to have all the knowledge of what to look for. A tool requiring the user to pick and choose among all the possible measurements wastes our time. In the modern world, we should be able to fulfill the promise of “Log Everything” and now, also “Analyze Everything.”

These comments are based on my research and experience and my personal opinions. How about your experience? Do these levels match your expectations for Observability tools? What’s missing?

Observability - The Service Path

In addition to looking at WHAT a tool can do, it is essential to look at “WHERE” a tool takes its measurements in the infrastructure. Like any murder-mystery plot, problems are easier solved when there are paths to follow. Observation tools can collect data from every object in the path between the user/customer interaction and the developer checking in the code… the Service Path.

The focus here is on the Web Service Path across the ‘Full-Stack,’ the series of measurable objects between the user and the developer checking in the code. I’ll call these the “Layers of Observability.” Please note, Observability is used for non-website industries as well. I will cover those in another blog post.

The Purpose: If we ensure every object that participates in the transaction is being observed, then we are assured there will be data when we need it to troubleshoot. It is better to pay for logging upfront than to have a component down and not have data.

Here are the stack layers that I am using to evaluate Observability tools. These are sorted by proximity to the User on one end and the developer on the other. I have listed the key aspects of each layer that differentiate it from others.

If you have spent any time in the software industry shipping and supporting production websites, you know the components you don’t have measurements on are the hardest to solve when they don’t work correctly. How long has it taken your team to investigate and find that one proxy server flailing because you didn’t have APM agents on it? Likewise, network ASN path information and DNS responses are often left to the network experts on-prem but may still need to be monitored in the Cloud.

Several of these layers have become de facto standards for Observability

RUM - Real User Measurements from the browser collect metrics, e.g., Page Load Time and WebVitals.
Synthetics - automated pollers worldwide are downloading web pages and API endpoints, checking for Availability (errors, downtime), and latency.
APM - Application Performance Monitoring of transactions on a server/service collects information about Latency, Throughput, and Errors.
Infrastructure - Monitors low-level server/host metrics like CPU, memory, I/O, and network, as well as how many servers/VM/pods are running.
Database - 3rd party integrations largely cover observability.

What is considered Observability has expanded upstream from operational monitoring toward the developer? Collecting CICD data, such as deployment timestamps for GitHub events and actions, can be a huge time saver. If we know the new code was just pushed out, we can correlate any new behaviors measured by RUM and APM to that event.

If you have Cloud services, you know that monitoring runaway Costs can be critical to your business. In addition to monitoring the bits and milliseconds going by, you also need to track the dollars spent for servers the auto-scaler has spun up.

Security and all aspects, such as Firewalls and Access Control, can directly impact availability and latency observations. Checking to see if the intrusion detection service is ‘mitigating traffic’ to a service endpoint is often a stop while troubleshooting why a service is underperforming.

From the holistic perspective, one of the most important layers is the Correlation, Integration layer. Some vendors call these AIOps to reflect the new possibilities with AI, though this layer has existed for a long time under other names, including simply ‘Alerting.’ In most environments, an Operations Center console has always been a ‘buffer’ of all the alarms. In the past, this layer was relatively thin, more of an alert summary than any real detail, but it could contain alert info from any component in the entire infrastructure. The Operator still had to chase down the right people who knew how to log into the right servers to look at the right tool. Today the Correlation tool is much more integrated into all other Observability layers, with the ability to link directly to the right charts and details on the fly. The Assistant Operator – Level 5 will play a major new role using the Large Language Model possibilities.

Can you think of anything I have missed? Is there some other set of metrics that your team measures to keep the systems alive and running well?

The Results?

This last section will show an all-up study of Website Observability tools, rated by Evolution level and each layer in the service path.

I’m going to split this section into three parts:

Specialized tools that do one thing
Cloud providers who also provide Observability tools
End-to-end dedicated Observability vendors

First, here is a list of specialized tools for each layer in the service path between the user and the developer. Most of these tools are widely used and known for their excellence in investigating problems within a specific area or layer. These tools can be integrated into a platform but are not cohesive Observability platforms.

The color and the number on the right designate the Evolution Level. I labeled most of them as Component Level 1 (Orange) on the Observability Evolutionary scale. The exceptions are the solutions specifically concerned with consolidating metrics and alarms across a whole industry. I label Moogsoft and BigPanda as Intelligent Level 4 (blue) due to their ability to build situational dashboards on the fly based on multiple sources. Note that they lack the same level of robust integration as the Cloud and EtoE providers since they are not going beyond the integration layer.

BTW, I mentioned Catchpoint in four layers. This reflects their growth toward becoming an end-to-end Observability tool.

Top Three Cloud Providers

For most layers, the cloud providers already have screens for setup and editing configurations. They now have Expert level 3 (green) logging, alerting, and dashboards.

The cloud providers have been behind the dedicated providers on a Level 4 Observability solution (blue). Their solutions ask the user to tell the tool where and which metrics to measure. I expect the cloud providers to march forward and assimilate any new feature ideas first showing in the 3rd parties. AWS recently developed a top-level tool called DevOps Guru, which looks very promising as a Level 4 tool.

Comparing Top Providers For Observability

Top End-to-End Observability Providers

Here is the all-up view of the Observability End-to-End solution providers. My firsthand experience has been mostly with New Relic, Elasticsearch, AWS CloudWatch, and WebVitals. Still, the concepts and many of the names align across the industry.

Observations

Elasticsearch, while great for those of us that like to do Do-It-Yourself, falls behind the others in the number of Expert level curated dashboards for website observability. The 3rd party Data Integrations enable logging data from these layers, but it’s not an expert package with a UI. While Elasticsearch enables 3rd parties to integrate their data template, it’s unclear how 3rd parties can define sharable UIs. My study is unfair because it focuses only on the website industry. Elasticsearch’s DIY approach lends itself to industries such as manufacturing with bespoke data needing custom solutions. My future blog will cover Observability in other industries.
I couldn’t find a Cloud Cost Monitoring solution from Dynatrace.
AppDynamics had the coolest Network path visualizations.
AWS, New Relic, Splunk, and Dynatrace have tools that aspire to be Intelligent Level 4 (Blue) tools. They scan for issues across the whole stack, build UIs on the fly, and could be game changers.

All-Up Scoring For All The Website Observability Providers

Summing up the points across the full stack, we see the differences in current technology between the providers today are nuanced. That said, AWS and New Relic are the leaders with the most robust solutions, but Azure, Splunk, Datadog, and even GCP and Dynatrace are likely to catch up. Elasticsearch is furthest behind in the level of Expert UI features.

46 pts = AWS, New Relic
45 pts = Azure, Splunk, DataDog
44 pts = GCP, Dynatrace
42 pts = AppDynamics
39 pts = Elasticsearch

Most Evolved Website Observability Tools

Conclusion

My study so far looks only at the technology, ‘Speed’ to market aspect. It does not look at two other important dimensions, Quality and Cost. That said, I can make a few conclusions based on the technology looked at here:

The Cloud Providers are catching up to the 3rd party end-to-end providers. Enterprises running primarily on one cloud provider may find it ideal to use their built-in Cloud Observability tool suite rather than risk paying extra to transport your data to one of the 3rd party providers.
Enterprises running a multi-cloud enterprise should strongly consider the top 3rd party providers.

Postscript: During the initial peer review of this paper, adding a reference to Gartner’s quadrant report was mentioned. I looked it up and mainly found agreement. Where we differ appears to be in new market features that matured since their 2022 report. My report is based primarily on actual screenshots of working tools posted within the last year.

About the Author

Jim is a Practitioner with RingStone and has 30+ years of experience in Product Management, Observability, and Quality Management, with Performance Engineering as a core focus. Jim has led the creation of Cloud-based tools resulting in products improving from ~99% availability to 99.99%+ and Reducing Page Load Times by 200%, and successfully pitched projects to Executive Leadership worth $100+ million. Trained thousands of software engineers in the Best Practices of Performance and Quality Engineering. Military background in Special Operations and Electronic Communications.

Contact Jim at Jim.Pierson@ringstonetech.com