Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry

Endre Sara

February 10, 2024

Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry

Implementing OpenTelemetry at the core of our observability strategy for Causely’s SaaS product was a natural decision. In this article, I'll share some background on our rationale and how the combination of OpenTelemetry and causal reasoning addresses several critical requirements that allow us to scale our services more efficiently.

Avoiding common observability pitfalls

We already know – based on decades of experience working in and with operations teams in the most challenging environments – that bridging the gap between the vast ocean of observability data and actionable insights has and continues to be a major pain point. This is especially true in the complex world of cloud-native applications.

Missing application insights

Application observability remains an elusive beast for many, especially in complex microservices architectures. While infrastructure monitoring has become readily available, neglecting application data paints an incomplete picture, hindering effective troubleshooting and operations.

Siloed solutions

Traditional observability solutions have relied on siloed, proprietary agents and data sources, leading to fragmented visibility across teams and technologies. This makes it difficult to understand the complete picture of service composition and dependencies.

To me, this is like trying to solve a puzzle with missing pieces – that’s essentially a problem that many DevOps teams face today – piecing together a picture of how microservices, serverless functions, databases, and other elements interact with one another, and underlying infrastructure and cloud services they run on. This makes it hard to collaborate and troubleshoot; it's a struggle to pinpoint the root cause of performance issues or outages.

Vendor lock-in

Many vendors’ products also lock customers’ data into their cloud services. This can result in customers paying through the nose, because licensing costs are predicated on the volume of data that is being collected and stored in the service providers’ backend SaaS. It can also be very hard to exit these services once locked in.

These are all pitfalls we wanted to avoid at Causely as we set out to build our our Causal Reasoning Platform.

The pillars of our observability architecture pointed us to OpenTelemetry

OpenTelemetry provides us with a path to break free from these limitations, establishing a common framework that transcends programming languages and platforms that we are using to build our services, and satisfying the requirements laid out in the pillars of our observability architecture:

Precise instrumentation

OpenTelemetry offers automatic instrumentation options that minimize the amount of work we need to do on manual code modifications and streamline the integration of our internal observability capabilities into our chosen backend applications.

Unified picture

By providing a standardized data model powered by semantic conventions, OpenTelemetry enables us to paint an end-to-end picture of how all of our services are composed, including application and infrastructure dependencies. We can also gain access to critical telemetry information, utilizing this semantically consistent data across multiple backend microservices even when written in different languages.

Vendor-neutral data management

OpenTelemetry allows us to avoid locking our application data into 3rd party vendors’ services by decoupling it from proprietary vendor formats. This gives us the freedom to choose the best tools on an ongoing basis based on the value they provide. If something new comes along that we want to exploit, we can easily plug it into our architecture.

Resource-optimized observability

With OpenTelemetry, we can take a top down approach to data collection, starting with the problems we are looking to solve and eliminating unnecessary information. In doing so, this minimizes our storage costs and optimizes compute resources we need to support our observability pipeline.

We believe that following these pillars and building our Causal Reasoning Platform on top of OpenTelemetry will propel our product’s performance, enable rock-solid reliability, and ensure consistent service experiences for our customers as we scale our business. We will also minimize our ongoing operational costs, creating a win-win for us and our customers.

OpenTelemetry + causal analysis: scaling for performance and cost efficiency

Ultimately, observability aims to illuminate the behavior of distributed systems, enabling proactive maintenance and swift troubleshooting. Yet isolated failures manifest as cascading symptoms across interconnected services.

While OpenTelemetry enables back-end applications to use this data to provide a unified picture in maps, graphs and dashboards, the job of figuring out the cause and effect in the correlated data still requires highly skilled resources. This process can also be very time consuming, tying up personnel across multiple teams, with ownership for different elements of overall services.

There is a lot of noise in the industry right now about how AI and LLMs are going to magically come to the rescue, but reality paints a different picture. All of the solutions available in the market today focus on correlating data versus uncovering a direct understanding of causal relationships between problems and the symptoms they cause, leaving DevOps teams with noise, not answers.

Traditional AI and LLMs also require massive amounts of data as input for training and learning behaviors on a continuous basis. This is data that ultimately ends up being transferred and stored in some form of SaaS. Processing these large datasets is very computationally intensive. This all translates into significant cost overheads for the SaaS providers as customer datasets grow overtime – costs that ultimately result in ever increasing bills for customers.

At Causely, we're taking a different approach

Our causal reasoning software provides operations and engineering teams with an understanding of the “why”, which is crucial for effective and timely troubleshooting and decision-making.

Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms
Example causality chain: Database Connection Noisy Neighbor causing service and infrastructure symptoms

Our Causal Reasoning Platform uses predefined models of how problems behave and propagate. When combined with real-time information about a system’s specific structure, Causely computes a map linking all potential problems to their observable symptoms.

This map acts as a reference guide, eliminating the need to analyze massive datasets every time the platform encounters an issue. Think of it as checking a dictionary instead of reading an entire encyclopedia.

The bottom line is, in contrast to traditional AI, Causely operates on a much smaller dataset, requires far less resources for computation and provides more meaningful actionable insights, all of which translate into lower ongoing operational costs and profitable growth.

Summing it up

There’s massive potential for causal analysis and OpenTelemetry to come together to tackle the limitations of traditional AI to get to the “why.” This is what we’re building at Causely. Doing so will result in numerous benefits:

  • Less time on Ops, more time on Dev: OpenTelemetry provides standardized data while Causely analyzes it to automate the root cause analysis (RCA) process, which will significantly reduce the time our DevOps teams have to spend on troubleshooting.
  • Instant gratification, no training lag: We can eliminate AI’s slow learning curve, because Causely leverages OpenTelemetry’s semantic language and the Causal Reasoning Platform’s domain knowledge of cause and effect to deliver actionable results, right out of the box without massive amounts of data and with no training lag!
  • Small data, lean computation, big impact: Unlike traditional AI’s data gluttony and significant computational overheads, Causely thrives on targeted data streams. OpenTelemetry’s smart filtering keeps the information flow lean, allowing Causely to identify the root causes with a significantly smaller dataset and compute footprint.
  • Fast root cause identification: Traditional AI might tell us “ice cream sales and shark attacks rise together,” but causal reasoning reveals the truth – it’s the summer heat and not the sharks, driving both! By understanding cause-and-effect relationships, Causely cuts through the noise and identifies the root causes behind performance degradation and service malfunctions.

Having these capabilities is critical if we want to move beyond the labor intensive processes associated with how RCA is performed in DevOps today, and eventually achieve autonomous service reliability. This is why we are eating our own dog food and using Causely as part of our tech stack to manage the services we provide to customers.

Want to learn more about our experience with OpenTelemetry, or see if Causely can help you build better, more reliable cloud-native applications? Book a meeting with the Causely team. We'd love to chat!

Ready to ensure your service reliability?

Causely installs in minutes. Use Causely's out-of-the-box instrumentation or connect your existing observability and monitoring tools as data sources.