Demystifying Observability

9 min readJun 3, 2020

Microservices are hard

In the last decade, we saw a significant shift in how modern, internet-scale applications are built. Cloud computing and containerization technologies like Docker enabled a new breed of distributed system designs commonly referred to as microservices. These large companies such as Uber, Google, AirBnb, Netflix, Twitter have leveraged microservices to build highly scalable and reliable systems to deliver features faster to their customers. Many organizations are moving to microservices so that their developers can independently develop and deploy their services, without having to plan or coordinate their activities from other teams.

Despite the benefits and eager adoption, microservices come with their own challenges and complexity.

Monolithic vs Microservice — courtesy of https://www.weave.works

The diagram above explains the difference between a monolithic architecture and a simple microservice architecture. With a monolithic architecture, you have one large server responsible for handling all the requests and communications across multiple modules or packages happens in memory. When you split these monolithic modules or packages in to microservices with bounded context, you will have hundreds of microservices with dependency to each other. Instead of memory, network and machines involved in the communication across these services. When we have a deep distributed system with few hundred microservices, the microservice developers probably know about their direct dependencies but they may not always know the about all the transitive dependencies.

The idea with microservices was to create more autonomy and independence for microservice teams, but we end up is creating a deep distributed system with complex dependencies which are difficult to understand. Like in every distributed system, there is a higher chance for network, hardware or application level issues. Because of service dependencies, any component can be temporarily unavailable for their consumers.

When number of microservices increases, number of failure modes also increases linearly but what matters for the users does not change much.

Responsibility without Control

If you own a service, you’re focused on its health — latency, error rates, throughput, etc. — but that service can only be as fast and reliable as its slowest, most error-prone dependency.

For example, you are owning a customer facing service A. Your service is dependent on service B, service B is dependent on service C and dependency extends all the way to service Z.

If customer reports slowness, since you are in the front line, you might investigate this issue with help of your logs, metrics, dashboards and identify the problem is not your service but the service B. Now you might decide to talk to service B’s team and they might say “Hey! it’s not our problem, Its service C”.

Assume if the service Z is the root cause of the slowness, how much effort you need to put to coordinate with teams to identify the root cause. In distributed system, if some service is having a bad day that you are directly or indirectly dependent on, you are also becoming responsible. Basically, the discrepancy between what you control and your responsibility growing with number of microservices.

Fundamentally you only control your own service, but the scope and complexity of the services you are implicitly responsible for increases as each new service and dependency deploys to production. We need consistent monitoring across the distributed systems, otherwise we are going to be missing pieces and we are going to have blind spots in our monitoring.

Observability solves these problems helping developers understand multi-layered distributed architectures: what’s slow, what’s broken, where and what needs to be done to improve performance.

Observability in Control Theory

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

In control theory, your system can be represented using a state vector which is theoretical representation of everything inside the system. In software it can be all processes running inside, resource status CPU, memory at a time. You have set of inputs to the system and the system produces set of output. Observability kind of describes the relationship between the output and the internal state of the system. How well can you infer internal state using only the output? In the modern paradigm, outputs are basically telemetry, internal state is what is the application state at a given point of time which are trying to infer from the telemetry. We are trying to use the telemetry and tools to figure out the internal state.

Observability in Software

Basically, Observability is being able instrument the application to produce different telemetry and use the telemetry to ask questions about what your software is doing and being able to get right answers for the questions. Having an observable system means, when you run this service in production, you have all the instrumentation you need to understand what’s happening in your software that will help detect undesirable behaviors (e.g. service downtime, errors, slow responses) and have actionable information to pin down root cause in an effective manner. Developers make service observable by instrumenting the application the outputs most important telemetry data

Monitoring vs Observability

Monitoring is the activity of observing the state of a system over time. When you monitor something, you already looking for something you already know, when you monitor for heap failure, you know application is going to fail after x% heap usage. This is the domain of known failures. this is not going to tell us what is known to us. Monitoring is biased towards alerts; every alert must have an action. This is where the concept of observability kicks in.

Observability is about being able to ask questions about what your software is doing and being able to get right answers for the questions, identify patterns. It may or may not be a problem, but we should be able to ask open questions and find answers.

Three pillars of Observability

There are three primary types of telemetry data through which systems are made observable.

Logs

Logging comprises of recording discrete events in the system. These events can be structured (JSON based application/system logs) or unstructured (text strings). Logs should also have contextual information such “what happened”, “who was accessing the system” or any specific system attribute at a specific time. They are typically easy to generate, difficult to extract meaning from, grows with number of users, and expensive to store.

Metrics

Metrics are aggregatable events like counters (Eg: HTTP requests), gauges (HTTP queue depth), histogram etc. These metrics are usually represented as counts or measures and are often aggregated over a period of time. Since metrics are pre aggregated, metrics size typically stays constant as users grows. Some metrics aggregation systems suffer from high cardinality of metrics.

Traces

When you can see software requests end-to-end, troubleshooting is easier and faster. Distributed tracing does this by showing you the path a request follows as it travels through a distributed system. As requests travel between services, each segment is recorded as a span, which represents time spent in services or resources of those services. All the spans of a request are combined into a single distributed trace to give you a picture of an entire request.

Observability Driven Development

Developers who build observable systems understand their software as they write it, include instrumentation when they ship it, then check it regularly to make sure it looks as expected. They use observability:

to decide how to scope features and fixes
to verify correctness or completion of changes
to inform priorities around which features to build and bugs to fix
to deliver necessary visibility into the behaviors of the code they ship

As a developer, if you have to write a software, then you will have some hypothesis about how the software should work. You should be able to answer is my software doing what it is intended to do? So that you can take that answers and influence your development.

Things like testing should not end in DEV environment or QA environment, or CI/CD systems or staging, they should continue until the actual usage of the systems. In production, if the software is not used in the expected way, you should be able to get this feedback and change the software accordingly.

In simple terms, ODD is somewhat cyclical, looping through these steps:

Instrument to measure what do your users care about
Now that you are collecting data on how well you do the things your users care about, use the data to get better at doing those things.

Observability Tools

Along with emitting three types of telemetry (Logs, Metrics and Traces) to make your system observable, following are some guidelines to select an observability tool.

Simplicity and Extensibility

Observability solution must be simple and extensible. They should be able to support polyglot environments (with whatever languages and frameworks you use), integrate with your service mesh or container platform, and connect to Slack and PagerDuty or whatever system your on-call team prefers.

Usability

If the platform is too difficult to learn or use on a daily basis, it won’t become part of existing processes and workflows. Developers won’t feel comfortable turning to the tool during moments of high stress, and little improvement will be made in the health and reliability of the system.

High-quality Telemetry

Effective observability requires high-quality telemetry. Your platform should provide tools to enable telemetry data collection for any app and enable observability scenarios.

Real Time Analysis

Reports, dashboards, and queries need to provide insight into what’s going on “right now,” so that developers can understand the severity of an issue or the immediate impact of their performance optimizations.

Aggregation and Visualization

Microservices can produce an enormous amount of data. Far more than can be easily understood by humans without some sort of guidance. For an observability tool to be effective, it needs to make insights obvious. This includes interactive visual summaries for incident resolution and clear dashboards that offer an at-a-glance understanding of any event.

Scalability

Ingest, process, and analyze your data without latency. Additionally, it can’t be cost-prohibitive to do so, as data goes from terabytes to petabytes and beyond.

Return of Investment

Ultimately, observability tools must improve the customer experience, increase developer velocity, and ensure a more reliable, resilient, and stable system at scale.

Summary

Microservice are complex. Number of failure modes increases linearly with number of microservices. You or your dependent service are going have a bad day which is normal
Microservice teams make their service observable by instrumenting their code to emit important telemetry data by the means of logs, metrics and traces
It’s essential that the microservice teams understand their dependencies. Make you’re your dependent services follow observability standards, have necessary instrumentation so that in case of issues, you will be able to pinpoint the root cause. We need to have consistent monitoring across the stack
Observability solves these problems helping developers understand multi-layered distributed architectures: what’s slow, what’s broken, where and what needs to be done to improve performance.
Along with emitting three types of telemetry (Logs, Metrics and Traces) to make your system observable, you need to have right observability tool.