Microservice Observability Strategies and Best Practices
In the last blog, I explained the observability, the difference between monitoring and observability and why the observability is important with microservice style architecture. Next, in this blog, I will explain some strategies and best practices for implementing observability.
To recap, Observability is the practice of instrumenting those systems with tools to gather actionable data that provides not only the when of an error or issue, but — more importantly — the why. The latter is what teams need to respond quickly and resolve emergencies in modern software. Organizations need to adopt an observability best practices to get a much deeper, shared understanding of the data, the systems, and the customers.
Following are some of the challenges for organizations on their journey towards implementing observability.
Problem 1: Agent overload
Microservices based systems require different agent for log and metric collectors, network monitoring, uptime monitoring, analytics aggregators, security scanners, APM runtime instrumentation, and so on. Ops teams are concerned with their machines becoming the dumping ground for bits of vendor code that can slow performance, conflict with services running on the machines and cause huge management headaches when upgrades are needed. Each agent brings not just the vendor’s code into production but also all of its dependencies.
Problem 2: Data Overload
Collecting a lot of data can be overwhelming, especially for companies that rush into it without asking the right questions. With microservices, a new release can now cause a surge in log volumes causing slowdown in your log ingestion across all of your services. The more data sources we have, the harder it is to determine exactly what we want to do with it. We can’t track everything; we need to focus on the data sources that hold the most relevant value to the teams. Most data ingestion tools like logstash provides data ingestion pipeline that allows us to collect data from a variety of sources, transform it on the fly, and send it to the desired destination. Microservice style architecture applications observability require more control over data ingestion to avoid capacity anxiety.
Problem 3: Instrumentation
Current APM (Application Performance Monitoring) solutions on the market like NewRelic and Dynatrace rely heavily on different levels of instrumentations, this is why teams usually install vendor specific agents to collect metrics into these products. Vendor-specific agents are not optimal from an instrumentation point of view. To put microservices observability to a next level, a vendor-neutral instrumentation standard would be needed like OpenTracing.
Problem 4: Tools
Because of the problems discussed earlier, generally, organizations settle on a limited set of tools like logging, metrics and analytics systems. Using any of the open source or commercial observability tools and their specific components results in vendor lock-in. Vendor lock-in and high switching costs can make it difficult to use the right tool for the job.
OpenTracing and OpenCensus projects were started to solve this problem. They provide what other frameworks and libraries can implement. This enables developers to add instrumentation to their application code that won’t lock them into any particular vendor. This low coupling, along with easy-to-use interfaces, makes these two projects very attractive.
Following are the best practices for a well-instrumented application architecture to consistently forward logs, rich set of metrics, and traces to an observability platform. This data enables SREs to triage issues faster by quickly identifying misbehaving services and drill into the root cause.
Following are the best practices to effectively implement observability.
Best Practice #1 — Structured Data
Traditional logging with unstructured text data makes it hard to query the logs for any sort of useful information. It would be nice to be able to filter all logs by a certain customer or transaction. The goal of structuring the data is to solve these sorts of problems and allow additional analytics.
A key for improving system observability is adopting to structure the data. When dealing with data, the first thing to understand is that application telemetry has two very different audiences: humans and machines. Machines are good at processing large amounts of structured data quickly and automatically.
Structured logging is useful for cases like
1. Analytics — A good example of this would be processing web service access logs and doing some basic summarization and aggregates across the data.
2. Search — Being able to search and correlate log messages is very valuable to development teams during the development process and for troubleshooting production problems.
Each log line should represent one single event and contain at least the timestamp, the hostname, the service and the logger name. Additionally logs can have contextual information such as thread or process Id, event Id, session and user id. Other important values may be environment contextual metadata such as: instance ID, deployment name, application version, or any other key-value pairs related to the event.
log.error(“Order ‘{}’ failed”.format(orderID))
This log event would result in a log message like:
ERROR 2019–12–30 09:28.31 Order ‘IMN3T64735’ failed
When debugging login problems, we’d probably use a combination of grep and regular expressions to track down the users experiencing issues. This approach makes logs extremely fragile. People begin to rely on the format of logs in ways that might even be unknown to the developers responsible for them.
With structured logs, logging statement can be changed to something like:
log.error(“Order failed”,event=ORDER_FAILURE,order_id=“IMN3T64735”,user=“surenraju”,email=“suren.1988@gmail.com”,error=error)
The above structured log statement would result in a log message like:
{“timestamp”: “2019–12–30 09:28.31”,“level”: “ERROR”,“event”: “order_failure”,“ order_id”=“IMN3T64735”,“user”: “surenraju”,“email”: “suren.1988@gmail.com”,“error”: “Connection refused: Cannot connect to filfilment-svc”,“message”: “Order failed”}
Best Practice #2 — Contextual Logging and Tracing
In elastic microservice architecture, a customer request might flow through several micorservices and it becomes essential to trace the request journey across those microservices. We need to have Distributed tracing along with structured logs. There is a pattern that originated in Go for passing request-scoped values, cancellation signals, and deadlines across API boundaries. This is also a useful pattern for observability for passing the shared context between services as a request traverses the system.
A context object is simply a key-value data structure to store request metadata as a request passes through a service and is persisted through the entire execution path. OpenTracing refers to this as baggage. We can also include this context as part of our structured logs.
Contextual logging is an approach that encourages not just adding additional useful data to log events, but also sharing that data across related events. With contextual logging, data tokens are added to and removed from log events over the course of the application’s runtime. Depending on your application’s workflow, some of these tokens can be shared across multiple log events, or even across the entire application. In the meantime, your log events still retain core logging information such as method names and stack traces.
In our order processing system’s logging from above would look something like this:
def processOrder(ctx, username, email, orderID) {ctx.set(user=username, email=email,order_id=orderID)…log.error(“Order failed”,event=ORDER_FAILURE,context=ctx,error=error)…}
Above log statement enhances the logs with rich metadata which is extremely useful for debugging — as they start evolving towards events. The context is also a convenient way to propagate tracing information, such as a span ID, between services.
{“timestamp”: “2019–12–30 09:28.31”,“level”: “ERROR”,“event”: “order_failure”,“context”: {“id”: “accfbb8315c44a52ad893ca6772e1caf”,“http_method”: “POST”,“http_path”: “/order”,“user”: “surenraju”,“order_id”=“IMN3T64735”,“email”: “suren.1988@gmail.com”,“span_id”: “34fe6cbf9556424092fb230eab6f4ea6”,},“error”: “Connection refused: Cannot connect to filfilment-svc”,“message”: “Order failed”}
Best Practice #3 — Common Schema
With our structured data and context, we need to have consistent way to structure our data. We can introduce a common specification or schema for each data type we collect, such as logs, metrics, and traces.
Whether we are performing interactive analysis such as search, drill-down and pivoting, visualization or any sort of automated analysis such as alerting, machine learning-driven anomaly detection etc, we need to be able to consistently examine the data. But unless the data originates from only one source, we face formatting inconsistencies resulting from:
- Disparate data types (e.g., logs, metrics, APM, flows, contextual data)
- Heterogeneous environments with diverse vendor standards
- Similar-but-different data sources
Imagine searching for a specific user within data originating from multiple sources. Just to search for this one field, we would likely need to account for multiple field names, such as user, username, nginx.access.user_name, and login. Drilling into and pivoting around that data would present an even greater challenge. Now imagine developing analytics content, such as a visualization, alert, or machine learning job — each new data source would add either complexity or duplication.
These schemas also need libraries which implement the specifications and make it easy for developers to actually instrument their systems. There are existing libraries available for structured logging. For tracing and metrics, OpenTelemetry has emerged as a vendor-neutral API and forthcoming data specification.
Best Practice #4 — Data Collectors
Observability solution should consolidate data collection and instrumentation and decouple data sources from data sinks. One of the ways we can reduce agent overload is by using a data collector to unify the collection of key pieces of observability data — namely logs, metrics, and traces. Collectors, as their name implies, collect metrics from our systems and services. Each collector runs once for each collection interval to obtain system and service statistics and once the data collection is finished, the data is handed in bulk to the exporters to be sent to the observability pipeline for further processing. This commonly runs as an agent on the host. In Kubernetes, this might be a DaemonSet with an instance running on each node. From the application or container side, data is written to stdout/stderr or a Unix domain socket which the collector reads. From here, the data gets written to the pipeline, which i will explain in the next section.
There are solutions like Fluentd or Logstash along with the Beats ecosystem will help to collect application, kubernetes, infrastructure and cloud metrics. Avoid having logic into the data collector instead manage it centrally in observability pipeline for the fact that the collectors has to be lightweight and runs distributed and at scale.
Best Practice #5 — Data Pipeline
With thousands of collector agents running on each host collecting metrics from our systems and services, we need a scalable, fault-tolerant data stream to handle it all. By using a data pipelines, we can split our data processing into logical parts and we can input and output data from/to various different sources. Its common to see up to 1TB of logs indexed daily for a medium sized application with microservice style architecture. This volume can be much greater for larger applications, and it can burst dramatically with the introduction of new services. As a result, decoupling sources and sinks becomes important for reducing capacity anxiety. This data pipeline is often something that can be partitioned for horizontal scalability. A key reason for decoupling is that it also allows us to introduce or change sinks without impacting our production cluster. A benefit of this is that we can also evaluate and compare tools side-by-side. This helps reduce switching costs. There are few open source solutions such as Apache Kafka, Apache Logstash and on the cloud-managed services side, Amazon Kinesis, Google Cloud Pub/Sub, and Azure Event Hubs.
The following is a open source Elastic stack based pipeline.