The Wikipedia definition of observability “…a measure of how well internal states of a system can be inferred by knowledge of its external outputs” means being able to determine what a system was doing only by looking at any outputs available from the system. We should be able to answer the question “Why did these users have problems at this particular time” without getting the users to retry against a test system with a debugger attached.While observability is essential to any system, modern cloud-native distributed architectures dramatically increase the complexity of the system in exchange for flexibility. Compared to a more traditional monolith running on a single server, modern architectures have far too many moving parts for any one person to be able to hold in their head. They also make use of a wide variety of tooling and techniques such as Infrastructure as Code, Machine Learning and a polyglot of programming languages.It is critical to employ observability so we can comprehend both what the system is doing now, such as when there’s an outage in production, and to understand what the system was doing in the past, such as when trying to understand reports of problems from users after the fact.Within computing, there are three main pillars of observability: logging, metrics and traces.
Logging
The pillar that is familiar to most people is logging. Logs are text-based, timestamp records of events happening over time – ideally in some sort of structure to allow information within the event log to more easily be parsed and extracted. What data gets logged will typically vary by level – with user-driven actions getting logged at information level and access, authorisation or system failures getting logged at error level.Logging is simple to get started with, but you can quickly run into problems with both the retention of the logs and the discovery of information from those logs. This is all the more difficult when different parts of the system run on multiple machines and means the logs have to be pieced together to create a single coherent narrative.
{"@timestamp":"2020-10-09T07:39:28.609Z","@version":"1","message":"Initializing Spring DispatcherServlet 'dispatcherServlet'","logger_name":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/]","thread_name":"http-nio-8080-exec-1","level":"INFO","level_value":20000}{"@timestamp":"2020-10-09T07:39:28.627Z","@version":"1","message":"Initializing Servlet 'dispatcherServlet'","logger_name":"org.springframework.web.servlet.DispatcherServlet","thread_name":"http-nio-8080-exec-1","level":"INFO","level_value":20000}
Example of structured logging (JSON Format)
Metrics
A pillar that is familiar with some people is that of metrics. Metrics are numerical representations of the state of the system which should be recorded and stored, so they can be analysed over time – such as the rate of change of a metric over the last five minutes or predicting a value based on what happened last year. The metrics can also be used to trigger alerts – providing helpful warnings of impending failure, rather than waiting for the failure to occur. The metrics a system exposes will be values that it is able to measure at the time. Calculations such as moving averages or 95th percentiles should be performed by the system displaying the metrics as it allows for the greatest flexibility of use with the metrics.Precisely what the system should expose as metrics will depend on what kind of system it is. There are broadly two ideas about what to expose:
- USE – typically aimed towards infrastructure as it refers to metrics that don’t make sense with most services
- Utilisation – what proportion of the resource is busy performing work
- Saturation – how much extra work the resource needs to perform but can’t be serviced
- Errors – Number of error events
- RED – aimed more towards services, a subset of the Four Golden Signals from Google’s Site Reliability Engineering book
- Rate – the number of requests handled by the service per second
- Error – the number of failed requests handled per second
- Duration – the amount of time required to process each request
To reduce the cognitive workload around the management of the monitoring and alerting around metrics, it is best to have standardisation around the naming of metrics – making it easier to apply monitoring across many services easily.This doesn’t mean that other metrics – such as garbage collection statistics – are useless, but that they will typically be only of importance after the fact rather than monitoring the current state of a service.

Example of CPU usage metrics using Prometheus
Traces
The pillar that is most unfamiliar to people is tracing. The design of modern tracing systems comes from a white paper published by Google in 2010 where specific headers get added to an incoming request to a service and added to the requests made to downstream services which will then reuse the headers present on the request. All services will report the headers they receive to a central system, including information such as how long the request took to process or what the status code was on the response. The basic component of traces is the ‘span’ which represents a single piece of work for a service. A span has a unique identifier, an operation name, a set of key-value tags and can point to another span. Services are expected to create spans to represent handling a request, when communicating with other services including data stores, or for other significant pieces of work that would want to be visualised.

- The request comes in and gets given a unique trace identifier.
- Service 1 will start a span that will represent the request received from the user – the trace assigned to the span will be the one from step 1.
- Service 1 will start a span that will represent the request sent to service 2. This span will have the trace identifier as from step 1 and will have its parent span set to the one from step 2.
- Service 2 will start a span that will represent the request received from service 1. This span will have the trace identifier as from step 1 and will have its parent span set to the one from step 3.
- Once service 2 has sent the response back, it will close the span from step 4 and send the details to the tracing service.
- After service 1 has received the response from service 2, it will close the span from step 3 and send the details to the tracing service.
- Once service 1 has sent the response back to the user, it will close the span from step 2 and send the details to the tracing service.
- From the information sent to the tracing service, it can produce a tree representing the request flowing through the services.
It is essential to ensure that requests to all downstream systems – including internal services, data stores and third parties – should be traced. By being able to have a single view of requests flowing through the whole system, it makes it possible to view how the requests are flowing through the system including how long requests took at each service or whether services returned errors.

Example of a trace using Jaeger
Connecting the pillars
One feature common to many logging systems is the ability to include arbitrary key/value pairs with log statements. This feature can be used to ensure that all log statements include the identifier for the trace in progress. Connecting these two pillars (logging and tracing) increases the value of both of them, as it makes it trivial to see how a user’s request flowed when you are looking at the logs and able to find the logs for a user when viewing the trace.

Example of log statements for a trace

Example of showing the key RED metrics alongside the service logs