Observability gives you debugging superpowers. It's x-ray vision to see what's happening inside your live software. The insight you can gain from knowing what happens when your users interact with your services (and when your services interact with each other) gives confidence and helps you make wise choices.
It might sound like magic, but it's something you can totally build into your software.
Isn't my monitoring doing this?
Application Monitoring gives us a broad view of our systems. it's great for setting up alerts for when something is broken. It does not handle the details of what’s at fault. Once it’s told us of a problem, it’s often not helpful.
For example, It might notify us that purchases are not able to be dispatched, but it can’t tell us which purchases have problems or what the underlying error is.
For this, we need to gain another perspective by inverting how monitoring works.
We're going to use Science!
Like many things in Software engineering, Observability is lifted from an actual Maths and Engineering thing. It's got a complex definition, but for our purposes, the level of observability you have boils down to:
“The amount you can infer about the state of your system solely from its outputs”
This measurement is super useful. Our live systems are somewhat out of our control. We don't always know exactly who's using it or what they are doing.
How could we know? Inference from the system outputs, if we output enough!
So...to get more insight I need All the Outputs?!
Ideally Yes. That would be great. Observability needs detail.
However, signalling every user interaction and each small system change would generate a mass of data. That would be a whole 'nother set of problems. Building Total Insight from all-the-outputs would be complex and you are probably back where you started.
Products like Honeycomb help you turn large amounts of system data into clear signal. If you have budget and need, tools like this can really deliver the goods. However, modelling, selecting and shaping still need to be done.
A good alternative is to focus on what's key for your product and increase your observability there. Having your system output more detail in a key area can give massive observability gains at a lower investment.
Let's look at how we could do that.
High Cardinality
Is your new hero name
The secret to Observability is to provide the right details that let you know what's been going on. Identifying the who, what and how. To do this you need to be operating with High Cardinality data. That's data that represents richness and uniqueness.
A system's key uniqueness is often an identity field: an item, user or a transaction id. Something from the heart of what the product does. The more unique and detailed, the better we can explore, understand and root out issues.
This covers off the who
. It's also really useful to know when
and why
. So be sure to add in detail on what was going on and when it all happened.
Your low-level Observability data should be accessible and queryable for its useful lifetime. Summed up data is super-useful to high level find patterns and abnormalities. But holding it in a raw form supports low-level exploration, helping the discovery of something new.
A Few Examples
Audits for Applications
I've built and tended a product to power the process of applying for new jobs. Here I found a lot of value in building an Observability store that contained the detail on the lifecycle of each application.
By storing each event that changed each application, in the same system as the applications, the owning team was able to observe the state of the system and how applications were changing. This gave us superpowers to spot problems and their causes.
Logs for Log-in attempts
I've also managed an Authentication and Authorisation system. We ended up with key Observability data stored in different parts of our systems.
We need to know about log-in attempts. The scale of this data could be huge, and its value is transient - we rarely cared about the data after a few weeks. For these reasons we enhanced the detail we stored but kept it within our structured logging system for as long as we could.
we treated Observability data for successful authentications differently. This data was of much more value to us in terms of longevity, observability and action. This data we stored in a database audit trail which allowed the system owners to interact and interrogate the data in rich ways. This allowed us to support users, and investigate unexpected incidents well.
Observability contributes to Software Delivery Performance
Knowing what’s going on in Production gives us the confidence to ship swiftly & often. And if we do mess up, we’ll be able to sort it out, roll forward, and fix issues.
These are big contributors to key metrics in the high-performance delivery model outlined in the book Accelerate (and the DORA reports ).
Shipping with confidence keeps Lead Time for Change
low, and knowing what’s happening contributes to a fast Mean Time to Recovery
. The data in Accelerate and the DORA reports highlight how essential these are to effective software delivery teams.
You are probably doing Observability right now.
Storing data to help with debugging is not a new thing. However, it can be a new way of thinking. As you build your next system, ask yourself, how will I have the x-ray superpowers I need when something goes wrong.
With these powers of Observability, you can worry less and listen to your system more.