The Challenge
I was questioned recently about why we have so many tools to do monitoring.
Ultimately I think there is not one tool but there is one approach that will improve your people and your systems.
While it is true that different groups in the organisation need to ask different questions about the system at different times.
With the system constantly changing, individuals will need different tools to improve their "Learning Efficiency", and their general understanding of the system.
This must be balanced with the need for 1-n tools over the needs of the people's ability to react to system failures. Increasing their Learning Efficiency will improve their awareness of the system and should spread knowledge and reduce the Minimum Time to Repair(MTTR).
Also everyone should read Release It!
Overview
There are a lot of aspects to performance and monitoring and frequently concepts are mixed.
To keep it simple lets categorize them like this:
- Current State - Instantaneous State based
- Alerting
- Historical Stats
- Trace (System > Component > Request) ~ Vertical Profiling through technologies
- Profiling - usually technology focused, like Java Profiling
- Vital Signs
These have varying levels of intrusiveness; stickiness and probability of working when needed. They also have varying stakeholders, users, requirements, permissions that = red tape that slows you down when you most need the information.
There are lots of stakeholders that are usually focused on particular aspects at these times:
- Production - Good Times
- Production - Under Load Times (batch or interactive)
- Production - Under Load and component Failures
- Dev - Design/Architecture
- Dev - Impl time
- Test - Environment Issues - ala. Troubleshooting the Integration
- Load Test Time
- Soak Test Time
Complications
- Heterogeneous Systems
- Complex Async Interactions
- Degradation of Monitored Metrics after installation
- non use of metrics in good times = metrics may be misleading or untrusted in bad times
- Either Not Enough or Too Many Metrics
- When things start to fail, they do so in a non-linear fashion
I'm quite greedy when it comes to "Learning Efficiency", so I have many desired features of monitoring.
My Desirable Features of monitoring
- Simple
- Non-intrusive for standard operation
- Minimal "observer effect"
- Provide "Just Enough" Early Warning Alerts
- Maximal use of current tools - don't hang your hopes on a new silver bullet
- Choice in when to Increase accuracy by trading off intrusiveness
- Tracing aspects
- Decentralized - centralized lock down access inversely decreases the value and usefulness
- Alerting and Notifications
- Visual Trends
Simple is number one and because some information is better than none, I can frequently get by with simple command line tools.
Which Metrics to choose?
How do you reduce the signal/noise ratio. When there are 10000s of metrics to choose from.
For this I think we just have to go back to basics.
Your site exists for a reason, to serve traffic, to process some type of request.
Sites don't have a reason to exist if they are not involved in input/output.
Consider your system as a "small world" network. Concentrate on the Hubs and Connectors.
Start with your key flows:
- end to end - human to human
Measure Requests/traffic:
- Requests (bytes/counts/response types ok/fail) between servers (os, then app/jvm)
- elapsed response times
- resource utilisation - cpu/mem/io
- queues/pools/quotas - finite resources and potential bottlenecks - more difficult, sensitive to changes.
- effectiveness - seo analytics etc.
Add:
- Alerts for outside boundaries
as well as having the extra info that affects the collected data and it's interpretation:
- Influences on metric collection/recording, bug in metric sensor
- Influences on interpretation and events - released xyz at this time...
Why is System wide monitoring so hard?
Suffers from lots of things: Conways Law, Tragedy of the Commons.
Too many stakeholders wanting different things from monitoring and not valuing the effort put into it.
The value of information is usually interpreted differently by many. Accuracy or correctness is not usually as important as age/timeliness and verifiability. Veterans usually place more importance on metrics, many juniors would not even consider of any interest.
Value is usually only seen by juniors when there are problems and the metrics can be directly/indirectly used go gain insight into a situation.
It is a very "hard sell" to setup an automated historical metrics monitoring system.
Centralised Management or tools are just too fragile, they suffer from bugs in too many places, too sensitive to change. Monitoring gets setup and people turn it off when it stops working etc. Spam generated from alerts get disabled...
When metrics fail or are erroneous, it requires discipline to prioritize and fix ahead of other seemingly more important issues.
Active vs. Passive Monitoring
Select metrics that will survive changes of software updates; source code releases, os updates; etc.
Select a process that will survive team changes and carry forward the underlying values and benefits of daily checklists.
Use distributed not centralized management, first. Forget trying to get a single tool etc. up and running with beautifully maintained stats. Difficult to maintain, untrusted, not resilient to change. Sometimes causes its own resource leaks.
The issue is more of a social problem, the technical problem is simple. Implement a daily checklist.
Ppl have this amazing property of responding to change, computers less so. Ppl attempting to program responding to changes often over complicate the situation, achieving the opposite effect, over the course of the project. As the brilliant mind of the original author is replaced by those less
Make Learning about your System a Cultural Trend
When Angry Monkeys are used for Good and not Evil. When used for good, it is usually referred to as following process or procedure.
When I was a young whipper snapper, bright eyed, enthused etc...
One of the best team leads, I've ever had, brought great discipline and process to our work.
It was the discipline of a daily checklist.
At first it was enforced by the Team lead, then as ppl rotated throughout the team, either of us would enforce that discipline.
Unfortunately others, do not always recognise the value.
Sometimes they may recognise the value when things go wrong. But then it is quickly forgotten.
When there are differences in the checklist or significant variations of some metrics, you need to be disciplined enough to track down the diffs.
Sometimes it may take weeks or months, but nonetheless, that level of knowledge is important for the team to learn.
Instill discipline into the culture, ppl can change faster than the technology.
The Way Forward
In our heterogenous environment, drop back to the simplest thing possible. Each app/node will have text logs. Generate events based on those logs.
That is the intent there are other tools as well, like splunk.com.
- Simple Event Based Correlation - sec.pl
- Provide "Just Enough" Early Warning Alerts
- Maximise use of current tools
- Monitor Rate of Change of metrics as well as State Based as dramatic rates of change of pending doom.
- Ease comparison of metrics - Order of Magnitude trend differences, side by side comparisons, annotation of events of interest etc.
- Interconnecting this monitoring data with your domain. Multipliers.
- Cheap, adhoc and non-destructive tools should never get replaced by one centralized monitor
The "All singing all dancing Monitoring and Management System", may be possible in some companies, but for most I think it better to stop chasing your tails and put "people and processes" over "tools and technologies".
Things will go wrong
So make sure learning about your system is a priority.
Besides Preventative Maintenance or Daily/Weekly Checklists, make time to Learn your Fragile components or synchronous call chaings; your Rhythms and Multipliers.
See Release It for "Circuit Breaker pattern" and alike.
Multipliers within systems, always crop up, especially as a balance between reaction time and cost to develop properly. Key Entities or Seasonal Customer Traffic flow.
All this will help reduce your Minimum Time To Learn and your Minimum Time to Repair (MTTR) your System.

