Skip to content


System Management Patterns

Recently the client I've been working towards using webservices with third parties and this has partly prompted me to articulate a couple of potential System Management Patterns. I've asked google and not found any related articles, so here goes.

System Management Patterns Overview
System Management Patterns Overview

Collect Correlated System Metrics

Gather and store correlated system metrics.

The essence of this pattern is enable collection of multiple correlated data points over time to provide a picture of the system components that can be analysed at all interesting layers. Enabling better informed business, architectural and design decisions about the system or it's components.

The necessity of this pattern grows as we handle the web's bursty traffic nature and rely on more distributed services. I cannot over state the significant value this provides that will flow through the organisation and it's partners.

How it Works

Identify what components make up your system, what dependencies there are and start from the bottom of the pyramid.
Correlated System Metrics

There are many mature tools and protocols that already provide monitoring at the Operating System level.

Depending on what container or virtual machine you have they may support the defacto industry standard for monitoring.

Track major data points within your application. For a Web based retailer there are particular areas that would be important, such as Inventory Searches, Purchasing, third party dependencies, like Payment System Providers.

Using It

  • Collect correlated system metrics in real-time. Log file analysis has a longer feedback loop, can introduce correlation issues, easily lost or overwritten and doesn't scale well as more system components are added.
  • Use it often, resolve problems with collection as a high priority, the value it provides decrease as reliability of results or collection.
  • Prefer the collection of raw data points over aggregated data where possible as you need to make arbitrary choice of aggregation which may prevent the use of the metric with other metrics. If you were measuring response time for a service serving many concurrent clients, you could track, hitcount and total response time or you may track hitcount and avg. response time over what: period of time, or last x hits?>
  • Think about coverage, do you have too much? Do you have too little? Don't get too much information, avoid paralysis by analysis.
  • Think about your system, it is alive, it constantly changes, whether that is code, hardware, os or business usage, everyday is unique!
  • Don't blindly trust the statistics you have, occasionally, independently verify important statistics, does not have to be exact, could use back of the envelope calculations.
  • Serious usage may warrant a separate network.

Example

The technology to support the implementation of this has been around for years
and is mature in the infrastructure layer of a system. There are several network management applications that build upon SNMP and Round Robin Database Tool to collect real time data and specify a suitable granularity of aggregation to limit archive size, yet still allow queries to be performed.

SNMP monitoring has recently been added to the Java JVM itself. At the application level there are several snmp implementations that could be used to enable a poller, or statistics collector to query your app for certain metrics.

So it is possible to create a system view that spans the Operating System (cpu, disk, network adaptors), Virtual Machine, Container and your application.

Reap the Benefits

Knowledge of your approximate workload is the first step.

The benefits flow from the day to day operational aspects to the CIO assuming sufficient coverage has been implemented. Operations/Production Support have an increased ability to diagnose issues, they can identify cause and affect immediately after a change instead of guessing, to the strategic end of town. Better diagnosis of issues, may only be to the extent that significant time is not wasted looking in an area that is not causing the problem. In many cases it will highlight what component is or is not causing a problem and further analysis of the Correlated Component Snapshot needs to be investigated.

Developers can use those numbers to approximate workload and test large scale changes to the system, like changing operating systems, changing databases, doubling inventory etc.

  • CIO, can get more confident answers to:
    What is the load sensitivity to the system, if we double inventory? What happens if we introduce a promotion that is likely to increase traffic 10 times for the length of the promotion? How much hardware do we need to purchase to support these loads?
  • Technical Operations Manager, are my servers up? What happened at 13:00, to spike cpu usage by 20%? Is load within reasonable? limits? Are there errors on any devices? We changed disk arrays and performance has decreased?
  • Production Support, should have a dashboard of important points usually spanning all layers. They can ask questions like: Hey we requested a that package xyz be upgraded and response time of service abc has doubled?
  • Development Team, the real workload is alot different to what everyone anticipated. Do we need to focus on a different area? Do we need to add or tune caches?

Correlated Component Snapshot

Collect correlated detailed information of a System Component.

How it Works

Periodically collect detailed correlated information about your System Component . Let you travel back in time to see what happened to a component, maybe even to see the cause of the resulting affect. It is useful for diagnosing system problems in both production as well as determining and removing bottlenecks when Scaling Approximate Workload.

Using It

  • Gather and store detailed correlated information of a System Component.
    Collected periodically based on a balance of component usage, gathering cost. 20 minute i is a good rule of thumb
  • A listing of top running processes at that point in time.
  • A database server, may have information such as locking transactions, long running transactions, database specific internal statistic reports.
  • Typically recent data would be stored. Older data to be archived or removed.

Scaling Approximate Workload

Build tests that can approximate the real workload or your system.

How it Works

Having the ability to scale an approximate workload of your system that is monitored via Collect Correlated SystemMetrics provides the ability to really see what is happening to the system when changes occur. It provides the final metrics by which changes to components or workload can analysed and simulated.

This enables developers to simulate and answer "What If?" type of questions.

Using It

  • Build read only tests and read-write tests of vital functions. Read-only tests will provide a further level of approximation without the extent of data management imposed by the read-write test.
  • The tests must be written in a loosely coupled way, to enable simple scaling up of workload. They are intended to be used by a tool that allows simple manipulation of scaling parameters, such as number of clients, number of hits per hour etc.
  • Approximate data usage as well as functionality usage. While randomness of data is important, significant data distribution should reflected in you most important tests. Don't randomly choose products if a product category represents significant proportions of browsing or searching. If 50% of products browsed are from 1 product category then weight it accordingly.

Ping

Is your service platform alive and what is the upper bound performance limit [Bulka] I could potentially achieve from a service on your platform at this time. This gives meaning to the efficiency of system components under real or approximate workloads when monitored via Collect Correlated System Metrics

How it Works

A simple service of your service platform that does nothing, but returns immediately with a suitable response. Requires that a Ping service on the service platform be implemented.

In the case of a Servlet this may be an empty html form with a 200 response code. In the case of a webservice it responds in a similar way, perhaps with application specific response codes as well. In the case of a virtual machine you may be pinging a synchronized resource, to check it's liveness [Lea].

Using It

  • This can be used for internal components and more beneficially for external third party components.
  • Very effective in helping to determine where limiting factors lay; virutal machines, operating systems, design or architecture of a system
  • Dynamically adjust your system to reduce workload on third parties. As webservices adoption grows businesses will become even more reliant on responsiveness of third party partners. In an ideal world any third party systems will be able to handle any load your system can throw at it. But in site that is growing exponentially, partners may not have the inclination, contractual obligation or capabilities to scale as fast as your organisation. It is most likely the case that some messages are more important than others and that changing the workload you place on their system, or altering practices to accomodate cyclic degradation in their system could be very beneficial.

References

[Bulka] Java Performance and Scalability, Volume 1

[Fowler] Patterns of Enterprise Architecture

[Lea]
Concurrent Programming in Java(TM): Design Principles and Pattern (2nd Edition)

[Post to Twitter] Tweet This Post 

Posted in Patterns, Stability, Performance and Monitoring. Tagged with .

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

(required)

(required, but never shared)

or, reply to this post via trackback.


Tweet This Post links powered by Tweet This v1.3.9, a WordPress plugin for Twitter.