<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>brettdargan.com &#187; Stability, Performance and Monitoring</title>
	<atom:link href="http://brettdargan.com/blog/category/stability-performance-and-monitoring/feed/" rel="self" type="application/rss+xml" />
	<link>http://brettdargan.com/blog</link>
	<description>&#955; Thoughts and rants</description>
	<lastBuildDate>Fri, 28 May 2010 01:35:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Why is Performance Monitoring so hard?</title>
		<link>http://brettdargan.com/blog/2009/07/17/why-is-performance-and-monitoring-so-hard/</link>
		<comments>http://brettdargan.com/blog/2009/07/17/why-is-performance-and-monitoring-so-hard/#comments</comments>
		<pubDate>Fri, 17 Jul 2009 04:43:54 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Stability, Performance and Monitoring]]></category>

		<guid isPermaLink="false">http://brettdargan.com/blog/?p=470</guid>
		<description><![CDATA[The Challenge I was questioned recently about why we have so many tools to do monitoring. Ultimately I think there is not one tool but there is one approach that will improve your people and your systems. While it is true that different groups in the organisation need to ask different questions about the system [...]]]></description>
			<content:encoded><![CDATA[
<div class="document">


<!-- -*- mode: rst -*- -->
<div class="section" id="the-challenge">
<h3>The Challenge</h3>
<p>I was questioned recently about why we have so many tools to do monitoring.</p>
<p>Ultimately I think there is not one tool but there is one approach that will improve your people and your systems.</p>
<p>While it is true that different groups in the organisation need to ask different questions about the system at different times.</p>
<p>With the system constantly changing, individuals will need different tools to improve their <a class="reference external" href="http://www.slideshare.net/ThoughtWorks/lean-times-require-lean-thinking">&quot;Learning Efficiency&quot;</a>, and their general understanding of the system.</p>
<p>This must be balanced with the need for 1-n tools over the needs of the people's ability to react to system failures.
Increasing their Learning Efficiency will improve their awareness of the system and should spread knowledge and reduce the Minimum Time to Repair(MTTR).</p>
<p>Also everyone should read <a class="reference external" href="http://www.pragprog.com/titles/mnee/release-it">Release It!</a></p>
</div>
<div class="section" id="overview">
<h3>Overview</h3>
<p>There are a lot of aspects to performance and monitoring and frequently concepts are mixed.</p>
<p>To keep it simple lets categorize them like this:</p>
<ul class="simple">
<li>Current State - Instantaneous State based</li>
<li>Alerting</li>
<li>Historical Stats</li>
<li>Trace (System &gt; Component &gt; Request) ~ Vertical Profiling through technologies</li>
<li>Profiling - usually technology focused, like Java Profiling</li>
<li>Vital Signs</li>
</ul>
<p>These have varying levels of intrusiveness; stickiness and probability of working when needed.
They also have varying stakeholders, users, requirements, permissions that = red tape that slows you down when you most need the information.</p>
<p>There are lots of stakeholders that are usually focused on particular aspects at these times:</p>
<ul class="simple">
<li>Production - Good Times</li>
<li>Production - Under Load Times (batch or interactive)</li>
<li>Production - Under Load and component Failures</li>
<li>Dev - Design/Architecture</li>
<li>Dev - Impl time</li>
<li>Test - Environment Issues - ala. Troubleshooting the Integration</li>
<li>Load Test Time</li>
<li>Soak Test Time</li>
</ul>
</div>
<div class="section" id="complications">
<h3>Complications</h3>
<ul class="simple">
<li>Heterogeneous Systems</li>
<li>Complex Async Interactions</li>
<li>Degradation of Monitored Metrics after installation</li>
<li>non use of metrics in good times = metrics may be misleading or untrusted in bad times</li>
<li>Either Not Enough or Too Many Metrics</li>
<li>When things start to fail, they do so in a non-linear fashion</li>
</ul>
<p>I'm quite greedy when it comes to <em>&quot;Learning Efficiency&quot;</em>, so I have many desired features of monitoring.</p>
</div>
<div class="section" id="my-desirable-features-of-monitoring">
<h3>My Desirable Features of monitoring</h3>
<ul class="simple">
<li>Simple</li>
<li>Non-intrusive for standard operation</li>
<li>Minimal &quot;observer effect&quot;</li>
<li>Provide &quot;Just Enough&quot; Early Warning Alerts</li>
<li>Maximal use of current tools - don't hang your hopes on a <em>new silver bullet</em></li>
<li>Choice in when to Increase accuracy by trading off intrusiveness</li>
<li>Tracing aspects</li>
<li>Decentralized - centralized lock down access inversely decreases the value and usefulness</li>
<li>Alerting and Notifications</li>
<li>Visual Trends</li>
</ul>
<p>Simple is number one and because some information is better than none, I can frequently get by with simple command line tools.</p>
</div>
<div class="section" id="which-metrics-to-choose">
<h3>Which Metrics to choose?</h3>
<p>How do you reduce the signal/noise ratio. When there are 10000s of metrics to choose from.</p>
<p>For this I think we just have to go back to basics.</p>
<p>Your site exists for a reason, to serve traffic, to process some type of request.</p>
<p>Sites don't have a reason to exist if they are not involved in input/output.</p>
<p>Consider your system as a &quot;small world&quot; network. Concentrate on the Hubs and Connectors.</p>
<p>Start with your key flows:</p>
<ul class="simple">
<li>end to end - human to human</li>
</ul>
<p>Measure Requests/traffic:</p>
<ul class="simple">
<li>Requests (bytes/counts/response types ok/fail) between servers (os, then app/jvm)</li>
<li>elapsed response times</li>
<li>resource utilisation - cpu/mem/io</li>
<li>queues/pools/quotas - finite resources and potential bottlenecks - more difficult, sensitive to changes.</li>
<li>effectiveness - seo analytics etc.</li>
</ul>
<p>Add:</p>
<ul class="simple">
<li>Alerts for outside boundaries</li>
</ul>
<p>as well as having the extra info that affects the collected data and it's interpretation:</p>
<ul class="simple">
<li>Influences on metric collection/recording, bug in metric sensor</li>
<li>Influences on interpretation and events - released xyz at this time...</li>
</ul>
</div>
<div class="section" id="why-is-system-wide-monitoring-so-hard">
<h3>Why is System wide monitoring so hard?</h3>
<p>Suffers from lots of things: <a class="reference external" href="http://en.wikipedia.org/wiki/Conways_Law">Conways Law</a>, <a class="reference external" href="http://en.wikipedia.org/wiki/Tragedy_of_commons">Tragedy of the Commons</a>.</p>
<p>Too many stakeholders wanting different things from <em>monitoring</em> and not valuing the effort put into it.</p>
<p>The value of information is usually interpreted differently by many. Accuracy or correctness is not usually as important as age/timeliness and verifiability. Veterans usually place more importance on metrics, many juniors would not even consider of any interest.</p>
<p>Value is usually only seen by juniors when there are problems and the metrics can be directly/indirectly used go gain insight into a situation.</p>
<p>It is a very &quot;hard sell&quot; to setup an automated historical metrics monitoring system.</p>
<p>Centralised Management or tools are just too fragile, they suffer from bugs in too many places, too sensitive to change. Monitoring gets setup and people turn it off when it stops working etc.
Spam generated from alerts get disabled...</p>
<p>When metrics fail or are erroneous, it requires discipline to prioritize and fix ahead of other seemingly more important issues.</p>
</div>
<div class="section" id="active-vs-passive-monitoring">
<h3>Active vs. Passive Monitoring</h3>
<p>Select metrics that will survive changes of software updates; source code releases, os updates; etc.</p>
<p>Select a process that will survive team changes and carry forward the underlying values and benefits of daily checklists.</p>
<p>Use distributed not centralized management, first.
Forget trying to get a single tool etc. up and running with beautifully maintained stats.
Difficult to maintain, untrusted, not resilient to change. Sometimes causes its own resource leaks.</p>
<p>The issue is more of a social problem, the technical problem is simple. Implement a daily checklist.</p>
<p>Ppl have this amazing property of responding to change, computers less so. Ppl attempting to program responding to changes often over complicate the situation, achieving the opposite effect, over the course of the project. As the brilliant mind of the original author is replaced by those less</p>
</div>
<div class="section" id="make-learning-about-your-system-a-cultural-trend">
<h3>Make Learning about your System a Cultural Trend</h3>
<p>When <a class="reference external" href="http://rtpscrolls.blogspot.com/2006/11/angry-monkeys-and-cargo-cults.html">Angry Monkeys are used for Good and not Evil</a>.
When used for good, it is usually referred to as following process or procedure.</p>
<p>When I was a young whipper snapper, bright eyed, enthused etc...</p>
<p>One of the best team leads, I've ever had, brought great discipline and process to our work.</p>
<p>It was the discipline of a <em>daily checklist</em>.</p>
<p>At first it was enforced by the Team lead, then as ppl rotated throughout the team, either of us would enforce that discipline.</p>
<p>Unfortunately others, do not always recognise the value.</p>
<p>Sometimes they may recognise the value when things go wrong.
But then it is quickly forgotten.</p>
<p>When there are differences in the checklist or significant variations of some metrics, you need to be disciplined enough to  track down the diffs.</p>
<p>Sometimes it may take weeks or months, but nonetheless, that level of knowledge is important for the team to learn.</p>
<p>Instill discipline into the culture, ppl can change faster than the technology.</p>
</div>
<div class="section" id="the-way-forward">
<h3>The Way Forward</h3>
<p>In our heterogenous environment, drop back to the simplest thing possible.
Each app/node will have text logs.
Generate events based on those logs.</p>
<p>That is the intent there are other tools as well, like splunk.com.</p>
<ul class="simple">
<li><a class="reference external" href="http://kodu.neti.ee/~risto/sec/">Simple Event Based Correlation - sec.pl</a></li>
<li>Provide &quot;Just Enough&quot; Early Warning Alerts</li>
<li>Maximise use of current tools</li>
<li>Monitor Rate of Change of metrics as well as State Based as dramatic rates of change of pending doom.</li>
<li>Ease comparison of metrics - Order of Magnitude trend differences, side by side comparisons, annotation of events of interest etc.</li>
<li>Interconnecting this monitoring data with your domain. Multipliers.</li>
<li>Cheap, adhoc and non-destructive tools should never get replaced by <em>one centralized monitor</em></li>
</ul>
<p>The <em>&quot;All singing all dancing Monitoring and Management System&quot;</em>, may be possible in some companies, but for most I think it better to stop chasing your tails and put <em>&quot;people and processes&quot; over &quot;tools and technologies&quot;</em>.</p>
</div>
<div class="section" id="things-will-go-wrong">
<h3>Things will go wrong</h3>
<p>So make sure learning about your system is a priority.</p>
<p>Besides Preventative Maintenance or Daily/Weekly Checklists, make time to Learn your Fragile components or synchronous call chaings; your Rhythms and Multipliers.</p>
<p>See Release It for <a class="reference external" href="http://brettdargan.com/blog/2007/12/15/experimental-circuit-breaker-pattern-implementation/">&quot;Circuit Breaker pattern&quot;</a> and alike.</p>
<p>Multipliers within systems, always crop up, especially as a balance between reaction time and cost to develop properly.
Key Entities or Seasonal Customer Traffic flow.</p>
<p>All this will help reduce your Minimum Time To Learn and your Minimum Time to Repair (MTTR) your System.</p>
</div>
</div>
<div class="tweetthis" style="text-align:left;"><p> <a class="tt" href="http://twitter.com/home/?status=Why+is+Performance+Monitoring+so+hard%3F+http%3A%2F%2Ftinyurl.com%2F65ujayl" title="Post to Twitter"><img class="nothumb" src="http://brettdargan.com/blog/wp-content/plugins/tweet-this/icons/en/twitter/tt-twitter-big2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=Why+is+Performance+Monitoring+so+hard%3F+http%3A%2F%2Ftinyurl.com%2F65ujayl" title="Post to Twitter">Tweet This Post</a></p></div>]]></content:encoded>
			<wfw:commentRss>http://brettdargan.com/blog/2009/07/17/why-is-performance-and-monitoring-so-hard/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>JAOO 2009 &#8211; Brisbane</title>
		<link>http://brettdargan.com/blog/2009/05/19/jaoo-2009-brisbane/</link>
		<comments>http://brettdargan.com/blog/2009/05/19/jaoo-2009-brisbane/#comments</comments>
		<pubDate>Mon, 18 May 2009 14:12:05 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[design]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[Stability, Performance and Monitoring]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://brettdargan.com/blog/?p=380</guid>
		<description><![CDATA[Another good event, great speakers, lots of language talks, agile, architecture, scalability, databases ala megastore. highlights for me, in no particular order. Avi Bryant's new project, great eye candy had the crowd go &#34;ooooooooohhh&#34;. When will an api be open to plug other data in??? Also nice talk on VM history, how we are working [...]]]></description>
			<content:encoded><![CDATA[
<div class="document">


<!-- -*- mode: rst -*- -->
<p>Another good event, great speakers, lots of language talks, agile, architecture, scalability, databases ala megastore.
highlights for me, in no particular order.</p>
<ul>
<li><dl class="first docutils">
<dt><a class="reference external" href="http://jaoo.com.au/brisbane-2009/speaker/Avi+Bryant">Avi Bryant's</a> new project, great eye candy had the crowd go &quot;ooooooooohhh&quot;.  When will an api be open to plug other data in???</dt>
<dd><ul class="first last simple">
<li>Also nice talk on VM history, how we are working on VMs from algo's designed in the 80's. And the features or standard of VMs aren't available for all our favourite languages.</li>
</ul>
</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt><a class="reference external" href="http://jaoo.com.au/brisbane-2009/speaker/Jon+Tirsen">Tiersen</a> , good intro to sharding and google approach to scaling out</dt>
<dd><ul class="first last simple">
<li>liked the probability graphs</li>
<li>read chubby, logserve paper</li>
<li>higher level of jonas discussion, but jonas did refer to tiersen counter example</li>
<li>When the web came along, our apps were read mostly.</li>
<li>These days, social sites, are moving away from this scheme, to one of shared data. Some of that shared data is written to a lot. but it is updated, the majority of it is always inserts only.</li>
<li>write fan outs.</li>
<li>combine the two operations in one, so write + read all counters at same time</li>
<li>counter example too simple, can bypass optimistic locking as counter should be monotonically increasing</li>
<li>combine websites with fragments of data/entities with different shards and shard policies.</li>
</ul>
</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt><a class="reference external" href="http://jaoo.com.au/brisbane-2009/speaker/Jonas+S+Karlsson">Jonas</a> - great talk re Megastore</dt>
<dd><ul class="first last simple">
<li>Treat everything as an insert, duplicates will occur make sure they are idempotent</li>
<li>Updates are a little harder, but what is an update really, when can you say it is actually done?</li>
<li>Confirmed for me that a current strategy i'm implementing is right and will work (well for two, maybe three nodes anyway).</li>
<li>To get to more will need a decent <a class="reference external" href="http://en.wikipedia.org/wiki/Paxos_algorithm">Paxos implementation</a>, which takes smart ppl and time.</li>
<li>consistency, availability, entity groups and big table</li>
<li>web apps more read/write than before, lots of shared data, but mostly additions</li>
<li>consistency, reliability, storage, scalability and megastore scale, entity groups, transactionality, avoiding joins</li>
<li>consistency: user trust, none, eventual, entity group, global</li>
<li><strong>consistency: &quot;a harmonious uniformity of agreement among things or parts&quot;</strong></li>
<li>Sharding by entity groups.</li>
<li>storage vs. cost of loss</li>
<li>paxos vs. 2pc</li>
<li>papers jonas recommended: pat helland paper; jinquan dai, intel, james hamilton, MS</li>
<li>I have a lot more to say about this topic, as I cut my teeth on Oracle Performance Tuning and it's architecture, always providing a READ CONSISTENT transaction at a minimum, I use to think was a great idea, until I wanted to scale it further.</li>
<li>Disk Reads are about 10,000 slower than memory access, but not if you have to manage a lot of versions of different blocks in memroy. The overheads reduce the read in memory to <strong>ONLY 10 to 100 times faster than a disk read</strong>. That just isn't enough. See <a class="reference external" href="http://www.hotsos.com/e-library/abstract.php?id=7">Milsap papers on Oracle scaling</a>  there are a number of them and <a class="reference external" href="http://www.scaleabilities.co.uk/book/scalingOracle8i.pdf">James Morle's book on Scaling Oracle 8i which is a great book, the older print version can still be picked up as well</a></li>
<li>The approach of one single db instance creates new problems, now we need transaction logs and a hot standby and a DR data centre...</li>
</ul>
</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt><a class="reference external" href="http://jaoo.com.au/brisbane-2009/speaker/Michael+T.+Nygard">Nygard</a> (<a class="reference external" href="http://www.amazon.com/Release-Production-Ready-Software-Pragmatic-Programmers/dp/0978739213/ref=sr_1_1?ie=UTF8&amp;#038;s=books&amp;#038;qid=1242652387&amp;#038;sr=1-1">Release It!</a> )- i missed the first one due to a conflict &lt;img src='<a class="reference external" href="http://brettdargan.com/blog/wp-includes/images/smilies/icon_sad.gif">http://brettdargan.com/blog/wp-includes/images/smilies/icon_sad.gif</a>' alt=':(' class='wp-smiley' /&gt; .</dt>
<dd><ul class="first last simple">
<li>Great stuff from what i hear, I've been pushing for some time implementation of his patterns.</li>
<li>motivated a number of devs at my company. Hopefully I'll see some badly behaving webservices and stability pattern implementations.</li>
<li>Here is a <a class="reference external" href="http://brettdargan.com/blog/2007/12/15/experimental-circuit-breaker-pattern-implementation/">simple Java Circuit Breaker Pattern Implementation from a ways back</a>.</li>
</ul>
</dd>
</dl>
</li>
<li><p class="first">mike cannon-brookes - had to fill a tough set of constraints, but interesting story about atlassian.</p>
</li>
<li><dl class="first docutils">
<dt><a class="reference external" href="http://jaoo.com.au/brisbane-2009/speaker/Clemens+Szyperski">Clemens</a></dt>
<dd><ul class="first last simple">
<li>ms unified component thingy. simliar to osgi. Nice talk about issues with design,  component composition and how all of IT boils down to composition of things at varying levels.</li>
<li>Discussion about component composition and how <strong>state is always a problem</strong>, yes.</li>
<li>Advocate of service use, <strong>not reuse</strong> as it should be used <strong>&quot;as is&quot;</strong></li>
<li>I prefer service use over code/component, especially with a services developed with RESTful intentions</li>
</ul>
</dd>
</dl>
</li>
<li><p class="first"><a class="reference external" href="http://dannorth.net">Dan North</a>, telling project war stories based on experience and
* observations of good architects
* soa gone bad, wsdl</p>
<ul class="simple">
<li>Listen, Listen, Listen</li>
<li>technical problems aren't the biggest issue, silo communication</li>
<li>replacement of tools sqlserver to oracle, not solving the real problem</li>
<li>the nameless quality</li>
<li>vision, inspiration, enabler</li>
<li>project shaman</li>
<li>empathise</li>
<li>self belief, a sense of conviction and humility</li>
</ul>
</li>
<li><dl class="first docutils">
<dt>Joshua Bloch - great stuff</dt>
<dd><ul class="first last simple">
<li>checkout <a class="reference external" href="http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/MapMaker.html">MapMaker</a></li>
</ul>
</dd>
</dl>
</li>
<li><p class="first">Eastman - apache mahut. Dirichlet clustering, hmm, that algorithm wasn't in <a class="reference external" href="http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325/ref=sr_1_1?ie=UTF8&amp;#038;s=books&amp;#038;qid=1242654567&amp;#038;sr=1-1">Collective Intelligence</a></p>
</li>
<li><dl class="first docutils">
<dt>Douglas Crockford</dt>
<dd><ul class="first last simple">
<li>javascript inspired by self, scheme, perl and java</li>
<li>prefer === over == doesn't do type coercion</li>
<li>he no longer uses ++ and -- anymore, implicated in buffer overflow exploits</li>
<li>lambda, dyn objs, loose typing and object literals.</li>
<li>refreshing to discuss languages and language features again</li>
<li>the <a class="reference external" href="http://www.nczonline.net/blog/2009/01/27/speed-up-your-javascript-part-3/">memoizer example was good</a>, check the slides for recursion usage</li>
<li>use functions to make objects</li>
<li>functional inheritance</li>
<li>be rigorous and  use jslint</li>
</ul>
</dd>
</dl>
</li>
</ul>
<div class="section" id="other-comments">
<h3>Other Comments;</h3>
<blockquote>
<ul class="simple">
<li>Lots of cloud talks, <strong>very fluffy</strong>, but where was the discussion about <a class="reference external" href="http://kenai.com/projects/suncloudapis/pages/Home">SUN Cloud RESTful API</a></li>
<li>No REST, no erlang.</li>
</ul>
</blockquote>
</div>
<div class="section" id="books-to-checkout">
<h3>Books to checkout:</h3>
<blockquote>
<ul class="simple">
<li>Douglas Crockford, recommends <a class="reference external" href="http://www.amazon.com/JavaScript-Good-Parts-Douglas-Crockford/dp/0596517742/ref=sr_1_1?ie=UTF8&amp;#038;s=books&amp;#038;qid=1242693117&amp;#038;sr=8-1">JavascriptTheGoodParts</a></li>
<li>Steve Hayes, recommends <a class="reference external" href="http://www.amazon.com/Brain-Rules-Principles-Surviving-Thriving/dp/0979777747/ref=sr_1_1?ie=UTF8&amp;#038;s=books&amp;#038;qid=1242652608&amp;#038;sr=1-1">BrainRules</a></li>
<li>Linda Rising, recommends <a class="reference external" href="http://www.amazon.com/Strangers-Ourselves-Discovering-Adaptive-Unconscious/dp/0674013824/ref=sr_1_1?ie=UTF8&amp;#038;s=books&amp;#038;qid=1242652693&amp;#038;sr=1-1">StrangersToOurselves</a> by Timothy Wilson</li>
<li>Dan North, recommends <a class="reference external" href="http://www.amazon.com/Timeless-Way-Building-Christopher-Alexander/dp/0195024028/ref=sr_1_1?ie=UTF8&amp;#038;s=books&amp;#038;qid=1242652771&amp;#038;sr=1-1">TimelessWayOfBuilding</a>, yah, i have flicked through it, may borrow from a library or amazon, cause i haven't seen it in oz on a shelf for less than $140.</li>
</ul>
</blockquote>
</div>
</div>
<div class="tweetthis" style="text-align:left;"><p> <a class="tt" href="http://twitter.com/home/?status=JAOO+2009+%E2%80%93+Brisbane+http%3A%2F%2Ftinyurl.com%2F5r8whbo" title="Post to Twitter"><img class="nothumb" src="http://brettdargan.com/blog/wp-content/plugins/tweet-this/icons/en/twitter/tt-twitter-big2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=JAOO+2009+%E2%80%93+Brisbane+http%3A%2F%2Ftinyurl.com%2F5r8whbo" title="Post to Twitter">Tweet This Post</a></p></div>]]></content:encoded>
			<wfw:commentRss>http://brettdargan.com/blog/2009/05/19/jaoo-2009-brisbane/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Experimental Java Circuit Breaker Pattern Implementation</title>
		<link>http://brettdargan.com/blog/2007/12/15/experimental-circuit-breaker-pattern-implementation/</link>
		<comments>http://brettdargan.com/blog/2007/12/15/experimental-circuit-breaker-pattern-implementation/#comments</comments>
		<pubDate>Fri, 14 Dec 2007 15:24:46 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Patterns]]></category>
		<category><![CDATA[Stability, Performance and Monitoring]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[design]]></category>

		<guid isPermaLink="false">http://brettdargan.com/blog/2007/12/15/experimental-circuit-breaker-pattern-implementation/</guid>
		<description><![CDATA[I read "Release It" recently, not a bad book, some good stories, I wish more people would pay attention to things like Stability and Capacity. Anyway I liked the Circuit Breaker pattern, as I've seen a number of apps with highly coupled systems that barf once latency increases in the back ends then you get [...]]]></description>
			<content:encoded><![CDATA[<p>I read <a href="http://www.pragprog.com/titles/mnee" title="Release It">"Release It"</a> recently, not a bad book, some good stories, I wish more people would pay attention to things like Stability and Capacity.</p>
<p>Anyway I liked the Circuit Breaker pattern, as I've seen a number of apps with highly coupled systems that barf once latency increases in the back ends then you get "Cascading Failures" and the webapp's request handler thread pool is exhausted and they all wait their mandatory 30 seconds for a response. Eventually the app either dies, or gets taken out of a load balance farm causing a "Chain Reaction".</p>
<p>Here is an <a href="http://brettdargan.com/blog/wp-content/uploads/2007/12/circuit_breaker.tgz" title="Experimental Circuit Breaker Implementation (circuit_breaker.tgz)">Experimental Circuit Breaker Implementation (circuit_breaker.tgz)</a> , <strong>NOT built or tested FOR PRODUCTION use in anyway</strong>. There is explicity no thread safety no the actual CircuitBreakerSimple class.</p>
<p>download, extract and run ant.</p>
<p>Play around with the parameters to see the affect. Generally it seems to thrash more with increased concurrent threads and interestingly in jdk1.6 sometimes reverts to the default timeout instead of the instance value.</p>
<p>Based roughly on the following logic:<br />
When Circuit is Closed:<br />
on call = pass through<br />
call succeeds = reset count<br />
call fails = count failure<br />
threshold reached = trip breaker. Open State</p>
<p>when Circuit is Half-Open<br />
on call = pass through<br />
call succeeds = reset go. Close State<br />
call fails = trip breaker. Open State</p>
<p>when Circuit is Open<br />
on call = return/fail<br />
on timeout = attempt reset. Half-Open State</p>
<div class="tweetthis" style="text-align:left;"><p> <a class="tt" href="http://twitter.com/home/?status=Experimental+Java+Circuit+Breaker+Pattern+Implementation+http%3A%2F%2Ftinyurl.com%2F6avqqeb" title="Post to Twitter"><img class="nothumb" src="http://brettdargan.com/blog/wp-content/plugins/tweet-this/icons/en/twitter/tt-twitter-big2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=Experimental+Java+Circuit+Breaker+Pattern+Implementation+http%3A%2F%2Ftinyurl.com%2F6avqqeb" title="Post to Twitter">Tweet This Post</a></p></div>]]></content:encoded>
			<wfw:commentRss>http://brettdargan.com/blog/2007/12/15/experimental-circuit-breaker-pattern-implementation/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Don&#8217;t kill that runaway process without understanding what it was doing</title>
		<link>http://brettdargan.com/blog/2006/07/26/dont-kill-that-runaway-process-without-understanding-what-it-was-doing/</link>
		<comments>http://brettdargan.com/blog/2006/07/26/dont-kill-that-runaway-process-without-understanding-what-it-was-doing/#comments</comments>
		<pubDate>Wed, 26 Jul 2006 01:18:07 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Stability, Performance and Monitoring]]></category>
		<category><![CDATA[tips]]></category>
		<category><![CDATA[unix]]></category>

		<guid isPermaLink="false">http://brettdargan.com/blog/?p=74</guid>
		<description><![CDATA[Having problem with runaway processes on a shared box? Save some cycles, by using strace -p &#60;pid&#62;. Leave these till after you have done your stracing: Checking what releases have gone out recently Kernel upgrades/dmesg errors if you have filled a a filesystem, ok df -h is pretty quick as well kill -3 &#60;javapid&#62; Tweet [...]]]></description>
			<content:encoded><![CDATA[<p>Having problem with runaway processes on a shared box?</p>
<p>Save some cycles, by using <a href="http://www.liacs.nl/~wichert/strace/">strace -p &lt;pid&gt;</a>. </p>
<p>Leave these till after you have done your stracing:</p>
<ul>
<li>Checking what releases have gone out recently</li>
<li>Kernel upgrades/dmesg errors</li>
<li>if you have filled a a filesystem, ok df -h is pretty quick as well</li>
<li>kill -3 &lt;javapid&gt;</li>
</ul>
<div class="tweetthis" style="text-align:left;"><p> <a class="tt" href="http://twitter.com/home/?status=Don%E2%80%99t+kill+that+runaway+process+without+understanding+what+it+was+doing+http%3A%2F%2Ftinyurl.com%2F68uk8jy" title="Post to Twitter"><img class="nothumb" src="http://brettdargan.com/blog/wp-content/plugins/tweet-this/icons/en/twitter/tt-twitter-big2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=Don%E2%80%99t+kill+that+runaway+process+without+understanding+what+it+was+doing+http%3A%2F%2Ftinyurl.com%2F68uk8jy" title="Post to Twitter">Tweet This Post</a></p></div>]]></content:encoded>
			<wfw:commentRss>http://brettdargan.com/blog/2006/07/26/dont-kill-that-runaway-process-without-understanding-what-it-was-doing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using snmp through the layers</title>
		<link>http://brettdargan.com/blog/2005/10/18/using-snmp-through-the-layers/</link>
		<comments>http://brettdargan.com/blog/2005/10/18/using-snmp-through-the-layers/#comments</comments>
		<pubDate>Mon, 17 Oct 2005 15:54:51 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Stability, Performance and Monitoring]]></category>
		<category><![CDATA[Monitoring]]></category>

		<guid isPermaLink="false">http://brettdargan.com/blog/?p=54</guid>
		<description><![CDATA[For some time I've been working on a project where we need to improve monitoring of our web application. Recently that opportunity has been realised and I've been working with a few people to implement some of the patterns i discussed in System Management Patters. SNMP is a natural fit for publishing application statistics. The [...]]]></description>
			<content:encoded><![CDATA[<p>For some time I've been working on a project where we need to improve monitoring of our web application. Recently that opportunity has been realised and I've been working with a few people to implement some of the patterns i discussed in <a href="http://www.brettdargan.com/blog/archives/2005/07/performance_pat.html">System Management Patters</a>.</p>
<p>SNMP is a natural fit for publishing application statistics. The thing I don't like about existing snmp agent implementations is that they are all pretty language dependent and they are built on the old style, define a mib, crank a wheel to generate a template, then you finally get to add code for you metrics.</p>
<p>So firstly work has been done to have an agent publish statistics based on results from an xml stream. This has been built on top of <a href="http://pysnmp.sf.net">pysnmp</a>.</p>
<p>This allows us to have an agent on each device/box that can publish stats, regardless of language as long as you instrument the device in question (and you have a python 2.4 environment).</p>
<p>Once your stats are published, you will need a manager to gather them.<br />
While any SNMP manager can be used, often like with jffnms, you will need to write your own poller anyway. We've created an easily configurable pollers so a manager, can poll and update rrdtool database.</p>
<p>While some of the leg work is alleviated, you will still need to do the following: instrument your app, define your rrd datasources and optionally design your mib.</p>
<p>I'm currently looking at opensourcing this very shortly.</p>
<div class="tweetthis" style="text-align:left;"><p> <a class="tt" href="http://twitter.com/home/?status=Using+snmp+through+the+layers+http%3A%2F%2Ftinyurl.com%2F69bp3lh" title="Post to Twitter"><img class="nothumb" src="http://brettdargan.com/blog/wp-content/plugins/tweet-this/icons/en/twitter/tt-twitter-big2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=Using+snmp+through+the+layers+http%3A%2F%2Ftinyurl.com%2F69bp3lh" title="Post to Twitter">Tweet This Post</a></p></div>]]></content:encoded>
			<wfw:commentRss>http://brettdargan.com/blog/2005/10/18/using-snmp-through-the-layers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>System Management Patterns</title>
		<link>http://brettdargan.com/blog/2005/07/01/system-management-patterns/</link>
		<comments>http://brettdargan.com/blog/2005/07/01/system-management-patterns/#comments</comments>
		<pubDate>Thu, 30 Jun 2005 21:13:44 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Patterns]]></category>
		<category><![CDATA[Stability, Performance and Monitoring]]></category>
		<category><![CDATA[design]]></category>

		<guid isPermaLink="false">http://brettdargan.com/blog/?p=49</guid>
		<description><![CDATA[Recently the client I've been working towards using webservices with third parties and this has partly prompted me to articulate a couple of potential System Management Patterns. I've asked google and not found any related articles, so here goes. System Management Patterns Overview Collect Correlated System Metrics Gather and store correlated system metrics. The essence [...]]]></description>
			<content:encoded><![CDATA[<p>Recently the client I've been working towards using webservices with third parties and this has partly prompted me to articulate a couple of potential System Management Patterns. I've asked google and not found any related articles, so here goes.</p>
<p>System Management Patterns Overview<br />
<a href="http://brettdargan.com/blog/wp-content/uploads/2007/08/patterns-overiew.png" title="System Management Patterns Overview"><img src="http://brettdargan.com/blog/wp-content/uploads/2007/08/patterns-overiew.png" alt="System Management Patterns Overview" border="0" height="412" width="686" /></a></p>
<h3><a title="CollectCorrelatedSystemMetrics" name="CollectCorrelatedSystemMetrics"></a>Collect Correlated System Metrics</h3>
<p>Gather and store correlated system metrics.</p>
<p>The essence of this pattern is enable collection of multiple correlated data points over time to provide a picture of the system components that can be analysed at all interesting layers. Enabling better informed business, architectural and design decisions about the system or it's components.</p>
<p>The necessity of this pattern grows as we handle the web's bursty traffic nature and rely on more distributed services. I cannot over state the significant value this provides that will flow through the organisation and it's partners.</p>
<h4>How it Works</h4>
<p>Identify what components make up your system, what dependencies there are and start from the bottom of the pyramid.<br />
<a href="http://brettdargan.com/blog/wp-content/uploads/2007/08/pattern-ccsm.png" title="Correlated System Metrics"><img src="http://brettdargan.com/blog/wp-content/uploads/2007/08/pattern-ccsm.png" alt="Correlated System Metrics" border="0" height="259" width="364" /></a></p>
<p>There are many mature tools and protocols that already provide monitoring at the  Operating System level.</p>
<p>Depending on what container or virtual machine you have they may support the defacto industry standard for monitoring.</p>
<p>Track major data points within your application. For a Web based retailer there are particular areas that would be important, such as Inventory Searches, Purchasing, third party dependencies, like Payment System Providers.</p>
<h4>Using It</h4>
<ul>
<li>Collect correlated system metrics in real-time. Log file analysis has a longer feedback loop, can introduce correlation issues, easily lost or overwritten and doesn't scale well as more system components are added.</li>
<li>Use it often, resolve problems with collection as a high priority, the value it provides decrease as reliability of results or collection.</li>
<li>Prefer the collection of raw data points over aggregated data where possible as you need to make arbitrary choice of aggregation which may prevent the use of the metric with other metrics.  If you were measuring response time for a service serving many concurrent clients, you could track, hitcount and total response time or you may track hitcount and avg. response time over what: period of time, or last x hits?&gt;</li>
<li>Think about coverage, do you have too much? Do you have too little? Don't get too much information, avoid paralysis by analysis.</li>
<li>Think about your system, it is alive, it constantly changes, whether that is code, hardware, os or business usage, everyday is unique!</li>
<li>Don't blindly trust the statistics you have, occasionally, independently verify important statistics, does not have to be exact, could use back of the envelope calculations.</li>
<li>Serious usage may warrant a separate network.</li>
</ul>
<h4>Example</h4>
<p>The technology to support the implementation of this has been around for years<br />
and is mature in the infrastructure layer of a system. There are several network management applications that build upon <a href="http://www.ietf.org/rfc/rfc1157.txt" title="Simple Network Management Protocol">SNMP</a> and <a href="http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/">Round Robin Database Tool</a> to collect real time data and specify a suitable granularity of aggregation to limit archive size, yet still allow queries to be performed.</p>
<p>SNMP monitoring has recently been added to the Java JVM itself. At the application level there are several snmp implementations that could be used to enable a poller, or statistics collector to query your app for certain metrics.</p>
<p>So it is possible to create a system view that spans the Operating System (cpu, disk, network adaptors), Virtual Machine, Container and your application.</p>
<h4>Reap the Benefits</h4>
<p>Knowledge of your approximate workload is the first step.</p>
<p>The benefits flow from the day to day operational aspects to the CIO assuming sufficient coverage has been implemented. Operations/Production Support have an increased ability to diagnose issues, they can identify cause and affect immediately after a change instead of guessing, to the strategic end of town. Better diagnosis of issues, may only be to the extent that significant time is not wasted looking in an area that is not causing the problem. In many cases it will highlight what component is or is not causing a problem and further analysis of the <a href="#CorrelatedComponentSnapshot">Correlated Component Snapshot</a> needs to be investigated.</p>
<p>Developers can use those numbers to <a href="ScalingApproximateWorkload">approximate workload and test large scale changes to the system</a>, like changing operating systems, changing databases, doubling inventory etc.</p>
<ul>
<li>CIO, can get more confident answers to:<br />
What is the <a href="#Fowler" title="see [Fowler] for performance terms">load sensitivity</a> to the system, if we double inventory? What happens if we introduce a promotion that is likely to increase traffic 10 times for the length of the promotion? How much hardware do we need to purchase to support these loads?</li>
<li>Technical Operations Manager, are my servers up? What happened at 13:00, to spike cpu usage by 20%? Is load within reasonable? limits? Are there errors on any devices? We changed disk arrays and performance has decreased?</li>
<li>Production Support, should have a dashboard of important points usually spanning all layers. They can ask questions like: Hey we requested a that package xyz be upgraded and response time of service abc has doubled?</li>
<li>Development Team, the real workload is alot different to what everyone anticipated. Do we need to focus on a different area? Do we need to add or tune caches?</li>
</ul>
<h3><a title="CorrelatedComponentSnapshot" name="CorrelatedComponentSnapshot"></a>Correlated Component Snapshot</h3>
<p>Collect correlated detailed information of a System Component.</p>
<h4>How it Works</h4>
<p>Periodically collect detailed correlated information about your System Component . Let you travel back in time to see what happened to a component, maybe even to see the cause of the resulting affect. It is useful for diagnosing system problems in both production as well as determining and removing bottlenecks when <a href="#ScalingApproximateWorkload">Scaling Approximate Workload</a>.</p>
<h4>Using It</h4>
<ul>
<li>Gather and store detailed correlated information of a System Component.<br />
Collected periodically based on a balance of component usage, gathering cost. 20 minute i is a good rule of thumb</li>
<li>A listing of top running processes at that point in time.</li>
<li>A database server, may have information such as locking transactions, long running transactions, database specific internal statistic reports.</li>
<li>Typically recent data would be stored. Older data to be archived or removed.</li>
</ul>
<h3><a href="ScalingApproximateWorkload">Scaling Approximate Workload</a></h3>
<p>Build tests that can approximate the real workload or your system.</p>
<h4>How it Works</h4>
<p>Having the ability to scale an approximate workload of your system that is monitored via <a href="CollectCorrelatedSystemMetrics">Collect Correlated SystemMetrics</a> provides the ability to really see what is happening to the system when changes occur. It provides the final metrics by which changes to components or workload can analysed and simulated.</p>
<p>This enables developers to simulate and answer "What If?" type of questions.</p>
<h4>Using It</h4>
<ul>
<li>Build read only tests and read-write tests of vital functions. Read-only tests will provide a further level of approximation without the extent of data management imposed by the read-write test.</li>
<li>The tests must be written in a loosely coupled way, to enable simple scaling up of workload. They are intended to be used by a tool that allows simple manipulation of scaling parameters, such as number of clients, number of hits per hour etc.</li>
<li>Approximate data usage as well as functionality usage. While randomness of data is important, significant data distribution should reflected in you most important tests. Don't randomly choose products if a product category represents significant proportions of browsing or searching. If 50% of products browsed are from 1 product category then weight it accordingly.</li>
</ul>
<h3><a title="Ping" name="Ping"></a>Ping</h3>
<p>Is your service platform alive and what is the upper bound performance limit [Bulka] I could potentially achieve from a service on your platform at this time.  This gives meaning to the efficiency of system components under real or approximate workloads when monitored via <a href="#CollectCorrelatedSystemMetrics">Collect Correlated System Metrics</a></p>
<h4>How it Works</h4>
<p>A simple service of your service platform that does nothing, but returns immediately with a suitable response. Requires that a Ping service on the service platform be implemented.</p>
<p>In the case of a Servlet this may be an empty html form with a 200 response code. In the case of a webservice it responds in a similar way, perhaps with application specific response codes as well. In the case of a virtual machine you may be pinging a synchronized resource, to check it's liveness [Lea].</p>
<h4>Using It</h4>
<ul>
<li>This can be used for internal components and more beneficially for external third party components.</li>
<li>Very effective in helping to determine where limiting factors lay; virutal machines, operating systems, design or architecture of a system</li>
<li>Dynamically adjust your system to reduce workload on third parties. As webservices adoption grows businesses will become even more reliant on responsiveness of third party partners. In an ideal world any third party systems will be able to handle any load your system can throw at it. But in site that is growing exponentially, partners may not have the inclination, contractual obligation or capabilities to scale as fast as your organisation. It is most likely the case that some messages are more important than others and that changing the workload you place on their system, or altering practices to accomodate cyclic degradation in their system could be very beneficial.</li>
</ul>
<h4>References</h4>
<p>[Bulka] <a href="http://www.amazon.com/exec/obidos/tg/detail/-/0201704293/qid=1120167799/sr=8-2/ref=sr_8_xs_ap_i2_xgl14/104-7932080-7784733?v=glance&amp;s=books&amp;n=507846">Java Performance and Scalability, Volume 1</a></p>
<p><a title="Fowler" name="Fowler"></a>[Fowler] <a href="http://www.amazon.com/exec/obidos/tg/detail/-/0321127420/104-3299664-3518357?v=glance">Patterns of Enterprise Architecture</a></p>
<p>[Lea]<br />
<a href="http://www.amazon.com/exec/obidos/tg/detail/-/0201310090/qid=1120188315/sr=8-1/ref=pd_bbs_ur_1/104-3299664-3518357?v=glance&amp;s=books&amp;n=507846">Concurrent Programming in Java(TM): Design Principles and Pattern (2nd Edition)</a></p>
<div class="tweetthis" style="text-align:left;"><p> <a class="tt" href="http://twitter.com/home/?status=System+Management+Patterns+http%3A%2F%2Ftinyurl.com%2F65ehxc6" title="Post to Twitter"><img class="nothumb" src="http://brettdargan.com/blog/wp-content/plugins/tweet-this/icons/en/twitter/tt-twitter-big2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=System+Management+Patterns+http%3A%2F%2Ftinyurl.com%2F65ehxc6" title="Post to Twitter">Tweet This Post</a></p></div>]]></content:encoded>
			<wfw:commentRss>http://brettdargan.com/blog/2005/07/01/system-management-patterns/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JVM Monitoring with SNMP</title>
		<link>http://brettdargan.com/blog/2005/04/01/jvm-monitoring-with-snmp/</link>
		<comments>http://brettdargan.com/blog/2005/04/01/jvm-monitoring-with-snmp/#comments</comments>
		<pubDate>Thu, 31 Mar 2005 19:30:36 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Stability, Performance and Monitoring]]></category>
		<category><![CDATA[Monitoring]]></category>

		<guid isPermaLink="false">http://brettdargan.com/blog/?p=42</guid>
		<description><![CDATA[One aspect of my described System Management Patterns includes gathering correlated metrics through the various layers of your system. Since SNMP is the defacto standard in monitoring. I used the SNMP aspects of the JVM. I never really got into SNMP before and since Tiger directly supports it, the time had come. The aim was [...]]]></description>
			<content:encoded><![CDATA[<p>One aspect of my described <a href="http://brettdargan.com/blog/2005/07/01/system-management-patterns/">System Management Patterns</a> includes gathering correlated metrics through the various layers of your system. Since SNMP is the defacto standard in monitoring. I used the SNMP aspects of the JVM. </p>
<p>I never really got into SNMP before and since Tiger directly supports it, the time had come.</p>
<p>The aim was to add our JVM to an existing monitoring system, <a href="http://jffnms.sf.net" title="Just for Fun Network Management System">jffnms</a>. </p>
<p>Being a newbie to SNMP I had a few setbacks than anticipated.</p>
<p>While the <a href="http://java.sun.com/j2se/1.5.0/docs/guide/management/jconsole.html">jconsole </a> is cool and useful, I wanted application/JVM monitoring to fit in with our existing monitoring tools and also for collected data to be nicely correlated with os and network stats.</p>
<p>Sun&apos;s <a href="http://java.sun.com/j2se/1.5.0/docs/guide/management/SNMP.html">SNMP guide</a> to enabling a port for monitoring is straight forward, but I found learning about snmp from google painful, and the best bits were hard to find.</p>
<p>If you've got/done the following then it will be easy to get started:
<ol>
<li><a href="net-snmp.sf.net">net-snmp</a> installed and read some doco on SNMP and MIBS, the net-snmp doco is better than most but has alot of fluff.
<li><a href="http://java.sun.com/j2se/1.5.0">tiger</a>
<li> have added followed the <a href="http://java.sun.com/j2se/1.5.0/docs/guide/management/SNMP.html">snmp monitoring setup</a> so you have a listenter and configure acl.
<li>and some kind of server to run
</ol>
<p>If any of the acl settings are wrong or file permissions aren't set correctly, the jvm won't start. </p>
<p>Once it is going you could double check, using netstat.</p>
<p>Now to use the snmp tools you will need to load in Sun's <a href="http://java.sun.com/j2se/1.5.0/docs/guide/management/JVM-MANAGEMENT-MIB.mib">JVM-MANAGEMENT-MIB</a></p>
<p>It's good practice to use <em>snmptranslate</em> first to see if there are any problems with the MIB definition, in particular any missing dependencies.</p>
<p><small><code>snmptranslate -M .:/usr/share/snmp/mibs -m JVM-MANAGEMENT-MIB -IR -Tp jvmMgtMIB</code></small></p>
<p>Assuming you are running this from the current directory where the new MIB is that command should work. -M is giving a search path for mibs and -m is telling is specifying which mibs we want to load. The final parameter, <em>jvmMgtMIB</em> is the module identity, you can see that inside the mib file.</p>
<p>You should see a tree view of the mib starting like this:<br />
+--jvmMgtMIB(1)<br />
   |<br />
   +--jvmMgtMIBObjects(1)<br />
   |  |<br />
   |  +--jvmClassLoading(1)<br />
   |  |  |<br />
   |  |  +-- -R-- Gauge     jvmClassesLoadedCount(1)<br />
   |  |  +-- -R-- Counter64 jvmClassesTotalLoadedCount(2)<br />
   |  |  +-- -R-- Counter64 jvmClassesUnloadedCount(3)<br />
   |  |  +-- -RW- EnumVal   jvmClassesVerboseLevel(4)<br />
...</p>
<p>And finally we can peek at some jvm stats, just plugin your port:</p>
<p><small><code>snmpwalk -v 2c -M .:/usr/share/snmp/mibs -m JVM-MANAGEMENT-MIB -c public localhost:port jvmMgtMIB</code></small></p>
<p>So in this snippet we are using snmp version 2c and still loading the sun mib. </p>
<p>You should see some results like this:<br />
...<br />
JVM-MANAGEMENT-MIB::jvmMemoryHeapInitSize.0 = Counter64: 838860800 bytes<br />
JVM-MANAGEMENT-MIB::jvmMemoryHeapUsed.0 = Counter64: 39430568 bytes<br />
JVM-MANAGEMENT-MIB::jvmMemoryHeapCommitted.0 = Counter64: 832438272 bytes<br />
JVM-MANAGEMENT-MIB::jvmMemoryHeapMaxSize.0 = Counter64: 1664876544 bytes<br />
JVM-MANAGEMENT-MIB::jvmMemoryNonHeapInitSize.0 = Counter64: 8552448 bytes<br />
JVM-MANAGEMENT-MIB::jvmMemoryNonHeapUsed.0 = Counter64: 9260472 bytes<br />
JVM-MANAGEMENT-MIB::jvmMemoryNonHeapCommitted.0 = Counter64: 9863168 bytes<br />
JVM-MANAGEMENT-MIB::jvmMemoryNonHeapMaxSize.0 = Counter64: 100663296 bytes<br />
...</p>
<p>To make this more useful you should install the mib properly, put it in with the others and update your ~/.snmp/snmp.conf file to have "mibs +JVM-MANAGEMENT-MIB". </p>
<p>One of the things I hated about getting started with SNMP were the misleading errors.<br />
Such as
<ul>
<li>snmpwalk: Timeout - can be caused by a bunch of things, like not specifying the correct version of snmp that you are trying to query!
<li>SNMPv2-SMI::mib-2 = No Such Object available on this agent at this OID - can be caused when you don't specify the OID, eg. jvmMgtMIB
</ul>
<p>Well writing this out takes too long, I'll have to figure out the <a href="http://jffnms.sf.net" title="Just for Fun Network Management System">jffnms</a> integration later.</p>
<div class="tweetthis" style="text-align:left;"><p> <a class="tt" href="http://twitter.com/home/?status=JVM+Monitoring+with+SNMP+http%3A%2F%2Ftinyurl.com%2F6x4jry3" title="Post to Twitter"><img class="nothumb" src="http://brettdargan.com/blog/wp-content/plugins/tweet-this/icons/en/twitter/tt-twitter-big2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=JVM+Monitoring+with+SNMP+http%3A%2F%2Ftinyurl.com%2F6x4jry3" title="Post to Twitter">Tweet This Post</a></p></div>]]></content:encoded>
			<wfw:commentRss>http://brettdargan.com/blog/2005/04/01/jvm-monitoring-with-snmp/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

