[noctool-devel] compact configurations for identical machines

Thu May 22 13:09:18 UTC 2008

 >I LIKE BEING ABLE TO POINT THE BAZOOKA AT MY FOOT!  I'M WORKING ON A 
 >DOUBLE-BARRELLED BAZOOKA SO I CAN POINT IT AT *BOTH* FEET!! ;)

        :)

        I've spent years, on and off, worrying about system
monitoring.  Years ago I made available a giant Python system called
Mom (v3) that no one but me could use.  The publish-subscribe
mechanism required learning a tiny matching language to use.

        I had three successes with Mom.v3 which I really think deserve
consideration in any future monitoring systems.  My first minor
success of sorts was that, due to my desire to produce statistical
models of a bunch of system measures, I have years and years of
collected data to test new algorithms on.  RRD or similar graphs are a
loss for anything except visualization.  My second unalloyed success
was producing an algorithm that told me when my users were filling up
a disk partition *before* thresholds were crossed. [1]

        The final huge win of Mom.v3 was that difficult
publish-subscribe engine.  The landscape is full of single-purpose
monitoring tools that don't interact very well outside of their own
little world.  If, instead, you center a monitoring system around a
communications protocol (a not too difficult one, hopefully) then you
can plug in whatever you want.  Full-contact, improvisational bazooka
juggling can then be indulged in without necessarily endangering the
stability of the main system.  An example might be useful.

              The Life Cycle of a Disk Sample in Mom.v3

        Some dumb data collecting agent running locally runs 'df' and
yanks out the relevant numbers.  The data is packaged up into a
network Message format (a collection of property-value pairs,
including the agent type, the data points, a timestamp, etc.) and is
passed off to a forwarder agent running on the same host.  That agent
is the only program running locally that knows enough to speak to the
central publish-subscribe system, called the Kiosk.  If the forwarder
cannot talk to the Kiosk it caches messages.  Otherwise it opens an
authenticated and encrypted socket to the Kiosk.

        Once the message is accepted by the Kiosk and a receipt goes
back to the forwarder the new message is shoved into a processing
queue.  Now the entire set of subscription rules is run against the
new message.  A simple subscription rule might look like this:

    agent == 'disk'

a trickier one:

    DEFINED class AND class == 'security' AND DEFINED message

These property-checking subscriptions are paired with data sinks I
called 'transports' in Mom.v3.  In the case of this disk agent, we
have two transports attached to the subscription.  The first shunts
the disk use sample into a log file for latter grovelling over.  The
second sends the sample into the diskwatcher transport, which keeps
enough disk samples around to run the impending disk doom algorithm.
That transport then *adds another message to the queue* with the
analysis.  You might attach this subscription to a transport that
sends out email or a page:

    agent == 'diskwatcher' AND class == 'notification' AND degree in 3 4

So, if the disk appears to be filling up fast, you'll hear about it.

        Now, in Mom.v3 rather too much of this message cascade
happened in the same process.  If an analysis transport went bad it
could muck up the entire Kiosk.  Fortunately Python has enough
introspective abilities that I could deactivate really badly behaving
transports, but this isn't ideal.  This publish-subscribe message
routing really was just incredibly powerful - I had correlation
engines, time series models, logs, database sinks, etc.  A single data
sample message could result in a half-dozen message being reinjected
into the system.  But because so much ran in the same process there
were certain things I couldn't try out live.  So my current focus -
when I have time to code on this - is to generalize a monitoring
protocol that'll let me plug in some experimental analysis engine
without endangering the other parts that are working correctly.  It
would also permit different ways of accessing the data so we're not
forced to pretend system monitoring maps well to web pages.

        I'm thinking aloud here.  A few weeks before Ingvar announced
his common-lisp.net project and this mailing list I was thinking of
contacting him and Chun Tian (the author of cl-net-snmp) to see if
they thought we should create a "Common Lisp and Monitoring" mailing
list to discuss our separate projects, to share what works and what
doesn't. 

--
wm

[1] http://www.biostat.wisc.edu/~annis/granny/notes/impending-doom.html -
    lisp code available if anyone is curious