Big Picture

Overall, we really want a general, scalable, full-featured monitoring system; which has several
features, and lets us potentially include data from existing monitoring, such as FEF/SSS's Ganglia, Datacomm's MRTG,
etc. (Merging in data from Zabbix is an open issue; but we could treat it as another time series data repository,
unfortunately it doesn't share the common RRD toolset, so common graphing/browsing wouldn't neccesarily be

  • Easy direct collection of data from various sources, including:
  • SNMP data [routers, temp sensors, systems running net-snmp]
  • data handling systems
  • batch systems
  • arbitrary scripts or tools
  • Collection of log streams (particularly via syslog)
  • Alarms triggered by thresholds in data streams, above
  • Alarms triggered by log messages
  • Alarms suppressed by scheduled downtimes
  • Display/graphing of collected data
  • Easy tools to develop dashboard pages

The overall architecture involves:

  • a repository for time-series data (currently plan: RRD files, with a simple cgi-script to add data)
    Note however, that any tool that generates RRD files (Ganglia, MRTG, existing local scripts) can put
    its rrd files in this repository, either by NFS mounts, or by using the rrdcached
    server mechanism, or even by a periodic rsync job.
  • a tool for logging failed threshold checks in the above (simple, short script using RRDTool)
  • a repository for log data (currently planned to use rsyslogd)
  • a tool for matching/alarming based on log data (swatch or other similar off the shelf tool)

Note that there is not a tool for for alarming other than on log data, that allows there to
be a common point for logging that can handle things like suppressing alarms appropriately;
instead there will be a standard log message format for requesting an alarm for a given system.

Another important idea is that there should be the minimum possible amount of code written
to provide this service, wherever possible, and what scripts there are should be as simple
as possible; instead it will largely be configuring existing tools.

Another important idea is that there can be multiple time-series data repositories,
to handle load, and each would forward threshold log messages, etc. to a central log
server for actual alarms to be generated.

The current maaws implementation is an attempt to build the RRD file repository.