Components and Description

Monitoring will be web based. The top web page will give overall
status information and links to components pages like batch, data
handling, mass storage, etc. Each component page will have status
and history information as needed.


Core of the monitoring will be an rrd based collector. Updating
and administration will be done via the web. Variables that are
monitored are stored in 5 minute intervals for 1 months (10,000
entries) and then at lower resolution. Issues to look into are:

  • authentication/authorisation: We need some simple authentication
    for write access and mechanism to authorise updating/reporting
    of variable values and to delegate administration to parts of
    the variable tree.
  • admin delegation: What level of granularity do we need?
  • variable tree layout: The top part of the variable tree we
    should design well and leave the lower parts to each group.
  • display of graph: For debugging and navigation we need a
    browser for the variable tree.
  • JSON interface: In order to use Google graphs to plot the
    variables we need a JSON interface to the rrd files.
  • alarming: We need the ability to register alarms and notifications.
    Individual variables over threshold, sum of variables or
    multiple variables over threshold; email and paging;
  • scale testing: We will have O(10^5) worker nodes or jobs running
    and may get updates from as many sources every 5 minutes. The
    current prototype is CGI based with an Apache web server. We need
    to scale test this works.

Web Pages and Scripts

The user will see web pages, some static but most automatically
generated (periodically or on demand). Static web pages and scripts
will be a separate from the rrd-collector. The web pages will provide
status information and include or link to the rrd-based history and
other monitoring.

  • CAFmon guided: The CDF CAFmon satisfies the monitoring needs of
    users quite well. We should use it to guide un in the design of
    the web pages for IF but with the vision to include data handling,
    disk cache, mass storage and job execution status information.
  • use of client API: The scripts should use condor client calls
    and not parse log files of condor daemons, etc. for status
  • Google graphs: We prefer the more interactive Google graphs over
    static JPG/PNG images to view history information.

Batch Info Provider

The plan is to keep all scripts/programs that gather information from
the batch system and provide updates to the rrd collector together in
one place outside the rrd collector itself.

Data Handling Info Provider

We need scripts/programs to gather data handling information and fill
the rrd collector.

Job Execution Info Provider

We need scripts/programs to gather job execution information and fill
the rrd collector.