Components and Description¶
Monitoring will be web based. The top web page will give overall
status information and links to components pages like batch, data
handling, mass storage, etc. Each component page will have status
and history information as needed.
Core of the monitoring will be an rrd based collector. Updating
and administration will be done via the web. Variables that are
monitored are stored in 5 minute intervals for 1 months (10,000
entries) and then at lower resolution. Issues to look into are:
- authentication/authorisation: We need some simple authentication
for write access and mechanism to authorise updating/reporting
of variable values and to delegate administration to parts of
the variable tree.
- admin delegation: What level of granularity do we need?
- variable tree layout: The top part of the variable tree we
should design well and leave the lower parts to each group.
- display of graph: For debugging and navigation we need a
browser for the variable tree.
- JSON interface: In order to use Google graphs to plot the
variables we need a JSON interface to the rrd files.
- alarming: We need the ability to register alarms and notifications.
Individual variables over threshold, sum of variables or
multiple variables over threshold; email and paging;
- scale testing: We will have O(10^5) worker nodes or jobs running
and may get updates from as many sources every 5 minutes. The
current prototype is CGI based with an Apache web server. We need
to scale test this works.
Web Pages and Scripts¶
The user will see web pages, some static but most automatically
generated (periodically or on demand). Static web pages and scripts
will be a separate from the rrd-collector. The web pages will provide
status information and include or link to the rrd-based history and
- CAFmon guided: The CDF CAFmon satisfies the monitoring needs of
users quite well. We should use it to guide un in the design of
the web pages for IF but with the vision to include data handling,
disk cache, mass storage and job execution status information.
- use of client API: The scripts should use condor client calls
and not parse log files of condor daemons, etc. for status
- Google graphs: We prefer the more interactive Google graphs over
static JPG/PNG images to view history information.
Batch Info Provider¶
The plan is to keep all scripts/programs that gather information from
the batch system and provide updates to the rrd collector together in
one place outside the rrd collector itself.
Data Handling Info Provider¶
We need scripts/programs to gather data handling information and fill
the rrd collector.
Job Execution Info Provider¶
We need scripts/programs to gather job execution information and fill
the rrd collector.