Project

General

Profile

Email Reports

In order to ensure GPGrid users can understand their grid utilization better, the FIFE group will prepare automatic notifications through the FIFEMon monitoring for all users. The email notification is designed to so that users will be able to identify the resource utilization in terms of wall time, CPU time, data I/O, memory usage, and efficiency of each of these elements. With this information, users will be able to tune their job submission resource request to match the requirements of their workflows. This will have the twice over benefit of increasing the overall efficiency of GPGrid while also allowing user workflows to match resources more quickly and return results with less delay. Finally, the notifications will by default be sent to all users with an opt-out policy. In no small way, this choice is made so that when efficiency policy violations are enforced users will have been fully informed about the situation before action is taken by SCD.

I got a warning email, what do I do?

Warning emails are triggered when the efficiency for a job cluster (over 500 wall hours cumulative) or for all jobs from the previous day are below the limits defined in the FIFE Efficiency Policy. There are many possible causes of low efficiency, see the following link for guidance... Job efficiency troubleshooting

Details of the notification emails:

  • notifications will be generated when all jobs within a cluster/DAG have either completed or gone held.
  • notifications will be sent to the <uid>@fnal.gov email address, users should be aware that this email address exists and monitor its activity
  • production users are not included in the automatic report generation
  • users can opt out of notification emails using the link at the bottom of the email, or subscribe to a daily digest
  • Cluster is the job id used for fetching logs and monitoring using jobsub command line tools (i.e. --jobid=<jobid>)
  • Wall Time is the min, max, and avg clock time of jobs in the cluster
  • CPU Time is the min, max, and avg CPU time of jobs in the cluster
  • CPU efficiency is the ratio of total CPU time for a job divided by the job wall time
  • Memory efficiency is the ratio of memory usage of a job (reported from HTCondor as the maximum RSS) divided by the requested memory at job submission (default is 2 GB)
  • Disk efficiency is the ratio of the scratch disk utilized by a job divided by the requested amount at job submission (default is 35 GB)
  • Time efficiency is the ratio of the Wall time of a job divided by the expected run time at job submission (default is 8 hours)
  • the efficiency table shows the minimum efficiency reported within the cluster, the average of all values reported, and the maximum value report by a job with the cluster
  • the efficiency policy will be based upon the job success rate, the average cluster CPU efficiency, the Max memory efficiency reported, and the ratio of Max Wall time to Req Wall Time

Subscription Options

The following subscription options are supported by the reporting service.

  • Receive one email per job cluster/DAG (default)
  • Receive a daily digest of all jobs run
  • Receive both
  • Receive none

Access the following link to change your subscription option (requires Services login): https://fifemon.fnal.gov/fifemail/subscription
Additional methods to control reporting options at submission time may be added in the future.

Example Emails

Cluster (plain)

FIFE Batch System Job Summary

Cluster:        20380298@fifebatch2.fnal.gov
Number of Jobs: 5
Submitted:      2017-05-26 15:22:47 -0500 CDT
Owner/Group:    kretzke / fermilab (kretzke@FNAL.GOV)
Command:        probe_20170526_152247_1357416_0_1_wrap.sh

Requested:
  Memory: 500 MiB
  Disk:   10.0 GiB
  Time:   30m0s

View this cluster on Fifemon:
  https://fifemon.fnal.gov/monitor/dashboard/db/job-cluster-summary?var-cluster=20380298&var-schedd=fifebatch2.fnal.gov&from=1495830167000

Average time waiting in queue: 1m11s

NOTE: Job statistics are collected every 10 minutes, and may not accurately
reflect the resources used when the job finished.

Used:              Min       Max       Avg
  Memory:          2.8       6.7       4.0 MiB
  Disk:            0.0       0.0       0.0 GiB
  Wall Time:    15m11s    15m11s    15m11s
  CPU Time:       8m9s     9m16s     8m51s

Efficiency:        Min       Max       Avg
  Memory:          0.6%      1.4%      0.8%
  Disk:            0.0%      0.0%      0.0%
  CPU:            53.7%     61.0%     58.3%
  Time:            0.0%     50.6%     40.5%

WARNING: 1 Job(s) are held for the following reasons:
       1 - SYSTEM_PERIODIC_HOLD: Job exceeded requested resources. (26)

--------------------------------------------------------------------------------
Opt-out or change your subscription preferences:
  https://fifemon-pp.fnal.gov/fifemail/subscription?report=1

Cluster (html)

Digest (plain)

FIFE Batch System Job Digest

Owner:    kretzke
Number of Jobs: 29
Average time waiting in queue: 7m29s

Requested:         Min       Max       Avg
  Memory:        100.0     500.0     362.1 MiB
  Disk:            1.0      10.0       6.9 GiB
  Time:          10m0s     30m0s     23m6s

Used:              Min       Max       Avg
  Memory:          2.8      17.7       7.0 MiB
  Disk:            0.0       0.0       0.0 GiB
  Wall Time:      8m4s    15m11s    11m23s
  CPU Time:         0s     9m16s     1m51s

Efficiency:        Min       Max       Avg
  Memory:          0.0%     18.1%      0.7%
  Disk:            0.0%      0.0%      0.0%
  CPU:             0.0%     61.0%     12.3%
  Time:            0.0%     50.6%     24.9%

WARNING: 1 Job(s) are held for the following reasons:
       1 - SYSTEM_PERIODIC_HOLD: Job exceeded requested resources. (26)
For more information see:
https://fifemon.fnal.gov/monitor/dashboard/db/why-are-my-jobs-held?var-user=kretzke

Clusters included in this digest (cluster_id - group - command):
  20374223@fifebatch2.fnal.gov - fermilab - probe_20170526_130355_453540_0_1_wrap.sh
  20374235@fifebatch2.fnal.gov - fermilab - probe_20170526_130422_454526_0_1_wrap.sh
  20376280@fifebatch2.fnal.gov - fermilab - probe_20170526_134201_711829_0_1_wrap.sh
  20380298@fifebatch2.fnal.gov - fermilab - probe_20170526_152247_1357416_0_1_wrap.sh
  20380940@fifebatch2.fnal.gov - fermilab - probe_20170526_160046_1464477_0_1_wrap.sh
  20381181@fifebatch2.fnal.gov - fermilab - probe_20170526_161605_1533481_0_1_wrap.sh

--------------------------------------------------------------------------------
Opt-out or change your subscription preferences:
  https://fifemon-pp.fnal.gov/fifemail/subscription?report=1

Digest (html)