In order to ensure GPGrid users can understand their grid utilization better, the FIFE group will prepare automatic notifications through the FIFEMon monitoring for all users. The email notification is designed to so that users will be able to identify the resource utilization in terms of wall time, CPU time, data I/O, memory usage, and efficiency of each of these elements. With this information, users will be able to tune their job submission resource request to match the requirements of their workflows. This will have the twice over benefit of increasing the overall efficiency of GPGrid while also allowing user workflows to match resources more quickly and return results with less delay. Finally, the notifications will by default be sent to all users with an opt-out policy. In no small way, this choice is made so that when efficiency policy violations are enforced users will have been fully informed about the situation before action is taken by SCD.
- Table of contents
- Email Reports
I got a warning email, what do I do?¶
Warning emails are triggered when the efficiency for a job cluster (over 500 wall hours cumulative) or for all jobs from the previous day are below the limits defined in the FIFE Efficiency Policy. There are many possible causes of low efficiency, see the following link for guidance... Job efficiency troubleshooting
Details of the notification emails:¶
- notifications will be generated when all jobs within a cluster/DAG have either completed or gone held.
- notifications will be sent to the <uid>@fnal.gov email address, users should be aware that this email address exists and monitor its activity
- production users are not included in the automatic report generation
- users can opt out of notification emails using the link at the bottom of the email, or subscribe to a daily digest
- Cluster is the job id used for fetching logs and monitoring using jobsub command line tools (i.e. --jobid=<jobid>)
- Wall Time is the min, max, and avg clock time of jobs in the cluster
- CPU Time is the min, max, and avg CPU time of jobs in the cluster
- CPU efficiency is the ratio of total CPU time for a job divided by the job wall time
- Memory efficiency is the ratio of memory usage of a job (reported from HTCondor as the maximum RSS) divided by the requested memory at job submission (default is 2 GB)
- Disk efficiency is the ratio of the scratch disk utilized by a job divided by the requested amount at job submission (default is 35 GB)
- Time efficiency is the ratio of the Wall time of a job divided by the expected run time at job submission (default is 8 hours)
- the efficiency table shows the minimum efficiency reported within the cluster, the average of all values reported, and the maximum value report by a job with the cluster
- the efficiency policy will be based upon the job success rate, the average cluster CPU efficiency, the Max memory efficiency reported, and the ratio of Max Wall time to Req Wall Time
The following subscription options are supported by the reporting service.
- Receive one email per job cluster/DAG (default)
- Receive a daily digest of all jobs run
- Receive both
- Receive none
Access the following link to change your subscription option (requires Services login): https://fifemon.fnal.gov/fifemail/subscription
Additional methods to control reporting options at submission time may be added in the future.
FIFE Batch System Job Summary Cluster: email@example.com Number of Jobs: 5 Submitted: 2017-05-26 15:22:47 -0500 CDT Owner/Group: kretzke / fermilab (kretzke@FNAL.GOV) Command: probe_20170526_152247_1357416_0_1_wrap.sh Requested: Memory: 500 MiB Disk: 10.0 GiB Time: 30m0s View this cluster on Fifemon: https://fifemon.fnal.gov/monitor/dashboard/db/job-cluster-summary?var-cluster=20380298&var-schedd=fifebatch2.fnal.gov&from=1495830167000 Average time waiting in queue: 1m11s NOTE: Job statistics are collected every 10 minutes, and may not accurately reflect the resources used when the job finished. Used: Min Max Avg Memory: 2.8 6.7 4.0 MiB Disk: 0.0 0.0 0.0 GiB Wall Time: 15m11s 15m11s 15m11s CPU Time: 8m9s 9m16s 8m51s Efficiency: Min Max Avg Memory: 0.6% 1.4% 0.8% Disk: 0.0% 0.0% 0.0% CPU: 53.7% 61.0% 58.3% Time: 0.0% 50.6% 40.5% WARNING: 1 Job(s) are held for the following reasons: 1 - SYSTEM_PERIODIC_HOLD: Job exceeded requested resources. (26) -------------------------------------------------------------------------------- Opt-out or change your subscription preferences: https://fifemon-pp.fnal.gov/fifemail/subscription?report=1
FIFE Batch System Job Digest Owner: kretzke Number of Jobs: 29 Average time waiting in queue: 7m29s Requested: Min Max Avg Memory: 100.0 500.0 362.1 MiB Disk: 1.0 10.0 6.9 GiB Time: 10m0s 30m0s 23m6s Used: Min Max Avg Memory: 2.8 17.7 7.0 MiB Disk: 0.0 0.0 0.0 GiB Wall Time: 8m4s 15m11s 11m23s CPU Time: 0s 9m16s 1m51s Efficiency: Min Max Avg Memory: 0.0% 18.1% 0.7% Disk: 0.0% 0.0% 0.0% CPU: 0.0% 61.0% 12.3% Time: 0.0% 50.6% 24.9% WARNING: 1 Job(s) are held for the following reasons: 1 - SYSTEM_PERIODIC_HOLD: Job exceeded requested resources. (26) For more information see: https://fifemon.fnal.gov/monitor/dashboard/db/why-are-my-jobs-held?var-user=kretzke Clusters included in this digest (cluster_id - group - command): firstname.lastname@example.org - fermilab - probe_20170526_130355_453540_0_1_wrap.sh email@example.com - fermilab - probe_20170526_130422_454526_0_1_wrap.sh firstname.lastname@example.org - fermilab - probe_20170526_134201_711829_0_1_wrap.sh email@example.com - fermilab - probe_20170526_152247_1357416_0_1_wrap.sh firstname.lastname@example.org - fermilab - probe_20170526_160046_1464477_0_1_wrap.sh email@example.com - fermilab - probe_20170526_161605_1533481_0_1_wrap.sh -------------------------------------------------------------------------------- Opt-out or change your subscription preferences: https://fifemon-pp.fnal.gov/fifemail/subscription?report=1