Task #16430
Milestone #16428: Roll out FIFE efficiency policy
Design and implement email generation based on the policy
100%
History
#1 Updated by Kevin Retzke almost 4 years ago
- Status changed from Assigned to Work in progress
This was originally requested in RITM0533293. Suppose I'll close that. Summary of status:
Report triggering (from Condor EventLog in Elasticsearch) and generation (from Condor job info in Elasticseach) is complete. Sample email:
FIFE Batch System Job Summary Cluster: 16693618@fifebatch1.fnal.gov Number of Jobs: 20 Submitted: 2017-05-04T00:30:18-05:00 Owner/Group: dpershey / nova (dpershey@FNAL.GOV) Requested: Memory: 1900 MiB Disk: 2.0 GiB Time: 3h0m0s Average time waiting in queue: 15h26m32s Used: Min Max Avg Memory: 26.9 548.8 174.4 MiB Disk: 0.0 0.0 0.0 GiB Wall Time: 48m38s 48m40s 48m38s CPU Time: 0s 28m8s 7m31s Efficiency: Min Max Avg Memory: 1.4% 28.9% 9.2% Disk: 0.0% 1.4% 0.3% CPU: 0.0% 57.8% 15.5% Time: 27.0% 27.0% 27.0%
The program has been running (not actually sending emails) since yesterday to gather statistics. Monitoring at https://fifemon-pp.fnal.gov/dashboard/db/fifemail
Generating under one email per second, I suppose that's reasonable (initial rate was seen much higher due to a bug).
Remaining parts:- Opt-out link & handling.
- Actually sending email.
#2 Updated by Tanya Levshina almost 4 years ago
- Target version set to FIFE Roadmap for FY18
#3 Updated by Kevin Retzke almost 4 years ago
Some observations and feedback from the Lariat trial:
1. Due to the collection interval some emails have been seen with "exceeded resource request" hold notice (and jobs actually held), but the utilization numbers don't reflect this. Issue will be solved by job history collection, but in the meantime should probably include a note that resource numbers may be from up to ten minutes before the job ended or was held.
2. Should add a link to the "why are my jobs held?" dashboard for further information on held jobs
3. User has requested the option to send daily digests, as has already been discussed.
I'll add more as they come in.
#4 Updated by Kevin Retzke over 3 years ago
General summary emails were deployed in production 5/30. Still need to finalize the spec for efficiency policy notifications.
User documentation: https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Email_Reports
Operations and development: https://cdcvs.fnal.gov/redmine/projects/discompsupp/wiki/Fifemail
#5 Updated by Tanya Levshina over 3 years ago
- % Done changed from 0 to 90
emails are monitored here: https://fifemon-pp.fnal.gov/dashboard/db/fifemail?refresh=1m&orgId=1
#6 Updated by Tanya Levshina over 3 years ago
- Status changed from Work in progress to Resolved
- % Done changed from 90 to 100
#7 Updated by Tanya Levshina over 3 years ago
- Status changed from Resolved to Closed