Project

General

Profile

Task #16430

Milestone #16428: Roll out FIFE efficiency policy

Design and implement email generation based on the policy

Added by Tanya Levshina over 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
05/04/2017
Due date:
05/20/2017
% Done:

100%

Estimated time:
Duration: 17

History

#1 Updated by Kevin Retzke over 2 years ago

  • Status changed from Assigned to Work in progress

This was originally requested in RITM0533293. Suppose I'll close that. Summary of status:

Report triggering (from Condor EventLog in Elasticsearch) and generation (from Condor job info in Elasticseach) is complete. Sample email:

FIFE Batch System Job Summary

Cluster:        16693618@fifebatch1.fnal.gov
Number of Jobs: 20
Submitted:      2017-05-04T00:30:18-05:00
Owner/Group:    dpershey / nova (dpershey@FNAL.GOV)

Requested:
  Memory: 1900 MiB
  Disk:   2.0 GiB
  Time:   3h0m0s

Average time waiting in queue: 15h26m32s

Used:              Min       Max       Avg
  Memory:         26.9     548.8     174.4 MiB
  Disk:            0.0       0.0       0.0 GiB
  Wall Time:    48m38s    48m40s    48m38s
  CPU Time:         0s     28m8s     7m31s

Efficiency:        Min       Max       Avg
  Memory:         1.4%     28.9%      9.2%
  Disk:           0.0%      1.4%      0.3%
  CPU:            0.0%     57.8%     15.5%
  Time:          27.0%     27.0%     27.0%

The program has been running (not actually sending emails) since yesterday to gather statistics. Monitoring at https://fifemon-pp.fnal.gov/dashboard/db/fifemail

Generating under one email per second, I suppose that's reasonable (initial rate was seen much higher due to a bug).

Remaining parts:
  • Opt-out link & handling.
  • Actually sending email.

#2 Updated by Tanya Levshina over 2 years ago

  • Target version set to FIFE Roadmap for FY18

#3 Updated by Kevin Retzke over 2 years ago

Some observations and feedback from the Lariat trial:

1. Due to the collection interval some emails have been seen with "exceeded resource request" hold notice (and jobs actually held), but the utilization numbers don't reflect this. Issue will be solved by job history collection, but in the meantime should probably include a note that resource numbers may be from up to ten minutes before the job ended or was held.

2. Should add a link to the "why are my jobs held?" dashboard for further information on held jobs

3. User has requested the option to send daily digests, as has already been discussed.

I'll add more as they come in.

#4 Updated by Kevin Retzke over 2 years ago

General summary emails were deployed in production 5/30. Still need to finalize the spec for efficiency policy notifications.

User documentation: https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Email_Reports
Operations and development: https://cdcvs.fnal.gov/redmine/projects/discompsupp/wiki/Fifemail

#5 Updated by Tanya Levshina about 2 years ago

  • % Done changed from 0 to 90

#6 Updated by Tanya Levshina about 2 years ago

  • Status changed from Work in progress to Resolved
  • % Done changed from 90 to 100

#7 Updated by Tanya Levshina about 2 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF