Project

General

Profile

Date:
01/03/2017

Attendees:
Anna M., Vladimir P., Joe B., Steve W., Margherita V.W., Marc M., Vito D., Felipe

Agenda:
  1. Next release update
  2. Who is allowed to submit production job to POMS?
  3. Problems happened during last week: do we need shifter?
  4. Missing information in POMS: are they related to missing information from offsites?
  5. AOB

Discussion:

1. Next release update

Feature #12751: Ability to determine when tasks are really finished:
Marc working on it , it will be done.

Feature #14659: Poms should keep a list of job launches requested while job launches are "held":
Steve and Marc working on it, it will be done.

2. Who is allowed to submit production job to POMS?

Joe provides a list from VOMS which is passed to POMS, but there is no tracking if the user is still authorized or not once is added..cases like people
leaving experiments, leaving the LAB, etc are not tracked..
Anna and Steve’s chat : How about having, for each experiment, a superuser who will manage experimeters and their roles.
When a user is added to an experiment can do "anything". Steve will introduce another field besides "active"
in the users list. Maybe a "read-only" filed instead of a superuser?
Outcome: OK Joe’s list. Each experiment should have a reference person for POMS. If they request to put some effort in POMS, we can add him as a superuser in POMS for his/her experiment. Still up for discussion which kind of flag for this person.

3. Problems happened during last week: do we need shifter?

Last week there was an "out of memory" issue.. need to be able to monitor jobs etc. Possibly each experimentcould have someone that does monitoring..
Or should we use some existing tools? Zabbix? Slack?
POMS is part of the FIFE toolkit, should we have the the FIFE shifters support extended to to include POMS?

4. Missing information in POMS: are they related to missing information from offsites?

Anna: Alex Himmel (NOvA) reported that there are missing information in the POMS jobs reports.
Some numbers differ from other sources which provide info on jobs. POMS reliability is important so that people would use it.
Marc: the numbers can be wrong. It's technically hard to get things right sometimes, running on the grid is fully of holes.The problem is that there are cases where POMS is "not allowed" to get all the needed information since they exists on nodes we are not allowed to get info from. We we need to work on it.
Anna: We need to coordinate with the OSG team for example and open GOC tickets when needed.
Marc: Maybe we need to make more extended use of Jobs log files, search in Condor logs etc..
Anna: All this needs to be coordinated.