Fermilab/HTCondor Minutes March 20, 2015

Krista, Zack, Tony, Jaime, Parag, Marco, Zack

Agenda and Notes:

CCB_ADDRESS can be a list of comma separated host. How is it handled by the different daemons, especially a startd and its starters?

  • is it using the first and falling back to the following in case of problems
  • keeps switching
  • is using different ones for different daemons on the same installation (under the same master)
  • Takes the first one always and tries to use that one. Fails over to second
    one if problems. Not random unless you specify "random_choice". But this
    random choice happens every time there is a CCB lookup.

I was wondering if there is a way to save in the classad and collect at the end information from previous (e.g. preempted) executions of the same job. I’d like to learn more about job_machine_attrs/job_machine_attrs_history_length and SYSTEM_JOB_MACHINE_ATTRS/SYSTEM_JOB_MACHINE_ATTRS_HISTORY_LENGTH that I think could provide a solution. I think the last 2 are the equivalent of the first 2 are in the job submit file.

  • I was wondering if I can count on the list being consistent if I save multiple attributes, e.g. saving Machine and CommittedTime,
  • MachineAttrMachine0 the job used MachineAttrCommittedTime0, ...
  • MachineAttrMachine1 the job used MachineAttrCommittedTime1, ...
  • how are they performance-wise (how much overhead)?
  • would you suggest a different solution?
  • Didn't follow all of the discussion, but you can put all the classad attributes in SYSTEM_JOB_MACHINE_ATTRS to keep history.

Status of nova gahp

  • No change, communication black hole

DES Use case
The Dark Energy Survey (DES) program uses a camera to survey the sky in an
attempt to understand the cause for the acceleration in the expansion rate of
the universe.  Another experiment, the Laser Interferometer Gravitational Wave
Observatory (LIGO) is searching for gravitational waves.  LIGO can narrow
signals down to a relatively small ring in the sky (still a very large section
of the sky).  A researcher has proposed that LIGO send the coordinates to DES
and DES will focus in on this area of the sky and perform an in depth analysis.
These events have a relatively short life span, so DES must be notified quickly
and DES needs to process their data quickly for further refinement and analysis.

These events are not anticipated to be regular, nor are they anticipated to be
frequent.  The desire is to allow DES to completely take over the GPGrid cluster
for a short time to accomplish their processing, but not allow them to do this
completely whenever they choose to.  Essentially, if using condor priorities,
can we set a half life differently for a specific accounting group vs. the
default half life.

  • custom classad functions? (anything is technically possible), but the current knobs probably don't support this.
  • if custom classd function = true, then quota = 100% or prio == 1
  • Will talk to team to see if it is possible with current knobs