Fermilab/HTCondor Minutes May 08, 2015

Krista, Zack, Tony, Steve, Parag, Marco

Agenda and Notes:

condor_off -peaceful to a remote machine does not shut down the daemons and python bindings has bug that prevents a peaceful shutdown.

  • Bug report has been filed: htcondor-admin 27971
  • No changes from last meeting

When using glexec, HTCondor tars up results as the job user, then pipes the tar'd content to the condor user and untars it as the HTCondor user. This effectively doubles the disk space used by the job.

  • There should be a solution shortly for this problem

In a recent change to the GWMS the CCBs can be set explicitly and be multiple (for redundancy and load balance). If the user specifies too many the system seems not to work (even if I did not find the exact error, communications are broken). Is there a limit to the number of CCBs specified in CCB_ADDRESS? What breaks? Where can I see it?

  • Should be no limit
  • breaks between 10 and 20 - 10 works, 20 fails
  • turn up debugging and send error report to HTCondor team

Running on Gordon (SDSC) the STARTER started failing to write files with errno 28: could be some file system error like "no space” or “no inodes” or too many files in directory but checks for all of these seem negative when I looked. The error went away and things restarted. Any suggestion of what could have caused the error and how to dig further if it happens again? The file system where the error was seen is Luster.

  • from Zach: sometimes it maybe on the Starter, sometimes may be on the Shadow maybe there was a user quota possibly a lustre error
  • Sent error message to list - definitely happening on the Starter side
  • not much extra info available since it is a kernel message and HTCondor passes back all the info it has

Status of nova gahp

  • No change, communication black hole

Status of GCE gahp

  • Thought there was a gahp with the old API that broke when Google changed APIs
  • Will have to check on the status

Weird warning message when upgrading to HTCondor 8.3.5

  • Turns out to be harmless and safe to ignore. It is related to IPv6 code. The FNAL nodes do not have IPv6 enabled, which is the cause for the warning. Will be fixed in 8.3.6.