Fermilab/HTCondor Minutes July 10, 2015

Greg, Krista, Marco, Todd T., Tony

Agenda and Notes:

clarifications about the schedd_cron (daemon hooks) mechanism:

  • Q: are the jobs queued to execute on the available slots or executed in a separate slot or ignoring the slot state?
  • A: Not run in a slot, but on the side
  • Q: could more than one job run in parallel?
  • A: sequential, not parallel - but in parallel with the slot
  • Q: what is a benchmark?
  • A: The benchmark uses the slot, runs periodically when slot is idle (slot state - unclaimed, benchmark)
  • Q: if there is privsep they run as the schedd (glidein) user, correct?
  • A: yes
  • Q: is it there a way to run at some event (but not reconfig, e.g. when a job ends)? Not as job wrapper because those run as the user and cannot affect the schedd classad. (want to run a script right before (or right after) a job starts, e.g. run based on event, rather than periodically.)
  • A: Two ways. 1) simple mechanism - job wrappers, 2) Section 4.4 of the manual (Hooks) Look at Starter Hooks.
  • NOTE: the scripts run by schedd_cron must be carefully crafted. Not under Condor protection

clarification about CCB_ADDRESS, when multiple addresses are specified (Don't have notes for this question)

  • is it like COLLECTOR_HOST and all the addresses are used or only one is?
  • daemons connect and advertise to all the address, client only to one (first and fall back if there are problems), client functions invoked by daemons have a sinful string with the destination. Is this correct?
  • if I have multiple collectors (HA) and multiple CCBs is it there any way to bind them (use this ccb for this collector)?

how a multiple collector or CCB configuration will work with shared port daemon?

  • Q: is it there an advantage on running multiple collectors/ccbs on the same host?
  • A: Collectors, yes, probably can get away with one CCB
  • Q: will shared port daemon be a bottleneck?
  • A: No
  • Q: could it assign the connection to a random collector/ccb off a pool?
  • A: No?? (I missed the answer to this one)
  • Q: if a specify the port in the ccb/collector configuration (e.g. secondary collectors) they are on separate ports and do not use shared port daemon, correct?
  • A: No, pilots still need to know which child collector they need to talk to

some jobs show up in condor_q but cannot be found with condor_rm (I used the schedd name shown in condor_q -g). Is this a bug in condor?

    # condor_q -name
    -- Schedd: : <>
    753.0   frontend        6/10 17:27   0+00:05:02 R  0   0.1
    754.0   frontend        6/10 17:27   0+00:16:01 R  0   0.1
    755.0   frontend        6/10 17:38   0+00:00:00 I  0   0.1
    756.0   frontend        6/10 17:39   0+00:00:00 I  0   0.1

    4 jobs; 0 completed, 0 removed, 2 idle, 2 running, 0 held, 0 suspended
    # condor_rm -name 756.0
    Job 756.0 not found
  • We saw this bug earlier in the year with the CMS glideinWMS factory at FNAL. A few restarts fixed the issue, and we were not able to reproduce. Marco's situation is reproducible. It only happens on the schedd with the name that contains only the hostname. It also happens ~1 in 5 times.

which is the correct way to invalidate a classad?

  • Q: What is the difference? Which one works better? 3b is not scanning all the ads so should be more efficient (suggested by Brian) Why 3a then?
  • A: Whenever we have the exact ad, use 3b. Use 3a. when you want to invalidate multiple ads.