Job update performance

Current situation:

Currently we have 1 "jobsub_q_scraper" agent running, with

  • 1 thread running condor_q with a 2 minute delay between runs
  • 8 threads calling bulk_update in parallel to report data, partitioned by task_id (submission id)
  • webservice bulk_update locks tasks to prevent database conflicts; so reports from threads
    above can run in parallel.

Our reporting calls to the webservice are generally taking 25 seconds a call to update 256 jobs

With 8 threads in parallel that limits us to 2048 job updates every 25 seconds or about 82 jobs/sec.

Note that we only report changed job status.

So when we have 100k jobs start or stop running in a short period for example, at 82jobs/sec it takes 20 minutes to get all the job updates through.

Investigations to make

Reporting threads

We could raise the number of reporting threads (and number of threads on our poms service there) and see what happens to our update time histogram. If we can double or quadruple our reporting threads on both ends, without significantly changing the time for a bulk_update call, we could clear large groups of job changes in half or quarter the time.

Split by experiment

We could run as separate jobsub_q_scraper per experiment, running condor_q queries limited to that experiment group. This would reduce the tendency of, say, Nova launching 50k jobs from slowing down the updates for Dune or SBND.

Improving bulk_update speed

If we could knock the bulk_update time down, this would obviously improve all of these rates. We could review the current code, and/or rework the tables. In particular if we regrouped the data between Jobs and JobHistories tables, we could make all updates be inserts into JobHistories and/or JobFiles which could in turn be partitioned for speed.