Project

General

Profile

Reworking jobs

So currently the Jobs and JobsHistories class/tables:

  • duplicate some data (latest status)
  • are updated unevenly
  • are the most updated /inserted tables

and the issue is that there are really several kinds of updates:

  1. from jobsub_q_scraper
    • job creation with jobsub_job_id and initial status
    • updates with different status optional held reason
    • updates with wall clock and/or cpu time -- note that wall clock time from job isn't needed, we can subtract times from appropriate history timestamps.
  2. from joblog scraper
    • updates with detailed running status, user_exe_exit_code
    • updates with cpu_type, host_site, node_name
  3. from declared_files_watcher
    • output_files_declared flag (could be derived)

So based on this, it seems to me we should break up the Jobs / JobsHistories table so that
most of these are inserts which do not require updates or locks up the chain.

New tables

Batch_nodes:

  • batch_node_id
  • batch_node_name

[By royal decree, jobsub<n>.fnal.gov will have batch_node_id n, for n=1..20, so we can not even look it up]

Jobs

  • jobsub job id
  • task_id (foreign key to tasks/submissions)
  • created (approx start time of job)

Job completion

  • jobsub job id
  • user_exe_exit_code (NULLable - NuLL = user exe hasn't finished yet)
  • job_exit_code (Nullable - NULL = job hasn't finished yet)
  • finished (completion time of job)

JobHistory:

  • job_history_id
  • jobsub_job_id (foreign key to jobs)
  • Status (Idle,Running,Held,...) (integer)
  • AdditionalInfo (nullable text -- additional running info or held reason)
  • updated (time)

Job's current status is max(updated) from JobHistory

JobCPUHistory

  • cpu_history_id
  • jobsub_job_id (foreign key to jobs)
  • time (float)
  • type (cpu, wallclock)
  • updated (time)

JobNode:

  • jobsub_job_id (primary key, foreign key to Jobs)
  • node_id

where job last ran.

NodeInfo:

  • node_id
  • node_name, host_site (unique pair)
  • cpu_type

Need to be careful with the above, we get fake node names due to "container" setups.

In this model, most updates coming in will be inserts into JobHistory or JobCPUHistory.
At job startup, we will initially create the job, and when the job terminates we will
log its exit codes, and those will be the only writes to the Job table.

If we get node info, we will possibly insert a node at a site, and insert a JobNode entry.

Other updates will go straight to the appropriate History table, and not touch the Job itself.

We could provide a backwards compatability view...