Project

General

Profile

Bug #17027

Unsual Completed but not Located job counts for NOvA

Added by Marc Mengel over 2 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
Start date:
06/26/2017
Due date:
% Done:

100%

Estimated time:
First Occurred:
Scope:
Internal
Experiment:
-
Stakeholders:
Duration:

Description

2) Check on the validity of numbers on the front page. We get tons of jobs listed as "completed, but producing no children."

This ought to be jobs which had output files that aren't declared.

May be a failing in our output file patterns for NOvA, so we think log files, etc. that aren't ever declared are output files(?).

History

#1 Updated by Marc Mengel over 2 years ago

  • Target version set to v2_1_3

#2 Updated by Marc Mengel over 2 years ago

So I've confirmed that we have (numerous) jobs whose output files all have declared dates but who are still listed as Completed not Located. For example:

                      ^
pomsprd=> select * from jobs where job_id = 1428670;
 job_id  | task_id |          jobsub_job_id           |     node_name      | cpu_type | host_site |  status   |            updated            | output_files_declared | user_exe_exit_code | input_file_names | reason_held | consumer_id | cpu_time | wall_time 
---------+---------+----------------------------------+--------------------+----------+-----------+-----------+-------------------------------+-----------------------+--------------------+------------------+-------------+-------------+----------+-----------
 1428670 |   27187 | 18466966.157@fifebatch2.fnal.gov | compute-20-2.tier2 | unknown  | Caltech   | Completed | 2017-04-01 02:47:58.185245-05 | f                     |                  0 |                  |             |             |        0 |          
(1 row)

pomsprd=> select * from job_files where job_id = 1428670;
 job_id  |                             file_name                             | file_type |            created            |           declared            
---------+-------------------------------------------------------------------+-----------+-------------------------------+-------------------------------
 1428670 | fardet_r00025768_s62_DDsnews.raw                                  | input     | 2017-04-01 02:19:27.211184-05 | 
 1428670 | fardet_r00025768_s62_ddsnews_S17-02-21_v1_data.artdaq.log.bz2     | log       | 2017-04-01 02:19:49.252711-05 | 
 1428670 | fardet_r00025769_s10_ddnumu_S17-02-21_v1_data.artdaq.log.bz2      | log       | 2017-04-01 02:19:00.641554-05 | 
 1428670 | fardet_r00025769_s34_ddcontained_S17-02-21_v1_data.artdaq.log.bz2 | log       | 2017-04-01 02:14:00.618194-05 | 
 1428670 | fardet_r00025770_s32_DaqStatus.raw                                | input     | 2017-04-01 02:07:44.375026-05 | 
 1428670 | fardet_r00025770_s32_daqstatus_S17-02-21_v1_data.artdaq.log.bz2   | log       | 2017-04-01 02:08:07.093811-05 | 
 1428670 | fardet_r00025770_s57_t05_S17-02-21_v1_data.artdaq.log.bz2         | log       | 2017-04-01 02:12:03.315091-05 | 
 1428670 | fardet_r00025771_s35_DaqStatus.raw                                | input     | 2017-04-01 02:41:18.405158-05 | 
 1428670 | fardet_r00025771_s35_daqstatus_S17-02-21_v1_data.artdaq.log.bz2   | log       | 2017-04-01 02:41:26.969502-05 | 
 1428670 | fardet_r00025771_s35_daqstatus_S17-02-21_v1_data.artdaq.root      | output    | 2017-04-01 02:41:20.613852-05 | 2017-04-01 02:45:55.801477-05
 1428670 | fardet_r00025771_s45_ddsnews_S17-02-21_v1_data.artdaq.log.bz2     | log       | 2017-04-01 02:17:54.858054-05 | 
 1428670 | fardet_r00025771_s45_ddsnews_S17-02-21_v1_data.artdaq.root        | output    | 2017-04-01 02:17:48.318734-05 | 2017-04-01 02:45:55.801477-05
 1428670 | fardet_r00025773_s27_DDsnews.raw                                  | input     | 2017-04-01 02:37:26.696232-05 | 
 1428670 | fardet_r00025773_s27_ddsnews_S17-02-21_v1_data.artdaq.log.bz2     | log       | 2017-04-01 02:38:01.678549-05 | 
 1428670 | fardet_r00025773_s27_ddsnews_S17-02-21_v1_data.artdaq.root        | output    | 2017-04-01 02:37:43.259168-05 | 2017-04-01 02:45:55.801477-05
 1428670 | fardet_r00025773_s44_DDSun.raw                                    | input     | 2017-04-01 02:08:07.097101-05 | 
 1428670 | fardet_r00025773_s44_ddsun_S17-02-21_v1_data.artdaq.root          | output    | 2017-04-01 02:10:05.595465-05 | 2017-04-01 02:45:55.801477-05
 1428670 | fardet_r00025775_s23_ddsnews_S17-02-21_v1_data.artdaq.log.bz2     | log       | 2017-04-01 02:19:22.466537-05 | 
 1428670 | fardet_r00025776_s02_DDslowmono.raw                               | input     | 2017-04-01 02:31:57.322435-05 | 
 1428670 | fardet_r00025776_s02_ddslowmono_S17-02-21_v1_data.artdaq.log.bz2  | log       | 2017-04-01 02:37:22.956597-05 | 
 1428670 | fardet_r00025776_s02_ddslowmono_S17-02-21_v1_data.artdaq.root     | output    | 2017-04-01 02:36:44.164921-05 | 2017-04-01 02:45:55.801477-05
 1428670 | fardet_r00025776_s08_ddcontained_S17-02-21_v1_data.artdaq.log.bz2 | log       | 2017-04-01 02:31:53.983702-05 | 
 1428670 | fardet_r00025776_s08_ddcontained_S17-02-21_v1_data.artdaq.root    | output    | 2017-04-01 02:31:26.970422-05 | 2017-04-01 02:45:55.801477-05
 1428670 | fardet_r00025776_s51_t05.raw                                      | input     | 2017-04-01 02:14:02.567417-05 | 
 1428670 | fardet_r00025776_s58_ddupmu_S17-02-21_v1_data.artdaq.log.bz2      | log       | 2017-04-01 02:24:06.691844-05 | 
 1428670 | fardet_r00025777_s10_ddenergy_S17-02-21_v1_data.artdaq.root       | output    | 2017-04-01 02:22:17.577415-05 | 2017-04-01 02:45:55.801477-05
 1428670 | fardet_r00025777_s14_ddfastmono_S17-02-21_v1_data.artdaq.root     | output    | 2017-04-01 02:23:23.053015-05 | 2017-04-01 02:45:55.801477-05
 1428670 | fardet_r00025777_s18_ddslowmono_S17-02-21_v1_data.artdaq.log.bz2  | log       | 2017-04-01 02:29:19.181997-05 | 
 1428670 | fardet_r00025777_s18_ddslowmono_S17-02-21_v1_data.artdaq.root     | output    | 2017-04-01 02:28:57.101057-05 | 2017-04-01 02:45:55.801477-05
 1428670 | fardet_r00025777_s35_t05_S17-02-21_v1_data.artdaq.log.bz2         | log       | 2017-04-01 02:40:43.693267-05 | 
 1428670 | log.txt.bz2                                                       | log       | 2017-04-01 02:08:07.094284-05 | 
(31 rows)

pomsprd=> \q

All of the output files have declared dates, but the job is still listed as Completed.

So doing a little digging:

pomsprd=> select count(*) from jobs where status = 'Completed' and (select count(*) from job_files where job_files.job_id = jobs.job_id and job_files.file_type = 'output' and job_files.declared is null) = 0;
 count  
--------
 137001
(1 row)


looks like we have quite a few such jobs... I think I need to just put an updater into one of our periodic cleanup calls, like wrapup_tasks... Ooh! We have one already, it's just missing the "declared is not null" bit... sigh.

#3 Updated by Marc Mengel over 2 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

#4 Updated by Anna Mazzacane over 2 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF