Project

General

Profile

Bug #12290

joblog scraper is missing input files sometimes

Added by Marc Mengel over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
04/13/2016
Due date:
% Done:

100%

Estimated time:
First Occurred:
Scope:
Internal
Experiment:
-
Stakeholders:
Duration:

Description

Note for example lariat jobs -- need to check if update_job was called for them, or if we somehow lost the message...


Related issues

Is duplicate of Production Operations Management Service (POMS) - Bug #12732: joblog scraper losing messagesClosed05/19/2016

History

#1 Updated by Marc Mengel over 4 years ago

See for example:
https://pomsgpvm01.fnal.gov/poms/job_table?job_status=Completed&task_id=11445

joblog has:

2016-04-13T00:09:58-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0001.root lariat_r008989_sr0001.root);
2016-04-13T00:09:59-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: ifdh cp: transferred: 66935355 bytes in 7.73193 seconds
2016-04-13T00:09:59-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0002.root lariat_r008989_sr0002.root);
2016-04-13T00:10:00-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: ifdh cp: transferred: 67478044 bytes in 7.52841 seconds
2016-04-13T00:10:00-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0003.root lariat_r008989_sr0003.root);
2016-04-13T00:10:00-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: ifdh cp: transferred: 47822200 bytes in 5.93793 seconds
2016-04-13T00:10:00-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0004.root lariat_r008989_sr0004.root);
2016-04-13T00:10:01-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: ifdh cp: transferred: 39137298 bytes in 5.37007 seconds
2016-04-13T00:10:01-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0005.root lariat_r008989_sr0005.root);
2016-04-13T00:10:02-05:00 fnpc2090.fnal.gov lariatpro/lariat: 11445/9003088.0@fifebatch2.fnal.gov/v1_8_9/lariat[11441]: ifdh: ifdh cp: transferred: 73300194 bytes in 2.31078 seconds
...

all the way up to 016... but we only got the first 3 in the job table

#2 Updated by Marc Mengel over 4 years ago

Okay, so chasing further, sticking with this particular case, I see only the three input file reports in the webservice log on pomsgpvm01:

[poms@pomsgpvm01 poms]$ grep 'saw input.*008989' error_log.3
[13/Apr/2016:00:10:00]  saw input_file_names: lariat_r008989_sr0001.root
[13/Apr/2016:00:10:04]  saw input_file_names: lariat_r008989_sr0002.root
[13/Apr/2016:00:10:07]  saw input_file_names: lariat_r008989_sr0003.root

And the joblog scraper reports only those three copies as something it saw:
[poms@pomsgpvm01 poms]$ grep "::cp.*008989" poms_joblog_scraper.log | grep -v '[-]D'
report_item: message: starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0001.root lariat_r008989_sr0001.root);
looking for input/output files in:  starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0001.root lariat_r008989_sr0001.root);
report_item: message: starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0002.root lariat_r008989_sr0002.root);
looking for input/output files in:  starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0002.root lariat_r008989_sr0002.root);
report_item: message: starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0003.root lariat_r008989_sr0003.root);
looking for input/output files in:  starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0003.root lariat_r008989_sr0003.root);

So since the copies are logged on fifelog, I conclude we must be losing them
in the transfer from fifelog -> pomsgpvm01. In fact, when I go check fermicloud045,
the development box, his joblog_scraper only saw one of the input files:

poms@fermicloud045 poms]$ grep "::cp.*008989" poms_joblog_scraper.log | grep -v '[-]D'
report_item: message: starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0001.root lariat_r008989_sr0001.Traceback (most recent call last):
looking for input/output files in:  starting ifdh::cp( /pnfs/lariat/raw/002/130/lariat_r008989_sr0001.root lariat_r008989_sr0001.root);

And, in fact, he got an error while reporting it... sigh. (fixing that...)

Okay, so currently we're using UDP (?) to forward from fifelog to pomsgpvm01 and to fermicloud045; we should change those links to TCP, I think. Then we'll see if our
reporting rate improves.

Of course, the other possiblity is that our log scraper is not keeping up with the rate of messages coming in; and rsyslogd is ending up throwing messages on the floor 'cause the named pipe would otherwise block, and we need to kick off our updates in other threads or some such... In fact that's the more likely possibilty.

So the first push then is to make the job_reporter asynchronous.
First stab: 19ce93f7b944a4

#3 Updated by Marc Mengel over 4 years ago

  • Is duplicate of Bug #12732: joblog scraper losing messages added

#4 Updated by Marc Mengel over 4 years ago

  • Status changed from Assigned to Resolved
  • Target version set to b0_5

#5 Updated by Marc Mengel over 4 years ago

  • % Done changed from 0 to 100

#6 Updated by Marc Mengel over 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF