Project

General

Profile

Support #22262

Keepup since 2019 March 18

Added by Arthur Kreymer 6 months ago. Updated 5 months ago.

Status:
Work in progress
Priority:
High
Start date:
04/02/2019
Due date:
04/04/2019
% Done:

100%

Estimated time:
2.00 h
Duration: 3

Description

Minos ND keepup stopped March 18, per email from Mateus Carneiro da Silva Mateus Carneiro da Silva

History

#1 Updated by Arthur Kreymer 6 months ago

Date: Tue, 2 Apr 2019 13:18:09 +0000
From: Mateus Carneiro <>
To: Arthur Kreymer <>
Subject: MINOS keepup


Hi Art,

 Sorry to bother but I really hope this is the last time. Apparently data files are not being declared since March 18, but they
are being moved to the oficial pnfs area. I could not find errors in th predator logs and I can't tell how to try and run it by
hand to see the ongoing problem. Would you be able to help me with this?

Best,
Mateus

Mateus F. Carneiro_______________________________
  Oregon State University - Physics Department   
  Postdoc Scholar @ Fermilab / MINERvA Experiment  1 630 518 5047 / 1 630 840 2387 /  

#2 Updated by Arthur Kreymer 6 months ago

See logs in https://minos.fnal.gov/data/computing/dh/predator/

In https://minos.fnal.gov/data/computing/dh/predator/log/predator/2019-03.log
genpy processing time increased from 3 minutes to 20+ minutes
in the March 18 02:06 run

STARTED Mon Mar 18 00:06:01 UTC 2019
predator.20180319
Mon Mar 18 00:06:01 UTC 2019 genpy -w -l " -r R3.01.00 --32bit " neardet_data/2019-03
Mon Mar 18 00:09:34 UTC 2019 sadd neardet_data 2019-03
FINISHED Mon Mar 18 00:10:04 UTC 2019

STARTED Mon Mar 18 02:06:01 UTC 2019
predator.20180319
Mon Mar 18 02:06:01 UTC 2019 genpy -w -l " -r R3.01.00 --32bit " neardet_data/2019-03
Mon Mar 18 02:26:41 UTC 2019 sadd neardet_data 2019-03
Mon Mar 18 02:27:01 UTC 2019 saddcache
FINISHED Mon Mar 18 02:27:06 UTC 2019

#3 Updated by Arthur Kreymer 6 months ago

  • % Done changed from 10 to 60

Reviewing
https://minos.fnal.gov/data/computing/dh/predator/log/genpy/neardet_data/2019-03.log

STARTING Mon Mar 18 02:06:04 UTC 2019
Treating 1037 files
Scanning 8 files
N00077550_0010.mdaq.root Mon Mar 18 02:06:09 UTC 2019
58238325 bytes in 420153 us ( 138 MB/sec )
OOPS - run_dbu is stuck for 118, killing it
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 S 3648 25204 25191 0 80 0 - 26573 do_wai ? 00:00:00 run_dbu
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 S 3648 25221 25204 12 80 0 - 28676 sk_wai ? 00:00:14 dbu
kill 25221

Testing file access on minos-data,

time sum /pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root
^C

real 1m53.849s
user 0m0.000s
sys 0m0.002s

FILE=/pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root

ls -l ${FILE}MINOSGPVM04 > file ${FILE}
/pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root: ERROR: cannot read `/pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root' (Input/output error)

rw-rw-r- 1 minosraw e875 58238325 Mar 17 19:06 /pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root

file ${FILE}
^C

Simlar results on minos-nearline and minosgpvm04.

MINOSGPVM04 > file ${FILE}
/pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root: ERROR: cannot read `/pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root' (Input/output error)

Trying to access these files with Gridftp, using ifdh cp

minos
setup_minos

FILE=/pnfs/minos/neardet_data/2019-03/N00077550_0001.mdaq.root
ifdh cp ${FILE} /var/tmp/ifout
ls l /var/tmp/ifout
-rw-r--r-
1 kreymer e875 55863844 Apr 2 10:07 /var/tmp/ifout

FILE=/pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root
time ifdh cp ${FILE} /var/tmp/ifout

real 4m16.904s
user 0m0.042s
sys 0m0.149s

ls l /var/tmp/ifout
-rw-r--r-
1 kreymer e875 58238325 Apr 2 10:14 /var/tmp/ifout

file ${FILE}
/pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root: ROOT file Version 51600 (Compression: 1)

time sum ${FILE}
30017 56874

real 0m0.591s
user 0m0.121s
sys 0m0.043s

This is acting as though these files are being restored from tape.
They should be on disk at all times, in the RawDataWritePools group.

#4 Updated by Arthur Kreymer 6 months ago

INC000001047044 04/02 Reboot minosdatagpvm01 and minos-nearline for PNFS NFS reads

Since March 18 at 02:06 on node minosdatagpvm01
a few Minos raw data files cannot be read via NFS.

For example, /pnfs/minos/neardet_data/2019-04/N00077646_0000.mdaq.root

This file can be read from other clients like minosgpvm04.
Different files are not readable on minos-nearline.

The minosdatagpv01 issue is holding up keepup processing.

Please reboot minosdatagpvm01 and minos-nearline at your next convenience,
to restore PNFS NFS read access.

#5 Updated by Arthur Kreymer 6 months ago

minos-data was rebooted, and all 2019-03 and 04 files can be read via NFS.

But predator's genpy is still getting stuck running dbu.
No CPU time is being used by the dbu processes, which eventually are being killed.

We will have to investigate further.

I have set the flag to stop predator until this is fixed.

${HOME}/predator/predator stop
SETTING STOP FLAG
rw-r--r- 1 mindata e875 0 Apr 3 04:32 /minos/app/home/mindata/predator/LOG/predator/STOP

#6 Updated by Arthur Kreymer 6 months ago

On Thu, Apr 4, 2019, 7:10 PM Olga Vlasova <> wrote:

Hi Jorge,       
We haven’t scheduled maintenance, otherwise we certainly would let you know.
Minos databases are not down, they have been up and running. 
The databases are in stale mode though.
 
mariadb-prd1> select count(*), state from INFORMATION_SCHEMA.PROCESSLIST group by state order by state;
-----------------------------------------+
       | count(*) | state                           |
-----------------------------------------+ |      304 | checking permissions            | |        1 | committed 10052101              | |        1 | committed 10052104              | |        1 | Filling schema table            | |      551 | query end                       | |        1 | Waiting for table metadata lock | |        1 | wsrep aborter idle              |
-----------------------------------------+

mariadb-prd2> select count(*), state from INFORMATION_SCHEMA.PROCESSLIST group by state order by state;

-----------------------------------------+
       | count(*) | state                           |
-----------------------------------------+ |      297 | checking permissions            | |        1 | committed 10049492              | |        1 | Filling schema table            | |      542 | query end                       | |        1 | Waiting for table metadata lock | |        1 | wsrep aborter idle              |
mariadb-prd3> select count(*), state from INFORMATION_SCHEMA.PROCESSLIST group by state order by state;
------------------------------+
       | count(*) | state                |
------------------------------+ |      261 | checking permissions | |        1 | committed 10052132   | |        1 | committed 10052133   | |        1 | Filling schema table | |      603 | query end            | |        1 | wsrep aborter idle   |
------------------------------+
We’ve had similar situation in September, 2018 (see INC000000992921) 
and most likely we will need to restart all three nodes in order to fix this issue.
We need sysadmins assistance to bring the servers down and approval from Minos/MINERvA team
for downtime.

Adding Art Kreymer to this email exchange.

Thanks,
Olga.

#7 Updated by Thomas Carroll 6 months ago

From discussions with Mateus and Deepika of Minerva, I think there are two pieces of information that might be missing so far.

1) In the above email from Olga to Jorge, mariadb-prd servers are mentioned. Does keepup use mariadb-prd or mariadb-dev?

2) When the beam was switched to FHC (neutrino mode) the MINOS near detector had it's coil current reversed "manually". It is my understanding that as far as ACNET is concerned the MINOS near detector's coil current is still set for an antineutrino beam. Could such a mismatch prevent keepup from running properly?

#8 Updated by Arthur Kreymer 6 months ago

  • % Done changed from 60 to 80
  • Status changed from Assigned to Work in progress

After the database restart at 13:47 today,
I tested dbu as was done previously on 2018/10/24

mindata@minos-data

cd /minos/app/users/mindata/maint/predator/

  1. run test, with log to testdbu-db1.log

./testdbu-db1
mv testdbu-db1.log testdbu-db1.20190405.log

The log and interactive output look good now.
RESTARTED PREDATOR AND STARTED RUNNING ON MARCH DATA

${HOME}/predator/predator start
rw-r--r- 1 mindata e875 0 Apr 3 04:32 /minos/app/home/mindata/predator/LOG/predator/STOP
CLEARED PREDATOR STOP FLAG at Sat Apr 6 01:19:48 UTC 2019

set nohup ; ${HOME}/predator/predator 2019-03 &
[1] 2614

This is working, dbu is using CPU time now.

DATA > ps xf
PID TTY STAT TIME COMMAND
18832 ? S 0:00 sshd: mindata@pts/1
18833 pts/1 Ss 0:00 \_ -bash
2614 pts/1 S 0:00 \_ /bin/sh /minos/app/home/mindata/predator/predator 2019-03
2637 pts/1 S 0:03 | \_ /bin/sh /minos/app/home/mindata/predator/genpy -w -l -r R3.01.00 --32bit neardet_data/2019-03
13636 pts/1 S 0:00 | \_ /bin/sh /minos/app/home/mindata/predator/wrun_dbu /pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root 118
13649 pts/1 S 0:00 | \_ /bin/sh /minos/app/home/mindata/predator/run_dbu /pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq.root mi
13669 pts/1 R 0:03 | | \_ dbu -bq /minos/app/home/mindata/predator/dbu_sampy.C /pnfs/minos/neardet_data/2019-03/N00077550_0010.mdaq
13707 pts/1 S 0:00 | \_ sleep 5

It could take up to a day to catch up with the backlog.
Keep an eye on the logs in 
https://minos.fnal.gov/data/computing/dh/predator/log/predator/
As Tom notes, there still may be an issue with the coil direction.
I see reconstructed output from data after the switch,
as recently as N00077544_0000 on March 16.
Someone should check that the field was correct for that reco.

#9 Updated by Arthur Kreymer 6 months ago

  • % Done changed from 80 to 90

March 2019 data was declared to SAM around 12:32 UTC .
Catch-up processing of this data will need to be run.

The next Predator run should pick up the April data within a couple of hours.
I will check around noon today.

#10 Updated by Arthur Kreymer 6 months ago

  • % Done changed from 90 to 100

Predator is up to date, having run on the 2019-04 data via the usual crontab entry.
All data has been declared to SAM.

https://minos.fnal.gov/data/computing/dh/predator/log/predator/2019-04.log

https://minos.fnal.gov/data/computing/dh/predator/log/samadd/neardet_data/2019-04.log

Keepup should run tonight.

I suggest that we verify that Keepup ran, then close this Issue.
We should open a separate Issue if there are problems with data monitoring.

#11 Updated by Arthur Kreymer 6 months ago

A few keepup output files have appeared, but some jobs are failing due to bfield database issues.

Experts will need to investigate.
I have added Donatella and Robert to the watch list of this Issue.

#12 Updated by Arthur Kreymer 5 months ago

What is the desired state of Minos ND keepup, and Data Quality plots ?
The detector is still operating.

On Fri, May 3, 2019 at 7:07 AM Arthur E Kreymer <> wrote:
  Minervans -

I see that Minos Near Detector data continues to be taken in
/pnfs/minos/neardet_data/2019-05.
But the latest sntp file is dated Apr 16 in
/pnfs/minos/reco_near/elm6/sntp_data/2019-04
Is keepup still running ?
-----------
Date: Fri, 3 May 2019 11:40:21 -0500
From: Mateus Carneiro &lt;&gt;

Hi Art,

 I was informed that we were shutting off the MINOS detector and that the keepup should not be necessary any more.
 I confess I'm not aware of the current operations situation so I did not took action on turning anything off, bu since I was
officially informed I've not been monitoring the keepup any longer. 

Best,
Mateus

Mateus F. Carneiro_______________________________
  Oregon State University - Physics Department   
  Postdoc Scholar @ Fermilab / MINERvA Experiment  1 630 518 5047 / 1 630 840 2387 /  



Also available in: Atom PDF