Project

General

Profile

Support #16746

SPILLTIMEND bad entries since 2017/03/17

Added by Arthur Kreymer over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Start date:
06/01/2017
Due date:
07/12/2017
% Done:

100%

Estimated time:
5.00 h
Duration: 42

Description

There have been a few invalid SPILLTIMENDVLD entries
each week since March 2017, created Sunday around 05:35.

This causes problems with data processing, including Keepup.
Entries have been manually corrected each week.
We need to find and fix the root cause..

History

#1 Updated by Arthur Kreymer over 2 years ago

From: Cron Daemon <root@minos-nearline.fnal.gov>
Subject: Cron <rhatcher@minos-nearline> $HOME/check_minos_temp_db.sh
Date: May 14, 2017 at 5:50:25 AM CDT
To: <rhatcher@fnal.gov>

+-----------+---------------------+---------------------+---------------------+---------------------+
| SEQNO     | TIMESTART           | TIMEEND             | CREATIONDATE        | INSERTDATE          |
+-----------+---------------------+---------------------+---------------------+---------------------+
| 700047489 | 2033-05-18 03:33:20 | 1966-10-31 14:13:20 | 2033-05-18 03:33:20 | 2017-05-14 10:42:00 |
| 700047490 | 2033-05-18 03:33:20 | 1966-10-31 14:13:20 | 2033-05-18 03:33:20 | 2017-05-14 10:42:00 |
+-----------+---------------------+---------------------+---------------------+---------------------+
you might want to issue:
update  offline.SPILLTIMENDVLD set DETECTORMASK=0,SIMMASK=0,TASK=999 where TASK=3 and INSERTDATE > '2017-03-14' and ( TIMESTART
> '2033-05-17' or TIMEEND < '1970-00-00');

#2 Updated by Arthur Kreymer over 2 years ago

Date: Sun, 21 May 2017 12:24:53 -0500
From: Robert W Hatcher <rhatcher@fnal.gov>
To: Arthur E Kreymer <kreymer@fnal.gov>, minos_batch <minos_batch@fnal.gov>, Dipak Rimal <drimal@fnal.gov>,
    Jiyeon Han <jyhan@fnal.gov>, Donatella Torretta <torretta@fnal.gov>
Subject: Again with crazy SPILLTIMEND entry, re: check_minos_temp_db.sh

And again Sunday morning at about the same time:

On May 21, 2017, at 5:35 AM, Cron Daemon <root@minos-nearline.fnal.gov> wrote:

+-----------+---------------------+---------------------+---------------------+---------------------+
| SEQNO     | TIMESTART           | TIMEEND             | CREATIONDATE        | INSERTDATE          |
+-----------+---------------------+---------------------+---------------------+---------------------+
| 700047695 | 2033-05-18 03:33:20 | 1966-10-31 14:13:20 | 2033-05-18 03:33:20 | 2017-05-21 10:34:08 |
+-----------+---------------------+---------------------+---------------------+---------------------+
you might want to issue:
update  offline.SPILLTIMENDVLD set DETECTORMASK=0,SIMMASK=0,TASK=999 where TASK=3 and INSERTDATE > '2017-03-14' and ( TIMESTART
> '2033-05-17' or TIMEEND < '1970-00-00');

#3 Updated by Arthur Kreymer over 2 years ago

I have scheduled logging of minos-db1 connections tomorrow 06/04
every 5 seconds from 05:01 to 05:46.

LOG=/minos/data/web/database/topdb/spill
echo "/minos/app/home/mindata/topdb_spill minos-db1 ${LOG}" | at 05:01 Jun 4
job 601 at 2017-06-04 05:01

Logs will be under http://minos.fnal.gov/database/topdb/spill

A short test was done at
http://minos.fnal.gov/database/topdb/spill/2017/06/03/08.txt

Note that writer_old is still being used from minos-daq24-nd
We should change this to writer before moving to Mariadb.

#4 Updated by Arthur Kreymer over 2 years ago

The connections logs were taken as scheduled, see
http://minos.fnal.gov/database/topdb/spill/2017/06/04/05.txt

Several offline tables were locked around 22:27 Jun 3
Things started clearing up around 05:39:54.

A similar thing happened last week, May 27/28
http://minos.fnal.gov/database/topdb/minos-db1/2017/05/27/23.txt
and previous weeks since April 16
http://minos.fnal.gov/database/topdb/minos-db1/2017/05/27/23.txt

There were no such LOCK problems seen the previous week, April 8/9
http://minos.fnal.gov/database/topdb/minos-db1/2017/04/08/23.txt
http://minos.fnal.gov/database/topdb/minos-db1/2017/04/09/00.txt

Some interesting entries from the normal logs are :

http://minos.fnal.gov/database/topdb/minos-db1/2017/06/03/22.txt

Sat Jun  3 22:27:53 CDT 2017
Uptime: 1421967  Threads: 2  Questions: 135586737  Slow queries: 4405  Opens: 6704186  Flush tables: 1  Open tables: 512  Queries per second avg: 95.351
 27759459  writer_old  minos-daq24-nd.fnal.gov:49490   offline  Sleep    62                            
 27759461  writer      minosdatagpvm01.fnal.gov:47312           Query    0     init   show processlist 

Sat Jun  3 22:37:53 CDT 2017
Uptime: 1422567  Threads: 2  Questions: 135586812  Slow queries: 4406  Opens: 6704445  Flush tables: 1  Open tables: 512  Queries per second avg: 95.311
 27759462  mysql   localhost                       offline  Query    407   Writing to net  SELECT /*!40001 SQL_NO_CACHE */ * FROM `BEAMMONSPILL` 
 27759466  writer  minosdatagpvm01.fnal.gov:47370           Query    0     init            show processlist                                      

Sat Jun  3 22:47:53 CDT 2017
Uptime: 1423167  Threads: 3  Questions: 135586817  Slow queries: 4406  Opens: 6704446  Flush tables: 1  Open tables: 512  Queries per second avg: 95.271
 27759462  mysql       localhost                       offline  Query    1007  Writing to net                SELECT /*!40001 SQL_NO_CACHE */ * FROM `BEAMMONSPILL` 
 27759467  writer_old  minos-daq24-nd.fnal.gov:50634   offline  Query    361   Waiting for table level lock  delete from DBUSUBRUNSUMMARY where SEQNO=107280901    
 27759469  writer      minosdatagpvm01.fnal.gov:47430           Query    0     init                          show processlist                                      

http://minos.fnal.gov/database/topdb/minos-db1/2017/06/03/23.txt

Sat Jun  3 23:07:53 CDT 2017
Uptime: 1424368  Threads: 3  Questions: 135592031  Slow queries: 4406  Opens: 6704455  Flush tables: 1  Open tables: 512  Queries per second avg: 95.194
 27759462  mysql       localhost                       offline  Query    2208  Writing to net                SELECT /*!40001 SQL_NO_CACHE */ * FROM `BEAMMONSPILL` 
 27759467  writer_old  minos-daq24-nd.fnal.gov:50634   offline  Query    1562  Waiting for table level lock  delete from DBUSUBRUNSUMMARY where SEQNO=107280901    
 27760393  writer      minosdatagpvm01.fnal.gov:48008           Query    0     init                          show processlist                                      

Sat Jun  3 23:17:54 CDT 2017
Uptime: 1424968  Threads: 5  Questions: 135592400  Slow queries: 4406  Opens: 6704455  Flush tables: 1  Open tables: 512  Queries per second avg: 95.154
 27759462  mysql       localhost                       offline  Query    2808  Writing to net                   SELECT /*!40001 SQL_NO_CACHE */ * FROM `BEAMMONSPILL` 
 27759467  writer_old  minos-daq24-nd.fnal.gov:50634   offline  Query    2162  Waiting for table level lock     delete from DBUSUBRUNSUMMARY where SEQNO=107280901    
 27760457  writer      minos-nearline.fnal.gov:38098   offline  Query    461   Waiting for table metadata lock  LOCK TABLES GLOBALSEQNO WRITE, SPILLTIMENDVLD WRITE   
 27760459  reader      minos-nearline.fnal.gov:38116   offline  Execute  149   Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                   
 27760461  writer      minosdatagpvm01.fnal.gov:48720           Query    0     init                             show processlist                                      

http://minos.fnal.gov/database/topdb/minos-db1/2017/06/04/00.txt

Sun Jun  4 00:17:55 CDT 2017
Uptime: 1428569  Threads: 8  Questions: 135594867  Slow queries: 4409  Opens: 6704476  Flush tables: 1  Open tables: 512  Queries per second avg: 94.916
 27759462  mysql       localhost                       offline  Query    1066  Writing to net                   SELECT /*!40001 SQL_NO_CACHE */ * FROM `CALADCTOPES` 
 27759467  writer_old  minos-daq24-nd.fnal.gov:50634   offline  Query    5763  Waiting for table level lock     delete from DBUSUBRUNSUMMARY where SEQNO=107280901   
 27760457  writer      minos-nearline.fnal.gov:38098   offline  Query    4062  Waiting for table metadata lock  LOCK TABLES GLOBALSEQNO WRITE, SPILLTIMENDVLD WRITE  
 27760459  reader      minos-nearline.fnal.gov:38116   offline  Execute  3750  Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                  
 27761181  reader      minos-nearline.fnal.gov:38160   offline  Execute  2747  Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                  
 27761195  reader      minos-nearline.fnal.gov:38258   offline  Execute  770   Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                  
 27761198  writer      minos-nearline.fnal.gov:38266   offline  Execute  470   Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                  
 27761200  writer      minosdatagpvm01.fnal.gov:49098           Query    0     init                             show processlist                                     

Around 03:00 we start getting stuck reads from worker nodes

http://minos.fnal.gov/database/topdb/minos-db1/2017/06/04/03.txt

Sun Jun  4 03:37:58 CDT 2017
Uptime: 1440572  Threads: 73  Questions: 135597315  Slow queries: 4421  Opens: 6704553  Flush tables: 1  Open tables: 512  Queries per second avg: 94.127
 27759462  mysql       localhost                       offline  Query    7604   Writing to net                   SELECT /*!40001 SQL_NO_CACHE */ * FROM `DCS_HV`     
 27759467  writer_old  minos-daq24-nd.fnal.gov:50634   offline  Query    17766  Waiting for table level lock     delete from DBUSUBRUNSUMMARY where SEQNO=107280901  
 27760457  writer      minos-nearline.fnal.gov:38098   offline  Query    16065  Waiting for table metadata lock  LOCK TABLES GLOBALSEQNO WRITE, SPILLTIMENDVLD WRITE 
 27760459  reader      minos-nearline.fnal.gov:38116   offline  Execute  15753  Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                 
 27761181  reader      minos-nearline.fnal.gov:38160   offline  Execute  14750  Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                 
...
 27761332  reader      fnpc6000.fnal.gov:15166         offline  Execute  412    Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                 
 27761334  writer      minosdatagpvm01.fnal.gov:51182  offline  Execute  343    Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                 
 27761534  reader      fnpc9072.fnal.gov:65196         offline  Execute  280    Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                 
 27761535  reader      fnpc9038.fnal.gov:30178         offline  Execute  280    Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                 
 27761545  reader      fnpc5015.fnal.gov:22306         offline  Execute  280    Waiting for table metadata lock  Select * from GLOBALSEQNO where 1=0                 
...

#5 Updated by Arthur Kreymer over 2 years ago

Date: Sun, 4 Jun 2017 17:37:49 +0000
From: Robert W Hatcher <rhatcher@fnal.gov>
To: Arthur E Kreymer <kreymer@fnal.gov>, Donatella Torretta <torretta@fnal.gov>, Jiyeon Han <jyhan@fnal.gov>
Subject: SPILLTIMEND hiccup occurred later this week

After our discussion, and a fairly consistent 5:30-ish occurrence ... this week the it happened at 5:50am   [ugh]  

-robert

Begin forwarded message:

From: Cron Daemon <root@minos-nearline.fnal.gov>
Subject: Cron <rhatcher@minos-nearline> $HOME/check_minos_temp_db.sh
Date: June 4, 2017 at 5:50:21 AM CDT
To: <rhatcher@fnal.gov>

+-----------+---------------------+---------------------+---------------------+---------------------+
| SEQNO     | TIMESTART           | TIMEEND             | CREATIONDATE        | INSERTDATE          |
+-----------+---------------------+---------------------+---------------------+---------------------+
| 700048059 | 2033-05-18 03:33:20 | 1966-10-31 14:13:20 | 2033-05-18 03:33:20 | 2017-06-04 10:40:29 |
| 700048060 | 2033-05-18 03:33:20 | 1966-10-31 14:13:20 | 2033-05-18 03:33:20 | 2017-06-04 10:40:29 |
| 700048083 | 2033-05-18 03:33:20 | 1966-10-31 14:13:20 | 2033-05-18 03:33:20 | 2017-06-04 10:40:36 |
| 700048084 | 2033-05-18 03:33:20 | 1966-10-31 14:13:20 | 2033-05-18 03:33:20 | 2017-06-04 10:40:36 |
+-----------+---------------------+---------------------+---------------------+---------------------+
you might want to issue:
update  offline.SPILLTIMENDVLD set DETECTORMASK=0,SIMMASK=0,TASK=999 where TASK=3 and INSERTDATE > '2017-03-14' and ( TIMESTART
> '2033-05-17' or TIMEEND < '1970-00-00');

There up to 7 looters running Jun 5 at 05:39, 6 at 05:40.
I copied relevant prochistory files to
/minos/data/users/mindata/maint/dblock

The loot script needs to check that it is not already running.
We need to find out what locked the tables Saturday around 23:10:13
That time is that of the 23:17:51 process list minus the 461 seconds the LOCK command had run.

#6 Updated by Arthur Kreymer over 2 years ago

Since April 15, before the tables are locked around 23:10,
there is a query by local user mysql around 22:30, like

Sat Apr 15 22:48:54 CDT 2017
Uptime: 2644547  Threads: 4  Questions: 602882707  Slow queries: 10904  Opens: 89265828  Flush tables: 1  Open tables: 512  Queries per second avg: 227.972
 84283655  mysql       localhost                       offline  Query    1126  Writing to net                SELECT /*!40001 SQL_NO_CACHE */ * FROM `BEAMMONSPILL` 


See for example
http://minos.fnal.gov/database/topdb/minos-db1/2017/04/15/22.txt
http://minos.fnal.gov/database/topdb/minos-db1/2017/04/15/23.txt

This query comes in within about a minute of 22:30 each Saturday,
and is probably the root cause of our problems.

I created a couple of bash functions to calculate the query start time.

Here are the functions, and results :

datesec () 
{ 
    date -d "19700101 UTC + ${1} seconds" ${2} ${3} ${4}
}

seca() { TS=`date +%s -d "${1}"` ; (( TA = TS - ${2} )) ; datesec ${TA} ; }

Day                      query     last seen next day at
04/15 seca  22:38:54 526 22:30:08 05:49
04/22 seca  22:32:18 122 22:30:16 06:12
04/29 seca  22:35:50 336 22:30:14 05:45
05/06 seca  22:39:39 576 22:30:03 05:49
05/13 seca  22:32:54 171 22:30:03 05:33
05/20 seca  22:32:02 234 22:28:08 05:32
05/27 seca  22:35:00 285 22:30:15 05:25
06/03 seca  22:37:53 407 22:31:06 05:38


Do the DBAs understand what this Saturday 22:30 BEAMMONSPILL query is ?

What changed on April 15 ?

#7 Updated by Arthur Kreymer over 2 years ago

  • % Done changed from 10 to 70

I have deployed a revised TimeGoblinLooter/loot script
which prevents simultaneous running of loot.

    minos@minos-nearline

cd /minos/app/home/minos/TimeGoblinLooter

mv loot loot.20130724
ln -sf  loot.20170606 loot # was loot.20130724
date
Thu Jun  8 15:54:16 CDT 2017

I ran this by hand, and checked the log
${HOME}/log/TimeGoblinLooter/loot.201706.log

#8 Updated by Arthur Kreymer over 2 years ago

My guess is that the user mysql SELECT queries running
Sat 22:30 to Sun around 06:00 are weekly backups.

I do not know why these seem to have started locking tables around April 15.

Mariadb is likely to do backups differently.

I see the DBAs were not watchers of this Issue,
so did not see yesterday's updates to this Issue.

I have added Olga and Svetlana to the Watch list.

#10 Updated by Arthur Kreymer over 2 years ago

I have moved the 442 duplicate spilltime files to a DUPLICATE directory,
to help clean up the logs.

See details in https://cdcvs.fnal.gov/redmine/issues/16830

#11 Updated by Arthur Kreymer over 2 years ago

Sample activity from the Saturday 06/10 read access blockage

06/10 22:30:02 mysql  localhost  22:31:27 - 85
SELECT DISTINCT TABLESPACE_NAME, FILE_NAME, LOGFILE_GROUP_NAME, EXTENT_SIZE, INITIAL_SIZE,
  then
SELECT /*!40001 SQL_NO_CACHE */ * FROM `BEAMMONSPILL`

06/10 22:45 writer_old mios-daq24-nd  22:51:28 - 416
Waiting for table level lock  delete from DBUSUBRUNSUMMARY where SEQNO=107285322

06/10 23:10:09 writer minos_nearline:41390 23:11:23 - 74
LOCK TABLES GLOBALSEQNO WRITE, SPILLTIMENDVLD WRITE

Read access is denied because of the WRITE lock placed from minos-nearline.
A Mysql WRITE lock allows writes and blocks other processes from reading.
This happens weekly, typically for 7 hours.

There seem to be two things to understand
  1. The mysql activity being performed for 7 hours starting Saturday 22:30
  2. The issuance of a WRITE lock by writer from minos-nearline at 23:10

#13 Updated by Arthur Kreymer over 2 years ago

  • Due date changed from 07/04/2017 to 07/12/2017
  • Status changed from Work in progress to Resolved
  • % Done changed from 70 to 100

Full backups of the production offline database
seem to be suspended until migration to Mariadb.
They will be done differently under Mariadb.

Resolving this Issue.



Also available in: Atom PDF