Project

General

Profile

Bug #4371

factory log rotation appears strange

Added by John Weigand about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
High
Category:
Factory
Target version:
Start date:
07/19/2013
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

This is occurring in the current master branch.
I will test the v2plus branch next.
This may or may not be related to Bug #4370 frontend log rotation is naming files incorrectly

It appears to have different behavior which is even stranger.

glideinWMS.xml
--------------

 <log_retention>
      <condor_logs max_days="14.0" max_mbytes="100.0" min_days="3.0"/>
      <job_logs max_days="7.0" max_mbytes="100.0" min_days="2.0"/>
      <summary_logs max_days="31.0" max_mbytes="100.0" min_days="3.0"/>
      <process_logs>
         <process_log extension="info" max_days="7.0" max_mbytes="100.0" min_days="3.0" msg_types="INFO,ERR,EXCEPTION,WARN"/>
      </process_logs>
   </log_retention>

Factory logs
-------------
Cannot tell from this, as yet, if the same problem with rotated log file
names are the same as the factory as the min_days has not been hit as yet.

[weigand@cms-xen21 glidein_master]$ ls -l *
entry_ress_ITB_INSTALL_TEST_1:
total 120
(completed_jobs logs looked fine)
  :
(removed the condor_activity logs as they looked fine)
  :
-rw-r--r-- 1 weigand weigand   523 Jul 19 13:10ress_ITB_INSTALL_TEST_1.info.log
-rw-r--r-- 1 weigand weigand  1097 Jul 17 14:08 ress_ITB_INSTALL_TEST_1.info.log.2013-07-17
-rw-r--r-- 1 weigand weigand  1111 Jul 19 13:19 ress_ITB_INSTALL_TEST_1.info.log.2013-07-18

factory:
(removed these logs as they looked fine)

In the entry log directory, the log files appear to be overwritten with each iteration of
the factory. In addition, some data goes in the *.info.log and some in the *.info.log.2013-07-18 log.

[weigand@cms-xen21 entry_ress_ITB_INSTALL_TEST_1]$ cat ress_ITB_INSTALL_TEST_1.info.log

[2013-07-19 13:10:13,002] INFO: Computing log_stats diff for ress_ITB_INSTALL_TEST_1
[2013-07-19 13:10:13,005] INFO: log_stats diff computed
[2013-07-19 13:10:13,005] INFO: Writing log_stats for ress_ITB_INSTALL_TEST_1
[2013-07-19 13:10:13,012] INFO: log_stats written
[2013-07-19 13:10:13,012] INFO: Writing qc_stats for ress_ITB_INSTALL_TEST_1
[2013-07-19 13:10:13,015] INFO: qc_stats written
[2013-07-19 13:10:13,015] INFO: Writing rrd_stats for ress_ITB_INSTALL_TEST_1
[2013-07-19 13:10:13,340] INFO: rrd_stats written

[weigand@cms-xen21 entry_ress_ITB_INSTALL_TEST_1]$ cat ress_ITB_INSTALL_TEST_1.info.log.2013-07-18

[2013-07-19 13:10:12,921] INFO: Checking downtime for frontend frontend_service security class: frontend (entry ress_ITB_INSTALL_TEST_1).
[2013-07-19 13:10:12,923] INFO: Client frontend_service-master.main (secid: frontend_service_frontend) requesting 10 glideins, max running 18, remove excess 'NO'
[2013-07-19 13:10:12,924] INFO:   Params: {'CONDOR_VERSION': 'default', u'GLIDECLIENT_ReqNode': u'cms-xen21.fnal.gov:9625', u'GLIDECLIENT_Rank': u'1', 'GLIDEIN_Report_Failed': 'NEVER', 'CONDOR_ARCH': 'default', u'USE_MATCH_AUTH': u'True', 'CONDOR_OS': 'default', u'GLIDEIN_Collector': u'cms-xen21.fnal.gov:9641-9650'}
[2013-07-19 13:10:12,924] INFO:   Decrypted Param Names: [u'SubmitProxy', u'SecurityName', u'SecurityClass']
[2013-07-19 13:10:12,927] INFO: Client frontend_service-master.main (secid: frontend_service_frontend) schedd status {1: 10, 1002: 10}
[2013-07-19 13:10:12,930] INFO: Using v3+ protocol and credential 885993
[2013-07-19 13:10:12,930] INFO: Have enough glideins: idle=10 req_idle=10, not submitting
[2013-07-19 13:10:12,931] INFO: Sanitizing glideins for entry ress_ITB_INSTALL_TEST_1

John Weigand

History

#1 Updated by John Weigand about 7 years ago

  • Target version changed from v3_x to v3_1

In addition, I can confirm that the factory log rotation naming has the same problem as seen in frontend bug #4370.

[weigand@cms-xen21 glidein_master]$ ls -l factory
total 9028
-rw-r--r-- 1 weigand weigand 2195077 Jul 22 08:46 factory.info.log
-rw-r--r-- 1 weigand weigand     423 Jul 17 13:15 factoryadmin.info.log
-rw-r--r-- 1 weigand weigand  716596 Jul 22 08:45 group_0.info.log
-rw-r--r-- 1 weigand weigand 3145674 Jul 18 23:15 group_0.info.log.2013-07-18
-rw-r--r-- 1 weigand weigand 3145693 Jul 21 17:37 group_0.info.log.2013-07-25

One difference between the frontend (#4370) and this is that with a 3 day threshold, the factory log rotation appears to be working at 3 days intervals (sort of) where the frontend rotation appears is not. HOWEVER... I am thinking that maybe this is a function of how the interval is determined. The date on the 07-25 log is not quite 3 days but IS close to the same size as the previous. Just a random thought.

John Weigand

#2 Updated by Parag Mhashilkar about 7 years ago

  • Assignee set to Parag Mhashilkar

#3 Updated by Parag Mhashilkar about 7 years ago

  • Assignee changed from Parag Mhashilkar to Anthony Tiradani

#4 Updated by Anthony Tiradani about 7 years ago

We now inherit directly from BaseRotatingHandler rather than TimedRotatingFileHandler since we are overriding all of the functions specified in the TimedRotatingFileHandler anyway.

init function:
  • The rolloverAt timestamp is calculated according to the python2.7 bug fix rules now:
    if os.path.exists(filename):
        begin_interval_time = os.stat(filename)[stat.ST_MTIME]
    else:
        begin_interval_time = int(time.time())
    self.rolloverAt = begin_interval_time + self.interval
    

shouldRollover function:
We now do away with the minimum days check. The file suffix includes hours and minutes to avoid rollover naming collisions. If you are rolling over logs in less than a minute, either you have a serious problem with your install or you need to adjust your max MB setting.

getFilesToDelete function:
This is a direct copy from the python 2.7 module. It removes some hard-coded assumptions about the file naming structure.

doRollover function:
This is a modified version of the function from the python 2.7 module. Based on feedback from the UCSD admins, the timestamp expected in the file name is NOT the begining of the interval (which the logging module assumes), but the time at which the log file was rotated. A bunch of unused functionality was removed in favor of readablility.

_open_new_log function:
This is a new helper function that bridges the gap between the differences in the way that the python 2.4 and python 2.7 modules opened new log files.

Unittests have been written to test the functionality.

#5 Updated by Anthony Tiradani about 7 years ago

  • Status changed from New to Feedback
  • Assignee changed from Anthony Tiradani to Parag Mhashilkar

Ready for review and testing

#6 Updated by John Weigand about 7 years ago

  • Status changed from Feedback to Assigned
  • Assignee changed from Parag Mhashilkar to Anthony Tiradani

Tested using v3_1_rc4 starting on 5/27 around 09:00.
Still does not look good.

These are the results for the factory.
cms-xen21:/opt/glideins/git-xen21-master-ps.ini/factory/logs/glidein_master

entry_ress_ITB_INSTALL_TEST_1:
total 296
-rw-r--r-- 1 weigand weigand   9947 Jul 27 23:51 completed_jobs_20130727.log
-rw-r--r-- 1 weigand weigand  16121 Jul 28 23:49 completed_jobs_20130728.log
-rw-r--r-- 1 weigand weigand   4802 Jul 29 07:31 completed_jobs_20130729.log
-rw-r--r-- 1 weigand weigand      0 Jul 29 07:35 ress_ITB_INSTALL_TEST_1.info.log
-rw-r--r-- 1 weigand weigand      0 Jul 29 07:31 ress_ITB_INSTALL_TEST_1.info.log.2013-07-29_07-31
-rw-r--r-- 1 weigand weigand      0 Jul 29 07:32 ress_ITB_INSTALL_TEST_1.info.log.2013-07-29_07-32
-rw-r--r-- 1 weigand weigand      0 Jul 29 07:33 ress_ITB_INSTALL_TEST_1.info.log.2013-07-29_07-33
-rw-r--r-- 1 weigand weigand      0 Jul 29 07:34 ress_ITB_INSTALL_TEST_1.info.log.2013-07-29_07-34
-rw-r--r-- 1 weigand weigand      0 Jul 29 07:35 ress_ITB_INSTALL_TEST_1.info.log.2013-07-29_07-35

factory:
total 2060
-rw-r--r-- 1 weigand weigand       0 Jul 27 22:51 factory.info.log
-rw-r--r-- 1 weigand weigand 1045064 Jul 27 22:51 factory.info.log.2013-07-27_22-51
-rw-r--r-- 1 weigand weigand     141 Jul 27 09:15 factoryadmin.info.log
-rw-r--r-- 1 weigand weigand       0 Jul 28 07:11 group_0.info.log
-rw-r--r-- 1 weigand weigand 1048555 Jul 28 07:11 group_0.info.log.2013-07-28_07-11

These are for the frontend.
cms-xen21:/opt/glideins/git-xen21-master-ps.ini/frontend/logs/frontend_frontend_service-master

frontend:
total 2044
-rw-r--r-- 1 weigand weigand       0 Jul 27 09:17 frontend.err.log
-rw-r--r-- 1 weigand weigand       0 Jul 27 17:42 frontend.info.log
-rw-r--r-- 1 weigand weigand 1046618 Jul 27 17:42 frontend.info.log.2013-07-27_17-42
-rw-r--r-- 1 weigand weigand       0 Jul 27 17:49 frontend.warn.log
-rw-r--r-- 1 weigand weigand 1038755 Jul 27 17:48 frontend.warn.log.2013-07-27_17-49

group_main:
total 1044
-rw-r--r-- 1 weigand weigand       0 Jul 29 04:20 main.err.log
-rw-r--r-- 1 weigand weigand   12730 Jul 28 04:00 main.err.log.2013-07-28_10-56
-rw-r--r-- 1 weigand weigand       0 Jul 28 10:56 main.err.log.2013-07-28_17-35
-rw-r--r-- 1 weigand weigand       0 Jul 28 17:35 main.err.log.2013-07-29_02-06
-rw-r--r-- 1 weigand weigand       0 Jul 29 02:06 main.err.log.2013-07-29_03-44
-rw-r--r-- 1 weigand weigand       0 Jul 29 03:44 main.err.log.2013-07-29_04-20
-rw-r--r-- 1 weigand weigand       0 Jul 27 16:07 main.info.log
-rw-r--r-- 1 weigand weigand 1048404 Jul 27 16:07 main.info.log.2013-07-27_16-07
-rw-r--r-- 1 weigand weigand       0 Jul 27 09:17 main.warn.log

John Weigand

#7 Updated by John Weigand about 7 years ago

Restarted the factory and frontend fresh with latest changes in my area on cms-xen21 at 7/29 14:56.
It is hard to tell on the factory as to the naming (timestamp on the rotated logs) but the strange behavior on the entry points is still there.

glideinWMS.xml

  <process_log extension="info" max_days="1.0" max_mbytes="1.0" min_days="1.0" msg_types="INFO,ERR,EXCEPTION,WARN"/>

/opt/glideins/git-xen21-master-ps.ini/factory/logs/glidein_master
[weigand@cms-xen21 glidein_master]$ ls -l *
entry_ress_ITB_INSTALL_TEST_1:
total 312
-rw-r--r-- 1 weigand weigand    523 Jul 30 08:20 ress_ITB_INSTALL_TEST_1.info.log
-rw-r--r-- 1 weigand weigand   1097 Jul 30 08:16 ress_ITB_INSTALL_TEST_1.info.log.2013-07-30_08-16
-rw-r--r-- 1 weigand weigand   1097 Jul 30 08:17 ress_ITB_INSTALL_TEST_1.info.log.2013-07-30_08-17
-rw-r--r-- 1 weigand weigand   1097 Jul 30 08:18 ress_ITB_INSTALL_TEST_1.info.log.2013-07-30_08-18
-rw-r--r-- 1 weigand weigand   1097 Jul 30 08:19 ress_ITB_INSTALL_TEST_1.info.log.2013-07-30_08-19
-rw-r--r-- 1 weigand weigand   1097 Jul 30 08:20 ress_ITB_INSTALL_TEST_1.info.log.2013-07-30_08-20

factory:
total 1076
-rw-r--r-- 1 weigand weigand 256087 Jul 30 08:20 factory.info.log
-rw-r--r-- 1 weigand weigand 835443 Jul 30 08:20 group_0.info.log

See #4370 for naming convention (on the timestamp) issue which is still not there.

John Weigand

#8 Updated by John Weigand about 7 years ago

  • Assignee changed from Anthony Tiradani to Parag Mhashilkar

The timestamp issue on rotated logs was resolved with #4370. Refer to that ticket for the results.
The changes to resolve that are documented in this ticket by Tony (see entry #4)

The strange log rotation (every minute) is the result of the performance improvements made to fork/exec a child for each entry.
- The parent process creates an instance of the logger.
- When a child entry is spawned it creates an instance of the logger as well.
- When the parent (every minute is my understanding)gets control again, it is still using its instance.

So everything goes along fine until the time for the 1st rotation. In this case it was set to rotate at 5000 bytes.

--------------------------------------
DOING ROLLOVER
msg: [2013-07-30 14:31:41,113] INFO: Computing log_stats diff for ress_ITB_INSTALL_TEST_1
self.stream: <open file u'.../ress_ITB_INSTALL_TEST_1.info.log', mode 'a' at 0x1f065dc8> self.stream.tell(): 4974 len(msg): 85
--------------------------------------
msg: [2013-07-30 14:31:41,116] INFO: log_stats diff computed
self.stream: <open file u'.../ress_ITB_INSTALL_TEST_1.info.log', mode 'w' at 0x1f06d0a8> self.stream.tell(): 85 len(msg): 56
... and on ..
--------------------------------------
msg: [2013-07-30 14:31:41,479] INFO: rrd_stats written
self.stream: <open file u'.../ress_ITB_INSTALL_TEST_1.info.log', mode 'w' at 0x1f06d0a8> self.stream.tell(): 473 len(msg): 50
--------------------------------------
DOING ROLLOVER
msg: [2013-07-30 14:32:41,611] INFO: Checking downtime for frontend frontend_service security class: frontend (entry ress_ITB_INSTALL_TEST_1).
self.stream: <open file u'.../ress_ITB_INSTALL_TEST_1.info.log', mode 'a' at 0x1f065dc8> self.stream.tell(): 4974 len(msg): 138
--------------------------------------
msg: [2013-07-30 14:32:41,612] INFO: Client frontend_service-master.main (secid: frontend_service_frontend) requesting 1 glideins, max running 4, remove excess 'NO'
self.stream: <open file u'.../ress_ITB_INSTALL_TEST_1.info.log', mode 'w' at 0x1f06d0a8> self.stream.tell(): 138 len(msg): 160

When the parent goes to write a log message (every minute), it is using the original object with the old log length (self.stream.tell()) and thinks it needs to rotate... and does.

This is my best guess at this point.

John Weigand

#9 Updated by John Weigand about 7 years ago

Ran tests over night using latest changes to master_4371.
Looks good.
Logs are rolling fine.
Timestamps reflect HH-MM when log is rolled.

John Weigand

#10 Updated by John Weigand about 7 years ago

  • Assignee changed from Parag Mhashilkar to Anthony Tiradani

#11 Updated by John Weigand about 7 years ago

Completed test of changes for
commit: 8017f187c179d58a461539e50e85180a59f77ce3
at 8/1 10:00

Looks good
John Weigand

#12 Updated by Parag Mhashilkar about 7 years ago

  • Status changed from Assigned to Closed

Tony implemented several fixes to address the issues, John tested it. Changes are now merged into master. Closing the ticket



Also available in: Atom PDF