Project

General

Profile

Support #8404

Create check_mk sensor for glideinwms frontend that checks logfiles

Added by Anthony Tiradani about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Start date:
04/23/2015
Due date:
% Done:

0%

Estimated time:
component:
glideinWMS Frontend
Scope:
Internal
Experiment:
-
Stakeholders:
Co-Assignees:
Categorization:
-
Duration:

Description

Gerard reports:

looking for something else I realized we have silent errors on fifebatchhead frontends (which are actually hidding real issues).

We should have check_mk alarms that let us know about this things, my proposal is to make a check that logs for /var/log/gwms-frontend/*/*.err.log and if there are more than X (=2?) errors within the last 1h then trigger a Warning, if there are more than 10 errors trigger a critical. We can tune this better down the road, but we need something that lets us know when there are issues on the GWMS frontends...

Poking around on this errors I verified most of them are due to hung condor schedds, maybe from large job submissions....

[root@fifebatchgpvmhead1 ~]# ll /var/log/gwms-frontend/group_*/FNAL_fermilab.err.log
rw-r--r- 1 frontend gwms 85690 Sep 16 11:55 /var/log/gwms-frontend/group_FNAL_fermilab/FNAL_fermilab.err.log
[root@fifebatchgpvmhead1 ~]# ll /var/log/gwms-frontend/group_*/*.err.log
rw-r--r- 1 frontend gwms 74494 Sep 14 16:30 /var/log/gwms-frontend/group_fermicloud/fermicloud.err.log
rw-r--r- 1 frontend gwms 74547 Sep 16 09:01 /var/log/gwms-frontend/group_fermicloud_pp/fermicloud_pp.err.log
rw-r--r- 1 frontend gwms 82869 Sep 16 10:47 /var/log/gwms-frontend/group_FNAL_argoneut/FNAL_argoneut.err.log
rw-r--r- 1 frontend gwms 90974 Sep 16 13:23 /var/log/gwms-frontend/group_FNAL_cdf/FNAL_cdf.err.log
rw-r--r- 1 frontend gwms 92800 Sep 16 10:47 /var/log/gwms-frontend/group_FNAL_cdf_slottest/FNAL_cdf_slottest.err.log
rw-r--r- 1 frontend gwms 88297 Sep 16 13:23 /var/log/gwms-frontend/group_FNAL_coupp/FNAL_coupp.err.log
rw-r--r- 1 frontend gwms 88015 Sep 16 13:23 /var/log/gwms-frontend/group_FNAL_darkside/FNAL_darkside.err.log
rw-r--r- 1 frontend gwms 94294 Sep 16 13:23 /var/log/gwms-frontend/group_FNAL_dzero/FNAL_dzero.err.log
rw-r--r- 1 frontend gwms 85690 Sep 16 11:55 /var/log/gwms-frontend/group_FNAL_fermilab/FNAL_fermilab.err.log
rw-r--r- 1 frontend gwms 81736 Sep 15 23:39 /var/log/gwms-frontend/group_FNAL_genie/FNAL_genie.err.log
rw-r--r- 1 frontend gwms 85404 Sep 16 10:47 /var/log/gwms-frontend/group_FNAL_gm2/FNAL_gm2.err.log
rw-r--r- 1 frontend gwms 82669 Sep 16 10:47 /var/log/gwms-frontend/group_FNAL_lar1/FNAL_lar1.err.log
rw-r--r- 1 frontend gwms 88768 Sep 15 23:14 /var/log/gwms-frontend/group_FNAL_lariat/FNAL_lariat.err.log
rw-r--r- 1 frontend gwms 83718 Sep 15 23:39 /var/log/gwms-frontend/group_FNAL_lbne/FNAL_lbne.err.log
rw-r--r- 1 frontend gwms 88664 Sep 16 13:23 /var/log/gwms-frontend/group_FNAL_lsst/FNAL_lsst.err.log
rw-r--r- 1 frontend gwms 88405 Sep 16 10:47 /var/log/gwms-frontend/group_FNAL_marsgm2/FNAL_marsgm2.err.log
rw-r--r- 1 frontend gwms 88872 Sep 16 10:47 /var/log/gwms-frontend/group_FNAL_marslbne/FNAL_marslbne.err.log
rw-r--r- 1 frontend gwms 88898 Sep 14 16:29 /var/log/gwms-frontend/group_FNAL_marsmu2e/FNAL_marsmu2e.err.log
rw-r--r- 1 frontend gwms 93868 Sep 16 10:47 /var/log/gwms-frontend/group_FNAL_minerva/FNAL_minerva.err.log
rw-r--r- 1 frontend gwms 88513 Sep 16 13:23 /var/log/gwms-frontend/group_FNAL_miniboone/FNAL_miniboone.err.log
rw-r--r- 1 frontend gwms 94320 Sep 16 10:47 /var/log/gwms-frontend/group_FNAL_minos/FNAL_minos.err.log
rw-r--r- 1 frontend gwms 88269 Sep 16 00:52 /var/log/gwms-frontend/group_FNAL_mu2e/FNAL_mu2e.err.log
rw-r--r- 1 frontend gwms 94238 Sep 16 13:23 /var/log/gwms-frontend/group_FNAL_nova/FNAL_nova.err.log
rw-r--r- 1 frontend gwms 193334 Sep 16 09:01 /var/log/gwms-frontend/group_FNAL_numi/FNAL_numi.err.log
rw-r--r- 1 frontend gwms 88485 Sep 16 04:22 /var/log/gwms-frontend/group_FNAL_seaquest/FNAL_seaquest.err.log
rw-r--r- 1 frontend gwms 91142 Sep 16 13:23 /var/log/gwms-frontend/group_FNAL_uboone/FNAL_uboone.err.log
rw-r--r- 1 frontend gwms 169469 Sep 14 16:29 /var/log/gwms-frontend/group_OSG_argoneut/OSG_argoneut.err.log
rw-r--r- 1 frontend gwms 150261 Sep 16 13:23 /var/log/gwms-frontend/group_OSG_fermilab/OSG_fermilab.err.log
rw-r--r- 1 frontend gwms 78682 Sep 16 10:47 /var/log/gwms-frontend/group_OSG_genie/OSG_genie.err.log
rw-r--r- 1 frontend gwms 90200 Sep 16 13:23 /var/log/gwms-frontend/group_OSG_gm2/OSG_gm2.err.log
rw-r--r- 1 frontend gwms 175639 Sep 16 13:23 /var/log/gwms-frontend/group_OSG_lbne/OSG_lbne.err.log
rw-r--r- 1 frontend gwms 70738 Sep 16 09:04 /var/log/gwms-frontend/group_OSG_lsst/OSG_lsst.err.log
rw-r--r- 1 frontend gwms 164829 Sep 14 16:30 /var/log/gwms-frontend/group_OSG_minerva/OSG_minerva.err.log
rw-r--r- 1 frontend gwms 162280 Sep 13 17:25 /var/log/gwms-frontend/group_OSG_miniboone/OSG_miniboone.err.log
rw-r--r- 1 frontend gwms 162675 Sep 16 10:47 /var/log/gwms-frontend/group_OSG_minos/OSG_minos.err.log
rw-r--r- 1 frontend gwms 175754 Sep 15 13:20 /var/log/gwms-frontend/group_OSG_mu2e/OSG_mu2e.err.log
rw-r--r- 1 frontend gwms 180764 Sep 16 09:01 /var/log/gwms-frontend/group_OSG_nova/OSG_nova.err.log
rw-r--r- 1 frontend gwms 166664 Sep 16 10:47 /var/log/gwms-frontend/group_OSG_numi/OSG_numi.err.log
rw-r--r- 1 frontend gwms 174008 Sep 15 16:31 /var/log/gwms-frontend/group_OSG_uboone/OSG_uboone.err.log
rw-r--r- 1 frontend gwms 74266 Sep 16 09:01 /var/log/gwms-frontend/group_paid_cloud/paid_cloud.err.log
rw-r--r- 1 frontend gwms 76986 Sep 16 13:23 /var/log/gwms-frontend/group_paid_cloud_test/paid_cloud_test.err.log
[root@fifebatchgpvmhead1 ~]# tail /var/log/gwms-frontend/group_FNAL_lbne/FNAL_lbne.err.log
File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorMonitor.py", line 272, in fetch
xml_data = condorExe.exe_cmd(self.exe_name,"%s s xml %s %s"(self.resource_str,format_str,self.pool_str,constraint_str),env=self.env);
File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorExe.py", line 56, in exe_cmd
return iexe_cmd(cmd,stdin_data,env)
File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorExe.py", line 93, in iexe_cmd
raise ExeError, "Unexpected Error running '%s'. Details: s" % (cmd, ex)
ExeError: Unexpected Error running '/usr/bin/condor_q -name fifebatch1.fnal.gov -format "%s" "AccountingGroup" -format "%s" "DESIRED_USAGE_MODEL" -format "%s" "DesiredOS" -format "%s" "x509UserProxyFirstFQAN" -format "%s" "x509UserProxyFQAN" -format "%s" "x509userproxy" -format "%i" "JobStatus" -format "%i" "EnteredCurrentStatus" -format "%i" "ServerTime" -format "%s" "RemoteHost" -format "%i" "ClusterId" -format "%i" "ProcId" -xml -constraint '((JobStatus=?=1)||(JobStatus=?=2)) &x%x
(((JobUniverse==5)&&(GLIDEIN_Is_Monitor =!= TRUE)&&(JOB_Is_Monitor =!= TRUE)) && (stringListIMember("group_lbne", AccountingGroup,".") && (isUndefined(DESIRED_USAGE_MODEL) || stringListsIntersect(toUpper(DESIRED_USAGE_MODEL),"DEDICATED,OPPORTUNISTIC",","))))''. Details: Command '/usr/bin/condor_q -name fifebatch1.fnal.gov -format "%s" "AccountingGroup" -format "%s" "DESIRED_USAGE_MODEL" -format "%s" "DesiredOS" -format "%s" "x509UserProxyFirstFQAN" -format "%s" "x509UserProxyFQAN" -format "%s" "x509userproxy" -format "%i" "JobStatus" -format "%i" "EnteredCurrentStatus" -format "%i" "ServerTime" -format "%s" "RemoteHost" -format "%i" "ClusterId" -format "%i" "ProcId" -xml -constraint '((JobStatus=?=1)||(JobStatus=?=2)) && (((JobUniverse==5)&&(GLIDEIN_Is_Monitor =!= TRUE)&&(JOB_Is_Monitor =!= TRUE)) && (stringListIMember("group_lbne", AccountingGroup,".") && (isUndefined(DESIRED_USAGE_MODEL) || stringListsIntersect(toUpper(DESIRED_USAGE_MODEL),"DEDICATED,OPPORTUNISTIC",","))))'' returned non-zero exit status 1:
- Failed to fetch ads from: <131.225.67.102:9615?sock=5407_df1c> : fifebatch1.fnal.gov
SECMAN:2007:Failed to end classad message.



Also available in: Atom PDF