Project

General

Profile

Feature #14473

Monitor internal components

Added by Marc Mengel almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
-
Start date:
11/10/2016
Due date:
11/30/2016
% Done:

100%

Estimated time:
8.00 h
Scope:
Internal
Experiment:
-
Stakeholders:
Duration: 21

Description

We need to have something watching to make sure POMS is actually alive:

  • agents under supervisord running on pomsgpvm01 (and in dev on fermicloud045)
  • excessive memory growth in above
  • exeptions/500 status codes in logfiles in private/logs/poms
  • report any such to email list with repeat suppression
  • mabye get main poms webservice monitored in site central monitoring

I'm listing this as tied to v1_0_1, but it can be stood up at any time, not neccesarily tied to a release.

History

#1 Updated by Anna Mazzacane almost 4 years ago

  • Target version deleted (v1_1_0)

#2 Updated by Joe Boyd almost 4 years ago

Yes, that sounds about right.

Vladimir.

On 12/19/2016 01:07 PM, Joe Boyd wrote:

Everything was restarted 44 minutes ago wasn't it?

[poms@fermicloud045 ~]$ supervisorctl status
poms_declared_files_watcher RUNNING pid 16909, uptime 0:44:15
poms_fifemon_reader RUNNING pid 16903, uptime 0:44:15
poms_fts_scanner RUNNING pid 16904, uptime 0:44:15
poms_joblog_scraper RUNNING pid 16905, uptime 0:44:15
poms_jobsub_q_scraper RUNNING pid 16906, uptime 0:44:15
poms_status_scraper RUNNING pid 16902, uptime 0:44:15
poms_webservice_devel RUNNING pid 17200, uptime 0:43:14
rsyslogd RUNNING pid 16901, uptime 0:44:15

Is there a ticket open to investigate the memory leak in jobsub_q_scraper.py? I didn't find one in a quick search but maybe my search was wrong.

joe

On 12/19/2016 12:38 PM, Vladimir Podstavkov wrote:

Hi Anna,

It was not down, I believe. Around 12:17 the node has been overloaded by
one of the deamons.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30752 poms 20 0 4051m 2.4g 960 D 2.3 86.7 93:45.30 python ./*jobsub_q_scraper.py* -d

It ate pretty much all the memory.

I have restarted the service, it seems to be OK now.

Vladimir.

#3 Updated by Joe Boyd over 3 years ago

  • Status changed from Assigned to Work in progress
  • % Done changed from 0 to 10

#4 Updated by Joe Boyd over 3 years ago

  • % Done changed from 10 to 40

With a couple of code additions we can have graphs like

https://fifemon-pp.fnal.gov/dashboard/db/joepromtest

for each of our probes so we can see if they're growing in memory or whatever we want to send to graphite. Will need to talk to Marc to find out how to release new probes into production with implementation.

#5 Updated by Joe Boyd over 3 years ago

  • % Done changed from 40 to 90

The jobsub_q_scraper probe is instrumented and running in dev. It's generating the stats on this dashboard:

https://fifemon-pp.fnal.gov/dashboard/db/poms-probe-profiles

#6 Updated by Joe Boyd over 3 years ago

  • Status changed from Work in progress to Resolved
  • % Done changed from 90 to 100

The jobsub_q_scraper script has been implemented and the server infrastructure setup. I'll open new issues to decide what other stuff to implement and add more monitoring.

#7 Updated by Joe Boyd over 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF