Monitor internal components
We need to have something watching to make sure POMS is actually alive:
- agents under supervisord running on pomsgpvm01 (and in dev on fermicloud045)
- excessive memory growth in above
- exeptions/500 status codes in logfiles in private/logs/poms
- report any such to email list with repeat suppression
- mabye get main poms webservice monitored in site central monitoring
I'm listing this as tied to v1_0_1, but it can be stood up at any time, not neccesarily tied to a release.
#2 Updated by Joe Boyd almost 4 years ago
Yes, that sounds about right.
On 12/19/2016 01:07 PM, Joe Boyd wrote:
Everything was restarted 44 minutes ago wasn't it?
[poms@fermicloud045 ~]$ supervisorctl status
poms_declared_files_watcher RUNNING pid 16909, uptime 0:44:15
poms_fifemon_reader RUNNING pid 16903, uptime 0:44:15
poms_fts_scanner RUNNING pid 16904, uptime 0:44:15
poms_joblog_scraper RUNNING pid 16905, uptime 0:44:15
poms_jobsub_q_scraper RUNNING pid 16906, uptime 0:44:15
poms_status_scraper RUNNING pid 16902, uptime 0:44:15
poms_webservice_devel RUNNING pid 17200, uptime 0:43:14
rsyslogd RUNNING pid 16901, uptime 0:44:15
Is there a ticket open to investigate the memory leak in jobsub_q_scraper.py? I didn't find one in a quick search but maybe my search was wrong.
On 12/19/2016 12:38 PM, Vladimir Podstavkov wrote:
It was not down, I believe. Around 12:17 the node has been overloaded by
one of the deamons.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30752 poms 20 0 4051m 2.4g 960 D 2.3 86.7 93:45.30 python ./*jobsub_q_scraper.py* -d
It ate pretty much all the memory.
I have restarted the service, it seems to be OK now.
#4 Updated by Joe Boyd over 3 years ago
- % Done changed from 10 to 40
With a couple of code additions we can have graphs like
for each of our probes so we can see if they're growing in memory or whatever we want to send to graphite. Will need to talk to Marc to find out how to release new probes into production with implementation.
#5 Updated by Joe Boyd over 3 years ago
- % Done changed from 40 to 90
The jobsub_q_scraper probe is instrumented and running in dev. It's generating the stats on this dashboard: