Project

General

Profile

Feature #4557

Automatic restart of failed NCIS controller/workers

Added by Randy Reitz about 6 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Start date:
08/20/2013
Due date:
% Done:

100%

Estimated time:
1.00 h
Spent time:
Duration:

Description

The ncis_robots package starts/stops the NCIS robots. The robots are daemons of
either the controller, worker or cache type. For example, the configuration of the ncis_robots
shows the number of controllers that should be running:

[ncis@gibbs ~]$ setup ncis_robots
[ncis@gibbs ~]$ env | grep CONTROLLER
NCIS_SYSTEM_DISPATCH_CONTROLLER_PORT=7300
NCIS_SNAP_DISPATCH_CONTROLLER_PORT=7100
NCIS_POLL_DISPATCH_CONTROLLER_PORT=7500
NCIS_MIA_DEVICE_DISPATCH_CONTROLLER_PORT=7600
NCIS_DHCP_DISPATCH_CONTROLLER_PORT=7200
NCIS_DNS_DISPATCH_CONTROLLER_PORT=7400

The current running processes can be checked for controllers:

[ncis@gibbs ~]$ ps -fwu ncis | egrep '\<controller\>'
ncis      9770     1  2 07:59 ?        00:01:40 python /fnal/ups/prd/ncis_robots/v3_0a/bin/NcisRobot.py end host dhcp controller --edPort=7950 --jobsPerWorker=1 --controllerPort=7200
ncis      9788     1  1 07:59 ?        00:01:01 python /fnal/ups/prd/ncis_robots/v3_0a/bin/NcisRobot.py end host system controller --edPort=7950 --jobsPerWorker=1 --controllerPort=7300
ncis      9798     1  1 07:59 ?        00:01:00 python /fnal/ups/prd/ncis_robots/v3_0a/bin/NcisRobot.py end host dnsnames controller --edPort=7950 --jobsPerWorker=1 --controllerPort=7400
ncis      9820     1  5 08:00 ?        00:04:30 python /fnal/ups/prd/ncis_robots/v3_0a/bin/NcisRobot.py snapshot controller --edPort=7950 --jobsPerWorker=5 --controllerPort=7100
ncis      9849     1 13 08:00 ?        00:10:24 python /fnal/ups/prd/ncis_robots/v3_0a/bin/NcisRobot.py network device poll controller --edPort=7950 --jobsPerWorker=11 --controllerPort=7500
ncis      9879     1  9 08:00 ?        00:07:02 python /fnal/ups/prd/ncis_robots/v3_0a/bin/NcisRobot.py mia device controller --edPort=7950 --jobsPerWorker=1 --controllerPort=7600
ncis     19146 17878  0 09:16 pts/0    00:00:00 egrep \<controller\>

If these two lists "match", then all is well. If one or more controller is missing from the running environment,
then an automatic restart of the ncis_robot suite can be done. Also, an incident can be created in ServiceNow.

This is the basic idea for an automatic NCIS restart. Implementation can be one of:

  • A shell script run from cron. The shell script contains all the logic and uses email to the ServiceDesk
    to create an incident.
  • A python script run from cron. The python script can be more sophisticated and keep a history of the
    incidents created in the NCIS database.

The intention of running daemon processes is that they will be running continuously 24x7. Since the
current SLA calls for 8x5 support, it is reasonable to automatically monitor the running daemons and
automatically restart as required. The ServiceNow incidents created by a restart can be pursued during
the normal 8x5 work week.

History

#1 Updated by Lauri Carpenter about 5 years ago

  • Status changed from New to Closed
  • Assignee set to Randy Reitz
  • % Done changed from 0 to 100
  • Estimated time set to 1.00 h


Also available in: Atom PDF