Project

General

Profile

Feature #14194

Feature #4586: Switch init script to use RHEL daemon function

write frontend and factory init scripts for sl7

Added by HyunWoo Kim almost 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
10/20/2016
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

redmine ticket 4586 addresses the changes to frontend and factory init scripts.
the changes involve the use of daemon function that is provided by /etc/rc.d/init.d/functions.
While working on 4586, we realized that we now have to start writing sl7 version of init scripts.

I did initial research and testings in sl7 but it turns out we need to redesign our reconfig and upgrade functions
in sl7 init environment.

History

#1 Updated by HyunWoo Kim almost 3 years ago

This is the comment that I made on October 12 2016 in the original ticket 4586:

I have a theory that I concluded on October 5 2016 Wednesday:

One obvious fact that I also learned is that systemctl reload command is rejected
when the service(gwms-frontend) is not running.
I think we should NOT underestimate this fact, i.e. systemctl assumes that the underlying service
(gwms-frontend) is always running.
But as you all know, our upgrade or reconfig fist kills the gwms-frontend service.
So, while systemctl reload command is running, our upgrade runs and kills the gwms-frontend service
and I believe this makes the systemctl reload command think that something wrong has happened
and then systemctl attempts to stop the service which does not exist in our case 
because upgrade already has killed it already at this point..

In order to test this theory of mine, I did some experiment:

Write the following file as /usr/sbin/hkbin

#!/usr/bin/env python                                                                                                            

import sys
import subprocess
import time

if sys.argv[1] == 'start':
   print 'regular start is being called'
   sys.stdout.flush()
   subprocess.call( 'nohup /usr/sbin/hksleep < /dev/null > /tmp/tmp.out 2>/tmp/tmp.err & ', shell=True )

elif sys.argv[1] == 'stop':
   print 'regular stop is being called'
   sys.stdout.flush()

   subprocess.call( 'killall python /usr/sbin/hksleep', shell=True )

elif sys.argv[1] == 'reload':
   print 'reload stop is being called'
   sys.stdout.flush()
   subprocess.call( 'killall python /usr/sbin/hksleep', shell=True )

   print 'pretend that reload is doing something'
   sys.stdout.flush()

   print 'reload start is being called'
   sys.stdout.flush()
   subprocess.call( 'nohup /usr/sbin/hksleep < /dev/null > /tmp/tmp.out 2>/tmp/tmp.err & ', shell=True )
And also write /etc/systemd/system/hktest.service
and then do

systemctl start hktest.service
systemctl reload hktest.service

then you will see the following messages from /var/log/messages
Oct  5 16:35:23 fermicloud363 systemd: Starting HKTESTSL7...
Oct  5 16:35:23 fermicloud363 hkbin: regular start is being called
Oct  5 16:35:23 fermicloud363 systemd: Started HKTESTSL7.

Oct  5 16:35:30 fermicloud363 hkbin: reload stop is being called
Oct  5 16:35:30 fermicloud363 hkbin: /usr/sbin/hksleep: no process found
Oct  5 16:35:30 fermicloud363 hkbin: regular stop is being called   **
Oct  5 16:35:30 fermicloud363 hkbin: /usr/sbin/hksleep: no process found
Oct  5 16:35:30 fermicloud363 systemd: Reloaded HKTESTSL7.
You will see this ** line which is the evidence that supports my theory..

So, as far as our upgrade() kills the gwms service, systemctl will intervene..

So, if this assertion is convincing enough for you two, 
we should simply guide people to use directly /usr/sbin/gwms-frontend upgrade or reconfig
and tell them that systemctl reload gwms-frontend.service is not supported..

And this is Brian Bockelman's comment on the same day

The way you would do this in RHEL7 is send a signal to the frontend, 
have it 'exec' to the appropriate reconfig command, 
then have the reconfig 'exec' the frontend again.

That said: 
there is no support for custom verbs (reconfig, upgrade, etc) in the systemd model. 
In general, you want to do it in a standalone command as you outline above.

#2 Updated by HyunWoo Kim almost 3 years ago

  • Status changed from New to Assigned

I think I solved the reload issue in the Frontend codes.
I added new lines to /usr/sbin/glideinFrontend and /usr/sbin/reconfig_frontend files.

What happens is as follows;

We will have to add a new script in Frontend which basically finds the PID of glideinFrontend process and sends SIGHUP signal to it.
glideinFrontend will catch SIGHUP and call os.execv() to run /usr/sbin/reconfig_frontend with some default arguments.
reconfig_frontend will conduct the usual reconfig process and then also call os.execv() to go back to /usr/sbin/glideinFrontend.
glideinFrontend will start from the scratch but the same PID will be written to the lock file.

I tested these new structure in my test sl7 Frontend and it appears that the new codes are working.

Then I added a set of similar lines to the Factory codes
but I can not seem to make it work there in Factory side yet.
I will need to investigate more on the Factory..

#3 Updated by HyunWoo Kim almost 3 years ago

I did further investigation into why these new codes in the Factory do not work.
I found out which lines are throwing errors.
Now, glideFactory.py successfully does os.execv to reconfig_glidein and
in turn reconfig_glidein successfully does os.execv back to glideFactory.py.

But there is one fatal issue here which comes from the intrinsic Factory code.
When we return from reconfig_glidein back to glideFactory.py via os.execv,
glideFactory.py begins from the scratch and tries to launch glideFactoryEntryGroup.py.
But this fails because there is already a running glideFactoryEntryGroup.py.

Note that we just jumps from glideFactory.py to reconfig_glidein leaving the existing glideFactoryEntryGroup.py running..

A solution might be, we might need to create a text file and keep the list of running glideFactoryEntryGroup.py processes
and have glideFactory.py look at this list in the text file first and just load the running glideFactoryEntryGroup.py before attempting to launch a new one..
I am not sure how plausible this new scenario will be..

#4 Updated by HyunWoo Kim almost 3 years ago

Today, I solved this issue in Factory.
My solution is,
when glideFactory.py receives a SIGHUP signal
before it becomes reconfig_glidein, it now kills the child processes (glideFactoryEntryGroup)
This way, when reconfig_glidein becomes glideFactory again,
the code will launch a new set of childrens (glideFactoryEntryGroup).

In summary, now I know how to keep the main processes of Frontend and Factory
and still do the reconfig.

I will now just have to write a new set of scripts for both Frontend and Factory that will
simply sends a SIGHUP signal to the main processes when we do
systemctl reload gwms(-frontend or -factory)

After this remaining development, I will assign this to someone for feedback..

#5 Updated by HyunWoo Kim almost 3 years ago

During yesterday's weekly gwms meeting, Parag pointed out the following 2 issues:
1. in glideinFactory, before before reconfig_frontend is executed via os.execv,
we have to make sure to terminate glideinFrontendElement child processes if any running.
I inspected the code and the glideinFrontend's structure is

main()
  spawn()
     try # Service will exit on signal only.
        while 1
           spawn_iteration()
                spawn_group()

    finally: # We have been asked to terminate
        logSupport.log.info("Deadvertize my ads")
        spawn_cleanup(work_dir, frontendDescript, groups, frontendDescript.data['FrontendName'], mode)

def spawn_group(work_dir, group_name, action):
    command_list = [sys.executable, os.path.join(STARTUP_DIR, "glideinFrontendElement.py"), str(os.getpid()),
                    work_dir,  group_name, action]
    child = subprocess.Popen(command_list, shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    return child

When glideinFrontend catches a SIGHUP signal, it will raise an exception.
Inside spawn(), the execution will break out of the while loop immdiately
and fall into the finally block which will run spawn_cleanup() which will deadvertize classads.
What about the children glideinFrontendElement?
My guess is when the thread breaks out of the while loop, even if there are some children processes(glideinFrontendElement),
they will simply exit when their mission is over.
This guess is based on the fact that when we issue the stop to the init script, SIGTERM/SIGQUIT will be sent
and this flow follows the same path where the code will break out of the while loop of spawn() and
enter the finally block in the same spawn(), i.e. there is no code that explicitly terminates the child processes.

2. second question was, when execution thread returns from reconfig_frontend back to glideinFrontend,
does it begin from the very start of the code so that it goes through a normal initiatlization?
the answer is yes, I observed the code starts from the beginning..

#6 Updated by HyunWoo Kim almost 3 years ago

Today, I have been struggling with a question of what if reconfig_frontend fails for any reason.
In sl6 case, when we do
service gwms-frontend reconfig
we can see the messages on the terminal that will say that the reconfig process went wrong
and the process started again with the same old configuration.
This way, we at least know that the reconfig failed and we are still using the same old configuration.

But in sl7 case, when we use systemctl command for the reconfig,
we are not going to use reconfig function of gwms-frontend.
Instead the structure will be:
in /etc/systemd/system/gwms-frontend.service will have

[Service]
Type=forking
ExecStart=/usr/sbin/gwms-frontend start
ExecStop=/usr/sbin/gwms-frontend stop
ExecReload=/bin/kill -HUP $MAINPID

Here $MAINPID is the pid of glideinFrontend.
When glideinFrontend receives this SIGHUP signal, it will invoke os.execv to become
/usr/sbin/reconfig_frontend which at the end will become glideinFrontend.
Here, when something happens during reconfig_frontend, we can still come back to glideinFrontend without exitting.
This way, glideinFrontend will still be running but with old configuration.
But problem is, systemctl reload gwms-frontend.service command will not produce any messages that say something went wrong during the reconfig.
And systemctl status gwms-frontend.service will not show anything either.

So, my proposal is, reconfig_frontend code should be modified only to have os.execv code at the end when successful
and we must leave all execepton codes so that it exits with an error code and does not go back to glideinFrontend,
i.e. it is equivalent to glideinFrontend stopping.
But a caveat is, systemctl reconfig will not produce any error message even in this case.
Users will notice that glideinFrotned is not running only if they check by using systemctl status command for example.

So, we should emphasize this in the user manual and must recommend to use gwms-frontend script directly even though we provide
an option to use systemctl for reconfig.

#7 Updated by HyunWoo Kim almost 3 years ago

besides these not-so-neat changes that we have to remind the users,
the code development itself for this ticket is (almost) over
although there still might be some place for improvement.
Let me review the code changes again for myself and assign this to someone else for feedback

#8 Updated by HyunWoo Kim almost 3 years ago

reconfig_frontend(or glidein) can be invoked in 2 different cases
1. from glideinFrontend or glideFactory when a user does systemctl reload
2. /usr/sbin/gwms-frontend(or factory) reconfig(or upgrade)

In case of 1, reconfig_frontend/glidein script must use os.execv to go back to the main frontend/factory process
but in case of 2, it should NOT..

So, I defined a new option for reconfig_frontend/glidein called sl7reload
and glideinFrontend or glideFactory will invoke reconfig_frontend/glidein with sl7reload option
and usual gwms-frontend or gwms-factory will not use sl7reload.

I also modified reconfig_frontend/glidein to also send messages to syslog so that systemctl status command will show what happened inside reconfig_frontend/glidein
during systemctl reload command

Another big change is, there are now sl7 version of
creation/templates/frontend_initd_startup_template_sl7
and
creation/templates/factory_initd_startup_template_sl7
The changes in these files are, when reconfig or upgrade is invoked AND if the main process is running, the script errors out
and tells the user to stop the main process via systemctl stop.
If the original startup script reconfig/upgrade functions are used, they will stop the main process before the reconfig/upgrade and start the main process afterwards.
This will confuse the systemctl and make it think the main process is not running because there is a new process started with a new PID.

In summary
- I tested the new frontend changes in sl7 VM
- I have to test the newest factory changes in sl7 VM
- I need to make sure the RPM generation picks up the new startup script for sl7
- And need to update the documentation and emphasize to the users about this new usage of reload, reconfig and upgrade..

And then I will assign this for feedback

#9 Updated by HyunWoo Kim almost 3 years ago

The following is a draft of the new instructions for starting,stopping and reloading the service in SL 7

To see the status of the service
systemctl status    gwms-frontend.service
or
systemctl is-active gwms-frontend.service

To start the service
systemctl start  gwms-frontend.service

To stop the service
systemctl stop gwms-frontend.service

To enable the service
systemctl enable gwms-frontend.service

To see of the service is enabled
systemctl is-enabled gwms-frontend.service

There are 2 options for reconfig/upgrade
1. when the main service is running:
systemctl reload gwms-frontend.service

2. when the main service is NOT running
   or if the main service is running, you should stop it first by using systemctl stop gwms-frontend.service
/usr/sbin/gwms-frontend reconfig
/usr/sbin/gwms-frontend upgrade

Note that if you use systemctl reload command, you have to check the status  after systemctl reload,
by using systemctl status gwms-frontend.service

This is because if something goes wrong during the reload, the main service which was running before the reload
will stop running and the error messages will be found only from systemctl status (or /var/log/messages)

#10 Updated by HyunWoo Kim almost 3 years ago

Today, I looked more closely at /etc/systemd/system/gwms-frontend(factory).service
to determine which options should be used:

The following is the minimum options that we have to use
So, in gwms-frontend.service or gwms-factory.service:

We need to specify that "when we start gwms-frontend or gwms-factory, condor should be start first

[Unit]
Requires=condor.service
After=condor.service

We need to use forking for the Type so that "/usr/sbin/gwms-frontend start" should NOT be the parent of the main process

[Service]
Type=forking

We want gwms-frontend/factory.service to be in the same target(runlevel)

[Install]
WantedBy=
should list the same units that condor.service lists

cat condor.service
[Install]
WantedBy=multi-user.target

#11 Updated by HyunWoo Kim almost 3 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from HyunWoo Kim to Marco Mambelli

Assigning this to Marco Mambelli for feedback

#12 Updated by HyunWoo Kim almost 3 years ago

I updated this with Parag's suggestion from 2 weeks ago
namely, the following is the usage:

Users can do the followings:
- systemctl {start|stop|reload} gwms-frontend.service
- /usr/sbin/gwms-frontend {reconfig|upgrade} without any arguments when the main service is running
- if the main service is NOT running, /usr/sbin/gwms-frontend {reconfig|upgrade} with arguments as specified in the help
Question here is, can a user run /usr/sbin/reconfig_{frontend|glidein} directly?
The answer is No, i.e. the users always should go through systemctl reload gwms-main.service
or /usr/sbin/gwms-main reconfig|upgrade

During this new test, I realized one issue:
during rpm generation, reconfig_frontend is updated to have

STARTUP_DIR="/var/lib/gwms-frontend/web-base" 

where the original template has STARTUP_DIR = sys.path0

But reconfig_glidein(factory) still has

STARTUP_DIR = sys.path[0]

and I believe this should be replaced by
STARTUP_DIR = "/var/lib/gwms-factory/web-base/" 

during the rpm generation.
I will have to talk with Marco Mambelli tomorrow.

Another new finding is,
if we put gwms-frontend(or factory).service under /usr/lib/systemd/system
(not /etc/systend/system that I have been using so far)
and remove /etc/init.d/gwms-service (must be moved to /usr/sbin),
RHEL6 syntax service gwms-frontend {start|stop| and etc} is
redirected to systemctl syntax..

#13 Updated by HyunWoo Kim almost 3 years ago

branch v3/14194 is updated with most recent changes.
Marco can continue his review.

Marco will have to ensure the followings when building:
1. the following 2 files should be put in /usr/lib/systemd/system
- creation/templates/gwms-factory.service
- creation/templates/gwms-frontend.service
This will enable the redirection of service command to systemctl
i.e. we can use service gwms-frontend start|stop|reload even in the absence of /etc/init.d/gwms-frontend
and these will be translated(redirected) to systemctl start|stop|reload gwms-frontend.service

2. the following 2 files should be put in /usr/sbin, not /etc/init.d/ see 1 ) above.
- creation/templates/frontend_initd_startup_template_sl7 as /usr/sbin/gwms-frontend
- creation/templates/factory_initd_startup_template_sl7 as /usr/sbin/gwms-factory

#14 Updated by HyunWoo Kim almost 3 years ago

  • Status changed from Feedback to Assigned
  • Assignee changed from Marco Mambelli to HyunWoo Kim

Today, Marco Mambelli gave me his review comments which I just finished implementing now.
I believe this can be included in the first release candidate of 3 2 17.

#15 Updated by HyunWoo Kim almost 3 years ago

  • Status changed from Assigned to Resolved

Finally merged into branch_v3_2 after rpm building test in buildmaster-jenkins.
Resolved.

#16 Updated by Parag Mhashilkar over 2 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF