Project

General

Profile

Bug #4043

create_glidein fails with error on missing clientlogs/glideinlogs/clientproxies directories

Added by Parag Mhashilkar about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
John Weigand
Category:
Factory
Target version:
Start date:
06/10/2013
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

This bug sneaked in somewhere recently and after 2.7 (2.6.2?) and affects both 2.7.1 and 3.1 rcs.

Error:

[factory@fermicloud399 ~]$ .  /local/home/factory/master/glideinsubmit/factory.sh;/local/home/factory/master/glideinwms/creation/create_glidein /local/home/factory/master/glideinsubmit/glidein_v1_0.cfg/glideinWMS.xml
create_glidein [-writeback yes|no] [-debug] cfg_fname | -help

Missing base clientlog directory /var/factory/master/clientlogs/user_vo1user/glidein_v1_0.
[factory@fermicloud399 ~]$ 

History

#1 Updated by Igor Sfiligoi about 7 years ago

This was never meant to work.

Is the change in behavior just the error message?
Or did it actually work in the past?

#2 Updated by Parag Mhashilkar about 7 years ago

As far as I can remember, it worked in the past.

#3 Updated by John Weigand about 7 years ago

Igor,

Why do you say this was never meant to work?
It always did.
But now with Condor 7.9.4+, it does not.
Worked in the q/a installer.

John Weigand

#4 Updated by Igor Sfiligoi about 7 years ago

The idea was to create the needed (base) directories with the installer.
If they did not exist, create_glidein was supposed to fail.

If it did work (correctly), great, but I am definitely surprised.

#5 Updated by John Weigand about 7 years ago

Igor,

It appears to be using condor_root_switchboard to create the
directories. That is as far as I have gotten on this so far.
Something is different from 7.9.1 (where it worked) and 7.9.4
where it doesn't).

Are you aware of anything that might have changed in Condor
privilege separation in that timeframe release-wise?

John Weigand

#6 Updated by Igor Sfiligoi about 7 years ago

OK, let me understand the problem;
is this a glieinWMS or a Condor problem?
Parag description implied glideinWMS. John seems to imply it is a Condor problem.

Said that, I am not aware of any changes in Condor that should affect this; which does not mean there are not any ;)
Also, you confirm the base directories are in place, the whole tree is owned by root and the switchboard has the setuid bit set, right?

#7 Updated by John Weigand about 7 years ago

Here is what is happening.

In HTCondor 7.9.1, condor_root_switchboard was creating
the directories with these permissions:

logs:
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 11 12:44 user_cms_vo
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 11 12:44 glidein_v2plus
drwxr-xr-x 2 cms_vo cms_vo 4096 Jun 11 12:44 entry_ress_ITB_INSTALL_TEST_2

proxies:
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 11 12:44 user_cms_vo
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 11 12:44 glidein_v2plus
drwx------ 2 cms_vo cms_vo 4096 Jun 11 12:44 entry_ress_ITB_INSTALL_TEST_2

Now in 7.9.5, the directories are created thus:

logs:
total 4
drwx------ 3 cms_vo cms_vo 4096 Jun 11 14:30 user_cms_vo
drwxr-xr-x 2 cms_vo cms_vo 4096 Jun 11 14:30 glidein_v2plus

proxies:
total 4
drwx------ 2 cms_vo cms_vo 4096 Jun 11 14:30 user_cms_vo
drwxr-xr-x 2 cms_vo cms_vo 4096 Jun 11 14:30 glidein_v2plus

The create process, at some point, does a simple os check for the
existence of one of the directories and because it is 0700, it fails fails.
It then does and 'rm -rf' on the each level... apparently to clean
up. This clean up made it nearly impossible to see what was really happening
until I had the create_glideins take some naps.

I have not figured out why the top level "username" directory gets 0700
and the next level down gets 0755 but it is a change in behavior in
condor_root_switchboard. These tests with 7.9.1 and 7.9.5 were done on
the same node so umask is not a factor.

It appears we will need to explicitly set permissions on all directory
creations as is done on the bottom level proxy one.

However, at this point, my brain is totally fried and will look at it
tomorrow.

John Weigand

#8 Updated by John Weigand about 7 years ago

Committed to master_4043
hash: c6cbd19bf29f3a6447c7377bb8f8abd9c7d37ebe

The creation of the necessary client directories, when privilege separation is
in effect, requires the use of HTCondor's condor_root_switchboard. It appears
that between v7.9.1 and v7.9.4 it's behavior changed. The switchboard is
not really documented explicitly so troubleshooting was performed pretty
much by trial and error and observation.

In v7.9.1, the 1st level client log/proxies directory (user_USERNAME) is
created using a 'mkdir' argument and had 755 permissions set by default.
The lower level directories use an 'exec' argument executing '/bin/mkdir' with
permissions getting set to 0755 by default. On the lowest level client proxies
directories, a directory for each factory entry point is created with
permissions explicitly set to 0700 by running the switchboard with the 'exec'
arg executing '/bin/mkdir -m 0700'.

drwxr-xr-x 3 root   root   4096 Jun 13 08:50 logs
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 13 08:50 logs/user_cms_vo
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 13 08:50 logs/user_cms_vo/glidein_master
drwxr-xr-x 2 cms_vo cms_vo 4096 Jun 13 08:50 logs/user_cms_vo/glidein_master/entry_ress_ITB_INSTALL_TEST_2

drwxr-xr-x 3 root   root   4096 Jun 13 08:50 proxies
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 13 08:50 proxies/user_cms_vo
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 13 08:50 proxies/user_cms_vo/glidein_master
drwx------ 2 cms_vo cms_vo 4096 Jun 13 08:50 proxies/user_cms_vo/glidein_master/entry_ress_ITB_INSTALL_TEST_2

In v7.9.4 (or after 7.9.1), this changed. The 1st level client log/proxies
(user_USERNAME) directories were getting set to 0700.
Since condor_root_switchboard does not always give good return codes or
requires looking in stderr for a message, the create_glideins does it own
verification that the directories were created and if they were not, rolls back
the directory structure by running the switchboard with an 'rmdir' argument.
So, unless you force a sleep, nothing will appear to have been created.
At the bottom level of the proxies directory structure the 'mkdir -m 0700' also
did not appear to be functional anymore and the entry level proxy directories were
being set to 0755.
drwx------ 3 root   root   4096 Jun 13 08:50 logs
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 13 08:50 logs/user_cms_vo
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 13 08:50 logs/user_cms_vo/glidein_master
drwxr-xr-x 2 cms_vo cms_vo 4096 Jun 13 08:50 logs/user_cms_vo/glidein_master/entry_ress_ITB_INSTALL_TEST_2

drwx------ 3 root   root   4096 Jun 13 08:50 proxies
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 13 08:50 proxies/user_cms_vo
drwxr-xr-x 3 cms_vo cms_vo 4096 Jun 13 08:50 proxies/user_cms_vo/glidein_master
drwxr-xr-x 2 cms_vo cms_vo 4096 Jun 13 08:50 proxies/user_cms_vo/glidein_master/entry_ress_ITB_INSTALL_TEST_2

Rather than depend on default behavior in condor_root_switchboard, the changes
made perform the same series of mkdir's but I added an explicit execution of
the switchboard doing the appropriate chmod for permissions.

Of note, this may be of interest regarding the future of PrivilegeSeparation and
may have been a factor in the changed behavior:
https://www-auth.cs.wisc.edu/lists/htcondor-devel/2013-April/msg00008.shtml

John Weigand

#9 Updated by John Weigand about 7 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from John Weigand to Parag Mhashilkar

Parag,

For your review.
Advise as to how/if you want me to merge into master and v2plus.

John Weigand

#10 Updated by John Weigand about 7 years ago

Forgot to mention the code affected:
creation/lib/cgWDictFile.py

John Weigand

#11 Updated by Parag Mhashilkar about 7 years ago

  • Assignee changed from Parag Mhashilkar to John Weigand
  • Target version changed from v2_7_2 to v3_1

Looks ok to merge

Since you branched off master, merge it to master and cherry-pick into branch_v2plus

#12 Updated by John Weigand about 7 years ago

For future reference, Burt found this HTCondor ticket that caused the behavior change.
It was introduced in v7.9.2.
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3315
https://github.com/htcondor/htcondor/commit/1bea9598

John Weigand

#13 Updated by John Weigand about 7 years ago

  • Status changed from Feedback to Resolved

./creation/lib/cgWDictFile.py
merged into master: 47b5cc4 c6cbd19
cherry-picked into branch_v2plus: 65af3fbdb5bb41b5a0d5925ba1ce7492cc4d2897

John Weigand

#14 Updated by Parag Mhashilkar about 7 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF