Bug #2970
after_filelist is not populated with correct condor tarball
0%
Description
On an internal test of running Condor 7.8.3 glideins on gWMS
branch_v2_6_1_gf1, we are not able to get 7.8.3 glideins (the frontend is
only requesting 7.8.3 glideins), and instead only 7.8.2 glideins. We were
able to get 7.8.3 glideins to run only when we removed all condor 7.8.2
references in the factories' glideinWMS.xml.
We saw this behavior on two separate instances of the factory, here are
the logs from one of them:
from job.709.0.err showing 7.8.3 was selected:
CONDOR_PLATFORM_7.8.3-default-default 1
09/17/12 11:01:14 (pid:30720) **********************************************
09/17/12 11:01:14 (pid:30720) * condor_startd (CONDOR_STARTD) STARTING UP
09/17/12 11:01:14 (pid:30720) *
/data2/condor_local/execute/dir_22474/glide_U22614/main/condor/sbin/condor_startd
09/17/12 11:01:14 (pid:30720) * SubsystemInfo: name=STARTD type=STARTD
class=DAEMON
09/17/12 11:01:14 (pid:30720) * Configuration: subsystem:STARTD
local:<NONE> class:DAEMON
09/17/12 11:01:14 (pid:30720) * $CondorVersion: 7.8.2 Aug 08 2012 $
09/17/12 11:01:14 (pid:30720) * $CondorPlatform: x86_rhap_5.8 $
09/17/12 11:01:14 (pid:30720) * PID = 30720
09/17/12 11:01:14 (pid:30720) * Log last touched time unavailable (No
such file or directory)
09/17/12 11:01:14 (pid:30720)
from glideinWMS.xml:
<condor_tarball arch="default" os="default"
tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz"
version="default"/>
<condor_tarball arch="default" os="default"
tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.3-x86_rhap_5.8-stripped.tar.gz"
version="7.8.3"/>
<condor_tarball arch="default" os="default"
tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz"
version="7.8.2"/>
Tim Mortensen
OSG Glidein Factory Operations
History
#1 Updated by Parag Mhashilkar over 8 years ago
Update on what I found out (email I sent to Tim)
After spending some time and doing extensive analysis here is what I
could find
[0937] gfactory@cabinet-10-10-6 /tmp/parag$ ls -al /home/gfactory/glideinsubmit/glidein_v1_2/client_log/user_fe7/entry_CMS_T2_US_UCSD_gw2/job.708.0.err -rw-rw-r-- 1 fe7 fe7 50042 Sep 17 11:21 /home/gfactory/glideinsubmit/glidein_v1_2/client_log/user_fe7/entry_CMS_T2_US_UCSD_gw2/job.708.0.err
Error file was created at 11:21 and picked up the condor tarball info
from after_file_list.c9haCf.lst which lists all the condor tarballs
irrespective of version to be condor_bin.c9haCf.tgz
This is a condor 7.8.2 condor tarball
[0942] gfactory@cabinet-10-10-6 /tmp/parag$ strings -a sbin/condor_startd | grep -i condorversion CondorVersion $CondorVersion: 7.8.2 Aug 08 2012 $
glideinWMS.c9haCf.xml points to correct condor version tarball. So
somehow looks like while reconfig something went wrong and the
after_filelist was not updated correctly.
I can see that the current after_file_list.c9hgtW.lst in staging area is
correct with tarballs pointing to right versions. So something you did
fixed the problem and recreated the after_filelist correctly.
So the way it is setup now, when CONDOR_VERSION=default, it will fetch
7.8.3. When condor_version explicitly set, it will fetch the correct
tarball, either 7.8.3 or 7.8.2 based on the version requested.
I am interested to know what you did in first place that caused this
issue and then later the steps that resulted in the issue being fixed.
Do you remember by any chance you had set tar_file in <condor_tarball>
pointing to wrong version at anytime?
#2 Updated by Parag Mhashilkar over 8 years ago
Tips on how to reproduce the problem (From Tim)....
- Remove all tarball references in glideinWMS.xml except for the default and
do a reconfig and the following is produced:
<condor_tarballs> <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.3-x86_rhap_5.8-stripped.tar.gz" version="default"/> </condor_tarballs>http://cabinet-10-10-6.t2.ucsd.edu/glidefactory/stage/glidein_v1_2/after_file_list.c9lfKq.lst
# File: after_file_list.c9lfKq.lst # # Outfile InFile Cache/exec Condition ConfigOut ############################################################################## validate_node.sh validate_node.c96hTg.sh exec TRUE FALSE condor_platform_select.sh condor_platform_select.c96hTg.sh exec TRUE FALSE condor_bin_default-default-default.tgz condor_bin.c9lfKq.tgz untar CONDOR_PLATFORM_default-default-default CONDOR_DIR create_mapfile.sh create_mapfile.c96hTg.sh exec TRUE FALSE collector_setup.sh collector_setup.c96hTg.sh exec TRUE FALSE gcb_setup.sh gcb_setup.c6le0z.sh exec TRUE FALSE glexec_setup.sh glexec_setup.c96hTg.sh exec TRUE FALSE java_setup.sh java_setup.c8gdK4.sh exec TRUE FALSE glidein_memory_setup.sh glidein_memory_setup.c96hTg.sh exec TRUE FALSE condor_startup.sh condor_startup.c96hTg.sh exec TRUE FALSE
- Then put the config back to original and do another reconfig:
<condor_tarballs> <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz" version="default"/> <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.3-x86_rhap_5.8-stripped.tar.gz" version="7.8.3"/> <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz" version="7.8.2"/> </condor_tarballs>http://cabinet-10-10-6.t2.ucsd.edu/glidefactory/stage/glidein_v1_2/after_file_list.c9lfL5.lst
# File: after_file_list.c9lfL5.lst # # Outfile InFile Cache/exec Condition ConfigOut ############################################################################## validate_node.sh validate_node.c96hTg.sh exec TRUE FALSE condor_platform_select.sh condor_platform_select.c96hTg.sh exec TRUE FALSE condor_bin_default-default-default.tgz condor_bin.c9lfL5.tgz untar CONDOR_PLATFORM_default-default-default CONDOR_DIR condor_bin_7.8.3-default-default.tgz condor_bin.c9lfL5.tgz untar CONDOR_PLATFORM_7.8.3-default-default CONDOR_DIR condor_bin_7.8.2-default-default.tgz condor_bin.c9lfL5.tgz untar CONDOR_PLATFORM_7.8.2-default-default CONDOR_DIR create_mapfile.sh create_mapfile.c96hTg.sh exec TRUE FALSE collector_setup.sh collector_setup.c96hTg.sh exec TRUE FALSE gcb_setup.sh gcb_setup.c6le0z.sh exec TRUE FALSE glexec_setup.sh glexec_setup.c96hTg.sh exec TRUE FALSE java_setup.sh java_setup.c8gdK4.sh exec TRUE FALSE glidein_memory_setup.sh glidein_memory_setup.c96hTg.sh exec TRUE FALSE condor_startup.sh condor_startup.c96hTg.sh exec TRUE FALSE
- Finally do an upgrade without changing anything:
<condor_tarballs> <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz" version="default"/> <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.3-x86_rhap_5.8-stripped.tar.gz" version="7.8.3"/> <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz" version="7.8.2"/> </condor_tarballs>http://cabinet-10-10-6.t2.ucsd.edu/glidefactory/stage/glidein_v1_2/after_file_list.c9lfMe.lst
# File: after_file_list.c9lfMe.lst # # Outfile InFile Cache/exec Condition ConfigOut ############################################################################## validate_node.sh validate_node.c96hTg.sh exec TRUE FALSE condor_platform_select.sh condor_platform_select.c96hTg.sh exec TRUE FALSE condor_bin_default-default-default.tgz condor_bin.c9lfL5.tgz untar CONDOR_PLATFORM_default-default-default CONDOR_DIR condor_bin_7.8.3-default-default.tgz condor_bin.c9lfMe.tgz untar CONDOR_PLATFORM_7.8.3-default-default CONDOR_DIR condor_bin_7.8.2-default-default.tgz condor_bin.c9lfL5.tgz untar CONDOR_PLATFORM_7.8.2-default-default CONDOR_DIR create_mapfile.sh create_mapfile.c96hTg.sh exec TRUE FALSE collector_setup.sh collector_setup.c96hTg.sh exec TRUE FALSE gcb_setup.sh gcb_setup.c6le0z.sh exec TRUE FALSE glexec_setup.sh glexec_setup.c96hTg.sh exec TRUE FALSE java_setup.sh java_setup.c8gdK4.sh exec TRUE FALSE glidein_memory_setup.sh glidein_memory_setup.c96hTg.sh exec TRUE FALSE condor_startup.sh condor_startup.c96hTg.sh exec TRUE FALSE
So apparently there is something wrong with the reconfig, and I think I
ran an upgrade (but I didn't retest) before you started troubleshooting
which "fixed" the problem.
#3 Updated by Parag Mhashilkar over 8 years ago
- Status changed from Assigned to Feedback
- Assignee changed from Parag Mhashilkar to Douglas Strain
Committed to branch_v2plus_2970
I have tested several combinations now and tried to cover as many more possibilities.
#4 Updated by Douglas Strain over 8 years ago
- Assignee changed from Douglas Strain to Parag Mhashilkar
Parag walked me through the changes. I approve and this can be merged into v2plus. Also, it should probably be merged into master as well.
#5 Updated by Parag Mhashilkar over 8 years ago
- Status changed from Feedback to Resolved
branch_v2plus: commit:899fe76
master: commit:839165b
#6 Updated by Parag Mhashilkar about 8 years ago
- Status changed from Resolved to Closed