Project

General

Profile

Bug #4657

Feature #2454: Advertise classad in case of glidein failure

Error classads still not working in v3_1

Added by Igor Sfiligoi about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Igor Sfiligoi
Category:
Glidein
Target version:
Start date:
09/19/2013
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Jeff tests show that the error classads still do not work.
The code committed in 2454 has major bugs in it.

History

#1 Updated by Igor Sfiligoi about 7 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Igor Sfiligoi to Parag Mhashilkar

Fixed in branch_v3_2plus_igor_4657.
Basically, we were missing a few attributes early on in the glidein setup that were needed for error classad generation.

Please review, and I will merge it back to v3_2 and master.

#2 Updated by Igor Sfiligoi about 7 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Parag Mhashilkar to Igor Sfiligoi

Parag gave me the go ahead.

Merged back to both branch_v3_2 and master

#3 Updated by Igor Sfiligoi about 7 years ago

  • Status changed from Resolved to Assigned
  • Priority changed from Normal to High

The original 2454 patch also broke the condor binary selection.

The FE-provided attributes are not being honored.

#4 Updated by Igor Sfiligoi about 7 years ago

We have a major architectural problem here;
there is no good place to load the Condor binaries before all the FE scripts are run!

See the load procedure:

$ grep lst job.8397.0.err |grep file |grep Sign
Signature OK for main:file_list.d95ng5.lst.
Signature OK for client:preentry_file_list.d6afkj.lst.
Signature OK for client_group:preentry_file_list.d9bh1Q.lst.
Signature OK for client:aftergroup_preentry_file_list.d2phku.lst.
Signature OK for entry:file_list.d95ng5.lst.
Signature OK for client:file_list.d2phku.lst.
Signature OK for client_group:file_list.d2phku.lst.
Signature OK for client:aftergroup_file_list.d2phku.lst.
Signature OK for main:after_file_list.d95ng5.lst.

The factory main scripts and files are loaded first and last only.
In the first section, I don't have the necessary info yet... condor version is often entry-specific.

And the last section is obviously too late, if we want to return useful info to the FE.

#5 Updated by Igor Sfiligoi about 7 years ago

  • Priority changed from High to Normal

I was wrong... it does not break platform selection.
The platform attributes are pushed as parameter, so we have them from the start.

#6 Updated by Igor Sfiligoi about 7 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Igor Sfiligoi to Parag Mhashilkar

Found the bug... was simply putting the condor binaries in the wrong dictionary.
Fix committed to branch_v3_2plus_igor_4657.

#7 Updated by Igor Sfiligoi about 7 years ago

Found and fixed another problem.

The split between condor_vars.lst and condor_vars.lst.entry was bitting up.
I have CONDOR_DIR and the x509 related attributes from entry into the main factory file.

Changes are now in branch_v3_2igor_4657

#8 Updated by Parag Mhashilkar about 7 years ago

  • Assignee changed from Parag Mhashilkar to Burt Holzman

I reviewed it. This looks ok. Burt you wanted to try this out first on your setup? Assigning it to you before merging. Feel free to assign it back to me when you are done and I will take care of merging and tagging rc4

#9 Updated by Burt Holzman about 7 years ago

Yes, I'll test this in all its permutations today.

#10 Updated by Burt Holzman about 7 years ago

  • Subject changed from Error classads still not workin gin v3_1 to Error classads still not working in v3_1
  • Assignee changed from Burt Holzman to Parag Mhashilkar

I tested this branch (branch_v3_2plus_igor_4657, commit:8b82cd4) with all four locations (FE inside and outside group, factory inside and outside entry).

Glidein failed while running entry/factory-bad-validation-script.sh. Keeping node busy until 1380047320 (Tue Sep 24 18:28:40 UTC 2013).
Glidein failed while running client_group/factory-bad-validation-script.sh. Keeping node busy until 1380047746 (Tue Sep 24 18:35:46 UTC 2013).
Glidein failed while running main/factory-bad-validation-script.sh. Keeping node busy until 1380047093 (Tue Sep 24 18:24:53 UTC 2013).
Glidein failed while running client/factory-bad-validation-script.sh. Keeping node busy until 1380047950 (Tue Sep 24 18:39:10 UTC 2013).

Looks good to me.

#11 Updated by Parag Mhashilkar about 7 years ago

  • Assignee changed from Parag Mhashilkar to Igor Sfiligoi

Merged it to master.

#12 Updated by Parag Mhashilkar about 7 years ago

  • Status changed from Feedback to Closed

Also available in: Atom PDF