Project

General

Profile

Bug #20880

Collector and CCB strings cut at comma. Group separator character cannot be allowed in collector and CCB strings.

Added by Marco Mascheroni 9 months ago. Updated 8 months ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
Start date:
09/19/2018
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Despite what I said in the meeting I verified the issues should NOT be there if both factory and frontend are 3.4.1. My startds were failing to connect for other reasons bu the strings are not cut. Anyway, this is the issue with new frontend/old factory.

The CMS frontend setup looks like:

...
<ccbs>
<ccb DN="/DC=ch/DC=cern/OU=computers/CN=vocms0816.cern.ch" group="ccb1" node="vocms0816.cern.ch:9619-9644"/>
</ccbs>
<collectors>
<collector DN="/DC=ch/DC=cern/OU=computers/CN=vocms0804.cern.ch" group="c1" node="vocms0804.cern.ch" secondary="False"/>
<collector DN="/DC=ch/DC=cern/OU=computers/CN=vocms0804.cern.ch" group="c1" node="vocms0804.cern.ch:9620-9645" secondary="True"/>
<collector DN="/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=cmssrv215.fnal.gov" group="c2" node="cmssrv215.fnal.gov" secondary="False"/>
<collector DN="/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=cmssrv215.fnal.gov" group="c2" node="cmssrv215.fnal.gov:9621-9676" secondary="True"/>
</collectors>
...

And this was what the frontend was publisheing to the factory collector (which is correct):

[mmascher@vocms0202 ~]$ condor_status -any 29094_CMSHTPC_T0_CH_CSCS_HPC_arc06@gfactory_instance@CERN -af GlideinParamGLIDEIN_Collector GlideinParamGLIDEIN_CCB
cmssrv215.fnal.gov:$RANDOM_INTEGER(9621,9676);vocms0804.cern.ch:$RANDOM_INTEGER(9620,9645) vocms0816.cern.ch:$RANDOM_INTEGER(9619,9644)

However looking at a pilot the string is cut after the comma:

[mmascher@vocms0204 ~]$ grep GLIDEIN_CCB /var/log/gwms-factory/client/user_fecmsglobalitb/glidein_gfactory_instance/entry_CMSHTPC_T3_CH_CERN_DOMA/job.24966.1.err
GLIDEIN_CCB vocms0816.cern.ch:$RANDOM_INTEGER(9619

[mmascher@vocms0204 ~]$ grep Collector /var/log/gwms-factory/client/user_fecmsglobalitb/glidein_gfactory_instance/entry_CMSHTPC_T3_CH_CERN_DOMA/job.24966.1.err
GLIDEIN_Site_Collector cet02.cern.ch:9619
GLIDEIN_Collector cmssrv215.fnal.gov:$RANDOM_INTEGER(9621,vocms0809.cern.ch:$RANDOM_INTEGER(9621
GLIDEIN_Master_Collector cmssrv215.fnal.gov:$RANDOM_INTEGER(9621,vocms0809.cern.ch:$RANDOM_INTEGER(9621
COLLECTOR_HOST = $(HEAD_NODE),$(GLIDEIN_Site_Collector)
MASTER.COLLECTOR_HOST = $(GLIDEIN_Master_Collector)
#MASTER.COLLECTOR_HOST = $(GLIDEIN_Master_Collector),$(GLIDEIN_Site_Collector)
GLIDEIN_Master_Collector=cmssrv215.fnal.gov:$RANDOM_INTEGER(9621,vocms0809.cern.ch:$RANDOM_INTEGER(9621
GLIDEIN_Site_Collector=cet02.cern.ch:9619

So this was less sever than I thought, I am not actually even sure if we can to fix this, maybe we just want to say that factory will need to be upgraded first this round? Assigning to Marco in case he wants to close this as a "false positive", or redirect to Lorena if it needs some work.

History

#1 Updated by Lorena Lobato Pardavila 9 months ago

  • Assignee changed from Marco Mambelli to Lorena Lobato Pardavila
  • Priority changed from Normal to High

#2 Updated by Marco Mambelli 9 months ago

  • Subject changed from Backward compatibility issue between frontend3.4.1/factory3.4: collector and CCB strings cut to Collector and CCB strings cut at comma. Group separator character cannot be allowed in collector and CCB strings.

The problem is not only backward compatibility.
The "," character is used as separator when different (high availability) groups of collectors or CCB are used.
So the comma in RANDOM_INTEGER(a,b) has to be escaped or substituted otherwise things will not work.
I discussed different plans w/ Lorena:
1- specify all ranges as N1-N1 (ports and socks) and replace w/ RANDOM_INTEGER in the glidien scripts (at the end): only solution that would not force an upgrade order (as long as no shared port is used until the factory is updated) but too much work according to Lorena
2- replace the separator character with some other character not allowed in sinful strings, e.g. %. : this would require a simultaneous upgrade of factory and frontends according to Marco
3- express the ranges a RANDOM_INTEGER(a-b) and replace the dash with a comma after the lists have been dealt with (shell scripts or condor_startup.sh): this will require the Factory to be upgraded before the frontend but will work

Changed title from "Backward compatibility issue between frontend3.4.1/factory3.4: collector and CCB strings cut"

#3 Updated by Marco Mambelli 9 months ago

With Lorena we decided to go for Solution 1.
This will solve also the problem reported by Marco Mascheroni w/ v3.4 Frontend and v3.4.1 factory.
Where the factory was unable to resolve the collector addresses including ranges.

The goal is not to add constraints to the upgrade order.
The only constraint is that a collector 3.4.1 will have to keep the old condor configuration if the factory is 3.4. It can switch to shared port only once the Factory is 3.4.1. All other functionalities will be OK.

#4 Updated by Lorena Lobato Pardavila 9 months ago

  • Status changed from New to Resolved

Merged v34/20880 into master. Ticket can be resolved.

For the record:

We remember from one of our monthly calls, that having a list of CCBs in CCB_ADDRESS, HTCondor will connect to all of them to listen for requests Communications and will start using one at random and fall back to the following ones in case of errors to send messages. We're waiting for the confirmation of this from HTCondor team about the CCB's behavior, in order to be more precise in the documentation about CCBs.

#5 Updated by Lorena Lobato Pardavila 9 months ago

They have confirmed that when you configure a list of CCBs, each daemon will register with all of the CCB servers that you give and include all of them in its sinful string. When a client tool, or another daemon tries to contact that daemon, it will try the CCB servers in random order until it gets a successful connection.

Updated collector_setup.sh and frontened/configuration.html with a more precise information about how to deal with the CCB list. Already merged into master.

#6 Updated by Marco Mambelli 8 months ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF