Collector and CCB strings cut at comma. Group separator character cannot be allowed in collector and CCB strings.
Despite what I said in the meeting I verified the issues should NOT be there if both factory and frontend are 3.4.1. My startds were failing to connect for other reasons bu the strings are not cut. Anyway, this is the issue with new frontend/old factory.
The CMS frontend setup looks like:
<ccb DN="/DC=ch/DC=cern/OU=computers/CN=vocms0816.cern.ch" group="ccb1" node="vocms0816.cern.ch:9619-9644"/>
<collector DN="/DC=ch/DC=cern/OU=computers/CN=vocms0804.cern.ch" group="c1" node="vocms0804.cern.ch" secondary="False"/>
<collector DN="/DC=ch/DC=cern/OU=computers/CN=vocms0804.cern.ch" group="c1" node="vocms0804.cern.ch:9620-9645" secondary="True"/>
<collector DN="/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=cmssrv215.fnal.gov" group="c2" node="cmssrv215.fnal.gov" secondary="False"/>
<collector DN="/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=cmssrv215.fnal.gov" group="c2" node="cmssrv215.fnal.gov:9621-9676" secondary="True"/>
And this was what the frontend was publisheing to the factory collector (which is correct):
[mmascher@vocms0202 ~]$ condor_status -any 29094_CMSHTPC_T0_CH_CSCS_HPC_arc06@gfactory_instance@CERN-ITB-Dev@CMSG-ITBDEV.test -af GlideinParamGLIDEIN_Collector GlideinParamGLIDEIN_CCB
However looking at a pilot the string is cut after the comma:
[mmascher@vocms0204 ~]$ grep GLIDEIN_CCB /var/log/gwms-factory/client/user_fecmsglobalitb/glidein_gfactory_instance/entry_CMSHTPC_T3_CH_CERN_DOMA/job.24966.1.err
[mmascher@vocms0204 ~]$ grep Collector /var/log/gwms-factory/client/user_fecmsglobalitb/glidein_gfactory_instance/entry_CMSHTPC_T3_CH_CERN_DOMA/job.24966.1.err
COLLECTOR_HOST = $(HEAD_NODE),$(GLIDEIN_Site_Collector)
MASTER.COLLECTOR_HOST = $(GLIDEIN_Master_Collector)
#MASTER.COLLECTOR_HOST = $(GLIDEIN_Master_Collector),$(GLIDEIN_Site_Collector)
So this was less sever than I thought, I am not actually even sure if we can to fix this, maybe we just want to say that factory will need to be upgraded first this round? Assigning to Marco in case he wants to close this as a "false positive", or redirect to Lorena if it needs some work.
#2 Updated by Marco Mambelli over 1 year ago
- Subject changed from Backward compatibility issue between frontend3.4.1/factory3.4: collector and CCB strings cut to Collector and CCB strings cut at comma. Group separator character cannot be allowed in collector and CCB strings.
The problem is not only backward compatibility.
The "," character is used as separator when different (high availability) groups of collectors or CCB are used.
So the comma in RANDOM_INTEGER(a,b) has to be escaped or substituted otherwise things will not work.
I discussed different plans w/ Lorena:
1- specify all ranges as N1-N1 (ports and socks) and replace w/ RANDOM_INTEGER in the glidien scripts (at the end): only solution that would not force an upgrade order (as long as no shared port is used until the factory is updated) but too much work according to Lorena
2- replace the separator character with some other character not allowed in sinful strings, e.g. %. : this would require a simultaneous upgrade of factory and frontends according to Marco
3- express the ranges a RANDOM_INTEGER(a-b) and replace the dash with a comma after the lists have been dealt with (shell scripts or condor_startup.sh): this will require the Factory to be upgraded before the frontend but will work
Changed title from "Backward compatibility issue between frontend3.4.1/factory3.4: collector and CCB strings cut"
#3 Updated by Marco Mambelli over 1 year ago
With Lorena we decided to go for Solution 1.
This will solve also the problem reported by Marco Mascheroni w/ v3.4 Frontend and v3.4.1 factory.
Where the factory was unable to resolve the collector addresses including ranges.
The goal is not to add constraints to the upgrade order.
The only constraint is that a collector 3.4.1 will have to keep the old condor configuration if the factory is 3.4. It can switch to shared port only once the Factory is 3.4.1. All other functionalities will be OK.
#4 Updated by Lorena Lobato Pardavila over 1 year ago
- Status changed from New to Resolved
Merged v34/20880 into master. Ticket can be resolved.
For the record:
We remember from one of our monthly calls, that having a list of CCBs in CCB_ADDRESS, HTCondor will connect to all of them to listen for requests Communications and will start using one at random and fall back to the following ones in case of errors to send messages. We're waiting for the confirmation of this from HTCondor team about the CCB's behavior, in order to be more precise in the documentation about CCBs.
#5 Updated by Lorena Lobato Pardavila over 1 year ago
They have confirmed that when you configure a list of CCBs, each daemon will register with all of the CCB servers that you give and include all of them in its sinful string. When a client tool, or another daemon tries to contact that daemon, it will try the CCB servers in random order until it gets a successful connection.
Updated collector_setup.sh and frontened/configuration.html with a more precise information about how to deal with the CCB list. Already merged into master.