Project

General

Profile

Feature #7677

Separate the CCB collectors from User collectors

Added by Parag Mhashilkar about 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
Start date:
01/23/2015
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS

Duration:

Description

Provide a means for the frontend admins to specify CCB collectors, primary and secondary in the same way as the user collectors.

Hi Brian, Parag,

I was talking with Edgar about the scaling limitations he ran into ~150,000 jobs in his tests, and he said moving the CCBs off the Collector nodes was essential to getting to 200K. Before that, the collectors would drop ClassAds like crazy, which is similar to what we are seeing above 90,000 jobs now perhaps. Any thoughts on this? Might this be the next big thing we would need from glideinWMS in order to move safely above 100K?

Regards,

James

History

#1 Updated by Parag Mhashilkar almost 5 years ago

  • Assignee changed from Parag Mhashilkar to Marco Mambelli

#2 Updated by Parag Mhashilkar almost 5 years ago

  • Priority changed from Normal to High

#3 Updated by Parag Mhashilkar almost 5 years ago

  • Target version changed from v3_2_9 to v3_2_10

#4 Updated by Marco Mambelli almost 5 years ago

  • Assignee changed from Marco Mambelli to Parag Mhashilkar
  • Status changed from New to Feedback

Code is in branch master_7899 (by mistake it end up in the same branch)
Coded, documented and tested.

The ccb_collectors is a list similar to the collectors list but there is no difference between groups or primary/secondary.

#5 Updated by Parag Mhashilkar almost 5 years ago

  • Target version changed from v3_2_10 to v3_2_9

#6 Updated by Parag Mhashilkar almost 5 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Parag Mhashilkar to Marco Mambelli

#7 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Resolved to Closed

#8 Updated by Marco Mambelli over 4 years ago

Just adding a note about tests done to verify the feature and correctness of the implementation.
From an email sent to factory operators.

You should see the information in the logs and in the glidein configuration and MyAddress.
I think at the time I checked the communications as well (it is using udp so you have to inspect packets) but I’m not sure.
I repeated a test and the results are below.
I’m using 3.2.11-2 (latest) but should work also with previous versions since 3.2.9 (there was an issue if the list of CCB was too long w/ 3.2.9)
  • vo frontend: fermicloud303.fnal.gov
  • factory: fermicloud309.fnal.gov
  • ccb: fermicloud083
  • worker node: fermicloud025

Added to the frontend.xml:

   <ccbs>
     <ccb DN="/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=fermicloud083.fnal.gov" group="default" node="fermicloud083.fnal.gov:9618" />
   </ccbs>

I can see the use of the CCB in the glidein configuration (on the worker node):

[root@fermicloud025 glide_hBhGIp]# export CONDOR_CONFIG=/tmp/glide_hBhGIp/condor_config ; condor_config_val -dump | grep -i ccb
CCB_ADDRESS = fermicloud083.fnal.gov:9618
CCB_HEARTBEAT_INTERVAL = 300
CCB_POLLING_INTERVAL = 20
CCB_POLLING_MAX_INTERVAL = 600
CCB_POLLING_TIMESLICE = 0.05
CCB_RECONNECT_FILE =
CCB_SERVER_READ_BUFFER = 2048
CCB_SERVER_WRITE_BUFFER = 2048
CCB_SWEEP_INTERVAL = 1200
GLIDEIN_CCB = fermicloud083.fnal.gov:9618
GLIDEIN_VARIABLES = GLIDEIN_Expose_Grid_Env,GLIDEIN_Glexec_Use,GLIDECLIENT_ReqNode,GLIDEIN_Site,USE_CCB,GLIDEIN_REQUIRE_VOMS,GLIDEIN_REQUIRE_GLEXEC_USE,GLIDEIN_Max_Idle,GLIDEIN_Max_Tail,MASTER_GCB_RECONNECT_TIMEOUT,GLIDECLIENT_Name,GLIDECLIENT_Group,GLIDECLIENT_Signature,GLIDECLIENT_Group_Signature,GLIDEIN_Factory,GLIDEIN_Name,GLIDEIN_CredentialIdentifier,GLIDEIN_Signature,GLIDEIN_Description_File,GLIDEIN_Entry_Name,GLIDEIN_Entry_Signature,GLIDEIN_Description_Entry_File,GLIDEIN_ClusterId,GLIDEIN_ProcId,GLIDEIN_Schedd,GLIDEIN_Gatekeeper,GLIDEIN_GridType,GLIDEIN_GlobusRSL,GLIDEIN_X509_GRIDMAP_DNS,GLIDEIN_Tmp_Dir,GLIDEIN_Site,GLIDEIN_Job_Max_Time,GLIDEIN_Graceful_Shutdown,JOB_INHERITS_STARTER_ENVIRONMENT,GLIDEIN_Glexec_Use,GLIDEIN_Monitoring_Enabled,PREEMPT_GRACE_TIME,HOLD_GRACE_TIME,USE_CCB,GLEXEC_STARTER,GLEXEC_JOB,GLIDEIN_SiteWMS,GLIDEIN_SiteWMS_Slot,GLIDEIN_SiteWMS_JobId,GLIDEIN_SiteWMS_Queue
USE_CCB = "True" 

I can see the glidein in the CCB log file (/var/log/condor/CollectorLog):

10/05/15 12:46:40 DC_AUTHENTICATE: Success.
10/05/15 12:46:40 IPVERIFY: matched user vofrontend_service@fermicloud083.fnal.gov from * to allow list
10/05/15 12:46:40 IPVERIFY: checking fermicloud025.fnal.gov against 131.225.154.159
10/05/15 12:46:40 IPVERIFY: matched 131.225.154.159 to 131.225.154.159
10/05/15 12:46:40 IPVERIFY: ip found is 1
10/05/15 12:46:40 Adding to resolved authorization table: vofrontend_service@fermicloud083.fnal.gov/131.225.154.159: DAEMON
10/05/15 12:46:40 PERMISSION GRANTED to vofrontend_service@fermicloud083.fnal.gov from host 131.225.154.159 for command 67 (CCB_REGISTER), access level DAEMON: reason: DAEMON authorization policy allows IP address 131.225.154.159; identifiers used for this remote host: 131.225.154.159,fermicloud025.fnal.gov
10/05/15 12:46:40 Received TCP command 67 (CCB_REGISTER) from vofrontend_service@fermicloud083.fnal.gov <131.225.154.159:37410>, access level DAEMON
10/05/15 12:46:40 Calling HandleReq <CCBServer::HandleRegistration> (0) for command 67 (CCB_REGISTER) from vofrontend_service@fermicloud083.fnal.gov <131.225.154.159:37410>
10/05/15 12:46:40 Current Socket bufsize=85k
10/05/15 12:46:40 Current Socket bufsize=244k
10/05/15 12:46:40 CCB: registered target daemon MASTER <131.225.154.159:42884?noUDP> on <131.225.154.159:37410> with ccbid 1
10/05/15 12:46:40 Return from HandleReq <CCBServer::HandleRegistration> (handler: 0.039s, sec: 0.062s, payload: 0.000s)
10/05/15 12:46:40 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.1000s
10/05/15 12:46:40 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (4)
10/05/15 12:46:40 DC_AUTHENTICATE: received DC_AUTHENTICATE from <131.225.154.159:34401>

And I do see the CCB also in the MyAddress:

-bash-4.1$ condor_status -l | grep -i ccb
StartdIpAddr = "<131.225.154.159:38936?CCBID=131.225.155.30:9618#7&noUDP>" 
USE_CCB = "True" 
MyAddress = "<131.225.154.159:38936?CCBID=131.225.155.30:9618#7&noUDP>" 
StartdIpAddr = "<131.225.154.159:53817?CCBID=131.225.155.30:9618#9&noUDP>" 
USE_CCB = "True" 
MyAddress = "<131.225.154.159:53817?CCBID=131.225.155.30:9618#9&noUDP>" 
-bash-4.1$ host 131.225.155.30
30.155.225.131.in-addr.arpa domain name pointer fermicloud083.fnal.gov.
-bash-4.1$ host 131.225.154.159
159.154.225.131.in-addr.arpa domain name pointer fermicloud025.fnal.gov.

And I see the CCB also in the parameters used in the glidein submission ( -param_GLIDEIN_CCB):

condor_q -g -l | grep CCB
Arguments = "-v std -name gfactory_instance -entry ITB_FC_CE2 -clientname fermicloud303-fnal-gov_OSG_gWMSFrontend.main -schedd schedd_glideins3@fermicloud309.fnal.gov -proxy None -factory gfactory_service -web http://fermicloud309.fnal.gov/factory/stage -sign b15ef0881d070e95f08e0e4c7c2bf87215c80a6b -signentry 673d064f9b3c3864733b2aa1b4beebd567bea4f9 -signtype sha1 -descript description.f9id5t.cfg -descriptentry description.f9id5t.cfg -dir OSG -param_GLIDEIN_Client fermicloud303-fnal-gov_OSG_gWMSFrontend.main -submitcredid 199773 -slotslayout fixed -clientweb http://fermicloud303.fnal.gov/vofrontend/stage -clientsign aae13979c9a44452ce256996312c493ba00b12a7 -clientsigntype sha1 -clientdescript description.fa5c6h.cfg -clientgroup main -clientwebgroup http://fermicloud303.fnal.gov/vofrontend/stage/group_main -clientsigngroup 963a549b83c7b87345db262e478102c294e94768 -clientdescriptgroup description.fa5c6h.cfg -param_CONDOR_VERSION default -param_GLIDECLIENT_ReqNode fermicloud309.dot,fnal.dot,gov -param_GLIDECLIENT_Rank 1 -param_GLIDEIN_CCB fermicloud083.dot,fnal.dot,gov.colon,9618 -param_CONDOR_OS rhel6 -param_CONDOR_ARCH default -param_USE_MATCH_AUTH True -param_GLIDEIN_Report_Failed NEVER -param_GLIDEIN_Collector fermicloud303.dot,fnal.dot,gov.colon,9620.minus,9660 -cluster 8 -subcluster 0”

I did also a following tests with 2 entries in 2 groups, one with a port range and the behavior was correct.



Also available in: Atom PDF