Project

General

Profile

Bug #7080

Changes to 01_gwms_collectors.config based on large scale testing from operations

Added by Parag Mhashilkar over 5 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
09/29/2014
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

On Sep 26, 2014, at 4:50 PM, Fajardo Hernandez, Edgar wrote:

Dear GlideIn WMS Team,

During the current round of large scale testing, we found out that the usual config here:

https://github.com/holzman/glideinWMS/blob/master/install/templates/01_gwms_collectors.config#L44

CONDOR_VIEW_HOST = $(COLLECTOR_HOST)

We think this should be changed to:

CONDOR_VIEW_HOST = localhost

and

UDP_LOOPBACK_FRAGMENT_SIZE = 60000

So the talking between Secondary Collectors and main Collector needs less packets that what is currently using.

Cheers,

-E

History

#1 Updated by Parag Mhashilkar over 5 years ago

Input from Brian

Hi Parag,

Actually, looking at Edgar's email, you don't need this one:

UDP_LOOPBACK_FRAGMENT_SIZE = 60000

(that's already the default)

Brian

#2 Updated by Marco Mambelli over 5 years ago

CONDOR_VIEW_HOST is used for the multiple collectors (there is no real condor_view with POOL_HISTORY_DIR or KEEP_POOL_HISTORY
The rational is that localhost connections are more efficient that using the FQDN

Looking at the log files though I see authentication errors probably because GSI certificates do not cover localhost.

10/28/14 10:49:31 Connecting to CONDOR_VIEW_HOST localhost
10/28/14 10:49:51 condor_read(): timeout reading 5 bytes from <127.0.0.1:9618>.
10/28/14 10:49:51 IO: Failed to read packet header
10/28/14 10:49:51 SECMAN: no classad from server, failing
10/28/14 10:49:51 ERROR: SECMAN:2004:Failed to create security session to <127.0.0.1:9618> with TCP.|SECMAN:2007:Failed to end classad message.
10/28/14 10:49:51 Can't send command 11 to View Collector localhost
10/28/14 10:49:51 condor_write(): Socket closed when trying to write 266 bytes to <127.0.0.1:50008>, fd is 6
10/28/14 10:49:51 Buf::write(): condor_write() failed
10/28/14 10:49:51 SECMAN: Error sending response classad to <127.0.0.1:50008>!
AuthMethods = "FS,GSI" 

This error is not happening when using COLLECTOR_HOST and the condor_view is skipped for the main collector.

I tried the following:

CONDOR_VIEW_HOST = localhost
COLLECTOR.CONDOR_VIEW_HOST =
COLLECTOR1.CONDOR_VIEW_HOST = localhost
COLLECTOR2.CONDOR_VIEW_HOST = localhost
COLLECTOR3.CONDOR_VIEW_HOST = localhost
COLLECTOR4.CONDOR_VIEW_HOST = localhost
...

or
CONDOR_VIEW_HOST =
COLLECTOR1.CONDOR_VIEW_HOST = localhost
COLLECTOR2.CONDOR_VIEW_HOST = localhost
COLLECTOR3.CONDOR_VIEW_HOST = localhost
COLLECTOR4.CONDOR_VIEW_HOST = localhost
...

But neither seem to work. Collectors are not collecting to CONDOR_VIEW_HOST

#3 Updated by Marco Mambelli over 5 years ago

I re-did the test on HTCondor 8.2.3 (after checking Edgar's installation) and localhost is identified correctly (127.0.0.1, localhost.localdomain, localhost)
e.g.: 10/28/14 17:20:45 Not forwarding to View Server 127.0.0.1 - self referential

I kept the logs of the od test and it was failing on HTCondor 8.0. I did only the configuration 2 on HTCondor 8.2.

I looked at the condor src and I found
Sinful::addressPointsToMe( Sinful const &addr ) const
using:
addr.getSinful() && addrsock.from_sinful(addr.getSinful()) && addrsock.is_loopback() )

But I found also this ticket that confirms that this was added in 8.1/8.2:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4329

I have to confirm with the condor team when this was changes and then I think we can do the change once HTCondor 8.2 from OSG goes into production.

PS
By the way, I found also this that refers to the other requests and mentions the defaults for loopback and network UDP:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4321

#4 Updated by Marco Mambelli about 5 years ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Parag Mhashilkar

I pushed the changes to v3/7080
Ready for feedback.

This introduces a dependency from HTCondor 8.2.X for the factory.
These changes should remain on hold until HTCondor 8.2.X is in OSG production (sometime in November)

#5 Updated by Parag Mhashilkar about 5 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Parag Mhashilkar to Marco Mambelli

Changes look ok. Merged it to branch_v3_2 and master Resolving

#6 Updated by Parag Mhashilkar about 5 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF