Changes to 01_gwms_collectors.config based on large scale testing from operations
On Sep 26, 2014, at 4:50 PM, Fajardo Hernandez, Edgar wrote:
Dear GlideIn WMS Team,
During the current round of large scale testing, we found out that the usual config here:
CONDOR_VIEW_HOST = $(COLLECTOR_HOST)
We think this should be changed to:
CONDOR_VIEW_HOST = localhost
UDP_LOOPBACK_FRAGMENT_SIZE = 60000
So the talking between Secondary Collectors and main Collector needs less packets that what is currently using.
#2 Updated by Marco Mambelli over 5 years ago
CONDOR_VIEW_HOST is used for the multiple collectors (there is no real condor_view with POOL_HISTORY_DIR or KEEP_POOL_HISTORY
The rational is that localhost connections are more efficient that using the FQDN
Looking at the log files though I see authentication errors probably because GSI certificates do not cover localhost.
10/28/14 10:49:31 Connecting to CONDOR_VIEW_HOST localhost 10/28/14 10:49:51 condor_read(): timeout reading 5 bytes from <127.0.0.1:9618>. 10/28/14 10:49:51 IO: Failed to read packet header 10/28/14 10:49:51 SECMAN: no classad from server, failing 10/28/14 10:49:51 ERROR: SECMAN:2004:Failed to create security session to <127.0.0.1:9618> with TCP.|SECMAN:2007:Failed to end classad message. 10/28/14 10:49:51 Can't send command 11 to View Collector localhost 10/28/14 10:49:51 condor_write(): Socket closed when trying to write 266 bytes to <127.0.0.1:50008>, fd is 6 10/28/14 10:49:51 Buf::write(): condor_write() failed 10/28/14 10:49:51 SECMAN: Error sending response classad to <127.0.0.1:50008>! AuthMethods = "FS,GSI"
This error is not happening when using COLLECTOR_HOST and the condor_view is skipped for the main collector.
I tried the following:
CONDOR_VIEW_HOST = localhost COLLECTOR.CONDOR_VIEW_HOST = COLLECTOR1.CONDOR_VIEW_HOST = localhost COLLECTOR2.CONDOR_VIEW_HOST = localhost COLLECTOR3.CONDOR_VIEW_HOST = localhost COLLECTOR4.CONDOR_VIEW_HOST = localhost ...
CONDOR_VIEW_HOST = COLLECTOR1.CONDOR_VIEW_HOST = localhost COLLECTOR2.CONDOR_VIEW_HOST = localhost COLLECTOR3.CONDOR_VIEW_HOST = localhost COLLECTOR4.CONDOR_VIEW_HOST = localhost ...
But neither seem to work. Collectors are not collecting to CONDOR_VIEW_HOST
#3 Updated by Marco Mambelli over 5 years ago
I re-did the test on HTCondor 8.2.3 (after checking Edgar's installation) and localhost is identified correctly (127.0.0.1, localhost.localdomain, localhost)
e.g.: 10/28/14 17:20:45 Not forwarding to View Server 127.0.0.1 - self referential
I kept the logs of the od test and it was failing on HTCondor 8.0. I did only the configuration 2 on HTCondor 8.2.
I looked at the condor src and I found
Sinful::addressPointsToMe( Sinful const &addr ) const
addr.getSinful() && addrsock.from_sinful(addr.getSinful()) && addrsock.is_loopback() )
But I found also this ticket that confirms that this was added in 8.1/8.2:
I have to confirm with the condor team when this was changes and then I think we can do the change once HTCondor 8.2 from OSG goes into production.
By the way, I found also this that refers to the other requests and mentions the defaults for loopback and network UDP:
#4 Updated by Marco Mambelli about 5 years ago
- Status changed from New to Feedback
- Assignee changed from Marco Mambelli to Parag Mhashilkar
I pushed the changes to v3/7080
Ready for feedback.
This introduces a dependency from HTCondor 8.2.X for the factory.
These changes should remain on hold until HTCondor 8.2.X is in OSG production (sometime in November)