Fix fork.py behavior (was: reproduce crashes on glidein2.chtc.wisc.edu, provide fix)
glideinwms/lib/fork.py was changed in v3.2.20 to use epoll() instead of select() for #17067 .
It was tested heavily on the factory but not well enough on the frontend side of things.
When glidein2.chtc.wisc.edu was upgraded to 3.2.20, it started throwing uncaught exceptions, eventually crashing the frontend. A temporary fix was made to roll back fork.py code to the previous release.
This urgently needs to be reproduced and understood.
As a side note, changes to the rrd files during the upgrade make rolling back to the previous release difficult. If there is a way to read/write the rrd files that doesn't care if new fields are tacked on to the end of the metadata it should be adopted.
#1 Updated by Marco Mambelli over 1 year ago
Findings: In 3.2.20 the code was changed to use epoll instead of select to improve scalability, still falling back on select if epoll is not available.
And was also changed to catch specific exceptions instead of the generic “except:"
There was a bug in the code and a function was returning only the first file descriptor instead of the expected list of file descriptors, backing-up on loaded systems and a OSError triggered down the road if caught could have allowed the Frontend to continue to operate but was no more caught.
In the new code I’m taking care of both: fixing the epoll behavior and catching the OSError
I'm also optimizing epoll/poll adding a timeout of 100 milliseconds.
Changes are in v3/18748 and attached to this ticket (new fork.py)