Milestone #6351: Release JobSub v0.3.1
Possible race condition in proxy creation on the server side
On May 27, 2014, at 12:22 AM, Steven C Timm wrote:
Over memorial day weekend we had a repeated sequence of problems with user "ashley90"
of MINOS trying to run jobs without a VOMS proxy on fifebatch1. This stressed the GUMS server
to the breaking point on several occasions when she did a condor_rm on those jobs because
they appeared to be not working. I got to see them all because I was at CERN and up during
the hours they happened.
It should be impossible for her to run without a VOMS proxy because after all those are automatically
generated for her by the refresh-proxies crontab
If I am reading the code of /opt/jobsub/server/webapp/auth.py correctly,
it appears that it is making the proxy in two stages, first doing the kx509 command
and then doing the voms-proxy-init command, both using the same intermediate file.
This appears to give a short window of opportunity where there is a bare kx509 credential
available without a voms attribute. You normally might not pick this up but in the
case where 50000 jobs are in the queue the schedd is always doing something with the proxy file.
If my diagnosis is correct, the auth.py code needs to be modified to do all its work in a different file
than the /fife/local/data/rexbatch/proxies/minos/x509cc_ashley90_Analysis
file where it is stored now, and then only move the new proxy to the standard location when it is
verified to be good.
This change should be made quickly if possible, otherwise we will continue to see DOS attacks against
the GUMS server like we saw over the weekend--it got to the point where not only ashley90's jobs were
failing but some other monitoring was failing too.
We might also want to consider dialing back JOB_STOP_COUNT on fifebatch1 from 30 to some smaller
value, on gpsn01 it is 10 and that doesn't give us any problem.