Project

General

Profile

Bug #6354

Milestone #6351: Release JobSub v0.3.1

Possible race condition in proxy creation on the server side

Added by Parag Mhashilkar over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
05/27/2014
Due date:
05/28/2014
% Done:

100%

Estimated time:
Spent time:
First Occurred:
Occurs In:
Stakeholders:
Duration: 2

Description

On May 27, 2014, at 12:22 AM, Steven C Timm wrote:

Over memorial day weekend we had a repeated sequence of problems with user "ashley90"
of MINOS trying to run jobs without a VOMS proxy on fifebatch1. This stressed the GUMS server
to the breaking point on several occasions when she did a condor_rm on those jobs because
they appeared to be not working. I got to see them all because I was at CERN and up during
the hours they happened.

It should be impossible for her to run without a VOMS proxy because after all those are automatically
generated for her by the refresh-proxies crontab

If I am reading the code of /opt/jobsub/server/webapp/auth.py correctly,
it appears that it is making the proxy in two stages, first doing the kx509 command
and then doing the voms-proxy-init command, both using the same intermediate file.
This appears to give a short window of opportunity where there is a bare kx509 credential
available without a voms attribute. You normally might not pick this up but in the
case where 50000 jobs are in the queue the schedd is always doing something with the proxy file.

If my diagnosis is correct, the auth.py code needs to be modified to do all its work in a different file
than the /fife/local/data/rexbatch/proxies/minos/x509cc_ashley90_Analysis
file where it is stored now, and then only move the new proxy to the standard location when it is
verified to be good.

This change should be made quickly if possible, otherwise we will continue to see DOS attacks against
the GUMS server like we saw over the weekend--it got to the point where not only ashley90's jobs were
failing but some other monitoring was failing too.

We might also want to consider dialing back JOB_STOP_COUNT on fifebatch1 from 30 to some smaller
value, on gpsn01 it is 10 and that doesn't give us any problem.

Steve Timm

History

#1 Updated by Dennis Box over 5 years ago

  • Status changed from New to Feedback
  • Assignee changed from Parag Mhashilkar to Dennis Box
  • % Done changed from 0 to 90

Race condition caused numerous condor and saz errors during stress testing, they are gone in auth.py from branch 6261_6354

#2 Updated by Dennis Box over 5 years ago

  • Status changed from Feedback to Resolved

#3 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 90 to 100


Also available in: Atom PDF