Project

General

Profile

Bug #4623

Pilot should reload grid environment between jobs

Added by Anthony Tiradani almost 6 years ago. Updated almost 6 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
09/06/2013
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

On Sept 6, 1013, OASIS went offline for ~24 hours. During that time any pilots that started prior to the outage and that were requested on behalf of VO's using OASIS, potentially acted as blackhole nodes. (Jobs failed and restarted continuously.)

To mitigate problems like these we should either:

a) reload the grid environment between jobs (e.g. this would allow a site to change the OSG_APP link from OASIS to something local without having all jobs fail on the pilot for the remaining lifetime of the pilot)
or
b) run all the validation scripts periodically and specifically check for OSG_APP availability and fail the pilot when something goes wrong.

History

#1 Updated by Burt Holzman almost 6 years ago

  • Target version set to v3_x

#2 Updated by Burt Holzman almost 6 years ago

  • Assignee set to Burt Holzman


Also available in: Atom PDF