Art and SAM meeting 1Sep2011

Marc's notes

We rely on having an externally-defined “data set”.
This means a name associated with a “files”, as SAM defines them.

The art framework will be given a “project”.
A “project” goes along with a group of jobs.
A project is a set of work that is going to be performed on a data set.
The task of the project is to deliver files in a reasonable order to a job.
Many jobs may be part of the same project.
Creating the project is a distinct step.
The project name is passed to each job that is part of the project.

The project is not connected to the batch system.
The jobs are created by the batch system.

Each job that wants to consume files must register itself with a project.
Each can be given a maximum number of files it will expect.
Each job has an application; this is basically an arbitrary string.
When a job registers with a project, the job is assigned an identifier that is unique with the project; this is called the “consumer process id”. The job has to provide the “delivery node”, the place to which files will be delivered.
The combination of project name and consumer process id is a unique identifier.
A single consumer process can only have one file at a time. But a single Unix process can register as more than one consumer process.
To register, provide: * application namer * application version * delivery location * optionally, a description of the process * a file limit * project name * user name or grid certificate id
Each call creates a new consumer process, and assigns it a new id.

What can a job do?...
  • get next file -- returns a file name
    often, but not always, on the filesystem
    there is not right now a very good way to determine what kind of access is needed for the file (dcap, etc.)
    if the file is not immediately available, it returns HTTP/202, which means the request was received but can not be immediately met; return a suggestion for how long to wait
    if there are no more files available, HTTP/204 (no content) is returned
  • release file -- called when you are done with a file.

My mental model: treat these as files! * get next file > open * release file > close (needs the name of the file we release, and the project name and consumer process id and status of “ok” if we’re closing because all is well; on failure, don’t mark the status).

For tracking purposes, we should call “set process status” to “completed”. This is to be done for each consumer process id. For a failed job, we can set the process to “bad”.
WE MIGHT WANT TO SET THE CONSUMER PROCESS STATUS at the time we close an output file, and start another one for another output.

What happens if the framework program finishes correctly, but files fail to copy (part of the batch system’s task)?

We need to make sure that there is a clear state machine describing the behavior of this system.

When should “end consumer process” called?
The final step is calling “stop project”, which free resources on the server side. There is a timeout, but it is set on the server side. Typical values are 48 hours to 14 days.

The WORKFLOW creates the project, and stops the project. If a user stops the project from outside, the next request to get a file will announce there are no more files, and the jobs will think they have completed successfully.

There are no recoverable error conditions; any 500 error from the server is an indication that you’re permanently stuck.

Can art provide a means of adding a job-specific output filename?
Can we supply a project name on the art command line?

NOvA use case:

An expert defines SAM data sets for NOvA data.
A user comes in an submits “a job” for running on some defined data set.
The user specifies * project name * how many simultaneous jobs are to be run * the configuration for art * art will have to be told what project it is part of.

Jim's notes

To be added.