Meeting "Getting Nova MC on the OSG" 15-Jan-2013

Executive Summary

Action Items

  1. Provide Links to Jobsub documentation


A.Norman, Ruth Pordes, Marko Slyz, Denis Box, Margaret Votava


Notes on Jobsub and documentation can be found in Redmine under the IFront project.

Jobsub overview:

On GPSN01 there is a schedd, frontend and a factory. The factory being used is not the OSG one but is a custom one.

In the factory there are entrypoints for the different experiments, specifically NOvA, Minerva, etc... and for the integration with SMU there was an entry added for "Nova_SMU".

There are corresponding entry points on the front end.

Jobsub is then able to be told which entry point to steer through.

This is configured through a HUGE XML configuration file.
This is documented through a website at UCSD.

Frontend uses a different configuration file (also XML) which lists all the sites. The VO's certs are attached on the IF's Frontend (one frontend for all the experiments) and then are steered through. This was done to satisfy FermiGrid.

  • Question: Where is the actual VO association done? Could we have a generic "SMU" entry point instead of a Nova specific one?

For OSG submission there are factories and entry points which are maintained. If a site is not yet supported then it has to be added or bugged to be added

Now Jobsub creates a condor_submit file to actually submit the job to the queue.

There are two modes of operation, one where the job wrapper script is generated by jobsub and the other where the full script is passed in as is. Jobsub has the ability to create the full set of DAGs that may be needed.

For OSG the submit procedure uses a wrapper script (provided for each experiment by the grid group and customized to the experiment)

CMS has ProjAgent, DZero has SAMGrid, CDF has CAFSubmit, Atlas is Panda

We need to findout from Chander about what else is done.

Client/Server Jobsub model

Current version looks like a client server to the user but really it's a generate and submit system.

We really want to keep the users off of the submit machine so that they don't login and do things (like run ROOT) that affect the performance of the machine, or crash the machine.

Denis is aware of Bosco and has read some documentation on it.

The other reason to move to a client/server model is that currently each user needs a robo cert that they need to keep alive. (via a cronjob)

The reason for our own factory is to run on Fermigrid. (and have been discouraged by Igor and Burt from doing that way)

Deliverable for Client/Server is end of Feb. Development is being done in parallel on FermiCloud. The actual machine is in the process of being brought up by FEF.

SMU specific work to get the glidin's work:

Misc condor config files that needed to be changed and the XML has to modified to pass the parameters correctly. Had to work with site admin to get the parameters right.

How does OSG handle this?