Pegasus Workflow

Meeting was on 2009-9-8. Adam L, Marc M, Eric N, Bert, Eileen, Gabrielle, Erik G, Ruth P, Jim K.

Agenda and announcement

Over the last nine months of so, at least three groups have expressed interest in using
Pegasus as a workflow engine. I would like to have a meeting sometime over the
next week to discuss Pegasus and workflow needs. Here is a draft agenda:

  • what is the problem that Pegasus is solving for you?
  • what is the larger problem that this is part of? (briefly stated) * if workflow is the problem, everyone will need to define what this means.
  • why is Pegasus a good solution for this problem? (why did you select it?) * what are your plans for evaluation and use? * should we be working together more closely? if so, how and on what?


The three groups will be called the glide-in group, the neutrino group, and the LQCD group.

What is the problem being addressed?

For the glide-in group, the problem is to add a glide-in service to Pegasus. This is in
support of Corral, a resource manager for Terragrid, targeted for MPI jobs. They are hiring
a term person to do this work.

Ruth mentioned the interest in Pegasus from OSG. She mentioned that LIGO was using it, but they
were not sure of the actual benefit. She mentioned that the Earthquake Engineering people are
using it also.

For the neutrino group, the problem is not understood yet. They are just exploring what Pegasus
could do for a user in the neutrino world. They have not talked with the users yet. Steve
Timm informed Marc M. of trouble in Pegasus with the number of jobs generated (to fine). Jim mentioned
that when they talked with Ewa, she discussed a way to indicate that certain jobs should be
run together as a unit on one resource, as apposed to running each separately.

For the LQCD group, the problem is to handle everything necessary to run LQCD campaigns (see attached slides).
This includes storage of configuration, providence, job history in a database, having a workflow engine that
tracks progress, and getting feedback from the running jobs and from the user to make adjustments and
actively monitor progress. We also have a need for a more full language for defining a workflow. We mentioned
the need for iterative constructs in this language, and also for the need of different types of files and
other objects that come from databases. This group has been wondering why they would not just use DAGman
(generate this from their structures and config).

what does Pegasus give you?

The question of what does Pegasus do for you came up several time. Our understanding is that it is largely
a tool to do a transformation from DAX to DAGman DAG. This is similar to compiling a program. Certain
constructs or use of data in the DAX file causes generation of a set of DAGman nodes to be generated. We
believe that the API for adding code to do this these finer-grained transformations is not necessary open
or accessible by users. There is no runtime monitor included. If the work to be done depends on
a current job step, then that job step must invoke Pegasus and submit the results. This is not a good,
general solution. So Pegasus seems to have a set of built-in policies for how it does this transformation
from DAX to DAG, developed for the way their main customers to business - which does not seem to
match our needs very well.

We discussed why we thing Pegasus came about and why the newer experiments are not doing as the older
large experiments are doing. First, the large experiments have the resources to build something
specialized for their problem and integrated down to the level of the running code and designed to
run in an environment (runtime area) suitable for the experiment. The smaller experiments cannot
afford to do this sort of thing, and could benefit from a tool that hides some of the complexities
of running a set of jobs.

a terminology problem

Workflow seems to be a bad term to describe the things we are talking about. For LQCD, Pegasus defines
workflow too narrow. It is not an end-to-end solution for running scientific jobs. It only handles
one part of the problem. The thing we want is a scientific job running system.

what next?

Several interesting action items came out of this meeting.

  1. CET can work with Adam's group to try to get some requirements for doing physics work from the experiments.
  2. CET will want to work with Adam's group to get to machines where Pegasus is available and working as it should.
  3. CET needs to specify needs for a prototype cloud computing cluster - what we would use it for.
  4. We will make a place to record information about Pegasus and have discussions about the topic.
  5. We will meet next month to discuss progress on each of the fronts.