Technical Consultation of FIFEGP Grid

Goals of the meeting

- Improve fifebatch frontend config
- Discuss Condor/quota fairshare, etc.
- Understand differences wrt LHC
- Overview of team: interests, responsibilities, etc.
- Improve diagnostics
- Jobsub architecture
- Identify bottlenecks, scalability, points of failure
- Understanding fife/usdc/dcso/hepcloud divisions

Management Questions

What are the stated goals and objectives of this project?
- Who is the customer base / stakeholders? How do you determine whether their needs are being met?
- What is the current team makeup? How is this expected to shift (or not) in the nearterm?
A: See the diagram here:

- What are the external dependencies (in terms of teams and projects) of JobSub? How are these managed?
- How should the platform look in two years?
- What are the nearterm (within the next two years) milestones?
- What is the desired effort level profile look like? What is the plan to “get there from here”?
A: Most development work is done. Still some items to add in that were lost in gpsn01 transition. Not planning major new changes in next 2 years; moving to operations.
- Where does the project management feel there is the greatest concerns? What areas should we focus on the rest of the day?
- What resource provisioning, allocation, and fair-share policies exist? How are these decided on and changed?
A: Right now each experiment has a "block" of resources; size determined by the SC-PMT annual review. We are moving to a hierarchical system where the experiments can adjust the priorities of jobs within their block. Jobs that run offsite should not count against the experiment's block.

Development Questions

- Can you provide a component architecture diagram of:
- The entire FIFE submission infrastructure?
- An example of how one experiment’s components interact with this infrastructure?
- A detailed view of the JobSub components.
- Management of security credentials, from the submission point to the worker node.
- Input data management.
- Output data management.
- Given the chance, what architectural decisions would you revisit?

Recommendation: fix some input sanitization issues and have a periodic review with security team.

- If forced to eliminate components, how would you simplify the architecture?
- What peer efforts exist? How do you compare to these other systems? How do you collaborate or stay informed with what others are doing?
- How are failures detected? What are common sources of problems?

Impossible to match jobs: add something to classad when contradictory options, or run condor_q better on the whole pool every ~20 min; pick out certain use cases.
How are client failures detected? How are client tools deployed?
- How does the system determine whether a task is making progress?

Q on tracking running jobs: do we use condor_tail? Not available to users. should it be? Add it to jobsub?
condor_chirp can send attributes of jobs back to schedd (nevents processed, etc.)
Could ifdh monitoring be improved (e.g. which jobs are waiting on cpn locks)?

- How is high-availability provided?
- How is feedback gathered for the user interface? Can you give examples where, based on user feedback, the interface was improved?
- What does a sample user interaction look like? Can we walk through an example task in its entire lifecycle?
- One concern I have going in is that HTCondor is a very fast moving target in some areas. How does development keep abreast of these changes? Are there fears / examples of where a JobSub feature was later duplicated in a HTCondor release.
- How does the team maintain a relationship with the HTCondor team? Can you give 1-3 examples where a problem was identified within FIFE, communicated to the HTCondor team, and the HTCondor team delivered an enhancement / fix?

How do we get a better feeling of when to upgrade, and be aware of fixes that come from CMS? Monthly sysadmin call? get some FIFE people in there?
- What is the release roadmap? What features are planned?
- What is the target scale in the next two years? What actions are being taken to prepare for this? How much ‘headroom’ do you currently believe is available?
- How are resource provisioning, allocation, and fair-share policies implemented? How is correctness determined?
A: See above answer about resource provisioning as well. Are generic fermilab glideins better in case experiment's glideins end up where the resources aren't sufficient? "Generic" glideins could be more efficient. Partitionable slots are also a complication. Also desire to make hierarchical groups within expts. What about user fairshare and priorities? Possible solution: have three pools; one for local GPGrid, one of offsite opportunistic, one for offsite dedicated (e.g. OSC/SMU). There seems to be general agreement that the main effort now should go to giving experiments ability to adjust priorities of their jobs to control what things run first. What are the pros/cons of having one pool controlling everything vs giving the experiments their own pools?

Seems to be a need for a technical mechanism to let experiments manage details of priorities themselves as opposed to opening tickets all the time ( we don't have resources to react quickly enough.)
GWMS ploxy selection plugin may help manage things as far as separate VO groups. can have selection by owner also
Recommend to treat offsite opp. as generic opp; separate group for offsite "dedicated" (SMU/OSC type of thing)

- If management wants a change of an experiments allocation from X to Y, how is this executed?
A: Formal request made in a SPPM meeting; upper management deliberates and issues a ruling. At that point we adjust the frontend accordingly.
- How is accounting implemented?
- What auditing capabilities exist?
- What are the current weaknesses and pain points of the system?
- Homework: Can you provide me with:
- Send forward any pre-existing presentations along the lines of the above.
- Links to any involved source code repositories.
- Example user submit scripts.
- Links to monitoring pages (my FNAL ID is if the links are not publicly viewable). - Release notes (if available) for the last 2 years?

Operations Questions

- How is the relationship with the development team managed? Can you give 1-3 examples where a problem was identified by operations, communicated to

Weekly meeting, file an issue - Jobsub redmine repository, file a SNOW request/incident

USDC also determines the priorities for next release based on
  • problems that we have encountered helping users
  • valid requests from liaisons

Monthly meeting, file issue (), send email (directly to developers), HTCondor week
28492 - 8.2 schedd troubles

Note: 8.3 allows a max jobs per user setting (requested by Minerva). Also allows some more Python binding, etc.

- Provide puppet modules and links to operational documentation

- Summary of available hardware and the mapping of components to hosts.

- What monitoring alerts and metrics are in place? Can you give an example where the monitoring worked correctly and when it failed?

worked correctly - killhard sensor ( run as script )

- How is availability measured?

- How do you know the HTCondor pool is working?

  • check_mk sensors above
  • (new) primary - to monitor check_mk dashboard
  • Incidents filed by user

- How are new JobSub releases validated? What’s the path from “git tag” to “production”?

  1. Developer tests on dev setup
  2. Developer creates SNOW request with USDC for deployment on pre prod
  3. USDC deploys on pre prod, does internal testing ( have a test suite for onsite and offsite job submission under each supported VO)
  4. Developer also helps test pre prod
  5. If no issues, USDC requests OPOS to test
  6. If no issues with OPOS tests, general user community is requested to test ( for 1 week )
  7. If everything is ok, new release is deployed in production ( usually during 3rd Thursday GPGrid downtime )

All of this via SNOW ( Request/Change management )

- How are new HTCondor releases handled?

We upgrade only when we have to - from OSG 3.2 release repo (we have plans to get HTCondor 8.3.6 from 3.3.0 series; to help with Schedd issues)
First tested on pre prod setup
Deployment on production happens as part of regular 3rd Thursday GPGrid downtime

- How is scale testing performed?

In past, we had CDF sleeper pool - not sure if we have anything now

- What kind of testbed exists and how close is it to the production environment?

Every component of production setup has it’s pre production counterpart
VO Frontend and GWMS Factories have the same configuration - so same VO groups are supported; jobs end up on same resources

We are trying to get preprod VOFrontend to request glideins from GOC ITB GWMS Factory

preprod nodes are Fermicloud VMs ( so, unlike production, not managed by ECF).
Some manual setup is required ( only if they get reinstalled for some reason)

Note: Pre prod config may differ from prod (for a brief time period) only if operations is testing something new ( like PS on new GPGrid)

- Summarize the change management procedures in place.

Change Request in SNOW
As part of CHG - Service Desk Communication Request to users and CS liaisons
CHG000000009763 - Upgrade jobsub from v1.1.3 to v1.1.4

- How are configuration changes handled? Are they primarily done by operations or by development?

Can be recommended by developers/operations
Deployed on preprod/prod ONLY by operations (SNOW Request)
RITM0234745 - Enforce memory limits with SYSTEM_PERIODIC_REMOVE

- What is the process for bringing up a new schedd?

  1. File request for node with ECF
  2. Add node to puppet
  3. Install Puppet agent on node and run it
  4. Run puppet on VO Frontend node ( to add new schedd to frontend config)

- When was the last time the HA setup was used in production?

Used currently

- What is the growth plan for the infrastructure?

  • DCSO will eventually take over operations of this whole setup
  • Transition of USDC GWMS Factory -> CMS Factory is in progress
  • Better monitoring

- Provide an overview of the last serious unplanned downtime. What happened? How was the issue resolved? What post-incident actions were taken?

Cannot think of any

- What are the perceived bottlenecks or pain points of the system?

Dev nodes
We shouldn't have to maintain dev nodes

Preprod nodes
  • Are not provided/managed by ECF
  • It was a pain to setup up services once these were moved from old to new FermiCloud
  • Certs expire and we don't have any monitoring for them

VOFrontend config needs to be simplified

- How does operations determine that external sites are behaving appropriately? What is done when a site becomes problematic?

  • Daily Gratia reports sent by Tanya ( for NOvA and mu2e) cover some sites
  • Query from user and/or Incidents filed by them
  • I don't think we monitor OFFSITE resources on a regular basis
Actions taken
  • File a GOC ticket, follow up as needed
  • Ask users to not submit to that site until issue is resolved

- How does operations determine they are getting an appropriate level of resources for a site? If I were to pick a site at random, could you give evidence that the level of use is
appropriate for demand?
- Example from CMS: Suppose Nebraska is running 1k jobs. Should that number go up or down in the near future? Is that an appropriate number of running jobs based on demand? What is the demand?

- Homework:
- Send forward any pre-existing presentations along the lines of the above

None exist ( that I know of)

- Provide all HTCondor configurations

- Provide all GlideinWMS configurations and validation scripts


Project Info

Technical Documentation

Technical documents available at:

Covers following topics:

  • JobSub Architecture
  • JobSub Client Server API Communication API