Project

General

Profile

December-07-2017

Slides: https://indico.fnal.gov/event/15797/

Meeting Notes


Present

Margaret Votava - FNAL Project Sponsor/Scientific Computing Services Associate Head
Parag Mhashilkar - Project Lead
Marco Mambelli - Technical Lead
Dennis Box - Project Member
James Letts - CMS L2 Manager
Tony Tiradani - HEP Cloud Technical Lead
Steve Timm - HEP Cloud Technical Advisor
Ken Herner - FIFE Support
Mike Kirby - FIFE Liaison
Jeff Dost - CMS/OSG Factory Operations
Brian Lin - OSG
Tanya Levshina
Dave Dykstra
Miron Livny
Burt Holzman


Communication

  • Adding CMS to list of meetings to attend is good.
  • What about OSG? Attend software team meeting once a month – OSG. In terms of OSG operations: Jeff is happy with attending the glideinwms meeting, stress is in turning around.
  • FIFE – local. No communications issues there.
  • Condor & GlideinWMS – had one productive meeting in Madison. Because glideinWMS is morphing. HTCondor team is now working with Fermilab on White paper on HEPCloud.
  • Good to get core set of people for face to face meeting for strategic discussions a few times a year. Condor week – last week of May (21-24) is next good option.
  • People are encouraged to be frank. This is a senior set of people. We can take blunt.
  • Next stake holders meeting to be planned in February to address bigger picture items.

Support

  • Redmine is hard for Fermilab people to effectively contribute. Should we just move it to Github? Rucio did it. GitHub is more modern but we should make sure it is mirrored here.
  • GlideinWMS Factory understands heterogeneity of sites. This is not directly accessible. No way to “just start submitting to A, B, and C”. Not so visible to the user. The decision is easy, the execution in hard. Value in the factory to understand the heterogeneous nature, but . Consider the Icecube example, pulls rather than pushes glideins. Eg, asks wisconisn what’s available and then submits. Decision engine vs acquisition engine. Follow up with David Schultz or Todd Tanenbaum at Wisconsin about PyGlidein

Project Management

  • Is development series dead?
    • No – not just a lot of time to devote to it with limited effort.
  • 4 releases a year – not very balanced. 6 months gap between releases is too long. Had issues with deploying releases. 2016 was particularly unfriendly. 50% hit rate. Factory ops hit rate is lower.
  • Jeff pointed that there are monitoring issues in 3.2.20. He will send more details to the GlideinWMS team. Changes will be included in v3.2.21 if the fixes are reasonably minor. #17825 redmine ticket is not yet quite right.
  • Stakeholders would like to see time based releases. Make sure things are Kosher for factories. Its possible that factory operations could work on automating release process. Should we have Bimonthly release?

Roadmap

  • Going off scripts from the slides.
  • Make glideinWMS much smaller. Functionality merged in with decision engine and HTConder. Make it more modular, modern tools, clean up interfaces.
    • Maybe release factory and front end components separately as heading toward making things more modular. So maybe a first start is to break it up into pieces now and then work on the pieces.
    • Need to think about how that surgery would work – do it in small steps and integrated into the release schedule. Roadmap needs to be cognizant of that.
    • Factory operations is waiting for fixes in the monitoring. They are planning to do their own factory monitoring using graphana and influxdb. They would like the factory code to continue to provide xml. Pushing this data to OSG elastic search. Still needs to get the core counts. Not ready to switch over yet.

Technical

  • Singularity road is still long. Needs a root cause so we can do better in the future. OSG seems to be on the cutting edge, CMS is more stable. How to best keep up with countinous changes?
  • Is there experimental glideins for testing instead of production?
    • OSG and CMS have separate testing pools. Special flag to get set. May need the factory to update the instance as well. Can non=CMS users use it? - Yes
  • GlideinWMS needs to work on replacement for condor switchboard. GlideinWMS team to pick it and strip it to bare minimum for their requirements.
  • Code modernization. What else besides python is happening? What tools are we using to analyze code?
    • Nightly, futurize runs. Needs to pass on polar request. Can we make that visible to users? Use a commercial service that people will understand. Check with Wisconsin on tools they use. Look at github marketplace.
  • Could use feedback from the development team on how much time each feature takes – may influence how important it is. 2 days. 2 weeks. 2 months.
  • Is it worth keeping two differently releases.
    • 3.21 had a release candidate. Can we fit in monitoring cleanup? If we can get this going in the next week, yes. Factory operations would like fixes for new hire starting in January.
  • Can pull request from Brian Lin can make it in too?
  • What is the singularity plan? Fife is just starting testing now. Maybe they should be first testers. Can start at OSC.