Project

General

Profile

Weekly Meeting Notes » History » Version 144

« Previous - Version 144/150 (diff) - Next » - Current version
Parag Mhashilkar, 02/06/2019 10:51 AM


Weekly Meeting Notes

Jump to the current Weekly Meeting Notes
Jump to the 2016 Weekly Meeting Notes 2016
Jump to the 2017 Weekly Meeting Notes 2017
Jump to the 2018 Weekly Meeting Notes 2018


February 06, 2019

Marco Mascheroni, Dennis Box, Parag Mhashilkar, Lorena, Lobato, Marco Mambelli, Dave Dykstra

  • Dave Dykstra
    • Singularity 3.03 is ready for osg upcoming
      • Known issue with unprivileged node. When executing from docker requires privilege. Singularity dev team plan to fix it.
      • WLCG working group meeting. More testing before rolling out unprivileged mode. On the order of 6 months. Takes long because of the Singularity audit going and scheduled to be done by mid June. Some members want to point to audit before making recommendation
      • Atlas want to be able to read from docker on worker nodes. Download the docker containers on WN. Thats a lot of overhead and sounds crazy. They don't want to maintain image repo.
      • Marco: CMS wants condor ssh to job to work but that required startd to be started as root which glideinwms cannot.
        • Dave thinks he can provide some help in that direction
      • Travel to WLCG workgroup. They are asking SI lab and they maybe able to pay for Dave's travel. CMS is already paying for CVMFS workshop.
      • Dave and Marco to work together on providing solution for CMS
  • Dennis Box
    • Reviewing #21682. Will be done and go back to working on #2531
    • No progress on travis ci and getting artifacts
  • Marco Mascheroni
    • Couple of issues from CMS. Frontend crashing because there is one of the attribute in schedd that evaluated to error/undefined causing the exception. We need to add more protection. Leak in the fork.py. Changes may not be propagating to the frontend process.
    • Factory operator added an entry. She couldn't get logs from pilots because the pilots were removed based on frontends request. Mambelli, added a disable to fix it. Getting log when you kill the job depends on batch system. if it is translated to kill -9 you don't get it back.
    • CPU = auto and memory set to zero
    • Operations team meeting
      • Session on auto generation of config. Address problem at abstract level, trying to identify category of items required for config.
      • Topics based on migration of services. Not focused on different factory/services etc
  • Lorena
    • Testing 3.4.3 glideinwms + htcondor 8.4.8 identify black hole
  • Marco Mambelli
    • Working mainly on troubleshooting issues about frontend crashing.
    • HTCondor survives the glidein. Made changes on glidein and condor startup. There is trap in place to forward the signal. Glideins were killed write after starting. Making script more responsive. Working with Diego and sysadmin at Purdue to troubleshoot. Their pbs is sending sig term and sig kill one after the other. So we dont get time to react. Working with OSG team since their wrapper script is not forwarding signals correctly.
    • Release of 3.4.3 has been promoted to testing.
    • Started working on the multi node glidein ticket. Added an option as multi glidein.
    • glidein_off problem reported by Shreyas. Mascheroni to follow up.
  • Project News
    • There is a possibility of moving the project from Redmine to GitHub
    • Marco submitted 4 student requests.
      ----

January 30, 2019

Marco Mambelli, Dennis Box, Parag Mhashilkar

  • Marco Mambelli
    • Move code review to Thursday and Friday during OSG All hands meeting.
    • Talk to OSG. They released osg release. They will release glideinwms in the coming release in 2-3 weeks
    • There is still issues about condor daemons surviving past glidein startup script
    • Started working on Singularity to consider release distributed by OSG in CVMFS and consider it in the path.
    • Wrote possible projects for summer interns and there was some communication with Sandra
  • Dennis Box
    • Working on #2531 store number of jobs restarts in frontend.

January 23, 2019

Marco Mambelli, Marco Mascheroni, Dennis Box, Parag Mhashilkar, Dave Dykstra

  • Singularity
    • OSG releasing singularity 3.0.2 in upcoming (current release in EPEL)
    • The problem seen at OSC w/ Singularity 3 (Too many symbolic links, was giving a permission error from the kernel to Singularity, was working w/ 2.6) seemed more a site problem: updating to RHEL 7.5 fixed the problem
    • Singularity 3.0.3 released and will be soon in EPEL
  • v3.4.3 Release Status
    • Mambelli:
      • RC2 out in osg-development, tests are OK so far
      • Release expected for Thursday or Friday
      • Still investigating some worker nodes where glidein is killed but condor keeps running and accepting jobs, moved the ticket to 3.5
  • Developers
    • Mascheroni
      • Busy w/ operations this past week
      • Will work more on interfacing with CRIC
      • Will check w/ Frank about skipping Thursday at OSG all-hands to do GlideinWMS code review then
    • Dennis
      • kicked off automated tests, so far all OK
    • Mambelli
      • Completed 3.4.3 tickets
      • Prepared RC and started tests
      • Troubleshooting HTCondor surviving glidein. Possible race condition?
  • Tentative code review dates: April 1, 2 or March 21, 22 (after OSG all-hands)

January 16, 2019

Marco Mambelli, Marco Mascheroni, Lorena Lobato, Dennis Box, Parag Mhashilkar

  • v3.5 Release Status
    • Mambelli:
    • Waiting on feedback on couple of tickets. Cut RC but does not include those changes. It should be in the osg-development soon. It is in minefield
    • Need to check with Steve ticket resolves what he needs.
    • There might be some worker nodes where glidein is killed but condor keeps running and accepting jobs.
      • Singularity support added process group and there is a condor warning that it may prevent you from condor to be killed.
  • Developers
    • Lorena
      • Mainly working feedback of tickets and getting ready for release candidate
    • Mambelli
      • Monitoring tickets and working with Thomas. Last week for his last week. Frontend was reporting and Factory had some problems.
      • Dennis interested in picking up the monitoring work from Thomas.
    • Dennis
      • One ticket for #21763. Parsing files into other config files. Not sure if it should go in this release? As per Marco some changes are necessary.
    • Mascheroni
      • Couple of fixes for the release
      • Looking at the process group issue on worker node
      • Working with the CRIC developers for interfacing with CRIC
  • Tentative code review dates: April 1, 2

December 19, 2018

Marco Mambelli, Marco Mascheroni, Lorena Lobato

  • v3.5 Release status
    • Postponed after 3.4.3
  • v3.4.3
    • Will include bug fixes, not single use factory
  • Developers
    • Marco Mascheroni
      • Work on AttributeError: dirSummaryTimingsOut instance has no attribute 'data' [#21570] and [#21569]
      • Worked on Automatically remove glideins after walltime (possibli for 3.4.3)
    • Lorena
      • Closed #21325: Potential bug in 3.4.2 frontend--not recognizing entries in downtime.
      • Feedback on various tickets for 3.4.3
    • Marco
      • Troubleshoot Avoid glideFactoryEntryGroup processe leaks [#21569]
      • coordinating releases and troubleshooting

Next meeting will be in 2019, no meeting on 12/26


December 12, 2018

Marco Mambelli, Marco Mascheroni, Lorena Lobato, Dennis Box, Dave Dykstra

  • Singularity
    • Singularity 2.6.0 had security issue 2.6.1 released today by OSG. The problem is related to Site UID singularity, affecting only privileged singularity, only in EL7, but cannot use EL7 features and unprivileged singularity as recommendation, not all VOs switched
    • Tried singularity 3 in EPEL, there was only an old go in the repo (too old), now they updated it (RHEL removed it from rhel7.6). It is failing on regression test. Unprivileged singularity 3 fails on the contain option (supporting pseudotty is failing), developers are working on it.
    • Singularity install instructions updated to tell to enable unprivileged singularity
    • When will GWMS support the changes to the search path to unprivileged singularity? As requested in a ticket: add OSG unprivileged singularity by default at the end of the searchpath. VO can override at the beginning. It will be in 3.5
  • v3.5 Release status
    • Postponed after 3.4.3
  • v3.4.3
    • Will include bug fixes, not single use factory
  • Developers
    • Marco Mascheroni
      • Tests for tockets 19949 Automatic generation of config
      • Worked on new ticket for 3.4.3
    • Lorena
      • Blacklist tickets
      • debugging environments
      • Ticket about not recognizing properly downtime entries
    • Dennis
      • Fixed problem that was breaking the unit tests
    • Marco
      • Worked w/ Lorena on the blacklist ticket
      • coordinating releases and troubleshooting

December 05, 2018

Marco Mambelli, Marco Mascheroni, Lorena Lobato, Parag Mhashilkar

  • v3.5 Release status
    • Waiting for unittests, and other pending issues before we can cut a release
    • Couple of branches Dennis is working on that are crashing the unittests.
  • Developers
    • Marco Mambelli
      • Working on multinode and 3.5 single user factory
      • Discussion last week about handling of singularity binary that osg is distributing.
      • Need to send the doodle poll today
      • Closed github pull request
    • Lorena Lobato
      • Working on blacklist script. Working with Shreyas
      • Testing
    • Marco Mascheroni
      • Configuration diff tool for 2 entries.
      • #19949 reconfig ticket. Both these tickets will go in 3.5
    • Dennis Box
      • Havent spent much time on Glideinwms and focusing on jobsub
      • Looking at issues reported by Marco
  • New request to include chirp binary but has issues with singularity platform
  • Marco Mascheroni will be in US so we can coordinate the code review accordingly.

November 28, 2018

Marco Mambelli, Marco Mascheroni, Dave Dykstra, Dennis Box

  • Singularity
    • RHEL/SL 7.6 is out and current kernel supports unprivileged mount spaces. Starting conversations on how we will be using it. Containers on mailing list.
    • Already installed latest singularity with unprivileged support in CVMFS oasis. Not enabled by default and need to use a sysctl. Working with Marco on a wrapper to figure out which binary to use. Will also sysctl for OSG.
    • Getting requests from LIGO for debian versions. Wrapper scripts will simplify to use which singularity bin to use.
    • There is a new CVMFS repository in http://hsf.org In future it will always be available in configuration.
  • v3.5 Release Status
    • Wrapping on couple of tickets. Multi node submission and downtime in factory.
    • Hope to have it done by end of the week for release candidate.
  • Developers
    • Marco Mascheroni
      • Working on items on list in factory. Not much progress last week
    • Dennis Box
      • Fixed SL7 scripts were failing
      • Need to look at feedback ticket for next release
    • Marco Mambelli
      • Working on release related issues mentioned above
      • Coordinating Singularity related issues
      • Monitoring doodle poll(?)

November 21, 2018

Marco Mambelli, Marco Mascheroni, Lorena Lobato, Parag Mhashilkar

  • v3.5 Release Status
    • Marco: Couple of releases need to be reviewed. Working on multi node. Single user in feedback
    • Lorena: Working on downtime entries are being ignored
    • Release candidate expected end of the next week
  • Developers
    • Marco Mascheroni
      • Working on finishing up tickets planning for next release
    • Marco Mambelli
      • Multi node submission
      • changed release script
    • Lorena
      • condor root switchboard review
      • testing and documentation
    • Dennis
      • Docker testing. Will try to run it in CI

October 31, 2018

Marco Mambelli, Marco Mascheroni, Lorena Lobato, Dennis Box, Eric Vaandering

  • v3.4.2 Release status
    • Released last Friday. Will replace 3.4.1 (not released to OSG production) OSG Jira ticket 3449. Planning of osg-3.4.20. Currently it is in testing. Planned for this week.
      • Edgar did not like that enabling some features in the Frontend could fail the glideins if the Factory was not updated. So 3.4.2 adds a Frontend config test and does not allow new features if one of the connected Factories is old.
      • Pylint tests may be red because of a pylint error: external packages (htcondor,classad for us) must be whitelisted, otherwise, methods are not found. Fixed in #21245, some branches may be red until this is merged ***
  • v3.5 Release status
    • Next release, planned for mid-November (RC 2 weeks from yesterday)
    • Single user Factory and multi-node jobs are the drivers. Evaluate all other tickets and move to 3.5.1 if you think they will not fit the time frame
  • Developers
    • Dennis Box
      • Review pylint test ticket #21245
      • Ran Automated test for 3.4.2: SL6 OK, SL7 failing, probably fermicloud changes, will check
      • All tickets doable except old ticket on storing job restarts, #2531
    • Lorena
      • single user Factory ticket: removed GT2, removal of privsep code, interaction w/ condor team, #21247, #20215
      • will review and move tichets
    • Marco Mascheroni
      • Worked on the tickets in feedback
      • Work on the switchboard ticket, #20215
      • Will review what to do for the next release
    • Marco Mambelli
      • Ticket to test the Frontend configuration, #21241
      • Use systemctl for loading/unloading on EL7
      • Released and tested 3.4.2
      • fix pylint test problems #21245
      • Work on multi-node glideins
      • Will be on vacation next week, limited online availability
  • In 2 weeks there will be the stakeholders meeting
    • Parag will run next week meeting
    • Prepare 1-2 slides for the stakeholders meeting w/ status and focus
    • Lorena will miss the meeting the next 2 weeks (overlapping trainings), will send the slides for the stakeholders meeting

October 24, 2018

Marco Mambelli, Marco Mascheroni, Lorena Lobato, Parag Mhashilkar, Dennis Box

  • v3.4.1 Release status
    • Release earlier this week. OSG Jira ticket 3449. Planning of osg-3.4.19. Currently it is in testing.
      • Red herring from Edgar. He installed old version 3.4-1 and not 3.4.1. Resolved after some confusion.
      • Testing in VMs, there were some problems. yum upgrade had some issues because of changes to fermicloud VMS. Need to remove ssi managed and fermicloud yum repos changed urls. Puppet repos also changed. Since we do not use puppet it is fine.
      • Developers were busy testing the release before it was made
  • Developers
    • Marco Mambelli
      • Topics covered in release status above
      • Activation of service in chkconfig --add.
      • Started working on 3.5 ticket multi node submission. It will go in upcoming repo. Timing is around November.
      • HTCondor 8.7.10 rc will be last release for 8.7 and will be renamed to 8.8 by mid November in osg upcoming
    • Code review
      • after 3.5 would be good time to do it since more code pruning is expected
      • Modular Monitoring
    • Marco Mascheroni
    • Lorena
      • Working with Marco on switchboard. Removal of gt5. Taking care of issues as we encounter. HTCondor is helping with testing and other useful scripts for transition.
    • Dennis Box
      • Ran smoke test on 3.4.1 and caught config related to singularity that was wrong in the test.

October 17, 2018

Marco Mambelli, Marco Mascheroni, Lorena Lobato, Parag Mhashilkar

  • v3.4.1 Release status
    • Marco will cut the release this afternoon. Everything is ok with the tests Edgar done for GlueX and GLOW.
    • Will go into next OSG release
  • Developers
    • Marco Mascheroni
      • CRIC: waiting on feedback. Rewrote the script on how to proceed based on their feedback. Iterating to come up with final version
      • for 3.5 working on script to change ownership of log files and others for migration to single user factory
      • Working with CERN operator for tools and training
      • Just started on CHEP paper
    • Marco Mambelli
      • Involved with testing of release and validation of frontend configuration
      • Following up with FIFE. They are running with hybrid version of glideinwms with scripts for singularity for Nova for SL7 nodes.
    • Lorena Lobato
      • Working on switchboard ticket
      • Removal of gram gt4 and gt5
      • one of the packages has error when upgrading the frontend
    • Dennis
      • Not much on Glideinwms this week. Moved some unittest tickets to 3.5

October 10, 2018

Marco Mambelli, Marco Mascheroni, Lorena Lobato, Parag Mhashilkar

  • v3.4.1 Release status
    • Marco's ticket about adding tests for frontend config. Tests for singularity functionalities. Marco is still working on it and expected to be done by the end of the day. Waiting on Marco Mascheroni and Edgar to hear back on tests.
      • Marco Mascheroni was able to get some testing started. Planning to singularity tests later today. Noticed that if you specify collector address as range of ports in frontend, it is not working. For RC3 reconfig it should give you reconfig error.
  • Developers
    • Marco Mascheroni
      • Working on generating configuration from CREED and have it for 3.5
    • Marco Mambelli
      • Testing of v3.4.1 and frontend configuration
    • Lorena Lobato
      • Working on 3.5 removal of condor switch board

September 26, 2018

Marco Mambelli, Dennis Box, Marco Mascheroni, Lorena Lobato, Dave Dykstra

  • Dave Dykstra
    • Singularity application review - Scilab people are the main writers, Dave and B. Bockelman did significant contributions to the document. This is an application for NSF funding (trusted app) with a rigorous 6 mo to review by UW.
    • Singularity 3.0.9 getting released. Dave suggests waiting some more minor release before switching to Singularity 3.x. Developers have been rushing some major feature, with no time to get all bugs out. May go in EPEL testing
  • v3.4.1 release status
    • Marco Mascheroni will resume tests on itb-dev (was waiting for Edgar, he is on vacation this week)
    • Marco Mambelli will release RC3 w/ hybrid config (shared ports + different ports). This will ease upgrades from <=3.4. If you have RC2 installed, OK to continue to test w/ that.
    • Release by the end of the week
  • Developers
    • Marco Mascheroni
      • working on the automatic generation of the configuration from CREEK. Talking to CREEK developers
    • Dennis
      • finished unit test for validate_nodes, 20909
      • completed 20945, remove some soft link in unit tests
    • Lorena
      • finishing test RC2
      • ticket about shared port and transition w/ older factory
      • review validate_node unit test
      • review documentation
      • started to work on switchboard ticket for 3.5
      • Marco Mambelli clarified that Marco Mascheroni (as factory operator) will provide guidance to Lorena about this ticket.
    • Marco Mambelli
      • Whitepapers about condor_root_switchboard and Singularity
      • tests or RC2
      • tickets review
      • working on some extra validation of Singularity parameters in the Frontend
      • Will release RC3 today

September 26, 2018

Marco Mambelli, Dennis Box, Marco Mascheroni, Lorena Lobato, Parag Mhashilkar

  • v3.4.1 release status
    • rc2 released for internal testing. has changes based on findings from rc1.
    • Edgar & Marco will be testing
    • needs some minor work on share port ticket and release rc3
    • Shouldn't take longer for Marco to test it out but needs to coordinate with Edgar to test it
    • Dennis will fire up automated build and results will be out later today
  • Developers
    • Marco Mascheroni
      • Working on automatic factory regeneration from CRIC. Tool to generate xml file from CRIC. Want to make it a two step process where intermediate files generated can be update/appended by the operator.
    • Dennis Box
      • Automated testing of release candidate. Will kick of the build and look through the output and report back later today.
    • Lorena Lobato
      • Working of testing shared ports
    • Marco Mambelli
      • Working with Lorena on shared port testing
      • Helped Ken, Joe and Shreyas to run Nova job on OSC v3.2.19 frontend. Gave them singularity script that can be run as a VO script.
      • Explanation with Krista on glideinwms versions for switchboard and condor and requirements for upgrades

September 19, 2018

Dave Dykstra, Marco Mambelli, Dennis Box, Marco Mascheroni, Lorena Lobato, Parag Mhashilkar

  • Dave Dykstra
    • Singularity 3.0 getting more ready. Dave is submitting pull requests to get it in shape. Working pretty well, but many things are not compatible w/ 2.6
    • Currently, 3.0, there are a runtime and regular package (including image building that requires setuid processes). The new version will not require setuid, so it will be a single package.
    • Dave built 2 RPMs: singularity-unpriv, and singularity. Not official packages yet.
    • Only unprivileged Singularity is sufficient for OSG normally because we mount expanded images from CVMFS. Image file mounting still requires privileges
    • Dave and Marco will discuss offline the search path for Singularity off OASIS
  • v3.4.1 status
    • Marco did some tests in ITB-dev and had problems. Things were fine in ITB.
      Was testing Backward compatibility using new frontends and new factory, and a HA configuration. Collector and CCB strings were cut after a comma [#20880]
    • Problems discovered in RC1 and addressed:
      1. change the preset of variables that was conflicting w/ current CMS use and failing reconfig [20819]
      2. regular entry not used in entry_set testing was causing a crash [20785]
      3. shared port HTCondor was not configured correctly see in [7341]
      4. A space in collector string was tripping the parsing [20871]
    • Marco Mli should send a communication to the stakeholders if we delay further than this week.
    • Marco Mli About to meet w/ Joe about FIFE on OSC using Singularity
    • Marco Mascheroni will test w/Edgar by the end of the week Singularity capabilities for Gluex
    • CMS wants another VO to test and adopt first. Same for OSG
  • Developers
    • Dennis
      • Automated tests ran fine. New install and upgrade worked. Jobs ran. Touch base w/ Marco Mli offline because this should not have happened
      • new unit tests to review, in 3.5
    • Marco Mascheroni
      • a couple of tickets for 3.5
        • improvements to manual submission
        • tickets to add a scaling factor
        • periodic remove expression for long-running glidein (put an expression to clean up automatically and not to have to run cron jobs and manual cleanup)
    • Parag
    • Lorena
      • working w/ release candidate for upgrading and installation from scratch
      • tickets for 3.5
    • Marco Mambelli
      • Wrote documentation for Singularity and emails about what will bring
      • Release and testing of RC1
      • Working w/ other developers for troubleshooting
  • Others
    • Input for the technical plan is needed. Marco Mli will send to Parag & Tanya
    • Marco Mli proposed a Release assessment in 2 weeks (after release, what worked, what not)
      • Lorena proposed to lead the session. Everyone will contribute.

September 5, 2018

Dave Dykstra, Marco Mambelli, Dennis Box, Marco Mascheroni, Lorena Lobato, James Letts, Eric Vaandering

  • Dave Dykstra
    • Singularity 2.6 released and in EPEL and OSG
    • Alpha release of 3.0, a lot is still missing
    • RHEL 7.6 is out w/ unprivileged user mount namespaces, this will allow unprivileged Singularity (was technology preview in previous releases), now will have security updates and is easier to enable (no reboot)
  • v3.4.1 status
    • Waiting for singularity for singularity ticket. Marco Mambelli updated with changes in the current CMS and OSG scripts.
    • Few more feedbacks to resolve.
    • Working to get RC by end of the week.
    • Marco Mascheroni proposed to add improved glidien_manual_submit scripts. Will be included
  • Developers
    • Marco Mascheroni
      • Working on scripts glidien_manual_submit (3.4.1) and gleidin_led_credential (3.5)
      • Meeting w/ Brian and Jeff about Factory tools: automatically generate configuration for the Factories from IS, tell the difference between 2 factory entries (glidein_factory_diff). Marco will open a ticket about these. Would be nice to have a tool to purge the files in the factory stage area especially if the reconfiguration turn-around is more frequent. There is already #19949 (3.5)
    • Lorena
      • Worked with M Mascheroni for feedback, will work on feedback for the Singularity ticket.
    • Dennis Box
      • Done w/ release tickets, 2531 moved to 3.5
    • Marco Mambelli
      • Worked on singularity scripts and testing them (out for feedback). Will work on documentation
      • Troubleshooting the library problem sometimes when using the HTCondor feature to transfer sandbox using squid proxy (condor curl plugin). It may be a misuse of the binaries (w/ wrong platform). Ticket created for 3.5, #20749.
  • Eric and James had no comments

Aug 29, 2018

Marco Mambelli, Dennis Box, Marco Mascheroni, Lorena Lobato

  • v3.4.1 status
    • Waiting for singularity for singularity ticket. Marco Mambelli updated with changes in the current CMS and OSG scripts.
    • Few more feedbacks to resolve.
    • Working to get RC by end of the week.
  • Developers
    • Marco Mascheroni
      • Working on feedback and completing 20320, 20301
      • A race condition came out w/ the CMS singularity periodic validation script using add_config_line instead of add_config_line_safe (w/ lock)
      • There is a library problem sometimes when using the HTCondor feature to transfer sandbox using squid proxy (condor curl plugin), the solution depends on HTCondor. Changing LD_LIBRARY_PATH in glidein_startup instead of condor_startup would allow validation scripts to detect it. Marco will open a ticket
      • Testing ticket 20031 assigned to Marco, can slip to 3.5
    • Lorena
      • Working with M Mascheroni for feedback.
    • Dennis Box
      • Almost done on getting python3 unit tests up and running #20232
    • Marco Mambelli
      • Working on singularity scripts and testing them. Updated with changes in the current CMS and OSG scripts.

Aug 22, 2018

Marco Mambelli, Dennis Box, Marco Mascheroni, Lorena Lobato, Parag Mhashilkar

  • v3.4.1 status
    • Waiting for singularity for singularity ticket. Marco Mambelli started testing and made some change to the script.
    • Few more feedbacks to resolve.
    • Working to get RC by end of the week.
    • Marco Mascheroni will work on his feedback tickets by tomorow
  • Developers
    • Marco Mascheroni
      • Want to focus on auto generation of factory configuration
    • Marco Mambelli
      • Working on singularity scripts and testing them. Added util scripts for simplifying the code base.
    • Lorena
      • Finished shared port for secondary collectors and CCB
      • Working with M Mascheroni for feedback.
    • Dennis Box
      • Working on getting python3 unit tests up and running #20232
  • OSG started promoting OSG Glideinwms 3.4 to production
  • CMS using HTCondor feature to transfer sandbox using squid proxy

Aug 15, 2018

Marco Mambelli, Dennis Box, Marco Mascheroni, Lorena Lobato, Jack Lundell

  • v3.4.1 status
    • Marco is working with singularity ticket
    • Lorena is almost done on CCB and shared port and testing configuration. Working on testing and documentation
  • Jack Lundell last week presentation (Will write a report by the end of the week)
    • Has been profiling queries between GWMS Frontend and HTCondor
    • Added profiling output to GWMS monitoring (start-end of queries, constraints, projections)
    • Wrote a tool to parse the logs and collect statistics about similar (grouped by type/projection/constraint) queries: avg, max, min, count
    • Working on identifying queries that could be improved (redundant, w/o projection)
  • Developers
    • Marco Mascheroni
      • Busy w/ Factory operation
      • Troubleshooting problem reported by Edgar: Pilots not running at UWMad ITB site
    • Dennis
      • Worked on v34/19304-3
    • Lorena
      • Worked on shared port ticket
    • Marco Mambelli
      • Worked on Singularity ticket
      • Support to Lorena, Thomas and Jack
      • Thomas working on the monitoring improvement. Classes were redesigned to make it more modular: hardened the code for the frontend and made changes for the factory.
      • Jack is analysis the condor logs and did logging to log all the queries we do. Found some unprojected queries. Looking into what is triggering them in the code.
  • Marco Mascheroni and Mambelli will follow-up offline on the problem reported by Edgar

Aug 08, 2018

Parag Mhashilkar, Marco Mambelli, Dennis Box, Dave Dykstra, Lorena Lobato

  • Singularity Update
    • v2.6.0 is released upstream and built for epel and osg. In testing. Added feature disabled_by_default for overlay is added to epel/osg build.
    • There been 2 security incidence with overlay so have asked people to turn it off and instead enable underlay. Overlay does not work for CVMFS. Overlay uses overlayfs feature in kernel in el7. Underlay creates its own space and bind mounts into the image. Underlay also does not require setuid root like overlay
    • v3 is still targeted for alpha release by end of the month. It does not have underlay feature. also not expecting to be stable for our use until end of the year. Not in roadmap for OSG yet, but will see how stability evolves.
    • Plan for UW Madison folks for doing full 5-month security review for v3.0. Would like to do it based on NSF funding.
  • v3.4.1 status
    • Marco is done with singularity one
    • Lorena is working on CCB and shared port and testing configuration. Sinful strings working for CCB. Need to make sure that CCB is correctly being expanded with collector function.
  • Developers
    • Marco
      • Thomas will be able to continue for another 1.5 weeks. Plan is to have the factory part completed, hardened the code and documentation.
      • Jack is analysis the condor logs and did logging to log all the queries we do. Found some unprojected queries. Looking into what is triggering them in the code.
      • sinful string = url with sharedport and address. Example in https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4934
    • Dennis
      • Not much progress on glideinwms this week.

Aug 01, 2018

Parag Mhashilkar, Marco Mambelli, Dennis Box, Marco Mascheroni, Lorena Lobato

  • v3.4.1 status
    • Lorena and Marco behind on tickets and expect the release in next 10 days. Marco will send email about the delay to stakeholders mailing list.
      • Move to share port
      • Singularity improvements
  • Developers
    • Marco Mambelli
      • Mascheroni & factory ops found bug in periodic script wrapper. Was detecting but not signaling if there were errors in one of the scripts. Fixed and committed changes to master. This was detected through black hole effect.
      • Parag: What is the status of moving CMS & OSG to common singularity solution?
        • They are waiting for v3.4.1 release. Edgar will start with GLOW VO.
        • Marco to follow up with Diego for CMS
    • Marco Mascheroni
      • Been busy with new CERN operator
      • Finished last ticket for 3.4.1 on improving sub entry for meta sites
      • Tools for factory operations
      • Will be on vacation from Aug 3 - Aug 13
      • Moved rest of tickets to v3.5
    • Dennis Box
      • Have 3 tickets still open for this release
      • 2 will be able to finish in time (unittest and libfork.py)
      • job restart ticket stuck on columns to put in rrd
      • Thomas is storing rrd info into influxdb which will allow info to be picked up with landscape and other platforms
    • Lorena Lobato
    • Shared port for collectors. Need to make sure sinful strings are correctly supported.
    • Working with Krista help her debug with frontend. When running frontend service owner is frontend group and user. Puppet config making it use a different group but having issues writing to monitor part.

July 18, 2018

Jeff Dost, Lorena Lobato, Marco Mambelli, Parag Mhashilkar, Marco Mambelli, Dave Dykstra, Marco Mascheroni

  • Project status
    • Margaret and Brian liked the format of developers talking about their work
  • CHEP Status
    • Marco Mascheroni: How to integrate CRIC for automated site info. Factory ops will give requirements to CRIC and use auto factory generation.
    • Jeff Dost: Don't expect to automate right away but provide stub that can be used to generate entry. Several info is obtained via manual interaction with the site admins.
  • Singularity
    • v2.5.2 has security updates that was rushed out
    • v2.6.0 release candidate in osg 3.4 development and EPEL testing
      • Added underlay feature to it. Its not in upstream but in v2.6 RC built by Dave for EPEL and osg
      • Singularity team is targeting v3.0 for summer release
  • OSG Operations
    • Jeff Dost & Marco Mascheroni
      • Monitoring of meta sites. Opened a new one #20320. Frontend calculates based on metasite and factory does last minute calculations to split the request. Issues with broken site. Need multiple algorithms.
      • Jeff: Scenario, when the number of sites is more than request
      • Come up with the 3.4 on factories. Marco will upgrade CERN ITB and then based on result next week upgrade SDSC factory.
  • Developers
    • Lorena
      • Working shared port tickets. Trying to understand the macros for htcondor
      • Working on configuration variable to identify black holes. Met with htcondor team. Ticket assigned to TJ. Tracking average job busy time in an attribute.
    • Marco Mambelli
      • HTcondor will have average value for runtime value to identify black holes. Question: If no jobs submitted yet, will be zero or will be undefined.
      • Worked on singularity changes and whats discussed in the meeting and making files with different singularity scripts.
      • Everyone should look at CI emails for their branches and fix broken changes in their branches.
  • 3.4.1
    • Released will be delayed by atleast 2 weeks. There are several development tasks still pending.

July 03, 2018

Marco Mascheroni, Lorena Lobato, Marco Mambelli, Parag Mhashilkar, Dennis Box, Thomas, Jack

  • Going through slides for the stakeholders meeting

June 27, 2018

Marco Mascheroni, Lorena Lobato, Marco Mambelli, Parag Mhashilkar, Dennis Box, Dave Dykstra

  • GlideinWMS 3.4
    • Will be released by OSG next week in upcoming repo and does not include switchboard and will require HTCondor 8.6 series and below. Brian Bockelman is against adding setuid software in OSG.
  • GlideinWMS 3.4.1
    • Planned date is end of July and will not include single user factory.
    • We need to ready with factory without switchboard before HTCondor8.8 live by end of summer.
  • Python 3 support
    • We have to move to python 3 as python and module developers will stop supporting by 2020.
    • Thomas did a prototype with python 3.
    • Parag: We should have a python3 branch forked off master and have it tested nightly and keep the python3 branch in sync with master.
  • Developers
    • Marco Mascheroni
      • Not much progress. Working on CHEP poster.
      • CMS is performing scale test. Found limits on frontend. Scale up to 1M jobs that is 5 times current one. Memory issue can be solved by number of workers and another problem is do_match() that takes upto 1 hour. Count glideins takes a lot of time.
    • Lorena Lobato
      • Testing of glidein variables that do not do cast properly when not in variable list.
      • Problem with upgrade of frontend. When frontend_startup upgrade is rewriting the config
      • Waiting on hostname and DN of new factory at UCSD
      • Working with htcondor guys. shared port
      • Slides for glideinwms stakeholders meeting
    • Dennis Box
      • Increasing coverage of unittest. Generate reports that can be displayed through web server

June 20, 2018

Marco Mascheroni, Lorena Lobato, Marco Mambelli, Parag Mhashilkar, Dennis Box, Dave Dykstra

  • Singularity (Dave Dykstra)
    • Its in EPEL. Brian Bockelman got rights to update it and Dave sent PR to update. Helpful for European folks with WLCG
    • Submitted PR for allow any mount points to be created in the image. Singularity team is working on new major change and may not get in the next series v2.6. Sent mail to mailing list asking for people to support it if they need to get it in 2.6.
    • Parag: How is this recognize this?
    • Marco: Working on feature to have more than one dir mount points (?) Need to add and run code without user touching it.
      • Dave: Can mount read only mount. Current code does not allow when unprivileged.
  • Release 3.4
    • 3.4 will be in upcoming. Discussions with Brian Bockelman who doesn't want switch board rpm. So we will not package switchboard and depend on HTCondor 8.6 series only and not support 8.7. He wants factory to run as single user. Brian maybe oversimplifying it and without support for GT support, rest of component supported by HTCondor are secure.
  • Developers
    • Marco Mascheroni
      • Wrapped up manual submit glideins
      • Fixed for meta sites. Rare occurrence
      • Issues with one of the site where glidein was triggering periodic remove but the glideins were not actually removed from the site. Suspicion is that it may related to lot of glideins being removed at a time.
      • CMS Frontend operator - Zombie pilots: Noticing glidein startup.sh does not wait for condor to exit when killed
      • Diego mentioned why not use https.
    • Lorena
      • Finished tickets on glideinwms doc, full reviewed. Many missing links, inconsistency, ticket in review for feedback. Suggest to do full migration to another repo and created a structure of documentation. Make is clear and condense.
      • Testing robustness of glidein variable.
    • Dennis
      • Reviewing singularity script from Marco. Seems fine. This was asked by Ken (FIFE) for a template for test scripts. To publish attributes when they run in singularity.
    • Marco Mambelli
      • single user factory is a big change and we should make it s big release.
  • CMS (Eric Vaandering)
    • Lack of factory auto configuration has come up. Will be asking for in coming year or so.
    • Morning meeting talking about grid info system provided by CERN is ready for testing and is fork of Ages. Supposed to be 1 stop shop to store info about grid systems storage and compute. Krista complained that she is spending time on keeping CERN and FNAL factory config in sync.

June 13, 2018

Marco Mascheroni, Lorena Lobato, Marco Mambelli, Parag Mhashilkar, Dennis Box, Thomas Hein, Jack Lundell

  • Release 3.4
    • It is undergoing OSG integration testing to be promoted to osg-upcoming
    • If adopted in production and all OK, it will be promoted to OSG production
  • Stakeholders meeting
    • Will be in one month
    • Developers will provide a couple of slides about their focus
    • Week before SO meeting Parag would like to see slides from everyone
  • Summer interns
    • Thomas working on monitoring
    • Jack working on profiling GlideinWMS-HTCondor communication
    • They joined fnal.slack.org!
  • Developers
    • Marco Mascheroni
      • pylint - manual submit ticket.
      • Other ticket to keep track to glide_off
      • Fix for metasite
      • Number of factories
    • Marco Mambelli
      • multi-node submission
      • Singularity
    • Dennis
      • Customizing pylint to focus on important errors
    • Lorens
      • Work on OSG RPM documentstion
      • Working on GlideinEMS documentation
      • switch collector to use shared port
      • Add a configurable limit to failure rate
        • HTCondor opened a ticket about this as well, exchanging emails w/ them
      • Testing robustness of Glidein variables handling

June 06, 2018

Marco Mascheroni, Lorena Lobato, Marco Mambelli, Parag Mhashilkar, Dave Dykstra

  • Singularity (Dave D)
    • Several security bugs in seingularity and everyone is asked to upgrade
    • v3.2.5.1 Minor bug fix release coming out tomorrow
    • Dave is working on feature that allows any system to be allow any mount points to be created in the image. Current only allows overlay and has limitations.
    • Marco
      • Current changes, fixes auto discovery of modules
      • new request on ability for options to for singularity from both factory and frontend side
      • htcondor team wanted containers in general as well and not just singularity
      • add variable for mount points for vo SINGULARITY_BIND_MOUNTS
      • In glideinwms release enabled secure CVMFS to be used through disable sharing sessions across different users using same pilot
  • v3.4
    • Release done yesterday
      • Delayed by couple of days because of some red herring issues in glideinwms and OSG koji downtime
      • Discussions with Brian Lin and wants 3.4 to upcoming repo and currently is in upcoming-testing
  • Developers
    • Marco Mambelli
      • Started working on v3.4.1 ticket generation/assignment
      • One of the tickets is for changing attr variable to trigger error
      • Smoke tests are quick but release testing takes about 3 days
    • Marco Mascheroni
      • Testing compatibility testing for v3.4 with metasites and found some issues. Have a fix for it
      • Working on manual_submit_glidein based on changes from the manual_glidein_startup
      • Factory ops: Some questions from production
      • Nothing much from CMS side as they are busy with CERN upgrades
    • Lorena
      • Testing 3.4 rc and working with Brian to solve issues with documentation and factory and frontend conflicts. Fixed conflicts. switch board packets.
    • Dennis
      • Working on unittests. Found a bug in coverage that does not look at files not updated for a while

May 30, 2018

Marco Mascheroni, Lorena Lobato, Marco Mambelli, Parag Mhashilkar

  • v3.4 Release Status
    • RC is released. Lorena will do the package and other testing. Production development is better place for this release. There are condor version dependencies as switchboard is not in higher version of condor and there is no easy way to trigger conditional dependency install. This will be tackled through the documentation.
    • 3.4 will go in production and development. Undergoing testing internally.
    • Various testing split across the team
  • Developers
    • Dennis Box
      • Plan is to do smoke testing
    • Marco Mascheroni
      • Working on manual submit glideins. Trying to review it as it stopped working. Takes a ini file as input and is awkward. Would like to simplify it.
      • Plan to remove commands for BDII and ReSS
    • Marco Mambelli
      • Commit done by Marco that works only for rpm and not tarball. Should we stop official support for tarball.
    • Lorena
      • Finished tickets for v3.4 and packaging of switchboard.
    • Parag
      • Discussions with Jaime and Zach and Brian on possibilities of getting rid of switchboard

May 16, 2018

Marco Mascheroni, Dennis Box, Lorena Lobato, Marco Mambelli, Parag Mhashilkar

  • v3.4 release status
    • Couple of tickets are in testing and review. Couple need code changes. Changes in singularity setup or put an expression in condor var list. Other is changes for tracking glidein requests.
    • Couple of tickets in feedback.
    • Estimated cpu usage review. Lorena will be reviewing it.
    • Given the pending work and HTCondor week next week May 24 seems very tight.
  • Developers
    • Lorena
      • Working on building and packing of condor switchboard in OSG. Need to contact the OSG for that.
      • You are working #17824, #16161
    • Dennis Box
      • Working in testing singularity ticket #19920 and found a problem. Fix provided by Marco works. There is a typo in code when singularity is not in path and uses module load.
  • Stakeholders meeting input
    • Marco Mascheroni, Lorena and Dennis should present the work they are doing as part of the stakeholders mailing list

May 9, 2018

Marco Mascheroni, Dennis Box, Lorena Lobato, Marco Mambelli, Parag Mhashilkar

  • Releases
    • v3.4 RC is planned in a week. Please, all close on the issues you think to complete
    • This meeting will focus on tickets for 3.4.
    • If a ticket will not fit (done by next week) should be moved to 3.4.1
  • 5/11 is the stakeholders meeting in the 9th-floor room. Developers are welcome to join, Marco will forward the coordinates
  • Marco Mascheroni
    • Factory monitoring ticket in Feedback to Dennis
    • Working on the tools review ticket, will moe the others
  • Lorena
    • Tested and reviewed 17102
    • Working on Condor switchboard ticket
  • Dennis
    • Review of tickets
    • Will provide stats on increased unit tests coverage
  • Marco Mambelli
    • Most tickets ready
    • 16161, cores estimation, ready for testing and review
    • 19293, tracking jobs requests needs work
    • 19827 and 17662 need troubleshooting/testing
    • Will open a ticket about singularity bin discovery error (19920)
  • Parag
    • Tickets moved to 3.4.1

May 2, 2018

Marco Mascheroni, Dennis Box, Lorena Lobato, Marco Mambelli, Eric Vandeering, Dave Mason

  • Releases
    • v3.2.22.2-3 in osg-testing (OSG ITB already updated), expected in production in 1-2 weeks
    • v3.3.3-3 in osg-testing, expected in production in 1-2 weeks
    • Our next release will be 3.4.0, RC in about 2 weeks
  • GWMS, FIFE, Singularity testing
    • Coordinated w/ OSG to update and configure the dev Frontend
    • Coordinated w/ Factory operation to configure the entry used for the test
    • Local test ran on fermicloud to verify the configuration suggested
    • FIFE (Ken) will run the jobs and test that they will run in Singularity
  • OSG is going away [#19875]
    • Will remain in some form but most services currently provided will scale down or go away within May
    • Not sure if ticket.grid.iu will stay. It is a great source for troubleshooting/problem solving
    • OSG CAs will not provide or renew certificates any more after May -> We may have to reboot our machines in fermicloud (to pick up new certificates) and to update the certificates used in factories/frontends
    • The factories hosted at IU (prod/ITB) will go away on 5/23, UCSD will install an ITB factory. There will be only one production factory --> We'll have to update the default configuration shipped with GWMS and the configurations used in our setups and the DN/addresses referred in the documentation
  • GWMS 3.2 and 3.3 have been merged in v3.4
    • the new main branch is master
    • see email for changes in Redmine tickets and the git repository
  • There is a plan to move the main git repository in github, but is not immediate
  • 5/11 is the stakeholders meeting. Developers are welcome to join, Marco will forward the coordinates
  • Marco Mascheroni
    • 100% on factory ops during last week as well, should be back 50% this week
    • will complete tickets in progress and start the condor ticket
  • Lorena
    • Pull request to merge OSG documentation
    • Smoke test maintenance
    • Reviewing ticket about futurize test
    • Testing ticket about 509 user proxy expiration
    • Entries in downtime
    • Has been reviewing the documentation
      • Considering to remove corral documentation
      • Review, eliminate error and fix grammar in the documentation
    • Reported about cloud meeting at the department meeting
  • Denis
    • 19830 - pilint errors in unit tests, later today
    • 3 tickets w/ feedback 17417
  • Marco Mambelli
    • Prepared v3.4 migration
    • Working on tickets for release
      • 509 user proxy expiration
      • tracking job brequests
    • working on Singularity test
  • No comments from Eric and Dave
  • Marco Mambelli requested to focus on release tickets for the next 2 weeks

April 25, 2018

Marco Mascheroni, Parag Mhashilkar, Lorena Lobato, Marco Mambelli

  • v3.2.22.2
    • Upgrades not planned yet. Was initially planned for this week
  • Marco Mascheroni
    • Noticed that after adding an entry and change the memory, changes did not propagate to the condor config. Will try to try and debug it.
    • CMS is busy with Singularity and T0 migration
  • Lorena
    • Fixing documentation
    • Working on #14164: problem does not seem to be persistant
  • Marco Mambelli
    • Plan to merge 3.2.23 into master and make a 3.4 release
    • condor_switchboard support. Once condor moves to 8.8 it wont be distributed by condor. We need to create a tarball and spec file to build and create. We should eliminate need to switchboard before we move to SL8. Work planned out with htcondor guys.
    • Testing 3.3.3 rc1 and looks fine. Plans to release final version today.
    • venv tarball issues with CI script. upstream changed the tarball location.
    • Some code was not future proof and it kept crashing on sl6

April 18, 2018

Marco Mascheroni, Parag Mhashilkar, Marco Mambelli

  • v3.2.22.2
    • Released yesterday
    • fixes singularity issues
    • fixes proxy renewal script
  • v3.3.3
    • Includes changes released until v3.2.22.2
    • Doing automating testing
  • Developers
    • Marco Mambelli
      • Auto CI testing did not run last night. Emailed Vito and he is checking it
      • Jenkins managed individual tests are running correctly
      • OSG is outsourcing JIRA and reduced number of accounts
      • Promote 3.3 to production. Have to do the comparison and figure out the differences.
      • HTCondor collaboration work. Haven't tested yet.
      • Want to start FIFE with singularity
    • Marco Mascheroni
      • tested v3.2.19 factory with v3.2.22 frontend and singularity default scripts worked. ITB factory at UCSD will be updated to v3.2.22.2. FIFE wanted to make sure v3.2.19 factory worked with latest frontend before upgrading.
      • OSG released v2.4.6 singularity that caused incompatibility and resulted in v3.2.22.2 factory and frontend.
      • Plans to upgrade factory at UCSD next week and then follow it up at IU.
      • Parag: Marco to create tickets for factory ops requests and work on them

April 11, 2018

Marco Mascheroni, Dennis Box, Lorena Lobato, Marco Mambelli

  • Released yesterday 3.2.22
  • Developers
    • Marco Mascheroni
      • Test Factory and Frontend 3.2.19 and new Frontend (3.2.21) against 3.2.19. Will test 3.2.22
      • Wiki instructions are incorrect, say to use osg-contrib repository and point to OSG twiki
      • Production Factories: SDSC 3.2.21 w/ fixes (Classad string+Singularity), CERN, GOC, 3.2.19
      • Final things to commit the monitoring for the metasites + documentation
      • Collected requests and comments from Factory Ops and surveyed about GWMS tools (see google doc in email): https://docs.google.com/document/d/1ANP80spS9so58OGPPt3JmlZKiw7fRFH4TSJLT1RvOAs/edit
    • Dennis
      • Working on ticket 2531
      • testing 3.222
    • Lorena
      • Updating Factory and Frontend to RC and now release
      • Work w/ Brian on fixing/testing script to update proxies
    • Marco Mambelli
      • Worked on RC, integrating fixes for proxy updates and release.

April 4, 2018

Marco Mascheroni, Dennis Box, Lorena Lobato, Marco Mambelli, Jeff Dost

  • Released RC1 for 3.2.22
    • All tests resulted OK, RC2 was added to fix the proxy renewal script
  • Jeff
    • saw the RC, will test it. Send email to factory ops w/ green light once internal tests are passed
  • Developers
    • Marco Mascheroni
      • the corner case generating the string is no more there. It is hard to replicate where the strings are coming from. The faulty Frontend was UCLAC (OSG 3.2 repo 3.2.16 GWMS ), now updated. Glow or LIGO Frontend. Anyway, Marco keeps monitoring the classad
      • Discussing w/ Jeff and other operators, working on a list of requests for GWMS developers. Mostly about monitoring and handling held glideins
      • Nothing else to report, Easter break and shift
      • 17-18 of May will be at Fermilab.
      • Interested in question discussed in the condor meeting (Marco and Parag in Madison)
    • Dennis
      • kicking off the RC tests.
      • Looking at 17417
    • Lorena
      • Familiarize w/ GWMS framework
      • Access to ITB factory. Problem w/ DN of pilot proxy was missing in the condor mapfile.
      • Started to work as developer.
      • Added to the OSG build
    • Marco Mambelli
      • Worked on last tickets, release of RC and testing

March 28, 2018

  • Jeff
    • OSG all hands last week.
  • Developers
    • Marco Mascheroni
      • done w/ monitoring of metasite, status-now page shows the breakdown for entry
      • Operations: removal of idle, periodic remove expression. When the site is at capacity, the glidein stays idle, then we remove.
        The right thing to do is to keep the idle, remove them after 8 hours, not 1 hour. The mechanism is backfiring when the site is at capacity
        Antonio and James have shown that global pool efficiency went up but when sites are at capacity then it starts backfiring.
        The frontend is knowing the pressure and removing the excess glideins will solve the problem in a better solution.
        How can the expression be changed to consider the site capacity? So glideins are not removed if the site is at capacity
      • Is is there is a feature in 3.2.21 that could help w/ Vanderbilt?
        • If a glidein gets preempted at a site, then the periodic removal removes it (after a couple of hours). Would it be better to prevent condor from resubmitting it?
          When a glidein is evicted (not held) could we prevent automatic requeue by condor?
          Generally makes no sense to have that glidein because pressure may have changed in the meantime
          Held glideins should be removed
          Condor release should never be called
          It will never release, it will resubmit from idle.
          The policy should try to remove held jobs.
          A release triggers a reschedule.
          Grid protocols are not sophisticated enough to recover a held job
          It always triggers a resubmission.
          Continuation may work only Condor-CE.
        • Marco replied that sometime jobs are recovered (or recovery is attempted instead of resubmitting). There is an ongoing discussion w/ the condor team.
    • Dennis
      • ready to test RC
      • working on 19304
    • Lorena
      • started to look at the code and at the ticket that was assigned
      • Assuming that there are no free slots to run the pilot
      • Working on her test setup. Cannot see the frontend in the factory monitoring (ITB). Yesterday she could, not today: http://glidein-itb.grid.iu.edu/factory/monitor/factoryStatusNow.html
      • Off the call, Jeff will check collector auth errors. There is an ad in the collector, talking to the factory even if not in the monitoring
        63 fermicloud364-fnal-gov_OSG_gWMSFrontend.main
        condor_status -any -pool glidein-itb.grid.iu.edu -const 'mytype=?="glideclient"' -af clientname | sort | uniq -c
      • Marco: checking one glidein from yesterday. Validation time is negative, failing
    • Marco Mambelli
      • Working on 3.2.22
      • Singularity ticket

March 21, 2018

Parag Mhashilkar, Marco Mascheroni, Dennis Box, Lorena Lobato, Marco Mambelli, Eric Vandeering

  • Developers
    • Marco Mascheroni
      • Factory operations. Two important fixes: Singularity script, exception w/ memory leak
        • A site asking about requesting the amount of disk to be used. Add something in GWMS to treat disk same as CPU and memory. The glidein is advertising all the disk is finding, 50 GB, the job was failing after 20 GB.
        • Marco will share more Factory ops requests at the next meeting (developing requests)
      • Finishing factory status page w/ entry breakdown
      • Will add the San Diego instructional documents, to the wiki in NewMemberOrientation
  • Discussion about the memory leak bug:
    1- who is putting the string instead of the int? Only a specific Frontend version?
    2- why is the exception not handled and causing a memory leak due to the hanging processes?
    There was an exception and the python processes were not cleaned after the exception. We were multiplying a sequence and a float, getting an exception, the calling process was not getting proper output, was not able to collect, and the processes were hanging around. From Brian B: Whatever bug there is there should be no memory leak. If an exception is in the subprocess, we should catch it and not leak python processes (e.g. global try -except)
  • Dennis:
    • Testing the fix for the Singularity bug
  • Parag:
  • Eric
  • Lorena:
    • Working connecting w/ the OSG Factory, now seems working, the Frontend is requesting glideins, jobs are still not running.
    • Documentation changes are ready and waiting for review: there will be some updates about proxy and certs
  • Marco Mambelli
    • Fixing the Singularity bug and working w/ Jeff about interim solution compatible w/ CMS and OSG custom solutions (DISABLE_GWMS option)
    • Working on 3.2.22 planning
    • Singularity meeting
    • Meeting w/ HTCondor team last Wednesday
  • 3.2.22 release
    • Urgent release w/ only critical bugs
    • RC by the end of the week

March 14, 2018

Canceled, due to a meeting in Madison


March 7, 2018

Stakeholder meeting


February 28, 2018

Parag Mhashilkar, Marco Mascheroni, Jeff Dost, Dennis Box, Lorena Lobato, Marco Mambelli, Antonio

  • v3.2.21-2
    • Mambelli working with Brian Lin to fix the bug in Brian's script. Will be available later today.
  • Singularity
    • All the changes in scripts in Mats and CMS that apply to us are in v3.2.21
    • Jeff
      • Not using Glideinwms internal validation scripts
      • CMS is v3.2.19. Not ready for v3.2.21
      • Factory docs for singularity. Will enable singularity_bin at site one at a time
  • CMS
    • We are using scripts to check of singularity. Running 48 hr pilot and at times in between singularity. Can we do periodic validation? We will have a transition phase
      • If singularity not available exit
      • wrapper script to check for singularity at
    • v3.2.21 plans
      • reinstalling to and other things that is taking most of cycles
      • HTCondor and glideinwms will be looked at second half of March or later
    • Provisioning requests
      • Been in stable and saturating load for now. Job pressure trying to fluctuate a bit again. Can the frontend, access total amount of work needed? Number of CPUs * core hours
        • We are requesting more glideins
        • Multi core, some jobs are still running on glideins leaving them partially wasted
    • New T0 setup. Resources at CERN in slots of different sizes. Instead of multiple sites, trying auto feature in glideinwms
    • Also trying IO slot.
      • Marco: Use glidein resource slots. Can use resource slot with auto.
  • Developers
    • Marco Mascheroni
      • Feature to monitor meta-sites monitoring break down
      • Investigated glide_off failures: When executed on collector node glidein gets killed but the client reports other wise
    • Lorena Lobato
      • Studying Glideinwms
    • Dennis Box
      • Busy creating unit test and integrating in CI. Have 3 ticket for 3.2.22
    • Marco Mambelli
      • Busy with the release and working with Brian Lin auto-proxy renewal
      • Modernizing the code and fix CI tests
  • Stakeholders meeting on March 07, 2018

February 14, 2018

Parag Mhashilkar, Marco Mascheroni, Jeff Dost, Dennis Box, Lorena Lobato, Marco Mambelli

  • OSG Operations
    • Will upgrade when it is released in OSG production
    • Marco: Edgar has been assigned to OSG rpm testing
    • Will be at FNAL for OSG blue print meeting
*CMS
  • Mascheroni: Nothing from CMS side
  • Mambelli: Nothing from the SI meeting last week
  • Developers
    • Mascheroni
      • Working on monitoring on Meta sites. Close to get the final version. Arriving on 20
    • Mambelli
      • Release v3.2.21 is now in OSG testing. All automated tests passed run by Dennis. Marco also did some manual tests and they passed.
      • OSG documentation. RPM installation docs for factory were not migrated from old systems.
      • On vacation from Feb 15 - Feb 20
      • Want to merge the futurize code to branch_v3_2. Dennis found few things to be corrected. Doing the rebase along with the fixes.
    • Lorena
      • Studying the documentation
      • Trying to install the RPMs and then will follow up with the testing.
    • Dennis
      • Looking at code modernization of code base. Found few things that slipped through. Upgrade process failed because of invalid imports. Found a tool that stubs out unit tests that assert false for every class. Very useful!
    • Parag
      • HTCondor trip in March.
      • OSG blue print meeting.

February 07, 2018

Parag Mhashilkar, Marco Mascheroni, Jeff Dost, Dennis Box, Lorena Lobato

  • Lorena Lobato started as of Monday
  • v3.2.21 Status
    • Dennis went through all the standard tests, automated testing, clean and upgrade, submitted jobs ran.
    • Marco Mambelli will most likely be cutting final release today. Jeff already upgraded to ITB and he found all the factory issues have been fixed. Haven't tested meta-sites.
  • OSG Factory Ops
    • Nothing from Jeff sides.
    • Plan to upgrade to production as soon as possible. Constraint is GOC factory will be upgraded only after rpms are available in the OSG production release.
    • Jeff will be here for blue print meeting.
  • Developers
    • Marco Mascheroni
      • Working on monitoring meta sites. Proto types expected next week.
      • glidein_off is broken and looking at it
      • Covered in v3.2.21 testing
  • Parag: Can we have factory operations help with testing in future if required? We may need some help testing at scale at times. That should happen only when we are convinced after internal rc testing.
    • Jeff: It is still best effort. Best to send email to factory ops mailing list and see the availability.

January 31, 2018

Marco Mascheroni, Parag Mhashilkar, Marco Mambelli, James Letts, Dennis Box

  • CMS
    • Unsantized MJF attributes was blowing up the htcondor auto-clustering. Marco Mascheroni provided the fix. Its in place and it works and will be in next release. Currently we are waiting on older glideins to clear up.
    • Short glideins. Working with factory ops to use these short glideins.
    • Meta site ticket status: It is is v3.2.21.
  • v3.2.21 Release status
    • Waiting on final feedback on the ticket. Will cut a RC today and final release by later this week after more testing.
    • Includes
      • Support for un privilaged singularity
      • MJF attributes fixes
      • Frontend epoll crashes fix
      • Meta sites.
      • [...]
  • Developers
    • Marco Mascheroni: Jeff asked for meta-sites. It needs to be documented. Working MJF attributes. Need to get back to monitoring of Meta-sites. Planning to arrive on Feb 20 and depart on Feb 28.
    • Dennis Box: Working on epoll feedback. Doing automated smoke tests.
    • Marco Mambelli: Main working on the epoll and release this week. Email from Tim Cartwright. Asked for hot fix. Brian Lin asked alternatively we can get 3.2.21 if it is ready.

January 24, 2018

Marco Mascheroni, Parag Mhashilkar, Eric Vaandering, Jeff Dost, Marco Mambelli

  • CHEP Paper
    • Marco Mascheroni submitted the abstract
  • v3_2_21 release status
    • Marco Mambelli is working on feedback from Jeff and most of the changes are in place.
    • Glow frontend has other unrelated errors which are usually transient and happen when other schedds are down. There are some changes to exception handling and way processes are forked. epoll() v/s select(). We are seeing errors related to closed file descriptors which only happens in case of epoll()
    • v3.2.21 RC is on GOC ITB. Jeff plans to look at it and test out Meta sites.
  • Developers
    • Marco Mascheorni working on improving monitoring for Met sites

January 10, 2018

Marco Mambelli, Marco Mascheroni, Parag Mhashilkar Dennis Box

  • Chep Paper
    • Story: Minimizing Glidein wastage
    • Marco Mascheroni to talk to Antonio about his paper and see how much overlap we have with our paper
  • v3_2_21 release status
    • Last problem to address is Monitoring problem. Marco found that the problem is in the Javascript. Data in the xml file is correct. Error is in cells across the table in multiple columns.
  • Discussions with Suchandra
    • Auto reconfig/upgrade after rpm upgrades
  • Developers
    • Marco Mambelli
      • Made changes to OSG git hub documentation
    • Marco Mascheroni
      • Started looking at properly monitoring Meta Sites. Will spend more time
    • Parag Mhashilkar
      • DOE Gitlabs as possible alternative to Github. Burt is looking into it. It is not clear if it will be easier for the external contributors to contribute to the project easily like in case of github.

January 3, 2018

Marco Mambelli, Marco Mascheroni, Dave Dykstra, Erik Vandeering

  • v3_2_21 status
    • Dennis talked w/ Marco yesterday
      • #17639: Dennis submitted it for review.
    • No progress on other tickets due to vacations
    • Marco plans on a new RC by the end of the week, probably a release by end of following week
  • Marco Mascheroni
    • Will look at monitoring breakdown for Meta Sites
    • Would like to propose an abstract on GlideinWMS progrss for CHEP18. His previous proposal is too similar to a paper from Antonio and other CMS people.
  • Marco Mambelli