Weekly Meeting Notes 2017

Jump to the current Weekly Meeting Notes
Jump to the 2016 Weekly Meeting Notes 2016

December 20, 2017

Parag Mhashilkar, Marco Mascheroni, Dennis Box

  • v3_2_21 status
    • Dennis
      • #17639: Dennis is working on it. Encountering some issues debugging the script. Trying to understand the problem and address them. Also working on integrating OSG singularity scripts. Will check with Dave for expert help.
      • #17471: Not looked at it yet
    • Other tickets
      • Marco was working on ticket #18588 before he left of vacation. He is done fixing one bug in the calculations and working on fixing other part
      • #18522: Brian Lin is working on feedback given by Marco Mambelli
    • Marco Mascheroni
      • #13069 is in the master
      • #17858 will assign it for feedback
  • Marco Mascheroni
    • Will look at monitoring breakdown for Meta Sites

December 06, 2017

Marco Mambelli, Parag Mhashilkar, Marco Mascheroni, Dennis Box

  • Project News
    • Stakeholders meeting on Dec 7 3:30 - 4:30pm.
  • v3_2_21 status
    • Its plossible to include this tickets if they get resolved by end of this week.
      • Meta-sites ticket: Mambelli reviewed and gave feedback. Mascheroni is going over them.
      • Dennis got the condor ce issues resolved and can work on the unprivileged singularity.

November 29, 2017

Dave Dykstra, Dennis Box, Marco Mambelli, Marco Mascheroni, Dave Mason, Parag Mhashilkar

  • Project News
    • Stakeholders meeting on Dec 7 3:30 - 4:30pm
  • Singularity
    • Found cause why bounds were disappearing on SL7.4 with overlay turned on. Fix has been put in for CVMFS 2.4.3. Dave has built new version 2.4.2-1.1 for OSG with the patch. Should be released soon.
    • Singularity 2.4.1 released upstream that broke EL 6
    • There is no quality control in Singularity. Not enough testing is going on EL6.
    • Dave found how to attach into a singularity control to get access a user and see processes.
    • Depolyment at FNAL
      • Currently deployed at UNL and Tier-2
      • Beginning Feb CMS wants every EL6 have to be installed. CMS is requiring that Singularity to be on SL6 but may not go there. Working to have a test cluster deployed by end of this year. Combination of T1 & LPC. Big transition is to move to container (docker) based on SL7. There are CMS meetings internally on how to proceed with this and move forward. runc and charliecloud will run if unprivileged namespaces.
      • Dennis: Setup SL7 to try out singularity.
  • CMS
    • Nothing from Dave for today's meeting
  • v3_2_20
    • Is in OSG integration testing and should be released in OSG 3.2 & 3.3 (December)
    • OSG moved to a rolling release schedule.
  • v3_2_21
    • Mambelli: Should be done with #17221 development today and have the RC by Friday
  • Developers
    • Marco Mascheroni
      • Happy with the changes to #13069. Writing documentation and unittests and plan to commit changes tomorrow or Friday.
    • Dennis
      • #17639 looking into it, setting up test environment. Have CE that can submit jobs to. For some reason test glideinwms deployment is not submitting to it.

November 22, 2017

Dennis Box, Marco Mambelli, Marco Mascheroni, Jeff Dost

  • OSG Operations
    • Jeff plans to deply and try out v3.2.20 soon
    • Provided feedback about factoryEntryStatusNow.html [#18371]
This is solution1, first step towards possible multi-site configuration:
- to set limits across a ret of Entries
- to avoid double counting because of multiple CEs pointing to the same cluster

Some notes about the solution proposed (beside what's in the presentation):

There is an Infosys section that may not be handled correctly: how to correlate the info DB that tells how to create the entry first. Is there to connect in BDII, but it is mostly a failed attempt
- BDII is not updated
- not all info is in BDII, e.g. if a site prefers 8 cores 

This solution is similar to a proposal from Jeff several years ago: one more level of the tree where there is a site level and entries underneath.
For that, you should be able to put the attributes in the entry set level or the entry level. Jeff will forward an old google doc about this tree-like structure.
Don’t worry about the requests for this iteration but may help w/ decisions to leave options open for the future.

Monitoring has not been touched, so you will see only the aggregate numbers for the entry set. Entry sets will look like a new entry and no sub-entries underneath.
Jeff: For site debugging would be important to see the single CEs because one may misbehave and would be difficult to troubleshoot if you cannot single it out. Would like to look at both: total and be able to drill down to single CEs.

The configuration directory allows multiple entry files. The splitting of Factory entries is because of different groups owning and sharing these files/entries:
1. staff shared w/ CMS and non-CMS; 2. CMS only entries; 3.  ITB factory has all of the production files plus one (or more) for testing
It is similar to condor config.d: files are read alphabetically and newer can overwrite entries in older ones

There are situations where entries from a set are in the same file. But sometime entries are shared across multiple VOs,  with a queue per VO or a queue accepting multiple VOs
In an entry set all the entries have to have the same auth method and trust domain
Double counting only matters per VO, we don’t care about non CMS, because when 2 different VOs are involved counting is different.
Could an entry be added to different entry sets (eg. different entry sets for different VOS)? yes if different VOs are involved. There would be no double counting problem

Monitoring is the most important missing feature, the other features mentioned here are only thought experiments and would need to be tested with the first version.

An entry sets is all in one file in this first version. An entry set is a block, the last file that defines it is overriding the entry set definition

Marco next steps will be:
- finish the test
- define what to add in the current version
- prepare a working version that can be released and tested
  • v3.2.20 has been released and is in OSG testing

November 15, 2017

Dennis Box, Marco Mambelli, Marco Mascheroni

Small attendance because of CMS week and Supercomputing. December 7 will be the next stakeholder meeting

  • v3.2.20 Status
    • RC4 has been tested succesfully, the release will be cut after the meeting.
  • Developers
    • Marco Mascheroni
      • Almost done with Meta sites ticket #13069. Will push the code tomorrow and discuss it w/ Marco Mambelli.
      • Protocol is same. Attrs files generation and usage by the factory changes. No changes to the protocol.
    • Marco Mambelli
      • Working on v3.2.20 RC and release. Did single core and multicore jobs testing.
    • Dennis Box
      • Been testing release candidate. No problems on SL6 and SL7 upgrades.

November 8, 2017

Dennis Box, Marco Mambelli, Marco Mascheroni, James Letts, Antonio, Jeff Dost, Parag Mhashilkar

  • CMS
    • James:
      • #17221: Marco Mambelli - Work has not started yet
      • Some sites are shortening proxy lifetime to 24 hours. There is a feature in HTCondor that lets you refresh proxy
      • Jeff: We only see this in the European sites and don't see this in
      • Antonio: KIT admins refused to do it saying that it is security measure.
      • Parag: This will have a bigger impact on VO.
  • OSG Factory Operations (Jeff)
    • Will respond to Marco's email.
    • Will try to put rc3 on ITB and try to send pilots/tests
  • v3.2.20 Status
    • RC3 is out. So far it is ok
    • Jenkins is reporting a unit test failing. Marco is investigating
  • Developers
    • Marco Mascheroni
      • Almost done with Meta sites ticket #13069. Will try to push for review by Friday.
      • Protocol is same. Attrs files generation and usage by the factory changes. No changes to the protocol.
    • Marco Mambelli
      • Working on issues for v3.2.20 and its release.
      • Haven't updated the FactoryEntryStatusNow page. waiting on Jeff's feedback.
    • Dennis Box
      • Been testing release candidate. No problems on SL6 and SL7 upgrades.
      • Nearly done with #2531

November 1, 2017

Dennis Box, Marco Mambelli, Dave Dykstra, Erik Vandeering, Dave Mason

  • Singularity (Dave Dykstra)
    • Problems with EL7.4 kernel (bind mounts not working) are not showing up in new installations

Hyunwoo Kim is moving to HEPcloud development, Dennis will increase effort
Lorena will likely start January 22

  • v3_2_20 status
    • Not much progress this week. Need to make changes and cut a release candidate.
  • Dennis Box
    • Working on 2531

October 25, 2017

Dennis Box, Marco Mascheroni, Marco Mambelli

  • v3_2_20 status
    • Not much progress this week. Need to make changes and cut a release candidate.
  • Dennis Box
    • Working on 2531, we talked about the categories (binned values with the name of the biggest value: JobsStart_0, 2, 5, 10, many) and adding them to the web monitoring

October 18, 2017

Jeff Dost, Dave Dykstra, Parag Mhashilkar, Dennis Box, Marco Mascheroni, Marco Mambelli

  • Singularity (Dave Dykstra)
    • 1.4 released. Dave recommends OSG to wait for couple of months and meanwhile test it on couple of big sites before releasing to all sites. Several of the pull requests from Dave are included.
    • Problems with EL7.4 kernel
    • With docker at UNL. All bind mount points disappear in the middle of the jobs. Its not easily reproducible and UNL admins trying to reproduce it.
    • Also issues with autofs. SL7.4 kernel and docker, FNAL notices autofs crashes once in a while. Work around is not to unmount
    • When timeout is 0, it does not update the access time of the file. Known issue with EL7 kernel. Workaround again is set timeout > 0, never unmount.
  • CMS
    • Talking to Jeff about top collector in global pool trying to debug why factory records (??). Year ago started filtering non essentials classads. Frontend queries for idle glideins to collector, frontend may get out dated idle counts. Will affects the frontend pressure. Claim idle, claim running. Query the top and secondary collector and see how much they are different.
    • Third source analyze_glideins. Where is the info coming from? Frontend or from stats scavenged from the glidein output. Formatting of anaylze_glideins, should consider multi core if it is not.
  • OSG
    • voms-proxy-fake: Red herring for the glideinwms team. Main issue is the proxy format and condor submit crashes the schedd with proxy that is missing attributes. It has nothing to do with the condor_root_switchboard.
  • condor_root_switchboard
    • condor 8.7 does not support switchboard. Proposal from the htcondor team is to pass the code to us to maintain and they will help us trim it.
  • v3_2_20 status
    • Not much progress this week. Need to make changes and cut a release candidate.
  • Marco Mascheroni
    • Experiment in CMS, frontend match expression changed format from list to string and it crashed. The error handling can be improved
    • Need to work on the meta sites.

October 11, 2017

Jeff Dost, Dave Mason, Parag Mhashilkar, Dennis Box, Marco Mascheroni, Marco Mambelli.

  • CMS
    • Discussions with James
      • After a shutdown at CERN, CMS started up long lived pilots with open stack. They start and stop at same time. It took about 24 hours to ramp up to speed. Jeff is waiting on info from CERN and can use the spread. CERN told to limit 1 VM per cycle. We maybe able to increase number of glideins submitted per loop but further control the rate of submission to Open stack through HTCondorG config knobs.
    • #13069 - Meta Sites: No progress from Marco Mascheroni last week. Plan to resume the work next week
    • #17221 - Glidein auto removal: No update. Working on v3.2.20 release.
  • v3_2_20 status
  • condor_switchboard support dropped by htcondor.
  • OSG
    • #17825: Marco working on it. Jeff can test rpms from osg development repos.

October 04, 2017

Jeff Dost, Dave Dykstra, Marco Mambelli, Dennis Box, Parag Mhashilkar, Jeff Dost, Marco Mascheroni, Dave Mason

  • Singularity
    • New release v2.4 is being prepared
    • Dave had to make couple of changes to have it compiled for OSG
    • Brian has security concern. Mounting arbitrary image files as whole file system. Potential FS race conditions can be exploited? Configuration option to turn it off. There is a FW attribute, pinned_only. Allows you to append to file but not modify it. Needs to be set by root. Works on EXT/Local/Luster FS but not on NFS FS/BEEGFS. Does not impact unprivileged singularity.
  • v3_2_20 status
    • Going through log glideinwms and htcondor warn/error messages for RC.
    • Cleared some issues and found some more issues
    • factory is trying to qedit even if the config has it disabled. Working with Marco to fix it.
  • CMS
    • No updates
  • Developers
    • Marco Mascheroni
      • Working on meta sites
      • Focus on fixing the condor qedit
    • Dennis Box
      • Checked in automated tests and scripts.
    • Hyunwoo Kim
      • Busy with GPGrid/HEPCloud
    • Marco Mambelli
      • Testing release candidate
      • condor switchboard has been removed from htcondor 8.7.2. So far it is in 8.6.4. We need a solution for alternative ways of doing it.

September 27, 2017

Jeff Dost, Marco Mambelli, Dennis Box, Parag Mhashilkar, Jeff Dost, Marco Mascheroni, Dave Mason

  • v3_2_20
    • tested rc2. No errors unexpected errors.
    • Cut release today
    • Next release to be shorter release cycle
      • Singularity
      • Jeff's monitoring request to be in this cycle
      • Futurize stage 1
  • CMS
    • James
      • Looking at retired time of the glideins and studies show that request wall clock time in jobs is sufficiently accurate and can be used to drain. If cut it to half we wont. As per Jeff, James can over ride the jobs' life time
      • Request from Antonio: solution to not reduce bunch of glideins and cause churn. #17221
  • Developers
    • Marco Mascheroni
      • Meta sites. Done with configuration path. Working on making entry sets advertise as a single entry.
    • Dennis Box
      • Testing RC2 and automating tests.
      • Need to understand how to monitor logs for errors.
    • Marco Mambelli
      • Doing testing/install/upgrade for v3.2.20 rc

September 20, 2017

Dave Dykstra, Jeff Dost, Marco Mambelli, Dennis Box, Hyunwoo Kim, Antonio Perez, Parag Mhashilkar

  • Dave Dykstra
    • OSG has released singularity
    • Singularity is in CVMFS
    • Learned that kernel 7.4 is default kernel and updating to latest kernel should get unprivileged singularity access if you enable it.
    • New singularity v2.3.2
      • Has fixes to loading images from docker hub
      • Is holding references to calling process in image directory so it does not get unmounted.
  • v3_2_20
    • Frontend RRD bug has been fixed. Needs to be merged. In fix rrd option, some files were not fixed.
    • Build RC later today.
  • OSG Operation
    • #14559: To get the required fixes, need to make mores changes.
    • Make 3.2.20 with current fixes and push #14559 out to v3.2.21
  • CMS
    • Meta Sites & Aliases: Been recently discussion with CERN and CERN has deployed large number of *15) different CES and some of them are redundant. This is imposing limitations on factory side. Marco Mascheroni has been working on this for past couple of releases and is dedicating his effort to this issue.
    • Method for long queued pilots is helping a bit. There are still number of sites complaining. Need to open a ticket for smart removal of glideins in a smart way rather than auto retiring and old idle glideins
  • Developers
    • Dennis Box
      • Trying to go through testing everything as per wiki page.
    • Hyunwoo Kim
      • Working on #17570: There maybe a simpler solution
    • Marco Mambelli
      • Reviewing change from Dennis
      • Writing the Frontend document

September 13, 2017

Marco Mambelli, Marco Mascheroni, Dennis Box, Hyunwoo Kim, Dave Mason, Parag Mhashilkar

  • v3_2_20
    • Still getting errors in logs when there is upgrade or new entry is added. In ticket for adding new monitoring info to factory. Sent mail to Jeff giving link asking for his feedback.
    • Other issues have been fixed
    • Feature change of loading singularity executable which is very specific with SL7 and kernel version
    • Additional testing
      • Looking at logs and found couple of issues with rc that were handled
  • Deverlopers
    • Marco Mascheroni
      • Nothing much to report
    • Hyunwoo
      • Fixed issues found in rc last week
      • Working on init scripts where second execution of reconfig is blocked
      • Next planning on new two singularity tickets
    • Dennis
      • Not much progress last week
    • Marco Mambelli
      • Fixed unittests for master branch

September 6, 2017

Marco Mambelli, Marco Mascheroni, Dennis Box, Hyunwoo Kim, Eric Vaandering, Dave Dykstra

  • Singularity
    • Singularity unprivileged installed in OASIS.
      • Would be nice to try unprivileged singularity bin before the other one (requested in ticket)
    • Brian brought up a new cvmfs feature (used by CMS, LIGO). Access to cvms by voms proxy. Requires a different Linux session (setsid, starts a new session and changes the parent to be the init) so that the proxy is not shared.
      • Is condor using different sessions (startd, starter)? - Marco will ask Friday
      • It affects also signals: we may need a process to trap and send signals to the new sesssion. Singularity is not changing session, inside singularity a script could do that (this way the new parent would be the Singularity process and not init).
      • Dave will open a new ticket for this
  • Marco Mascheroni, working on other cms duties mostly, and in part on 1st part of metasite ticket. Will move to 2nd part soon
  • Testing 3.2.20 RC1
    • Dennis, observed an error about a missing key when accessing RRD data. The attribute is there (in RRD) when a fresh install is done, not w/ upgrade. Investigating further
    • Dennis did istall and upgrade smoke test and they work.
    • Marco did manual install w/ smoke test and work fine. Observed an error whe a new entry is added, only wisible at the first reconfig with the new entry
    • Both Marco and Dennis are investigating further
    • HyunWoo - found minor bug in singularity, fixed already
  • No news from Eric

August 30, 2017

Parag Mhashilkar, Dennis Box, Jeff Dost, James Letts, Hyunwoo Kim

  • v3_2_20 rc1 testing
    • Dennis could not find v3_2_20 rc1
      • Basic smoke tests are ok
    • Hyunwoo will be testing it today
    • Jeff: Its in osg-3.4 and not in osg-3.3 development
  • CMS (James)
    • CPU efficiency in GlideinWMS and HTCondor. Cut the queue for 3 hour has been helpful.
    • Next wastage is draining of glideins. CMS will be looking at the configuration in case of multicore glideins
    • Feature request from computing operations
      • Production workflow can find useful if they can easily stop jobs from starting at a site (maybe its broken)
        • Seems more like a condor_qedit for exiting jobs + condor_config changes to add default start/site exclusion
  • OSG Operations (Jeff Dost)
    • Interested in #14559.
    • Jeff to upgrade to ITB and try it out
    • Jeff wants to know how many cores reported back to the VO collector
  • Developer
    • Dennis Box
      • Not much progres
    • Hyunwoo
      • Marco Mambelli reviewed singularity ticket. Made changes accordingly
      • Start working on asynchronous feature of SL7 sysctl.

August 23, 2017

Dave Dykstra, Marco Mambelli, Parag Mhashilkar, Marco Mascheroni, Hyunwoo Kim, Dennis Box

  • Singularity
    • Tried on one of the machines on RHEL7.4 it does not take advantage of one of the system calls.
    • Marco Mambelli: Have frontend that does forwarding. Will be testing the logging.
    • Singularity wrapper script - Hyunwoo
      • Got feedback from Brian and Mats Rynge. Reflected important comments in the code changes. Split main script into 2, validation script and script for startd cron. Hyunwoo's changes only take in singularity related changes.
  • v3_2_20
    • Working on #14559. Needs more changes and testing
    • #17343: Needs to be documented
    • Release candidate: pushing by end of this week
    • Dennis to help with RPM testing
  • Developer
    • Dennis
      • Haven't done much on GlideinWMS
    • Marco Mascheroni
      • Just came back from vacation. Will resume work on Meta sites.

August 16, 2017

Dave Dykstra, Marco Mambelli, Parag Mhashilkar, Marco Mascheroni, Hyunwoo Kim, Dennis Box

  • Singularity
    • Tried on one of the machines on RHEL7.4 it does not take advantage of one of the system calls.
    • Marco Mambelli: Have frontend that does forwarding. Will be testing the logging.
    • Singularity wrapper script - Hyunwoo
      • Got feedback from Brian and Mats Rynge. Reflected important comments in the code changes. Split main script into 2, validation script and script for startd cron. Hyunwoo's changes only take in singularity related changes.
  • v3_2_20
    • Working on #14559. Needs more changes and testing
    • #17343: Needs to be documented
    • Release candidate: pushing by end of this week
    • Dennis to help with RPM testing
  • Developer
    • Dennis
      • Haven't done much on GlideinWMS
    • Marco Mascheroni
      • Just came back from vacation. Will resume work on Meta sites.

August 09, 2017

Dave Dykstra, Marco Mambelli, Parag Mhashilkar, Jeff Dost

  • Singularity - Dave Dykstra
    • Logging for singularity is working HTCondor 8.6.5 and is in august OSG release. Waiting on Marco to test the bug fix.
    • Singularity seems to be deployed at CMS Tier-2. Maybe Atlas too (?)
    • Any job/glidein go to HTCondor CE has logging
    • RHEL 7.4 released has optional feature which allows singularity to run as unprivileged user. Need a kernel boot parameter. Dave will be testing it. Need to verify that it works.
  • OSG operations. Jeff Dost.
    • Jeff sent email with details
  • v3_2_20
    • #13807: Hyunwoo looking at the singularity feedback from Brian and VOs. Should be done by end of this week.
    • #14559: Marco should be done by end of this week.
    • Release candidate end of the week
    • #17343: Issues with SL7 and reload.
      • Jeff: how is reload done? Cycle: STOP - RECONFIG - RESTART. For scale of OSG: reload can take upto 5 mins. Any protection against back to back quick reloads? Marco thinks if reload is in process it will go in oblivion. Jeff suggests having a warning message that a reload is in progress.
  • v3_3_3
    • This will follow after v3_2_20. Will include #15176

August 02, 2017

Marco Mambelli, Marco Mascheroni, Dennis Box, Antonio, Parag Mhashilkar, Hyunwoo Kim, James Letts, Eric Vaandering

  • v3_2_20
    • Marco Mambelli
      • Behind with the monitoring tickets. #14559
      • Spend time to review changes by Thomas
      • Jeff gave feedback to Marco on what is expected from the ticket. Changes to RRD. Expect to be done by end of the week and we will have RC next week.
    • Marco Mascheroni
      • #13069 will be pushed to either v3.3 series. Keep the code changes wrt to branch_v3_2 to give flexibility
  • CMS
    • Antonio
      • Continuing preparation for scale tests starting today. Got feedback from Edgar/Jeff/Marco/Parag on multi glidein slots. Running scale tests on grid sites in opportunistic & dedicated unused resources.
        • multiple p-slot
        • multiple p-slot per glidein
        • push collector & negotiator and multiple queries
        • intensity of jobs
        • timeline is dependent on several factors but focus in August or into September
      • Issue addressed in past about monitoring info into glidein job's classad through qedit which is expensive. Maybe a good idea to enable this with dedicated factory for above scale testing.
      • Tuning on queue limits
        • Reduced idle glidein time limits to 1 hour. Not cancelling held glideins yet. Maybe have a time limit for held glideins for their removal.
  • Hyunwoo
    • composing how our singularity is different that what UNL is using
      • Our code requires new features in glideinwms just like factory-frontend
      • From CMS - frustration from sites setting up singularity. May have some issues with isolation of user environ and running CMS software.
    • Will resume "why my job is not running"
  • Thomas
    • Working on Jenkins
    • Document Work
    • documented coding guidelines
  • Marco Mambelli
    • Need to test changes with python 3 as well
  • Dennis Box
    • Did not whole lot on glideinwms.
    • automated testing script - found issue with EOF and named pipes. Removed tee and piping and it works.

July 26, 2017

Marco Mambelli, Marco Mascheroni, Dennis Box, Antonio, Parag Mhashilkar, Dave Mason

  • v3_2_20
    • There are 2 high priority tickets under progress
    • #14559 (Mambelli): Should be completed before the end of the week.
      • Jeff wants to make sure that monitoring is fixed as it is confusing and some of the lines show pilots while some show cores. When the info is overlaying in monitoring. On fixed slots case, glideins should match how man... In addition to cores and glideins and core requested also want to see partially used glideins and unused cores.. ONLY WANT TOTAL PILOTS and TOTAL CORES in the factory monitoring. SLOT info is not relevant. more info Nov 2016.
    • #13069 (Mascheroni)
      • Didnt get chance to work on this and will try to work on it. May not be able to complete for this release
        • There are two parts. Introduced new tags in factory configuration for entry set that can list entries. Entry sets are published differently.
        • Collection of sites have different number of weighting factor.
        • Grouping of entries only happen for similar or standard entries. Are we approaching to xrootd model of federated entries? Multiple factories with same entries do not see each others limits. Not easy to solve without making factories aware of each other. This is rate problem as entries in multiple factories fill up fast.
  • CMS
    • we are removing glideins every hour. Three hours was too long.
    • Met
  • Factory OPS: ITB & RC testing
    • Only testing on ITB when ready to upgrade Production.
    • Ramping up new hire but didnt work out. Understaffed to help out with testing releases and RCs
  • Burt
    • Quality control in glideinwms have been bubbling up and is not reflecting good on the project, group and lab

July 19, 2017

Marco Mambelli, Marco Mascheroni, Hyunwoo Kim, Dennis Box, Eric Vaanderig, Antonio, Parag Mhashilkar

  • Antonio
    • No new request compared to GWMS stakeholder meeting.
    • Main work is on improving CPU efficiency: condor issues, GWMS issues
    • Would like to check more the number of activations reported by Glideins. Number of pilots with no activation is an interesting metric.
    • Currently removing pilots form the queues if they are older than 3 hours, thinking to reduce this to 1 hour. This is gonna cost a lot of pilots renewal: + It will keep all the pilots relevant, - It may cost some harm on the factory
      Things seem to be OK, we had this in place already for a couple of weeks. Applied this only to the main Frontend group, dedicated resources, no opportunistic (and HPC resources)
    • This has not been helping a lot because requests lately have been very spiky, so even 3 hours seem not enough. Whitelisting is also very specific on where some jobs can run, this increases spikiness.
  • Marco Mascheroni
    • Almost finished w/ ticket for configuration of the factory [#13069]. Ended up rewriting last week implementation. Hopefully tomorrow.
  • Dennis
    • Implementing feedback for the entry scaling ticket [#17067], all seems OK
  • Hyunwoo
    • Singularity branch [#13807] has been merged after testing that path names w/ spaces work.
  • Marco Mambelli
    • Fix unit tests
    • Working on 14559
    • Reminder to run unit tests
    • Reminder that futurize tests of the code will be added soon
  • Eric, Parag
    • No news

July 12, 2017

Marco Mambelli, Marco Mascheroni, Thomas Hein (Hyunwoo Kim, Dennis Box sent updates)

  • Hyunwoo
    • Worked on Singularity OSG scripts.
    • Working more on tests to verify that Singularity is OK
  • Dennis
    • Worked on entry scalability
    • Problem with select, there is a limit set at compile time for 1024 file descriptors, this would limit the Factory to ~510 entries: possible solutions (excluding recompiling the interpreter), use multiple select on segments of the file descriptor, use multiple entryGroup, use poll instead of select
  • Marco Mascheroni
    • Work on 13069: Balancing glidein pressure to sites that are aliases or Meta-Sites
    • Split the ticket in 2: 1. add entry sets with entries sharing common elements, 2. manage balancing within the set
  • Marco Mambelli
    • Troubleshoot and fix empty Factory job stats bug
    • Working on 14559
  • Python3
    • Marco forwarded to the list Thomas email to the GlideinWMS developers highlighting the changes in syntax and new suggested idioms
    • No comments on the suggestions

July 5, 2017

Marco Mambelli, Hyunwoo Kim, Dennis Box, Thomas Hein

  • v3_2_20
    • We want to go in OSG's August Release [], deadline (dev freeze) is 7/24, we want to reserve one week or almost to RC testing, so we have almost 2 weeks
    • All 3.2.20 High priority tickets should go in, Marco, Dennis and HyunWoo agree
  • Python3
    • Introduced Thomas work: modernization of the code, getting it ready for Python 3, adding futurize tests in Jenkins
    • Thomas will send an email to the GlideinWMS developers highlighting the changes in syntax and new suggested idioms

June 14, 2017

Parag Mhashilkar, Marco Mambelli, Hyunwoo Kim, Dave Dykstra

  • Singularity (Dave)
    • Worked on logging. Got plugin from Brian. Now can keep track of jobs that starts, stop and are running. Configurable to cleanup tracking if no info is there for more than x hours. Depends on the glideins reporting to HTCondorCE Collector.
    • Will ask Ken Herner to configured some sites to send classads to Dave's test collector
    • Working with Hyunwoo for Singularity. Not have an option to disable singularity in job's classad.
    • EL6 limitation. Cannot automount inside container dir thats already mounted outside. So this script needs to mount the dirs on EL6.
  • v3_2_19
    • ITB should have v3.2.19 since Suchandra tested in ITB
    • condor view problem went away with collector use shared port
    • Brian and Tim Theisen have been pushing for collector use shared port
  • Hyunwoo
    • Worked on Singularity OSG scripts.
    • Struggling with the CVMFS issues. Should check on EL^ that mounting same dir in multiple containers on same worker node is ok
  • Marco Mambelli
    • Multiple node submission on Cori
    • Fermicloud shared home dir has issues. Will have to restart rpcbind and ypbind. Port mapper daemon is having some issues. We need a fix from Scientific linux.

June 07, 2017

Parag Mhashilkar, Marco Mambelli, Dennis Box, Marco Mascheroni, James Letts, Hyunwoo Kim

  • v3_2_19
    • Is in OSG testing repo
    • Should be in the June OSG release
    • v3_2_19-1 in osg-v3.3 and v3_2_19-2 in osg-v3.4 (removes redundant dependency in spec file)
  • CMS
    • No news
  • Developers
    • Marco Mascheroni
      • Will start working on the Meta sites.
    • Marco Mambelli
      • #15176 for OSG 3.3, multiple load submission accounting from Frontend
      • # 14559 next one
    • Hyunwoo Kim
      • Singularity ticket: Will make more progress based on Brian's scripts
    • Dennis Box
      • Been busy with Jobsub:
      • Tested uUpgrading to 3.2.19 and fresh install worked

May 31, 2017

Parag Mhashilkar, Dave Dykstra, Eric Vaandering, Marco Mambelli, Dennis Box, Marco Mascheroni, James Letts, Hyunwoo Kim

  • Singularity (Dave)
    • Trying to demonstrate the logging feature when singularity runs jobs. Test setup with HTCondor CE. Can send ads to CE.
    • Brian has wrote plugins that allows python plugins. Can load python plugin now but problem with shared object loaded. Brian is trying to help with that.
    • Even Wisconsin builds don't help
    • Marco helping with site setting env variable to forward all glidein info to the collector
    • Marco talked to OSG and they will add singularity with next release. It is currently in OSG upcoming. Glideinwms will support singularity in next production release. Ticket is being worked upon.
    • Glideinwms - singularity status: Hyunwoo started with scripts from Brian Bockelman but they are tailored for their environment. Have simple scripts working.
    • Dave will need help from Glideinwms team to test it.
  • CMS
    • Upcoming release will include
      • Improve glideins scale down
      • Linking Frontend monitoring from Factory monitoring
      • Log number of activation/claims per glidein
      • Support for HTCondor v8.6 configuration
    • Requires upgrade to factory and frontend
  • Glideinwms v3.2.19
    • Final release done yesterday
    • Now in OSG production
  • Developers
    • Marco Mascheroni
      • Scale down ticket
      • Meta Sites
    • Dennis Box
      • Testing v3.2.19
      • Been busy with Jobsub
      • Automated deployment: Problems with logging
    • Hyunwoo Kim
      • Singularity Support: May need 2-3 weeks
    • Marco Mambelli
      • HTCondor 8.6 changes
      • Job being sent to default schedd. Changes to HTCondor configuration parsing along with how we do it was the issue. New settings are compatible with old HTCondor versions
      • condor_config_val now defaults to tool environment and based on the daemon calling it, dump will dump different info.
      • Glideinwms 3.3 will go in OSG-3.4-upcoming
      • Students coming June 19

May 24, 2017

Attending: Marco, Marco, Dennis
Parag started the meeting, HyunWoo sent email

  • Developers:
    • HyunWoo:
      • I have been working on this singularity issue and I am currently adding new codes for this ticket.
    • Dennis:
      • Moved 2 ticket to 3.2.20, will complete the one in feedback
    • Marco Mambelli:
      • Will complete 15892 bi tomorrow morning
    • Marco Mascheroni:
      • 16414: applied feedback, we discussed it in the meeting. Better start the timeout from queueing time than EnterCurrentStatus, to avoid resets in case of preemption or jumping through hold.
  • CMS
    Marco discussed at the CMS meeting and Antonio would prefer to have direct control of the complete periodic_remove expression, not just removing Idle glideins.
    To avoid jobs that go to idle after being held.
    These may be missed in the expression.
    Marco Msc will investigate the behavior of the factory with held jobs. According to Marco Mmb there is no periodic release and
    There is no reason to automatically release glideins that had been held? (correction: the release is needed for glideins already running jobs. There a hold due to a temporary network problem could otherwise kill user jobs)

May 17, 2017

Parag Mhashilkar, Marco Mascheroni, Dave Dykstra, James Letts, Antonio, Hyunwoo Kim, Eric Vaandering, Marco Mambelli

  • Singularity
    • Feature for logging in HTCondor in who is using singularity is in 8.6.3 and is available in upcoming
    • Asking Tony to try it out. Dave is learning to install HTCondor CE
    • Can we send parameters to glidein -- yes
  • CMS
    • Removing glideins from the factory queues anything that is older than x days. Marco Mascheroni is working on it and will have prototype by today. Will work with Mambelli on how to test it. Question is how to pass the info from frontend to factory.
  • v3.2.19
    • Will try to make it by OSG release
    • Status didnt change much since last week
    • Updated the status of tickets that will released in v3_2_19

May 10, 2017

Parag Mhashilkar, Marco Mascheroni, Dave Dykstra, Dennis Box, James Letts, Antonio

  • CMS
    • Trying to improve the efficiency of global pool
      • Checking age of pilot in the system. Noticed that we keep a lot of old pilots.
      • Marco: Talked to James during HTCondor week. Removing pilots idle in local queue but not in remote queue. Removing pilots in remote queue is tricky
      • Antonio: If we lose spot in queue, that's ok. If we are in fair share losing spot is ok. There are held pilots. Killing old pilots is ok. Also it is ok to remove to remote pilots that are over couple of days. There are multiple CEs per site. There are single entries per CE. If we over provisioning, we start removing old pilots (older than 24 hours)
    • Check if we remove held glideins and also forcefully.
    • Start removing idle glideins that are local and also idle for more than x hours
    • Use periodic remove
    • Plots+Notes from Antonio
      • QDate for all the pilots in the CMS global pool for running, idle and held status in the factory schedds. As we discussed, the main observation is that we apparently have quite old stuff in the queues, even if the majority of pilots which are running are relatively "fresh" (they were requested within 2 to 3 days ago)

      • Looking at it from a site's perspective (PIC Tier 1), this plot shows pilots running and queued classified according to the date they entered the local queue. As it shows, in some cases such as this, we are running old pilots requested about a week ago. This is basically breaking the correspondence between workload pressure and glidein pressure in situations when we have fluctuating workloads in the global pool (e.g. requests that only take about a couple of days to complete, then pretty much nothing for the next few days). Hence the suggestion to fixing it by trying to remove any pilots in the system that did not start running perhaps 48h (or 24h) after requested by the FE, in order to recover the correlation to existing demand.
      • Then, looking at it as a function of time, this is the number of pilots classified as running, idle or held from factory perspective, for a couple of sites:

      • As I mentioned, there are clear differences in the mix amongst CMS sites. For PIC above, we have about 5x total more glideins than max running (=size of the CMS pledge at the site). Going through the full queue explains the picture in c): once the pilots get to run, they are already a week old. In contrast, the second site (UK Tier-1 at RAL) has much less in queue compared to running, so it's going to be running fresher pilots (=following more closely the actual demand for their resources).
  • Marco Mascheroni
    • Looking at removing idle glideins
    • Meta sites: lower in priority
  • Dennis Box
  • Hyunwoo Kim
  • Marco Mambelli
    • v3_2_19 Status - OSG dev freeze in 2017-05-30
      • RC next week. Some small bug fixes.
      • Singularity support should be done by end of this week
      • Changes in HTCondor for subsystem config changes. Impact is visible with HTCondor 8.6

April 26, 2017

Parag Mhashilkar, Marco Mascheroni, Dave Dykstra, Dennis Box, James Letts

  • glexec & singularity
    • Singularity taking over at OSG
    • HTCondor CE plugin to collect logs from glideinwms pilots to tell collector which VO users are running jobs etc. Brian wrote the plugin. This is requirement for FNAL to install it here. Dave suspects other sites may be interested in running this plugin. Plugin will be part of HTCondor
    • We are waiting on FNAL for feature request
    • CILogon Basic CA recognized in Europe. FNAL uses this CA so users cant run in Europe. Following options:
      • JNR is only allowing FNAL users. Mine is trying to see if this can be distributed
      • BNL has a package but will take upto a year
      • OSG can make agreement with CILogon to give a list of Identity providers that are approved to be Silver level which is approved in Europe
  • OSG does not plan to distribute glexec in OSG-3.4 which is expected in next few months
  • CMS:
  • Parag: Possible to kill job if its only one in multicore glidein
  • James: When trying to drain, in case of some sites it took upwards of 72 hours to drain them
    • take centreal manager to go away
    • wait for glideins to go away
    • request frontend has made
    • when job pressure is low we still get lot of glideins.
    • Can we automate the removal of glidein
  • Marco Mascheroni
    • Create and assign to yourself ticket on auto removal of glideins to help with faster draining
  • Dennis Box
    • Looking at #2531. but will have t add new fields to rrd. Its tricky and may break things. Parag suggested to move this ticket to v3.3 series if there is a chance of breaking backwards compatibility.
  • No meeting next week. Most of us will be at HTCondor week

April 19, 2017

Notes from James Letts

Ad hoc meeting between the GlideinWMS Developers and the CMS Submission

Infrastructure Group

Attending: Marco Mascheroni, Antonio, James, Marco Mambelli

  • Since the regular glideinWMS developers meeting has been cancelled for the past couple of weeks, CMS called an informal chat to discuss some issues we have been having with ramping down the Global Pool. Bursty job submission patterns is something SI cannot do anything about, but we noted that while we can ramp up the Global Pool very quickly and efficienty (with few idle cores), when job pressure becomes reduced then draining glideins and even new running glideins that were submitted to the sites' batch systems during high pressure waste a lot of CPU cores. Since this happens often, it has become visible to the sites and to their funding agencies. CMS SI have made fixing this problem a top priority.
  • After defining the problem, we proposed two areas where improvements might be made in glideinWMS, realizing that solutions may also need to come from HTCondor. One possibility would be cancelling no-longer-needed idle glideins when the job pressure is reduced. Firstly we need to understand the current logic inside glideinWMS.
  • During the switchover of the HTCondor Central Manager machine at CERN, it was observed that the idle glidein queue at PIC was over 48h long. This was during a period when the pool was contracting, so new glideins were not needed according to frontend pressure. Apparently there is functionality in glideinWMS to remove idle glideins at the site batch queues when the frontend pressure drops. Is this be turned on?
  • ACTION ITEM: Marco Mambelli will investigate what is the current mechanism is, regarding the aggressiveness of submission, investigate consequences for ramp up and ramp down, what the built-in delays are, etc.
  • CMS noted that the glideins at T2_CH_CERN ramp up as well as down really efficiently, unlike most other sites, e.g. T2_BE_IIHE, for which plots were compared.
  • ACTION ITEM: Marco Mambelli will investigate if there are any special factory settings for the entries at T2_CH_CERN responsible for this behaviour.
  • Everyone noted that the CMS Global Pool has a single frontend, that makes the request for resources effectively a single-user model. Could the frontend request removal of idle glideins when no longer needed? Tune retire time?
  • We also discussed controlling the pilot pressure, i.e. the number of idle glideins sent to individual sites. CMS observes that sites with multiple CEs get more pilot submissions. Can glidein factories be site aware and scale pressure appropriately? Are entries in fact being tuned to reflect changes in sites, i.e. addition of resources? In principle the tunings are in git.
  • On the HTCondor side, we are investigating depth-wise filling of multi-core glideins.

March 22, 2017

Parag Mhashilkar, Marco Mascheroni, Hyunwoo Kim, Dennis Box, James Letts

  • CMS:
    • Fighting scalability issues with HTCondor central managers when we get to 200K scale. Queries from different systems are blocking top level Collector. Currently, part of the problem is running out of memory on CERN Machine
  • v3.3.2 Status
    • v3.3.2 rc3 released yesterday
    • Issues with HTCondor 8.4.11 and 8.6.0 where new style config SCHEDD2.BLAH does not work correctly
      • With both 8.5 and 8.6 series you get warnings with old configurations ***
  • Dennis Box
    • Unable to reproduce #2081
    • Will be working on #2531

March 01, 2017

Parag Mhashilkar, Marco Mascheroni, Hyunwoo Kim, Dennis Box, Eric Vaandering, James Letts, Antonio Perez-Calero,

  • Release Status
    • 3.2.18 released earlier this week.
    • 3.3.2 working on the release and bug fixes
  • CMS
    • Most changes we are working on are on the HTCondor collector and negotiator scalability issues
  • Developer updates
    • Marco Mascheroni: #11755: Will have update later this week
    • Dennis Box: Automated deployment of
      • # 15074: Completed and will be assigned for review
    • Hyunwoo Kim
      • #15479: Assigned for feedback
      • #14501: Found the big. Need to test the changes
      • #2858: Need to revisit the changes

February 22, 2017

Parag Mhashilkar, Marco Mascheroni, Hyunwoo Kim, Dennis Box, Eric Vaandering, Dave Dykstra

  • Dave Dykstra (glexec)
    • At JNR we were submit jobs from here but was failing because CI Logon Basic CA. They get config from EGI. In Europe some VOs anyone is allowed to join. At NIKEF they are working on functionality to list of VOs one can join
    • Brian has been asking users to use SL7 starting April (?) if site has singularity.
  • Release Status: v3.2.18
    • Hyunwoo: Testing his changes in rc1 rpm
    • Dennis: Will test v3.2.18 rc1
      Marco Mascheroni: Will try to work on #11755 this week.

February 15, 2017

Parag Mhashilkar, Marco Mascheroni, Hyunwoo Kim, Jeff Dost, Dennis Box, James Letts, Marco Mascheroni, Antonio, Dave Mason

  • Project News
    • Marco Mambelli is taking over as the Technical Lead
    • Will be switching to zoom
  • Release Status:
    • v3.2.17 is in OSG release
      • Issues with use of daemon function and pid file for the glideinwms services
    • v3.2.18 is being worked on and will be in the next OSG release
  • CMS
    • Talking to Brian wrt singularity and GlideinWMS. He is ready to roll it out on his testbed so this will become priority for the GlideinWMS project
    • Discussions within CMS on controlling the pilot pressure
    • #14559 makes it difficult for debugging. We need to scale down all the number of cores to pilot
  • Factory Operations
    • Covered in topics above

January 18, 2017

Parag Mhashilkar, Marco Mascheroni, Hyunwoo Kim, Eric Vaandering, Jeff Dost, Dennis Box

  • v3.2.17 Release Status
    • Release is out in development repo and will be available in OSG Feb release
  • Factory Operations
    • Enables sha2
    • #14864. Should be 3.2.18
  • CMS
    • No Big News

January 18, 2017

Parag Mhashilkar, Marco Mambelli, Hyunwoo Kim, Dennis Box, Eric Vaandering, Jeff Dost

  • v3.2.17 Release Status
    • Marco tested rc1 on SL6 and SL7

January 11, 2017

Parag Mhashilkar, Marco Mambelli, Hyunwoo Kim, Dennis Box, Eric Vaandering, Jeff Dost

  • CMS
    • No new news
  • OSG Factory Operations
    • Jeff: Ken Herner reported issues after changing the 2MB to 2048 KB.
  • v3.2.17 Release Status
    • Release candidate delayed. Will be released today.
    • Marco Mascheroni to push other changes #13277
    • Dennis to upload his changes
    • Marco Mambelli: Found issues with condor_startsup with unmatched if block at a higher level.
    • Hyunwoo: Will start working on #2858

January 04, 2017

Parag Mhashilkar, Marco Mambelli, Hyunwoo Kim, Dennis Box

  • Release candidate tomorrow, so resolve all the pending tickets.