Weekly Meeting Notes » History » Version 160

« Previous - Version 160/216 (diff) - Next » - Current version
Parag Mhashilkar, 05/22/2019 10:19 AM

Weekly Meeting Notes

Jump to the current Weekly Meeting Notes
Jump to the 2016 Weekly Meeting Notes 2016
Jump to the 2017 Weekly Meeting Notes 2017
Jump to the 2018 Weekly Meeting Notes 2018

May 22, 2019

Marco Mambelli, Parag Mhashilkar, Lorena Lobato, Marco Mascheroni, Dave Dykstra, Dennis Box

  • Release status
    • pylint failures since version was changed. Now we catch more SL7 errors. Need to confirm if its not related to the pylint version
    • Lorena and Marco test single user factory. Marco found one issue with permissions which was fixed with changes to spec file. Also testing if we can drop OS users on which frontend since we are moving to single factory user. Checking with htcondor users on how to do it without different OS ids
    • Mambelli will cut a release candidate later today
  • Dave Dykstra
    • Having several discussions related to singularity with Mambelli.
    • Mambelli: Everything works fine with system installed condor and tarball installed condor. Will need condor 8.8.2 for pilot

May 15, 2019

Marco Mambelli, Dennis Box, Parag Mhashilkar, Lorena Lobato, Dave Dykstra

  • Singularity
    • Security release announced yesterday. Building it. Released 3.2 that has major changes. Building a patched version Few things should have been in epel testing which were not there. But now with this release it has those changes. Impacts unprivileged mode.
    • Will go in osg 3.2
    • Next will put singularity 3.2 in production osg
  • v3.5 Release Status
    • working on singularity tickets and
    • Need to wrap up.
    • Parag: Need to get release out right away to give users chance to try them out.
    • Mambelli: HTCondor with no switchboard support will not go into OSG production until June or so because of the delay
    • Transition of file and job ownerships in factory for switchboard changes. HTCondor team helped with the migration scripts and steps that are needed
  • Developers
    • Mascheroni
      • Working on testing script
    • Lorena
      • Get everything done for blackhole detection
      • Working with Diego and fixed periodic script
    • Dennis
      • Mostly on vacation
      • While testing found small bugs
    • Mambelli
      • Working on feedback tickets and assigned them. Submission of singularity jobs in HTCondor
      • Working on coordinating planning for summer students
      • Thomas, 2 target students, 1 quark net student and Italian student in Aug-Sep

May 01, 2019

Marco Mambelli, Dennis Box, Parag Mhashilkar, Lorena Lobato, Dave Dykstra

  • Singularity
    • 3.2 in rc is ready should be released any time
    • singularity dev is working on fuse command option that Dave proposed which should work cvmfs provided it is linked with fuse3lib
    • working on fuse3 in epel. submitted pull request. got permission from fuse3 dev and gave permissions after several days.
    • singularity wrapper. cms is in process of discussions and switching to glideinwms provided wrapper instead of using their own.
  • Action items
    • Marco sent email to Egdar about students but the email thread died after that
    • Roadmap in Wiki
    • Working on moving artifacts to gitlab free account.
  • Stakeholder slides
    • Going through the slides

April 24, 2019

Marco Mambelli, Dennis Box, Parag Mhashilkar, Lorena Lobato, Marco Mascheroni

  • Release Status
    • 3.4.5 is out and Diego tested and will be in next OSG release 3.4.28. currently in OSG testing. We still support SL6.
    • Working on 3.5. Current list is long. Need to trim once single user is tested.
  • Marco Mascheroni
    • Nothing to report
  • Lorena Lobato
    • Talking with Krista on periodic scripts
    • Testing 3.4.5
  • Dennis
    • Not many cycles last week, closed
  • Marco Mambelli
    • Mainly on condor and singularity
  • Next week we need developers slides for stakeholders meeting

April 03, 2019

Marco Mambelli, Dennis Box, Dave Dykstra, Marco Mascheroni

  • Singularity report from Dave
    • Singularity 3.1.1 fully released in OSG-upcoming and epel testing, epel in 2 weeks
    • Singularity core team will add fuse3
    • Fuse3 will be supported in epel soon probably
    • From Dirk: @ TACC their worker nodes are running RH7 and allow fuse mounts. This means that CVMFS could be mounted as an unprivileged user. Mounted in a directory where you have write access and then bind mount in the right place when starting Singularity. Unprivileged namespaces would make it easier: we could start an unprivileged namespace, start CVMFS inside it and then run Singularity. Dave and Dirk will check if they can change the kernel option to enable that
  • Release:
    • GWMS 3.4.5 RC1 has been released yesterday. Marco Mambelli's moke tests are OK (SL6 and SL7 upgrades). Dennis will start his automated tests.
  • Developers
    • Marco Mascheroni
      • Fix for 3.4.5, boolean comparison more robust
      • Optimization of Frontend code for production; will push it to a branch. Added also the code that that dumps the data. Can be enabled by uncommenting some lines in the code (will add detailed explanation in the comments). Profiler code will be added to the unittest directory. Will add a comment with instruction to factor out the inner function to have detailed profiling but will not integrate that in the code (makes it less legible).
      • Since Krista added a new Frontend with a new DN a script form Diego is not getting correctly the status: system control-status is reporting the frontend as inactive. This may be because of the behavior in SL7 (systemctl instead of system). Marco Mascheroni will investigate
    • Dennis Box
      • Finished 21940, unit testing
      • Will close the testing of incommon certificates ticket
    • Marco Mambelli
      • Released last week 3.4.4 and troubleshoot the problem reported by OSG integration
      • Released and tested 3.4.5 RC1

April 10, 2019

Marco Mambelli, Lorena Lobato, Dennis Box, Parag Mhashilkar, Dave Dykstra, Marco Mascheroni

  • Release Status 3.4.4
    • In OSG testing
    • Edgar tested. Matches are not working and he confirmed that its a 3.4.4 Marco Mascheroni looking into it. Last time it was caused by bool and string matching. Mascheroni tried Edgar's setting in testing and was working fine. Needs more investigation.
  • Developers
    • Mascheroni
      • Disk filing up because of pilot stdout and stderr logging. Does glideinwms cleanup the logs? If frontend is not asking for entry will it get cleaned up for that entry?
        • Mambelli: Not sure about glidein logs.
      • Glidein off issue faced by FIFE. We dont have access to the credentials so cant troubleshoot
    • Dennis
      • Working on #21940. Made progress on it last night
      • #21844 done
    • Lorena
      • Providing feedback on 3.4.4 and troubleshooting blacklist
    • Mambelli
      • Testing on pending issues on 3.4.4
      • Moving singularity wrapper
      • Get started container test for Nova
      • Python 3 migration: On hold until 3.5 finalized.
      • Need to fast track the glideinwms 3.5 if it needs to go in the upcoming.

April 03, 2019

Marco Mambelli, Lorena Lobato, Dennis Box, Parag Mhashilkar, Dave Dykstra, Marco Mascheroni

  • Singularity
    • 3.1.1 released in OSG upcoming-development, fedora and planning for EPEL as well
      • Fixes last known problem about incompatibility with 2.6
    • In last few days figured out how to mount fuse file system as privileged in HPC system and run fuse system in side the container so can run CVMFS inside. Way to run CVMFS in HPC. It should avoid need for huge containers with CVMFS inside it. It depends on libfuse-3. CVMFS developer Jacob managed to get it working in development mode.
    • Submitted a request to update install singularity documentation on how to install it and set it unprivileged. Running of CVMFS is already in 3.4.4. Once experiments adopt it, we can start telling experiments to remove singularity installation.
    • CMS is thinking about moving to unprivileged singularity. Brian pushing for pilot sites first before asking other sites.
  • Release Status v3.4.4
    • rc4 test all positive. If everything is ok Marco will release later today or tomorrow morning
  • Mambelli
    • Started working on condor invoking singularity for 3.5. Created branch for 3.5. Master -> 3.4
  • Mascheroni
    • Heard a talk about glideinwms and submission infrastructure from CMS side
    • Currently working on tests of improvements for count match function. Apply hotfix for cms frontend and test it.
    • Fix for downtime entries. Adding option in frontend to ignore entires in downtime and consider them for un-matched.
    • With Edgar found some problems related to schedd 8.8.1 and frontend communication. Frontend is running 8.6. Couple of options in ticket with glideinwms.
  • Dennis
    • Will work on smoke tests later today
  • Lorena
    • Working on Couple of tickets, providing feedback and testing. Fixing review errors on branch used for code review. troubleshooting fife script.
    • Will talk to condor team relate to any problem related to black list script.
    • Working on configuration on black hole detection

March 27, 2019

Marco Mambelli, Lorena Lobato, Dennis Box, Parag Mhashilkar

  • v3.4.4 release update
    • Features are almost done. Will have release candidate later today. #21916
    • One of the unit test will be in next release. #21940
  • 3.5
    • Plan is to get changes during the review and release them
    • Single user factory
    • Use of condor to start singularity. Everything will be done through condor. condor ssh to job will work by doing this.
    • #20799 will be done for v3.5
    • Will go in osg upcoming
  • FIFE & GCO glidein_off not working with the way infrastructure is deployed and supported here. Its not glideinwms problem but we should help them arrive at an agreement and then close the ticket.

March 6, 2019

Marco Mascheroni, Marco Mambelli, Parag Mhashilkar, Maria Zvada

Dave Dykstra, Dennis Box, Marco Mascheroni, Marco Mambelli, Lorena Lobato

  • Singularity (Dave Dykstra)
    • Singularity 3.1.0 released and built for Fedora. Incompatibility w/ 2.6 (if there is a duplicate bind path behaves differently: 2.6 accepted it, 3.1 gives a fatal error)
    • The show stopper is an issue w/ unprivileged Singularity pulling from Docker (not working in 3.1), will be fixed soon, high priority
    • Future feature: Potential to be able to do nested Singularity (outside would need setuid root, inside could be unprivileged), will require the most recent kernel from EL7. Would allow using Singularity in the node and run it from the glidein
    • Next week will be at the Singularity users meeting
  • Release 3.4.4
    • Tickets halting:
      • Factory monitoring for HEPClous
      • Unit test for boolean values (Lorens will work on it since Dennis is taking some days off)
    • There may be a ticket about parsing metasite configuration
  • Developers
    • Marco Mascheroni
      • Discussed w/ Factory operation prototype for configuration generation
        • Mostly happy, some small changes requested
        • Will start testing it in production for a small set of entries
    • Lorena
      • Mostly training and sick leave
      • Provided feedback to some tickets
      • Troubleshooting w/ Shreyas FIFE periodic script for back holes
      • Working on black hole ticket w/ condor team
    • Dennis
      • Checking CI infrastructure
    • Marco Mambelli
      • Working on Factory monitoring
      • Feedback to Brian Lin for a fix in the proxy renewal script
      • Troubleshooting a couple of factory issues
      • Meeting about containers in FIFE
  • Next week there is the stakeholders meeting
    • Marco gave feedback to Dennis's and Lorena's slides
    • Marco Mambelli and Marco Mascheroni will provide the slides to Parag within the day

February 27, 2019

Marco Mascheroni, Marco Mambelli, Parag Mhashilkar, Maria Zvada

  • Developers
    • Marco Mambelli
      • Made 3.4.4
      • Working on problem where monitoring stats going to 0 but was not able to reproduce it
      • Schedd downtime is reported incorrectly in xml as is updated only when there is work
      • Working on troubleshooting factory with Krista
      • Will adapt to singularity solution provided by HTCondor after 3.4.4
    • Marco Mascheroni
      • Manual glidein startup
      • Setting attr to constant = False prevents publishing to factory
      • Working on issues related to FIFE support and Shreyas about glidein shutdown
      • CRIC site config generation

February 20, 2019

Dave Dykstra, Dennis Box, Marco Mascheroni, Marco Mambelli, Parag Mhashilkar, Lorena Lobato

  • Singularity (Dave Dykstra)
    • 3.0.3 is released in Fedora now. Waiting on another fix before it will be in osg production. There are some features that Atlas need but do not work
    • Its currently in RC3. Only works for root users and planning for unprivileged users
  • v3.5
    • Couple tickets are in feedback mode
    • There are couple of values that are still string/boolean for True/False
    • #21884 Testing is looking good. Will also test with new format
  • Developers
    • Dennis
      • One of the tickets (#20215) from Lorena broke unittests. Resulted in the improvement of the unittests!
    • Marco Mascheroni
      • #19949 Got feedback from Lorena should be done quickly
      • #21898 Lorena provided the fix
      • Will feedback for #20861
      • New project TODAS working with CMS to launch pilot. They start glidein_startup script and connect to another pool. There are validation scripts that need to be tackled.
      • In 3.4.3 we could not change parameter that were const if attr is cont in global and not in entry
    • Lorena

February 13, 2019

Marco Mascheroni, Dennis Box, Parag Mhashilkar, Lorena, Lobato

  • Developers
    • Dennis Box
      • No progress on CI side
      • Testing new CAs on gws-dev factory and frontend and htcondor ce
      • Would like to handle monitoring
      • Gave feedback on #15176
    • Marco Mascheroni
      • Doing test of 3.4.3 and found couple of issues. Meta sites related issues.
      • Started working with factory operator who will have cycles for development on work related to CRIC.
      • Estimation of memory on sites with glidein cpus are auto
    • Lorena Lobato
      • Testing 3.4.3 with htcondor 8.8 and working with TJ for enabling statistics
      • Handling classad from frontend to factory glidein job classad. Publish is not available in frontend side

February 06, 2019

Marco Mascheroni, Dennis Box, Parag Mhashilkar, Lorena, Lobato, Marco Mambelli, Dave Dykstra

  • Dave Dykstra
    • Singularity 3.03 is ready for osg upcoming
      • Known issue with unprivileged node. When executing from docker requires privilege. Singularity dev team plan to fix it.
      • WLCG working group meeting. More testing before rolling out unprivileged mode. On the order of 6 months. Takes long because of the Singularity audit going and scheduled to be done by mid June. Some members want to point to audit before making recommendation
      • Atlas want to be able to read from docker on worker nodes. Download the docker containers on WN. Thats a lot of overhead and sounds crazy. They don't want to maintain image repo.
      • Marco: CMS wants condor ssh to job to work but that required startd to be started as root which glideinwms cannot.
        • Dave thinks he can provide some help in that direction
      • Travel to WLCG workgroup. They are asking SI lab and they maybe able to pay for Dave's travel. CMS is already paying for CVMFS workshop.
      • Dave and Marco to work together on providing solution for CMS
  • Dennis Box
    • Reviewing #21682. Will be done and go back to working on #2531
    • No progress on travis ci and getting artifacts
  • Marco Mascheroni
    • Couple of issues from CMS. Frontend crashing because there is one of the attribute in schedd that evaluated to error/undefined causing the exception. We need to add more protection. Leak in the Changes may not be propagating to the frontend process.
    • Factory operator added an entry. She couldn't get logs from pilots because the pilots were removed based on frontends request. Mambelli, added a disable to fix it. Getting log when you kill the job depends on batch system. if it is translated to kill -9 you don't get it back.
    • CPU = auto and memory set to zero
    • Operations team meeting
      • Session on auto generation of config. Address problem at abstract level, trying to identify category of items required for config.
      • Topics based on migration of services. Not focused on different factory/services etc
  • Lorena
    • Testing 3.4.3 glideinwms + htcondor 8.4.8 identify black hole
  • Marco Mambelli
    • Working mainly on troubleshooting issues about frontend crashing.
    • HTCondor survives the glidein. Made changes on glidein and condor startup. There is trap in place to forward the signal. Glideins were killed write after starting. Making script more responsive. Working with Diego and sysadmin at Purdue to troubleshoot. Their pbs is sending sig term and sig kill one after the other. So we dont get time to react. Working with OSG team since their wrapper script is not forwarding signals correctly.
    • Release of 3.4.3 has been promoted to testing.
    • Started working on the multi node glidein ticket. Added an option as multi glidein.
    • glidein_off problem reported by Shreyas. Mascheroni to follow up.
  • Project News
    • There is a possibility of moving the project from Redmine to GitHub
    • Marco submitted 4 student requests.

January 30, 2019

Marco Mambelli, Dennis Box, Parag Mhashilkar

  • Marco Mambelli
    • Move code review to Thursday and Friday during OSG All hands meeting.
    • Talk to OSG. They released osg release. They will release glideinwms in the coming release in 2-3 weeks
    • There is still issues about condor daemons surviving past glidein startup script
    • Started working on Singularity to consider release distributed by OSG in CVMFS and consider it in the path.
    • Wrote possible projects for summer interns and there was some communication with Sandra
  • Dennis Box
    • Working on #2531 store number of jobs restarts in frontend.

January 23, 2019

Marco Mambelli, Marco Mascheroni, Dennis Box, Parag Mhashilkar, Dave Dykstra

  • Singularity
    • OSG releasing singularity 3.0.2 in upcoming (current release in EPEL)
    • The problem seen at OSC w/ Singularity 3 (Too many symbolic links, was giving a permission error from the kernel to Singularity, was working w/ 2.6) seemed more a site problem: updating to RHEL 7.5 fixed the problem
    • Singularity 3.0.3 released and will be soon in EPEL
  • v3.4.3 Release Status
    • Mambelli:
      • RC2 out in osg-development, tests are OK so far
      • Release expected for Thursday or Friday
      • Still investigating some worker nodes where glidein is killed but condor keeps running and accepting jobs, moved the ticket to 3.5
  • Developers
    • Mascheroni
      • Busy w/ operations this past week
      • Will work more on interfacing with CRIC
      • Will check w/ Frank about skipping Thursday at OSG all-hands to do GlideinWMS code review then
    • Dennis
      • kicked off automated tests, so far all OK
    • Mambelli
      • Completed 3.4.3 tickets
      • Prepared RC and started tests
      • Troubleshooting HTCondor surviving glidein. Possible race condition?
  • Tentative code review dates: April 1, 2 or March 21, 22 (after OSG all-hands)

January 16, 2019

Marco Mambelli, Marco Mascheroni, Lorena Lobato, Dennis Box, Parag Mhashilkar

  • v3.5 Release Status
    • Mambelli:
    • Waiting on feedback on couple of tickets. Cut RC but does not include those changes. It should be in the osg-development soon. It is in minefield
    • Need to check with Steve ticket resolves what he needs.
    • There might be some worker nodes where glidein is killed but condor keeps running and accepting jobs.
      • Singularity support added process group and there is a condor warning that it may prevent you from condor to be killed.
  • Developers
    • Lorena
      • Mainly working feedback of tickets and getting ready for release candidate
    • Mambelli
      • Monitoring tickets and working with Thomas. Last week for his last week. Frontend was reporting and Factory had some problems.
      • Dennis interested in picking up the monitoring work from Thomas.
    • Dennis
      • One ticket for #21763. Parsing files into other config files. Not sure if it should go in this release? As per Marco some changes are necessary.
    • Mascheroni
      • Couple of fixes for the release
      • Looking at the process group issue on worker node
      • Working with the CRIC developers for interfacing with CRIC
  • Tentative code review dates: April 1, 2