Project

General

Profile

Weekly Meeting Notes 2016

Jump to the current Weekly Meeting Notes


December 21, 2016

Parag Mhashilkar, Jeff Dost, Marco Mascheroni, Marco Mambelli, Hyunwoo Kim, Dennis Box

  • Jeff Dost
    • Created a global attribute that sets GLIDEIN_CPUS to 1
    • Marco: Motivation is to take advantage of all the slots. On the CE side, unless there is RSL with xcount, it is always 1 CORE WN slot at CE
    • 1MB = 1024KB v/s 1MB = 1000KB discussion
  • Marco Mascheroni
    • 13277: Assign for feedback for parts that are completed
  • Dennis Box
    • 14474: Was able to reproduce it
  • Hyunwoo
    • 14194: Worked on feedback
  • Marco Mambelli
    • User support and fixes
    • #12838

December 14, 2016

Dennis Box, Parag Mhashilkar, James Letts, Jeff Dost, Eric Vaandering, Marco Mascheroni, Marco Mambelli, Hyunwoo Kim

  • CMS
    • Upgraded global pool to 3.2.16 and Tier-0 pool as well
    • So far no issues with the upgrades
    • Staffing is a bit slow too because of holidays
    • Most of the work are in the HTCondor side currently
    • #13277: Code will be reviewed and released in v3.2.17
  • Factory Operations
    • Nothing major
  • Dennis Box
    • 14558: Now in review
    • 14474: Unable to reproduce. Parag will help him with it.
  • Marco Mambelli
    • Helping Dirk & Krista troubleshoot problem with Cori and Edison
    • Dirk asked for customization to local resource manager parameter and multiple
    • Jeff: Disabled Edison so as to utilize the allocation.
    • Parag: Can we get a test allocation for glideinwms team? Apply for account and refer to repository: m2612
  • Hyunwoo
    • 4586:
  • Marco Mascheroni
    • 13277: Glidein job will be in queue in completed status for 12 hours and will be auto cleaned up.
      • Factory does condor_qedit through bindings as a single transaction. There is still a chance for scale issues, so we should give an option to operations to disable it.
  • No Meeting on December 28, 2016

December 07, 2016

Marco Mascheroni, Antonio, Dennis Box, Parag Mhashilkar, James Letts, Jeff Dost, Marco Mambelli, Eric Vaandering

  • CMS
    • Deployed 3.2.16 on global pool but not reviewed multicore changes
    • Discussed urgent issues last week
  • Jeff Dost
    • New issue: NERSC complaining about querying batch system too often. Its BOSCO.
    • Marco, Parag to schedule meeting with Jeff on Thursday after 1pm or any time on Friday
  • Marco Mascheroni
    • Found a but with #13277 so testing more
    • #11755 taking back the ticket
  • Dennis Box
    • glidein script checks for existing of curl and wget and tries to use what is available. In case of failure of wget it tries to use curl if wget fails
  • Marco Mambelli
    • Followed with Krista and Dirk about they changed some entries but glideins submitted were using old values. Because of release.

November 30, 2016

Marco Mascheroni, Antonio, Dennis Box, Parag Mhashilkar, Hyunwoo Kim, Marco Mambelli

  • CMS:
    • Marco Mascheroni: #13277 Working on it. Will take #11755 once done.
    • Marco Mambelli: #13069. Need to work with Jeff Dost
    • CMS started using JDL magic/resizable jobs. Understand how this will be behaving from frontend. Currently we are facing high demand.
  • Dennis Box:
    • Working on #14558. Having trouble with testing and starting glideins. Will work with Marco.
  • Marco Mambelli:
  • Hyunwoo Kim:
  • Working #14194. SL7 init scripts. In SL6 even today they can use the reconfig command directly without using the service command.

November 22, 2016

Talked to individual developers separately

  • Marco Mascheroni
    • Have a prototype on #13277. Needs to test it before committing.
    • May want to take back #11755 from Marco Mambelli if he hasn't started working on it

November 09, 2016

Parag Mhashilkar, Hyunwoo Kim, Dennis Box, Marco Mambelli

  • Dennis Box
    • Got a working glideinwms setup. Now looking at first ticket to start working on.
  • Marco Mambelli
    • Worked on tickets
    • Documented testbed.

October 26, 2016

Parag Mhashilkar, Hyunwoo Kim, James Letts, Antonio Perez-Calero, Dave Mason, Eric Vaandering, Dennis Box

  • Project News
    • Release v3.3.1
    • Stakeholder's Meeting
  • CMS
    • Status of #13277 & # 13069.
    • May need to release
    • Central Manager not being able to scale on CERN Hardware. UDP packets routed within machines are backlogged. When HA backup in FNAL, its a bit powerful and is better. Requested CERN for better machine/physical machine. CMS is trying to solve it through different channels.

October 19, 2016

Parag Mhashilkar, Hyunwoo Kim, Marco Mascheroni, James Letts, Antonio Perez-Calero, Dave Mason, Eric Vaandering, Marco Mambelli

  • CMS
    • Status of #13277 & # 13069
    • Extra idle slot to prevent fragmentation. Farrukh is going to try something and check if HTCondor would allow him to do that. Request for Disk reservation in glideinwms. Open a ticket and see if we can make it easier in the Glideinwms. Use case is to put merge jobs. We should add this to Glideinwms so if we do not have enough disk we do not create extra slot.
  • HPC
    • For IF 1-2 million hours. In future around 10k-20k slots
  • JQuery
    • Fermilab security requested to upgraded

October 12, 2016

Parag Mhashilkar, Hyunwoo Kim, Marco Mascheroni, James Letts, Antonio Perez-Calero, Dave Mason

  • CMS
    • Setting up high IO slots and will be useful to partition the disk and it would be useful. -- Feature request for GlideinWMS. Marco and Farrukh are talking about this. Even if we configure in glidein it's not clear if HTCondor enforces it.
    • Holding over 150K+ jobs over a week. Come close to 200K cores.
  • v3.2.16
    • Will have a release sooner rather than later
    • Fixes critical bugs
  • Marco Mascheroni
    • Change the script to glidein_libs.sh
  • Hyunwoo Kim
    • SL7 does not like killing process during the reload like we do in the case of frontend and factory
    • #13881: Split into two tickets. SL7 and SL6 specific. SL6 work is almost done

October 05, 2016

Parag Mhashilkar, Dave Mason, Eric Vaandering, Hyunwoo Kim, Marco Mambelli, Dennis Box, Marco Mascheroni, James Letts, Dave Dykstra, Antonio Perez-Calero

  • Dave Dykstra
    • Singularity (glexec's replacement) progressing
    • Problem with crashing EL7 as it was using config option we do not need. Its not turned off.
    • It has chroot and bind mount functionality. Requires top level dir '/' to be in chroot. If it is not like /cvmfs you can't bind mount on top of it. Brian Bockelman plans to use same path inside the contained environment so everything like HTCondor will work as well. Dave thinks it may not be practical.
    • Timeline: Brian thinks mid next year in production.
  • GlideinWMS News & Issues
    • Issues with HA frontend & HTCondor python bindings. #14036
    • v3_2_15+ frontend requires v3_2_14_1+ factories
  • CMS
    • #13277: Marco Mascheroni, working on it. CHEP is taking priority, so progress is slow.
    • #13069: Parag needs to talk to Jeff who is busy with CHEP. So it wont be in v3_2_16
    • #7922: Will be in v3_2_16
    • Dave Mason: Playing more with I/O slot. CMS has workflows that are multi-core and single-core. There are some subtasks in workflow that are single cores that are higher priority and CMS is trying to place them in I/O slot. Brian volunteered Nebraska and using with I/O slot and working with Farrukh. There is slot partitioning concerns. Marco Mambelli is working with Farrukh to look for a solution. Marco: Instead of a separate slot, over account CPU.
  • Marco Mambelli
    • development machines in fermicloud and reverse proxies. Machines were updated recently.
  • Hyunwoo Kim
    • #4586: Init scripts in RHEL7. reload action is in issue. Current reload -> upgrade/reconfig does not work
    • #13881: No progress.

September 28, 2016

Parag Mhashilkar, Dave Mason, Eric Vaandering, Hyunwoo Kim, Marco Mambelli, James Letts

  • CMS
    • No progress on resizable jobs but thats one list of things to do next. Every one seems to be busy with CHEP.
    • Marco Mambelli: #10910. Change in behavior how CPUs are auto discovered and defaults. Earlier defaults were fixed slot. New default is partitionable. Auto detection default is also node -> slot. More details are in the ticket. Dirk was using auto detection, glidein fork bombs.
  • v3_2_16
    • Schedule: Tentatively for mid-end of Oct
  • Hyunwoo
    • Working on #13881 & #4586
    • Parag: Lets first close the tickets in feedback mode so we can get them in release

September 21, 2016

Antonio Perez-Calero, Parag Mhashilkar, Dave Mason, Eric Vaandering, Hyunwoo Kim, Marco Mascheroni, Marco Mambelli

  • CMS
    • #13807: Support for singularity
    • #7922: Manually starting glidein by hand. Submission infrastructure meeting
      Antonio: This does not seem to be a high priority.
      Parag: Raised concerns about this related to accounting and gaps with unknown info
      Antonio: This is LHCB, UK sites
    • #13277: Marco Mascheroni working on this issue
  • Marco Mascheroni:
    • #13277: Marco Mascheroni working on this issue
  • Marco Mambelli
    • #7186:Parag: Maybe we should support Google Cloud in v3.2 series so merge those changes into BOSCO ticket so you get related credentials checks
  • Hyunwoo Kim:
    • #4586: Switch init script to use RHEL daemon function

September 14, 2016

Hyunwoo Kim, Parag Mhashilkar, Dave Mason, Marco Mambelli

  • Hyunwoo Kim
    • #4486: Not much progress. Have working script for slf6 but we also need working for slf7
  • CMS
    • Manually starting glidein. #7922. Parag will check with Jeff what his plans are to upgrade the factory
    • Request for making notes public
    • #13609: Parag will check with Jeff on factory ops perspective
    • #13807: