Project

General

Profile

Weekly Meeting Notes

Apr 23, 2019

Present: Shreyas Nick Tanya Parag

  • Service Degradation After Wed Apr 17 Upgrade:
  • Testing 1.3 on SL7
    • we are ready to have users test on jobsubdevgpvm01.fnal.gov, but don't want to overload them with too many confusing and therefor ignorable emails that read something like 'test this for reasons, 'now test that for similar sounding reasons'
    • The private /tmp area that gave us such trouble for SL7 migration needs to be documented in the release notes.
  • Issue #22383 bypass VOMS for global_superuser
    • Dennis thought this already happened in the @check_auth decorator, but clearly it doesn't
  • AOB
    • rapid code distribution is high priority for next release

Apr 9, 2019

Present: Tanya Dennis Shreyas Joe Nick Farrukh

  • SL7/Condor/Apache authentication issues
    • One workaround would be to set condor to look at grid-mapfile after condor-mapfile
    • Would be nice to understand why authentication is broken on SL7 in the first place
    • Dennis will attend the weekly FNAL-HTCondor conference call and ask questions
    • UPDATE: After consultation with HTCondor team, discovered that on SL7 and other systemd derivatives, apache has its own private /tmp area by default. Changing this allowed condor file authentication to start working again.
  • Dennis has tested InCommon RE's in cigetcertopts.txt but will revisit this after the meeting to make sure that a drain is not needed for planned maintenance on Apr 17
  • Lots of things will change during the downtime, so re-testing of RegExes gives us one less thing to worry about
    • kernel upgrades
    • most machine certs will have new issuer (OSG -> InCommon)
    • OSG certs and software stack goes from 3.3 to 3.4
    • new cvmfs client

Mar 26, 2019

Present: Shreyas Nick Tanya Joe Dennis

  • 1.3 release candidate 1 will be out in a couple of days
    • Nick will start paperwork to upgrade jobsubdevgpvm01 to SL7 for 1.3.rc1
    • Dennis is finishing up condor_prio wrapper tickets #20167 and #13853.
      • Condor_prio does not accept constraints as input, only --user or --jobid. Group opinion favored disabling --user capability as there doesn't seem to be a good way to prevent group superusers from affecting users who run jobs both in their own groups and in other groups
      • Tanya thinks we should open a ticket with condor to accept constraints for condor_prio and wait for it to be implemented prior to allowing --user to be exposed.
    • Shreyas is working on code that implements both #22164 and #21031, it may be ready for 1.3, otherwise it will be pushed back
  • Condor_ssh_to_job #10903 and condor_tail #5906 discussion:
    • Security has consistently said 'NO' to our requests for permission to do this on FNAL resources. It is possible that they may consent for non FNAL OSG resources or for jobs running inside containers, so we are pushing these requests out to 1.3.2 instead of closing them immediately

Nov 20, 2018

Present: Bruno Shreyas Nick Tanya Dennis

  • Jobsub 1.2.9 feature list has been updated in redmine
  • The Ferry APi requested for Shibboleth authentication is completed.
    • Dennis has basic Shibboleth Authentication working on SL7, but integration with jobsub server remains incomplete

Oct 9, 2018

Present: Bruno Shreyas Nick Farrukh Dennis

  • Port to SL7 underway. Dennis is currently working on a problem where condor_submit seg_faults, which did not happen in previous ports.
  • Discusssion: priorities for next release
    • Single-Sign-On authentication for job sandboxes highly desirable
    • Request from S Timm: can jobsub report back how it maps (DN, FQAN) to username, perhaps through a new entry point that does nothing else?
      • sso DN could map to different username than (DN,fqan) mapping that submission used, this needs some more thought.
      • Leads to discussion of whether Ferry API should be doing this as well/instead of. See redmine ticket #20164
      • Implementation of this by ferry would simplify jobsub resoving issue #16072
  • AOB
    • Nick is moving jobsub-dev from OSG 3.3 to OSG 3.4 today. He will ping developers to make sure everything is OK after move

Sep 25, 2018

Present: Joe Shreyas Bruno Tanya Nick Kevin Dennis

  • jobsub 1.2.8.1 has a ferry/myproxy interaction problem, older certs not always notifying myproxy server to renew
    • Joe was able to demonstrate during meeting
  • Issue #16072 (failure to map to Production accounts for sandbox retrieval) is causing significant pain for fife-group, please fix it
    • #16072 is scheduled for next release.

Aug 14, 2018

Present: Nick Tanya Shreyas Bruno Dennis

  • Action Items
    • Nick will deploy latest release candidate to jobsub-dev for Fife and User testing of Ferry authentication
    • jobsub.ini files will be cleaned of no longer active experiments (Nick on production, Dennis/Shreyas in jobsub git repo and on dev instances)
    • Ferry integration testing will proceed ASAP on jobsub-dev (Dennis/Shreyas, then Fife group, then users)

Jun 19, 2018

Present: dbox, sbhat, perignow, tiradani, bruno

  1. Testing status - ran into collector config issue
    • Tony: We don't assume this is a user testing env.
    • Glenn and (Margaret? Tanya?) will be setting up a user testing area sooner rather than later
    • Nick: thinks he sees the issue and will test fix on collector config
  2. AOB Ferry integration - Dennis and Dave found that llrun-voms not a drop in replacement for llrun-gums
    • Dennis will document the cases where llrun works/doesn't work for the queries, send it to Tony
    • Tony: is the voms-mapfile correct? Is it from ferry? too wide open?
    • Nick - will update dev to rc4

Feb 27, 2018

Present: Bruno Shreyas Tanya Nick Farukh Parag Dennis

  • Post-mortem of 2/15/18 v1_2_6 deployment/rollback
    • jobsub_client was tied to a specific version of ifdh
    • lots of experiment workflows were tied to other incompatible versions
    • submission failed for these experiments
    • a jobsub_client v1_2_6_2_rc2 has been deployed that is not tied to ifdh version.
    • it has been tested with a wide range of ifdh clients
    • tanya has a list of users/ifdh versions she is sending she just pulled from kibana, we need to notfy them
    • shreyas will send an email to users urging them to test the new client on jobsub-dev
  • Other Items
    • Nick and Joe will coordinate a new release policy so we dont have to deploy on third thursday and roll back on a friday under pressure
    • Ferry testing and schedule
      • Target deployment date is June 2018
      • There is no transition plan as of now
      • Dennis plan is for the jobsub server to query the Ferry server once an hour or so and pickle the results for later queries
      • POMS is already using Ferry they find it quite a bit faster than VOMS/GUMS

Feb 13, 2018

Present: Dennis Bruno Nick Joe

  • jobsub 1.2.6 deployment to production is still scheduled for 2/15
  • condor,apache and jobsub will all be updated at the same time on production.
  • cdf testing: we need a way to send no-ticket jobs off site and keep ticketed jobs onsite.
    • Dennis will change requirements to --sendtkt jobs
    • Dennis will make sure to check checksum on worker node before untarring tarball
    • Shreyas will ask nick to add uboone and (other) resilient to jobsub.ini
  • Bruno reports testing of 1.2.6 client is OK besides ugly error message
  • Ferry testing: ferry is populating VOMS so any info in VOMS should be derivable from Ferry

Jan 30,2018

Present: Shreyas Dennis Parag Tanya Bruno Nick Dmitri

  • Planned deployment for jobsub 1.2.6.rc1 to jobsub-dev is Feb 1, 2018
  • Discussion of issue #16775
    • if we can pin files in scratch space, we have a much lower chance of failures where jobs start after some time but LRU algorithm removes tarballs from scratch space
    • tarball location will be added to job classad to aid any pinning/cleanup scripts we come up with
    • an ifdh rpm would be very useful for cleanup/pinning on server side. Ask Marc if they exist
    • scratch space fills up all the time according to Dmitri
    • look in /nova/app to get a current feel for sizes of tarballs

Post meeting findings:

  • ifdh pin does not work any more due to srm interface to dcache being disabled
  • there is no ifdh rpm
  • server side pin/cleanup script will have to use globus commands
  • client side cannot pin either. Marc suggested reading back first 16 bytes with globus commands to game LRU algorithm

Jan 16, 2018

Present: Tanya, Joe, Bruno, Nick, Farrukh, Dennis

  • Target date to deploy jobsub 1.2.6 to jobsub-dev is Thurs Feb 1
    • Will be deployed to integration following week
    • Deployed to Production during scheduled downtime Feb 15
  • Important features to be included in 1.2.6
    • #16775 generated tarball initially to pnfs (and then elsewhere if needed)
    • #18290 modify CDF wrapperfile behavior (no krb5 ticket sent with job)

Jan 2, 2018

Present: Tanya, Nick, Dennis

  • Nick will ask his management for scheduling info regarding experiment-specific jobsub servers
  • The API for Ferry (GUMS replacement) is not finished. Tanyas best estimate is it will be ready for testing in March of this year.
  • An item that came up after the meeting is that #16775, move tarball submission to pnfs/dcache is suddenly a high priority

Nov 7, 2017

Present: dbox sbaht parag njp tlevshin coimbra fkhan

  • Personell/effort changes.
    • Dennis moved to 25% effort on jobsub through Jan 2018
    • Shreyas is joining development team, also 25% effort through Jan 2018
      • Shreyas will start going through the developers quick-start guide today.
    • Given the resulting lack of resources, development priorities are (in order):
      • gums to ferry transition (gums de-supported January 2018)
      • critical security bugs, if any
      • experiment specific jobsub_server/schedd deployment
        • There are no requests in SNOW yet for this however

Sep 12, 2017

Present: Dennis Bruno Nick Tony

  • Remaining items to be checked for Sept 19 cutover
    • fifebatch-test haproxy server currently sits in front of new production nodes
      • DNS switching for above server happens Sep 19
    • Jobsub server appears to still have a ups/upd dependency, investigate
    • Re-test offsite submission
    • merge refresh-proxy cron script back into production server

Aug 29, 2017

Present: Dennis Nick Parag Tanya Joe Tony Ken Bruno

  • Focus is on GPGrid refactor. Goes live Sept 19
  • Still to test:
    • Offsite jobs
    • Monitoring
    • Experiments 'typical' jobs
    • Failure modes: server machine down down, server up but condor down or httpd down etc

Aug 1, 2017

Present: Dennis Tanya Joe Shreyas Ken Tony Farrukh

  • Jobsub 1.2.4.rc3 has been deployed to fifebatch-dev and is being tested
    • Will be deployed to production during monthly downtime 8/17/17 if tests are OK
  • Dennis experiencing htcondor authentication problems on gpgrid refactor dev nodes htcjsdev01/2
    • action item: send query to Nick and Farrukh
  • add #14195 Add support for condor_qedit-like functionality to 1.2.5 feature list

Mar 28, 2017

Present: Dennis Parag Joe Nick Tony Tanya Ken Mike Shreyas

Jan 17, 2017

Present: Dennis Shereyas Ken Tanya Nick

  • 1.2.3 deployment status
    • installed on preprod, need to get people to test
      • Ken submitted his tests last night. About half of them ran.
      • Ken needs release notes to make announcement
      • Dennis will ask POMS to run tests
  • Last Fridays submission flood disaster
    • the user (from dune) has done it before and been asked not to do this
    • jobs were disconnecting as schedd was pegged, most reconnected
      • idea: compute the submission rate per user, deny new submission if too high
        • have to look if this would slow server too much doing the query
      • an alarm in monitoring sounds like a much better idea. Tanya will open a SNOW ticket for the Graphana/FIFEmon people

Jan 3, 2017

Present: Ken Parag Shereyas Tanya Dennis

  • there are 2 RITs to deploy a new jobsub server pending
    • the deployment process has been hampered by personnel changes and Christmas
    • decision: cancel the 'emergency' patch to 1.2.2 and concentrate on releasing 1.2.3
  • In light of Dennis being only 25% on this project, consider slowing release schedule down to once every other month
  • Separating schedds from jobsub servers is prototyped, except for the problem of runnining DAGS on a remote schedd
    • Parag: ask Kent Wagner on condor team about ways to run DAGS on remote schedds
      • consider joining the monthly condor call to ask

Dec 6, 2016

Present: Dennis Parag Joe Ken Neha

  • new feature discussion v1.2.3.rc6
    • submission throttling - after discussion it seems the algorithm is not quite right. Dennis will do some tests and re-write if needed.
    • global_superuser list - general consensus is bring it.
  • overview of last nights INC000000791559-- cigetcert commands failing when presenting managed proxies
    • cigetcertopts.txt was changed on fifebatch and on hepcjobsub01/2, the trusted retrievers parameter changed
    • To make mu2epro submission work, had to change its trustedreceiver field in the myproxy server.
    • Did not do this for all the other managed proxies and submission to fifebatch for them failed.
    • This side effect was overlooked by everyone, setting up a realistic test would have been possible but it would have been quite easy to accidentally set up a test that did not detect the problem prior to moving it to fifebatch.
    • Steve did a quick fix by updating trusted retrievers for all managed proxies in myproxy server
    • Ken is modifying the script that populates the myproxy server with managed proxies

Sep 13, 2016

Present: Ken, Dennis, Dave, Neha, Art

  • DCAFI post transition status
    • Dennis has to fix issue 12405
    • Discussion of stopping keytab distribution script - Operations does this
      • Really can't be done until CDF switches from scp to ifdh cp
      • Idea - make the scp/ifdh switch a configuration item in jobsub.ini

Aug 16, 2016

Present: Ken, Dennis, Tanya, Parag, Joe, Neha, Art, Kevin, Sheryas

  • v1_2_2 testing, release to production schedule
    • v1_2_2 is installed on preprod waiting to be tested.
    • There is no OPOS to ask to test, which is too bad. Shereyas ran Kens tests 'representative experiment' jobs, they were OK
    • Announcement and request to test now sent to fife-jobsub-announce and experiment liasons, it does not generate a rush to get the new release tested.
    • v1_2_2 will be released to production on Aug 18
  • DCAFI transition status
    • moving to DCAFI on Production today: D0, argoneut, seaquest.
    • moving to DCAFI on Preprod today: coup, minerva
    • Everyone else will be transitioned on Aug 30 (lar1nd, patriot, others?)
  • AOB
    • Single Sign On: Tanya wants to know about whether jobsub will support it vs
      certificate. Discussion ensued
      • Discussion Summary: it is possible. Dennis will start reading docs and planning
    • group_superuser cannot read others logs by default. I think there is a ticket for this but if not will create one
    • global_superuser: there is a ticket for this, will be in v1_2_3
    • multiple schedds: There is a RIT for this but needs more discussion. Dennis will try to schedule a meeting with Joe, Parag, and Tony to flesh out requirements.

July 19, 2016

Present: Dave, Ken, Mike, Tanya, Mine, Philipe, Dennis

  • mu2e production, submission with managed service proxy and no kerberos ticket
    • Dave has proposed fixing this with a cigetcert change
  • mars subgroups problem
    • Dennis will fix this with a server upgrade
  • fifebatch-dev status
    • Neha reports via email that this will be available soon.

July 5, 2016

Present: Dennis, Anna, Neha, Tanya, Ken, Parag, Mine

  • Issues stemming from 6/30 upgrade of production to 1.2.1
    • Only known issue from upgrade is 1.2.1.5 client using wrong kx509, resulting in client side submission with cilogon DN. Server side is still Digicert for non transitioned groups.
    • Nova had to change some config files to keep Production submission working
  • DCAFI cutover: genie, annie, numix, chips, darkside, cdf, dune will switch tomorrow

June 7, 2016

Present: Parag, Neha, Dennis, Art, Mike, Ken, Tanya

  • Current status, tentative release schedule.
    • fifebatch2 is draining, will be upgraded thurs 6/9
      • Art thinks remaining jobs are jobs type that run until preempted, then restart again, repeat forever
      • He also thinks we need to enforce runtime, we hold after 10 restarts
      • Art will contact users and give them a chance to kill their own jobs
    • Re-visitation of condor tail or enable streaming of logs or whatever
      • Joe thinks there is a 'send me my logs' command in condor, we could add a
        jobsub client interface
    • condor classad setting change ON_EXIT to ON_EXIT_OR_EVICT
    • Parag suggestion: jobsub_tail invokes condor_tail, returns url to poll
    • Joe will open a ticket to see if Kevins fifemon app can do this, will need to coordinate with Kevin on this.
  • Dennis released pycurl v7_19_5_3 that implements TLS on SL5 to /grid/fermiapp
    • discussion of whether to make current
      • Dennis will open a ticket to:
      • enable TLS on preprod
      • request minos to test
      • make current after testing
  • New release procedures, Prod/Preprod/Dev configuration etc
    • dev is currently broken, joe will look at it today
    • is dev going to be gco puppet? No according to Neha
    • procedure for configuration changes:
      • jobsub configuration changes have to come from fife not Dennis
      • Dennis asks fife_support, who asks GCO to deploy
      • Dennis will push to a secondary git repo, and file a snow ticket to deploy
        • this procedure is not in place, opportunities for miscommunication are abundant.
    • DCAFI: numix and genie are being changed to myproxy on preprod right now

May 24, 2016

Present: Felipe Alba, Dennis Box, Joe Boyd, Michele Fattoruso, Ken Herner, Anna Mazzacane, Parag Mhashilkar, Neha Sharma

  • There was a long hiatus in meetings due to illnesses, vacations, and reorgs
  • V1.2/Production status
    • fifebatch1 has been successfully transitioned from gcso puppet to gco/htc puppet configuration
    • fifebatch2 will begin draining June 2 in preparation for transition
      • a problem was discovered with jobsub_client v1_2_0_4, submissions that included a tarball would use round-robin DNS to choose a server and ignore the INDOWNTIME classadd attribute, making the drain slower. Dennis has prepared a jobsub_client v1_2_0_5 that fixes this and will make a SNOW request that it be made 'current' prior to June 2
  • V1.2.1 Upgrade/deployment Status
  • DCAFI/cilogon/myproxy status
    • A reminder: Support for KCA stops about Oct 1, will continue to work until Jan 1
    • Felipe was able to submit production jobs to onsite/Fermigrid for Nova, they ran successfully.
    • He will now test offsite
    • Ken and Willis are proceeding with CDF testing on preprod
    • Beam shutdown for maintenance has been postponed to ~August 3
      • minos, annie, chips, uboone, and nova are reluctant to change production until this happens
      • we will try to keep to original test schedule on preprod and do an accelerated switch-over after shutdown.

Mar 29, 2016

Present: Dennis Box, Joe Boyd, Ken Herner, Mike Kirby, Anna Mazzacane, Parag Mhashilkar, Neha Sharma
  • Jobsub 1.2.1.rc3 deployment to dev is imminent
    • New features from rc2.
      • Trapping exit conditions and job cleanup works better in wrapper than previously
      • Logging has been improved. Errors in cron scripts detected and logged in particular.
  • Deploying 1.2.1.rc3 to preprod
    • this was where we planned to do cilogon user testing
    • a 'todo' list for operations
      • new rpms for preprod and dev with this release:
        • myproxy.x86_64 6.1.15-2.osg33.el6 @osg
        • myproxy-libs.x86_64 6.1.15-2.osg33.el6 @osg
      • new cilogon DNs still need to go into all the VOMS that jobsub uses. Currently only fermilab VOMS have these.
    • authentication method is per-group, controlled by jobsub.ini.
    • documented at https://cdcvs.fnal.gov/redmine/documents/1009
    • jobsub.ini lives in git project puppetrepo-jobsub, currently in branch dennis_jobsub_v1_2_1
      • location: puppetrepo-jobsub/templates/opt/jobsub/server/conf/jobsub.ini.erb
      • anyone who has root on a jobsub server can check this branch out, edit the jobsub.ini, and deploy it via puppet
    • cigetcertopts.txt lives in same git project
    • Testing cilogon, regression testing new release: There is a coordination problem here
      • need an easy switch back/forth that doesn't require root
      • one option: make jobsub.ini group writable, turn off puppet after deployment
      • After discussion, we think we will ask experiments to do preliminary myproxy/cilogon testing on dev while regression testing happens on preprod
  • AOB: 1.2.1 feature discussions
    • test queue, experiment queue, condor_ssh_to_job: this are more Operations config issues than developer issues
      • action item: Dennis will send ticket numbers to Joe
    • protection against denial of service attacks via jobsub_fetchlog
      • jobsub_fetchlog --partial : make it do the right thing
      • squid caching: seems like a good idea, no one present understands deployment/operations
      • need to make fetchlog urls cacheable first, so we can detect and make happen one at a time
      • another option: should tarballs go to dcache scratch? Need to think about this.

Mar 15, 2016

Present: Joe Boyd, Dennis Box, Vito Di Benedetto, Tanya Levshina, Anna Mazzacane, Parag Mhashilkar, Neha Sharma

  • Jobsub 1.2.1.rc1 is available for testing on fifebatch-dev
    • It turns out that Cilogon DN will not work for anyone that has a middle name in their DN
      • Box, Mengel, and Illingworth all managed to acquire a DN that works so this was not noticed till recently!
    • VOMS and GUMS will need to be updated before general testing can proceed
      • VOMS already has too many DN's to function quickly, so 'bad' ones have to be removed as 'good' ones are added.
    • Ken and Tanya can test with current configuration
  • AOB
    • request made to change name of jobsub error_log to debug_log and slim it down. #11964
    • Condor superusers capability is available but poorly advertised
    • Action Item: Dennis will contact Gabe and other super users, have them hold/release someone elses jobs
      • will verify that these get logged to condor_superusers logfile

Mar 1, 2016

Present: Joe Gabrielle Tanya Parag Dennis Neha

  • jobsub v1.2 status
    • testing underway, vitos jobs are not starting
      • joe is looking at why this is
      • if we dont get slots we will just have to wait until they tests run
      • general discussion of how can we get a better test cluster
  • v.1.2.1 release
    • need this test environment even more for regression testing
    • target release was 3/3 this will be pushed back
  • AOB
    • a request for default usage_model = DEDICATED (comma separated list)

Feb 16, 2016

Present: Dennis Box, Joe Boyd, Tanya Levshina, Parag Mahashilkar, Neha Sharma

  • Status of 1.2 testing
    • On fifebatch-dev now. Scheduled to be deployed to production 2/25
    • Important to test group_superuser functionality. Ken has the deployment ticket, Joe is editing as we discuss
  • Status of 1.2.1
    • big feature is myproxy integration with client/server
    • being actively developed now, release to fifebatch-dev scheduled 3/10
  • AOB
    • DES wants their own schedd to submit to directly without using jobsub
    • this will affect load balancing queries on jobsub among other things.

Jan 19, 2016

Present: Dennis Box, Joe Boyd, Tanya Levshina, Ken Herner, Mike Kirby, Art Kreymer, Parag Mahashilkar, Neha Sharma

  • Status of 1.1.9.1 testing
    • 1.1.9.1.rc2 has been installed on fifebatch-dev
    • deployment to preprod is waiting for OSG upgrade testing to complete. Neha thinks this will happen today.
    • Art has been assigned the deployment ticket RITM0332868 . He thinks deployment to production may slip by one week to Jan 28 to allow more complete testing, given delays with preprod.
    • Art and Joe would like to slip issue #11437 into 1.1.9.1 if possible. Dennis thinks he can get it into rc3.
  • Status of 1.2.0
    • this is a major feature release, jobsub_super_users for each experiment that can hold, release, remove jobs of other users in their same VO subgroup
    • Do we need a GROUP_SUPER_USER role in VOMS? Not necessarily, but need to decide soon.
    • Experiment liasons definitely need list of group_super_users to put in jobsub config file or in VOMS
    • Good logging of this feature is necessary. A separate log (admin.log?) is required.
    • Other important requirements:
      • super_users can only affect jobs in thier own group
      • cannot submit jobs as other users.
  • Any Other Business
    • Tanya would like the DCAFI modifications using myproxy server deployed into production. The server modifications will be included into release 1.2.0
    • Some discussion of changing default JOB_EXPECTED_LIFETIME from 24 hours to 23 hours 40 minutes. No objections noted.

December 8, 2015

Present: Dennis Box, Joe Boyd, Ken Herner, Anna Mazzacane, Neha Sharma

  • Status of 1.1.9 testing
    • 1.1.9rc1 has been installed on fifebatch-dev
      • Ken has done light testing here.
    • Ken was assigned the ticket to deploy
  • Next Release
    • Roadmap shows it as 1.2, due to major feature release (condor superuser capability for selected users)
    • GCOS wants condor_tail, condor_user_log, condor_ssh_to_job functionality badly.
      • Authentication is configured at fermigrid to not allow this - right now. May change
      • If any of this can be set up for offsite nodes, GCOS wants it - HOWTO link from GCOS wiki sent to Dennis
  • Fermilab VO transition to CILogon HSM
    • Neha notes that Jan 26 is the switch-over date, new certs will be cilogon after that date.
    • Should be transparent to user but be on watch for problems.

November 24, 2015

Present: Neha Sharma, Anna Mazzacane, Dennis Box
  • attendance was sparse as it conflicted with DCAFI review
  • v1.1.9 Release discussion
    • feature set will likely be restricted as deployment to production will be week before Christmas, best to be conservative.
    • issue #10715 add --expected-lifetime=(short,medium,long) will definitely be in release.
    • workflow for sending email requesting tests is still being developed.

November 10, 2015

Present: Parag Mhashilkar, Ken Herner, Neha Sharma, Mike Kirby, Anna Mazzacane, Joe Boyd, Dennis Box

  • v1.1.8 Release
    • rc1 is deployed on dev. Art and Dennis have tested.
    • Notification for other users to test wil be automated with a workflow. Neha will work with Dennis to get this done
    • Art reported via email that he plans to deploy to production on Wed Nov 18 if testing goes OK.
  • Other
    • Operations and users want condor_ssh_to_job for their own debugging. Some off-site CE's allow, fermigrid does not.
    • Development would like to put a wrapper around this for a client tool if security model allows.

October 27, 2015

Present: Parag Mhashilkar, Ken Herner, Neha Sharma, Mike Kirby, Anna Mazzacane, Joe Boyd, Dennis Box

  • v1.1.7 Release
    • Its deployed in production. No news so far.
  • v1.1.8 Status
    • Users would like to see jobsub_q --hold to give hold reason
    • OPOS want to send the users notifications if they have several jobs held
    • Allow super user to remove the jobs. Who and how to decide super users?
  • We may need to allow users to specify job length like short/medium/long. From the jobsub perspective we only need to put an attribute in job's JDF

October 13, 2015

Present: Parag Mhashilkar, Ken Herner, Neha Sharma, Dennis Box, Mike Kirby, Art Kreymer, Joe Boyd

  • Announcements
    • Change in meeting rooms for upcoming months
      • 10/27: WH8XO Hornets Nest
      • 11/10:- Unable to find any rooms available at WH due to conf going on.
      • 11/24: WH8XO Hornets Nest
      • 12/08: WH8XO Hornets Nest
      • 12/22: WH8XO Hornets Nest
      • 01/05: WH8XO Hornets Nest
      • 01/19: WH2NW Black Hole
  • Transition of project leadership to Dennis Box
    • Parag and Dennis to workout a transition plan and expect it to be straightforward
  • v1.1.7 Release
    • RC is on fifebatch-dev
    • There were some issues with deploying RC on pre-prod related to GPGrid upgrade
  • Minerva
    • Can jobsub track what the job is doing to identify efficiency issues?
    • Art: Group managers can now login into worker nodes and monitor whats happening
    • Wrapper can log steps initiated by it along with the time stamp. Should be simple task.

September 29, 2015

Present: Parag Mhashilkar, Ken Herner, Anna Mazzacane, Dennis Box,

  • Specifying job type (long/medium/short)
    • We have two options: User can either use a bucket size (long/short/medium) or user can specify --max-job-lifetime=<number of hours>
    • --max-job-lifetime is a more preferred option.
  • 1.1.6 Upgrade
    • No issues so far. Was smooth.
  • OPOS
    • Can we have a Jobsub health monitor web page
  • Ken: Helpful page that gives site specific info for jobsub users
  • Partial sandbox download:
    • Marc Mengel wanted to add feature to POMS where the users can click the stdout/err link to see the job's output. We need a API and feature to download partial sandboxes.
    • Dennis: Moved it to 1.1.7
  • 1.1.7
    • We have a filter for next version

September 15, 2015

Present: Parag Mhashilkar, Joe Boyd, Neha Sharma, Tanya Levshina, Mike Kirby, Ken Herner, Anna Mazzacane, Dennis Box,

  • Jobsub 1.1.5
    • No issues so far
  • Jobsub 1.1.6 rc4
    • Its on preprod and announcement will be sent out later this week
    • Dennis will test it using HTCondor 8.3.x. Dennis ran the tests last night. So this is addressed
    • Ken and OPOS will submit his tests today
    • Jobsub Logging: https://cdcvs.fnal.gov/redmine/issues/9711 Needs info from Operations/FIFE Support. Art is going to look at the upgrades.
  • OPOS
    • OPOS group is pushing experiments for test scripts for new jobsub.
  • KCA Shutdown
    • Need to start planning for KCA shutdown in September 2016.
    • Direct changes to jobsub maybe minimal but there are other tools and infrastructure related stuff that needs to be addressed and thoroughly tested
  • FIFEMon
    • Everybody should look at new fifemon and see what changes are required
  • Minerva request for disconnected DAGS

September 01, 2015

Present: Dennis Box, Parag Mhashilkar, Joe Boyd, Neha Sharma, Tanya Levshina, Mike Kirby, Ken Herner

  • Jobsub 1.1.5
    • No issues so far
  • Jobsub 1.1.6
    • Dennis is trying to make into Thursday release dates.
    • #9738 and #9711 are taking much longer than expected. #9711 can be pushed to next version
  • Minerva & --use_gftp flag
    • Go through Gabriel's latest emails with Dennis, Ken (and Marc?) and get back to him.
  • mu2e & jobsub_q issue
    • Parag to create a ticket and get back to them
  • Future meetings
    • Starting September we are moving to bi-weekly meetings
    • Next meeting will be on September 15, 2015

August 25, 2015

Present: Dennis Box, Parag Mhashilkar, Joe Boyd, Neha Sharma

  • Jobsub 1.1.5
    • No issues so far
  • Nova Collaboration
    • Ken and Neha will open jobsub tickets as required and take another look at the priorities for development issues based on feedback from Nova.

August 11, 2015

Present: Dennis Box, Tanya Levshina, Parag Mhashilkar, Mike Kirby, Neha Sharma, Ken Herner, Anna Mazzacane

  • Jobsub 1.1.4
    • Deployed in production
    • No issues so far
  • Jobsub 1.1.5
    • Deployed in preprod
    • Test by Dennis were successful
    • Ken is running his tests now. So far everything is looking ok except few jobs in queue for 3 OSG sites
    • OPOS will run more tests after Ken's tests are successful
    • Wiki needs updating
  • OPOS Submission
    • There is currently no beam so activity is quite low
  • FIFE Support
    • Parag: All the changes in jobsub to support hierarchical quotas should be in place
    • HTCondor 8.2.9 update will be done by Sep 9 by earliest.
    • Tanya: Do we want to upgrade to 8.2.9 or move to 8.3.x?
    • Some users from Manchester using Ganga to submit to local batch system
  • CDF
    • All the features from CDF are now in jobsub. Last issue #8278 was addressed in v1.1.4

August 04, 2015

Present: Dennis Box, Tanya Levshina, Parag Mhashilkar, Mike Kirby, Neha Sharma, Ken Herner, Anna Mazzacane

  • Jobsub 1.1.4
    • Looks good and will be on prod servers this Thursday
  • CILogon
    • Parag talked to Dave on Monday and there is yet another planned proposal. Dave will get back to us when they have finalized the details.
  • Brian Bockelman to be in lab on August 17 & 18
    • We need a list of topics/questions to discuss with Brian
  • Jobsub 1.1.5
    • Plan is to release rc1 on Aug 6
    • Parag to review the issues in feedback status assigned to him

July 28, 2015

Present: Dennis Box, Tanya Levshina, Parag Mhashilkar, Willis Sakumoto, Joe Boyd, Mike Kirby, Neha Sharma, Ken Herner

  • Jobsub 1.1.4
    • Ken ran some jobs and they ran with no issues
  • Jobsub 1.1.5
    • Went through the list and shrinked it a bit.
    • Critical issues have been addressed
  • Discussion: Should we submit individual jobs without checking for how many files are cached? #9277
    • We need to identify if the problem is taps->sam cache or sam cache->worker node
    • When Parag talked to Robert last time, there is room to increase the limits per experiments and per project
    • We can put in the checks in jobsub to not start the job unless certain % of files are cached

July 07, 2015

Present: Dennis Box, Tanya Levshina, Parag Mhashilkar, Neha Sharma, Ken Herner, Joe Boyd, Willis Sakumoto, Anna Mazzacane

  • Jobsub 1.1.4
    • Dennis installed latest rc on dev. He will open a ticket.
    • Neha will install it on pre prod later today after she gets the ticket
    • We need users to test this release.
  • Nova deploys it in nova specific area.
  • Dave Dykstra wrote a script so it is deployed in Fermilab common CVMFS area.
  • CDF will look into using jobsub from CVMFS
  • SL5. How long should be support it? Some experiments have SL5 interactive nodes and they may be using.
  • SSL handshake time out

June 30, 2015

Present: Dennis Box, Tanya Levshina, Parag Mhashilkar, Neha Sharma, Ken Herner, Mike Kirby, Willis Sakumoto, Anna Mazzacane

  • Jobsub 1.1.4
    • Dennis: Will merge tickets currently in feedback and only keep tickets for feedback if he needs Parag to look at it. Will push some of the tickets to 1.1.5 based on how far we go.
  • FIFE-Support
    • Neha will be testing with CILogon certs this week
    • Dzero pro user work around by using Joel Snows ticket as pro ticket

June 22, 2015

Present: Dennis Box, Tanya Levshina, Parag Mhashilkar, Neha Sharma, Ken Herner, Mike Kirby, Willis Sakumoto

  • Jobsub 1.1.3
    • So far no issues reported. Seems like new version handles most of the threading issues. Until we find some new ones
  • CDF
    • There are few minor CDF issues to be addressed. Dennis to look into them for upcoming releases.
  • Jobsub Release schedule
    • Plan to move to monthly release schedule that works with the Thursday downtime
    • Release candidates every first Thursday of the month
    • After internal and testing by fife-support testing, deploy on pre prod on second Thursday of the month
    • Final tagging and release on Wednesday to be deployed on production on third Thursday of the month during the downtime
  • Discussion on prestaging jobs to reduce inefficiencies in jobs due to IO wait tape->cache

June 16, 2015

Present: Dennis Box, Tanya Levshina, Parag Mhashilkar, Willis Sakumoto, Neha Sharma, Ken Herner, Joe Boyd

  • CDF
    • Usually takes 8 hrs for jobs to start running on opportunistic
    • Glexec issues. Joe & Ken is working on it

June 09, 2015

Present: Dennis Box, Tanya Levshina, Parag Mhashilkar, Willis Sakumoto, Neha Sharma, Anna Mazzacane

  • Jobsub 1.1.3
    • Dennis cuts a v1.1.3 rc1 today
    • Neha to deploy it to ITB today/tomorrow
  • CDF
    • Neha to check with Joe about issues with glexec

May 26, 2015

Present: Dennis Box, Mike Kirby, Tanya Levshina, Parag Mhashilkar, Willis Sakumoto, Vito Di Benedetto, Neha Sharma, Joe Boyd

  • FIFE Workshop
    • Best Practices talk (15 Min): What & How Jobsub can accomplish (More of question & answer)
      • How to get file to worker node
      • How to submit jobs to a specific site
  • v1.1.2 Release
    • In production now
    • Dennis: Andrei has a SNOW ticket that jobsub was trying to use someone else proxy. It may be a threading issue we may have missed
    • Dennis: chmod issue cannot be reproduced reliably and happens in chunks
  • HTCondor complaining about the permission errors
    • FIFE Support to enable detailed logging and try to reproduce the problem
  • OPS Group
    • Nova started using jobsub_client for production. Until now they were using jobsub_tools
    • OPS Group helping with issues with jobsub for production in case NOVA is having some issues
  • Bye Bye gpsn01?
    • NOVA need gpsn01 only for some specific job types
    • NOVA uses gpsn01 for crontabs for proxy renewal & generation & submission
    • Activity on gpsn01 not being reported to Graita for a while (?). Tanya looking at the issue
  • CDF
    • Need to setup Lynn Garren for new jobsub announcements.
  • FIFE Support
    • Minerva has a ticket open that none of their jobs were running. Art thinks that dagman was held. INC000000545929

May 19, 2015

Present: Dennis Box, Tanya Levshina, Parag Mhashilkar, Willis Sakumoto, Vito Di Benedetto

  • v1.1.2 Release
    • RC3 was deployed in pre-prod today. If there are no major show stoppers, it will be retagged as final release and deployed on production servers on May 20.
  • HTCondor complaining about the permission errors
    • Known issue and suspected to be load issues on the machine. FIFE support is looking into this.
  • OPS Group
    • Client testing went ok.
    • Working on NOVA scripts for prod. Nova scripts are able to submit jobs using prod scripts.

May 12, 2015

Present: Dennis Box, Joe Boyd, Ken Herner, Mike Kirby, Tanya Levshina, Anna Mazzacane, Neha Sharma, Parag Mhashilkar, Willis Sakumoto

  • V1.1.1 Release
    • INC000000539199: #8681. Critical and should be in next release (v1.1.2) And also includes fix for fetchlog while job is running.
  • There are two SNOW tickets complaining condor does not have read/write access to out/err files. Jobs do run to completion
    • Add retrial in next iteration. We need to make very sure that we dont do double submission in case of false errors
    • Also need to consider the use case of multiple schedds
  • CDF seeing some jobs dont get renewed tickets. Dennis to look at if held jobs are looked at the while refreshing the proxies.
  • Minos getting not writable error as well.
  • Zach miller will be here on June 2nd during the FIFE workshop.

May 5, 2015

Present: Dennis Box, Joe Boyd, Ken Herner, Mike Kirby, Tanya Levshina, Anna Mazzacane, Neha Sharma

  • V1.1.1 Release
    • one new issue, INC000000539199 which is a threading issue. Dennis has a fix for this checked into the repository for review.
    • Joe noted that the jobsub config file has parameters for processes and threads set by default to 2 processes and 25 threads. When he changed these during v1.1 we stopped getting new SNOW tickets related to threading.
    • There is a SNOW ticket assigned to Ken requesting ability to retrieve other users log tarballs for debugging purposes. Ken considers this low priority.
    • CDF (actually Willis) has run 40000 hours at MIT recently using v1.1.1 There are pending requests in v1.1.2 that are desirable for CDF but jobs are getting run.
  • V1.1.2 release
    • Everyone wants to know the timing of this. Unfortunately we are dependent on hierarchical quotas being implemented by grid admins that we have no control over. A meeting today is scheduled that may resolve or at least shed light on this issue.
    • Traceability for jobs from group accounts. It is unclear whether the proposed solutions achieve this. Q: Would the output of 'klist' in the classad be sufficient? A: yes, for jobs submitted from command line. For cron jobs, maybe not. They would be identifiable as cron generated jobs, is it sufficient to say 'this person was on shift at that time, the jobs belong to them'.

April 28, 2015

Present: Parag Mhashilkar, Ken Herner, Neha Sharma, Tony Tiradani, Tanya Levshina, Anna Mazzacane, Mike Kirby

  • v1.1.1 Release
    • So far no issues in SNOW
  • Andrei reported some issues using jobsub through setup_mu2e_art script. Using jobsub by itself works fine.
  • Make jobsub_history not require group
  • Add a new option to jobsub_q and jobsub_history to show client dn & client host
  • jobsub & dags & procid with a incremental numbers for consistency

April 21, 2015

  • It turns out there are two places we keep meeting notes in this wiki

April_21_2015

April 07, 2015

Present: Parag Mhashilkar, Ken Herner, Paola Buitrago, Neha Sharma, Jeny Teheran, Willis Sakumoto, Dennis Box, Tony Tiradani, Tanya Levshina, Mike Kirby

  • v1.1.1 Release
    • rc1 will be deployed on preprod later today
    • rc1 will be on pre prod for a week
    • Willis: Most things work. Could not test delayed starts
    • Oct 20 for release date in change request
  • Hierarchical Quotas
    • No progress from DSCO. Effort limitation.
    • Tony & Lisa to talk about this and define some timeline.

March 31, 2015

Present: Parag Mhashilkar, Ken Herner, Paola Buitrago, Anna Mazzacane, Neha Sharma, Jeny Teheran, Willis Sakumoto, Dennis Box, Joe Boyd, Tony Tiradani, Tanya Levshina, Mike Kirby, Kevin Retzke

  • Fifebatch disk space
    • Disk is filling up /var/log for jobsub/apache/HTCondor
    • logging is too verbose in jobsub.
    • There are check mk sensors but who monitors them?
  • KNOWN ISSUE: Fetchlog of
  • v1.1.1 Release
    • RC1: Dennis is testing it.
    • Deploy on preprod on April 1
    • Shooting for April 16 as final deployment schedule.
  • Jobsub offsite jobs
    • Jeny's jobs are staying idle for too long.
  • Hierarchical Quotas
    • Work in progress. High on Tony's priority list.

March 17, 2015

Present: Parag Mhashilkar, Ken Herner, Paola Buitrago, Anna Mazzacane, Neha Sharma, Jeny Teheran, Willis Sakumoto, Dennis Box, Joe Boyd, Tony Tiradani

  • v1.1.1 Issues
    • FIFE Support will send the proposed solution to the experiment and ask for more info if needed.
    • CDF krb5 issues to be addressed in next release
  • Need for service cert for production jobs
    • Need to document in jobsub wiki and send the link to the group.
  • Need to restart the schedd to update the fd limit -- issues reported by novapro.
  • Ken: Got MIT site working for FIFE experiments.

March 10, 2015

Present: Parag Mhashilkar, Ken Herner, Paola Buitrago, Anna Mazzacane, Neha Sharma, Jeny Teheran, Willis Sakumoto, Dennis Box, Joe Boyd, Tony Tiradani

  • v1.1 Issues
    • We need to clarify with the experiment who should get production role. Security group is ok with reduced list in group account. Experiments need to shrink the list.
  • CDF
    • tickets related to jobsub are now assigned to CDF Support which does not get anymore.
    • Joe will fix the service desk so they get assigned to fife support.
    • KRB tickets transferred to jobs are not being refreshed?
  • v1.1.1 Issues

March 03, 2015

Present: Parag Mhashilkar, Ken Herner, Paola Buitrago, Anna Mazzacane, Neha Sharma, Jeny Teheran, Willis Sakumoto, Dennis Box

  • v1.1 Issues
    • Submission using voms/grid proxy
    • Not all the users were aware of this upgrade -- Not really jobsub issue
    • Users were not aware of this upgrade and did not test it on pre prod
    • Discussion on group accounts and impact on hierarchical quota.

February 10, 2015

Present: Parag Mhashilkar, Tony Tiradani, Mike Kirby, Willis Sakumoto, Dennis Box, Ken Herner, Jeny Teheran, Paola Buitrago, Joe Boyd, Anna Mazzacane

  • Tony
    • Downtime for all the CMS resources for kernel updates. Scheduled for entire day.
    • Hierarchical quotas: Shooting for March 1st
    • GUMS server: Neha is communicating with Brian B because of potential issues with the new version of GUMS. As per Tony this GUMS is a pre req for gpgrid migration/hierarchical quotas.
  • v1.1
    • Changes are in place.
    • Dennis found some merge issues. 1) CDF submission 2) Need to exercise group mappings 3) Dropbox works. 4) Need to go through a rc build and deployment to figure out if we miss something.
    • Need to figure out if we need multiple service certs or one cert for group job submission
    • If we have to we can drain but we need to announce it upfront and not wait for the Thursday downtime.
  • CDF
    • Infrastructure with dcache. Yujin is working on addressing the issues. Once done Willis will work moving users over to the jobsub.
  • TRANSITION: gpsn01 -> fifebatch
    • Nova: There are many small users.
    • In bashrc of users people set jobsub_tools.
    • Email users nightly to stop using gpsn01
    • Ken will come up with an email for Liaisons to inform users about the migration and to stop using gpsn01
  • Kerberoes accreditation

February 03, 2015

Present: Parag Mhashilkar, Neha Sharma, Tony Tiradani, Mike Kirby, Willis Sakumoto, Dennis Box, Ken Herner, Jeny Teheran, Paola Buitrago

  • v1.1
    • Parag: Spent almost 2 days trying to understand SSL v3 issues last week. This delays the release.
    • Parag to give list of things to do to Neha so she can puppetize it.
  • TRANSITION: gpsn01 -> fifebatch
    • mu2e: Analysis changed from gpsn01 to fifebatch. There maybe some tail.
    • No Analysis activity on gpsn01 for last week.
    • Start count down for gm2. Two weeks from today.
  • CDF
    • Some one is having problems with krb ticket passing.
    • Willis trying to get more people to try it.
  • Operations
    • Nothing specific.
  • Priorities & Quotas
    • Changes to jobsub should be minimal. Plan is to have it available in jobsub v1.1.1

January 27, 2015

Present: Parag Mhashilkar, Tanya Levshina, Neha Sharma, Joe Boyd, Tony Tiradani, Mike Kirby, Willis Sakumoto, Dennis Box, Ken Herner, Anna Mazzacane

  • TRANSITION: gpsn01 -> fifebatch
    • mu2e is making progress.
    • Uboone
    • Start countdown to: gm2
    • Next Nova & LBNF analysis.
    • How to tackle LBNE situation. Its unstructured and we are not in loop in what direction they are taking. Check with Liz on working group meeting or some other meeting that are meaningful to computing/workflow submission.
    • Send automated mail to Liaisons about the analysis still using gpsn01 and need to switch to fifebatch.
    • Migration: Tentative Feb end
  • CDF
    • Minor issues reported by Willis. Will be tackled in 1.1 if they are really quick to address.
    • End of Feb us fine.
    • Handful of users using CDF Grid
  • Sites & Supported VOs
    • Neha made the changes to the Fermilab entries.
    • Need to make a minor change to the jobsub server to get it working correctly.
  • Priorities & Quotas
    • Expected by Feb end on production in phases.
    • Jobsub and SCO needs to work on making this simple for the users to specify the accounting group
  • v1.1
    • Parag hope to get critical except dag done by today.
    • Dennis: Should be good by end of this week
    • Will need more extensive manual testing.
  • Mike: Got request from uboone Herb. They need a condor_q -better-analyze
  • We need GUMS for this to work with group accounts. Marina and Nick is working on setting up the GUMS servers.
  • Nova: Dominique complaining jobs died for no reasons.
  • Some users get better chance of running when they specify only on opportunistic.

January 13, 2015

Present: Parag Mhashilkar, Tanya Levshina, Ken Herner, Neha Sharma, Willis Sakumoto, Dennis Box, Joe Boyd, Tony Tiradani, Mike Kirby

  • Sites & Supported VOs
    • Sites should be configured correctly to list supported VOs correctly.
    • Parag needs to update the ticket
  • CDF
    • SL6 can be transitioned away from cdfgrid right away
    • SL5, we need another month or so to move other users.
  • v1.1
    • Parag: There are several changes all across the server functionality. Release will be delayed. New estimate is by end of the January.
  • TRANSITION: gpsn01 -> fifebatch
    • Tingjun wanted to use gpsn01 for some quick computing.
    • Ken/Mike sent email for two week count down for migrating some experiments from gpsn01 to fifebatch
    • Minerva is blocked on hierarchical quotas because of the way they have organized their work
    • Someone from coupp was trying gpsn01
    • To avoid future submissions,

January 06, 2015

Present: Parag Mhashilkar, Tanya Levshina, Ken Herner, Neha Sharma, Willis Sakumoto, Dennis Box, Joe Boyd

  • v1.0.4
    • Deployed on production.
    • No tickets so far.
    • Some how the upgrade request from the dev team did no include request to declare jobsub_client 1.0.4 current.
  • Sites & Supported VOs
    • Sites should be configured correctly to list supported VOs correctly.
  • CDF
    • Few users still using SL5
    • There will be a push to move users to jobsub
    • CDF allocation: 800 on gpgrid/fifebatch
  • v1.1
    • Parag: There are several changes all across the server functionality. Release will be delayed. We will have a better estimate by end of the week.
  • TRANSITION: gpsn01 -> fifebatch
    • Announcement was made earlier and the plan is to delay a bit until holidays are over.
    • Users will be able to submit the jobs but they wont run on gpsn01
    • Need to get Eric Flumerfelt from Nova to move to fifebatch
    • gm2 moved back to gpsn01? Why?
    • Target lbne as well

December 09, 2014

Present: Parag Mhashilkar, Tanya Levshina, Ken Herner, Neha Sharma, Willis, Dennis Box, Joe Boyd

  • v1.0.4
    • All tickets have been resolved
    • Dennis: had some issues with yum and puppet. Was resolved by running explicitly cleaning yum repo
    • Ken will try running test jobs on site and Dennis will ask Willis to try it out for CDF
  • CDF
    • Joe will be moving CDF kerberoes principles to fifebatch -- Joe did that.
  • Jobsub Workshop
    • So far only two users registered. Need to advertise to more users, in CS liaison meeting. Ken to announce it on jobsub support.
  • Microboone:
    • Webpage or gui for job submission.
    • python gui -- same functionality -- CDF started it and then abandoned it.
    • Maybe production team
    • Long term goal -- low priority -- strategic target
  • Ken & Mike will identify environment variables for mu2e
    • Dennis will look at it and add some explanation and document in Jobsub wiki.
  • Mike to check with microboone if it is ok to cut their access to gpsn01

December 02, 2014

Present: Parag Mhashilkar, Tanya Levshina, Ken Herner, Neha Sharma, Joe Boyd, Willis, Dennis Box, Gerard Bernabeu Altayo

  • v1.0.4
    • Moved completed tasks listed for v1.1 to v1.0.4
    • Moved CDF requirements to v1.0.4
  • CDF
    • Joe will be moving CDF kerberoes principles to fifebatch
  • gm2
    • Started using fifebatch
    • Ken: Adam Lyon will update the wiki instructions so users will be using fifebatch
  • Jobsub Workshop
    • Monday, Tuesday: December 15/16 Currently planned in CDF Big Room.

November 25, 2014

Present: Parag Mhashilkar, Tanya Levshina, Ken Herner, Neha Sharma, Jo Boyd, Willis, Dennis Box

  • New Experiments
    • gm2
      • Adam Lyon will update the instructions to point to jobsub_client instead of gpsn01
  • Transition
    • Notice to move to gpns01: Darkside along with the list of experiments from last week.
  • Deployment
    • Mail summary: Works for DAG
    • Email issues
  • v1.1 Status
    • Release maybe deleted
  • CDF
    • jobsub_rm: Specifying constraints will be useful

November 18, 2014

Present: Parag Mhashilkar, Tanya Levshina, Ken Herner, Mike Kirby, Joe Boyd, Dennis Box, Neha Sharma

  • Deployment
    • v1.0.3
      • Ken: When you put minerva specific options, like -t option it does cd into a script but does not cd back to starting dir. Dennis to look at it.
      • Ken: From nova. site=SMU but resource=dedicated. This is inconsistent since SMU is OFFSITE. We need to get some checks in place to avoid this.
  • Hierarchical Quotas & Accounting Group
    • Gerard has a test cluster.
    • Test it with CMS infrastructure.
    • It did not work well for CMS and they moved to priorities only.
    • BNL has worked with quota since past two weeks so it should work now.
  • TRANSITION: gpsn01 -> fifebatch
    • lsst
      • Ran workflows for SC14 using fifebatch
    • nova
      • Dominick Rocco used around 20k hours using fifebatch. If accounting groups were set then it they can move.
      • Satish Desai is trying to use fifebatch
    • microboone
      • Using only fifebatch since Nov 02
    • darkside
      • Using only fifebatch since Nov 05
    • seaquest
      • Working on switching to fifebatch
    • mu2e
      • Still gpsn01
    • coupp
      • In the process since Nov 11
    • dzero, fermilab, lariat, lsst, lar1nd
      • Switched completely
    • minerva
      • Got almost 100k hours. Moving since Nov 11.
    • lbne
      • Usage on both

November 11, 2014

Present: Parag Mhashilkar, Tanya Levshina, Ken Herner, Willis Sakumoto, Mike Kirby, Joe Boyd, Dennis Box

  • Deployment
    • v1.0.3
      • Already in pre prod. Joe found non-show stoppers issues and opened redmine tickets
      • Joe will put this on prod today.
  • v1.1
    • We should be able to get close to the release or a rc by Thanks Giving.
  • TRANSITION: gpsn01 -> fifebatch
    • microboone
      • about 25K peak hours used
    • darkside
      • switching to fifebatch
    • lariat
      • about 9K peak hours used
    • minerva
      • about 19K peak hours used yesterday. Users: Jerome (most of the usage), Philip
    • lbne
      • Around 12K peak hours
      • Tom junk gave a talk and that should get more lbne users moving to jobsub_client
    • mu2e & mars
      • Two step process. We have to go through Rob K. And getting hold of the users is difficult.
    • nova
      • Hierarchical quotas with surplus
    • cdf
      • jobs were stuck on sunday. but started running on Monday. This affected dev and pre-prod(?). Neha did some maintenance (?). Its working again.
      • 1.0.3 works for cdf and he can get jobs running
    • minerva
      • Requested adding re try logic to jobsub tools. May have been gpsn overload
    • Jobsub Forums
      • Would be useful to have one. Tanya asked Kathrine to check if its possible in sharepoint.

November 04, 2014

Present: Parag Mhashilkar, Neha Sharma, Tanya Levshina, Ken Herner, Willis Sakumoto, Mike Kirby, Joe Boyd, Dennis Box, Gerard Bernabeu Altayo

  • 1.0.3: RC to be released by Wed Oct 29 morning
    • Its on pre prod.
    • Joe plans to test the changes.
    • Joe will be testing --donot-drain option to prepare for the downtime.
    • Plan: If testing is done in a day or two, it can be put in production early next week
  • TRANSITION: gpsn01 -> fifebatch
    • gpsn01 had a melt down yesterday.
    • Joe: Not in favor of imposing the limits.
    • fermilab
      • Not all sites support fermilab
      • frontend policy changed so all the FIFE experiments are request sites that support fermilab
    • Minos
      • Art is testing new jobsub on preprod for his recent feature requests.
    • Nova
      • They maybe running cloud tests
    • MarsMu2e
      • Still using gpsn01.
      • SL6 Gridftp servers only accept 1024 bit proxies. So they are having issues with jobs failing to transfer files as HTCondor refreshes the proxies.
    • Uboone
      • Almost all the usage is fifebatch. Really tiny fraction using gpsn01 from a user, possibly old jobs
    • Minerva
      • Not transitioned
    • Mu2e
      • Not transitioned
    • Darkside
      • There is some critical workflows that haven't be moved to fifebatch. Ken to talk to users to move to fifebatch
    • SeaQuest
      • Presentation is on Oct 28 7pm CDT
    • Koupp
      • One of the users will be switching to fifebatch
    • CDF
      • Gerard will be working on CDF krb principles moved to fifebatch. It is tabled because of other high priority issues.

October 28, 2014

Present: Parag Mhashilkar, Neha Sharma, Tanya Levshina, Ken Herner, Willis Sakumoto, Mike Kirby, Joe Boyd, Dennis Box

  • v1.0.2: Deployment Status -- DO NOT INSTALL ON PROD
    • Deployed in Pre-Production
    • There are some issues that needs v1.0.3
    • #7226: jobsub_q --jobid is broken. Other options work
    • #7225: Most of the calls are already implemented for jobsub_history. Need to adapt them to jobsub_fetchlog --list-sandboxes
  • 1.0.3: RC to be released by Wed Oct 29 morning
    • We should make explicit that server HTTP return code of 200 is ok.
  • TRANSITION: gpsn01 -> fifebatch
    • fermilab
      • Not all sites support fermilab
      • frontend policy changed so all the FIFE experiments are request sites that support fermilab
    • Minos
      • We should let Art know to use pre prod. Preprod upgrade announcement was sent to the mailing list.
    • Nova
      • They maybe running cloud tests
    • Uboone
      • fifebatch usage is going up
    • Minerva
      • Not transitioned
    • Mu2e
      • Not transitioned
    • lar1nd
      • Started using fifebatch
    • lariat
      • Started using fifebatch
    • LBNF
      • Plots do not show LBNF
      • Ken met with LBNF. Went quite well and use standard process so far. Transitioning will be straightforward. Will be linking to uboone instructions. Have 7-8 users that will be contacted separately
      • Need to run workflows with 4G
      • Need to be less Fermilab centric be able to submit jobs from home institutions.
      • Need to support non KCA certificates.
    • DZero
      • All their requirements have been satisfied. Everything is work.
    • CDF
      • Started testing. Issues with kinit -R. Dennis will look into it. This is in the jobsub tools. Dennis may have identified the problem related to KRBCC_NAME variable.
      • Some other issues related to SAM
      • Still need to implement. No need for file:// for the tarball since exe is within it.
      • Dennis: We can check if the exe is in tarball.
      • email summary option is in v1.0.2
    • SeaQuest
      • Presentation is on Oct 28 7pm CDT

October 21, 2014

Present: Parag Mhashilkar, Neha Sharma, Tanya Levshina, Ken Herner, Gerard Bernabeu Altayo, Willis Sakumoto

  • v1.0.1: Deployment Status
    • Deployed in Production
  • TRANSITION: gpsn01 -> fifebatch
    • Minos:
      • Administrative accounts/controls
      • Set priorities within the production role.
      • Art is willing to run on pre prod and need to move to fifebatch.
    • Nova:
      • Can they move Analysis. They have ~70K hours doing Analysis
    • Uboone
      • Herb has changed the uboone code to use fifebatch for production
      • Analysis is still using gpsn01.
      • Ken to chase down the Analysis users (6 users)
    • Minerva
      • We need to work with Gabe
      • We need to make sure all the tickets from Heidi are addressed before we can ask them
    • Mu2e
      • Need to update all their scripts. Put it off until the DOE review this week.
      • We need to go after the analyzers and make them to switch to fifebatch.
    • LBNE
      • Is one of the experiment we would like to incorporate within FIFE
      • We need to inform Tom Junk and Maxim and ask them to move to fifebatch
    • CDF
      • email option
      • No need for file:// for the tarball since exe is within it.

September 30, 2014

Present: Parag Mhashilkar, Neha Sharma, Joe Boyd, Tanya Levshina, Ken Herner, Gerard Bernabeu Altayo, Willis Sakumoto

  • v1.0.1: Deployment Status
    • Scheduled for Thursday 16, 2014
      • Jobsub 1.0.1
      • Jobsub tools 1_3_4
      • Glidein HTCondor 8.0.7
      • Glideinwms 3.2.6
  • Usage Status
    • Useful to have side by side comparison of fifebatch v/s gpsn01. Tanya will send the link.
  • Ken
    • Mars - mu2e having some issues
    • Shawn - Some issues with his script
  • CDF
    • fifebatch kerberoes principles for CDF -- Gerard is working on them
    • Need other features before next round of testing.
    • email summaries. cdf submit with tarball so we dont need file://

September 30, 2014

Present: Parag Mhashilkar, Willis Sakumoto, Dennis Box, Neha Sharma, Mike Kirby, Joe Boyd, Tanya Levshina, Ken Herner

  • v1.0: Deployment Status
    • No incidences so far
    • It would be useful to get a plot showing all the users that submitted test jobs to preprod. Can we use ITB or preprod Gratia server collector for this?
    • Jobsub v1.0 is on production servers (fifebatch.fnal.gov) as of Wed, 24 Sep 2014 around 10:00 am
    • On gpsn01: jobsub_tools v1_3_1_1_2 set to current & v1_3_2_0 as test 24 Sep 2014
    • jobsub_tools v1_3_2_0 is installed on dev & pre-prod
    • jobsub_tools v1_3_2_0 is current on the fifebatch.fnal.gov servers
    • Configure HTCondor on fifebatch with default require_disk/require_cpu/require_memory
  • Usage Status
    • Minos: 60Khrs/day
    • Nova: Max 200Khrs/day
    • Lariat:
  • CDF (Willis)
    • Seeing all dag jobs with different clusterid is confusing. We need a jobsub_q and jobsub_history -dag
  • Mapping username based on DN
    • Parag: Working with Mischa to understand the llrun usage but the gumsclient plugin may not have the required functionality. Have few options to try it out like create local mapfile and use llrun.
    • Circulated initial draft on security in Jobsub.
  • Jobsub Tools:
    • We need an emergency release for the jobsub_tools that address several critical issues.
  • Nova has some glideinwms config issues on the UCSD factory side
  • genie may try it out.
  • uboone workflow all the except the last step.
  • local batch operation were submitted with jobsub tools without -g option. We should keep using same scheme/infrastructure.

September 23, 2014

Present: Parag Mhashilkar, Willis Sakumoto, Dennis Box, Neha Sharma, Kenneth Herner, Joe Boyd

  • v1.0: Deployment Status
    • It would be useful to get a plot showing all the users that submitted test jobs to preprod. Can we use ITB or preprod Gratia server collector for this?
    • Deployed on dev & pre prod.
    • Jobsub v1.0 will be on production servers (fifebatch.fnal.gov) on Wed, 24 Sep 2014 around 10:00 am
    • On gpsn01: jobsub_tools v1_3_1_1 will be current & v1_3_1_2 will be test
    • jobsub_tools v1_3_2_0 is installed on dev & pre-prod
    • jobsub_tools v1_3_2_0 will be current on the fifebatch.fnal.gov servers
    • Configure HTCondor on fifebatch with default require_disk/require_cpu/require_memory
  • Testing
    • Nova, Minerva, gm2, uboone have volunteered to test their workflows
  • CDF (Willis)
    • Seeing all dag jobs with different clusterid is confusing. We need a jobsub_q and jobsub_history -dag
  • Mapping username based on DN
    • Parag: Working with Mischa to understand the llrun usage but the gumsclient plugin may not have the required functionality. Have few options to try it out like create local mapfile and use llrun.
  • Can we make it easier to find Jobsub Commands? Maybe little reorganizing on the front page?

September 16, 2014

Present: Parag Mhashilkar, Willis Sakumoto, Dennis Box, Neha Sharma, Mike Kirby, Kenneth Herner, Gerard Bernabeu Altayo, Tanya Levshina, Joe Boyd

  • v1.0: Deployment Status
    • Released last week.
    • Deployed on dev & pre prod
    • On gpsn01: jobsub_tools v1_3_1_1 will be current & v1_3_1_2 will be test
    • jobsub_tools v1_3_2_0 is installed on dev & pre-prod *
  • Testing
    • Nova, Minerva, gm2, uboone have volunteered to test their workflows
  • CDF (Willis)
    • Missing option to add email address to email summary emails. Maybe add two options --email-summary & --email
    • Working on wrapping caf-submit
    • Also need to work on caf-mon
  • Mapping username based on DN
    • Parag: Working with Mischa to understand the llrun usage but the gumsclient plugin may not have the required functionality. Have few options to try it out like create local mapfile and use llrun.
  • Experiment Onboarding
    • Genie is next in line.
  • Log file survey
    • Lariat: User or experiment readable.
    • Minerva: Experiment readable. Need initial environment and last ~5MB tail. Request for making the logs web viewable.

September 10, 2014

Present: Parag Mhashilkar, Willis Sakumoto, Dennis Box, Neha Sharma, Joe Boyd, Mike Kirby, Kenneth Herner

  • v1.0: Release Status
    • Docs updated
    • Fixed pycurl for SL6 and ups python
  • CDF
    • Need to make sure that Robot DNs for existing users are allowed to scp files to the CDF disks
  • LSST
    • lsstana user did not exist. Neha resolved the issue.
  • Operations
    • Load balancing is not close to 50%
  • Roadmap for future releases
    • Start HTCondor as root
    • Support multiple users to submit production jobs. Requires user accounts to be created on the server machines + sudo access to the Jobsub server to perform certain tasks
    • Support users to authenticate using Non KCA credentials and use credentials mapped in GUMS (Requires llrun/GUMS client to map user name based on DN + FQAN)

September 02, 2014

Present: Parag Mhashilkar, Willis Sakumoto, Dennis Box, Neha Sharma, Joe Boyd, Kenneth Herner

  • CDF: Willis
    • Dag features
    • When caf jobs finish we can have post script to mail the output. Post script filename can be random for security purposes.
  • v1.0 Release status
    • Dagnabbit features is completed. Ready for RC1
    • Cut RC after #6657
    • Deploy RC1 by today
    • Willis will try it out once deployed on the dev setup
    • Have Ken try out dag. Dzero, Lariat,
    • Have Herb try out v1.0
    • Have Kazu try out for uboone.
  • Can we gather statistics at the end of the jobsub job and put it in the sandbox that will be available to the users when they download the sandbox
  • Microboone Onboarding
    • All issues should be resolved.

August 26, 2014

Present: Parag Mhashilkar, Willis Sakumoto, Dennis Box, Tanya Levshina, Neha Sharma, Mike Kirby

  • v1.0 Release status
    • Should be track to release RC by end of this week
  • Microboone Onboarding
    • HOME env variable issue.
    • Python env/ups setup

August 19, 2014

Present: Parag Mhashilkar, Willis Sakumoto, Joe Boyd, Dennis Box, Kenneth Herner, Tanya Levshina

  • CDF: Willis
    • Willis's test jobs are stuck at sam begin. This job does not start as there are no glideins being submitted/running. We should check with Neha
    • Giving access to Developers will speed up the debugging process
    • Need the wrapper script to auto renew the krbcc
  • Jobsub Tools
    • Dennis working on supporting dagnabbit support in jobsub server.
    • Heidi found some issues with the jobsub_tools v1_3_1_2 -- Dennis has instructions on how Minerva does business.
    • Current jobsub_tools breaks mu2e because of stdout/err redirection. Maybe we can use tee here and give proper recommendations to the user.
  • FIFE
    • Minos ran around 50k wall hours on Monday
    • Uboone seem to be happy so far but need access to the cloud -- Joe to get back to Tanya/uboone
    • Meeting with Lariat today
  • v0.4 Release Status - Deployment
    • To avoid scalability issues, limit the number of concurrent queued jobs.
    • v0.4 deployed in fifebatch
  • Dzero
    • Joel Snow to run using samgrid on GP Grid using Jobsub
  • Tanya: Started working with Genie. Gabriel Purdue is the contact person for this and most neutrino experiments will be using it.

August 12, 2014

Present: Parag Mhashilkar, Willis Sakumoto, Neha Sharma, Joe Boyd, Mike Kirby, Gerard Bernabeu Altayo, Dennis Box, Kenneth Herner, Tanya Levshina, Alessio

  • Jobsub Tools
    • Heidi found some issues with the jobsub_tools v1_3_1_2 that needs to be investigated. Dennis will check with Heidi.
  • v0.4 Release Status - Deployment
    • v0.4 deployed on pre-prod
    • Willis tested running jobs and got jobs running. He wants to run more scale testing on GP Grid
    • Neha to configure Jobsub/glideinwms pre prod setup with Fermicloudpp
    • Plans to deploy it in fifebatch: Approx. this week
  • Yun-Tse sent requirements for MC5 Microboone production. SLAC, Tuskerr, FermiCloud. This is high priority.
  • Joel Snow to run using samgrid on GP Grid

August 05, 2014

Present: Tanya Levshina, Parag Mhashilkar, Dennis Box, Willis Sakumoto, Gerard Bernabeu Altayo, Neha Sharma, Mike Kirby, Kenneth Herner

  • Jobsub Tools
    • Heidi found some issues with the jobsub_tools v1_3_1_2 that needs to be investigated.
  • Ken: Catharine has some issues submitting/running jobs for Lariat
  • v0.4 Release Status
    • Neha to install jobsub server v0.4 on pre-prod.

July 29, 2014

Present: Tanya Levshina, Parag Mhashilkar, Dennis Box, Willis Sakumoto, Kenneth Herner, Gerard Bernabeu Altayo, Neha Sharma, Mike Kirby

  • v0.4 Release Status
    • Willis: Needs some changes to jobsub_tools to get the CDF job through. #6697
    • Ken: Still unable to submit to fifebatch-dev.fnal.gov
    • Mike:
      • Alex Drlica-Wagner from DES was interested in jobsub.
      • Anders from SLAC and super CDMS maybe interested in deploying jobsub at SLAC for their experiments. They are looking at what they can get from FIFE.
      • Introduce LARIAT to jobsub. Parag: May as well use new jobsub since they are getting started. Neha: Supported for onsite submission.
    • We should have "Hello World" and "Hello World with ifdh". In ifdh, dcache & bluearc through ifdh.

July 22, 2014

Present: Tanya Levshina, Parag Mhashilkar, Dennis Box, Willis Sakumoto, Kenneth Herner

  • v0.4 Release Status
    • Dennis: Doing few last minute tests. So far it is going good and will ask Neha to deploy on the ITB today.
    • Willis: There are some issues with connecting to SAM Web. Its not clear how jobsub is involved.
    • Changes need to be documented.

July 15, 2014

Present: Neha Sharma, Parag Mhashilkar, Dennis Box, Joe Boyd, Willis Sakumoto, Kenneth Herner, Mike Kirby

  • v0.4 Release Status
    • Connect to SAM -- working -- #6586
    • Handling tar files along with input sandbox -- #6562 -- Uses dropbox but Dennis is not happy with the solution. Maybe we can take care of the messy path with a wrapper script.
    • Need krb5 tickets passed along. #6541 -- Dennis and Parag to talk offline about this. In case of CDF users have their .k5login file populated with the robot principals
  • Mike
    • Will talk with Lariat
  • Joe: Data preservation group is shrinking and we need to get this working. Deadline to get working in next few weeks.
  • Neha
    • fifebatch1 will be behind HA in 2 weeks

July 08, 2014

Present: Neha Sharma, Parag Mhashilkar, Dennis Box, Tanya Levshina, Willis Sakumoto

  • Neha: Issues with frontend advertising to two factories.
  • Tanya: We cannot use fifebatch to submit outside fermilab. This makes new system unusable.
  • Neha: We are still waiting on networking folks to open up the http port for outside access.
  • Default SAM_STATION name is using group name.
  • Tanya: Concerned about moving to production to try out flux file and alien cache in Nebraska.
  • Willis: Kerberoes ticket and shipping tarball to output location? Default output icaf. get ifdh working with user specified SAM_STATION. '@' sign in job parameter list.
  • Parag: Jobsub users meeting: Thursday, 2-3pm.

June 24, 2014

Present: Neha Sharma, Parag Mhashilkar, Dennis Box, Tanya Levshina, Willis Sakumoto, Joe Boyd

  • v0.4 Status
    • Updates in the ticket #5333

June 24, 2014

Present: Neha Sharma, Parag Mhashilkar, Dennis Box, Tanya Levshina, Willis Sakumoto, Mike Kirby, Joe Boyd, Ken Herner, Gerard Bernabeu Altayo

  • FIFE Workshop Feedback
    • Joe: Liz suggested DAG since we are using partitionable slots. Since all applications use same amount of memory, can we take feedback from the few jobs that are running and submit other jobs in the DAG requesting only that much memory. CHECK WITH HTCONDOR EXPERTS.
    • Mike: Give the users tool to understand the jobs usage like disk, memory, cpu, etc. FIFEMON already provides this.
    • Parag: Discussions with Brian B. Use pool account to map users where we do not get KCA DN from the user.
    • Gerard: Data comes back from jobs can be 10M - 2G per job. This is essentially input sandbox. We need to handle this somehow and make proper system requirements. Analysis jobs will have this use case.
    • Ken: Need to transfer the robot principle to the WN. Jobs need this principle to be valid for atleast 3 Days.
    • Support for running dzero and cdf jobs outside FNAL will not be supported.
    • Dennis: User want tarfiles. Dagnabbit to work with the jobsub server.
    • Willis: Args to jobs dont work. As per Dennis this is some issue with the option when tarfile is used

June 10, 2014

Present: Neha Sharma, Parag Mhashilkar, Dennis Box, Tanya Levshina, Willis Sakumoto, Mike Kirby

  • Issues Art is facing
    • Authentication Error
    • Parag to Check with Joe on how to change the password for keytab and change it on all the known hosts.
  • FIFE Workshop
    • WIllis's jobs are idle in the queue. -- CDF related
    • Neha's jobs using Nova group are idle. Glideins are requested by the frontend but factory does not submit any.
    • SINGLE USAGE model works but giving multiple usage models don't.
  • CDF/Dzero's required to make auth tokens to be available on the worker nodes to transfer files back.

June 03, 2014

Present: Neha Sharma, Parag Mhashilkar, Joe Boyd, Dennis Box, Tanya Levshina, Willis Sakumoto

  • Announcements:
  • Jobsub v0.3 Release
    • No News. Not many production jobs.
  • Deployment status
    • Deployed in pre production. fifebatch-preprod.fnal.gov
      • Neha working on Keytab synching.
      • Dennis to send the work with GCSO to patch fifebatch1 for auth.py
      • Neha will check with Gerard to increase the disk space before asking Art -- DONE
      • Gerard: Working with FEF for fifebatch hardware. Neha to check about the port 80 exemptions.
      • Steve pointed out possible race condition in auth module with 1000s of request. This is affecting GUMS servers. -- in v0.3.1
  • Experiment Status:
    • Mu2e:
    • Microboone: Grad students coming to lab. We can try to run 4G node on Fermicloud.
    • Nova:
    • Minos:
    • LBNE:
    • Minerva:
    • D0:
    • Dark side:
    • CDF: Joe will install jobsub client on CDF machines so Willis can start testing. Can CDF use ifdh instead of scp so we dont need to ship credentials.
  • Keytabs generated on new machine invalidates the one on old machine. We need unique keytab files across all the nodes to avoid issue in HA setup. -- Neha working on creating keytabs and puppet module to sync them.

May 27, 2014

Present: Neha Sharma, Parag Mhashilkar, Joe Boyd, Dennis Box, Tanya Levshina, Gerard Bernabeu Altayo

  • Announcements:
  • Jobsub v0.3 Release
    • Released last week.
  • Deployment status
    • Deployed in pre production. fifebatch-preprod.fnal.gov
    • Issues with Fermicloud. Parag: URL limits reaching 8k in Apache.
    • Steps: See last week's meeting notes.
    • Neha will check with Gerard to increase the disk space before asking Art.
    • Gerard: Working with FEF for fifebatch hardware.
    • Steve pointed out possible race condition in auth module with 1000s of request. This is affecting GUMS servers.
  • Experiment Status:
    • Mu2e:
    • Microboone: Grad students coming to lab. We can try to run 4G node on Fermicloud.
    • Nova:
    • Minos:
    • LBNE:
    • Minerva:
    • D0:
    • Dark side:
  • Keytabs generated on new machine invalidates the one on old machine. We need unique keytab files across all the nodes to avoid issue in HA setup.

May 20, 2014

Present: Neha Sharma, Parag Mhashilkar, Joe Boyd, Dennis Box, Tanya Levshina

  • Jobsub v0.2.1 Feedback
  • Jobsub v0.3 Release
    • #6279: Should be done today
    • #6261: Changes are done. Need to be tested and merged.
    • Final release schedule for Thursday if testing goes well and there are no show stoppers.
  • Deployment status
    • Will be deployed in pre-prod fifebatch2.fnal.gov. Alias is set to fifebatch.fnal.gov
    • Neha & Joe test Pre prod testing this week. Next week Tanya will try it out. Thursday we can ask Art to try it out. June 2 on production nodes.
    • Steps:
      • We test in pre prod
      • Art tests pre prod
      • Once confident, move users to fifebatch.fnal.gov with fifebatch2.fnal.gov only
      • Phase out fifebatch1.fnal.gov, drain queues
      • Upgrade fifebatch1.fnal.gov and add it to DNS round robin
  • Experiment Status:
    • Nothing new since last week.
      • Mu2e:
      • Microboone:
      • Nova:
      • Minos: Getting resources
      • LBNE:
      • Minerva:
      • D0:
      • Dark side:

May 13, 2014

Present: Neha Sharma, Parag Mhashilkar, Joe Boyd, Gerard Bernabeu Altayo, Tanya Levshina

  • Announcements:
    • Users meeting tomorrow at 9:30 am in
  • Jobsub v0.2.1 Feedback
    • Users are using it for their testing. We need easy to access Gratia links to various plots
  • Jobsub v0.3 Release
    • If no issues found in testing, will be released this week or early next week.
  • Deployment status
    • HA setup for testbed. Neha working on puppetizing.
    • We need jobsub versions in job classads.
    • Minos would like to submit to local batch along with the Jobsub servers. What is local batch and GCSO plans to support it?
  • Experiment Status:
    • Mu2e: Are they still using Fermicloud (?)
    • Microboone: Has some issues with ifdh.
    • Nova: Bluearc going down o Thursday. Andrew can still use OSG resources
    • Minos: Running on fifebatch1.fnal.gov steadily
    • LBNE: Started using/testing off sites
    • Minerva:
    • D0: Might start using it. Ken will be the user.
    • Dark side: Kent Herner

April 15, 2014

Present: Dennis Box, Parag Mhashilkar, Joe Boyd, Mike Kirby, Tanya Levshina

  • Jobsub v0.2 Feedback
    • No noticeable issues
  • Jobsub v0.2.1 Release
    • Important features: condor_q and condor_history equivalent commands in jobsub -- Done
    • Dennis: Need to revert back the fetchlog to just output and not the jobstatus. URL multipart is problematic.
    • Support for non fermilab voms.
  • Needs improvements
    • We need to look into dispatcher for handling the URLs in proper way
  • Jobsub v0.3 Release
    • HA Support: Parag talked to Gerard. HA will be DNS round robin. Every JobSub server will only host one schedd.
  • Deployment status
    • HA setup for testbed
  • Experiment Status:
    • Mu2e: Is trying to run jobs on Fermicloud with large memory. They tried to run jobs using fifebatch1.
    • Microboone: Yuntse presented during OSG AHM.
    • Nova: No update. They want to run on AWS.
    • Minos: Possibly run in past week?
    • LBNE: condor_tail equivalent.
    • Minerva: May be next but after production release.

April 08, 2014

Present: Dennis Box, Parag Mhashilkar, Joe Boyd, Mike Kirby

  • Jobsub v0.2 Feedback
    • No noticeable issues
  • Jobsub v0.2.1 Release
    • Important features: condor_q and condor_history equivalent commands in jobsub. jobsub_q prototype.
    • Support for non fermilab voms.
  • Jobsub v0.3 Release
    • CDF: Willis Sakumoto will help with the testing. (Replacement for Bo)
    • HA Support: Parag talked to Gerard. HA will be DNS round robin. Every JobSub server will only host one schedd.
  • fifebatch1 status
    • HA setup for testbed
  • Experiment Status:
    • Microboone: Working with Yuntse for job submission. Present during OSG AHM.
    • Nova: FIFE is requesting the tier based provisioning. Request will be coming to glideinwms support.
    • Minos: Minos got ~5K slots and used ~20K CPU hours using new jobsub.
    • LBNE: Authentication/VOMS will be in v0.2.1 (to be released sometime around mid next week)

April 01, 2014

Present: Dennis Box, Parag Mhashilkar, Joe Boyd, Mike Kirby

  • Jobsub v0.2 Release. Did we miss any mailing list? Client availability needs a broader audience. Server update will be announced by GCSO to fifebatch-jobsub-announce mailing list.
  • fifebatch1 status
    • SSS will be installing the jobsub server rpm.
    • HA setup for testbed. Joe will talk to Gerard about the scheme and we need to coordinate for this before starting the work.
  • Mike: Downloading user sandbox feedback. Job output used to go to bluearc area. With fifebatch1 they will stay on the server and we need better instructions on how to get the output logs. fifemon is also looking at the
  • Experiment Status:
    • Microboone: Working with Yuntse for job submission.
    • Nova: FIFE is requesting the tier based provisioning. Request will be coming to glideinwms support.
    • Minos: Joe talked to Art and he ran 4500 jobs running.
    • LBNE: How critical is the authentication fix?

March 07, 2014

Present: Dennis Box, Parag Mhashilkar, Tanya Levshina, Joe Boyd, Mike Kirby, Stephan Lammel

(M) = Mandatory
(P) = Preferred
(N) = Not equired

CAFSubmit

CAFExe:

--tarFile(M): Input the job. This needs to be extracted before executing the job. Jobsub supports it or can be made to change.

--outLocation(P): Tar ed up output directory. --depot in jobsub? If output is bigger than certain amount then only dirs are shipped back.
Come back to this one later for implementation details and available info

--procType (M): Is job long or short. We need to support in jobsub

--start, --end, --section (M): This is to user's job and for section and dag generation.

--dhaccess(N): How to handle the data handling.

--maxParallelSection(R): Run upto this dag nodes at a time.

--email(PP): Email for job notification. Needs to be passed by wrapper

--farm(N): Equivalent --jobsub-server

--group(R): Already there in jobsub

--os: Already supported

--cdfsoft: Mute with with cvmfs

--site: Already supported

--donotdrain(N)

command(R)

exp wrapper : CAFExe

February 25, 2014

Present: Nick Palombo, Dennis Box, Parag Mhashilkar, Tanya Levshina

  • Deployment status * Joe: Need to deploy 0.1.4 * Tanya: Gratia probes now installed
  • JobSub Dropbox Server status
    • Update from Nick
      • #5484: Related to fetch log. Return job status along with the log. Should be a Multipart response. Different browsers handle it differently. Client & server needs to be coded accordingly.
      • #5485: Implementation is as per API docs.
      • #5509: Document the code using doc strings.
  • Update from Experiments
    • No News
  • JobSub Tickets
  • Version : v0.2
  • JobSub Tools
  • Version: v0.3

February 11, 2014

Present: Nick Palombo, Dennis Box, Parag Mhashilkar, Tanya Levshina, Mike Kirby

  • JobSub v0.1.3 Deployment status -- Joe
    • In case of issues check with Steve Timm for issues related to jobsub.
    • Need to deploy gratia probes
  • JobSub Dropbox Server status
    • Update from Nick
      • Almost done. Need to document.
      • Review the API docs
  • Update from Experiments
    • Nothing new
  • JobSub Tickets
  • Version : v0.2
  • Version: v0.3

February 04, 2014

Present: Nick Palombo, Dennis Box, Parag Mhashilkar, Tanya Levshina, Joe Boyd, Mike Kirby

  • JobSub v0.1.3 Deployment status -- Joe
    • Joe to deploy new rpms on the fifebatch1.fnal.gov
    • Parag to provide the glideinwms configuration changes so Nova can run their tests using new jobsub and fifebatch1
  • JobSub Beta testing update -- Parag/Dennis
    • Nova: Eric ran jobs test jobs.
    • Larsoft: Mike to get back when they are ready to run using new jobsub.
  • JobSub Dropbox Server status
    • Update from Nick
      • #5233: Need couple of days to finish this.
      • Performance testing; have access to ganglia and will work on it.
      • Review the API docs.
    • HTCondor https curl-plugin?
  • Update from Experiments
    • Technology demonstration by LBNE for DOE review running jobs on OSG/Grid/Cloud. This is coming from LBNE and not directly from FIFE. They will be running on XSEDE (?) to start with.
    • There will be a requirements from LBNE in case they want to use jobsub in case we are required to support it in jobsub.
    • DOE Review: May 12 - 16
    • Demonstration: Atleast a week before that.
    • LBNE is using jobsub for MC simulations but not everything.
  • JobSub Tickets
  • Version: v0.3
    • cdfcaf support
    • Investigate CILogon

January 28, 2014

Present: Nick Palombo, Dennis Box, Parag Mhashilkar, Tanya Levshina, Joe Boyd, Mike Kirby, Bo Jayatilaka

  • JobSub v0.1.2 Deployment status -- Joe
    • Joe deployed last Friday But still pointing to sleeper pool.
    • Eric to start beta testing in production after glideinwms is configured to go to OSG sites.
    • Tanya: We need Gratia probe on fifebatch1 since it has its own collector.
  • JobSub Beta testing update -- Parag
    • Nova: Eric to start beta.
    • Mu2e: Got back to me
    • Release jobsub_client as ups product and release it in ups common area
  • JobSub Dropbox Server status
    • Update from Nick
      • Might need changes to josub tools or plugins to handle input sandbox files.
      • Currently the server prints out both URIs, https:// and file path.
      • All other changes are done.
      • Work with Joe to get Ganglia setup for performance testing.
    • HTCondor https curl-plugin?
      • Parag: Asked HTCondor if they would like to support https as part of the curl transfer plugin. Haven't heard back so will follow up again.
  • JobSub Tickets

January 21, 2014

Present: Nick Palombo, Dennis Box, Parag Mhashilkar, Tanya Levshina

  • JobSub v0.1.2 Deployment status -- Joe
    • Try to get it deployed today or in a day or two.
    • Tests for v0.1.1 indicated heavy memory usage, by HTCondor.
    • Joe to submit few jobs after upgrade to make sure every thing is working.
  • JobSub Beta testing update -- Parag
    • Nova: Haven't heard back from Nova
    • Mu2e: Got back to me
    • Release jobsub_client as ups product and release it in ups common area
  • JobSub Dropbox Server status
    • Update from Nick
      • Retain original filename
      • Client changes to handle the custom files upload and add them to actual jobsub submission.
    • Need disks to be mounted across the submission nodes for the time being -- Joe
      • Machine name: fifedepot01.fnal.gov
      • To start with use fifebatch1. It has 2TB x 3 disks.
      • Also, use http instead of https so we can use the file transfer plugins
      • Change the server to return back file location and api access/URL in the JSON thats passed back.
    • HTCondor https curl-plugin?
      • Parag: Asked HTCondor if they would like to support https as part of the curl transfer plugin.
  • JobSub Tickets

January 14, 2014

Present: Nick Palombo, Dennis Box, Parag Mhashilkar, Tanya Levshina, Bo Jayatilaka, Mike Kirby

  • Discussed & Updated ticket status for JobSub v0.1.2. Once tested we should be ready to release new JobSub
  • Nick gave a demo of file upload mechanisms
    • We need to do integrity checks
    • We need to keep the filenames same.

    Question:

    • Should the user be read protected against other users? As per Bo, no.
    • Files are read protected using group. Will that be enough?
    • HTCondor curl_plugin only support http and not https. Nick to try it.
    • Will the dropbox disk mounted on the submission nodes? Check with Joe.
  • As per Bo files can be accessible to everyone in the VO.

January 07, 2014

Present: Nick Palombo, Dennis Box, Parag Mhashilkar

Discussed & Updated ticket status.

December 10, 2013

Present: Tanya Levshina, Nick Palombo, Dennis Box, Joe Boyd, Parag Mhashilkar

  • Improve the efficiency of the authorization. We do no need to run ktadd every time but just ones. Also reuse the voms proxy if it exists.
  • Take care of the Proxy refresh in authorization.
  • In v0.2 we need to handle VOMS roles. Check with Mike Kirby to get these requirements from the experiments.
  • Tickets & Status updates

December 04, 2013

Present: Tanya Levshina, Nick Palombo, Dennis Box, Joe Boyd

  • Ticket & Status updates

November 26, 2013

Present: Tanya Levshina, Nick Palombo, Dennis Box, Parag Mhashilkar

  • Ticket & Status updates

November 19, 2013

Present: Parag Mhashilkar, Nick Palombo, Dennis Box,

  • Nick: Demo with authentication using the GUMS server. Currently it takes upto 7 sec for he server to return the mapped user. This is quite slow. Also we need to investigate if http certs work instead of the host certs
  • Created several tickets, updated/closed the old ones.

November 12, 2013

Present: Parag Mhashilkar, Nick Palombo, Dennis Box, Tanya Levshina

  • Nick: Demo with support for executable uploaded to the server. Submits jobs

Basic integration with jobssub deployed o the server machine.

jobsub client taking shape

November 05, 2013

Present: Parag Mhashilkar, Joy Boyd, Nick Palombo, Dennis Box, Tanya Levshina

  • Nick gave a demo of simple jobsub submission using http/cherrypy

October 29, 2013

Present: Parag Mhashilkar, Joy Boyd, Nick Palombo, Dennis Box

  • Joe will check with Stephan about the keytab file to get the x509 robot certs
  • Dennis to replicate the development setup on the testbed.

Using jobsub on fcint076

  • ssh opennebula@fcint076
    [opennebula@fcint076 ~]$ source /fnal/ups/etc/setups.sh
    [opennebula@fcint076 ~]$ setup jobsub_tools
    [opennebula@fcint076 ~]$ ls $JOBSUB_TOOLS_DIR/test
    dagTest  Run_All_Tests.sh  run_grid_test.sh   Run_Unit_Tests.sh  test_local_env.sh
    README   run_dag_test.sh   run_local_test.sh  test_grid_env.sh
    [opennebula@fcint076 ~]$
    [opennebula@fcint076 ~]$ jobsub $JOBSUB_TOOLS_DIR/test/test_local_env.sh 600
    /grid/data/fermilab/condor_tmp//test_local_env.sh_20131030_151352_14312_0_1.cmd
    submitting....
    Submitting job(s).
    1 job(s) submitted to cluster 45.
    [opennebula@fcint076 ~]$ 
    [opennebula@fcint076 ~]$ condor_q
    
    -- Submitter: fcint076.fnal.gov : <131.225.64.239:56162> : fcint076.fnal.gov
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
      45.0   opennebula     10/30 15:13   0+00:00:19 R  0   0.0  test_local_env.sh_
    
    1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
    [opennebula@fcint076 ~]$