Project

General

Profile

23-November-2015

Attending: Satish, Chris, Susan, Enrique, Justin, Tapasi, Sijith, Evan, Prabjot, Paul S, Gavin, Joe, Qiulan

News (Satish and Alex)

We have started a new production troubleshooting wiki. At present it has only one item, but all productioneers are encouraged to expand this document as they encounter and solve problems. Or if you are an expert who keeps hearing about the same problem, also please update the page.

Satish will be on Thanksgiving holiday Thursday through Monday, although he should have sporadic email availability, and we will have a meeting on Monday.

SW Tags (Paul S)

Paul cut new a new branch (R15-11-17-miniprod-br) and patch release (R15-11-17-miniprod.a) last week.

The nightly build has been running without problems.

Evan has completed the updates necessary to include ANG for the build of development against ART 1.17.03. These include building CAFFE against both SLF5 and SLF6, and including builds with the e9 qualifier. He also needed to install a new package Snappy, which is included as new UPS product. These packages are available on scisoft. Chris has requested that the externals installation script be updated to ensure that the relevant packages will also get picked up for offsite locations. Paul will shortly be producing a new tag with this version of art, and development will also be built against the new art version.

Preparations for Dual Gain Simulation (Gavin)

There are various opinions on how to approach this. Gavin has decided to set up CRY jibs that just run up through the giant level (producing g4 tier files). The scheme is to then run the ReadoutSim and PhotonTransport modules in separate jobs. He just needs to know what the different configuration options needed are. Gavin has emailed Adam for clarification. It was observed that we should have separate configurations for each gain, as likely many parameters need to be changed.

Gavin will make separate job FCLs, to maximize flexibility, for example if simulations at other gain settings are needed.

Running LEM at FNAL (Chris)

DocDB 14420

This looks like a promising avenue, but the major concern is scalability. It appears that this scheme will start to break down with about 2k simultaneous jobs, possibly substantially more if LEM runs as part of a larger job that naturally throttles the rate of communication with the LEMServer. This should work for running at Fermilab and at offsite locations, although there may be network security considerations that complicate the picture. Testing would need a large, but low-priority sample.

Another possible way to throttle requests to the LEMServer is to somehow issue some number of tokens so that jobs no not to access the server if it is too busy.

Adding worker nodes at SMU should be relatively straightforward.

Offisite Status Report (Enrique)

DocDB 14314

The OSC cluster has had some issues, but is now performing the best. There was some discussion of how this could be as it doesn’t appear to running that well (very long idle times). This is evidently because no sites appear to be running spectacularly at the moment.

At some sites (MWT2 and Chicago), there appear to be a few issues with the novasoft not being installed properly. Enrique is investigating.

Satish needs to cut a new version of NovaGridUtils to deal with fire kernel-distribution version mis-mathces. That may resolve some of the apparent tmis-isntallations.

Processing Status

FA-style Horn-off MC (Enrique)

Enrique has been emailing back and forth with Paul and Gavin to set out the test jobs. Some changes are needed to make_sim_fcl_fastyle: he needs to turn off genie reweighting, which is not configurable with a command line option. Also it seems that the artdaq files are not picked up by the output SAM definition. Gavin claims that this is because the files are produced with the incorrect metadata, in particular the nova.special parameter. But this parameter isn’t needed as the horn current metadata parameter is enough to distinguish these files.

FA-style ND New Position MC (Paul R/Enrique/Chris)

Paul cannot attend today, but has sent an update by email. Generation is all done except for 88 files. Most have a common FTS error, which he will investigate later today.

Running command sam_metadata_dumper /pnfs/nova/scratch/fts/MCdaq_dropbox/b/e/2/neardet_genie_fhc_nonswap_ndnewpos-ndfluxv08_2000_r00010605_s01_c003_FA14-10-03x.d_v1_20151119_135356.sim.overlay.daq.root failed with timed out

stderr (Total 212 bytes):
151121 04:55:08 32752 Xrd: ReadPartialAnswer: Failed to read msg from connmgr (server [fndca4a.fnal.gov:1094]). Retrying ...
151121 04:55:08 32760 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
  

Reco is proceeding nicely, and should be caught up to sim today. LEM should be caught up in another two days.

FD new position (Joe):

Joe is reading through the docs, and has been sending out a bunch of emails to get up to speed. He will shortly be running tests.

Raw2root Keepup (Qiulan)

This is running smoothly. Processed 14.9k files. One job failed, and Qiulan will send an email with details.

Reco Keepup (Paola)

Paola cannot attend, but reports that this is proceeding smoothly. Bruno had reported that new files have not been showing up in the past few days, so this needs to be investigated.

Mini-prod Calibration for FD (Qiulan)

All jobs have been failing with missing diblock masks and bad channel information. Satish will follow up to understand the problem.

Mini-prod Calibration for ND (Qiulan)

This is done for period 1 and 2, except that a few files in each period are stuck as the corresponding jobs idle.

Mini-Prod ND CRY (Bruno)

Bruno managed to get FCLs declared to SAM, but has had authentication issues running the jobs. Bruno will send another email, and possibly follow up with a service desk ticket.