Project

General

Profile

Bug #14190

BackTracker emits error "failed to get handle to simb::MCParticle from largeant, return"

Added by Gianluca Petrillo almost 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Category:
Simulation
Target version:
-
Start date:
10/20/2016
Due date:
% Done:

100%

Estimated time:
Spent time:
Occurs In:
Experiment:
DUNE
Co-Assignees:
Duration:

Description

This is detached from the original report issue #14187, originating from Miriama Rajaoalisoa:

I am a member of the Young DUNE Collaboration and am currently working with a simulation using LArSoft. However, I have encountered some errors and failure during the generation of events using GENIE :
[...]
2) Also, when doing the Detector simulation, using this code :
lar -c standard_detsim_dune10kt_1x2x6.fcl -n -1 -s ../g4/p2murho_2_g4.root
For each event, I've got this error :

Begin processing the 9th record. run: 1 subRun: 0 event: 9 at 18-Oct-2016 06:57:38 CDT
%MSG-w BackTracker: PostSource 18-Oct-2016 06:57:38 CDT run: 1 subRun: 0 event: 9
failed to get handle to simb::MCParticle from largeant, return
%MSG

So, please, can you tell me what may be the origin to this error? My supervisor has tried to run the files with his account and everything worked fine, while for me, I got this errors.
With these errors, we still can take the simulation to the end and get results if the number of events are less than 100, but whenever we try working with 1000 events, in the detector simulation we always have trouble after the 100th event, which is even weirder.

Screenshot from 2016-10-26 08-04-44.png (15.8 KB) Screenshot from 2016-10-26 08-04-44.png Broken pipe error Miriama Rajaoalisoa, 10/26/2016 10:38 AM

History

#1 Updated by Gianluca Petrillo almost 4 years ago

  • Status changed from New to Feedback

The reading of the error message says that:

  • BackTracker, which is a LArSoft service connecting truth information with readout and reconstruction data
  • is not able to find a list of simb::MCParticle, that are particles generated by event generators like Genie (neutrino interactions) and also by Geant4 (during the propagation of the generated particles through the detector),
  • a list that is expected to be created by a module labelled largeant

So the most likely cause is that the input file you specified has not Geant4 results in it.
If that is expected, because you only want to deal with Genie output, then you should not load BackTracker service in your job configuration, since it is, anyway, not useful.
If instead you thought Geant4 was run, then you have to verify that you are picking the right file and that Geant4 module was labelled largeant.

#2 Updated by Gianluca Petrillo almost 4 years ago

Also, can you expand on:

but whenever we try working with 1000 events, in the detector simulation we always have trouble after the 100th event

which kind of trouble you have?

#3 Updated by Miriama Rajaoalisoa almost 4 years ago

Gianluca Petrillo wrote:

The reading of the error message says that:

  • BackTracker, which is a LArSoft service connecting truth information with readout and reconstruction data
  • is not able to find a list of simb::MCParticle, that are particles generated by event generators like Genie (neutrino interactions) and also by Geant4 (during the propagation of the generated particles through the detector),
  • a list that is expected to be created by a module labelled largeant

So the most likely cause is that the input file you specified has not Geant4 results in it.
If that is expected, because you only want to deal with Genie output, then you should not load BackTracker service in your job configuration, since it is, anyway, not useful.
If instead you thought Geant4 was run, then you have to verify that you are picking the right file and that Geant4 module was labelled largeant.

Thank you very much for your answer. For our case, we have run Geant4, we have had an output file that is supposed to have Geant4 results in it.

The thing is, we have had this failure since the use of Genie simulation, right after we run this command :
lar -c ../prodndkGolden_norm.fcl -n 1000 -o p2murho_2.root

Where the error was :
%MSG-w BackTracker: PostSource 20-Oct-2016 14:28:51 CDT run: 1 subRun: 0 event: 2
failed to get handle to simb::MCParticle from largeant, return
%MSG

When we have run Geant4 afterwards, we still got the same failure, we had this error :
%MSG
Begin processing the 8th record. run: 1 subRun: 0 event: 8 at 20-Oct-2016 14:33:06 CDT
%MSG-w BackTracker: PostSource 20-Oct-2016 14:33:06 CDT run: 1 subRun: 0 event: 8
failed to get handle to simb::MCParticle from largeant, return
%MSG

And still the same error for the Detector simulation as above.

But the fact that we have focused on the Detector simulation error is exactly the trouble after the "100th" event :
After running this command :
lar -c standard_detsim_dune10kt_1x2x6.fcl -n -1 -s ../g4/p2murho_2_g4.root

The records are being processing, as follows :
Begin processing the 1st record. run: 1 subRun: 0 event: 1 at 20-Oct-2016 14:43:25 CDT
...
But then, just right after the 100th of event is being processed, there are no more processes, just a blinking cursor, and after a few moments, we will just get disconnected from the server. It was the case 4 times, and the same trouble always appeared on the 100th event.

#4 Updated by Lynn Garren almost 4 years ago

I think we are missing information here. Would you tell us exactly what commands you run and in what order? What software are you running? Are you using your own software?

Would you please list all commands you run when starting from a fresh login?

#5 Updated by Miriama Rajaoalisoa almost 4 years ago

Lynn Garren wrote:

I think we are missing information here. Would you tell us exactly what commands you run and in what order? What software are you running? Are you using your own software?

Would you please list all commands you run when starting from a fresh login?

Sorry for the missing information. So I am connecting to the dunegpvm04 machines and have used the following command :

1. $source /grid/fermiapp/products/dune/setup_dune.sh

2. $cd /dune/app/users/hrazafin

3. $setup dunetpc v05_14_01 -q e9:prof

After that, I've got a directory containing 5 subdirectories which are : genie, g4, detsim, reco, ana

In the detsim directory, I have used the following commands :
4. $gevgen_ndcy -g 1000180400 -m 5 -n 1000 -o p2murho

5. $gevdump -f p2murho.1000.ghep.root > p2murho.out

6. $lar -c ../prodndkGolden_norm.fcl -n 1000 -o p2murho_2.root
It is after this command, that I first get the error :
"%MSG-w BackTracker: PostSource 20-Oct-2016 14:28:51 CDT run: 1 subRun: 0 event: 2
failed to get handle to simb::MCParticle from largeant, return
%MSG"

But I have kept on going for the Geant4 Simulation, anyway, so I have used the following command in the g4 directory :
7. $lar -c standard_g4_dune10kt_1x2x6.fcl -n -1 -s ../genie/p2murho_2.root
Here again, I got the failures message :
"%MSG
Begin processing the 8th record. run: 1 subRun: 0 event: 8 at 20-Oct-2016 14:33:06 CDT
%MSG-w BackTracker: PostSource 20-Oct-2016 14:33:06 CDT run: 1 subRun: 0 event: 8
failed to get handle to simb::MCParticle from largeant, return
%MSG"

Since, I thought this may not be a critical error, I still kept on doing the Detector Simulation, so in the detsim directory, I have run this command :
8. $lar -c standard_detsim_dune10kt_1x2x6.fcl -n -1 -s ../g4/p2murho_2_g4.root

In spite of the failures, the records are still being processed :
"Begin processing the 1st record. run: 1 subRun: 0 event: 1 at 20-Oct-2016 14:43:25 CDT ..."

But right after the 100th records, there are no more processing, no notification that it has been done, just a blinking cursor until, after several minutes of inactivity, we get cut out of the dunegpvm04 server.

And it went one like this at least 4 times in a row.

#6 Updated by Gianluca Petrillo almost 4 years ago

  • Status changed from Feedback to Assigned
  • Assignee set to Gianluca Petrillo
  • Occurs In v06_11_00 added
  • Experiment DUNE added
  • Experiment deleted (-)

I can reproduce the event.

#7 Updated by Gianluca Petrillo almost 4 years ago

  • Status changed from Assigned to Resolved
  • % Done changed from 0 to 100

I suspect what you are seeing is just a feature of art.
Your console output is very terse: there is only the report of art starting to process a new event.
art emits that message for each of the first 100 events, then one message every 100 events.
My output therefore looks like:

Begin processing the 96th record. run: 1 subRun: 0 event: 96 at 20-Oct-2016 18:11:55 CDT
Begin processing the 97th record. run: 1 subRun: 0 event: 97 at 20-Oct-2016 18:12:08 CDT
Begin processing the 98th record. run: 1 subRun: 0 event: 98 at 20-Oct-2016 18:12:21 CDT
Begin processing the 99th record. run: 1 subRun: 0 event: 99 at 20-Oct-2016 18:12:35 CDT
Begin processing the 100th record. run: 1 subRun: 0 event: 100 at 20-Oct-2016 18:12:48 CDT
Begin processing the 200th record. run: 1 subRun: 0 event: 200 at 21-Oct-2016 09:33:10 CDT
Begin processing the 300th record. run: 1 subRun: 0 event: 300 at 21-Oct-2016 09:55:55 CDT

In my test, one event tool about 14 seconds, and therefore between the "100th record" and the "200th record" messages, and between that one and "300th record" message, 23 minutes pass. If you are waiting in front of the screen, that's more than enough to make you think the process is stuck. Yet, your output file will still increase roughly on every event.
(the difference in time between 100th and 200th event in my test is mostly due to the fact that I suspended the process in the evening and restarted it in the morning)

Please try the suggestions from Tingjun Yang on issue #14187 and then try again this one, and see if my guess is correct. If not, please reopen and update this ticket and change it's state back to "assigned".
About you being kicked out of dunegpvm04: that is strange to me. LArSoft executable can't trigger a log out. Even in case of memory outage the program would just crash (and anyway memory usage of this job is flat at less than 300 MB).
My best guess is either a sporadic network problem, or a policy for inactivity in dunegpvm04 (if I remember right, though, such inactivity policies are only for the online servers). My suggestion is: learn how to use a screen multiplexer (my recommendation goes to tmux), and never have a session out of it any more. This will solve both connection and inactivity policy problems.

#8 Updated by Gianluca Petrillo almost 4 years ago

I have added a page about terminal multiplexers to LArSoft wiki, which provides information to get started with them.

#9 Updated by Miriama Rajaoalisoa almost 4 years ago

Hi!

Thank you very much for your answer and for the LArSoft Wiki page about tmux, they were really helpful. However, I tried using a tmux session for the Detector simulation, I have tried several times, but I still got this error after the 100th of event (as seen on the attached screenshot):

Write failed : Broken pipe

And then I am logged out of the Dunegpvm machines.

Afterwards, when I try to get back to the last tmux session, by using tmux attach, it seems like tmux is just running all the command lines from the beginning and then I still get stuck after the 100th of event.

I do not know if it is because I am using tmux the wrong way, as this is just what I have done :
1. create a tmux session using tmux
2. Run all of the required command I have cited before
3. I got stuck after the 100th command and then got the Broken pipe error
Afterwards, I used the tmux attach but then again, I still got the broken pipe error.

And you are right about the network problem, as indeed, I have a bad internet connexion and it takes about 70 seconds to run an event. Therefore, it really takes time to run each 100 other events. However, even if it takes time, I still do not understand why I got the Broken pipe error.

Thank you very much for all of your help.

#10 Updated by Gianluca Petrillo almost 4 years ago

Can you point me to the complete path of the input file you are using for the DetSim stage?
I can find only /dune/app/users/hrazafin/pdecay/g4/p2murho_2_g4.root (10 events) and /dune/app/users/hrazafin/pdecay2/g4/p2murho_2_g4.root (100 events).

As an aside, note that the speed of execution is slow because GPVM's are slow, not because your internet connection is slow.

#11 Updated by Gianluca Petrillo almost 4 years ago

  • Status changed from Resolved to Feedback

#12 Updated by Miriama Rajaoalisoa almost 4 years ago

Hi!

This is the input file I have used :
/dune/app/users/mrajaoal/pdecay/g4/p2murho_2_g4.root

It has supposed to have 1000 events on it.

For the speed of execution, then, does that mean if I changed which GPVM machines I am using, the execution should be faster?

#13 Updated by Gianluca Petrillo almost 4 years ago

  • Status changed from Feedback to Accepted

Thank you. I am processing that file on dunegpvm06.fnal.gov and on my laptop. We'll see how it goes.

About GPVM speed: you can expect them to have roughly the same base performance (and not that bad, either). But the real performance is determined by how many people are running on the same node, and how busy is the disk you are writing to.
Neither is something that you can really check, nor control. Your job takes 20 hours, and GPVM load changes much faster than that: you can't really do good planning in that situation.
(incidentally, this job is taking 1'12"/event on a fast machine, and about 1'30"/event on GPVM: it's slower, but not spectacularly so)

#14 Updated by Gianluca Petrillo almost 4 years ago

Miriama Rajaoalisoa wrote:

And you are right about the network problem, as indeed, I have a bad internet connexion and it takes about 70 seconds to run an event. Therefore, it really takes time to run each 100 other events. However, even if it takes time, I still do not understand why I got the Broken pipe error.

I realise that I took for granted a detail that is instead worth noting. The steps, starting from your local host, are:

  1. ssh to the GPVM:
    ssh -XY miriama@dunegpvm04.fnal.gov
  2. open tmux
    tmux
  3. in the window inside tmux, set up and run the command as you wrote in #14190#note-5

In other words, tmux must be run in the remote machine, not in the local one. If the connection falls, you have to SSH again on that same machine, and then go tmux attach.
I am running your command adding the --trace argument to the lar command, so that art prints each step it's doing. So far it processed up to event 133. My connection also broke once (so far), but there was no problem in reattaching the remote tmux session.

#15 Updated by Miriama Rajaoalisoa almost 4 years ago

Hi!

Indeed, you are right, I have used tmux in my local terminal. I will try running it also in the remote machine, then, and I will tell you if I can reattach the session successfully or not.

Thank you very much for all of your help.

#16 Updated by Miriama Rajaoalisoa almost 4 years ago

So I have tried running tmux in the remote machine (after the ssh command), but then it says "command not found". Probably I need to install tmux, before, then? But my problem is, I do not have root privileges to do the #yum install tmux command, nor to execute the make install command. Or is tmux already installed in the other GPVM mahines (it needs to be installed in the remote machines, isn't it?)

#17 Updated by Lynn Garren almost 4 years ago

If tmux is not available, try screen. It should work in much the same way and I think it is available on the gpvm machines.

#18 Updated by Gianluca Petrillo almost 4 years ago

  • Status changed from Accepted to Resolved

Miriama Rajaoalisoa wrote:

So I have tried running tmux in the remote machine (after the ssh command), but then it says "command not found". Probably I need to install tmux, before, then? But my problem is, I do not have root privileges to do the #yum install tmux command, nor to execute the make install command. Or is tmux already installed in the other GPVM mahines (it needs to be installed in the remote machines, isn't it?)

That's true! I compiled my own... that's why it works for me. As Lynn says, basically the only advantage of GNU screen vs. tmux is that screen is pretty much everywhere. Study (or print) the key bindings from man screen, and you are in business.
But no computer worth that name comes with no tmux. Worth asking for it.

#19 Updated by Miriama Rajaoalisoa almost 4 years ago

Hi!

So by using screen, I had no more problems with the Detector Simulation and I could have gone through all of the 1000 events.
Also, I have seen now that tmux is available in the GPVM machines! I think my problem is entirely solved and once again, thank you very much for all of your help.

#20 Updated by Gianluca Petrillo almost 4 years ago

Great!
In fact, I have requested tmux to be installed on all GPVM's by default, and they made it.

#21 Updated by Gianluca Petrillo almost 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF