Project

General

Profile

Bug #24843

art job crashed but completed successfully when retried almost immediately ..

Added by Rob Kutschke about 2 months ago. Updated about 1 month ago.

Status:
Assigned
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
08/23/2020
Due date:
% Done:

0%

Estimated time:
Occurs In:
Scope:
Internal
Experiment:
Mu2e
SSI Package:
Duration:

Description

Mu2e is looking for advice on how to proceed with the following.

Mu2e has a github hook that does CI on Jenkins as part of our process for reviewing Pull Requests. The Jenkins job does the requested merge into a temporary clone, builds the code, runs 6 short art jobs and runs a few other tests on the code base. In the 6 art runs we only check that art returns a good status.

Last Friday we launched one of these and one of the 6 art jobs failed. Everything else was successful.

Soon after we resubmitted the Jenkins CI job and this time everything ran correctly. There were no intervening changes to the code.

Patrick Gartung checked the Jenkins logs and did not see any Jenkins issues around that time.

The job that failed on the first attempt was the only one of the 6 that runs art in MT mode; all others are sequential. This job asks for 5 schedules and 5 threads.

We will see if this is repeatable and report back.

In the mean time, here is a core dump of information that might prove useful.

The job ran on buildserver007 which has 32 threads. It is a AMD Opteron(tm) Processor 6320 with 132GB of ram. The Jenkins jobs run on the bare machine and the job has the whole machine to itself.

We run all 6 art jobs in parallel. So we had 10 threads active, well under 32.

Patrick pointed out that the PR being tested removes many unused header files and adds in explicit references to other headers that were previously resolved via their inclusion in other headers. So this could modify memory layout thereby exposing a previously latent error.

The art job log for the failed job is: https://buildmaster.fnal.gov/buildmaster/job/GitHubPRTests/job/mu2e-offline-build-test//179/artifact/g4test_03MT.log

and for the successful job is: https://buildmaster.fnal.gov/buildmaster/job/GitHubPRTests/job/mu2e-offline-build-test//181/artifact/g4test_03MT.log

The full Jenkins logs are:

Failed: https://buildmaster.fnal.gov/buildmaster/job/GitHubPRTests/job/mu2e-offline-build-test//179/console
Successful: https://buildmaster.fnal.gov/buildmaster/job/GitHubPRTests/job/mu2e-offline-build-test//181/console

The PR conversation is at: https://github.com/Mu2e/Offline/pull/148

History

#1 Updated by Ryunosuke O'Neil about 2 months ago

Rob Kutschke wrote:

Mu2e is looking for advice on how to proceed with the following.

Mu2e has a github hook that does CI on Jenkins as part of our process for reviewing Pull Requests. The Jenkins job does the requested merge into a temporary clone, builds the code, runs 6 short art jobs and runs a few other tests on the code base. In the 6 art runs we only check that art returns a good status.

Last Friday we launched one of these and one of the 6 art jobs failed. Everything else was successful.

Soon after we resubmitted the Jenkins CI job and this time everything ran correctly. There were no intervening changes to the code.

Patrick Gartung checked the Jenkins logs and did not see any Jenkins issues around that time.

The job that failed on the first attempt was the only one of the 6 that runs art in MT mode; all others are sequential. This job asks for 5 schedules and 5 threads.

We will see if this is repeatable and report back.

In the mean time, here is a core dump of information that might prove useful.

The job ran on buildserver007 which has 32 threads. It is a AMD Opteron(tm) Processor 6320 with 132GB of ram. The Jenkins jobs run on the bare machine and the job has the whole machine to itself.

We run all 6 art jobs in parallel. So we had 10 threads active, well under 32.

Patrick pointed out that the PR being tested removes many unused header files and adds in explicit references to other headers that were previously resolved via their inclusion in other headers. So this could modify memory layout thereby exposing a previously latent error.

The art job log for the failed job is: https://buildmaster.fnal.gov/buildmaster/job/GitHubPRTests/job/mu2e-offline-build-test//179/artifact/g4test_03MT.log

and for the successful job is: https://buildmaster.fnal.gov/buildmaster/job/GitHubPRTests/job/mu2e-offline-build-test//181/artifact/g4test_03MT.log

The full Jenkins logs are:

Failed: https://buildmaster.fnal.gov/buildmaster/job/GitHubPRTests/job/mu2e-offline-build-test//179/console
Successful: https://buildmaster.fnal.gov/buildmaster/job/GitHubPRTests/job/mu2e-offline-build-test//181/console

The PR conversation is at: https://github.com/Mu2e/Offline/pull/148

I can reproduce this error at least one time out of 20 tries on a mu2egpvm machine by running:

for i in {0..20}
do

mu2e -n 10 -c Mu2eG4/fcl/g4test_03MT.fcl | tee g4test03mt$i.log

done

If you have access to the /mu2e/app mount, the build I used is located at /mu2e/app/users/roneil/test/Offline

#2 Updated by Kyle Knoepfel about 2 months ago

  • Assignee set to Kyle Knoepfel

We will investigate what is going on. It is not clear right now whether this is a problem with the framework, or whether it is an issue with the G4MT/TBB impedance mismatch.

#3 Updated by Kyle Knoepfel about 1 month ago

  • Status changed from New to Assigned


Also available in: Atom PDF