Project

General

Profile

Feature #15916

Generate and recover core files

Added by Gianluca Petrillo almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Start date:
03/20/2017
Due date:
% Done:

100%

Estimated time:
Duration:

Description

I would like the C.I. system to allow the generation of core dump files and their retrieval.
The relevant phases are both the unit test phase and the integration test phase.

The idea is that when a test fails, the submitted can be given access to the core file and can analyse it.
It is not clear to me yet if such a core file will be useful1: that needs to be checked, possibly as a first step.

I am providing a unit test called segfault_test which will consistently produce a crash, in branch feature/gp_BreakingTest.
The test is located in larexamples/test and is executed during a regular ctest/mrb test in the directory ${MRB_BUILDDIR}/larexamples/test/segfault_test.d.
Setting ulimit -c unlimited, I get a core file of about 500 kB.
I am available to test any core file that you could produce, as a mean to assess if it is of any use to have a foreign core file.

1 Core files have a "special" relation with the executable and libraries that have produced it, and with the environment. I don't know to which extent a core file extracted out of its environment can be effectively used.

UnrollCore.sh (2.38 KB) UnrollCore.sh Script to extract information from a core file. Gianluca Petrillo, 03/23/2017 05:56 PM

History

#1 Updated by Vito Di Benedetto almost 4 years ago

  • Status changed from New to Assigned
  • Assignee set to Vito Di Benedetto

#2 Updated by Vito Di Benedetto almost 4 years ago

  • Status changed from Assigned to Work in progress
  • % Done changed from 0 to 20

I verified that I can set
ulimit -c unlimited
and I'm getting the core file generated by segfault_test unit test.

Next step is to have a procedure to copy the core file on a scratch dCache area.

#3 Updated by Gianluca Petrillo almost 4 years ago

I have received the core dump file of the test from Vito.
This is the backtrace which gdb prints when running locally:

#0  pleaseSegfault (value=1) at /scratch/petrillo/LArSoft/software/build/develop/prof/srcs/larexamples/test/segfault_test.cc:6
#1  Inner::result (this=<synthetic pointer>) at /scratch/petrillo/LArSoft/software/build/develop/prof/srcs/larexamples/test/segfault_test.cc:19
#2  Outer::compute (this=<synthetic pointer>) at /scratch/petrillo/LArSoft/software/build/develop/prof/srcs/larexamples/test/segfault_test.cc:32
#3  main (argc=<optimized out>, argv=<optimized out>) at /scratch/petrillo/LArSoft/software/build/develop/prof/srcs/larexamples/test/segfault_test.cc:49

Neat. The commands print and info local, among others, give information.
With the core file from the C.I. test, together with my local test binary, confuse everything:
#0  0x0000000000400cbd in __libc_csu_init ()
#1  0x0000003f44a10758 in ?? ()
#2  0x00000000004003a8 in ?? ()
#3  0x0000000100000000 in ?? ()
#4  0x00000001000007f9 in ?? ()
#5  0x00007fec5ab19540 in ?? ()
#6  0x00007ffd30504dd0 in ?? ()
#7  0x0000003f448214e0 in ?? ()
#8  0x00007ffd30504ec0 in ?? ()
#9  0x00007ffd30504ee8 in ?? ()
#10 0x0000003f44821188 in ?? ()
#11 0x00007fec5a7d96c8 in ?? ()
#12 0x00000000f63d4e2e in ?? ()
#13 0x0000003f44609f0a in ?? ()
#14 0x0000000000000000 in ?? ()

Not neat. And of course no local symbol information.
Since the C.I. test is designed to build the libraries it is testing, and those binary files stay in the node, it is not likely that we have the right binaries available post mortem to correctly interpret the core file.

I can see two paths forward, and other paths might be possible that I am not aware of:
  1. transfer not only the core file, but also all the necessary binary files from the node; with my limited knowledge, "all the necessary" includes the whole $MRB_INSTALL area, which is large
  2. have gdb run locally to extract some predefined information; this is not completely trivial, but probably possible.

#4 Updated by Vito Di Benedetto almost 4 years ago

Using the second option has the advantage to automatize the core analysis and to provide the results as part of the CI phase logs.

If you can provide the gdb commands that will produce the information you need, I'll implement this in the CI.

#5 Updated by Gianluca Petrillo almost 4 years ago

I am attaching a script that extracts all the information from all the core files in the current (or specified) directory. For each core file, a file name starting with backtrace. is generated.

One tricky part is that we need to know the name of the executable. At least for the unit tests, that is extremely hard to discover. This script is extracting that information from the core file itself. I don't know how it would perform when the executable does not exist.
The sequence of commands given to GDB is hard-coded in a function within the script.

This could be a starting point. The script would need to be run in any directory where there could be a core file.
I don't know if under OSX GDB is available. Under Linux, the system should set up the latest version, at least the v7_12 distributed via UPS. System's gdb is no good.

#6 Updated by Vito Di Benedetto almost 4 years ago

  • % Done changed from 20 to 80

I opened a request to the Jenkins admin to get the core dump file name schema modified in the following way:
core.%p.%h.%t.%s.%e

where, according to the man page of core:
%p PID of dumped process
%s number of signal causing dump
%t time of dump, expressed as seconds since the Epoch (00:00h, 1 Jan 1970, UTC)
%h hostname (same as nodename returned by uname(2))
%e executable filename (without path prefix)

I updated the UnrollCore.sh file to make it able to retrieve the required information.

The results from a test CI build that run the unit test that has a segmentation fault are available here:

http://dbweb6.fnal.gov:8080/TestCI/app/build_detail/phase_details?build_id=lar_ci_test/323&platform=Linux%202.6.32-642.13.1.el6.x86_64&phase=unit_test&buildtype=slf6%20e10:prof

where there is the link to the backtrace analyzed by gdb.
This log seems to have in it useful backtrace information.

#7 Updated by Gianluca Petrillo almost 4 years ago

Using the tag %E, which includes the full path, would make the job of UnrollCore.sh easier and safer.
We would still need a special case for OSX. Currently, UnrollCore.sh will not work on OSX because there is no gdb under that system. When the procedures are established for Linux, then UnrollCore.sh should be extended to support whatever debugger is available on those OSX machines (lldb, I would assume). But OSX does not seem to have the fancy core file naming (the documentation, in fact, asserts all core files are placed in /cores, which I could confirm on a OSX 10.12 laptop).
Under OSX, the tag %E is not available, but there is an equivalent of %e. I could not figure out a way to leave the core file in the execution directory, so we would have to pick them from /cores (or wherever the system setting places them).

#8 Updated by Vito Di Benedetto almost 4 years ago

Gianluca Petrillo wrote:

Using the tag %E, which includes the full path, would make the job of UnrollCore.sh easier and safer.
We would still need a special case for OSX. Currently, UnrollCore.sh will not work on OSX because there is no gdb under that system. When the procedures are established for Linux, then UnrollCore.sh should be extended to support whatever debugger is available on those OSX machines (lldb, I would assume). But OSX does not seem to have the fancy core file naming (the documentation, in fact, asserts all core files are placed in /cores, which I could confirm on a OSX 10.12 laptop).

Unfortunately on SLF6 the tag %E is not available, but it is on SLF7.
When the CI build will be migrated to SLF7 we can take advantage of the %E tag.

#9 Updated by Vito Di Benedetto almost 4 years ago

  • Target version set to v1_4_0_RC

#10 Updated by Vito Di Benedetto almost 4 years ago

  • Target version changed from v1_4_0_RC to v1_3_0_RC

#11 Updated by Vito Di Benedetto over 3 years ago

  • Target version changed from v1_3_0_RC to v1_3_0

#12 Updated by Vito Di Benedetto over 3 years ago

  • Status changed from Work in progress to Resolved
  • % Done changed from 80 to 100

The support to analyze core dump files for tests crashing on Mac OSX has been added.
This will be available with the next CI Release v1_3_0

#13 Updated by Vito Di Benedetto over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF