Project

General

Profile

Bug #16360

artdaq 2.x problems when running with mvapich2

Added by Kurt Biery over 2 years ago. Updated 6 months ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
04/28/2017
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

As part of trying to update ds50daq to use artdaq 2.02.01, I noticed that some number of artdaq unit tests fail when running on the DarkSide teststand at Fermilab in FCC (dsfr6 via ds50ws.fnal.gov) with Infiniband (mvapich2, as compared to mpich).

I'll use this issue to document what I see.


Related issues

Related to artdaq - Bug #16361: The v2_00_00 implementation of DataReceiverManager can drop fragments if the EndOfData fragment is not the last one received.Closed04/28/2017

History

#1 Updated by Kurt Biery over 2 years ago

Some high-level results:

  • with artdaq v1_13_03, quals "eth e10 s41 -p"...
    • all unit tests pass (using "buildtool -t")
Test project /home/biery/scratch/artdaqBuild
      Start  1: GenericFragmentSimulator_t
      Start  2: s_r_handles_t
      Start  4: FragCounter_t
      Start  5: EventStore_t
      Start  6: raw_event_queue_reader_t
      Start  7: read_files_t
      Start  8: driver_t
      Start  9: CommandableFragmentGenerator_t
      Start 10: config_t
      Start 11: config_with_art_t
      Start 12: genToArt_t
      Start 13: genToArt_outToBinaryFileOutput_t
      Start 14: genToArt_outToBinaryMPIOutput_t
 1/14 Test #11: config_with_art_t ..................   Passed    0.21 sec
 2/14 Test  #4: FragCounter_t ......................   Passed    0.42 sec
 3/14 Test  #1: GenericFragmentSimulator_t .........   Passed    0.43 sec
 4/14 Test  #2: s_r_handles_t ......................   Passed    0.43 sec
 5/14 Test #10: config_t ...........................   Passed    0.41 sec
 6/14 Test  #9: CommandableFragmentGenerator_t .....   Passed    0.72 sec
 7/14 Test  #6: raw_event_queue_reader_t ...........   Passed    1.12 sec
 8/14 Test  #5: EventStore_t .......................   Passed    1.52 sec
 9/14 Test  #8: driver_t ...........................   Passed    1.62 sec
10/14 Test  #7: read_files_t .......................   Passed    2.40 sec
11/14 Test #13: genToArt_outToBinaryFileOutput_t ...   Passed    2.39 sec
12/14 Test #12: genToArt_t .........................   Passed    2.39 sec
13/14 Test #14: genToArt_outToBinaryMPIOutput_t ....   Passed    2.39 sec
      Start  3: daqrate_gen_test
14/14 Test  #3: daqrate_gen_test ...................   Passed    1.40 sec

100% tests passed, 0 tests failed out of 14

Total Test time (real) =   3.83 sec
  • with artdaq v1_13_03, quals "ib e10 s41 -p"...
    • all unit tests pass (using "buildtool -t")
Test project /home/biery/scratch/artdaqBuild
      Start  1: GenericFragmentSimulator_t
      Start  2: s_r_handles_t
      Start  4: FragCounter_t
      Start  5: EventStore_t
      Start  6: raw_event_queue_reader_t
      Start  7: read_files_t
      Start  8: driver_t
      Start  9: CommandableFragmentGenerator_t
      Start 10: config_t
      Start 11: config_with_art_t
      Start 12: genToArt_t
      Start 13: genToArt_outToBinaryFileOutput_t
      Start 14: genToArt_outToBinaryMPIOutput_t
 1/14 Test  #9: CommandableFragmentGenerator_t .....   Passed    0.31 sec
 2/14 Test  #1: GenericFragmentSimulator_t .........   Passed    0.53 sec
 3/14 Test  #4: FragCounter_t ......................   Passed    0.52 sec
 4/14 Test  #2: s_r_handles_t ......................   Passed    1.03 sec
 5/14 Test #11: config_with_art_t ..................   Passed    1.31 sec
 6/14 Test  #5: EventStore_t .......................   Passed    1.32 sec
 7/14 Test  #6: raw_event_queue_reader_t ...........   Passed    1.42 sec
 8/14 Test #10: config_t ...........................   Passed    1.41 sec
 9/14 Test  #8: driver_t ...........................   Passed    2.02 sec
10/14 Test  #7: read_files_t .......................   Passed    2.62 sec
11/14 Test #12: genToArt_t .........................   Passed    5.59 sec
12/14 Test #13: genToArt_outToBinaryFileOutput_t ...   Passed    5.60 sec
13/14 Test #14: genToArt_outToBinaryMPIOutput_t ....   Passed    5.62 sec
      Start  3: daqrate_gen_test
14/14 Test  #3: daqrate_gen_test ...................   Passed    2.00 sec

100% tests passed, 0 tests failed out of 14

Total Test time (real) =   7.65 sec
  • with artdaq v2_00_00, quals "eth e10 s41 -p"...
    • all unit tests pass (using "buildtool -t")
Test project /home/biery/scratch/artdaqBuild
      Start  1: GenericFragmentSimulator_t
      Start  2: transfer_driver_t
      Start  4: FragCounter_t
      Start  5: EventStore_t
      Start  6: raw_event_queue_reader_t
      Start  7: daq_flow_t
      Start  8: read_files_t
      Start  9: driver_t
      Start 10: CommandableFragmentGenerator_t
      Start 11: config_t
      Start 12: config_with_art_t
      Start 13: genToArt_t
      Start 14: genToArt_outToBinaryFileOutput_t
 1/14 Test  #4: FragCounter_t ......................   Passed    0.44 sec
 2/14 Test  #1: GenericFragmentSimulator_t .........   Passed    0.44 sec
 3/14 Test #10: CommandableFragmentGenerator_t .....   Passed    0.72 sec
 4/14 Test #11: config_t ...........................   Passed    0.72 sec
 5/14 Test #12: config_with_art_t ..................   Passed    0.72 sec
 6/14 Test  #9: driver_t ...........................   Passed    1.43 sec
 7/14 Test  #6: raw_event_queue_reader_t ...........   Passed    1.64 sec
 8/14 Test  #5: EventStore_t .......................   Passed    1.64 sec
 9/14 Test  #2: transfer_driver_t ..................   Passed    2.14 sec
10/14 Test #14: genToArt_outToBinaryFileOutput_t ...   Passed    2.37 sec
11/14 Test #13: genToArt_t .........................   Passed    2.38 sec
12/14 Test  #8: read_files_t .......................   Passed    2.54 sec
13/14 Test  #7: daq_flow_t .........................   Passed    5.36 sec
      Start  3: daqrate_gen_test
14/14 Test  #3: daqrate_gen_test ...................   Passed    1.56 sec

100% tests passed, 0 tests failed out of 14

Total Test time (real) =   6.94 sec
  • with artdaq v2_00_00, quals "ib e10 s41 -p"...
    • one unit test failed and another one ran forever (using "buildtool -t")
Test project /home/biery/scratch/artdaqBuild
      Start  1: GenericFragmentSimulator_t
      Start  2: transfer_driver_t
      Start  4: FragCounter_t
      Start  5: EventStore_t
      Start  6: raw_event_queue_reader_t
      Start  7: daq_flow_t
      Start  8: read_files_t
      Start  9: driver_t
      Start 10: CommandableFragmentGenerator_t
      Start 11: config_t
      Start 12: config_with_art_t
      Start 13: genToArt_t
      Start 14: genToArt_outToBinaryFileOutput_t
 1/14 Test  #1: GenericFragmentSimulator_t .........   Passed    0.11 sec
 2/14 Test  #4: FragCounter_t ......................   Passed    0.51 sec
 3/14 Test #10: CommandableFragmentGenerator_t .....   Passed    0.90 sec
 4/14 Test #11: config_t ...........................   Passed    1.30 sec
 5/14 Test  #9: driver_t ...........................   Passed    1.50 sec
 6/14 Test  #6: raw_event_queue_reader_t ...........   Passed    1.51 sec
 7/14 Test  #5: EventStore_t .......................   Passed    1.81 sec
 8/14 Test #12: config_with_art_t ..................   Passed    1.90 sec
 9/14 Test  #8: read_files_t .......................   Passed    2.55 sec
10/14 Test #13: genToArt_t .........................   Passed    5.75 sec
11/14 Test #14: genToArt_outToBinaryFileOutput_t ...   Passed    5.74 sec
12/14 Test  #7: daq_flow_t .........................   Passed    6.48 sec

13/14 Test  #2: transfer_driver_t ..................***Failed   18.57 sec
      Start  3: daqrate_gen_test
^C

Of course, the set of unit tests changed with v2.x, but daqrate_gen_test is common to both 1.13 .and 2.0.

#2 Updated by Kurt Biery over 2 years ago

Focusing on transfer_test_t...

With an Ethernet build of v2_00_00...

cd $BUILD_DIR/test/DAQrate/transfer_driver_t.d
mpirun -hosts localhost -np 5 transfer_driver_mpi transfer_driver_mpi.fcl
...
Receiver 4 received fragment 1475 with seqID 951 from Sender 0 (Expecting 25 more)
Receiver 3 received fragment 1475 with seqID 998 from Sender 1 (Expecting 25 more)
Receiver 3 received fragment 1476 with seqID 950 from Sender 0 (Expecting 24 more)
Sender 1 sent fragment 999
Receiver 4 received fragment 1476 with seqID 999 from Sender 1 (Expecting 24 more)
Sender 0 sent fragment 952
Receiver 4 received EndOfData Fragment from Sender 1
Sent 1048576000 bytes in 1.46389 seconds ( 683.110992 MB/s ).
Receiver 3 received fragment 1477 with seqID 952 from Sender 0 (Expecting 23 more)
Receiver 3 received EndOfData Fragment from Sender 1
Sender 0 sent fragment 953
Receiver 4 received fragment 1477 with seqID 953 from Sender 0 (Expecting 23 more)
Sender 0 sent fragment 954
Receiver 3 received fragment 1478 with seqID 954 from Sender 0 (Expecting 22 more)
Sender 0 sent fragment 955
Receiver 4 received fragment 1478 with seqID 955 from Sender 0 (Expecting 22 more)
Sender 0 sent fragment 956
Receiver 3 received fragment 1479 with seqID 956 from Sender 0 (Expecting 21 more)
Sender 0 sent fragment 957
Receiver 4 received fragment 1479 with seqID 957 from Sender 0 (Expecting 21 more)
Sender 0 sent fragment 958
Receiver 3 received fragment 1480 with seqID 958 from Sender 0 (Expecting 20 more)
Sender 0 sent fragment 959
Receiver 4 received fragment 1480 with seqID 959 from Sender 0 (Expecting 20 more)
Sender 0 sent fragment 960
Receiver 3 received fragment 1481 with seqID 960 from Sender 0 (Expecting 19 more)
Sender 0 sent fragment 961
Receiver 4 received fragment 1481 with seqID 961 from Sender 0 (Expecting 19 more)
Sender 0 sent fragment 962
Receiver 3 received fragment 1482 with seqID 962 from Sender 0 (Expecting 18 more)
Sender 0 sent fragment 963
Receiver 4 received fragment 1482 with seqID 963 from Sender 0 (Expecting 18 more)
Sender 0 sent fragment 964
Receiver 3 received fragment 1483 with seqID 964 from Sender 0 (Expecting 17 more)
Sender 0 sent fragment 965
Receiver 4 received fragment 1483 with seqID 965 from Sender 0 (Expecting 17 more)
Sender 0 sent fragment 966
Receiver 3 received fragment 1484 with seqID 966 from Sender 0 (Expecting 16 more)
Sender 0 sent fragment 967
Receiver 4 received fragment 1484 with seqID 967 from Sender 0 (Expecting 16 more)
Sender 0 sent fragment 968
Receiver 3 received fragment 1485 with seqID 968 from Sender 0 (Expecting 15 more)
Sender 0 sent fragment 969
Receiver 4 received fragment 1485 with seqID 969 from Sender 0 (Expecting 15 more)
Sender 0 sent fragment 970
Receiver 3 received fragment 1486 with seqID 970 from Sender 0 (Expecting 14 more)
Sender 0 sent fragment 971
Receiver 4 received fragment 1486 with seqID 971 from Sender 0 (Expecting 14 more)
Sender 0 sent fragment 972
Receiver 3 received fragment 1487 with seqID 972 from Sender 0 (Expecting 13 more)
Sender 0 sent fragment 973
Receiver 4 received fragment 1487 with seqID 973 from Sender 0 (Expecting 13 more)
Sender 0 sent fragment 974
Receiver 3 received fragment 1488 with seqID 974 from Sender 0 (Expecting 12 more)
Sender 0 sent fragment 975
Receiver 4 received fragment 1488 with seqID 975 from Sender 0 (Expecting 12 more)
Sender 0 sent fragment 976
Receiver 3 received fragment 1489 with seqID 976 from Sender 0 (Expecting 11 more)
Sender 0 sent fragment 977
Receiver 4 received fragment 1489 with seqID 977 from Sender 0 (Expecting 11 more)
Sender 0 sent fragment 978
Receiver 3 received fragment 1490 with seqID 978 from Sender 0 (Expecting 10 more)
Sender 0 sent fragment 979
Receiver 4 received fragment 1490 with seqID 979 from Sender 0 (Expecting 10 more)
Sender 0 sent fragment 980
Receiver 3 received fragment 1491 with seqID 980 from Sender 0 (Expecting 9 more)
Sender 0 sent fragment 981
Receiver 4 received fragment 1491 with seqID 981 from Sender 0 (Expecting 9 more)
Sender 0 sent fragment 982
Receiver 3 received fragment 1492 with seqID 982 from Sender 0 (Expecting 8 more)
Sender 0 sent fragment 983
Receiver 4 received fragment 1492 with seqID 983 from Sender 0 (Expecting 8 more)
Sender 0 sent fragment 984
Receiver 3 received fragment 1493 with seqID 984 from Sender 0 (Expecting 7 more)
Sender 0 sent fragment 985
Receiver 4 received fragment 1493 with seqID 985 from Sender 0 (Expecting 7 more)
Sender 0 sent fragment 986
Receiver 3 received fragment 1494 with seqID 986 from Sender 0 (Expecting 6 more)
Sender 0 sent fragment 987
Receiver 4 received fragment 1494 with seqID 987 from Sender 0 (Expecting 6 more)
Sender 0 sent fragment 988
Receiver 3 received fragment 1495 with seqID 988 from Sender 0 (Expecting 5 more)
Sender 0 sent fragment 989
Receiver 4 received fragment 1495 with seqID 989 from Sender 0 (Expecting 5 more)
Sender 0 sent fragment 990
Receiver 3 received fragment 1496 with seqID 990 from Sender 0 (Expecting 4 more)
Sender 0 sent fragment 991
Receiver 4 received fragment 1496 with seqID 991 from Sender 0 (Expecting 4 more)
Sender 0 sent fragment 992
Receiver 3 received fragment 1497 with seqID 992 from Sender 0 (Expecting 3 more)
Sender 0 sent fragment 993
Receiver 4 received fragment 1497 with seqID 993 from Sender 0 (Expecting 3 more)
Sender 0 sent fragment 994
Receiver 3 received fragment 1498 with seqID 994 from Sender 0 (Expecting 2 more)
Sender 0 sent fragment 995
Receiver 4 received fragment 1498 with seqID 995 from Sender 0 (Expecting 2 more)
Sender 0 sent fragment 996
Receiver 3 received fragment 1499 with seqID 996 from Sender 0 (Expecting 1 more)
Sender 0 sent fragment 997
Receiver 4 received fragment 1499 with seqID 997 from Sender 0 (Expecting 1 more)
Sender 0 sent fragment 998
Receiver 3 received fragment 1500 with seqID 998 from Sender 0 (Expecting 0 more)
Sender 0 sent fragment 999
Receiver 4 received fragment 1500 with seqID 999 from Sender 0 (Expecting 0 more)
Receiver 3 received EndOfData Fragment from Sender 0
Receiver 4 received EndOfData Fragment from Sender 0
Sent 1048576000 bytes in 1.49716 seconds ( 667.929932 MB/s ).
Received 1572864000 bytes in 1.47178 seconds ( 1019.173926 MB/s ).
Received 1572864000 bytes in 1.47338 seconds ( 1018.065677 MB/s ).

With an Infiniband build of v2_00_00...

cd test/DAQrate/transfer_driver_t.d
mpirun -hosts localhost -np 5 transfer_driver_mpi transfer_driver_mpi.fcl
...
Receiver 4 received fragment 1493 with seqID 997 from Sender 1 (Expecting 7 more)
Sender 2 sent fragment 992
Receiver 3 received fragment 1496 with seqID 998 from Sender 1 (Expecting 4 more)
Sender 1 sent fragment 999
Receiver 3 received EndOfData Fragment from Sender 1
Sent 1048576000 bytes in 1.51016 seconds ( 662.183420 MB/s ).
Receiver 4 received fragment 1494 with seqID 991 from Sender 2 (Expecting 6 more)
Sender 2 sent fragment 993
Receiver 3 received fragment 1497 with seqID 992 from Sender 2 (Expecting 3 more)
Receiver 4 received fragment 1495 with seqID 999 from Sender 1 (Expecting 5 more)
Receiver 4 received EndOfData Fragment from Sender 1
Sender 2 sent fragment 994
Receiver 4 received fragment 1496 with seqID 993 from Sender 2 (Expecting 4 more)
Sender 2 sent fragment 995
Receiver 3 received fragment 1498 with seqID 994 from Sender 2 (Expecting 2 more)
Sender 2 sent fragment 996
Receiver 4 received fragment 1497 with seqID 995 from Sender 2 (Expecting 3 more)
Sender 2 sent fragment 997
Receiver 3 received fragment 1499 with seqID 996 from Sender 2 (Expecting 1 more)
Sender 2 sent fragment 998
Receiver 4 received fragment 1498 with seqID 997 from Sender 2 (Expecting 2 more)
Sender 2 sent fragment 999
Receiver 3 received fragment 1500 with seqID 998 from Sender 2 (Expecting 0 more)
Receiver 3 received EndOfData Fragment from Sender 2
Sent 1048576000 bytes in 1.51118 seconds ( 661.734969 MB/s ).
Received 1572864000 bytes in 1.57092 seconds ( 954.851698 MB/s ).
Receiver 4 received fragment 1499 with seqID 999 from Sender 2 (Expecting 1 more)
Receiver 4 received EndOfData Fragment from Sender 2
Received 1571815424 bytes in 1.51358 seconds ( 990.364217 MB/s ).

 *** Break *** segmentation violation

 *** Break *** segmentation violation

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x0000003ccecabfdd in __libc_waitpid (pid=<value optimized out>, stat_loc=<value optimized out>, options=<value optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:41
#1  0x0000003ccec3e899 in do_system (line=<value optimized out>) at ../sysdeps/posix/system.c:149
#2  0x0000003ccec3ebd0 in __libc_system (line=<value optimized out>) at ../sysdeps/posix/system.c:190
#3  0x00007fca777c51a4 in TUnixSystem::StackTrace() () from /products/root/v6_06_04b/Linux64bit+2.6-2.12-e10-prof/lib/libCore.so
#4  0x00007fca777c72bc in TUnixSystem::DispatchSignals(ESignals) () from /products/root/v6_06_04b/Linux64bit+2.6-2.12-e10-prof/lib/libCore.so
#5  <signal handler called>
#6  _int_free (av=0x7fca4c000020, mem=<value optimized out>) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:4404
#7  0x00007fca75eccf6b in free (mem=0x7fca4c0008c0) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3496
#8  0x0000003cce411199 in _dl_deallocate_tls (tcb=0x7fca5c676700, dealloc_tcb=false) at dl-tls.c:481
#9  0x0000003ccf0065dd in __free_stacks (limit=41943040) at allocatestack.c:278
#10 0x0000003ccf00776a in queue_stack (pd=0x7fca6eefc700) at allocatestack.c:306
#11 __deallocate_stack (pd=0x7fca6eefc700) at allocatestack.c:750
#12 __free_tcb (pd=0x7fca6eefc700) at pthread_create.c:222
#13 0x0000003ccf008074 in pthread_join (threadid=140507421329152, thread_return=0x0) at pthread_join.c:110
#14 0x00007fca73fb4697 in std::thread::join() () at /scratch/workspace/art-build-base/v4_9_3a/SLF6/build/gcc/v4_9_3a/build/Linux64bit+2.6-2.12/src/gcc-4.9.3/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:668
#15 0x00007fca74d0668b in mf::service::MessageServicePresence::~MessageServicePresence (this=0x1e55790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:49
#16 0x00007fca74d06771 in mf::service::MessageServicePresence::~MessageServicePresence (this=0x1e55790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:51
#17 0x0000003ccec35db2 in __run_exit_handlers (status=0) at exit.c:78
#18 exit (status=0) at exit.c:100
#19 0x0000003ccec1ece4 in __libc_start_main (main=0x40b420 <main(int, char**)>, argc=2, ubp_av=0x7fffc7b18848, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffc7b18838) at libc-start.c:258
#20 0x0000000000408c09 in _start ()
===========================================================

The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#6  _int_free (av=0x7fca4c000020, mem=<value optimized out>) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:4404
#7  0x00007fca75eccf6b in free (mem=0x7fca4c0008c0) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3496
#8  0x0000003cce411199 in _dl_deallocate_tls (tcb=0x7fca5c676700, dealloc_tcb=false) at dl-tls.c:481
#9  0x0000003ccf0065dd in __free_stacks (limit=41943040) at allocatestack.c:278
#10 0x0000003ccf00776a in queue_stack (pd=0x7fca6eefc700) at allocatestack.c:306
#11 __deallocate_stack (pd=0x7fca6eefc700) at allocatestack.c:750
#12 __free_tcb (pd=0x7fca6eefc700) at pthread_create.c:222
#13 0x0000003ccf008074 in pthread_join (threadid=140507421329152, thread_return=0x0) at pthread_join.c:110
#14 0x00007fca73fb4697 in std::thread::join() () at /scratch/workspace/art-build-base/v4_9_3a/SLF6/build/gcc/v4_9_3a/build/Linux64bit+2.6-2.12/src/gcc-4.9.3/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:668
#15 0x00007fca74d0668b in mf::service::MessageServicePresence::~MessageServicePresence (this=0x1e55790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:49
#16 0x00007fca74d06771 in mf::service::MessageServicePresence::~MessageServicePresence (this=0x1e55790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:51
#17 0x0000003ccec35db2 in __run_exit_handlers (status=0) at exit.c:78
#18 exit (status=0) at exit.c:100
#19 0x0000003ccec1ece4 in __libc_start_main (main=0x40b420 <main(int, char**)>, argc=2, ubp_av=0x7fffc7b18848, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffc7b18838) at libc-start.c:258
#20 0x0000000000408c09 in _start ()
===========================================================

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x0000003ccecabfdd in __libc_waitpid (pid=<value optimized out>, stat_loc=<value optimized out>, options=<value optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:41
#1  0x0000003ccec3e899 in do_system (line=<value optimized out>) at ../sysdeps/posix/system.c:149
#2  0x0000003ccec3ebd0 in __libc_system (line=<value optimized out>) at ../sysdeps/posix/system.c:190
#3  0x00007f2f4b5f51a4 in TUnixSystem::StackTrace() () from /products/root/v6_06_04b/Linux64bit+2.6-2.12-e10-prof/lib/libCore.so
#4  0x00007f2f4b5f72bc in TUnixSystem::DispatchSignals(ESignals) () from /products/root/v6_06_04b/Linux64bit+2.6-2.12-e10-prof/lib/libCore.so
#5  <signal handler called>
#6  _int_free (av=0x7f2f28000020, mem=<value optimized out>) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:4404
#7  0x00007f2f49cfcf6b in free (mem=0x7f2f280008c0) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3496
#8  0x0000003cce411199 in _dl_deallocate_tls (tcb=0x7f2f303e1700, dealloc_tcb=false) at dl-tls.c:481
#9  0x0000003ccf0065dd in __free_stacks (limit=41943040) at allocatestack.c:278
#10 0x0000003ccf00776a in queue_stack (pd=0x7f2f42d2c700) at allocatestack.c:306
#11 __deallocate_stack (pd=0x7f2f42d2c700) at allocatestack.c:750
#12 __free_tcb (pd=0x7f2f42d2c700) at pthread_create.c:222
#13 0x0000003ccf008074 in pthread_join (threadid=139840961300224, thread_return=0x0) at pthread_join.c:110
#14 0x00007f2f47de4697 in std::thread::join() () at /scratch/workspace/art-build-base/v4_9_3a/SLF6/build/gcc/v4_9_3a/build/Linux64bit+2.6-2.12/src/gcc-4.9.3/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:668
#15 0x00007f2f48b3668b in mf::service::MessageServicePresence::~MessageServicePresence (this=0x1b6f790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:49
#16 0x00007f2f48b36771 in mf::service::MessageServicePresence::~MessageServicePresence (this=0x1b6f790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:51
#17 0x0000003ccec35db2 in __run_exit_handlers (status=0) at exit.c:78
#18 exit (status=0) at exit.c:100
#19 0x0000003ccec1ece4 in __libc_start_main (main=0x40b420 <main(int, char**)>, argc=2, ubp_av=0x7fffc129be88, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffc129be78) at libc-start.c:258
#20 0x0000000000408c09 in _start ()
===========================================================

The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#6  _int_free (av=0x7f2f28000020, mem=<value optimized out>) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:4404
#7  0x00007f2f49cfcf6b in free (mem=0x7f2f280008c0) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3496
#8  0x0000003cce411199 in _dl_deallocate_tls (tcb=0x7f2f303e1700, dealloc_tcb=false) at dl-tls.c:481
#9  0x0000003ccf0065dd in __free_stacks (limit=41943040) at allocatestack.c:278
#10 0x0000003ccf00776a in queue_stack (pd=0x7f2f42d2c700) at allocatestack.c:306
#11 __deallocate_stack (pd=0x7f2f42d2c700) at allocatestack.c:750
#12 __free_tcb (pd=0x7f2f42d2c700) at pthread_create.c:222
#13 0x0000003ccf008074 in pthread_join (threadid=139840961300224, thread_return=0x0) at pthread_join.c:110
#14 0x00007f2f47de4697 in std::thread::join() () at /scratch/workspace/art-build-base/v4_9_3a/SLF6/build/gcc/v4_9_3a/build/Linux64bit+2.6-2.12/src/gcc-4.9.3/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:668
#15 0x00007f2f48b3668b in mf::service::MessageServicePresence::~MessageServicePresence (this=0x1b6f790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:49
#16 0x00007f2f48b36771 in mf::service::MessageServicePresence::~MessageServicePresence (this=0x1b6f790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:51
#17 0x0000003ccec35db2 in __run_exit_handlers (status=0) at exit.c:78
#18 exit (status=0) at exit.c:100
#19 0x0000003ccec1ece4 in __libc_start_main (main=0x40b420 <main(int, char**)>, argc=2, ubp_av=0x7fffc129be88, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffc129be78) at libc-start.c:258
#20 0x0000000000408c09 in _start ()
===========================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

#3 Updated by Kurt Biery over 2 years ago

Something to note in the previous test results is that the program stops execution before all of the data has arrived. In additional tests, the number of fragments that are dropped varies from 1-3 or 4. I'll file a different Issue for fixing that.

It should be noted, that the fragments that are not processed does not seem to be related to the crash. I've made candidate changes to the code to avoid losing fragments, and I still see the crash.

#4 Updated by Kurt Biery over 2 years ago

  • Related to Bug #16361: The v2_00_00 implementation of DataReceiverManager can drop fragments if the EndOfData fragment is not the last one received. added

#5 Updated by Kurt Biery over 2 years ago

More trivia...

I've seen the following combinations of senders and receivers succeed (that is, the job finishes without crashing):
  • 1x1, 1x2, 2x1, 2x2, 2x3, 2x4, and 2x5
I've see the following combinations fail:
  • 3x2, 3x1, and 4x1

It turns out that the combination that is used in this unit test is 3x2.

#6 Updated by Kurt Biery over 2 years ago

The problem seems to be somehow related to (or tripped by?) MPI_Finalize. When I comment out that call in transfer_driver_mpi.cc, I don't see the crash in either a 3x2 or a 5x2 test.

I've seen instances in which the crash seems to happen after the processes all exit:

Sender 2 sent fragment 998
Sender 0 sent fragment 998
Receiver 3 received fragment 1496 with seqID 996 from Sender 0 (Expecting 4 more)
Sender 2 sent fragment 999
Receiver 4 received fragment 1497 with seqID 997 from Sender 1 (Expecting 3 more)Sender 0 sent fragment 
999
Sent 1048576000 bytes in 1.44204 seconds ( 693.460351 MB/s ).
Rank 2 before MPI_Finalize()
Sent 1048576000 bytes in 1.44206 seconds ( 693.452400 MB/s ).
Rank 0 before MPI_Finalize()
Sender 1 sent fragment 998
Receiver 3 received fragment 1497 with seqID 996 from Sender 1 (Expecting 3 more)
Sender 1 sent fragment 999
Sent 1048576000 bytes in 1.44285 seconds ( 693.074049 MB/s ).
Rank 1 before MPI_Finalize()
Receiver 4 received EndOfData Fragment from Sender 2
Receiver 4 received EndOfData Fragment from Sender 0
Receiver 3 received fragment 1498 with seqID 998 from Sender 2 (Expecting 2 more)
Receiver 4 received fragment 1498 with seqID 999 from Sender 2 (Expecting 2 more)
Receiver 3 received EndOfData Fragment from Sender Receiver 4 received fragment 1499 with seqID 999 from Sender 0 (Expecting 0
1 more)
Receiver 3 received EndOfData Fragment from Sender 2
Receiver 3 received fragment 1499 with seqID 998 from Sender 0 (Expecting 1 more)
Receiver 3 received EndOfData Fragment from Sender 1
Receiver 3 received fragment 1500 with seqID 998 from Sender 1 (Expecting 0 more)
Receiver 4 received EndOfData Fragment from Sender 1
Receiver 4 received fragment 1500 with seqID 999 from Sender 1 (Expecting 0 more)
Received 1572864000 bytes in 1.52817 seconds ( 981.564577 MB/s ).
Rank 3 before MPI_Finalize()
Received 1572864000 bytes in 1.52788 seconds ( 981.751414 MB/s ).
Rank 4 before MPI_Finalize()
Rank 1 after MPI_Finalize()
Rank 1 immediately before return
Rank 2 after MPI_Finalize()
Rank 2 immediately before return
Rank 4 after MPI_Finalize()
Rank 4 immediately before return
Rank 3 after MPI_Finalize()
Rank 3 immediately before return
Rank 0 after MPI_Finalize()
Rank 0 immediately before return

 *** Break *** segmentation violation

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x0000003ccecabfdd in __libc_waitpid (pid=<value optimized out>, stat_loc=<value optimized out>, options=<value optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:41
#1  0x0000003ccec3e899 in do_system (line=<value optimized out>) at ../sysdeps/posix/system.c:149
#2  0x0000003ccec3ebd0 in __libc_system (line=<value optimized out>) at ../sysdeps/posix/system.c:190
#3  0x00007f8b4e1841a4 in TUnixSystem::StackTrace() () from /products/root/v6_06_04b/Linux64bit+2.6-2.12-e10-prof/lib/libCore.so
#4  0x00007f8b4e1862bc in TUnixSystem::DispatchSignals(ESignals) () from /products/root/v6_06_04b/Linux64bit+2.6-2.12-e10-prof/lib/libCore.so
#5  <signal handler called>
#6  _int_free (av=0x7f8b2c000020, mem=<value optimized out>) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:4404
#7  0x00007f8b4c88bf6b in free (mem=0x7f8b2c0008c0) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3496
#8  0x0000003cce411199 in _dl_deallocate_tls (tcb=0x7f8b33078700, dealloc_tcb=false) at dl-tls.c:481
#9  0x0000003ccf0065dd in __free_stacks (limit=41943040) at allocatestack.c:278
#10 0x0000003ccf00776a in queue_stack (pd=0x7f8b458bb700) at allocatestack.c:306
#11 __deallocate_stack (pd=0x7f8b458bb700) at allocatestack.c:750
#12 __free_tcb (pd=0x7f8b458bb700) at pthread_create.c:222
#13 0x0000003ccf008074 in pthread_join (threadid=140236143965952, thread_return=0x0) at pthread_join.c:110
#14 0x00007f8b4a973697 in std::thread::join() () at /scratch/workspace/art-build-base/v4_9_3a/SLF6/build/gcc/v4_9_3a/build/Linux64bit+2.6-2.12/src/gcc-4.9.3/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:668
#15 0x00007f8b4b6c568b in mf::service::MessageServicePresence::~MessageServicePresence (this=0x859790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:49
#16 0x00007f8b4b6c5771 in mf::service::MessageServicePresence::~MessageServicePresence (this=0x859790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:51
#17 0x0000003ccec35db2 in __run_exit_handlers (status=0) at exit.c:78
#18 exit (status=0) at exit.c:100
#19 0x0000003ccec1ece4 in __libc_start_main (main=0x40b420 <main(int, char**)>, argc=2, ubp_av=0x7fffd91e0de8, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffd91e0dd8) at libc-start.c:258
#20 0x0000000000408c09 in _start ()
===========================================================

The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#6  _int_free (av=0x7f8b2c000020, mem=<value optimized out>) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:4404
#7  0x00007f8b4c88bf6b in free (mem=0x7f8b2c0008c0) at src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3496
#8  0x0000003cce411199 in _dl_deallocate_tls (tcb=0x7f8b33078700, dealloc_tcb=false) at dl-tls.c:481
#9  0x0000003ccf0065dd in __free_stacks (limit=41943040) at allocatestack.c:278
#10 0x0000003ccf00776a in queue_stack (pd=0x7f8b458bb700) at allocatestack.c:306
#11 __deallocate_stack (pd=0x7f8b458bb700) at allocatestack.c:750
#12 __free_tcb (pd=0x7f8b458bb700) at pthread_create.c:222
#13 0x0000003ccf008074 in pthread_join (threadid=140236143965952, thread_return=0x0) at pthread_join.c:110
#14 0x00007f8b4a973697 in std::thread::join() () at /scratch/workspace/art-build-base/v4_9_3a/SLF6/build/gcc/v4_9_3a/build/Linux64bit+2.6-2.12/src/gcc-4.9.3/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:668
#15 0x00007f8b4b6c568b in mf::service::MessageServicePresence::~MessageServicePresence (this=0x859790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:49
#16 0x00007f8b4b6c5771 in mf::service::MessageServicePresence::~MessageServicePresence (this=0x859790, __in_chrg=<value optimized out>) at /scratch/workspace/build-gallery/SLF6/prof/build/messagefacility/v1_17_01/src/messagefacility/MessageLogger/MessageServicePresence.cc:51
#17 0x0000003ccec35db2 in __run_exit_handlers (status=0) at exit.c:78
#18 exit (status=0) at exit.c:100
#19 0x0000003ccec1ece4 in __libc_start_main (main=0x40b420 <main(int, char**)>, argc=2, ubp_av=0x7fffd91e0de8, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffd91e0dd8) at libc-start.c:258
#20 0x0000000000408c09 in _start ()
===========================================================

#7 Updated by Kurt Biery over 2 years ago

Here is the code snippet from the end of transfer_driver_mpi.cc that generated the previous result:

  theTest.runTest();

  std::cout << "Rank " << my_rank << " before MPI_Finalize()" << std::endl;
  rc = MPI_Finalize();
  std::cout << "Rank " << my_rank << " after MPI_Finalize()" << std::endl;
  assert(rc == 0);
  TRACE(TLVL_TRACE, "s_r_handles main return" );
  std::cout << "Rank " << my_rank << " immediately before return" << std::endl;
  return 0;
}

#8 Updated by Kurt Biery over 2 years ago

For reference, here are the configuration strings from the most recent test (two entries ago...)

argc:2
argv[0]: transfer_driver_mpi
argv[1]: transfer_driver_mpi.fcl
Going to configure with ParameterSet: buffer_count:10 destinations:{d3:{buffer_count:10 destination_rank:3 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"} d4:{buffer_count:10 destination_rank:4 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"}} fragment_size:1.048576e6 num_receivers:2 num_senders:3 sends_per_sender:1000 sources:{s0:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:0 transferPluginType:"MPI"} s1:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:1 transferPluginType:"MPI"} s2:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:2 transferPluginType:"MPI"}} transfer_plugin_type:"MPI" 
Going to configure with ParameterSet: buffer_count:10 destinations:{d3:{buffer_count:10 destination_rank:3 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"} d4:{buffer_count:10 destination_rank:4 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"}} fragment_size:1.048576e6 num_receivers:2 num_senders:3 sends_per_sender:1000 sources:{s0:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:0 transferPluginType:"MPI"} s1:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:1 transferPluginType:"MPI"} s2:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:2 transferPluginType:"MPI"}} transfer_plugin_type:"MPI" 
Going to configure with ParameterSet: buffer_count:10 destinations:{d3:{buffer_count:10 destination_rank:3 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"} d4:{buffer_count:10 destination_rank:4 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"}} fragment_size:1.048576e6 num_receivers:2 num_senders:3 sends_per_sender:1000 sources:{s0:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:0 transferPluginType:"MPI"} s1:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:1 transferPluginType:"MPI"} s2:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:2 transferPluginType:"MPI"}} transfer_plugin_type:"MPI" 
Going to configure with ParameterSet: buffer_count:10 destinations:{d3:{buffer_count:10 destination_rank:3 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"} d4:{buffer_count:10 destination_rank:4 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"}} fragment_size:1.048576e6 num_receivers:2 num_senders:3 sends_per_sender:1000 sources:{s0:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:0 transferPluginType:"MPI"} s1:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:1 transferPluginType:"MPI"} s2:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:2 transferPluginType:"MPI"}} transfer_plugin_type:"MPI" 
Going to configure with ParameterSet: buffer_count:10 destinations:{d3:{buffer_count:10 destination_rank:3 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"} d4:{buffer_count:10 destination_rank:4 max_fragment_size_words:1.048576e6 transferPluginType:"MPI"}} fragment_size:1.048576e6 num_receivers:2 num_senders:3 sends_per_sender:1000 sources:{s0:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:0 transferPluginType:"MPI"} s1:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:1 transferPluginType:"MPI"} s2:{buffer_count:10 max_fragment_size_words:1.048576e6 source_rank:2 transferPluginType:"MPI"}} transfer_plugin_type:"MPI" 
%MSG-i DataReceiverManager:  transfer_driver_mpi transfer_driver_mpi
enabled_sources not specified, assuming all sources enabled.
%MSG
%MSG-i DataReceiverManager:  transfer_driver_mpi transfer_driver_mpi
enabled_sources not specified, assuming all sources enabled.
%MSG
%MSG-i DataReceiverManager:  transfer_driver_mpi transfer_driver_mpi
enabled_destinations not specified, assuming all destinations enabled.
%MSG
%MSG-i DataReceiverManager:  transfer_driver_mpi transfer_driver_mpi
enabled_destinations not specified, assuming all destinations enabled.
%MSG
%MSG-i DataReceiverManager:  transfer_driver_mpi transfer_driver_mpi
enabled_destinations not specified, assuming all destinations enabled.
%MSG
Sender 0 sent fragment 0
Sender 1 sent fragment 0
Sender 2 sent fragment 0
Sender 0 sent fragment 1
Sender 2 sent fragment 1
Sender 1 sent fragment 1
Sender 0 sent fragment 2
...

#9 Updated by Kurt Biery over 2 years ago

Earlier in my investigation of this issue, I talked with Chris Green about a possible problem with mvapich2. The version that is installed on the dsfr6 computer is 1.9. Since an artdaq 1.13.03 build works, but a 2.00.00 build fails (other things being constant, including the gcc version), it seems like it shouldn't be related to the mvapich2 version, but we'll probably try a new one at some point in time.

#10 Updated by Eric Flumerfelt 6 months ago

  • Status changed from New to Rejected

This issue has become rather stale. MPI transfers are currently unsupported in artdaq v3. Several of the issues documented here were resolved by later versions of artdaq v2.



Also available in: Atom PDF