Project

General

Profile

Bug #7576

test_space_charge_3d_open_hockney_mpi4 fails intermittently

Added by Eric Stern almost 5 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Start date:
12/30/2014
Due date:
% Done:

0%

Estimated time:
Duration:

Description

test_space_charge_3d_open_hockney_mpi4 fails intermittently most prominently on the upgraded Wilson cluster, but I find it happening on SLF6.5 panal7 desktop machine too.

Intermittently from ctest:

[egstern@panal7 synergia2]$ ctest --verbose -R test_space_charge_3d_open_hockney_mpi4
UpdateCTestConfiguration from :/home/egstern/syn2-dev/build/synergia2/DartConfiguration.tcl
UpdateCTestConfiguration from :/home/egstern/syn2-dev/build/synergia2/DartConfiguration.tcl
Test project /home/egstern/syn2-dev/build/synergia2
Constructing a list of tests
Done constructing a list of tests
Checking test dependency graph...
Checking test dependency graph end
test 146
Start 146: test_space_charge_3d_open_hockney_mpi4

146: Test command: /usr/lib64/openmpi/bin/mpiexec "-np" "4" "/home/egstern/syn2-dev/build/synergia2/src/synergia/collective/tests/test_space_charge_3d_open_hockney_mpi"
146: Test timeout computed to be: 9.99988e+06
146: Running 17 test cases...
146: Running 17 test cases...
146: Running 17 test cases...
146: Running 17 test cases...
146:
146: * No errors detected
146:
146:
No errors detected
146:
146:
No errors detected
146:
146: *
No errors detected
1/1 Test #146: test_space_charge_3d_open_hockney_mpi4 ... Passed 1.43 sec

The following tests passed:
test_space_charge_3d_open_hockney_mpi4

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1.44 sec
[egstern@panal7 synergia2]$ ctest --verbose -R test_space_charge_3d_open_hockney_mpi4
UpdateCTestConfiguration from :/home/egstern/syn2-dev/build/synergia2/DartConfiguration.tcl
UpdateCTestConfiguration from :/home/egstern/syn2-dev/build/synergia2/DartConfiguration.tcl
Test project /home/egstern/syn2-dev/build/synergia2
Constructing a list of tests
Done constructing a list of tests
Checking test dependency graph...
Checking test dependency graph end
test 146
Start 146: test_space_charge_3d_open_hockney_mpi4

146: Test command: /usr/lib64/openmpi/bin/mpiexec "-np" "4" "/home/egstern/syn2-dev/build/synergia2/src/synergia/collective/tests/test_space_charge_3d_open_hockney_mpi"
146: Test timeout computed to be: 9.99988e+06
146: Running 17 test cases...
146: Running 17 test cases...
146: Running 17 test cases...
146: Running 17 test cases...
146: unknown location(0): fatal error in "auto_tune_comm_sptr": memory access violation at address: 0x01b33780: no mapping at fault address
146: /home/egstern/syn2-dev/build/synergia2/src/synergia/collective/tests/test_space_charge_3d_open_hockney_mpi.cc(92): last checkpoint

And also if I call the test executable directly:

[egstern@panal7 synergia2]$ mpirun -np 4 /home/egstern/syn2-dev/build/synergia2/src/synergia/collective/tests/test_space_charge_3d_open_hockney_mpi
Running 17 test cases...
Running 17 test cases...
Running 17 test cases...
Running 17 test cases...

  • No errors detected
  • No errors detected
  • No errors detected
  • No errors detected
    [egstern@panal7 synergia2]$ mpirun -np 4 /home/egstern/syn2-dev/build/synergia2/src/synergia/collective/tests/test_space_charge_3d_open_hockney_mpi
    Running 17 test cases...
    Running 17 test cases...
    Running 17 test cases...
    Running 17 test cases...
    unknown location(0): fatal error in "auto_tune_comm_sptr": memory access violation at address: 0x021ff8b0: no mapping at fault address
    /home/egstern/syn2-dev/build/synergia2/src/synergia/collective/tests/test_space_charge_3d_open_hockney_mpi.cc(92): last checkpoint

History

#1 Updated by Eric Stern almost 5 years ago

The memory access violation is apparently occurring in get_global_electric_field_component_allgatherv.

#2 Updated by Eric Stern almost 5 years ago

  • Status changed from New to Resolved

Removed the auto_tune_comm code and its use in the tests per suggestion of J. Amundson in commit a99f2e6302294a8084d8a382633e5f9a15f29516. Fix tested on platforms where the bug was most evident: panal7, tev, and amd32 worker node.

#3 Updated by Eric Stern almost 5 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF