Project

General

Profile

Necessary Maintenance #17047

Floating Point Exceptions

Added by Gianluca Petrillo about 2 years ago. Updated about 2 years ago.

Status:
Assigned
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
06/27/2017
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Duration:

Description

After LArSoft coordination meeting on June 27, there was consensus on taking action to have code triggering floating point exceptions to be extensively fixed.

This ticket is some "glue" to keep track of all issues related to this problem and to the solution effort.

Each discovered instance of this type of errors should have its own issue number.


Related issues

Blocked by LArSoft - Bug #17045: OpFastScintillation::GetMeanLifeTime sets up a floating-point overflow in G4Closed06/27/2017

Blocked by LArSoft - Bug #17068: Floating point divide by zero in TrajClusterAlg::UpdateTrajResolved06/29/2017

Blocked by LArSoft - Bug #17048: floating point divide by zero in tca:MCSMomResolved06/27/2017

Blocked by LArSoft - Bug #17095: Floating point divide by zero in ClusterParamsAlg::RefineDirectionAssigned06/30/2017

Blocked by LArSoft - Bug #17096: Floating point divide by zero in PmaTrack3D.cxxClosed06/30/2017

Blocked by LArSoft - Bug #17097: Floating point divide by zero in EMShowerAlg.cxxAssigned06/30/2017

Blocked by LArSoft - Bug #17117: prodsingle_sbnd.fcl crashes with larsoft v06_42_00Closed07/06/2017

History

#1 Updated by Gianluca Petrillo about 2 years ago

  • Blocked by Bug #17045: OpFastScintillation::GetMeanLifeTime sets up a floating-point overflow in G4 added

#2 Updated by Gianluca Petrillo about 2 years ago

  • Related to Bug #17048: floating point divide by zero in tca:MCSMom added

#3 Updated by Thomas Junk about 2 years ago

A question about how to go about this campaign. There is low-hanging fruit; one just has to turn on exception signals as Herb showed on Tuesday and start the debugger on a typical job and one will find some.

But once one gets identified, to go on to the next one, you have to fix the first one. Ideally, the author of the code should fix the problem. But does then one have to wait for a release? The person finding the exception may propose a fix and move on, but the proposed fix may cause other exceptions.

Regarding Herb's slides proposing how to turn on the FP signals: the module-level granularity is now gone
as of art 2.07.01. See the breaking changes page:

https://cdcvs.fnal.gov/redmine/projects/art/wiki/List_of_breaking_changes

I guessed how to update the fcl control file, and this recipe seems to work. The underflow option
also seems to be gone in art 2.07.01

services.floating_point_control: {
reportSettings: true
  enableDivByZeroEx: true
  enableInvalidEx: true
  enableOverFlowEx: true
#  EnableUnderFlowEx: false
}

#4 Updated by Gianluca Petrillo about 2 years ago

  • Related to Bug #17068: Floating point divide by zero in TrajClusterAlg::UpdateTraj added

#5 Updated by Gianluca Petrillo about 2 years ago

  • Related to deleted (Bug #17048: floating point divide by zero in tca:MCSMom)

#6 Updated by Gianluca Petrillo about 2 years ago

  • Blocks Bug #17048: floating point divide by zero in tca:MCSMom added

#7 Updated by Gianluca Petrillo about 2 years ago

  • Related to deleted (Bug #17068: Floating point divide by zero in TrajClusterAlg::UpdateTraj)

#8 Updated by Gianluca Petrillo about 2 years ago

  • Blocked by Bug #17068: Floating point divide by zero in TrajClusterAlg::UpdateTraj added

#9 Updated by Gianluca Petrillo about 2 years ago

  • Blocks deleted (Bug #17048: floating point divide by zero in tca:MCSMom)

#10 Updated by Gianluca Petrillo about 2 years ago

  • Blocked by Bug #17048: floating point divide by zero in tca:MCSMom added

#11 Updated by Gianluca Petrillo about 2 years ago

Thomas Junk wrote:

A question about how to go about this campaign. There is low-hanging fruit; one just has to turn on exception signals as Herb showed on Tuesday and start the debugger on a typical job and one will find some.

But once one gets identified, to go on to the next one, you have to fix the first one. Ideally, the author of the code should fix the problem. But does then one have to wait for a release? The person finding the exception may propose a fix and move on, but the proposed fix may cause other exceptions.

I have no good solution for this workflow.
In principle the maintainer of the code would fix and test it, and push the fix in develop branch. For this type of campaigns, one has to live on the HEAD, and update often, too.
Another tricky part is the "testing", since there are potentially plenty of unfixed errors and the test will normally fail even just because of those. Furthermore, errors may not appear on every event, so it's still possible that a module upstream of the module with the exposed failure has still problems which will manifest after the exposed module is fixed for the triggering event.

#12 Updated by Gianluca Petrillo about 2 years ago

  • Blocked by Bug #17095: Floating point divide by zero in ClusterParamsAlg::RefineDirection added

#13 Updated by Gianluca Petrillo about 2 years ago

  • Blocked by Bug #17096: Floating point divide by zero in PmaTrack3D.cxx added

#14 Updated by Gianluca Petrillo about 2 years ago

  • Related to Bug #17097: Floating point divide by zero in EMShowerAlg.cxx added

#15 Updated by Gianluca Petrillo about 2 years ago

  • Related to deleted (Bug #17097: Floating point divide by zero in EMShowerAlg.cxx)

#16 Updated by Gianluca Petrillo about 2 years ago

  • Blocked by Bug #17097: Floating point divide by zero in EMShowerAlg.cxx added

#17 Updated by Gianluca Petrillo about 2 years ago

  • Blocked by Bug #17117: prodsingle_sbnd.fcl crashes with larsoft v06_42_00 added

#18 Updated by Lynn Garren about 2 years ago

  • Status changed from New to Assigned
  • Assignee set to Lynn Garren


Also available in: Atom PDF