Project

General

Profile

Feature #24083

LibTorch v1.4 for DUNE reconstruction

Added by Andrew Chappell 8 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
02/24/2020
Due date:
% Done:

100%

Estimated time:
Spent time:
Experiment:
DUNE
Co-Assignees:
Duration:

Description

I'd like to request that a build of LibTorch v1.4.0 be made available within LArSoft (this use case is for larpandoracontent in particular).

The Pandora team have developed a couple of algorithms that use PyTorch-based neural networks that we'll be looking to make available in the near future. One of these algorithms makes use of torch.optim.lr_scheduler modules that are not available in early versions of LibTorch, and pending developments using sparse convolutional networks also require v1.3+.

It's my understanding that there is a longer-term program relating to the provision of PyTorch and Tensorflow interfaces to support machine learning solutions being developed for DUNE sim/reco, but in the interim is it possible to provide a v1.4.0 build of LibTorch to link against? Thanks very much.

History

#1 Updated by Lynn Garren 8 months ago

  • Assignee set to Lynn Garren

Note that we have concerns about which platforms and compilers can be supported. I see that we only built libtorch 1.0.1 for e17 on SLF7.

We strongly suggest that any interface using libtorch be modular and optional, since it may not be available for all compilers and platforms. We are currently sorting out how best to remove tensorflow from larreco. There will be a new github repository, name as yet unclear.

#2 Updated by Andrew Chappell 8 months ago

From the Pandora side, we will look to make the use of LibTorch optional, with default values assigned to the properties populated by the neural network to ensure that downstream algorithms that can make use of this information will continue to operate correctly if the network is unavailable.

Thanks very much for taking the time to look into this, it is greatly appreciated, as this will be very useful to us for both current and future development.

#3 Updated by Kyle Knoepfel 8 months ago

  • Status changed from New to Assigned

#4 Updated by Lynn Garren 4 months ago

Brought the build scripts up to date and then attempted building 1.4.1. The build flags that we used last time around are insufficient.

#5 Updated by Lynn Garren 4 months ago

  • Co-Assignees Christopher Green added

c

#6 Updated by Lynn Garren 4 months ago

  • % Done changed from 0 to 100
  • Status changed from Assigned to Resolved

libtorch v1_5_1 is available on SciSoft and larsoft cvmfs. For various reasons, this package is only available for e19 on SLF7. When larsoft moves to art 3.6, it should be available for all supported platforms and compilers.

To use libtorch:

setup libtorch v1_5_1 -q e19:eigen

#7 Updated by Lynn Garren 4 months ago

Sorry to be the bearer of bad news, but the libtorch cpu library is too big for cvmfs. So libtorch is not available on the larsoft cvmfs at this time. We are investigating.

#8 Updated by Lynn Garren 4 months ago

libtorch v1_5_1a is now available on larsoft cvmfs. Please let us know if this works for you.

setup libtorch v1_5_1a -q e19:eigen

#9 Updated by Andrew Chappell 4 months ago

Lynn Garren wrote:

libtorch v1_5_1a is now available on larsoft cvmfs. Please let us know if this works for you.

[...]

Thanks for this Lynn. I'm currently on vacation, so I will check this when I return, but I don't anticipate any issues with moving to 1.5.

#10 Updated by Kyle Knoepfel 3 months ago

  • Status changed from Resolved to Closed

#11 Updated by Andrew Chappell 3 months ago

Hi Lynn, Kyle,

I've now had the opportunity to test the LibTorch v1.5.1a build - thanks again for your efforts here. Although I've been able to get the network running and reproducing output consistent with what I was seeing with v1.4, the inference is taking much longer (per event processing has jumped from ~1.5 seconds on v1.4 to ~30 seconds on v1.5.1).

I've tested this both on my local system at Warwick (which is where the v1.4 baseline timing was established) and on dunegpvm02 and find similar runtime on each. I've also tested a network produced using 1.5.1 rather than 1.4 and, again, this did not improve the runtime.

Do you have any thoughts on possible reasons for this behaviour? Thanks again.

#12 Updated by Kyle Knoepfel 3 months ago

Andy, we have a few ideas why you might be seeing the behavior you're seeing. Please open another issue to address the efficiency side of the installation. Two requests:

  • We ask that you include the build flags you used for your personal installation of libtorch.
  • Can you build libtorch 1.5 yourself and reproduce the same difference in behavior?

Also available in: Atom PDF