Project

General

Profile

Support #24669

LibTorch v1.5.1 efficiency

Added by Andrew Chappell 17 days ago. Updated about 23 hours ago.

Status:
Feedback
Priority:
Normal
Category:
External Packages
Target version:
-
Start date:
07/27/2020
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

Upon testing the provision of LibTorch 1.5.1 provided in relation to issue https://cdcvs.fnal.gov/redmine/issues/24083#change-77911 , I've found that equivalent network inference time is around a factor of 20 slower than a standalone build of LibTorch 1.4.0. My standalone build of LibTorch was produced via:

git clone https://github.com/pytorch/pytorch.git --recursive
cd pytorch && git checkout v1.4.0
git submodule update --init --recursive
USE_OPENCV=1 && BUILD_TORCH=ON
USE_CUDA=0 && USE_NNPACK=0
CMAKE_PREFIX_PATH=“<python path>"
python tools/build_libtorch.py

I'd appreciate it if you could look into this efficiency issue. As requested, I'll look to produce a local build of LibTorch v1.5.1 to see if this changes the behaviour. Based on previous discussion with Lynn, it's my understanding that the above build procedure doesn't work for v1.5.1, would you be able to let me know how you did ultimately build 1.5.1 for larsoft cvmfs? Thanks very much.

scripts.tar.gz (1.31 KB) scripts.tar.gz Andrew Chappell, 08/12/2020 03:46 AM

History

#1 Updated by Andrew Chappell 17 days ago

  • Tracker changed from Feature to Support

#2 Updated by Andrew Chappell 17 days ago

I've now produced a local build of LibTorch 1.5.1. The inference time with this local build is now comparable to the times I was seeing with LibTorch 1.4, with 100 events processing with a mean time of 1.4 seconds per event. The build was performed as follows (very similar to the approach used for 1.4):

git clone https://github.com/pytorch/pytorch.git --recursive
cd pytorch
git checkout v1.5.1
git submodule sync
git submodule update --init --recursive

pip install -r requirements.txt
pip install mkl
pip install mkl-include

USE_OPENCV=1
BUILD_TORCH=ON
USE_CUDA=0
USE_NNPACK=0
CMAKE_PREFIX_PATH=“<python path>"

python tools/build_libtorch.py

#3 Updated by Lynn Garren 10 days ago

  • Status changed from New to Feedback

Would you please try without the pip install mkl and mkl-include? Those are platform specific.

#4 Updated by Andrew Chappell 10 days ago

I've rebuilt without mkl and mkl-include, the inference performance using this new build is unchanged relative to my previous build.

#5 Updated by Christopher Green 2 days ago

  • Assignee set to Christopher Green
  • Category set to External Packages

Would it be possible for you to provide us with a way to reproduce both builds, and something we can use to reproduce the same caliber of performance discrepancy you're seeing. I'm afraid at the moment, I don't have enough information to ascertain whether the performance discrepancy is a bug or a feature. To clarify: we have definitely compiled libtorch 1.5.1 differently than you, with more features. However, it's unclear whether the performance difference is a feature of "doing more," a consequence of a choice made to increase portability or avoid dependency clashes with other linked code, or an actual performance issue that can or should be rectified. The single command you list above:

pip install -r requirements.txt
is doing a lot of work, at least some of which is likely to be incompatible with (e.g.) C++ code compiled in the context of an experiment's "framework" software setup.

I'm sorry we can't give you a definitive solution immediately, but if you could provide the means for us to reproduce your experience ourselves (with as few experiment-specific dependencies as possible), we will work to get you the answers you need.

#6 Updated by Andrew Chappell 1 day ago

I've attached a tar.gz file containing a number of scripts to setup the two different configurations of Pandora that use the network, in a dunegpvm environment. This is the "standalone" setup of Pandora.

To build the version of Pandora that runs against the locally built LibTorch as per the approach outlined above (minus the two mkl-related pip installs) the scripts should be run as follows (with the caveat that in install_torch_local.sh, the -DCMAKE_PREFIX_PATH towards the end of line 23 will need to be updated to reflect the installation location of your local build of LibTorch - otherwise all files should run as-is):

source setup_torch_local.sh
source env.sh
source clone.sh
source install_torch_local.sh

To build against the SciSoft version of LibTorch on cvmfs the approach is similar, with the setup and install steps having the relevant modifications for the SciSoft LibTorch build:

source setup_torch_cvmfs.sh
source env.sh
source clone.sh
source install_torch_cvmfs.sh

Running in each case is the same (the Pandora pndr file specified is in my /dune/data folder, which I assume you can see, please let me know if that's not the case):

source infer.sh

This script is currently setup to run over 5 events in the pndr file, which should be sufficient to see the difference in computation time, but 500 events are contained in this file (the -n parameter in infer.sh). In the local LibTorch version the run should take just a few seconds, whereas the cvmfs LibTorch version will likely take a minute or so. Incidentally, the remove.sh script just clears out the build folders.

Additionally, I wonder if it might be useful for you to let me know how you are building LibTorch, to see how a local build using your configuration runs in my local environment? Thanks again.

#7 Updated by Christopher Green about 23 hours ago

Thank you for this, Andrew. It might be a few days until I am able to investigate further, but in the meantime I can point you at our build script for libtorch 1.5: build-framework:source:libtorch-ssi-build|build_libtorch.sh@v1_5_1a. Let me know if you have any questions or problems.



Also available in: Atom PDF