Project

General

Profile

Support #24669

LibTorch v1.5.1 efficiency

Added by Andrew Chappell 6 months ago. Updated about 2 months ago.

Status:
Closed
Priority:
Normal
Category:
External Packages
Target version:
-
Start date:
07/27/2020
Due date:
% Done:

100%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

Upon testing the provision of LibTorch 1.5.1 provided in relation to issue https://cdcvs.fnal.gov/redmine/issues/24083#change-77911 , I've found that equivalent network inference time is around a factor of 20 slower than a standalone build of LibTorch 1.4.0. My standalone build of LibTorch was produced via:

git clone https://github.com/pytorch/pytorch.git --recursive
cd pytorch && git checkout v1.4.0
git submodule update --init --recursive
USE_OPENCV=1 && BUILD_TORCH=ON
USE_CUDA=0 && USE_NNPACK=0
CMAKE_PREFIX_PATH=“<python path>"
python tools/build_libtorch.py

I'd appreciate it if you could look into this efficiency issue. As requested, I'll look to produce a local build of LibTorch v1.5.1 to see if this changes the behaviour. Based on previous discussion with Lynn, it's my understanding that the above build procedure doesn't work for v1.5.1, would you be able to let me know how you did ultimately build 1.5.1 for larsoft cvmfs? Thanks very much.

scripts.tar.gz (1.31 KB) scripts.tar.gz Andrew Chappell, 08/12/2020 03:46 AM
ldd.txt (6.7 KB) ldd.txt Andrew Chappell, 09/01/2020 05:05 AM
build.log (2.2 MB) build.log Andrew Chappell, 09/01/2020 05:05 AM
ldd_pi.txt (9.74 KB) ldd_pi.txt Andrew Chappell, 09/01/2020 05:20 AM
local_build.log (2.26 MB) local_build.log Andrew Chappell, 09/18/2020 10:46 AM
scisoftbuild_build.log (2.27 MB) scisoftbuild_build.log Andrew Chappell, 09/18/2020 10:46 AM
igprof.PandoraInterface.29425.1603876042.801249.txt (631 KB) igprof.PandoraInterface.29425.1603876042.801249.txt Andrew Chappell, 10/28/2020 04:11 AM
igprof.PandoraInterface.6069.1603906474.595590.gz.txt (504 KB) igprof.PandoraInterface.6069.1603906474.595590.gz.txt Patrick Gartung, 10/28/2020 12:47 PM
igprof.PandoraInterface.26541.1603908951.445445.gz.txt (711 KB) igprof.PandoraInterface.26541.1603908951.445445.gz.txt Patrick Gartung, 10/28/2020 01:19 PM
igprof.PandoraInterface.41160.1603917887.536442.gz.txt (916 KB) igprof.PandoraInterface.41160.1603917887.536442.gz.txt Patrick Gartung, 10/28/2020 03:50 PM
igprof.PandoraInterface.18876.1603982349.801497.gz.txt (468 KB) igprof.PandoraInterface.18876.1603982349.801497.gz.txt VirtualBox with AVX and AVX2 Patrick Gartung, 10/29/2020 09:43 AM
igprof.PandoraInterface.205094.1603982539.662027.gz.txt (982 KB) igprof.PandoraInterface.205094.1603982539.662027.gz.txt scisoftbuild01 Patrick Gartung, 10/29/2020 09:43 AM
igprof.PandoraInterface.243133.1603986651.564849.gz.txt (1.13 MB) igprof.PandoraInterface.243133.1603986651.564849.gz.txt Patrick Gartung, 10/29/2020 10:52 AM
igprof.PandoraInterface.19770.1603987168.871718.gz.txt (499 KB) igprof.PandoraInterface.19770.1603987168.871718.gz.txt Patrick Gartung, 10/29/2020 11:01 AM
igprof.PandoraInterface.32575.1603989769.974901.gz.txt (826 KB) igprof.PandoraInterface.32575.1603989769.974901.gz.txt Patrick Gartung, 10/29/2020 11:45 AM
igprof.PandoraInterface.29425.1603876042.801249.txt (631 KB) igprof.PandoraInterface.29425.1603876042.801249.txt VirtualBox VM with AVX and AVX2 disabled. Patrick Gartung, 10/29/2020 12:34 PM

History

#1 Updated by Andrew Chappell 6 months ago

  • Tracker changed from Feature to Support

#2 Updated by Andrew Chappell 6 months ago

I've now produced a local build of LibTorch 1.5.1. The inference time with this local build is now comparable to the times I was seeing with LibTorch 1.4, with 100 events processing with a mean time of 1.4 seconds per event. The build was performed as follows (very similar to the approach used for 1.4):

git clone https://github.com/pytorch/pytorch.git --recursive
cd pytorch
git checkout v1.5.1
git submodule sync
git submodule update --init --recursive

pip install -r requirements.txt
pip install mkl
pip install mkl-include

USE_OPENCV=1
BUILD_TORCH=ON
USE_CUDA=0
USE_NNPACK=0
CMAKE_PREFIX_PATH=“<python path>"

python tools/build_libtorch.py

#3 Updated by Lynn Garren 6 months ago

  • Status changed from New to Feedback

Would you please try without the pip install mkl and mkl-include? Those are platform specific.

#4 Updated by Andrew Chappell 6 months ago

I've rebuilt without mkl and mkl-include, the inference performance using this new build is unchanged relative to my previous build.

#5 Updated by Christopher Green 5 months ago

  • Assignee set to Christopher Green
  • Category set to External Packages

Would it be possible for you to provide us with a way to reproduce both builds, and something we can use to reproduce the same caliber of performance discrepancy you're seeing. I'm afraid at the moment, I don't have enough information to ascertain whether the performance discrepancy is a bug or a feature. To clarify: we have definitely compiled libtorch 1.5.1 differently than you, with more features. However, it's unclear whether the performance difference is a feature of "doing more," a consequence of a choice made to increase portability or avoid dependency clashes with other linked code, or an actual performance issue that can or should be rectified. The single command you list above:

pip install -r requirements.txt
is doing a lot of work, at least some of which is likely to be incompatible with (e.g.) C++ code compiled in the context of an experiment's "framework" software setup.

I'm sorry we can't give you a definitive solution immediately, but if you could provide the means for us to reproduce your experience ourselves (with as few experiment-specific dependencies as possible), we will work to get you the answers you need.

#6 Updated by Andrew Chappell 5 months ago

I've attached a tar.gz file containing a number of scripts to setup the two different configurations of Pandora that use the network, in a dunegpvm environment. This is the "standalone" setup of Pandora.

To build the version of Pandora that runs against the locally built LibTorch as per the approach outlined above (minus the two mkl-related pip installs) the scripts should be run as follows (with the caveat that in install_torch_local.sh, the -DCMAKE_PREFIX_PATH towards the end of line 23 will need to be updated to reflect the installation location of your local build of LibTorch - otherwise all files should run as-is):

source setup_torch_local.sh
source env.sh
source clone.sh
source install_torch_local.sh

To build against the SciSoft version of LibTorch on cvmfs the approach is similar, with the setup and install steps having the relevant modifications for the SciSoft LibTorch build:

source setup_torch_cvmfs.sh
source env.sh
source clone.sh
source install_torch_cvmfs.sh

Running in each case is the same (the Pandora pndr file specified is in my /dune/data folder, which I assume you can see, please let me know if that's not the case):

source infer.sh

This script is currently setup to run over 5 events in the pndr file, which should be sufficient to see the difference in computation time, but 500 events are contained in this file (the -n parameter in infer.sh). In the local LibTorch version the run should take just a few seconds, whereas the cvmfs LibTorch version will likely take a minute or so. Incidentally, the remove.sh script just clears out the build folders.

Additionally, I wonder if it might be useful for you to let me know how you are building LibTorch, to see how a local build using your configuration runs in my local environment? Thanks again.

#7 Updated by Christopher Green 5 months ago

Thank you for this, Andrew. It might be a few days until I am able to investigate further, but in the meantime I can point you at our build script for libtorch 1.5: build-framework:source:libtorch-ssi-build|build_libtorch.sh@v1_5_1a. Let me know if you have any questions or problems.

#8 Updated by Christopher Green 5 months ago

  • Status changed from Feedback to Work in progress

#9 Updated by Christopher Green 5 months ago

  • Status changed from Work in progress to Feedback

Andrew, can you please send the output of `time source infer.sh` for each case you hve under test?

#10 Updated by Andrew Chappell 5 months ago

Hi Chris,

For my local build the output is:

real 0m7.332s
user 0m23.280s
sys 0m0.676s

and for the build against CVMFS LibTorch I have:

real 2m14.093s
user 3m33.834s
sys 0m2.500s

I've not yet had the opportunity to test a build based on the build script you provided.

#11 Updated by Christopher Green 5 months ago

It looks like your build is utilizing multiple processing units (4 CPUs, if I had to guess). Can you tell me what mechanism is used for this---MPI/IPC, OpenMP, GPU or something else?

For our UPS packages (OpenBLAS, FFTW, etc.), we generally disable OpenMP because in a batch environment, parallelism that is not planned and accounted for can cause disruption, unwanted charging and/or denial of service. At some sites, any use whatsoever of a second processor (even for a mostly-idle co-process) can result in double-charged hours. At this point, my working hypothesis is that the efficiency difference is a combination of the difference in parallism and the lack of LAPACK in our most recent packaging consistent with the LArSoft stack currently in use.

If we get an understanding of just how you expect to make use of any parallelism, we can work to improve the way we package the software you need.

#12 Updated by Andrew Chappell 5 months ago

Hi Chris, I'm actually not immediately sure where that parallelism is coming from - I'd speculate it's coming from the OpenCV component of LibTorch (I can try to test this by rebuilding without it - I can rule out GPU). In the interim, I have rerun the tests on the batch system we have at Warwick, which is explicitly single-threaded and have the following figures:

For my local build of LibTorch the output is:

real 0m15.686s
user 0m12.254s
sys 0m0.814s

and for the build against CVMFS LibTorch I have:

real 2m40.964s
user 2m33.601s
sys 0m5.172s

#13 Updated by Andrew Chappell 5 months ago

Looks like LibTorch builds using OpenMP by default. I've switched off this option and am rebuilding at the moment.

#14 Updated by Andrew Chappell 5 months ago

I've now rebuilt LibTorch 1.5 using the following settings:

USE_ASAN=0 \
USE_CUDA=0 \
USE_CUDNN=0 \
USE_DISTRIBUTED=0 \
USE_FBGEMM=0 \
USE_FFMPEG=0 \
USE_GFLAGS=0 \
USE_GLOG=0 \
USE_LEVELDB=0 \
USE_LMDB=0 \
USE_MKLDNN=0 \
USE_MPI=0 \
USE_NCCL=0 \
USE_NNPACK=1 \
USE_NUMA=0 \
USE_NUMPY=1 \
USE_NVRTC=0 \
USE_OBSERVERS=1 \
USE_OPENCL=0 \
USE_OPENCV=0 \
USE_OPENMP=0 \
USE_PROF=0 \
USE_QNNPACK=1 \
USE_PYTORCH_QNNPACK=1 \
USE_REDIS=0 \
USE_ROCKSDB=0 \
USE_ROCM=0 \
USE_TBB=0 \
USE_TENSORRT=0 \
USE_XNNPACK=1 \
USE_ZSTD=0

I believe these settings should now be broadly consistent with those in the build script you link above, with the exception of numpy (which seems to be overridden and reset on at some stage of the build: I'm not sure where yet) and that I now explicitly set USE_OPENMP to 0, whereas it still appears to be 1 based on the script used for the CVMFS version, so presumably this was still a factor in my workstation runs with the CVMFS build).

This local LibTorch build now yields the following timing result (running on my workstation again, as in the original timings I quoted):

real 0m18.597s
user 0m17.254s
sys 0m0.964s

So, the updated build does appear to eliminate the the parallelism from this case. It's not clear to me if/where LAPACK features in the LibTorch build with these settings.

#15 Updated by Christopher Green 5 months ago

I'm having real trouble reproducing your results. Can you tell me if your examples above are an accurate representation of how you built libtorch, because you don't appear to be exporting the configuration settings to the environment or setting them at the beginning of the command line. If this is the case, your settings won't have affected the build at all.

Could you please make another local build and test, this time with VERBOSE=1 along with everything else, and attach the log of the build along with the results of `ldd` for each dynamic library and executable?

#16 Updated by Andrew Chappell 5 months ago

Hi Chris. I'll try to provide those details at the earliest opportunity this week (the settings listed above are the settings reported back by CMake during the build process - in the latest build I explicitly exported all of those variables, previously I had only specified a small subset of them).

#17 Updated by Andrew Chappell 5 months ago

CMake build log and ldd output for LibTorch related libraries attached.

#18 Updated by Andrew Chappell 5 months ago

And the corresponding ldd output for the PandoraInterface executable in LArReco.

#19 Updated by Christopher Green 5 months ago

Thanks for this, Andrew. I can see a couple of differences here:

First, your build is not finding any of the UPS packages we would normally use as dependencies: protobuf, pybind11, eigen or lapack—instead, it is finding OpenBLAS from your system in /lib64 and its internal copies of protobuf, pybind11 and eigen. numpy is also being picked up internally rather than via UPS.

In addition, given the long list of compile flags listed in the configuration summary in your build log, it appears that the specification of CXXFLAGS in the UPS build replaces rather than adds to that list, and someof them (-faligned-new, -fno-math-errno, -fno-trapping-math) might have a significant effect on performance. You also have a different Python (3.6) versus the one that would be used from UPS (3.7.2), but that's unlikely to be relevant here.

I will investigate the feasibility of merging the compiler flags rather than replacing them, and update this ticket with what I discover.

#20 Updated by Christopher Green 5 months ago

  • Status changed from Feedback to Work in progress

#21 Updated by Christopher Green 4 months ago

  • Status changed from Work in progress to Feedback

Andrew, I'm afraid I find myself completely unable to replicate your "good" results (<20s user time) with a local build on our main development machine, scisoftbuild01.fnal.gov (Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz—Skylake). I've given you an account on this machine, and I'd be grateful if you would be able to log in and run your local build and test of libtorch to verify that you can reproduce your results on this machine. This is a machine under sporadic heavy use, so you might need to wait a little while for a period with low load. Please put everything under /scratch/chappell to avoid overloading the TiBS backup of home directories.

I apologize that we don't have a solution for you yet, but this is extremely perplexing. Any light that can be shed by your taking the time to build on our machine would be appreciated.

#22 Updated by Andrew Chappell 4 months ago

Hi Chris. No problem, and thanks for looking into this as extensively as you have. I'll try to take a look at building on scisoftbuild01 later this week and let you know what I find.

#23 Updated by Andrew Chappell 4 months ago

Hi Chris, I finally got the opportunity to test on scisoftbuild01. Running the local build yielded these results (under apparently low load at the time):

real 2m49.115s
user 2m37.361s
sys 0m2.666s

Presumably this, at least approximately, replicates the kind of results you have been seeing? This would seem to rule out the possibility of specific LibTorch build settings being the source of the difference, as this used the same settings as were used in my local environment.

#24 Updated by Christopher Green 4 months ago

Andrew, thank you for this: if nothing else, it tells me I haven't been doing something completely silly—at least on this evidence.

Could you please attach full build logs for both your "good" build on your own machine and the one on scisoftbuild01? My best guess at the moment is that you're using a highly-optimized system library on your machine that simply isn't present on ours.

#25 Updated by Andrew Chappell 4 months ago

No problem. I've attached the build logs for the respective local and scisoftbuild01 builds.

#26 Updated by Patrick Gartung 3 months ago

  • Assignee changed from Christopher Green to Patrick Gartung

#27 Updated by Patrick Gartung 3 months ago

Running in VirtualBox using the local build I get close to what Andrew got for timing

time ./infer.sh 
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m10.897s
user    1m2.769s
sys    0m2.405s

#28 Updated by Patrick Gartung 3 months ago

Running install_libtorch_local.sh in the VirtualBox gives an executable linked against system libopenblas.so

#29 Updated by Patrick Gartung 3 months ago

Using the ups libtorch built against ups openblas, the timing is slower and there is an error about unsupported hardware running in a VirtualBox VM.

[vagrant@localhost issue-24669]$ time ./infer.sh 
Warning in <UnknownClass::SetDisplay>: DISPLAY not set, setting it to 10.0.2.2:0.0
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchopen//Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m40.258s
user    2m7.212s
sys    0m4.934s

#30 Updated by Patrick Gartung 3 months ago

Using ups built libtorch built against ups openblas on scisoftbuild01 there is no error and the timing is higher.

[gartung@scisoftbuild01 issue-24669]$ time ./infer.sh 
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchopen/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m48.957s
user    6m55.939s
sys    0m12.163s

#31 Updated by Patrick Gartung 3 months ago

I compared the CMakeCache.txt for the "local" build of libtorch versus the "ups" build of libtorch. The local build automatically picks MKL versus the "ups" build having OpenBLAS set. Another note is that the local build sets the optimization flag -msse4.

ARCH_OPT_FLAGS:STRING=-msse4                                  <
                            //Selected BLAS library                                         //Selected BLAS library
BLAS:STRING=MKL                                               | BLAS:STRING=OpenBLAS
                                                              >
                                                              > //Marks whether BLAS was manually set by user or auto-detecte
                                                              > BLAS_SET_BY_USER:STRING=TRUE

//Path to a library.                                            //Path to a library.
BLAS_Accelerate_LIBRARY:FILEPATH=BLAS_Accelerate_LIBRARY-NOTF | BLAS_goto2_LIBRARY:FILEPATH=BLAS_goto2_LIBRARY-NOTFOUND

//Marks whether BLAS was manually set by user or auto-detecte | //Path to a library.
BLAS_SET_BY_USER:STRING=FALSE                                 | BLAS_mkl_LIBRARY:FILEPATH=BLAS_mkl_LIBRARY-NOTFOUND
                                                              >
                                                              > //Path to a library.
                                                              > BLAS_mkl_em64t_LIBRARY:FILEPATH=BLAS_mkl_em64t_LIBRARY-NOTFOU

//Path to a library.                                            //Path to a library.
BLAS_openblas_LIBRARY:FILEPATH=/usr/lib64/libopenblas.so      | BLAS_mkl_ia32_LIBRARY:FILEPATH=BLAS_mkl_ia32_LIBRARY-NOTFOUND

//Path to a library.                                            //Path to a library.
BLAS_vecLib_LIBRARY:FILEPATH=BLAS_vecLib_LIBRARY-NOTFOUND     | BLAS_mkl_intel_LIBRARY:FILEPATH=BLAS_mkl_intel_LIBRARY-NOTFOU
                                                              >
                                                              > //Path to a library.
                                                              > BLAS_mkl_intel_lp64_LIBRARY:FILEPATH=BLAS_mkl_intel_lp64_LIBR
                                                              >
                                                              > //Path to a library.
                                                              > BLAS_mkl_rt_LIBRARY:FILEPATH=BLAS_mkl_rt_LIBRARY-NOTFOUND
                                                              >
                                                              > //Path to a library.
                                                              > BLAS_openblas_LIBRARY:FILEPATH=/scratch/gartung/products/open
                                  <

#32 Updated by Andrew Chappell 3 months ago

Hi Patrick,

Are you using the original build script that was sent through? Since that initial build, I added explicit exports for the various environment variables in https://cdcvs.fnal.gov/redmine/issues/24669#note-14 which should avoid the use of MKL and also OpenMP, for example (the build is otherwise unchanged).

#33 Updated by Patrick Gartung 3 months ago

I used the script linked below that Chris adapted from your original upload directly on scisoftbuild01.
https://cdcvs.fnal.gov/redmine/projects/build-framework/repository/libtorch-ssi-build/revisions/issue-24669/entry/issue-24669/install_libtorch_local.sh
It somehow found MKL but I am not sure from where.

#34 Updated by Patrick Gartung 3 months ago

Even running inside VirtualBox CMake automatically sets BLAS=MKL and ARCH_OPT_FLAGS:STRING=-msse4.
The -msse4 flag is probably causing the biggest speedup.

#35 Updated by Patrick Gartung 3 months ago

CMake selects BLAS=MKL automatically and -msse4 is set when using gcc or clang. Because the ups build sets BLAS=OPEN or BLAS=Eigen FindMKLDNN is never called and no optimization flag gets set. I can try setting the optimization flag on the cmake command line to see if that improves performance.

[gartung@scisoftbuild01 pytorch-c-build]$ git grep -C5 msse4
cmake/Modules/FindMKLDNN.cmake-SET(MKLDNN_LIBRARY_TYPE STATIC CACHE STRING "" FORCE)
cmake/Modules/FindMKLDNN.cmake-IF(MKLDNN_USE_NATIVE_ARCH)  # Disable HostOpts in MKLDNN unless MKLDNN_USE_NATIVE_ARCH is set.
cmake/Modules/FindMKLDNN.cmake-  SET(ARCH_OPT_FLAGS "HostOpts" CACHE STRING "" FORCE)
cmake/Modules/FindMKLDNN.cmake-ELSE()
cmake/Modules/FindMKLDNN.cmake-  IF(CMAKE_CXX_COMPILER_ID STREQUAL "GNU" OR CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
cmake/Modules/FindMKLDNN.cmake:    SET(ARCH_OPT_FLAGS "-msse4" CACHE STRING "" FORCE)
cmake/Modules/FindMKLDNN.cmake-  ELSE()
cmake/Modules/FindMKLDNN.cmake-    SET(ARCH_OPT_FLAGS "" CACHE STRING "" FORCE)
cmake/Modules/FindMKLDNN.cmake-  ENDIF()
cmake/Modules/FindMKLDNN.cmake-ENDIF()
cmake/Modules/FindMKLDNN.cmake-

#36 Updated by Patrick Gartung 3 months ago

Adding -DARCH_OPT_FLAGS:STRING=-msse4 and -DBLAS=Open did not improve the performance. It also showed the curious increase in timing running outside the VM. Inside the VM produced the unsupported hardware error but the time was faster.

[vagrant@localhost issue-24669]$ time ./infer.sh 
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchopensys//Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m42.209s
user    2m13.772s
sys    0m5.476s

[gartung@scisoftbuild01 issue-24669]$ time ./infer.sh 
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchopensys/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m48.831s
user    6m53.070s
sys    0m12.547s

#37 Updated by Patrick Gartung 3 months ago

I think I can explain the slow down outside the VirtualBOX VM. If I check /proc/cpu in the VirtualBox I only see the avx2 instruction. If I check /proc/cpuinfo on scisoftbuild01 I see the avx512 instruction set. I compiled libtorch on scisoftbuild01 so it added object for avx512. This would explain the error about unsupported hardware when running in the VirtualBox VM because the avx512 instruction set is not available. The use of the avx512 instruction set when running directly on scisoftbuild01 causes the cpu to clock down while performing these instructions. Comparing the symbols in libXDNN.a I see that different symbols per instruction set are included in the "local" build and the "ups" build. The "local" build and the "ups" build both use CMAKE_BUILD_TYPE='Release'. The "local" build was done directly on scisoftbuild01 and the "ups" build was done inside the VirtualBox VM.

#38 Updated by Patrick Gartung 3 months ago

I built inside a VirtualBox VM where I disabled the AVX and AVX2 registers. I used install_libtorch_local.sh with the g variant. When I run the inference test I get a user time similar to the that from running with the ups built openblas. This build is using the system installed openblas but cannot run the AVX2 version of its routines since those registers are not available.


time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh g

EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest-sseonly/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    1m1.690s

user    6m13.894s

sys     0m5.692s
[vagrant@localhost libtorchtest-sseonly]$ 

#39 Updated by Patrick Gartung 3 months ago

I profiled the two builds of libtorch using the install_librtorch_local.sh script. One was done on scisoftbuild01 and the other in a VirtualBox VM with AVX registers inaccessible.

The libtorch built in the VM showed these routines with the highest self cost:

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  61.89     193.98  @{libgomp.so.1.0.0+100778} [5]
  16.26      50.95  @{libgomp.so.1.0.0+100354} [7]
   7.48      23.44  sgemm_kernel_NEHALEM [8]
   2.37       7.43  @{libgomp.so.1.0.0+100782} [49]
   0.93       2.93  @{libgomp.so.1.0.0+100790} [67]
   0.93       2.91  @{libgomp.so.1.0.0+100787} [68]
   0.90       2.81  sgemm_itcopy_NEHALEM [69]
   0.63       1.97  @{libtorch_cpu.so+24536293} [72]
   0.63       1.96  @{libgomp.so.1.0.0+100358} [73]
   0.41       1.28  @{libgomp.so.1.0.0+100776} [83]
   0.24       0.76  @{libgomp.so.1.0.0+100365} [94]
   0.23       0.73  @{libgomp.so.1.0.0+101007} [95]
   0.23       0.72  @{libgomp.so.1.0.0+100363} [96]
   0.20       0.62  @{libtorch_cpu.so+26651262} [104]
   0.18       0.56  @{libtorch_cpu.so+24851925} [111]
   0.18       0.56  @{libtorch_cpu.so+26651275} [112]
   0.17       0.52  __expf_finite [113]

The libtorch built on scisoftbuild01 had these routines with the highest self cost:
NB I had to increase the number of events to 50 to get a better sample

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  11.17       1.21  @{libgomp.so.1+102010} [73]
   8.74       0.95  @{libgomp.so.1+101586} [85]
   7.40       0.80  brk [91]
   6.53       0.71  __libc_malloc [61]
   4.83       0.52  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv [69]
   3.59       0.39  __sched_yield [147]
   1.98       0.21  __read [199]
   1.43       0.15  _ZN3c1010TensorType3getEv [233]
   1.29       0.14  @{libc-2.17.so+532348} [243]
   1.24       0.13  @{libc-2.17.so+527903} [249]
   1.20       0.13  _ZNK3c1010TensorType14isSubtypeOfExtESt10shared_ptrINS_4TypeEEPSo [175]
   1.15       0.12  @{libtorch_cpu.so+11554592} [263]
   1.01       0.11  @{libc-2.17.so+528006} [272]
   0.83       0.09  @{libgomp.so.1+102014} [316]
   0.78       0.08  @{libtorch_cpu.so+11413930} [338]
   0.74       0.08  @{libtorch_cpu.so+20130746} [346]
   0.69       0.07  @{libgomp.so.1+101590} [354]
   0.51       0.05  @{libgomp.so.1+101595} [414]
   0.51       0.05  sgemm_kernel_SKYLAKEX [418]

The biggest difference seems to be caused by difference in performance of sgemm_kernel_NEHALEM and sgemm_kernel_SKYLAKEX.

Presumably the ups build of libtorch was done on an older AMD machine so the sgemm build chose to optimize for NEHALEM while the later builds chose to optimize for SKYLAKEX.

#40 Updated by Patrick Gartung 3 months ago

A Google search of sgemm_kernel_NEHALEM or sgemm_kernel_SKYLAKEX shows that these symbols come from libopenblas.so.
The kernel seems to be selected on the basis of what instruction sets are available on the machine where it runs.

#41 Updated by Patrick Gartung 3 months ago

I think I can draw the following conclusions.

Libtorch 1.6 built with openblas is faster than the older libtorch built with eigen.

Openblas can pick the kernel based on the registers available on the CPU. If the AVX512 registers are unavailable the user time is shorter but the real time is slightly longer.

 time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest-sseonly/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m48.150s
user    6m44.052s
sys     0m13.108s
[gartung@scisoftbuild01 libtorchtest]$ 

time  /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest-sseonly/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m58.525s
user    2m52.739s
sys     0m7.781s
[vagrant@localhost libtorchtest-sseonly]$

#42 Updated by Patrick Gartung 3 months ago

The differences in libraries built by install_libtorch_local.sh for the c variant and the g variant are the inclusion of libasmjit.a, libfbgemm.a and libmkldnn.a and some test libraries. The differences in libtorch_cpu.so can be attributed to the inclusion of libfbgemm.a and libmkldnn.a.

[[vagrant@localhost libtorchtest-sseonly]$ ls -1s pytorch-*-build/build/lib/
pytorch-c-build/build/lib/:
total 194936
   476 libasmjit.a
   732 libbenchmark.a
     4 libbenchmark_main.a
   968 libc10d.a
   364 libc10.so
   892 libcaffe2_detectron_ops.so
    80 libcaffe2_module_test_dynamic.so
   300 libCaffe2_perfkernels_avx2.a
     4 libCaffe2_perfkernels_avx512.a
    12 libCaffe2_perfkernels_avx.a
  1100 libcaffe2_protos.a
     8 libclog.a
    92 libcpuinfo.a
    92 libcpuinfo_internals.a
 14004 libfbgemm.a
     4 libfoxi_loader.a
  1104 libgloo.a
   264 libgmock.a
     4 libgmock_main.a
   672 libgtest.a
     4 libgtest_main.a
 46040 libmkldnn.a
   248 libnnpack.a
    20 libnnpack_reference_layers.a
  6420 libonnx.a
   588 libonnx_proto.a
  5980 libprotobuf.a
   768 libprotobuf-lite.a
  5640 libprotoc.a
    24 libpthreadpool.a
   188 libpytorch_qnnpack.a
   152 libqnnpack.a
105296 libtorch_cpu.so
     8 libtorch_global_deps.so
     8 libtorch.so
  2376 libXNNPACK.a

pytorch-g-build/build/lib/:
total 92356
  364 libc10.so
  480 libcaffe2_detectron_ops.so
  120 libcaffe2_observers.so
  260 libCaffe2_perfkernels_avx2.a
    4 libCaffe2_perfkernels_avx512.a
   12 libCaffe2_perfkernels_avx.a
 1100 libcaffe2_protos.a
    8 libclog.a
   92 libcpuinfo.a
   88 libcpuinfo_internals.a
    4 libfoxi_loader.a
  248 libnnpack.a
   20 libnnpack_reference_layers.a
 6420 libonnx.a
  572 libonnx_proto.a
    0 libprotobuf-lite.so
  436 libprotobuf-lite.so.3.6.1
    0 libprotobuf.so
 3372 libprotobuf.so.3.6.1
    0 libprotoc.so
 3036 libprotoc.so.3.6.1
   24 libpthreadpool.a
  192 libpytorch_qnnpack.a
  156 libqnnpack.a
72980 libtorch_cpu.so
    8 libtorch_global_deps.so
    8 libtorch.so
 2352 libXNNPACK.a

#43 Updated by Patrick Gartung 3 months ago

A comparison of the symbols in each variant of libtorch_cpu.so shows the inclusion of libmkldnn and libfbgemm

b (anonymous namespace)::g_GradientRegistry2                  <
b (anonymous namespace)::g_GradientRegistry4                  <
b (anonymous namespace)::g_IDEEPOperatorRegistry1             <
b (anonymous namespace)::g_IDEEPOperatorRegistry2             <
b (anonymous namespace)::g_IDEEPOperatorRegistry3             <
b (anonymous namespace)::g_IDEEPOperatorRegistry4             <

b asmjit::cpuInfoGlobal                                       <
b asmjit::cpuInfoInitialized                                  <
b asmjit::VirtMem::info()::vmInfo                             <
b asmjit::VirtMem::info()::vmInfoInitialized                  <
b asmjit::VirtMem_openAnonymousMemory(int*, bool)::internalCo <
b asmjit::vm_memfd_create_not_supported                       <
b asmjit::vm_shm_strategy                                     <

B _end                                                          B _end
b fbgemm::(anonymous namespace)::fbgemmEnvGetIsa()::isaMap    | B FLAGS_caffe2_daz
b fbgemm::(anonymous namespace)::g_Avx512_Ymm_enabled         <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::g_forced_isa                 <
b fbgemm::(anonymous namespace)::isaSupportMap                <
b fbgemm::cblas_gemm_compute(fbgemm::matrix_op_t, int, float  <
b fbgemm::cblas_gemm_compute(fbgemm::matrix_op_t, int, float  <
b fbgemm::CodeGenBase<long, long, long, long>::codeCache_     <
b fbgemm::CodeGenBase<long, long, long, long>::rtMutex_       <
b fbgemm::CodeGenBase<long, long, long, long>::runtime()::rt  <
b fbgemm::CodeGenBase<unsigned char, signed char, int, int>:: <
b fbgemm::CodeGenBase<unsigned char, signed char, int, int>:: <
b fbgemm::CodeGenBase<unsigned char, signed char, int, int>:: <
b fbgemm::CodeGenBase<unsigned char, signed char, int, short> <
b fbgemm::CodeGenBase<unsigned char, signed char, int, short> <
b fbgemm::CodeGenBase<unsigned char, signed char, int, short> <
b fbgemm::fbgemmInstructionSet()::env_forced_isa              <
b fbgemm::fbgemmInstructionSet()::isAvx512_Ymm_enabled        <
b fbgemm::GenConvKernelBase<2, (fbgemm::inst_set_t)1>::codeCa <
b fbgemm::GenConvKernelBase<2, (fbgemm::inst_set_t)1>::rtMute <
b fbgemm::GenConvKernelBase<2, (fbgemm::inst_set_t)1>::runtim <
b fbgemm::GenConvKernelBase<3, (fbgemm::inst_set_t)1>::codeCa <
b guard variable for fbgemm::(anonymous namespace)::fbgemmEnv <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::cblas_gemm_compute(fbgemm::matri <
b guard variable for fbgemm::cblas_gemm_compute(fbgemm::matri <
b guard variable for fbgemm::CodeGenBase<long, long, long, lo <
b guard variable for fbgemm::CodeGenBase<long, long, long, lo <
b guard variable for fbgemm::CodeGenBase<unsigned char, signe <
b guard variable for fbgemm::CodeGenBase<unsigned char, signe <
b guard variable for fbgemm::CodeGenBase<unsigned char, signe <
b guard variable for fbgemm::CodeGenBase<unsigned char, signe <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<fl <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<fl <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::fbgemmInstructionSet()::env_forc <
b guard variable for fbgemm::fbgemmInstructionSet()::isAvx512 <
b guard variable for fbgemm::GenConvKernelBase<2, (fbgemm::in <
b guard variable for fbgemm::GenConvKernelBase<2, (fbgemm::in <
b guard variable for fbgemm::GenConvKernelBase<3, (fbgemm::in <
b guard variable for fbgemm::transpose_simd(int, int, float c <
b guard variable for gloo::infinibandToBusID(std::__cxx11::ba <
b guard variable for gloo::interfaceToBusID(std::__cxx11::bas <
b guard variable for gloo::kernelModules[abi:cxx11]()::module <
b guard variable for gloo::ReductionFunction<float>::sum      <
b guard variable for gloo::ReductionFunction<gloo::float16>:: <

b mkldnn::impl::cpu::(anonymous namespace)::abi_not_param1    <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param1        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param2        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param3        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param4        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param5        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param6        <
b mkldnn::impl::cpu::(anonymous namespace)::cpu               <
b mkldnn::impl::cpu::(anonymous namespace)::L1_cache_size     <
b mkldnn::impl::cpu::(anonymous namespace)::L2_cache_size     <
b mkldnn::impl::cpu::(anonymous namespace)::LLC_cache_size    <
b mkldnn::impl::cpu::(anonymous namespace)::LLC_data_size     <
b mkldnn::impl::cpu::avx512_common_gemm_f32::get_xbyak_gemm(b <
b mkldnn::impl::cpu::avx512_common_gemm_f32::get_xbyak_gemm(b <
b mkldnn::impl::cpu::avx_gemm_f32::get_xbyak_gemm(bool, bool, <
b mkldnn::impl::cpu::avx_gemm_f32::get_xbyak_gemm(bool, bool, <
b mkldnn::impl::cpu::bf16_cvt_utils::add_cvt_ps_to_bf16_      <
b mkldnn::impl::cpu::bf16_cvt_utils::cvt_bf16_to_ps_          <
b mkldnn::impl::cpu::bf16_cvt_utils::cvt_one_ps_to_bf16       <
b mkldnn::impl::cpu::bf16_cvt_utils::cvt_ps_to_bf16_          <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::jit_generator::dump_code(unsigned char c <
b mkldnn::impl::cpu::jit_uni_lrn_fwd_kernel_f32<(mkldnn::impl <
b mkldnn::impl::cpu::simple_barrier::barrier(mkldnn::impl::cp <
b mkldnn::impl::dump_jit_code                                 <
b mkldnn::impl::global_scratchpad_t::reference_count_         <
b mkldnn::impl::global_scratchpad_t::scratchpad_              <
b mkldnn::impl::global_scratchpad_t::size_                    <
b mkldnn::impl::initialized                                   <
b mkldnn::impl::mxcsr_save                                    <
b mkldnn::impl::verbose                                       <
b mkldnn::impl::version_printed                               <

#44 Updated by Patrick Gartung 3 months ago

The speed improvement when using dynamic code as shown in this igprof profile of variant c even when run in the virtualbox and using the NEHALEM kernel.

[vagrant@localhost libtorchtest]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh c
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m17.716s
user    1m57.102s
sys    0m2.413s


----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  42.40     440.86  @{libgomp.so.1.0.0+100354} [5]
  10.42     108.33  @{libgomp.so.1.0.0+100778} [6]
   2.81      29.21  sgemm_kernel_NEHALEM [24]
   1.62      16.84  @{libgomp.so.1.0.0+100358} [31]
   0.67       6.96  @{libtorch_cpu.so+26675187} [71]
   0.61       6.38  @{libgomp.so.1.0.0+100363} [75]
   0.60       6.27  @{libgomp.so.1.0.0+100365} [76]
   0.40       4.12  @{libgomp.so.1.0.0+100782} [94]
   0.38       3.93  __expf_finite [104]
   0.35       3.59  @{libgomp.so.1.0.0+100352} [105]
   0.25       2.64  @{libtorch_cpu.so+29938737} [106]
   0.25       2.62  @{libtorch_cpu.so+28878299} [107]
   0.20       2.03  @{libtorch_cpu.so+29938782} [108]
   0.19       1.93  @{libtorch_cpu.so+66220908} [111]
   0.18       1.88  sgemm_beta_NEHALEM [112]
   0.17       1.78  @{libgomp.so.1.0.0+100790} [116]
   0.17       1.76  @{libtorch_cpu.so+28878269} [123]
   0.16       1.68  @{libtorch_cpu.so+13640830} [132]
   0.16       1.68  @{libtorch_cpu.so+29246320} [133]
   0.14       1.44  @{libgomp.so.1.0.0+100787} [146]
   0.14       1.41  __read [150]
   0.14       1.41  @?0x7c642842{<dynamically-generated>} [153]
   0.13       1.35  @?0x7c57f070{<dynamically-generated>} [155]
   0.13       1.32  @{libtorch_cpu.so+29246353} [157]
   0.13       1.31  @{libtorch_cpu.so+29246331} [159]
   0.12       1.27  @{libtorch_cpu.so+29938054} [165]
   0.12       1.24  @{libtorch_cpu.so+29246342} [168]
   0.11       1.10  @{libtorch_cpu.so+63095556} [181]
   0.10       1.09  @{libtorch_cpu.so+70421375} [185]
   0.10       0.99  @{libtorch_cpu.so+29938014} [216]
   0.10       0.99  @{libtorch_cpu.so+29938000} [217]
   0.09       0.90  @{libc-2.17.so+530478} [226]
   0.09       0.89  @{libtorch_cpu.so+13640824} [230]
   0.08       0.88  @{libtorch_cpu.so+26675196} [231]
   0.08       0.86  @{libtorch_cpu.so+26675200} [235]
   0.08       0.81  @{libtorch_cpu.so+26675191} [239]
   0.07       0.76  @{libgomp.so.1.0.0+100776} [240]
   0.07       0.75  @{libtorch_cpu.so+29938392} [241]
   0.07       0.74  @{libtorch_cpu.so+70421371} [242]
   0.06       0.61  expf [93]
   0.06       0.61  @{libtorch_cpu.so+29938417} [259]

#45 Updated by Patrick Gartung 3 months ago

Running on scisoftbuild01 one directly the dynamic generated code is either not used or is too fast to be sampled

[gartung@scisoftbuild01 libtorchtest]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh c
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m14.486s
user    3m48.631s
sys    0m6.302s

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
   9.14       0.64  __libc_malloc [47]
   5.32       0.37  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv [62]
   4.68       0.33  brk [108]
   2.91       0.20  @{libc-2.17.so+527903} [153]
   2.55       0.18  __read [161]
   2.55       0.18  __sched_yield [166]
   2.48       0.17  _ZN3c1010TensorType3getEv [167]
   1.91       0.13  @{libgomp.so.1+102010} [183]
   1.77       0.12  @{libc-2.17.so+528006} [188]
   1.63       0.11  @{libc-2.17.so+532348} [196]
   1.28       0.09  @{libtorch_cpu.so+20130746} [233]
   1.06       0.07  @{libtorch_cpu.so+11413930} [260]
   0.78       0.05  _ZNK3c1010TensorType14isSubtypeOfExtESt10shared_ptrINS_4TypeEEPSo [180]
   0.71       0.05  cfree [335]
   0.71       0.05  @{libtorch_cpu.so+11554592} [346]
   0.64       0.04  sgemm_kernel_SKYLAKEX [398]
   0.57       0.04  _ZdlPv [408]
   0.57       0.04  munmap [409]
   0.57       0.04  __libc_fork [415]
   0.50       0.03  _ZNK7pandora9MCManager24CreateUidToPfoTargetsMapERSt13unordered_mapIPKvS1_IPKNS_10MCParticleEfSt4hashIS6_ESt8equal_toIS6_ESaISt4pairIKS6_fEEES7_IS3_ES9_IS3_ESaISB_IKS3_SF_EEERKS1_IS3_S1_IS3_fSG_SH_SaISB_ISI_fEEESG_SH_SaISB_ISI_SP_EEE [173]
   0.43       0.03  _Znwm [45]
   0.43       0.03  @{libtorch_cpu.so+46306665} [513]
   0.43       0.03  _ZNKSt10_HashtableIPN5torch3jit4NodeES3_SaIS3_ENSt8__detail9_IdentityESt8equal_toIS3_ESt4hashIS3_ENS5_18_Mod_range_hashingENS5_20_Default_ranged_hashENS5_20_Prime_rehash_policyENS5_17_Hashtable_traitsILb0ELb1ELb1EEEE5countERKS3_ [515]
   0.43       0.03  __xstat64 [533]
   0.43       0.03  @{libtorch_cpu.so+30774291} [539]
   0.43       0.03  @{libtorch_cpu.so+46031330} [547]
   0.35       0.02  _ZNSt13basic_filebufIcSt11char_traitsIcEE6xsgetnEPcl [139]
   0.35       0.02  _ZNSt15basic_streambufIcSt11char_traitsIcEE6xsgetnEPcl [213]
   0.35       0.02  @{libc-2.17.so+527781} [555]
   0.35       0.02  @{libc-2.17.so+528016} [556]
   0.35       0.02  @{libc-2.17.so+527969} [561]
   0.35       0.02  _ZNSi6sentryC2ERSib [566]
   0.35       0.02  @{libc-2.17.so+525341} [577]
   0.35       0.02  @{libtorch_cpu.so+11552194} [597]
   0.35       0.02  @{libtorch_cpu.so+11413874} [605]
   0.35       0.02  llseek [608]
   0.35       0.02  @{libtorch_cpu.so+30774277} [620]
   0.35       0.02  sgemm_beta_SKYLAKEX [626]
   0.28       0.02  _ZN11lar_content22PreProcessingAlgorithm15ProcessCaloHitsEv [149]

#46 Updated by Patrick Gartung 3 months ago

Running on dunebuild01 which has an AMD processor using the binaries built on scisoftbuild01 I do not get any invalid instruction errors. It seems as though libtorch picks the correct code to use on the processor it runs on. In this case it uses libopenblas which uses sgemm_PILEDRIVER_kernel.

time /build/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh c
Warning in <UnknownClass::SetDisplay>: DISPLAY not set, setting it to 131.225.240.52:0.0
EventReadingAlgorithm: Processing event file: /build/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m18.074s
user    2m35.750s
sys    0m9.460s

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  19.76       2.12  @{libgomp.so.1.0.0+102010} [31]
   6.93       0.74  @{libgomp.so.1.0.0+101586} [77]
   3.11       0.33  @{libtorch_cpu.so+13640819} [114]
   2.93       0.31  @{libtorch_cpu.so+12481486} [130]
   2.42       0.26  @{libgomp.so.1.0.0+102008} [150]
   2.37       0.25  @{libtorch_cpu.so+23298013} [151]
   1.86       0.20  @{libtorch_cpu.so+23298018} [169]
   1.86       0.20  __sched_yield [170]
   1.67       0.18  @{libtorch_cpu.so+23297992} [173]
   1.58       0.17  @{libgomp.so.1.0.0+102019} [176]
   1.53       0.16  @{libgomp.so.1.0.0+102022} [178]
   1.12       0.12  __libc_malloc [161]
   1.12       0.12  @{libtorch_cpu.so+13640835} [190]
   1.07       0.11  @{libtorch_cpu.so+13640800} [195]
   1.02       0.11  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv [132]
   0.93       0.10  @{libgomp.so.1.0.0+101726} [209]
   0.88       0.09  brk [218]
   0.88       0.09  @{libgomp.so.1.0.0+101584} [222]
   0.88       0.09  @{libtorch_cpu.so+13640817} [223]
   0.84       0.09  sgemm_kernel_PILEDRIVER [225]
   0.84       0.09  @{libtorch_cpu.so+12481464} [230]
   0.84       0.09  @{libtorch_cpu.so+23298005} [231]
   0.79       0.08  @{libtorch_cpu.so+13640808} [233]
   0.79       0.08  @{libgomp.so.1.0.0+101597} [234]
   0.74       0.08  @{libtorch_cpu.so+12481378} [235]
   0.70       0.07  @{libtorch_cpu.so+23298009} [238]
   0.65       0.07  @{libtorch_cpu.so+13640805} [241]
   0.60       0.06  munmap [242]
   0.51       0.05  @{libtorch_cpu.so+12481552} [256]
   0.51       0.05  @{libtorch_cpu.so+13640839} [257]
   0.46       0.05  @{libc-2.17.so+528006} [261]
   0.46       0.05  @{libtorch_cpu.so+12481346} [264]
   0.46       0.05  @{libtorch_cpu.so+12481481} [265]
   0.46       0.05  @{libc-2.17.so+588304} [266]
   0.46       0.05  @{libc-2.17.so+1373249} [268]
   0.42       0.04  @{libc-2.17.so+1373280} [282]
   0.37       0.04  @{libtorch_cpu.so+12481532} [287]
   0.33       0.03  __read [298]

#47 Updated by Patrick Gartung 3 months ago

Built libtorch with ups scripts, this time not setting MKLDNN=0. The preformance looks roughly the same as variant c of the install_libtorch_local.sh script.

time bash -x /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh 
+ /scratch/gartung/products/install-a/LArReco/bin/PandoraInterface -r AllHitsNu -i /scratch/gartung/products/LArReco/settings/PandoraSettings_Inference_DUNEFD_DeepLearningId.xml -g /scratch/gartung/products/LArReco/geometry/PandoraGeometry_DUNEFD_1x2x6.xml -e /scratch/gartung/products/Pandora_Events_0.pndr -N -n 5 -s 1
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m14.604s
user    1m49.215s
sys     0m3.682s
Flat profile (self >= 0.01%)

% total       Self  Function
  10.60       2.14  @{libgomp.so.1.0.0+102010} [58]
   6.74       1.36  __libc_malloc [28]
   5.38       1.08  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv [75]
   2.82       0.57  brk [162]
   2.06       0.41  @{libc-2.17.so+527903} [199]
   1.81       0.36  __sched_yield [228]
   1.78       0.36  @{libc-2.17.so+528006} [229]
   1.64       0.33  @{libc-2.17.so+532348} [244]
   1.59       0.32  @{libtorch_cpu.so+21405818} [249]
   1.02       0.20  __read [311]
   0.82       0.16  _ZN3c1010TensorType3getEv [344]
   0.79       0.16  cfree [348]
   0.79       0.16  @{libtorch_cpu.so+11883058} [349]
   0.67       0.13  _ZNK3c1010TensorType14isSubtypeOfExtESt10shared_ptrINS_4TypeEEPSo [131]
   0.67       0.13  @{libgomp.so.1.0.0+102014} [376]
   0.55       0.11  _ZNKSt10_HashtableIPN5torch3jit4NodeES3_SaIS3_ENSt8__detail9_IdentityESt8equal_toIS3_ESt4hashIS3_ENS5_18_Mod_range_hashingENS5_20_Default_ranged_hashENS5_20_Prime_rehash_policyENS5_17_Hashtable_traitsILb0ELb1ELb1EEEE5countERKS3_ [422]
   0.52       0.10  _ZN5torch3jit9MemoryDAG12setWildcardsERKSt13unordered_setIPKNS0_5ValueESt4hashIS5_ESt8equal_toIS5_ESaIS5_EERKN3ska13flat_hash_mapIS5_PNS0_7ElementES7_S9_SaISt4pairIS5_SH_EEEERKSt8functionIFSH_S5_EE [136]
   0.50       0.10  _ZNKSt10_HashtableIN3c106SymbolESt4pairIKS1_St6vectorISt10shared_ptrIN5torch3jit8OperatorEESaIS9_EEESaISC_ENSt8__detail10_Select1stESt8equal_toIS1_ESt4hashIS1_ENSE_18_Mod_range_hashingENSE_20_Default_ranged_hashENSE_20_Prime_rehash_policyENSE_17_Hashtable_traitsILb1ELb0ELb1EEEE19_M_find_before_nodeEmRS3_m [448]
   0.50       0.10  _ZNKSt10_HashtableIN3c106SymbolES1_SaIS1_ENSt8__detail9_IdentityESt8equal_toIS1_ESt4hashIS1_ENS3_18_Mod_range_hashingENS3_20_Default_ranged_hashENS3_20_Prime_rehash_policyENS3_17_Hashtable_traitsILb1ELb1ELb1EEEE5countERKS1_ [449]
   0.47       0.09  __pthread_mutex_unlock [466]
   0.45       0.09  @{libtorch_cpu.so+52004917} [480]
   0.45       0.09  @{libtorch_cpu.so+11883386} [482]
   0.42       0.08  @{libtorch_cpu.so+12074178} [499]
   0.40       0.08  pthread_mutex_lock [510]
   0.37       0.07  __dynamic_cast [423]
   0.37       0.07  @{libtorch_cpu.so+12083408} [534]
   0.35       0.07  _ZNK5torch3jit4Node7matchesERKN3c1014FunctionSchemaE [263]
   0.35       0.07  _ZNKSt8__detail20_Prime_rehash_policy11_M_next_bktEm [540]
   0.32       0.06  _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPcEEvT_S7_St20forward_iterator_tag [332]
   0.32       0.06  _ZNKSt8__detail20_Prime_rehash_policy14_M_need_rehashEmmm [404]
   0.32       0.06  @{libc-2.17.so+528486} [559]
   0.32       0.06  @{libgomp.so.1.0.0+102019} [574]
   0.30       0.06  sgemm_beta_SKYLAKEX [552]
   0.27       0.05  _ZN5torch3jit4NodeC2EPNS0_5GraphEN3c106SymbolE [357]
   0.27       0.05  munmap [595]
   0.27       0.05  @{libtorch_cpu.so+51698538} [605]
   0.25       0.05  _Znwm [27]
   0.25       0.05  @{libc-2.17.so+533904} [621]
   0.25       0.05  @{libtorch_cpu.so+51854553} [630]

#48 Updated by Andrew Chappell 3 months ago

Built libtorch with ups scripts, this time not setting MKLDNN=0. The preformance looks roughly the same as variant c of the install_libtorch_local.sh script.

Hi Patrick, is this still with OpenMP enabled? I notice that the user time is still quite high here, whereas in the ~20 second local build examples, OpenMP is off and the user time and real time are comparable.

#49 Updated by Patrick Gartung 3 months ago

The build_libtorch.sh script is linked below. Yes USE_OpenMP=ON is set. I will try with it set to USE_OpenMP=OFF.

https://cdcvs.fnal.gov/redmine/projects/build-framework/repository/libtorch-ssi-build/revisions/issue-24669/entry/build_libtorch.sh

#50 Updated by Patrick Gartung 3 months ago

With USE_OpenMP=OFF the user time is longer. Libopenblas is linked to libgomp which is probably where the multithreading is coming from. It appears to be thread locking adding the extra overhead. I will see if I can build libopenblass without libgomp.

Fairly sure this was running lapack blas with openmp enabled.

time bash -x /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh 
+ /scratch/gartung/products/install-a/LArReco/bin/PandoraInterface -r AllHitsNu -i /scratch/gartung/products/LArReco/settings/PandoraSettings_Inference_DUNEFD_DeepLearningId.xml -g /scratch/gartung/products/LArReco/geometry/PandoraGeometry_DUNEFD_1x2x6.xml -e /scratch/gartung/products/Pandora_Events_0.pndr -N -n 5 -s 1
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m16.915s
user    4m13.701s
sys     0m9.894s

#51 Updated by Patrick Gartung 3 months ago

Of note, /lib64/libopenblas.so does not depend on libgomp.so. It could be linking the static library but there is no way to tell because the symbols are stripped out.

lib64/libopenblas.so:
        linux-vdso.so.1 =>  (0x00007fff71f1d000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f389cca9000)
        libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f389c987000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f389c5b9000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f389f49f000)
        libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f389c37d000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f389c167000)

#52 Updated by Patrick Gartung 3 months ago

I am unabled to reproduce the timings you got on your workstation. Re-enabling OpenMP gets me back to the user and real time as before.

[gartung@scisoftbuild01 libtorch-ssi-build]$ time bash -x /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh 
+ /scratch/gartung/products/install-a/LArReco/bin/PandoraInterface -r AllHitsNu -i /scratch/gartung/products/LArReco/settings/PandoraSettings_Inference_DUNEFD_DeepLearningId.xml -g /scratch/gartung/products/LArReco/geometry/PandoraGeometry_DUNEFD_1x2x6.xml -e /scratch/gartung/products/Pandora_Events_0.pndr -N -n 5 -s 1
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m13.624s
user    1m37.509s
sys     0m3.331s

#53 Updated by Patrick Gartung 3 months ago

Using variant e of the install_libtorch_local.sh which enables numpy but disables openmp results in truly singled threaded running. When I checked thread usage with top I see only one thread being used where as previously there were as many threads as there are hyperthreads on the cpu. The timing is the highest by far. 93% of the time was spent int sgemm_.

This was running lapack blas.

time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh e
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    4m9.178s
user    3m59.706s
sys     0m9.503s
----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  93.00     226.45  sgemm_ [25]
   0.97       2.36  @{libc-2.17.so+1395691} [48]
   0.85       2.06  @{libc-2.17.so+1400238} [49]
   0.66       1.61  @{libc-2.17.so+1397166} [50]
   0.47       1.13  munmap [73]

#54 Updated by Patrick Gartung 3 months ago

Enabling OpenCV with OpenMP disabled results in the same slow time.
Fairly sure this was running lapack blas with openmp.

time  /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh g
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    4m7.287s
user    3m58.109s
sys     0m9.193s

Again spending all of its time in sgemm_

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  92.85     225.77  sgemm_ [25]
   0.98       2.38  @{libc-2.17.so+1400238} [48]
   0.97       2.37  @{libc-2.17.so+1395691} [49]
   0.75       1.81  @{libc-2.17.so+1397166} [50]
   0.53       1.28  munmap [63]

This was running lapack blas.

#55 Updated by Patrick Gartung 3 months ago

Hi Andy,

I am unable to reproduce the timings you get with your local installation. Can you please make a tarball of your pytorch installation on your workstation and upload it to scisoftbuild01 somewhere. I would like to look at the symbols in the libraries.

Patrick

#56 Updated by Andrew Chappell 3 months ago

Hi Patrick,

I've uploaded a torch_libs.tar.gz tarball to /scratch/chappell, which contains the lib and lib64 folders of the torch installation. I think this should have everything you need - please let me know if you need anything else.

Cheers,
Andy

#57 Updated by Patrick Gartung 3 months ago

Hi Andy,

Could you add the flag -frecord-gcc-switches to the compile options for libtorch. This will save the compile options as a string in the library. Although the flags are reported by cmake for sleef and liubtorch, this flag will ensure that the flags are reported for all of the libraries.

Patrick

#58 Updated by Patrick Gartung 3 months ago

Hi Andy,

Can you generate a profile for the executable running on your desktop? Assuming you have cvmfs available on your workstation you can setup igprof with

. /cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/etc/profile.d/init.sh
. /cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/igprof/5.9.16-gchjei/etc/profile.d/init.sh 

and then run it with

igprof -pp -- /scratch/gartung/products/install-g/LArReco/bin/PandoraInterface -r AllHitsNu -i /scratch/gartung/products/LArReco/settings/PandoraSettings_Inference_DUNEFD_DeepLearningId.xml -g /scratch/gartung/products/LArReco/geometry/PandoraGeometry_DUNEFD_1x2x6.xml -e /scratch/gartung/products/Pandora_Events_0.pndr -N -n 50 -s 1

You can then generate the igprof report with

igprof-analyse -t igprof.PandoraInterface.80165.1603298873.251885.gz | c++filt > igprof.PandoraInterface.80165.1603298873.251885.txt

and view it or attach it to this issue.

#59 Updated by Andrew Chappell 3 months ago

Hi Patrick,

Sure, I'll try to have a look at this and the compilation flag tomorrow morning.

#60 Updated by Andrew Chappell 3 months ago

Hi Patrick,

I've replaced the tarball containing the libtorch libraries following the -frecord-gcc-switches flag inclusion. I'm also attaching the output for igprof - I hope it's helpful. To the extent that I could interpret the output I think the notable calls are sgemm_kernel_HASWELL and sgemm_itcopy_HASWELL, which I think are probably related to OpenBLAS. Will be interesting to see what the corresponding calls are on Scisoftbuild.

#61 Updated by Patrick Gartung 3 months ago

I re-enabled the OpenMP support and used these environment variables to control the number of threads used by OPENMP and OPENBLAS

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

Running in the VirtualBox VM with AVX and AVX2 enable I get this timing from infer.sh:

time  /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh c
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m51.018s
user    0m48.280s
sys     0m2.119s

The attached igprof profile shows that the most time is spent in sgemm_.

The difference between scisoftbuild01 and your workstation might be attributed to a difference in CPU frequency. Newer processors will upclock individual cores if the system is not loaded.

#62 Updated by Patrick Gartung 3 months ago

Same environment settings but this time running directly on scisoftbuild01

[gartung@scisoftbuild01 SL7]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh g
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    4m6.573s
user    3m57.557s
sys     0m9.007s

The attached igprof profile shows that the most time is spent in sgemm_ again.

This was running lapack blas.

#63 Updated by Patrick Gartung 3 months ago

Hmm...

Allowing the CMakeLists.txt to automatically pick the blas selected came from lapack

[gartung@scisoftbuild01 ~]$ ldd /scratch/gartung/libtorchtest/pytorch-g/lib/libtorch_cpu.so
        linux-vdso.so.1 =>  (0x00007fff8b7db000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f2152cdc000)
        libgcc_s.so.1 => /scratch/gartung/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libgcc_s.so.1 (0x00007f2152ac4000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f21528c0000)
        libblas.so.3 => /scratch/gartung/products/lapack/v3_8_0c/Linux64bit+3.10-2.17-e19-prof/lib64/libblas.so.3 (0x00007f215267c000)
        liblapack.so.3 => /scratch/gartung/products/lapack/v3_8_0c/Linux64bit+3.10-2.17-e19-prof/lib64/liblapack.so.3 (0x00007f2151ec9000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f2151bc7000)
        libprotobuf.so.3.6.1 => /scratch/gartung/libtorchtest/pytorch-g/lib/libprotobuf.so.3.6.1 (0x00007f2151421000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2151205000)
        libc10.so => /scratch/gartung/libtorchtest/pytorch-g/lib/libc10.so (0x00007f2150fc2000)
        libstdc++.so.6 => /scratch/gartung/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6 (0x00007f2150c3e000)                                                                                                                           
        libc.so.6 => /lib64/libc.so.6 (0x00007f2150870000)                                                                                                                                                                                              
        /lib64/ld-linux-x86-64.so.2 (0x00007f2157914000)                                                                                                                                                                                                
        libgfortran.so.5 => /scratch/gartung/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libgfortran.so.5 (0x00007f2150403000)                                                                                                                       
        libquadmath.so.0 => /scratch/gartung/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libquadmath.so.0 (0x00007f21501c3000)                                                                                                                       
        libz.so.1 => /lib64/libz.so.1 (0x00007f214ffad000)               

#64 Updated by Patrick Gartung 3 months ago

Built libtorch against the ups built openblas using install_pytorch_local.sh with variant g and set OMP_NUM_THREADS=1. Timing is slower probably because if system load and cpu frequency stepping:

[gartung@scisoftbuild01 products]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh g
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m54.062s
user    0m41.217s
sys     0m12.806s

Attached igprof report shows that most time is spent in sgemm_kernel_SKYLAKEX on scisoftbuild01.

#65 Updated by Patrick Gartung 3 months ago

Trying to run the same libtorch build in the VirtualBox VM results in a illegal instruction error

[vagrant@localhost products]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh g

 *** Break *** illegal instruction

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007fcb4290046c in waitpid () from /lib64/libc.so.6
#1  0x00007fcb4287df62 in do_system () from /lib64/libc.so.6
#2  0x00007fcb478e9d40 in TUnixSystem::StackTrace() () from /products/root/v6_18_04d/Linux64bit+3.10-2.17-e19-prof/lib/libCore.so
#3  0x00007fcb478eb41a in TUnixSystem::DispatchSignals(ESignals) () from /products/root/v6_18_04d/Linux64bit+3.10-2.17-e19-prof/lib/libCore.so
#4  <signal handler called>
#5  0x00007fcb42403b39 in c10::Registerer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<c10::C10FlagParser, std::default_delete<c10::C10FlagParser> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>::Registerer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Registry<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<c10::C10FlagParser, std::default_delete<c10::C10FlagParser> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>*, std::function<std::unique_ptr<c10::C10FlagParser, std::default_delete<c10::C10FlagParser> > (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .constprop.187] () from /scratch/gartung/products/pytorch-g/lib/libc10.so
#6  0x00007fcb42404276 in __static_initialization_and_destruction_0(int, int) [clone .constprop.186] () from /scratch/gartung/products/pytorch-g/lib/libc10.so
#7  0x00007fcb48ab99b3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#8  0x00007fcb48aab17a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#9  0x000000000000000e in ?? ()
#10 0x00007ffc8080409f in ?? ()
#11 0x00007ffc808040e0 in ?? ()
#12 0x00007ffc808040e3 in ?? ()
#13 0x00007ffc808040ed in ?? ()
#14 0x00007ffc808040f0 in ?? ()
#15 0x00007ffc8080414f in ?? ()
#16 0x00007ffc80804152 in ?? ()
#17 0x00007ffc8080419e in ?? ()
#18 0x00007ffc808041a1 in ?? ()
#19 0x00007ffc808041d1 in ?? ()
#20 0x00007ffc808041d4 in ?? ()
#21 0x00007ffc808041d7 in ?? ()
#22 0x00007ffc808041d9 in ?? ()
#23 0x00007ffc808041dc in ?? ()
#24 0x0000000000000000 in ?? ()
===========================================================

The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x00007fcb42403b39 in c10::Registerer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<c10::C10FlagParser, std::default_delete<c10::C10FlagParser> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>::Registerer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Registry<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<c10::C10FlagParser, std::default_delete<c10::C10FlagParser> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>*, std::function<std::unique_ptr<c10::C10FlagParser, std::default_delete<c10::C10FlagParser> > (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .constprop.187] () from /scratch/gartung/products/pytorch-g/lib/libc10.so
#6  0x00007fcb42404276 in __static_initialization_and_destruction_0(int, int) [clone .constprop.186] () from /scratch/gartung/products/pytorch-g/lib/libc10.so
#7  0x00007fcb48ab99b3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#8  0x00007fcb48aab17a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#9  0x000000000000000e in ?? ()
#10 0x00007ffc8080409f in ?? ()
#11 0x00007ffc808040e0 in ?? ()
#12 0x00007ffc808040e3 in ?? ()
#13 0x00007ffc808040ed in ?? ()
#14 0x00007ffc808040f0 in ?? ()
#15 0x00007ffc8080414f in ?? ()
#16 0x00007ffc80804152 in ?? ()
#17 0x00007ffc8080419e in ?? ()
#18 0x00007ffc808041a1 in ?? ()
#19 0x00007ffc808041d1 in ?? ()
#20 0x00007ffc808041d4 in ?? ()
#21 0x00007ffc808041d7 in ?? ()
#22 0x00007ffc808041d9 in ?? ()
#23 0x00007ffc808041dc in ?? ()
#24 0x0000000000000000 in ?? ()
===========================================================

real    0m7.662s
user    0m5.375s
sys    0m1.401s

Illegal instruction still occurs after rebuild.

#66 Updated by Patrick Gartung 3 months ago

When scisoftbuild01 was less loaded I get these timings.

[gartung@scisoftbuild01 products]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh g
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m42.830s
user    0m30.960s
sys     0m11.858s

#67 Updated by Patrick Gartung 3 months ago

Setting USE_NATIVE_ARCH=0 creates executable that can be run inside VirtualBox without Illegal Instruction error.

#68 Updated by Patrick Gartung 3 months ago

Running in the VirtualBox VM when the system is not loaded gives similar performance to running directly on scisoftbuild01

[vagrant@localhost libtorch]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh g
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorch/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m41.878s
user    0m37.439s
sys     0m3.999s

#69 Updated by Patrick Gartung 3 months ago

Running inside the VirtualBox VM with AVX and AVX2 enabled the top function is sgemm_kernel_SANDYBRIDGE.
Running directly on scisoftbuil01 which has AVX512 instructions the top function is sgemm_kernel_SKYLAKEX.

#70 Updated by Patrick Gartung 3 months ago

Running with ups provided libtorch and openblas with openmp disabled I get these timings on scisoftbuild01 and the attached igprof profile which shows sgemm_kernel_SKYLAKEX as the leading function per cpu use.

[gartung@scisoftbuild01 bin]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m46.764s
user    0m32.213s
sys     0m13.833s

#71 Updated by Patrick Gartung 3 months ago

Running in the VirtualBox VM with AVX and AVX2 enabled I see the Unsupported hardware error again with the igprof profile showing sgemm_kernel_SANDYBRIDGE as the top function per cpu usage.

[vagrant@localhost products]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m43.901s
user    0m35.931s
sys     0m6.802s

#72 Updated by Patrick Gartung 3 months ago

Running on dunebuild01 I get the timing shown below and the attached igprof report shows that sgemm_itcopy_PILEDRIVER and sgemm_kernel_PILEDRIVER are the functions with the highest cpu usage.

bash-4.2$ time /build/gartung/libtorch-ssi-build/issue-24669/infer.sh a
Warning in <UnknownClass::SetDisplay>: DISPLAY not set, setting it to 131.225.240.52:0.0
EventReadingAlgorithm: Processing event file: /build/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    1m20.450s
user    1m4.882s
sys     0m15.051s

#73 Updated by Patrick Gartung 3 months ago

Running in Virtual Box VM with AVX and AVX2 disabled I get similar timings to the AMD CPU.

NNPACK requires AVX2 to function which explains the Unsupported hardware message.

From the igprof profile sgemm_kernel_NEHALEM is the function where most of the time is spent.

[vagrant@localhost products]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m57.767s
user    0m50.897s
sys    0m5.724s

#75 Updated by Patrick Gartung 3 months ago

All of the Jenkins machines where libtorch 1.5.1 might have been built are AMD Opteron(tm) Processor 6320 or AMD Opteron(tm) Processor 6212. These processors do not include the AVX2 instruction set. If a check was done for the AVX2 registers before enabling AVX2 features, then the libtorch build on the Jenkins machines would have resulted in the AVX2 featured being disabled.

#76 Updated by Patrick Gartung 3 months ago

Hi Andy,

I have published on cvmfs a build of libtorch v1.6.0 that was built against openblas. You should be able to use it for testing using

setup libtorch v1_6_0 -q e19:open

This was built on scisoftbuild01 which has AVX2 available. I would be interested in seeing what kind of performance you get on your workstation.

Patrick

#77 Updated by Patrick Gartung 3 months ago

Hi Andy,

The cvmfs install will go away at some point because this is not an official build. To try it on your desktop you will need to copy the files from scisoftbuild01 and install a local products area.

First copy the files to your workstation:

mkdir prducts
cd products
scp -r scisoftbuild01.fnal.gov:/scratch/gartung/libtorch-ups/* .

then install a local products area with
pullProducts -l $PWD slf7 libtorch-v1_6_0 e19-open prof

Finally setup up the local products directory with
source $PWD/setup

#78 Updated by Andrew Chappell 3 months ago

Thanks Patrick, I'll try to test this out tomorrow.

#79 Updated by Andrew Chappell 3 months ago

Hi Patrick,

I tried to setup libtorch 1.6.0, but this failed with

ERROR: Action parsing failed on "unsetuprequired(mpfr v4_1_0)"

which I don't see via ups list. I tried going via the pullProducts route instead, but this also failed with

ERROR: pull of http://scisoft.fnal.gov/scisoft/packages/ups/v6_0_8/notar failed

#80 Updated by Patrick Gartung 3 months ago

Hi Andy,

I corrected the manifest file. Can you please retry the local install. The cvmfs install is still available.

#81 Updated by Andrew Chappell 3 months ago

Hi Patrick, unfortunately I still see the same errors for both the cvmfs and local installs.

#82 Updated by Patrick Gartung 3 months ago

So the pullProducts script will only install ups. You will have to untar the tar files with

cd products
for f in *.bz2;do
    tar xvjf $f 
done

#83 Updated by Andrew Chappell 3 months ago

Hi Patrick,

A brief update on the status. pullProducts now runs without reporting any issues, but when trying to build LArContent against this version of libtorch, the protobuf library isn't found, creating an issue for the Caffe2 library. The protobuf library is present in the products folder, so it's not clear to me why it's not finding this. I'll keep investigating the issue, though I'm unlikely to be able to make any progress for the next couple of days.

#84 Updated by Patrick Gartung 3 months ago

Hi Andy,

I would suggest using the CMAKE_PREFIX_PATH environment variable set by ups and setting this as a CMake define. See
https://cdcvs.fnal.gov/redmine/projects/build-framework/repository/libtorch-ssi-build/revisions/issue-24669/entry/issue-24669/install_libtorch_test.sh#L35

#85 Updated by Patrick Gartung 3 months ago

The illegal instruction used when -march=native was used was vzeroupper which is specific to AVX and higher instruction sets. Below I dumped the disassembled function where the illegal instruction was reported. I suspect that -march=native may have been used for Andy's build because that is the default. This might explain the better performance.

    c668:       0f 84 ae 00 00 00       je     c71c <c10::Registerer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<c10::C10FlagParser, std::default_delete<c10::C10FlagParser> >, std::__cxx11::basic_string<
char, std::char_traits<char>, std::allocator<char> > const&>::Registerer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Registry<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, 
std::unique_ptr<c10::C10FlagParser, std::default_delete<c10::C10FlagParser> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>*, std::function<std::unique_ptr<c10::C10FlagParser, std::default_delete<c10::C10FlagParse
r> > (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .constprop.187] [clone .cold.203]+0xd2>
    c66e:       ba 03 00 00 00          mov    $0x3,%edx
    c673:       4c 89 ee                mov    %r13,%rsi
    c676:       4c 89 ef                mov    %r13,%rdi
    c679:       c5 f8 77                vzeroupper 
    c67c:       ff d0                   callq  *%rax
    c67e:       48 89 df                mov    %rbx,%rdi
    c681:       e8 ca fa ff ff          callq  c150 <_Unwind_Resume@plt>
    c686:       48 89 c3                mov    %rax,%rbx
    c689:       c5 f8 77                vzeroupper 

#86 Updated by Patrick Gartung 3 months ago

This may have some bearing on why using vzeroupper yields faster execution.

https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake

#87 Updated by Andrew Chappell 2 months ago

Hi Patrick,

Here's what I see running against your libtorch 1.6 build:

real 0m22.155s
user 1m2.443s
sys 0m4.080s

#88 Updated by Patrick Gartung 2 months ago

Hi Andy,

Could you try again with

export OMP_NUM_THREADS=1

#89 Updated by Patrick Gartung 2 months ago

Andrew Chappell wrote:

Hi Patrick,

Here's what I see running against your libtorch 1.6 build:

real 0m22.155s
user 1m2.443s
sys 0m4.080s

That looks about the same as the timings here https://cdcvs.fnal.gov/redmine/issues/24669#note-52

#90 Updated by Andrew Chappell 2 months ago

Hi Patrick,

Restricting to a single thread yields

real 0m21.280s
user 0m16.676s
sys 0m3.758s

So that build is looking much better.

#91 Updated by Patrick Gartung 2 months ago

Running on a machine that supports AVX2 seems to be the key for performance. The only further optimization is the addition of -march=native, but this produces a binary that can only be run on machines with AVX or higher. One of the changes the optimization produces is the addition of vzeroupper before/after every call to another function. Clearing the upper registers allows out of order execution when sse instructions are used. For example this is a function with and without optimization:


000000000000db70 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]>:
    db70:    48 8b 44 24 20           mov    0x20(%rsp),%rax
    db75:    48 85 c0                 test   %rax,%rax
    db78:    74 0d                    je     db87 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x17>
    db7a:    ba 03 00 00 00           mov    $0x3,%edx
    db7f:    4c 89 f6                 mov    %r14,%rsi
    db82:    4c 89 f7                 mov    %r14,%rdi
    db85:    ff d0                    callq  *%rax
    db87:    e8 74 f5 ff ff           callq  d100 <std::terminate()@plt>
    db8c:    48 8b 44 24 30           mov    0x30(%rsp),%rax
    db91:    48 85 c0                 test   %rax,%rax
    db94:    74 20                    je     dbb6 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x46>
    db96:    48 83 7c 24 20 00        cmpq   $0x0,0x20(%rsp)
    db9c:    48 89 44 24 08           mov    %rax,0x8(%rsp)
    dba1:    75 05                    jne    dba8 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x38>
    dba3:    e8 58 f6 ff ff           callq  d200 <std::__throw_bad_function_call()@plt>
    dba8:    48 8d 74 24 08           lea    0x8(%rsp),%rsi
    dbad:    48 8d 7c 24 10           lea    0x10(%rsp),%rdi
    dbb2:    ff 54 24 28              callq  *0x28(%rsp)
    dbb6:    48 c7 44 24 30 00 00     movq   $0x0,0x30(%rsp)
    dbbd:    00 00 
    dbbf:    48 8b 44 24 20           mov    0x20(%rsp),%rax
    dbc4:    48 85 c0                 test   %rax,%rax
    dbc7:    74 0f                    je     dbd8 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x68>
    dbc9:    48 8d 7c 24 10           lea    0x10(%rsp),%rdi
    dbce:    ba 03 00 00 00           mov    $0x3,%edx
    dbd3:    48 89 fe                 mov    %rdi,%rsi
    dbd6:    ff d0                    callq  *%rax
    dbd8:    48 89 df                 mov    %rbx,%rdi
    dbdb:    e8 70 fe ff ff           callq  da50 <_Unwind_Resume@plt>

000000000000c240 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]>:
    c240:    48 8b 44 24 20           mov    0x20(%rsp),%rax
    c245:    48 85 c0                 test   %rax,%rax
    c248:    74 34                    je     c27e <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x3e>
    c24a:    ba 03 00 00 00           mov    $0x3,%edx
    c24f:    4c 89 f6                 mov    %r14,%rsi
    c252:    4c 89 f7                 mov    %r14,%rdi
    c255:    c5 f8 77                 vzeroupper 
    c258:    ff d0                    callq  *%rax
    c25a:    e8 c1 f5 ff ff           callq  b820 <std::terminate()@plt>
    c25f:    48 8b 44 24 30           mov    0x30(%rsp),%rax
    c264:    48 85 c0                 test   %rax,%rax
    c267:    74 1a                    je     c283 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x43>
    c269:    48 83 7c 24 20 00        cmpq   $0x0,0x20(%rsp)
    c26f:    48 89 44 24 08           mov    %rax,0x8(%rsp)
    c274:    75 3a                    jne    c2b0 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x70>
    c276:    c5 f8 77                 vzeroupper 
    c279:    e8 92 f6 ff ff           callq  b910 <std::__throw_bad_function_call()@plt>
    c27e:    c5 f8 77                 vzeroupper 
    c281:    eb d7                    jmp    c25a <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x1a>
    c283:    c5 f8 77                 vzeroupper 
    c286:    48 8b 44 24 20           mov    0x20(%rsp),%rax
    c28b:    48 c7 44 24 30 00 00     movq   $0x0,0x30(%rsp)
    c292:    00 00 
    c294:    48 85 c0                 test   %rax,%rax
    c297:    74 0f                    je     c2a8 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x68>
    c299:    48 8d 7c 24 10           lea    0x10(%rsp),%rdi
    c29e:    ba 03 00 00 00           mov    $0x3,%edx
    c2a3:    48 89 fe                 mov    %rdi,%rsi
    c2a6:    ff d0                    callq  *%rax
    c2a8:    48 89 df                 mov    %rbx,%rdi
    c2ab:    e8 a0 fe ff ff           callq  c150 <_Unwind_Resume@plt>
    c2b0:    48 8d 74 24 08           lea    0x8(%rsp),%rsi
    c2b5:    48 8d 7c 24 10           lea    0x10(%rsp),%rdi
    c2ba:    c5 f8 77                 vzeroupper 
    c2bd:    ff 54 24 28              callq  *0x28(%rsp)
    c2c1:    eb c3                    jmp    c286 <c10::InefficientStdFunctionContext::makeDataPtr(void*, std::function<void (void*)> const&, c10::Device) [clone .cold.35]+0x46>

#92 Updated by Patrick Gartung 2 months ago

It appears that the NNPACK static library always compiles the same functions whether the host has AVX2 or not. Running with a libtorch compiled in a VirtualBox VM with AVX2 disabled and then running the LArReco compiled against it still produces the warning

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

This warning comes when trying to init/run xnn_create_convolution2d_nhwc_f32 which only works on CPUs with AVX2 registers. The function is unregistered and libtorch does not try to use the convolution2d subgraph method. The performance is slower because of this.

#93 Updated by Patrick Gartung 2 months ago

Running in a VirtualBox VM on my laptop which has a Skylake cpu and supports AVX2, I get the unsupported hardware error. VirtualBox nominally supports AVX2 instructions but apparently not in a way that NNPACK recognizes. Performance is still pretty good even without the convolution2d method being available.

[vagrant@localhost products-avx2]$ time /home/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /home/vagrant/products-avx2/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m26.150s
user    0m23.283s
sys    0m2.822s
[vagrant@localhost products-avx2]$ cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 158
model name    : Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz

#94 Updated by Patrick Gartung 2 months ago

On my linux desktop with has a Haswell cpu I don't get the unsupported hardware error and the performance is shown below. I am running in a SL7 Singularity image because my linux desktop is running Centos 8.

Singularity> time /build/gartung/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /build/gartung/products-avx2/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m25.951s
user    0m35.938s
sys    0m3.963s
Singularity> cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 60
model name    : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

#95 Updated by Patrick Gartung 2 months ago

The unsupported hardware message happens when running in a VirtualBox VM even if the host support avx and avx2.
When running on dunebuild01 which only supports avx, no unsupported hardware message is seen.

#96 Updated by Patrick Gartung 2 months ago

Running on cluck, which has neither AVX nor AVX2 instructions I get the unsupported hardware warning.

NNPACK will produce that warning unless AVX or AVX2 instruction are available.

[gartung@cluck products]$ time ../libtorch-ssi-build/issue-24669/infer.sh 
Warning in <UnknownClass::SetDisplay>: DISPLAY not set, setting it to 131.225.240.52:0.0
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    1m48.710s
user    1m34.403s
sys    0m13.697s

#97 Updated by Patrick Gartung 2 months ago

Hi Andy,

There is now an official build of libtorch v1_6_0 installed on larsoft cvmfs. I tried the build on scisoftbuild01 and got these timings. Can you give the cvmfs install a try on your local workstation.

Patrick

time ../srcs/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m41.769s
user    0m31.177s
sys     0m10.371s

#98 Updated by Andrew Chappell 2 months ago

Hi Patrick,

Running against the e19 version (with OMP threads = 1) I find the following times:

real 0m17.380s
user 0m16.090s
sys 0m1.038s

without this OMP thread specification I get:

real 0m21.897s
user 0m57.590s
sys 0m1.134s

Thanks so much for your efforts on this, Patrick. That was a comprehensive set of tests and it's good to see the CVMFS LibTorch product performance has improved to this extent.

#99 Updated by Andrew Chappell 2 months ago

Hi Patrick,

One follow-up on the new libtorch build - was this built against a C++ standard of 17 or greater? With a complete rebuild of Pandora using the new CVMFS libtorch I'm running into issues with C++17 features that are used in larpandoracontent that I don't see when using the libtorch 1.5 CVMFS product. I've previously seen this issue occur when attempting to build against pre C++17 versions of ROOT.

Cheers,
Andy

#100 Updated by Patrick Gartung 2 months ago

The c7, e19 and e20 qualifiers are built with cxxflags += -std=cxx17 and cflags += -std=c11

#101 Updated by Lynn Garren 2 months ago

Andy, which qualified build of libtorch did you use in the build? Only the e19-lapack qualified build is compatible with the current release of larsoft. The other builds will be relevant when larsoft moves to art 3.06.

#102 Updated by Andrew Chappell 2 months ago

Hi Lynn,

This was Pandora's standalone build, rather than the build in the LArSoft context, which I've not tested yet. For this I tested the e19 qualified build.

#103 Updated by Kyle Knoepfel about 2 months ago

If you would like us to investigate, we will need more details to reproduce the problem.

#104 Updated by Andrew Chappell about 2 months ago

Hi Kyle,

The error I'm seeing is the following:

[ 9%] Building CXX object CMakeFiles/LArContent.dir/larpandoracontent/LArCheating/CheatingSliceSelectionTool.cc.o
/storage/epp2/phrdqd/Pandora.repos/TheBuild/LArContent/larpandoracontent/LArCheating/CheatingSliceSelectionTool.cc: In member function ‘virtual void lar_content::CheatingSliceSelectionTool::SelectSlices(const pandora::Algorithm*, const SliceVector&, lar_content::SliceVector&)’:
/storage/epp2/phrdqd/Pandora.repos/TheBuild/LArContent/larpandoracontent/LArCheating/CheatingSliceSelectionTool.cc:114:21: error: structured bindings only available with -std=c++17 or -std=gnu++17 [-Werror]
for (const auto [ cutVariable, index ] : reducedSliceVarIndexMap)

To produce this error I've modified the cvmfs variants of the scripts in the scripts.tar.gz file attached to this issue, in particular:

setup_torch_cvmfs.sh: I've modified the libtorch setup line to setup libtorch v1_6_0 -q e19 (attempting to use the e19:lapack qualified version fails with a no matching product error for lapack).
install_torch_cvmfs.sh: I've modified the -DCMAKE_PREFIX_PATH in the LArContent portion of the build to point to /cvmfs/larsoft.opensciencegrid.org/products/libtorch/v1_6_0/Linux64bit+3.10-2.17-e19/share/cmake

That's the totality of the changes made and produces this error, which is not seen when using the libtorch 1.5 CVMFS product. It appears that switching to libtorch 1.6 CVMFS is resulting in the c++17 standard specification in the Pandora build being ignored. Note that this only happens when the LArContent library is rebuilt, which is why I was able to successfully run with libtorch 1.6 as noted above, when I was only rebuilding LArDLContent. My initial thought was that libtorch 1.6 may have been built against C++11, because I've seen this error before in circumstances where external products used older standards, but based on Patrick's comment above, that doesn't appear to be the case here, so I'm a bit perplexed. Any thoughts you have would be appreciated. Thanks!

#105 Updated by Kyle Knoepfel about 2 months ago

Andy, can you make the following change in your install command:

  cmake -DCMAKE_MODULE_PATH="$MY_TEST_AREA/PandoraPFA/cmakemodules;$ROOTSYS/etc/cmake" -DPANDORA_MONITORING=ON  \
     -DPandoraSDK_DIR=$MY_TEST_AREA/PandoraSDK -DPandoraMonitoring_DIR=$MY_TEST_AREA/PandoraMonitoring 
     -DEigen3_DIR=/cvmfs/larsoft.opensciencegrid.org/products/eigen/v3_3_5/share/eigen3/cmake 
-    -DCMAKE_CXX_FLAGS=-std=c++17 \
     -DCMAKE_PREFIX_PATH="/cvmfs/larsoft.opensciencegrid.org/products/libtorch/v1_5_1a/Linux64bit+3.10-2.17-e19-eigen/share/cmake"  \
+    -DCMAKE_CXX_STANDARD=17 \
     -DTORCHLIB=ON $debug ..
  make -j4 install

The Torch properties in the libtorch CMake config files set the CMAKE_CXX_STANDARD value to 14, which I suspect is causing the problem. Let me know if that helps anything.

#106 Updated by Andrew Chappell about 2 months ago

That's solved the issue, thanks Kyle!

#107 Updated by Kyle Knoepfel about 2 months ago

  • % Done changed from 0 to 100
  • Status changed from Feedback to Resolved

Glad to hear it. Now marking this issue as resolved.

#108 Updated by Kyle Knoepfel about 2 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF