Project

General

Profile

Support #24669

LibTorch v1.5.1 efficiency

Added by Andrew Chappell 3 months ago. Updated about 8 hours ago.

Status:
Feedback
Priority:
Normal
Category:
External Packages
Target version:
-
Start date:
07/27/2020
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

Upon testing the provision of LibTorch 1.5.1 provided in relation to issue https://cdcvs.fnal.gov/redmine/issues/24083#change-77911 , I've found that equivalent network inference time is around a factor of 20 slower than a standalone build of LibTorch 1.4.0. My standalone build of LibTorch was produced via:

git clone https://github.com/pytorch/pytorch.git --recursive
cd pytorch && git checkout v1.4.0
git submodule update --init --recursive
USE_OPENCV=1 && BUILD_TORCH=ON
USE_CUDA=0 && USE_NNPACK=0
CMAKE_PREFIX_PATH=“<python path>"
python tools/build_libtorch.py

I'd appreciate it if you could look into this efficiency issue. As requested, I'll look to produce a local build of LibTorch v1.5.1 to see if this changes the behaviour. Based on previous discussion with Lynn, it's my understanding that the above build procedure doesn't work for v1.5.1, would you be able to let me know how you did ultimately build 1.5.1 for larsoft cvmfs? Thanks very much.

scripts.tar.gz (1.31 KB) scripts.tar.gz Andrew Chappell, 08/12/2020 03:46 AM
ldd.txt (6.7 KB) ldd.txt Andrew Chappell, 09/01/2020 05:05 AM
build.log (2.2 MB) build.log Andrew Chappell, 09/01/2020 05:05 AM
ldd_pi.txt (9.74 KB) ldd_pi.txt Andrew Chappell, 09/01/2020 05:20 AM
local_build.log (2.26 MB) local_build.log Andrew Chappell, 09/18/2020 10:46 AM
scisoftbuild_build.log (2.27 MB) scisoftbuild_build.log Andrew Chappell, 09/18/2020 10:46 AM
igprof.PandoraInterface.29425.1603876042.801249.txt (631 KB) igprof.PandoraInterface.29425.1603876042.801249.txt Andrew Chappell, 10/28/2020 04:11 AM

History

#1 Updated by Andrew Chappell 3 months ago

  • Tracker changed from Feature to Support

#2 Updated by Andrew Chappell 3 months ago

I've now produced a local build of LibTorch 1.5.1. The inference time with this local build is now comparable to the times I was seeing with LibTorch 1.4, with 100 events processing with a mean time of 1.4 seconds per event. The build was performed as follows (very similar to the approach used for 1.4):

git clone https://github.com/pytorch/pytorch.git --recursive
cd pytorch
git checkout v1.5.1
git submodule sync
git submodule update --init --recursive

pip install -r requirements.txt
pip install mkl
pip install mkl-include

USE_OPENCV=1
BUILD_TORCH=ON
USE_CUDA=0
USE_NNPACK=0
CMAKE_PREFIX_PATH=“<python path>"

python tools/build_libtorch.py

#3 Updated by Lynn Garren 3 months ago

  • Status changed from New to Feedback

Would you please try without the pip install mkl and mkl-include? Those are platform specific.

#4 Updated by Andrew Chappell 3 months ago

I've rebuilt without mkl and mkl-include, the inference performance using this new build is unchanged relative to my previous build.

#5 Updated by Christopher Green 3 months ago

  • Assignee set to Christopher Green
  • Category set to External Packages

Would it be possible for you to provide us with a way to reproduce both builds, and something we can use to reproduce the same caliber of performance discrepancy you're seeing. I'm afraid at the moment, I don't have enough information to ascertain whether the performance discrepancy is a bug or a feature. To clarify: we have definitely compiled libtorch 1.5.1 differently than you, with more features. However, it's unclear whether the performance difference is a feature of "doing more," a consequence of a choice made to increase portability or avoid dependency clashes with other linked code, or an actual performance issue that can or should be rectified. The single command you list above:

pip install -r requirements.txt
is doing a lot of work, at least some of which is likely to be incompatible with (e.g.) C++ code compiled in the context of an experiment's "framework" software setup.

I'm sorry we can't give you a definitive solution immediately, but if you could provide the means for us to reproduce your experience ourselves (with as few experiment-specific dependencies as possible), we will work to get you the answers you need.

#6 Updated by Andrew Chappell 3 months ago

I've attached a tar.gz file containing a number of scripts to setup the two different configurations of Pandora that use the network, in a dunegpvm environment. This is the "standalone" setup of Pandora.

To build the version of Pandora that runs against the locally built LibTorch as per the approach outlined above (minus the two mkl-related pip installs) the scripts should be run as follows (with the caveat that in install_torch_local.sh, the -DCMAKE_PREFIX_PATH towards the end of line 23 will need to be updated to reflect the installation location of your local build of LibTorch - otherwise all files should run as-is):

source setup_torch_local.sh
source env.sh
source clone.sh
source install_torch_local.sh

To build against the SciSoft version of LibTorch on cvmfs the approach is similar, with the setup and install steps having the relevant modifications for the SciSoft LibTorch build:

source setup_torch_cvmfs.sh
source env.sh
source clone.sh
source install_torch_cvmfs.sh

Running in each case is the same (the Pandora pndr file specified is in my /dune/data folder, which I assume you can see, please let me know if that's not the case):

source infer.sh

This script is currently setup to run over 5 events in the pndr file, which should be sufficient to see the difference in computation time, but 500 events are contained in this file (the -n parameter in infer.sh). In the local LibTorch version the run should take just a few seconds, whereas the cvmfs LibTorch version will likely take a minute or so. Incidentally, the remove.sh script just clears out the build folders.

Additionally, I wonder if it might be useful for you to let me know how you are building LibTorch, to see how a local build using your configuration runs in my local environment? Thanks again.

#7 Updated by Christopher Green 3 months ago

Thank you for this, Andrew. It might be a few days until I am able to investigate further, but in the meantime I can point you at our build script for libtorch 1.5: build-framework:source:libtorch-ssi-build|build_libtorch.sh@v1_5_1a. Let me know if you have any questions or problems.

#8 Updated by Christopher Green 2 months ago

  • Status changed from Feedback to Work in progress

#9 Updated by Christopher Green 2 months ago

  • Status changed from Work in progress to Feedback

Andrew, can you please send the output of `time source infer.sh` for each case you hve under test?

#10 Updated by Andrew Chappell 2 months ago

Hi Chris,

For my local build the output is:

real 0m7.332s
user 0m23.280s
sys 0m0.676s

and for the build against CVMFS LibTorch I have:

real 2m14.093s
user 3m33.834s
sys 0m2.500s

I've not yet had the opportunity to test a build based on the build script you provided.

#11 Updated by Christopher Green 2 months ago

It looks like your build is utilizing multiple processing units (4 CPUs, if I had to guess). Can you tell me what mechanism is used for this---MPI/IPC, OpenMP, GPU or something else?

For our UPS packages (OpenBLAS, FFTW, etc.), we generally disable OpenMP because in a batch environment, parallelism that is not planned and accounted for can cause disruption, unwanted charging and/or denial of service. At some sites, any use whatsoever of a second processor (even for a mostly-idle co-process) can result in double-charged hours. At this point, my working hypothesis is that the efficiency difference is a combination of the difference in parallism and the lack of LAPACK in our most recent packaging consistent with the LArSoft stack currently in use.

If we get an understanding of just how you expect to make use of any parallelism, we can work to improve the way we package the software you need.

#12 Updated by Andrew Chappell 2 months ago

Hi Chris, I'm actually not immediately sure where that parallelism is coming from - I'd speculate it's coming from the OpenCV component of LibTorch (I can try to test this by rebuilding without it - I can rule out GPU). In the interim, I have rerun the tests on the batch system we have at Warwick, which is explicitly single-threaded and have the following figures:

For my local build of LibTorch the output is:

real 0m15.686s
user 0m12.254s
sys 0m0.814s

and for the build against CVMFS LibTorch I have:

real 2m40.964s
user 2m33.601s
sys 0m5.172s

#13 Updated by Andrew Chappell 2 months ago

Looks like LibTorch builds using OpenMP by default. I've switched off this option and am rebuilding at the moment.

#14 Updated by Andrew Chappell 2 months ago

I've now rebuilt LibTorch 1.5 using the following settings:

USE_ASAN=0 \
USE_CUDA=0 \
USE_CUDNN=0 \
USE_DISTRIBUTED=0 \
USE_FBGEMM=0 \
USE_FFMPEG=0 \
USE_GFLAGS=0 \
USE_GLOG=0 \
USE_LEVELDB=0 \
USE_LMDB=0 \
USE_MKLDNN=0 \
USE_MPI=0 \
USE_NCCL=0 \
USE_NNPACK=1 \
USE_NUMA=0 \
USE_NUMPY=1 \
USE_NVRTC=0 \
USE_OBSERVERS=1 \
USE_OPENCL=0 \
USE_OPENCV=0 \
USE_OPENMP=0 \
USE_PROF=0 \
USE_QNNPACK=1 \
USE_PYTORCH_QNNPACK=1 \
USE_REDIS=0 \
USE_ROCKSDB=0 \
USE_ROCM=0 \
USE_TBB=0 \
USE_TENSORRT=0 \
USE_XNNPACK=1 \
USE_ZSTD=0

I believe these settings should now be broadly consistent with those in the build script you link above, with the exception of numpy (which seems to be overridden and reset on at some stage of the build: I'm not sure where yet) and that I now explicitly set USE_OPENMP to 0, whereas it still appears to be 1 based on the script used for the CVMFS version, so presumably this was still a factor in my workstation runs with the CVMFS build).

This local LibTorch build now yields the following timing result (running on my workstation again, as in the original timings I quoted):

real 0m18.597s
user 0m17.254s
sys 0m0.964s

So, the updated build does appear to eliminate the the parallelism from this case. It's not clear to me if/where LAPACK features in the LibTorch build with these settings.

#15 Updated by Christopher Green 2 months ago

I'm having real trouble reproducing your results. Can you tell me if your examples above are an accurate representation of how you built libtorch, because you don't appear to be exporting the configuration settings to the environment or setting them at the beginning of the command line. If this is the case, your settings won't have affected the build at all.

Could you please make another local build and test, this time with VERBOSE=1 along with everything else, and attach the log of the build along with the results of `ldd` for each dynamic library and executable?

#16 Updated by Andrew Chappell about 2 months ago

Hi Chris. I'll try to provide those details at the earliest opportunity this week (the settings listed above are the settings reported back by CMake during the build process - in the latest build I explicitly exported all of those variables, previously I had only specified a small subset of them).

#17 Updated by Andrew Chappell about 2 months ago

CMake build log and ldd output for LibTorch related libraries attached.

#18 Updated by Andrew Chappell about 2 months ago

And the corresponding ldd output for the PandoraInterface executable in LArReco.

#19 Updated by Christopher Green about 2 months ago

Thanks for this, Andrew. I can see a couple of differences here:

First, your build is not finding any of the UPS packages we would normally use as dependencies: protobuf, pybind11, eigen or lapack—instead, it is finding OpenBLAS from your system in /lib64 and its internal copies of protobuf, pybind11 and eigen. numpy is also being picked up internally rather than via UPS.

In addition, given the long list of compile flags listed in the configuration summary in your build log, it appears that the specification of CXXFLAGS in the UPS build replaces rather than adds to that list, and someof them (-faligned-new, -fno-math-errno, -fno-trapping-math) might have a significant effect on performance. You also have a different Python (3.6) versus the one that would be used from UPS (3.7.2), but that's unlikely to be relevant here.

I will investigate the feasibility of merging the compiler flags rather than replacing them, and update this ticket with what I discover.

#20 Updated by Christopher Green about 2 months ago

  • Status changed from Feedback to Work in progress

#21 Updated by Christopher Green about 2 months ago

  • Status changed from Work in progress to Feedback

Andrew, I'm afraid I find myself completely unable to replicate your "good" results (<20s user time) with a local build on our main development machine, scisoftbuild01.fnal.gov (Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz—Skylake). I've given you an account on this machine, and I'd be grateful if you would be able to log in and run your local build and test of libtorch to verify that you can reproduce your results on this machine. This is a machine under sporadic heavy use, so you might need to wait a little while for a period with low load. Please put everything under /scratch/chappell to avoid overloading the TiBS backup of home directories.

I apologize that we don't have a solution for you yet, but this is extremely perplexing. Any light that can be shed by your taking the time to build on our machine would be appreciated.

#22 Updated by Andrew Chappell about 2 months ago

Hi Chris. No problem, and thanks for looking into this as extensively as you have. I'll try to take a look at building on scisoftbuild01 later this week and let you know what I find.

#23 Updated by Andrew Chappell about 1 month ago

Hi Chris, I finally got the opportunity to test on scisoftbuild01. Running the local build yielded these results (under apparently low load at the time):

real 2m49.115s
user 2m37.361s
sys 0m2.666s

Presumably this, at least approximately, replicates the kind of results you have been seeing? This would seem to rule out the possibility of specific LibTorch build settings being the source of the difference, as this used the same settings as were used in my local environment.

#24 Updated by Christopher Green about 1 month ago

Andrew, thank you for this: if nothing else, it tells me I haven't been doing something completely silly—at least on this evidence.

Could you please attach full build logs for both your "good" build on your own machine and the one on scisoftbuild01? My best guess at the moment is that you're using a highly-optimized system library on your machine that simply isn't present on ours.

#25 Updated by Andrew Chappell about 1 month ago

No problem. I've attached the build logs for the respective local and scisoftbuild01 builds.

#26 Updated by Patrick Gartung 22 days ago

  • Assignee changed from Christopher Green to Patrick Gartung

#27 Updated by Patrick Gartung 22 days ago

Running in VirtualBox using the local build I get close to what Andrew got for timing

time ./infer.sh 
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m10.897s
user    1m2.769s
sys    0m2.405s

#28 Updated by Patrick Gartung 22 days ago

Running install_libtorch_local.sh in the VirtualBox gives an executable linked against system libopenblas.so

#29 Updated by Patrick Gartung 21 days ago

Using the ups libtorch built against ups openblas, the timing is slower and there is an error about unsupported hardware running in a VirtualBox VM.

[vagrant@localhost issue-24669]$ time ./infer.sh 
Warning in <UnknownClass::SetDisplay>: DISPLAY not set, setting it to 10.0.2.2:0.0
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchopen//Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m40.258s
user    2m7.212s
sys    0m4.934s

#30 Updated by Patrick Gartung 21 days ago

Using ups built libtorch built against ups openblas on scisoftbuild01 there is no error and the timing is higher.

[gartung@scisoftbuild01 issue-24669]$ time ./infer.sh 
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchopen/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m48.957s
user    6m55.939s
sys    0m12.163s

#31 Updated by Patrick Gartung 20 days ago

I compared the CMakeCache.txt for the "local" build of libtorch versus the "ups" build of libtorch. The local build automatically picks MKL versus the "ups" build having OpenBLAS set. Another note is that the local build sets the optimization flag -msse4.

ARCH_OPT_FLAGS:STRING=-msse4                                  <
                            //Selected BLAS library                                         //Selected BLAS library
BLAS:STRING=MKL                                               | BLAS:STRING=OpenBLAS
                                                              >
                                                              > //Marks whether BLAS was manually set by user or auto-detecte
                                                              > BLAS_SET_BY_USER:STRING=TRUE

//Path to a library.                                            //Path to a library.
BLAS_Accelerate_LIBRARY:FILEPATH=BLAS_Accelerate_LIBRARY-NOTF | BLAS_goto2_LIBRARY:FILEPATH=BLAS_goto2_LIBRARY-NOTFOUND

//Marks whether BLAS was manually set by user or auto-detecte | //Path to a library.
BLAS_SET_BY_USER:STRING=FALSE                                 | BLAS_mkl_LIBRARY:FILEPATH=BLAS_mkl_LIBRARY-NOTFOUND
                                                              >
                                                              > //Path to a library.
                                                              > BLAS_mkl_em64t_LIBRARY:FILEPATH=BLAS_mkl_em64t_LIBRARY-NOTFOU

//Path to a library.                                            //Path to a library.
BLAS_openblas_LIBRARY:FILEPATH=/usr/lib64/libopenblas.so      | BLAS_mkl_ia32_LIBRARY:FILEPATH=BLAS_mkl_ia32_LIBRARY-NOTFOUND

//Path to a library.                                            //Path to a library.
BLAS_vecLib_LIBRARY:FILEPATH=BLAS_vecLib_LIBRARY-NOTFOUND     | BLAS_mkl_intel_LIBRARY:FILEPATH=BLAS_mkl_intel_LIBRARY-NOTFOU
                                                              >
                                                              > //Path to a library.
                                                              > BLAS_mkl_intel_lp64_LIBRARY:FILEPATH=BLAS_mkl_intel_lp64_LIBR
                                                              >
                                                              > //Path to a library.
                                                              > BLAS_mkl_rt_LIBRARY:FILEPATH=BLAS_mkl_rt_LIBRARY-NOTFOUND
                                                              >
                                                              > //Path to a library.
                                                              > BLAS_openblas_LIBRARY:FILEPATH=/scratch/gartung/products/open
                                  <

#32 Updated by Andrew Chappell 20 days ago

Hi Patrick,

Are you using the original build script that was sent through? Since that initial build, I added explicit exports for the various environment variables in https://cdcvs.fnal.gov/redmine/issues/24669#note-14 which should avoid the use of MKL and also OpenMP, for example (the build is otherwise unchanged).

#33 Updated by Patrick Gartung 20 days ago

I used the script linked below that Chris adapted from your original upload directly on scisoftbuild01.
https://cdcvs.fnal.gov/redmine/projects/build-framework/repository/libtorch-ssi-build/revisions/issue-24669/entry/issue-24669/install_libtorch_local.sh
It somehow found MKL but I am not sure from where.

#34 Updated by Patrick Gartung 20 days ago

Even running inside VirtualBox CMake automatically sets BLAS=MKL and ARCH_OPT_FLAGS:STRING=-msse4.
The -msse4 flag is probably causing the biggest speedup.

#35 Updated by Patrick Gartung 20 days ago

CMake selects BLAS=MKL automatically and -msse4 is set when using gcc or clang. Because the ups build sets BLAS=OPEN or BLAS=Eigen FindMKLDNN is never called and no optimization flag gets set. I can try setting the optimization flag on the cmake command line to see if that improves performance.

[gartung@scisoftbuild01 pytorch-c-build]$ git grep -C5 msse4
cmake/Modules/FindMKLDNN.cmake-SET(MKLDNN_LIBRARY_TYPE STATIC CACHE STRING "" FORCE)
cmake/Modules/FindMKLDNN.cmake-IF(MKLDNN_USE_NATIVE_ARCH)  # Disable HostOpts in MKLDNN unless MKLDNN_USE_NATIVE_ARCH is set.
cmake/Modules/FindMKLDNN.cmake-  SET(ARCH_OPT_FLAGS "HostOpts" CACHE STRING "" FORCE)
cmake/Modules/FindMKLDNN.cmake-ELSE()
cmake/Modules/FindMKLDNN.cmake-  IF(CMAKE_CXX_COMPILER_ID STREQUAL "GNU" OR CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
cmake/Modules/FindMKLDNN.cmake:    SET(ARCH_OPT_FLAGS "-msse4" CACHE STRING "" FORCE)
cmake/Modules/FindMKLDNN.cmake-  ELSE()
cmake/Modules/FindMKLDNN.cmake-    SET(ARCH_OPT_FLAGS "" CACHE STRING "" FORCE)
cmake/Modules/FindMKLDNN.cmake-  ENDIF()
cmake/Modules/FindMKLDNN.cmake-ENDIF()
cmake/Modules/FindMKLDNN.cmake-

#36 Updated by Patrick Gartung 20 days ago

Adding -DARCH_OPT_FLAGS:STRING=-msse4 and -DBLAS=Open did not improve the performance. It also showed the curious increase in timing running outside the VM. Inside the VM produced the unsupported hardware error but the time was faster.

[vagrant@localhost issue-24669]$ time ./infer.sh 
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchopensys//Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m42.209s
user    2m13.772s
sys    0m5.476s

[gartung@scisoftbuild01 issue-24669]$ time ./infer.sh 
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchopensys/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m48.831s
user    6m53.070s
sys    0m12.547s

#37 Updated by Patrick Gartung 14 days ago

I think I can explain the slow down outside the VirtualBOX VM. If I check /proc/cpu in the VirtualBox I only see the avx2 instruction. If I check /proc/cpuinfo on scisoftbuild01 I see the avx512 instruction set. I compiled libtorch on scisoftbuild01 so it added object for avx512. This would explain the error about unsupported hardware when running in the VirtualBox VM because the avx512 instruction set is not available. The use of the avx512 instruction set when running directly on scisoftbuild01 causes the cpu to clock down while performing these instructions. Comparing the symbols in libXDNN.a I see that different symbols per instruction set are included in the "local" build and the "ups" build. The "local" build and the "ups" build both use CMAKE_BUILD_TYPE='Release'. The "local" build was done directly on scisoftbuild01 and the "ups" build was done inside the VirtualBox VM.

#38 Updated by Patrick Gartung 13 days ago

I built inside a VirtualBox VM where I disabled the AVX and AVX2 registers. I used install_libtorch_local.sh with the g variant. When I run the inference test I get a user time similar to the that from running with the ups built openblas. This build is using the system installed openblas but cannot run the AVX2 version of its routines since those registers are not available.


time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh g

EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest-sseonly/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    1m1.690s

user    6m13.894s

sys     0m5.692s
[vagrant@localhost libtorchtest-sseonly]$ 

#39 Updated by Patrick Gartung 12 days ago

I profiled the two builds of libtorch using the install_librtorch_local.sh script. One was done on scisoftbuild01 and the other in a VirtualBox VM with AVX registers inaccessible.

The libtorch built in the VM showed these routines with the highest self cost:

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  61.89     193.98  @{libgomp.so.1.0.0+100778} [5]
  16.26      50.95  @{libgomp.so.1.0.0+100354} [7]
   7.48      23.44  sgemm_kernel_NEHALEM [8]
   2.37       7.43  @{libgomp.so.1.0.0+100782} [49]
   0.93       2.93  @{libgomp.so.1.0.0+100790} [67]
   0.93       2.91  @{libgomp.so.1.0.0+100787} [68]
   0.90       2.81  sgemm_itcopy_NEHALEM [69]
   0.63       1.97  @{libtorch_cpu.so+24536293} [72]
   0.63       1.96  @{libgomp.so.1.0.0+100358} [73]
   0.41       1.28  @{libgomp.so.1.0.0+100776} [83]
   0.24       0.76  @{libgomp.so.1.0.0+100365} [94]
   0.23       0.73  @{libgomp.so.1.0.0+101007} [95]
   0.23       0.72  @{libgomp.so.1.0.0+100363} [96]
   0.20       0.62  @{libtorch_cpu.so+26651262} [104]
   0.18       0.56  @{libtorch_cpu.so+24851925} [111]
   0.18       0.56  @{libtorch_cpu.so+26651275} [112]
   0.17       0.52  __expf_finite [113]

The libtorch built on scisoftbuild01 had these routines with the highest self cost:
NB I had to increase the number of events to 50 to get a better sample

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  11.17       1.21  @{libgomp.so.1+102010} [73]
   8.74       0.95  @{libgomp.so.1+101586} [85]
   7.40       0.80  brk [91]
   6.53       0.71  __libc_malloc [61]
   4.83       0.52  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv [69]
   3.59       0.39  __sched_yield [147]
   1.98       0.21  __read [199]
   1.43       0.15  _ZN3c1010TensorType3getEv [233]
   1.29       0.14  @{libc-2.17.so+532348} [243]
   1.24       0.13  @{libc-2.17.so+527903} [249]
   1.20       0.13  _ZNK3c1010TensorType14isSubtypeOfExtESt10shared_ptrINS_4TypeEEPSo [175]
   1.15       0.12  @{libtorch_cpu.so+11554592} [263]
   1.01       0.11  @{libc-2.17.so+528006} [272]
   0.83       0.09  @{libgomp.so.1+102014} [316]
   0.78       0.08  @{libtorch_cpu.so+11413930} [338]
   0.74       0.08  @{libtorch_cpu.so+20130746} [346]
   0.69       0.07  @{libgomp.so.1+101590} [354]
   0.51       0.05  @{libgomp.so.1+101595} [414]
   0.51       0.05  sgemm_kernel_SKYLAKEX [418]

The biggest difference seems to be caused by difference in performance of sgemm_kernel_NEHALEM and sgemm_kernel_SKYLAKEX.

Presumably the ups build of libtorch was done on an older AMD machine so the sgemm build chose to optimize for NEHALEM while the later builds chose to optimize for SKYLAKEX.

#40 Updated by Patrick Gartung 12 days ago

A Google search of sgemm_kernel_NEHALEM or sgemm_kernel_SKYLAKEX shows that these symbols come from libopenblas.so.
The kernel seems to be selected on the basis of what instruction sets are available on the machine where it runs.

#41 Updated by Patrick Gartung 12 days ago

I think I can draw the following conclusions.

Libtorch 1.6 built with openblas is faster than the older libtorch built with eigen.

Openblas can pick the kernel based on the registers available on the CPU. If the AVX512 registers are unavailable the user time is shorter but the real time is slightly longer.

 time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest-sseonly/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m48.150s
user    6m44.052s
sys     0m13.108s
[gartung@scisoftbuild01 libtorchtest]$ 

time  /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh a
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest-sseonly/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m58.525s
user    2m52.739s
sys     0m7.781s
[vagrant@localhost libtorchtest-sseonly]$

#42 Updated by Patrick Gartung 9 days ago

The differences in libraries built by install_libtorch_local.sh for the c variant and the g variant are the inclusion of libasmjit.a, libfbgemm.a and libmkldnn.a and some test libraries. The differences in libtorch_cpu.so can be attributed to the inclusion of libfbgemm.a and libmkldnn.a.

[[vagrant@localhost libtorchtest-sseonly]$ ls -1s pytorch-*-build/build/lib/
pytorch-c-build/build/lib/:
total 194936
   476 libasmjit.a
   732 libbenchmark.a
     4 libbenchmark_main.a
   968 libc10d.a
   364 libc10.so
   892 libcaffe2_detectron_ops.so
    80 libcaffe2_module_test_dynamic.so
   300 libCaffe2_perfkernels_avx2.a
     4 libCaffe2_perfkernels_avx512.a
    12 libCaffe2_perfkernels_avx.a
  1100 libcaffe2_protos.a
     8 libclog.a
    92 libcpuinfo.a
    92 libcpuinfo_internals.a
 14004 libfbgemm.a
     4 libfoxi_loader.a
  1104 libgloo.a
   264 libgmock.a
     4 libgmock_main.a
   672 libgtest.a
     4 libgtest_main.a
 46040 libmkldnn.a
   248 libnnpack.a
    20 libnnpack_reference_layers.a
  6420 libonnx.a
   588 libonnx_proto.a
  5980 libprotobuf.a
   768 libprotobuf-lite.a
  5640 libprotoc.a
    24 libpthreadpool.a
   188 libpytorch_qnnpack.a
   152 libqnnpack.a
105296 libtorch_cpu.so
     8 libtorch_global_deps.so
     8 libtorch.so
  2376 libXNNPACK.a

pytorch-g-build/build/lib/:
total 92356
  364 libc10.so
  480 libcaffe2_detectron_ops.so
  120 libcaffe2_observers.so
  260 libCaffe2_perfkernels_avx2.a
    4 libCaffe2_perfkernels_avx512.a
   12 libCaffe2_perfkernels_avx.a
 1100 libcaffe2_protos.a
    8 libclog.a
   92 libcpuinfo.a
   88 libcpuinfo_internals.a
    4 libfoxi_loader.a
  248 libnnpack.a
   20 libnnpack_reference_layers.a
 6420 libonnx.a
  572 libonnx_proto.a
    0 libprotobuf-lite.so
  436 libprotobuf-lite.so.3.6.1
    0 libprotobuf.so
 3372 libprotobuf.so.3.6.1
    0 libprotoc.so
 3036 libprotoc.so.3.6.1
   24 libpthreadpool.a
  192 libpytorch_qnnpack.a
  156 libqnnpack.a
72980 libtorch_cpu.so
    8 libtorch_global_deps.so
    8 libtorch.so
 2352 libXNNPACK.a

#43 Updated by Patrick Gartung 9 days ago

A comparison of the symbols in each variant of libtorch_cpu.so shows the inclusion of libmkldnn and libfbgemm

b (anonymous namespace)::g_GradientRegistry2                  <
b (anonymous namespace)::g_GradientRegistry4                  <
b (anonymous namespace)::g_IDEEPOperatorRegistry1             <
b (anonymous namespace)::g_IDEEPOperatorRegistry2             <
b (anonymous namespace)::g_IDEEPOperatorRegistry3             <
b (anonymous namespace)::g_IDEEPOperatorRegistry4             <

b asmjit::cpuInfoGlobal                                       <
b asmjit::cpuInfoInitialized                                  <
b asmjit::VirtMem::info()::vmInfo                             <
b asmjit::VirtMem::info()::vmInfoInitialized                  <
b asmjit::VirtMem_openAnonymousMemory(int*, bool)::internalCo <
b asmjit::vm_memfd_create_not_supported                       <
b asmjit::vm_shm_strategy                                     <

B _end                                                          B _end
b fbgemm::(anonymous namespace)::fbgemmEnvGetIsa()::isaMap    | B FLAGS_caffe2_daz
b fbgemm::(anonymous namespace)::g_Avx512_Ymm_enabled         <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<floa <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMLookup<unsi <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::GenEmbeddingSpMDMNBitLookup< <
b fbgemm::(anonymous namespace)::g_forced_isa                 <
b fbgemm::(anonymous namespace)::isaSupportMap                <
b fbgemm::cblas_gemm_compute(fbgemm::matrix_op_t, int, float  <
b fbgemm::cblas_gemm_compute(fbgemm::matrix_op_t, int, float  <
b fbgemm::CodeGenBase<long, long, long, long>::codeCache_     <
b fbgemm::CodeGenBase<long, long, long, long>::rtMutex_       <
b fbgemm::CodeGenBase<long, long, long, long>::runtime()::rt  <
b fbgemm::CodeGenBase<unsigned char, signed char, int, int>:: <
b fbgemm::CodeGenBase<unsigned char, signed char, int, int>:: <
b fbgemm::CodeGenBase<unsigned char, signed char, int, int>:: <
b fbgemm::CodeGenBase<unsigned char, signed char, int, short> <
b fbgemm::CodeGenBase<unsigned char, signed char, int, short> <
b fbgemm::CodeGenBase<unsigned char, signed char, int, short> <
b fbgemm::fbgemmInstructionSet()::env_forced_isa              <
b fbgemm::fbgemmInstructionSet()::isAvx512_Ymm_enabled        <
b fbgemm::GenConvKernelBase<2, (fbgemm::inst_set_t)1>::codeCa <
b fbgemm::GenConvKernelBase<2, (fbgemm::inst_set_t)1>::rtMute <
b fbgemm::GenConvKernelBase<2, (fbgemm::inst_set_t)1>::runtim <
b fbgemm::GenConvKernelBase<3, (fbgemm::inst_set_t)1>::codeCa <
b guard variable for fbgemm::(anonymous namespace)::fbgemmEnv <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::(anonymous namespace)::GenEmbedd <
b guard variable for fbgemm::cblas_gemm_compute(fbgemm::matri <
b guard variable for fbgemm::cblas_gemm_compute(fbgemm::matri <
b guard variable for fbgemm::CodeGenBase<long, long, long, lo <
b guard variable for fbgemm::CodeGenBase<long, long, long, lo <
b guard variable for fbgemm::CodeGenBase<unsigned char, signe <
b guard variable for fbgemm::CodeGenBase<unsigned char, signe <
b guard variable for fbgemm::CodeGenBase<unsigned char, signe <
b guard variable for fbgemm::CodeGenBase<unsigned char, signe <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<fl <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<fl <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMKernelSignature<un <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::EmbeddingSpMDMRowWiseSparseKerne <
b guard variable for fbgemm::fbgemmInstructionSet()::env_forc <
b guard variable for fbgemm::fbgemmInstructionSet()::isAvx512 <
b guard variable for fbgemm::GenConvKernelBase<2, (fbgemm::in <
b guard variable for fbgemm::GenConvKernelBase<2, (fbgemm::in <
b guard variable for fbgemm::GenConvKernelBase<3, (fbgemm::in <
b guard variable for fbgemm::transpose_simd(int, int, float c <
b guard variable for gloo::infinibandToBusID(std::__cxx11::ba <
b guard variable for gloo::interfaceToBusID(std::__cxx11::bas <
b guard variable for gloo::kernelModules[abi:cxx11]()::module <
b guard variable for gloo::ReductionFunction<float>::sum      <
b guard variable for gloo::ReductionFunction<gloo::float16>:: <

b mkldnn::impl::cpu::(anonymous namespace)::abi_not_param1    <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param1        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param2        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param3        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param4        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param5        <
b mkldnn::impl::cpu::(anonymous namespace)::abi_param6        <
b mkldnn::impl::cpu::(anonymous namespace)::cpu               <
b mkldnn::impl::cpu::(anonymous namespace)::L1_cache_size     <
b mkldnn::impl::cpu::(anonymous namespace)::L2_cache_size     <
b mkldnn::impl::cpu::(anonymous namespace)::LLC_cache_size    <
b mkldnn::impl::cpu::(anonymous namespace)::LLC_data_size     <
b mkldnn::impl::cpu::avx512_common_gemm_f32::get_xbyak_gemm(b <
b mkldnn::impl::cpu::avx512_common_gemm_f32::get_xbyak_gemm(b <
b mkldnn::impl::cpu::avx_gemm_f32::get_xbyak_gemm(bool, bool, <
b mkldnn::impl::cpu::avx_gemm_f32::get_xbyak_gemm(bool, bool, <
b mkldnn::impl::cpu::bf16_cvt_utils::add_cvt_ps_to_bf16_      <
b mkldnn::impl::cpu::bf16_cvt_utils::cvt_bf16_to_ps_          <
b mkldnn::impl::cpu::bf16_cvt_utils::cvt_one_ps_to_bf16       <
b mkldnn::impl::cpu::bf16_cvt_utils::cvt_ps_to_bf16_          <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<float, float, float>::jit_in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, signed char, in <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<signed char, unsigned char,  <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::gemm_info_t<unsigned short, unsigned sho <
b mkldnn::impl::cpu::jit_generator::dump_code(unsigned char c <
b mkldnn::impl::cpu::jit_uni_lrn_fwd_kernel_f32<(mkldnn::impl <
b mkldnn::impl::cpu::simple_barrier::barrier(mkldnn::impl::cp <
b mkldnn::impl::dump_jit_code                                 <
b mkldnn::impl::global_scratchpad_t::reference_count_         <
b mkldnn::impl::global_scratchpad_t::scratchpad_              <
b mkldnn::impl::global_scratchpad_t::size_                    <
b mkldnn::impl::initialized                                   <
b mkldnn::impl::mxcsr_save                                    <
b mkldnn::impl::verbose                                       <
b mkldnn::impl::version_printed                               <

#44 Updated by Patrick Gartung 9 days ago

The speed improvement when using dynamic code as shown in this igprof profile of variant c even when run in the virtualbox and using the NEHALEM kernel.

[vagrant@localhost libtorchtest]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh c
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m17.716s
user    1m57.102s
sys    0m2.413s


----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  42.40     440.86  @{libgomp.so.1.0.0+100354} [5]
  10.42     108.33  @{libgomp.so.1.0.0+100778} [6]
   2.81      29.21  sgemm_kernel_NEHALEM [24]
   1.62      16.84  @{libgomp.so.1.0.0+100358} [31]
   0.67       6.96  @{libtorch_cpu.so+26675187} [71]
   0.61       6.38  @{libgomp.so.1.0.0+100363} [75]
   0.60       6.27  @{libgomp.so.1.0.0+100365} [76]
   0.40       4.12  @{libgomp.so.1.0.0+100782} [94]
   0.38       3.93  __expf_finite [104]
   0.35       3.59  @{libgomp.so.1.0.0+100352} [105]
   0.25       2.64  @{libtorch_cpu.so+29938737} [106]
   0.25       2.62  @{libtorch_cpu.so+28878299} [107]
   0.20       2.03  @{libtorch_cpu.so+29938782} [108]
   0.19       1.93  @{libtorch_cpu.so+66220908} [111]
   0.18       1.88  sgemm_beta_NEHALEM [112]
   0.17       1.78  @{libgomp.so.1.0.0+100790} [116]
   0.17       1.76  @{libtorch_cpu.so+28878269} [123]
   0.16       1.68  @{libtorch_cpu.so+13640830} [132]
   0.16       1.68  @{libtorch_cpu.so+29246320} [133]
   0.14       1.44  @{libgomp.so.1.0.0+100787} [146]
   0.14       1.41  __read [150]
   0.14       1.41  @?0x7c642842{<dynamically-generated>} [153]
   0.13       1.35  @?0x7c57f070{<dynamically-generated>} [155]
   0.13       1.32  @{libtorch_cpu.so+29246353} [157]
   0.13       1.31  @{libtorch_cpu.so+29246331} [159]
   0.12       1.27  @{libtorch_cpu.so+29938054} [165]
   0.12       1.24  @{libtorch_cpu.so+29246342} [168]
   0.11       1.10  @{libtorch_cpu.so+63095556} [181]
   0.10       1.09  @{libtorch_cpu.so+70421375} [185]
   0.10       0.99  @{libtorch_cpu.so+29938014} [216]
   0.10       0.99  @{libtorch_cpu.so+29938000} [217]
   0.09       0.90  @{libc-2.17.so+530478} [226]
   0.09       0.89  @{libtorch_cpu.so+13640824} [230]
   0.08       0.88  @{libtorch_cpu.so+26675196} [231]
   0.08       0.86  @{libtorch_cpu.so+26675200} [235]
   0.08       0.81  @{libtorch_cpu.so+26675191} [239]
   0.07       0.76  @{libgomp.so.1.0.0+100776} [240]
   0.07       0.75  @{libtorch_cpu.so+29938392} [241]
   0.07       0.74  @{libtorch_cpu.so+70421371} [242]
   0.06       0.61  expf [93]
   0.06       0.61  @{libtorch_cpu.so+29938417} [259]

#45 Updated by Patrick Gartung 9 days ago

Running on scisoftbuild01 one directly the dynamic generated code is either not used or is too fast to be sampled

[gartung@scisoftbuild01 libtorchtest]$ time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh c
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m14.486s
user    3m48.631s
sys    0m6.302s

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
   9.14       0.64  __libc_malloc [47]
   5.32       0.37  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv [62]
   4.68       0.33  brk [108]
   2.91       0.20  @{libc-2.17.so+527903} [153]
   2.55       0.18  __read [161]
   2.55       0.18  __sched_yield [166]
   2.48       0.17  _ZN3c1010TensorType3getEv [167]
   1.91       0.13  @{libgomp.so.1+102010} [183]
   1.77       0.12  @{libc-2.17.so+528006} [188]
   1.63       0.11  @{libc-2.17.so+532348} [196]
   1.28       0.09  @{libtorch_cpu.so+20130746} [233]
   1.06       0.07  @{libtorch_cpu.so+11413930} [260]
   0.78       0.05  _ZNK3c1010TensorType14isSubtypeOfExtESt10shared_ptrINS_4TypeEEPSo [180]
   0.71       0.05  cfree [335]
   0.71       0.05  @{libtorch_cpu.so+11554592} [346]
   0.64       0.04  sgemm_kernel_SKYLAKEX [398]
   0.57       0.04  _ZdlPv [408]
   0.57       0.04  munmap [409]
   0.57       0.04  __libc_fork [415]
   0.50       0.03  _ZNK7pandora9MCManager24CreateUidToPfoTargetsMapERSt13unordered_mapIPKvS1_IPKNS_10MCParticleEfSt4hashIS6_ESt8equal_toIS6_ESaISt4pairIKS6_fEEES7_IS3_ES9_IS3_ESaISB_IKS3_SF_EEERKS1_IS3_S1_IS3_fSG_SH_SaISB_ISI_fEEESG_SH_SaISB_ISI_SP_EEE [173]
   0.43       0.03  _Znwm [45]
   0.43       0.03  @{libtorch_cpu.so+46306665} [513]
   0.43       0.03  _ZNKSt10_HashtableIPN5torch3jit4NodeES3_SaIS3_ENSt8__detail9_IdentityESt8equal_toIS3_ESt4hashIS3_ENS5_18_Mod_range_hashingENS5_20_Default_ranged_hashENS5_20_Prime_rehash_policyENS5_17_Hashtable_traitsILb0ELb1ELb1EEEE5countERKS3_ [515]
   0.43       0.03  __xstat64 [533]
   0.43       0.03  @{libtorch_cpu.so+30774291} [539]
   0.43       0.03  @{libtorch_cpu.so+46031330} [547]
   0.35       0.02  _ZNSt13basic_filebufIcSt11char_traitsIcEE6xsgetnEPcl [139]
   0.35       0.02  _ZNSt15basic_streambufIcSt11char_traitsIcEE6xsgetnEPcl [213]
   0.35       0.02  @{libc-2.17.so+527781} [555]
   0.35       0.02  @{libc-2.17.so+528016} [556]
   0.35       0.02  @{libc-2.17.so+527969} [561]
   0.35       0.02  _ZNSi6sentryC2ERSib [566]
   0.35       0.02  @{libc-2.17.so+525341} [577]
   0.35       0.02  @{libtorch_cpu.so+11552194} [597]
   0.35       0.02  @{libtorch_cpu.so+11413874} [605]
   0.35       0.02  llseek [608]
   0.35       0.02  @{libtorch_cpu.so+30774277} [620]
   0.35       0.02  sgemm_beta_SKYLAKEX [626]
   0.28       0.02  _ZN11lar_content22PreProcessingAlgorithm15ProcessCaloHitsEv [149]

#46 Updated by Patrick Gartung 8 days ago

Running on dunebuild01 which has an AMD processor using the binaries built on scisoftbuild01 I do not get any invalid instruction errors. It seems as though libtorch picks the correct code to use on the processor it runs on. In this case it uses libopenblas which uses sgemm_PILEDRIVER_kernel.

time /build/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh c
Warning in <UnknownClass::SetDisplay>: DISPLAY not set, setting it to 131.225.240.52:0.0
EventReadingAlgorithm: Processing event file: /build/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m18.074s
user    2m35.750s
sys    0m9.460s

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  19.76       2.12  @{libgomp.so.1.0.0+102010} [31]
   6.93       0.74  @{libgomp.so.1.0.0+101586} [77]
   3.11       0.33  @{libtorch_cpu.so+13640819} [114]
   2.93       0.31  @{libtorch_cpu.so+12481486} [130]
   2.42       0.26  @{libgomp.so.1.0.0+102008} [150]
   2.37       0.25  @{libtorch_cpu.so+23298013} [151]
   1.86       0.20  @{libtorch_cpu.so+23298018} [169]
   1.86       0.20  __sched_yield [170]
   1.67       0.18  @{libtorch_cpu.so+23297992} [173]
   1.58       0.17  @{libgomp.so.1.0.0+102019} [176]
   1.53       0.16  @{libgomp.so.1.0.0+102022} [178]
   1.12       0.12  __libc_malloc [161]
   1.12       0.12  @{libtorch_cpu.so+13640835} [190]
   1.07       0.11  @{libtorch_cpu.so+13640800} [195]
   1.02       0.11  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv [132]
   0.93       0.10  @{libgomp.so.1.0.0+101726} [209]
   0.88       0.09  brk [218]
   0.88       0.09  @{libgomp.so.1.0.0+101584} [222]
   0.88       0.09  @{libtorch_cpu.so+13640817} [223]
   0.84       0.09  sgemm_kernel_PILEDRIVER [225]
   0.84       0.09  @{libtorch_cpu.so+12481464} [230]
   0.84       0.09  @{libtorch_cpu.so+23298005} [231]
   0.79       0.08  @{libtorch_cpu.so+13640808} [233]
   0.79       0.08  @{libgomp.so.1.0.0+101597} [234]
   0.74       0.08  @{libtorch_cpu.so+12481378} [235]
   0.70       0.07  @{libtorch_cpu.so+23298009} [238]
   0.65       0.07  @{libtorch_cpu.so+13640805} [241]
   0.60       0.06  munmap [242]
   0.51       0.05  @{libtorch_cpu.so+12481552} [256]
   0.51       0.05  @{libtorch_cpu.so+13640839} [257]
   0.46       0.05  @{libc-2.17.so+528006} [261]
   0.46       0.05  @{libtorch_cpu.so+12481346} [264]
   0.46       0.05  @{libtorch_cpu.so+12481481} [265]
   0.46       0.05  @{libc-2.17.so+588304} [266]
   0.46       0.05  @{libc-2.17.so+1373249} [268]
   0.42       0.04  @{libc-2.17.so+1373280} [282]
   0.37       0.04  @{libtorch_cpu.so+12481532} [287]
   0.33       0.03  __read [298]

#47 Updated by Patrick Gartung 8 days ago

Built libtorch with ups scripts, this time not setting MKLDNN=0. The preformance looks roughly the same as variant c of the install_libtorch_local.sh script.

time bash -x /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh 
+ /scratch/gartung/products/install-a/LArReco/bin/PandoraInterface -r AllHitsNu -i /scratch/gartung/products/LArReco/settings/PandoraSettings_Inference_DUNEFD_DeepLearningId.xml -g /scratch/gartung/products/LArReco/geometry/PandoraGeometry_DUNEFD_1x2x6.xml -e /scratch/gartung/products/Pandora_Events_0.pndr -N -n 5 -s 1
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m14.604s
user    1m49.215s
sys     0m3.682s
Flat profile (self >= 0.01%)

% total       Self  Function
  10.60       2.14  @{libgomp.so.1.0.0+102010} [58]
   6.74       1.36  __libc_malloc [28]
   5.38       1.08  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv [75]
   2.82       0.57  brk [162]
   2.06       0.41  @{libc-2.17.so+527903} [199]
   1.81       0.36  __sched_yield [228]
   1.78       0.36  @{libc-2.17.so+528006} [229]
   1.64       0.33  @{libc-2.17.so+532348} [244]
   1.59       0.32  @{libtorch_cpu.so+21405818} [249]
   1.02       0.20  __read [311]
   0.82       0.16  _ZN3c1010TensorType3getEv [344]
   0.79       0.16  cfree [348]
   0.79       0.16  @{libtorch_cpu.so+11883058} [349]
   0.67       0.13  _ZNK3c1010TensorType14isSubtypeOfExtESt10shared_ptrINS_4TypeEEPSo [131]
   0.67       0.13  @{libgomp.so.1.0.0+102014} [376]
   0.55       0.11  _ZNKSt10_HashtableIPN5torch3jit4NodeES3_SaIS3_ENSt8__detail9_IdentityESt8equal_toIS3_ESt4hashIS3_ENS5_18_Mod_range_hashingENS5_20_Default_ranged_hashENS5_20_Prime_rehash_policyENS5_17_Hashtable_traitsILb0ELb1ELb1EEEE5countERKS3_ [422]
   0.52       0.10  _ZN5torch3jit9MemoryDAG12setWildcardsERKSt13unordered_setIPKNS0_5ValueESt4hashIS5_ESt8equal_toIS5_ESaIS5_EERKN3ska13flat_hash_mapIS5_PNS0_7ElementES7_S9_SaISt4pairIS5_SH_EEEERKSt8functionIFSH_S5_EE [136]
   0.50       0.10  _ZNKSt10_HashtableIN3c106SymbolESt4pairIKS1_St6vectorISt10shared_ptrIN5torch3jit8OperatorEESaIS9_EEESaISC_ENSt8__detail10_Select1stESt8equal_toIS1_ESt4hashIS1_ENSE_18_Mod_range_hashingENSE_20_Default_ranged_hashENSE_20_Prime_rehash_policyENSE_17_Hashtable_traitsILb1ELb0ELb1EEEE19_M_find_before_nodeEmRS3_m [448]
   0.50       0.10  _ZNKSt10_HashtableIN3c106SymbolES1_SaIS1_ENSt8__detail9_IdentityESt8equal_toIS1_ESt4hashIS1_ENS3_18_Mod_range_hashingENS3_20_Default_ranged_hashENS3_20_Prime_rehash_policyENS3_17_Hashtable_traitsILb1ELb1ELb1EEEE5countERKS1_ [449]
   0.47       0.09  __pthread_mutex_unlock [466]
   0.45       0.09  @{libtorch_cpu.so+52004917} [480]
   0.45       0.09  @{libtorch_cpu.so+11883386} [482]
   0.42       0.08  @{libtorch_cpu.so+12074178} [499]
   0.40       0.08  pthread_mutex_lock [510]
   0.37       0.07  __dynamic_cast [423]
   0.37       0.07  @{libtorch_cpu.so+12083408} [534]
   0.35       0.07  _ZNK5torch3jit4Node7matchesERKN3c1014FunctionSchemaE [263]
   0.35       0.07  _ZNKSt8__detail20_Prime_rehash_policy11_M_next_bktEm [540]
   0.32       0.06  _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPcEEvT_S7_St20forward_iterator_tag [332]
   0.32       0.06  _ZNKSt8__detail20_Prime_rehash_policy14_M_need_rehashEmmm [404]
   0.32       0.06  @{libc-2.17.so+528486} [559]
   0.32       0.06  @{libgomp.so.1.0.0+102019} [574]
   0.30       0.06  sgemm_beta_SKYLAKEX [552]
   0.27       0.05  _ZN5torch3jit4NodeC2EPNS0_5GraphEN3c106SymbolE [357]
   0.27       0.05  munmap [595]
   0.27       0.05  @{libtorch_cpu.so+51698538} [605]
   0.25       0.05  _Znwm [27]
   0.25       0.05  @{libc-2.17.so+533904} [621]
   0.25       0.05  @{libtorch_cpu.so+51854553} [630]

#48 Updated by Andrew Chappell 8 days ago

Built libtorch with ups scripts, this time not setting MKLDNN=0. The preformance looks roughly the same as variant c of the install_libtorch_local.sh script.

Hi Patrick, is this still with OpenMP enabled? I notice that the user time is still quite high here, whereas in the ~20 second local build examples, OpenMP is off and the user time and real time are comparable.

#49 Updated by Patrick Gartung 8 days ago

The build_libtorch.sh script is linked below. Yes USE_OpenMP=ON is set. I will try with it set to USE_OpenMP=OFF.

https://cdcvs.fnal.gov/redmine/projects/build-framework/repository/libtorch-ssi-build/revisions/issue-24669/entry/build_libtorch.sh

#50 Updated by Patrick Gartung 8 days ago

With USE_OpenMP=OFF the user time is longer. Libopenblas is linked to libgomp which is probably where the multithreading is coming from. It appears to be thread locking adding the extra overhead. I will see if I can build libopenblass without libgomp.

time bash -x /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh 
+ /scratch/gartung/products/install-a/LArReco/bin/PandoraInterface -r AllHitsNu -i /scratch/gartung/products/LArReco/settings/PandoraSettings_Inference_DUNEFD_DeepLearningId.xml -g /scratch/gartung/products/LArReco/geometry/PandoraGeometry_DUNEFD_1x2x6.xml -e /scratch/gartung/products/Pandora_Events_0.pndr -N -n 5 -s 1
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m16.915s
user    4m13.701s
sys     0m9.894s

#51 Updated by Patrick Gartung 8 days ago

Of note, /lib64/libopenblas.so does not depend on libgomp.so. It could be linking the static library but there is no way to tell because the symbols are stripped out.

lib64/libopenblas.so:
        linux-vdso.so.1 =>  (0x00007fff71f1d000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f389cca9000)
        libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f389c987000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f389c5b9000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f389f49f000)
        libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f389c37d000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f389c167000)

#52 Updated by Patrick Gartung 8 days ago

I am unabled to reproduce the timings you got on your workstation. Re-enabling OpenMP gets me back to the user and real time as before.

[gartung@scisoftbuild01 libtorch-ssi-build]$ time bash -x /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh 
+ /scratch/gartung/products/install-a/LArReco/bin/PandoraInterface -r AllHitsNu -i /scratch/gartung/products/LArReco/settings/PandoraSettings_Inference_DUNEFD_DeepLearningId.xml -g /scratch/gartung/products/LArReco/geometry/PandoraGeometry_DUNEFD_1x2x6.xml -e /scratch/gartung/products/Pandora_Events_0.pndr -N -n 5 -s 1
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    0m13.624s
user    1m37.509s
sys     0m3.331s

#53 Updated by Patrick Gartung 7 days ago

Using variant e of the install_libtorch_local.sh which enables numpy but disables openmp results in truly singled threaded running. When I checked thread usage with top I see only one thread being used where as previously there were as many threads as there are hyperthreads on the cpu. The timing is the highest by far. 93% of the time was spent int sgemm_.

time /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh e
EventReadingAlgorithm: Processing event file: /scratch/gartung/libtorchtest/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    4m9.178s
user    3m59.706s
sys     0m9.503s

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  93.00     226.45  sgemm_ [25]
   0.97       2.36  @{libc-2.17.so+1395691} [48]
   0.85       2.06  @{libc-2.17.so+1400238} [49]
   0.66       1.61  @{libc-2.17.so+1397166} [50]
   0.47       1.13  munmap [73]

#54 Updated by Patrick Gartung 7 days ago

Enabling OpenCV with OpenMP disabled results in the same slow time.

time  /scratch/gartung/srcs/libtorch-ssi-build/issue-24669/infer.sh g
EventReadingAlgorithm: Processing event file: /scratch/gartung/products/Pandora_Events_0.pndr

   PROCESSING EVENT: 0

   PROCESSING EVENT: 1

   PROCESSING EVENT: 2

   PROCESSING EVENT: 3

   PROCESSING EVENT: 4

real    4m7.287s
user    3m58.109s
sys     0m9.193s

Again spending all of its time in sgemm_

----------------------------------------------------------------------
Flat profile (self >= 0.01%)

% total       Self  Function
  92.85     225.77  sgemm_ [25]
   0.98       2.38  @{libc-2.17.so+1400238} [48]
   0.97       2.37  @{libc-2.17.so+1395691} [49]
   0.75       1.81  @{libc-2.17.so+1397166} [50]
   0.53       1.28  munmap [63]

#55 Updated by Patrick Gartung 7 days ago

Hi Andy,

I am unable to reproduce the timings you get with your local installation. Can you please make a tarball of your pytorch installation on your workstation and upload it to scisoftbuild01 somewhere. I would like to look at the symbols in the libraries.

Patrick

#56 Updated by Andrew Chappell 6 days ago

Hi Patrick,

I've uploaded a torch_libs.tar.gz tarball to /scratch/chappell, which contains the lib and lib64 folders of the torch installation. I think this should have everything you need - please let me know if you need anything else.

Cheers,
Andy

#57 Updated by Patrick Gartung 1 day ago

Hi Andy,

Could you add the flag -frecord-gcc-switches to the compile options for libtorch. This will save the compile options as a string in the library. Although the flags are reported by cmake for sleef and liubtorch, this flag will ensure that the flags are reported for all of the libraries.

Patrick

#58 Updated by Patrick Gartung about 23 hours ago

Hi Andy,

Can you generate a profile for the executable running on your desktop? Assuming you have cvmfs available on your workstation you can setup igprof with

. /cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/etc/profile.d/init.sh
. /cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/igprof/5.9.16-gchjei/etc/profile.d/init.sh 

and then run it with

igprof -pp -- /scratch/gartung/products/install-g/LArReco/bin/PandoraInterface -r AllHitsNu -i /scratch/gartung/products/LArReco/settings/PandoraSettings_Inference_DUNEFD_DeepLearningId.xml -g /scratch/gartung/products/LArReco/geometry/PandoraGeometry_DUNEFD_1x2x6.xml -e /scratch/gartung/products/Pandora_Events_0.pndr -N -n 50 -s 1

You can then generate the igprof report with

igprof-analyse -t igprof.PandoraInterface.80165.1603298873.251885.gz > igprof.PandoraInterface.80165.1603298873.251885.txt

and view it or attach it to this issue.

#59 Updated by Andrew Chappell about 23 hours ago

Hi Patrick,

Sure, I'll try to have a look at this and the compilation flag tomorrow morning.

#60 Updated by Andrew Chappell about 8 hours ago

Hi Patrick,

I've replaced the tarball containing the libtorch libraries following the -frecord-gcc-switches flag inclusion. I'm also attaching the output for igprof - I hope it's helpful. To the extent that I could interpret the output I think the notable calls are sgemm_kernel_HASWELL and sgemm_itcopy_HASWELL, which I think are probably related to OpenBLAS. Will be interesting to see what the corresponding calls are on Scisoftbuild.

Also available in: Atom PDF