Project

General

Profile

Support #21845

Request for build of pytorch 1.0 accessible within LArSoft for MicroBooNE's MCC9 candidate release

Added by Matthew Toups 5 months ago. Updated 4 months ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
-
Start date:
02/05/2019
Due date:
02/22/2019
% Done:

100%

Estimated time:
Spent time:
Experiment:
MicroBooNE
Duration: 18

Description

MicroBooNE will finalize a uboonecode release to support its 2019 physics results on Feb. 22. A key piece of this release is to be able to use a trained semantic segmentation network (SSNet) to infer pixel track vs. shower labels as a part of production. We have implemented CPU-based SSNet inference using pytorch 1.0 and so would like to have this be made accessible within LArSoft on the time scale listed above.

Taritree Wongjirad () is the contact for this support request. He is very grateful for any help that he can receive on this. He writes:

'I need a build of pytorch 1.0. According to the installation instructions, glibc>=2.17 (satisfied by SL7 only) is required.

This is the "only" package that is needed. However, I imagine there are going to be a whole set of dependencies.
For example, one important one was some intel linear algebra and machine learning algorithm library MKL.
Alternatively, the Eigen package can be used as the backend
for all of the network operations.

I will have a docker/singularity script that performs the build in that context. Hopefully it will help.

One way I can help is by trying to build on one of the UB machines. However, I wasn't sure how to setup a uboonecode environment for one of the builds using SL7.
If just could be that I don't know the right ups command to setup the right "flavor?" along with the "qualifier".
(specifically, I can try to build after setting up: "uboonecode" "v08_00_00_02" "Linux64bit+3.10-2.17" "c2:prof" "")'

Thanks!

History

#2 Updated by Lynn Garren 5 months ago

Taritree will provide us with instructions. If we are able to provide a product, we propose that the product name will be libtorch.

#3 Updated by Taritree Wongjirad 5 months ago

Hi Everyone,

Thanks again for the discussion yesterday. (for for helping to resolve my ticket with write-access to uboone/app).

Following up

1) I got write access to the /uboone/app
2) I was able to build pytorch on uboonebuild02 and link it against my copy of uboonecode (actually ubcv) which contains a larsoft producer module that uses the c++ api

(the branch with this module is ubcv:feature/tmw_ubdl_ssnetintegration)

So far, I just got it to build and link. I will see if I can actually run it on uboonebuild02 and have it produce an output that is sensible.

Some notes from the build:
1) I setup uboonecode v08_00_00_03 in order to setup the environment we'll use
2) I also had to setup pyyaml which was already in ups (setup pyyaml v3_12d -q p2714b)
3) I had to build the python module 'typing' as well (I built it and put the path to its library in my PYTHONPATH for now). Got it with:

wget https://files.pythonhosted.org/packages/bf/9b/2bf84e841575b633d8d91ad923e198a415e3901f228715524689495b4317/typing-3.6.6.tar.gz

4) pyyaml and typing, I think, are there to be able to run 'setup.py'
5) I cannot get things to build without going through setup.py (for example, just making a build folder and running `cmake ../` did not work). Since the pytorch repo is a repo of repos, I think there are variables and settings that need to be coordinated among subrepos which requires using setup.py (but this is a guess)
4) To start with I just wanted to see if a vanilla build would complete (it did). After setting up uboonecode, pyyaml, and typing, I cloned (with recursive) the pytorch repo (github.com/pytorch/pytorch), changed to tag v1.0.0, updated the submodules and then ran: python setup.py bdist_wheel
5) there is a flag to set the BLAS library used. I can set it to Eigen or openBLAS. For the above vanilla build where I did not set it, no BLAS package was found, and so the default is to download some MKL libraries and use that. (Clearly, this is not the eventual behavior we want.)
6) Also, it was clear it was built with flags for SSE and AVX instructions (those were the ones I saw anyway).

Going forward I will:
1) try to run our network using the C++ api and see if something sensible comes out
2) once I do that I will try to modify the build to use Eigen or openBLAS -- which from our discussion seemed to be OK -- and then confirm I can still run the network
3) if the above goes well, I will try to package everything into a local products tarball in order to see if I can run on the grid (or more likely, discover potential issues)

Help I could use:
1) I tried to look for a build of openBLAS in ups but was not successful. From the discussion, I inferred that this was already packaged in UPS. If so, does anyone know the name for the library in the UPS system?

Many thanks for the help,
Taritree

#4 Updated by Taritree Wongjirad 5 months ago

Hi Everyone,

An update (below applies to code built and run on uboonebuild02):

1) building using the default settings led to a version of pytorch that installed some MKL library it downloaded. This version would crash when running the network (from the larsoft produce module)

2) I rebuilt, but this time specifying the BLAS to be OpenBLAS.

3) I built my own copy of OpenBLAS for this test.

4) Now it runs. I can produce an SSNet image and save it to a larcv root file. So it seems to work! The output seems fairly sensible.

To summarize the ups products/dependencies used:

(anything I built, I installed to my ${MRB_INSTALL} folder.)

1) uboonecode+ubcv: v08_00_00_03 -q e17:prof
2) pyyaml (already on ups): setup pyyaml v3_12d -q p2714b
3) typing (version 3.6.6): python module I had to build and set location in PYTHONPATH. Needed by pytorch's setup.py
3) numpy (on ups): setup numpy v1_14_3 -q e17:p2714b:prof (However, this only seems to be incomplete. I cannot find numpy headers nor import numpy from python when I set this up)
4) numpy (v1_14_3): built myself with location set in PYTHONPATH
5) ZeroMQ (v4.3.1): built myself. used by ssnet interface module.
6) cppzmq (v4.3.0): built myself. c++ bindings to ZeroMQ. Its simply a header only repo. cmake build for this repo only makes install location for header and makes cmake config files that defines target which includes libzmq from ZeroMQ.
7) OpenBLAS (develop)

For the above, I always got the source using the git repo with the exception of 'typing' where I used a tarball.

Building pytorch:

1) git clone --recursive https://github.com/pytorch/pytorch
2) git checkout -b v1.0.1 v1.0.1
3) edited tools/build_pytorch_libs.sh: after line 244 added

-DBLAS="OpenBLAS" \
in order to set the BLAS library

4) set environment variables

export USE_MKLDNN=0
export USE_CUDA=0
export OpenBLAS_HOME=${MRB_INSTALL}/OpenBLAS # (needed) to detect OpenBLAS

5) build pytorch. 'build_deps' just builds the pieces for libtorch. doesn't build things for the python package it seems.

python setup.py build_deps

6) above creates `tmp_install` in (pytorch repo root)/torch/lib/. So coped it to my ${MRB_INSTALL} folder as libtorch:

cp -r torch/lib/tmp_install ${MRB_INSTALL}/libtorch

That was it. You can find log files and my CMakeCache for pytorch at:

/uboone/app/users/tmw/dev/v08_00_00_br_sl7_pytorch/pytorch/build/cmake

Remaining issues:
1) rebuild pytorch without special CPU instruction sets. I am not sure which ones need to be avoided. SSE, AVX, AVX2 are ones I see. all need to be avoided? it would be helpful for me to know what the old type of intel CPU we expect to run on. (they are all intel cpus?)
2) test run on grid, bringing all the deps through a localproducts tarball

Bests,
Taritree

#5 Updated by Lynn Garren 5 months ago

  • Status changed from New to Feedback

AVX is only available on about 1/4 of the lab worker nodes. SSE should be OK. We know that SSE 4.2 is available at the lab.

Did you try building with cblas from lapack?

#6 Updated by Taritree Wongjirad 5 months ago

Hi Lynn,

Thanks for the info! That's really helpful. I'm glad that the SSE instructions are OK. I saw flags for AVX, so I think there's hope that it won't be too bad to turn off.

I haven't tried to build against LAPACK, but it should support it. I will try that.

One other question I had: I should turn off OpenMPI support?

Bests,
Taritree

#7 Updated by Lynn Garren 5 months ago

Please try to turn off OpenMPI. We'd like to minimize extra products.

#8 Updated by Taritree Wongjirad 5 months ago

Hi Everyone:

An update and a question:

Update:

1) still preparing UPS products (in a development folder)
2) still working on a non-AVX build of libtorch
3) attempted to run on the FNAL grid: took initial working build, put software into localProducts folder, tar'ed it up, shipped it out to the worker node, and used an initial source script to set the environment for those packages.)

For (3), I got the software onto the worker node OK, I believe. However, I got an error due to the fact that SL6 was loaded instead of SL7. Was using project.py to handle the job submission and set the required tag: <os>SL7</os>. However, it didn't seem to take. I can see this through the list of UPS products setup. They are all SL6 flavored.

Is there any other steps I have to take to get the worker node to run SL7?

Bests,
Taritree

#9 Updated by Taritree Wongjirad 5 months ago

Hi Everyone,

I think I built libtorch without AVX instructions. (I turned off NNPACK which I think is the library pytorch uses for some of its optimized CPU tensor operations.)

My libtorch build along with the various dependencies I placed here:

/uboone/app/users/tmw/ups_dev/products

In this folder are the following:

cppzmq  libtorch  libzmq  numpydev  OpenBLAS  typing  ubdl

I attempted to create the folder structure and files needed to have them be a part of uboone's UPS products. However, I am new to this and might not have set them up properly.

However, when I prepend that location to my PRODUCTS env. variable, I can get ups list and setup commands to recognize them. For example:

[ tmw@uboonebuild02 ubcv ]$ ups list -aK+ libtorch
"libtorch" "v1_0_1" "Linux64bit+3.10-2.17" "e17:nonnpack:openblas:prof" ""

Is it just a matter of moving these to '/cvmfs/uboone.opensciencegrid.org/products/'?

Bests,
Taritree

#10 Updated by Lynn Garren 5 months ago

That's great news! We do need to get them into the structure needed for ongoing maintenance so that they can be added to SciSoft. I'll start work on that. We do have a numpy ups product already. There may be some changes as we pull the pieces together for ongoing support.

Does your product directory have a .upsfiles directory? You can get that by either building ups there or just copying, say, /cvmfs/larsoft.opensciencegrid.org/products/.upsfiles
The directory and its contents should be the same everywhere, so just cp -pr ...

#11 Updated by Lynn Garren 5 months ago

  • Assignee set to Tanaz Mohayai
  • Status changed from Feedback to Assigned
  • Co-Assignees Taritree Wongjirad added

#12 Updated by Lynn Garren 5 months ago

  • Assignee changed from Tanaz Mohayai to Lynn Garren

#13 Updated by Taritree Wongjirad 5 months ago

Hi Lynn,

Thanks! My dev folder did not have those files/folder .upsfiles, so I created a copy of them now into the dev folder. I do now know if I did it correctly nor what it does. But I suspect it provides needed variables for me to specify some of the production folder locations in a nice relative way instead of some of the hard-coded paths I am using now.

For example:

FLAVOR = "Linux64bit+3.10-2.17"
QUALIFIERS = "e17:prof"
DECLARER = tmw
DECLARED = 2019-02-18 05.06.28 GMT
MODIFIER = tmw
MODIFIED = 2019-02-18 05.06.28 GMT
PROD_DIR = /uboone/app/users/tmw/ups_dev/products/libzmq/v4_3_1
#PROD_DIR = libzmq/v4_3_1
UPS_DIR = ups
TABLE_FILE = libzmq.table

I figured that PROD_DIR should probably look like the commented line, but I couldn't get it to work before. But with the changes, now it does. I'll propagate that fix to the other products.

Bests,
Taritree

#14 Updated by Lynn Garren 5 months ago

Indeed, the .upsfiles subdirectory makes /uboone/app/users/tmw/ups_dev/products a functioning PRODUCTS directory. You can add /uboone/app/users/tmw/ups_dev/products to your PRODUCTS path and all setups should work.

Relative paths are required for our products. In this case, that happens when a ups declare command is issued. You just use the relative path instead of the full path.

#15 Updated by Christopher Green 5 months ago

  • Assignee changed from Lynn Garren to Christopher Green

I have been working on producing a UPS package for libtorch based on Taritree's work. Note that no attempt has been made to build pytorch itself. The status of libtorch is as follows:

  • As currently configured, the UPS package for libtorch has the following external dependencies:
    • openblas
    • pybind11
    • protbuf
    • cmake (build-only)
    • python (build-only)
  • In addition, the Python packages pyyaml and typing are installed for the build only using pip install, and the libtorch build procedure builds internal copies of caffe2, onnx and a stub of c10.
  • Milestones achieved so far:
    • Successful checkout of the source and all external submodules.
    • Construction of a UPS source tarball.
    • Successful build of all the software ensuring that the external packages are actually found and used at header, binary and library level as appropriate.
  • Upon inspection, it appears those components that take advantage of chip-architecture-specific features utilize runtime interrogation of the cpuID to ascertain the correct code to execute, so compiling on a system which has (e.g.) AVX2 should not affect operation on other, less capable systems.
  • Goals not yet achieved:
    • A verified install stage with all headers and libraries in the correct place.
    • Production of a binary tarball for install and test.

Caveats:

  • Relocation and operation on older machines must be verified by the requestor.
  • The requestor must ensure that their whole system is consistent and correct, including possible clashes between rival BLAS, LAPACK Caffe2, ONNX and C10 implementations. Note that Caffe does not clash with Caffe2 for these purposes. The UPS package `openblas`, however, will certainly clash with the UPS package `lapack`.
  • An analysis of the code compiled leads me to believe that as built, libtorch and friends do not depend on the Python that was used during the build procedure. Several headers refer to pybind11 however, implying that a Python may need to be present while compiling code using certain components libtorch and/or its bundled dependencies. The user is responsible for ensuring that the Python consistent with the rest of their code system is available and has been set up for use.
  • If versions of external products used are not compatible with the expected versions available in the frozen software stack, please let me know and adjustments will be made. Currently, libtorch uses:
    • pybind v2_2_4
    • OpenBLAS v0_3_5
    • protobuf v3_5_2
    • gcc v7_3_0

The two as-yet unachieved goals are expected to be met by COB Friday 2019-02-22. I will endeavor to make these available as early as possible in order to allow the maximum time for requestor validation.

Please let us know as soon as possible if there are any changes to be made to the configuration of features for this package. The build command is as follows:

 env CC=gcc REL_WITH_DEB_INFO=1 MAX_JOBS=<ncore> NO_CUDA=1 NO_CUDNN=1 NO_FBGEMM=1 NO_TEST=1 NO_MIOPEN=1 NO_MKLDNN=1 NO_NNPACK=1 NO_QNNPACK=1 NO_DISTRIBUTED=1 NO_SYSTEM_NCCL=1 USE_OPENCV=0 USE_FFMPEG=0 USE_LEVELDB=0 USE_LMDB=0 BUILD_BINARY=1 'EXTRA_CAFFE2_CMAKE_FLAGS=-DBLAS=OpenBLAS -DBUILD_CUSTOM_PROTOBUF=OFF -DUSE_NCCL=OFF' python <products-path>/libtorch/v1_0_1/src/libtorch-1.0.1/tools/build_libtorch.py

#16 Updated by Christopher Green 5 months ago

  • % Done changed from 0 to 90
  • Status changed from Assigned to Feedback
  • Due date set to 02/22/2019
  • Co-Assignees Lynn Garren added

I believe I have something for download / test at https://scisoft.fnal.gov/scisoft/packages/libtorch/v1_0_1/ (see also https://scisoft.fnal.gov/scisoft/packages/openblas/v0_3_5/). Please let us know ASAP if there are any deficiencies or requested changes.

#17 Updated by Taritree Wongjirad 5 months ago

Thank you! I'll test it ASAP.

#18 Updated by Christopher Green 5 months ago

I have just updated the libtorch source and binary packages due to a minor issue removing the hard-wired path from the ATen CMake config file. Please re-obtain the binary package if you encounter issues building against libtorch during your tests.

#19 Updated by Taritree Wongjirad 5 months ago

Hi,

I untarred the packages from scisoft (to my development ups area) and modified the code (in ubcv) to use these products. (one additional step was to setup pybind11 v2_2_1.)

The program runs as before. Thanks!

May we move these to a place visible on the grid?

Bests,
Taritree

#20 Updated by Kyle Knoepfel 5 months ago

Were there any UPS errors when you setup libtorch? We would not have expected you to have set up pybind11 explicitly.

#21 Updated by Lynn Garren 5 months ago

We apologize for the confusion. pybind11 v2_2_1 is distributed along with larsoft. To avoid potential conflicts, we have rebuilt libtorch with pybind11 v2_2_1. This is available as libtorch v1_0_1a on SciSoft: http://scisoft.fnal.gov/scisoft/packages/libtorch/v1_0_1a/

Please use this release.

#22 Updated by Kyle Knoepfel 4 months ago

  • % Done changed from 90 to 100

Unless we hear from you in the next week, we will close this issue.

#23 Updated by Kyle Knoepfel 4 months ago

  • Status changed from Feedback to Closed


Also available in: Atom PDF