Project

General

Profile

Support #17004

make a tensorflow ups product

Added by Lynn Garren about 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
External Packages
Target version:
Start date:
08/15/2017
Due date:
% Done:

100%

Estimated time:
(Total: 24.00 h)
Spent time:
Experiment:
LArSoft
Co-Assignees:
Duration:

Description

After discussions with Robert Sulej and others, we have agreed to make TensorFlow available as a ups product.

Robert's notes are attached.

tf.txt (2.16 KB) tf.txt Lynn Garren, 06/22/2017 04:44 PM
BUILD (163 Bytes) BUILD Lynn Garren, 06/22/2017 04:44 PM
tf_graph.h (571 Bytes) tf_graph.h Lynn Garren, 06/22/2017 04:45 PM
tf_graph.cc (1.95 KB) tf_graph.cc Lynn Garren, 06/22/2017 04:45 PM
makegraph.py (388 Bytes) makegraph.py Lynn Garren, 06/22/2017 04:45 PM
hist_avg_max_old.png (12.9 KB) hist_avg_max_old.png Old code: average (blue) and max (red) time in the module applying CNN Robert Sulej, 10/10/2017 07:29 AM
hist_avg_max_TF.png (12.3 KB) hist_avg_max_TF.png Tensorflow: average (blue) and max (red) time in the module applying CNN Robert Sulej, 10/10/2017 07:29 AM
hist_cpu-real_old.png (12.1 KB) hist_cpu-real_old.png Old code: cpu / real time for entire event processing Robert Sulej, 10/10/2017 07:29 AM
hist_cpu-real_TF.png (10.4 KB) hist_cpu-real_TF.png Tensorflow: cpu / real time for entire event processing Robert Sulej, 10/10/2017 07:29 AM

Subtasks

Support #17503: identify necessary ups product dependenciesClosed

Support #17504: understand the TensorFlow cmake build optionClosedPatrick Gartung

Support #17505: understand what it means to provide a shared libraryClosedLynn Garren

History

#1 Updated by Lynn Garren about 3 years ago

  • Due date set to 07/20/2017

#2 Updated by Robert Sulej about 3 years ago

Hi Lynn,

Do you have an estimate when it may be completed? I could focus on adopting CNN code to use TF inside LArSoft in August.

Thanks,
Robert

#3 Updated by Lynn Garren about 3 years ago

It turns out that building tensorflow is quite a rabbit hole. However, there is a contributed cmake build option. This may save us.

#4 Updated by Lynn Garren about 3 years ago

There is now a build of tensorflow v1_2_1_b3 which appears to work with feature/rsulej_tf. However, there is a problem where the build reports a missing destructor. I think the actual problem is that you have not called the session Close function.
http://scisoft.fnal.gov/scisoft/packages/tensorflow/v1_2_1_b3/

Patrick and I continue to try to find a better way to build tensorflow. This build is less than optimal and has its own internal copy of both protobuf and eigen. The header directories are such that you can't mix these up with our builds, but we would prefer to use the ups products for these dependencies. Also the cmake build which I am currently using produces a 2GB "shared" library!

#5 Updated by Robert Sulej about 3 years ago

I updated to use v1_2_1_b3 in y code and started writing an actual interface class which would be able to run the calculation graph (so it has the session Close() in the destructor).

Compilation seems to work, however linking fails:

CMakeFiles/larreco_RecoAlg_ImagePatternAlgs_TF.dir/tf_graph.cc.o: In function `tf::Graph::Graph(char const*, bool&)':
/afs/cern.ch/work/r/rosulej/larsoft/tf_test/srcs/larreco/larreco/RecoAlg/ImagePatternAlgs/TF/tf_graph.cc:22: undefined reference to `tensorflow::SessionOptions::SessionOptions()'
/afs/cern.ch/work/r/rosulej/larsoft/tf_test/srcs/larreco/larreco/RecoAlg/ImagePatternAlgs/TF/tf_graph.cc:22: undefined reference to `tensorflow::NewSession(tensorflow::SessionOptions const&, tensorflow::Session**)'
CMakeFiles/larreco_RecoAlg_ImagePatternAlgs_TF.dir/tf_graph.cc.o: In function `_ZN10tensorflow14SessionOptionsD4Ev':
/afs/cern.ch/work/r/rosulej/larsoft/all_products/./tensorflow/v1_2_1_b3/Linux64bit+2.6-2.12-e14-p2713d-prof/include/tensorflow/core/public/session_options.h:28: undefined reference to `tensorflow::ConfigProto::~ConfigProto()'
CMakeFiles/larreco_RecoAlg_ImagePatternAlgs_TF.dir/tf_graph.cc.o: In function `tf::Graph::Graph(char const*, bool&)':
/afs/cern.ch/work/r/rosulej/larsoft/tf_test/srcs/larreco/larreco/RecoAlg/ImagePatternAlgs/TF/tf_graph.cc:25: undefined reference to `_ZNK10tensorflow6Status8ToStringB5cxx11Ev'
/afs/cern.ch/work/r/rosulej/larsoft/tf_test/srcs/larreco/larreco/RecoAlg/ImagePatternAlgs/TF/tf_graph.cc:29: undefined reference to `tensorflow::GraphDef::GraphDef()'
... ... ...

My CMakeLists.txt is quite simple, I only added ${TENSORFLOW}:

art_make(
LIB_LIBRARIES
${FHICLCPP}
cetlib cetlib_except
${TENSORFLOW}
)

install_headers()
install_source()

So far I commented out everything except very basic code in the constructor/destructor. Probably this is again a simple misconfiguration on my side. Please, let me know if you see something obvious missing.

#6 Updated by Robert Sulej about 3 years ago

Hi Lynn,

Did you have a chance to look at thi linkink problem?

Thanks,
Robert

#7 Updated by Lynn Garren about 3 years ago

The TENSORFLOW variable was undefined. You should have been able to find and fix this problem yourself. I pushed a fix to the branch.

#8 Updated by Robert Sulej about 3 years ago

Hi Lynn,

Thank you so much! To be honest, I would never guessed the line which you have added!

Yesterday together with Piotr we finished a helper class to run the graph. It is >>100 faster than the simple code we have been using (but still in a standalone program), and the .so size is ~170MB. So any strategy we choose for the ups, it is defenitively worth doing.

Thanks,
Robert

#9 Updated by Lynn Garren almost 3 years ago

We have settled on a contributed makefile build option. With the makefile build, we are able to use our build of protobuf v3_3_1. However, tensorflow uses what appears to be an old fork of eigen. We suspect that they may have contributed the code, which contains explicit changes needed by tensorflow. This eigen release is incompatible with the tagged eigen releases used elsewhere by LArSoft. To avoid potential conflicts we have changed both the namespace and the Eigen directory from Eigen to Eigen_tf.

larreco feature/rsulej_tf has been updated to use tensorflow v1_3_0_b3.

Please test in your environment. You will need to download both tensorflow and protobuf from SciSoft.
http://scisoft.fnal.gov/scisoft/packages/protobuf/v3_3_1/
http://scisoft.fnal.gov/scisoft/packages/tensorflow/v1_3_0_b3/

#10 Updated by Robert Sulej almost 3 years ago

Hi All,

The new release compiles and runs well - the output is numerically identical to the previous release.
UPS size is significantly lower, but this you know best!

I guess this time "core2" option was applied, this changed messages about instruction sets used and it looks like SSE3 is now used. But the CPU time spent in my module is practically the same as it was with the previous release (it is doing 2D convolutions and vector dot products in ~100%, so vectorization should have very visible reduction of time). There are also new warning/errors (if W stnds for warning and E for error). Below I attach the full output from the initialization:

2017-09-22 12:06:30.299246: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-22 12:06:30.299344: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-22 12:06:30.299354: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-22 12:06:30.299361: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-22 12:06:30.299368: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-22 12:06:30.440044: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "Invert" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_UINT16 } } }') for unknown op: Invert
2017-09-22 12:06:30.440113: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "Invert" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_UINT8 } } }') for unknown op: Invert
2017-09-22 12:06:30.440130: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "Invert" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT64 } } }') for unknown op: Invert
2017-09-22 12:06:30.440159: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "Invert" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } }') for unknown op: Invert
2017-09-22 12:06:30.440176: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "Invert" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT16 } } }') for unknown op: Invert
2017-09-22 12:06:30.440188: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "Invert" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT8 } } }') for unknown op: Invert
2017-09-22 12:06:30.440215: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseXor" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_UINT16 } } }') for unknown op: BitwiseXor
2017-09-22 12:06:30.440224: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseXor" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_UINT8 } } }') for unknown op: BitwiseXor
2017-09-22 12:06:30.440233: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseXor" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT64 } } }') for unknown op: BitwiseXor
2017-09-22 12:06:30.440241: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseXor" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } }') for unknown op: BitwiseXor
2017-09-22 12:06:30.440250: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseXor" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT16 } } }') for unknown op: BitwiseXor
2017-09-22 12:06:30.440258: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseXor" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT8 } } }') for unknown op: BitwiseXor
2017-09-22 12:06:30.440286: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseOr" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_UINT16 } } }') for unknown op: BitwiseOr
2017-09-22 12:06:30.440301: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseOr" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_UINT8 } } }') for unknown op: BitwiseOr
2017-09-22 12:06:30.440311: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseOr" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT64 } } }') for unknown op: BitwiseOr
2017-09-22 12:06:30.440324: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseOr" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } }') for unknown op: BitwiseOr
2017-09-22 12:06:30.440333: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseOr" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT16 } } }') for unknown op: BitwiseOr
2017-09-22 12:06:30.440363: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseOr" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT8 } } }') for unknown op: BitwiseOr
2017-09-22 12:06:30.440378: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseAnd" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_UINT16 } } }') for unknown op: BitwiseAnd
2017-09-22 12:06:30.440395: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseAnd" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_UINT8 } } }') for unknown op: BitwiseAnd
2017-09-22 12:06:30.440404: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseAnd" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT64 } } }') for unknown op: BitwiseAnd
2017-09-22 12:06:30.440413: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseAnd" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } }') for unknown op: BitwiseAnd
2017-09-22 12:06:30.440421: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseAnd" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT16 } } }') for unknown op: BitwiseAnd
2017-09-22 12:06:30.440435: E tensorflow/core/framework/op_kernel.cc:1142] OpKernel ('op: "BitwiseAnd" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT8 } } }') for unknown op: BitwiseAnd

Thanks!
Robert

#11 Updated by Robert Sulej almost 3 years ago

Just to add some numbers:

TimeTracker   VmPeak      VmHWM
our c++: 20.3809 s, 1863.71 MB 643.375 MB <--- our hand-written CNN inference mode
tf, 1: 17.9962 s, 2776.6 MB 760.721 MB <--- serial processing of CNN inputs with tensorflow
tf, 64: 3.34498 s, 2776.59 MB 761.504 MB <--- 64 input vectors passed to the TF graph in each batch
tf, 128: 2.74058 s, 2799.49 MB 786.604 MB <--- 128 input vectors / batch

Time fluctuations are minor, 8-core CPU was used, small ProtoDUNE events (beam only, no cosmic backround, PMA takes ~4 s on these events). Speedup looks like entirely due to parallel processing of the graph (and this is what I can see with "top").

#12 Updated by Lynn Garren almost 3 years ago

Indeed, this release was compiled with both -march=core2 and -std=c++14. Would you provide instructions to reproduce the warnings you see?

#13 Updated by Robert Sulej almost 3 years ago

Please, try dunetpc on the data file and fcl in /pnfs/dune/scratch/users/rsulej/tf/

lar -c /pnfs/dune/scratch/users/rsulej/tf/reco_michelemtrackid.fcl /pnfs/dune/scratch/users/rsulej/tf/ProtoDUNE_pion_3p0GeV_mono_3ms_reco.root

There are comments in fcl if you would like to try to change size of the batch of inputs, or apply the code which we are using without tensorfow. Both CNN models, default and the new one have the same architecture.

Thanks,
Robert

#14 Updated by Lynn Garren almost 3 years ago

"TF_CPP_MIN_LOG_LEVEL is a TensorFlow environment variable responsible for the logs, to silence INFO logs set it to 1, to filter out WARNING 2 and to additionally silence ERROR logs (not recommended) set it to 3"

To silence the warnings: export TF_CPP_MIN_LOG_LEVEL=2

However, the E lines concern us, and we will explore using a different march flag.

#15 Updated by Lynn Garren almost 3 years ago

tensorflow v1_3_0_b4 is now available. The tensorflow errors are no longer produced. However, we do see messages such as these below. I don't know if they are expected or not.

%MSG
%MSG-w DisambigAlg35t:  HitFinder35t:hitfd@ 25-Sep-2017 19:57:56 CDT  run: 1 subRun: 95 event: 9421
Could not find disambiguated hit for  C:0 T:5 P:0 W:169 3212.61
%MSG
%MSG-w DisambigAlg35t:  HitFinder35t:hitfd@ 25-Sep-2017 19:58:15 CDT  run: 1 subRun: 95 event: 9425
Could not find disambiguated hit for  C:0 T:0 P:0 W:398 4208.25
%MSG
%MSG-w DisambigAlg35t:  HitFinder35t:hitfd@ 25-Sep-2017 19:58:15 CDT  run: 1 subRun: 95 event: 9425
Could not find disambiguated hit for  C:0 T:0 P:1 W:322 4213.72
%MSG
%MSG-w DisambigAlg35t:  HitFinder35t:hitfd@ 25-Sep-2017 19:58:20 CDT  run: 1 subRun: 95 event: 9426
Could not find disambiguated hit for  C:0 T:0 P:0 W:151 4091.61
%MSG
%MSG-w DisambigAlg35t:  HitFinder35t:hitfd@ 25-Sep-2017 19:58:25 CDT  run: 1 subRun: 95 event: 9427
Could not find disambiguated hit for  C:0 T:5 P:1 W:17 4072.72
%MSG
%MSG-w DataPrepModule:  DataPrepModule:caldata@ 25-Sep-2017 19:58:26 CDT  run: 1 subRun: 95 event: 9428
No wires made for this event.

#16 Updated by Dorota Stefan almost 3 years ago

In the protodune geometry these warnings are expected.

#17 Updated by Robert Sulej almost 3 years ago

I updated branch to use the new UPS. There are no error messages. Numerical results are identical. Time is also the same.

Do you think it would be possible to compie a test version with --copt=-mavx and --copt=-mfma options, just to compare the speed? Since the results of UPS with/without -march=core2 are identical up to the last digit I suspect the vector optimizations somehow did not make it to the library. Usually there is some difference on the last digits between SIMD and plain FPU implementations.

Thanks,
Robert

#18 Updated by Patrick Gartung almost 3 years ago

CMS picked -march=core2 because some of the really old grid nodes do not even support SSE4.x. I think the vectorization is done with the AVX instruction set. This is available on AMD since march=bdver1 and on Intel since march=nehalem. It might be worth building with individual instruction set flags like -mavx since march=native can include AMD or Intel specific instruction sets.

#19 Updated by Lynn Garren almost 3 years ago

Please try with tensorflow v1_3_0_b5, which is compiled with -mavx and without -march=core2.

#20 Updated by Robert Sulej almost 3 years ago

Thanks!

This time results are different on the last digit - so something changed in the code.

Time is systematically, but only slightly shorter (~10-15%):

TimeTracker   VmPeak      VmHWM
our c++: 20.3809 s, 1863.71 MB 643.375 MB <--- our hand-written CNN inference mode
tf, 1: 17.9962 s, 2776.6 MB 760.721 MB <--- serial processing of CNN inputs with tensorflow
tf+avx, 1: 16.3125 s, <--- serial, TF with AVX
tf, 128: 2.74058 s, 2799.49 MB 786.604 MB <--- 128 input vectors / batch
tf+avx, 128: 2.13315 s, <--- 128 batch, TF with AVX

I also started looking at the speed using different number of cores, which is strange for the first look: 4 cores have equal real time to 8 cores, while CPU time is longer on 8 cores than on 4 cores. Not sure if this means there is quite an overhead of parallelizing the job. I'll try on 20 cores + hyperthreading and also try running on the grid.

#21 Updated by Robert Sulej almost 3 years ago

I just have discovered multiple levels of parallelization in TF, so my results for time of running inputs in a sequence could be wrong... let me force TF to run really in a fixed number of threads to see the difference with/without AVX.

#22 Updated by Robert Sulej almost 3 years ago

Hi All,

We have run more tests. Vector instructions give more visible speedup when less cores are used. One could expect that..

Being practical: Dorota made tests on grid, running ProtoDUNE events. Jobs using AVX version (b5) failed in ~50%. No-Avx (b4) jobs run smoothly. We anyway need to setup 4GB memory / job, then CPU/real time is close to 4, so I guess it gets 2 hyperthreaded cores by default. Speed is very reasonable for the ProtoDUNE (likely DUNE as well) production needs.

Would it be reasonable to ask for uploading Tensorflow b4 to cvms and merge the feature branch with LArSoft develop at this point, and deal with use of more optimized versions later? People start asking for this and some are already using TF in their local builds.

Thanks!
Robert

#23 Updated by Lynn Garren almost 3 years ago

Thank you for running the tests. We will tag and distribute an actual release of tensorflow (without the bN suffix). To be clear, are you asking to merge larreco feature/rsulej_tf with the weekly release? We will need to make sure that it does not break experiment code.

Lynn

#24 Updated by Robert Sulej almost 3 years ago

OK, that would be great! I can update the dependency in the feature branch as soon as this TF release is available.

Yes, I mean the larreco feature/rsulej_tf branch. The interface for the old models is still there, they can coexist, there is no breaking changes, including configuration parameters and we can transition smoothly to the new code.

#25 Updated by Robert Sulej almost 3 years ago

Yesterday I updated the branch to the v06_52_00 release, all is ready to go from our side.

Just to make ourselve happier, we parsed output from the jobs running old version of the code and the new one, with TF b4 (so no AVX instructions). Both running the same size CNN model, on pi+ 3GeV/c in ProtoDUNE (but no cosmics in bkg). I attach histograms:
- average (blue) and max (red) time in the module applying CNN
- cpu / real time for entire event processing, so also including other, not parallelized modules.

Old CNN code had some part (convolutional kernels) parallelized with tbb, that's why the cpu/real is not 1.0.

Cheers,
Robert

#26 Updated by Lynn Garren almost 3 years ago

  • Status changed from Assigned to Resolved
  • Target version set to v06_53_00
  • Experiment LArSoft added
  • Experiment deleted (-)

tensorflow v1_3_0a is now released as part of the larsoft v06_53_00 distribution. Robert's feature branch has been merged with larreco. We were able to build tensorflow for macOS, but neither the macOS build nor the Ubuntu build is optimized for their platform.

Note that the beta releases (v1_x_y_bN) of tensorflow will be removed from SciSoft. Anyone who has been using these test releases is advised to use the official release.

#27 Updated by Lynn Garren almost 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF