Project

General

Profile

Support #22504

Request to upgrade the tensorflow version

Added by Tingjun Yang over 1 year ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
05/02/2019
Due date:
% Done:

100%

Estimated time:
Experiment:
ArgoNeut, DUNE
Co-Assignees:
Duration:

Description

Dear LArSoft experts,

Would it be possible to upgrade tensorflow in larsoft? The current version is v1_3_0 and it would be great if it can be upgraded to v1_8_0 or higher. This will help the deep learning development in DUNE and ArgoNeuT and possibly other experiments.

Thanks,
Tingjun

errors.txt (49.4 KB) errors.txt Tingjun Yang, 07/12/2019 12:15 PM
error.txt (53.9 KB) error.txt Aiden Reynolds, 08/28/2019 07:51 AM
tensorflow_stacks.txt (77.4 KB) tensorflow_stacks.txt Kyle Knoepfel, 09/03/2019 04:08 PM
tensorflow_errors.txt (49.5 KB) tensorflow_errors.txt Saul Alonso Monsalve, 10/01/2019 05:07 AM
cnn.png (48.5 KB) cnn.png Tingjun Yang, 03/16/2020 05:46 PM
cvn.png (52.6 KB) cvn.png Tingjun Yang, 03/16/2020 05:46 PM

Related issues

Related to SciSoft - Support #23361: Abseil not c++17 compliant, causing problems in Tensorflow v1_12_0b e17 buildClosed10/02/2019

History

#1 Updated by Kyle Knoepfel over 1 year ago

  • Status changed from New to Feedback

We are accepting this, but we would like assistance from your tensorflow experts. With whom should we setup a meeting?

#2 Updated by Tingjun Yang over 1 year ago

Hi Kyle,

You can setup a meeting with Saul and Leigh. I would like to join the meeting but they are the tensorflow experts.

Thanks,
Tingjun

#3 Updated by Lynn Garren over 1 year ago

We also have a request from MicroBooNE for a working python interface. I would like have a joint meeting. Tentatively scheduled for 10 am May 8.

#4 Updated by Tingjun Yang over 1 year ago

Lynn Garren wrote:

We also have a request from MicroBooNE for a working python interface. I would like have a joint meeting. Tentatively scheduled for 10 am May 8.

This does not work for us as there will be a ProtoDUNE meeting at the same time.

#5 Updated by Tingjun Yang over 1 year ago

  • Status changed from Feedback to Work in progress

From Lynn's email:

I had a talk with Marc P about the tensorflow build.  We came up with 
two possible options.  Fortunately, the cleanest option has worked and I 
have provided new patches for the c++17 build.   I also had a look at 
the headers that are installed in the tensorflow include directory and 
refined the set.

tensorflow v1_12_0b is available for testing on cvmfs.

Please let us know if you have problems using this release.  We presume 
that you will supply a larsim feature branch for use with tensorflow 
v1_12_0b.   We should alert the larsoft mailing list and make sure there 
are no objections before making a release with the new build.

This release is only available for e17 (and e17:py3).

I remain concerned about tensorflow going forward.  With newer releases 
we will have to use the bazel build, which appears to be problematic for 
spack.  Also, tensorflow seems to be making its own copy of some 
utilities that are usually provided by the system.  As long as 
everything is completely contained, that should be fine. Given that 
tensorflow maintains its own ecosystem, it may be wise to consider 
running it inside a container and taking the output in some fashion.

Lynn

#6 Updated by Tingjun Yang over 1 year ago

I have started testing tensorflow v1_12_0b. The upgrade seems to be quite straightforward. The only issue is the following error:

/cvmfs/larsoft.opensciencegrid.org/products/tensorflow/v1_12_0b/Linux64bit+3.10-2.17-e17-prof/include/tensorflow/core/public/session.h:24,
                 from /data/tjyang/dune/larsoft_em/srcs/dunetpc/dune/CVN/tf/tf_graph.cc:12:
/cvmfs/larsoft.opensciencegrid.org/products/tensorflow/v1_12_0b/Linux64bit+3.10-2.17-e17-prof/include/tensorflow/core/lib/core/stringpiece.h:29:10: fatal error: absl/strings/string_view.h: No such file or directory
 #include "absl/strings/string_view.h" 
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

The solution is to add include_directories( $ENV{TENSORFLOW_INC}/absl ) to the CMakeLists.txt file wherever session.h is included.

I have created feature branches feature/team_for_tensorflow_v1_12_0b in both larreco and dunetpc. I am going to test if the tensorflow results remain the same for DUNE. I will deal with argoneutcode later.

#7 Updated by Tingjun Yang over 1 year ago

Tried to run tensorflow on an existing DUNE MC file:

lar -c select_ana_dune10kt_nu.fcl xroot://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/mcc11/protodune/mc/full-reconstructed/07/51/34/20/nue_dune10kt_1x2x6_12855888_0_20181104T211321_gen_g4_detsim_reco.root -n -1

There are lots of errors like this
2019-07-12 12:04:58.201673: E tensorflow/core/framework/op_kernel.cc:1197] OpKernel ('op: "MutableDenseHashTableV2" device_type: "CPU" constraint { name: "key_dtype" allowed_values { list { type: DT_STRING } } } constraint { name: "value_dtype" allowed_values { list { type: DT_INT64 } } }') for unknown op: MutableDenseHashTableV2

The full error log is attached.
The program hangs on the first event:
Classifier summary: 
Output 0: 0.0257705, 
Output 1: 2.55598e-05, 0.999172, 0.000753308, 4.93325e-05, 
Output 2: 0.00586059, 0.367086, 0.627042, 1.20333e-05, 
Output 3: 6.90421e-05, 3.75733e-05, 0.000108019, 0.999785, 
Output 4: 0.997856, 0.0021328, 1.05676e-05, 3.6611e-07, 
Output 5: 0.999907, 9.18793e-05, 7.18959e-07, 3.51992e-08, 
Output 6: 0.999743, 0.000255833, 9.39033e-07, 3.14338e-07, 

I guess the conclusion is that we cannot use the old networks with the new version of tensorflow.

#8 Updated by Kyle Knoepfel over 1 year ago

There appears to be a conflict between the old and new tensorflow-generated data schema. Unless tensorflow supports schema evolution, you may need to regenerate your tensorflow-formatted data.

#9 Updated by Kyle Knoepfel over 1 year ago

  • Status changed from Work in progress to Feedback

Do you think the feature branches you are working on will be ready for this week's release?

#10 Updated by Tingjun Yang over 1 year ago

Kyle Knoepfel wrote:

Do you think the feature branches you are working on will be ready for this week's release?

Hi Kyle,
No, we need to train new networks in order to use the new version of tensorflow. By the way, does MicroBooNE use tensorflow in their code? If so they also need to retrain in order to use the new tensorflow.

#11 Updated by Lynn Garren over 1 year ago

MicroBooNE uses tensorflow via a container. They do not use our build. We expect to have a report from them at the next coordination meeting.

#12 Updated by Aiden Reynolds about 1 year ago

Tried to run a larsoft job to do the CNN based hit tagging with a new network trained with tensorflow v1.12.

I get a lot of errors similar to those mentioned by Tingjun above, e.g.

2019-08-28 07:49:27.703868: E tensorflow/core/framework/op_kernel.cc:1197] OpKernel ('op: "MutableDenseHashTableV2" device_type: "CPU" constraint { name: "key_dtype" allowed_values { list { type: DT_STRING } } } constraint { name: "value_dtype" allowed_values { list { type: DT_INT32 } } }') for unknown op: MutableDenseHashTableV2

After opening the input file the job stalls but does not return a failure code, full log attached.

#13 Updated by Lynn Garren about 1 year ago

Aiden, Tingjun is still working on the training to use tensorflow 1.12. The good news is that this might be ready next week.

#14 Updated by Tingjun Yang about 1 year ago

Lynn Garren wrote:

Aiden, Tingjun is still working on the training to use tensorflow 1.12. The good news is that this might be ready next week.

Lynn, Aiden is the person who is training the CNN using tensorflow 1.12 and he saw the same error with the retrained network.

#15 Updated by Kyle Knoepfel about 1 year ago

Aiden, can you give us a sample workflow to test, including setup instructions?

#16 Updated by Tingjun Yang about 1 year ago

I have created feature branch feature/team_for_tensorflow_v1_12_0b in larreco and dunetpc. It works with larsoft v08_29_00.

To build it, you need to unsetup tensorflow and protobuf first.

After you build and setup dunetpc, you can run the following command to reproduce the problem:
lar -c protoDUNE_SP_keepup_decoder_reco.fcl xroot://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/07/74/37/39/np04_raw_run005809_0023_dl6.root -n 1

#17 Updated by Kyle Knoepfel about 1 year ago

I find that if I use the debug build, I am able to proceed without any issues, albeit more slowly. However, there does appear to be some type of hang in the prof build. That will take some more investigation.

#18 Updated by Kyle Knoepfel about 1 year ago

Using the 'pstack' command, I get the attached printout. Looking at the lowest stack frame for each of the threads launched by tensorflow:

$ grep "#0" tensorflow_stacks.txt
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7dd3861c88 in absl::InlinedVector<long long, 4ul, std::allocator<long long> >::EnlargeBy(unsigned long) () from /scratch/knoepfel/products/tensorflow/v1_12_0b/Linux64bit+3.10-2.17-e17-prof/lib/libtensorflow-core.so

The only thread attempting to make progress is the master thread (thread 1), by calling EnlargeBy.

#19 Updated by Kyle Knoepfel about 1 year ago

  • Assignee set to Kyle Knoepfel
  • Status changed from Feedback to Assigned

#20 Updated by Kyle Knoepfel about 1 year ago

  • Status changed from Assigned to Feedback

The job you provided (with e17:prof qualifiers) enters an infinite loop deep within TensorFlow code. Specifically, the absl::InlinedVector::EnlargeBy function is problematic (see comments prefaced with 'KJK'):

template <typename T, size_t N, typename A>
void InlinedVector<T, N, A>::EnlargeBy(size_type delta) {
  const size_type s = size();
  assert(s <= capacity());

  size_type target = std::max(static_cast<size_type>(N), s + delta);

  // Compute new capacity by repeatedly doubling current capacity                                                                                                                                                 
  // TODO(psrc): Check and avoid overflow?                                                                                                                                                                        
  size_type new_capacity = capacity(); // KJK: new_capacity is 0
  while (new_capacity < target) {      // KJK: comparison returns true
    new_capacity <<= 1;                // KJK: new_capacity is shifted by 1 bit (i.e. doubled); still 0.
  }                                    // KJK: new_capacity still 0, repeat loop infinitely

  Allocation new_allocation(allocator(), new_capacity);

  UninitializedCopy(std::make_move_iterator(data()),
                    std::make_move_iterator(data() + s),
                    new_allocation.buffer());

  ResetAllocation(new_allocation, s);
}

Although this is a bug in Abseil's library, it's possible it is triggered by incorrect use of TensorFlow. I recommend two approaches:

  • Switch to using TensorFlow's newer C++ API (some of the TensorFlow code in LArSoft is using older API)--that may help get around the problem
  • Switch to a newer version of TensorFlow, which does not use this fragile bit of code. It would take some investigation to figure out which version of TensorFlow would be adequate.

Please let us know how you would like to proceed.

#21 Updated by Tingjun Yang about 1 year ago

Dear Kyle,

Thank you for the investigation and locating the problem. It is probably easier to update the C++ API as building a newer version of tensorflow can be time consuming. Do you expect us to update the API or are you available to help us with that?

Thanks,
Tingjun

#22 Updated by Kyle Knoepfel about 1 year ago

We should probably do it together. Are you guys available sometime this week to meet?

#23 Updated by Kyle Knoepfel about 1 year ago

Pinging to keep this alive.

#24 Updated by Saul Alonso Monsalve about 1 year ago

I am getting similar errors when trying to run a network trained with Tensorflow 1.12.0 (I have attached a file with the errors). As Kyle said, it might be an issue related to the Tensorflow C++ API. I will investigate it.

#25 Updated by Tingjun Yang about 1 year ago

Hi Kyle and all,

We would like to setup a meeting with you next week to discuss a plan to upgrade the tensorflow C++ API. Could you please fill this doodle pool?
https://doodle.com/poll/pt2fw2f26ntac5bw

Thanks,
Tingjun

#26 Updated by Tingjun Yang about 1 year ago

Lynn, Leigh, Saul and myself just had a meeting. We agreed to try ClientSession as suggested by Kyle and see if that solves the problem.

#27 Updated by Leigh Whitehead about 1 year ago

Saúl and I have done a bit of research and found the following: the C++ interface in larreco (and the one we based on it in dunetpc), which was written by Robert Sulej, actually goes straight to the core of Tensorflow and bypasses the entire C++ API itself. This certainly doesn't sound like the best approach! As such, we will modify the interface in dunetpc to test things using the C++ API properly with the ClientSession class suggested by Kyle and some other necessary changes. Hopefully this will solve the issues that we have seen.

#28 Updated by Kyle Knoepfel about 1 year ago

  • Related to Support #23361: Abseil not c++17 compliant, causing problems in Tensorflow v1_12_0b e17 build added

#29 Updated by Tingjun Yang 8 months ago

There is good progress on this. Pengfei made tensorflow 1.12.0b for SLF7 (both py3 and py2):
https://home.fnal.gov/~dingpf/tensorflow/

I tested the prof build and I was able to build larrecodnn and dunetpc against it with some minor changes. The results on DUNEFD cvn and ProtoDUNE cnn are unchanged:


Left is with tensorflow 1.12.0b and right is with tensorflow v1.3.0i.

Lynn, would it be OK to install the tensoflow 1.12.0b builds to larsoft cvmfs? I will provide pull request to larrecodnn and feature branches to experiment code.

Tingjun

#30 Updated by Lynn Garren 8 months ago

In order to install, we must have a source code tarball and be able to build tensorflow on jenkins. These are requirements. We already have a v1_12_0b tag for the tensorflow build that does not include any modifications from Pengfei. There will need to be a new tag (v1_12_0c) with an updated build script.

#31 Updated by Pengfei Ding 8 months ago

Here is the link to the source tarball:

https://home.fnal.gov/~dingpf/tensorflow/tensorflow-v1.12.0c-source.tar.bz2

It has the tag v1_12_0c, and have a several fixes for:

  1. latest protobuf changed package layout (using lib64 as opposed to lib);
  2. one of tensorflows header files used a deprecated declaration which did not stop the build of tensorflow itself as "-Wno-deprecated-declarations" is set, but it causes errors for dunetpc as that compiler flag is not set by default in e19;

#32 Updated by Lynn Garren 8 months ago

Would you commit the required changes to ssh:///cvs/projects/build-framework-tensorflow-ssi-build please.

#33 Updated by Lynn Garren 8 months ago

And I have a question. There are old entries in the table file for the e17 qualifier. Is there any reason to keep these entries in the table file? In other words, are you intending to build and use this release of tensorflow with something other than the current head of larsoft?

#34 Updated by Tingjun Yang 8 months ago

We only need e19 builds.

#35 Updated by Lynn Garren 8 months ago

Thanks for clarifying that. How soon can you provide the PR and experiment code feature branches? I realize we can't run the CI tests until tensorflow v1_12_0c is on cvmfs.

#36 Updated by Tingjun Yang 8 months ago

I can provide them early tomorrow.

#37 Updated by Pengfei Ding 8 months ago

I've committed my changes to the repo. I forgot to mention that I added a "compile_nsync.sh" into the patch directory.

I also updated the source tarball linked here:

https://home.fnal.gov/~dingpf/tensorflow/tensorflow-v1.12.0c-source.tar.bz2

The build script in the source tarball is now in sync with the version in the git repo. The previous version was using `-std=c++14` for e17 and e19 build. It is fixed in the latest tarball now. Please download the tarball again.

#38 Updated by Tingjun Yang 8 months ago

Hi Lynn,

I have submitted larrecodnn PR#1 and created feature branch feature/team_for_tensorflow_v1_12_0b in dunetpc. Note this name of branch says v1_12_0b but everything depends on v1_12_0c now.

Please let me know if you have questions.

#39 Updated by Tingjun Yang 8 months ago

I have renamed the dunetpc branch to feature/tjyang_tensorflow_v1_12_0

#40 Updated by Lynn Garren 8 months ago

tensorflow v1_12_0c is now on cvmfs. I'll give it time to propagate before triggering a CI.

#41 Updated by Lynn Garren 7 months ago

  • % Done changed from 0 to 100
  • Status changed from Feedback to Resolved

Thanks to everyone who contributed. larsoft v08_47_00 now uses tensorflow v1_12_0c.

#42 Updated by Kyle Knoepfel 7 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF