Project

General

Profile

Support #22504

Request to upgrade the tensorflow version

Added by Tingjun Yang 6 months ago. Updated 1 day ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
05/02/2019
Due date:
% Done:

0%

Estimated time:
Experiment:
ArgoNeut, DUNE
Co-Assignees:
Duration:

Description

Dear LArSoft experts,

Would it be possible to upgrade tensorflow in larsoft? The current version is v1_3_0 and it would be great if it can be upgraded to v1_8_0 or higher. This will help the deep learning development in DUNE and ArgoNeuT and possibly other experiments.

Thanks,
Tingjun

errors.txt (49.4 KB) errors.txt Tingjun Yang, 07/12/2019 12:15 PM
error.txt (53.9 KB) error.txt Aiden Reynolds, 08/28/2019 07:51 AM
tensorflow_stacks.txt (77.4 KB) tensorflow_stacks.txt Kyle Knoepfel, 09/03/2019 04:08 PM
tensorflow_errors.txt (49.5 KB) tensorflow_errors.txt Saul Alonso Monsalve, 10/01/2019 05:07 AM

History

#1 Updated by Kyle Knoepfel 5 months ago

  • Status changed from New to Feedback

We are accepting this, but we would like assistance from your tensorflow experts. With whom should we setup a meeting?

#2 Updated by Tingjun Yang 5 months ago

Hi Kyle,

You can setup a meeting with Saul and Leigh. I would like to join the meeting but they are the tensorflow experts.

Thanks,
Tingjun

#3 Updated by Lynn Garren 5 months ago

We also have a request from MicroBooNE for a working python interface. I would like have a joint meeting. Tentatively scheduled for 10 am May 8.

#4 Updated by Tingjun Yang 5 months ago

Lynn Garren wrote:

We also have a request from MicroBooNE for a working python interface. I would like have a joint meeting. Tentatively scheduled for 10 am May 8.

This does not work for us as there will be a ProtoDUNE meeting at the same time.

#5 Updated by Tingjun Yang 3 months ago

  • Status changed from Feedback to Work in progress

From Lynn's email:

I had a talk with Marc P about the tensorflow build.  We came up with 
two possible options.  Fortunately, the cleanest option has worked and I 
have provided new patches for the c++17 build.   I also had a look at 
the headers that are installed in the tensorflow include directory and 
refined the set.

tensorflow v1_12_0b is available for testing on cvmfs.

Please let us know if you have problems using this release.  We presume 
that you will supply a larsim feature branch for use with tensorflow 
v1_12_0b.   We should alert the larsoft mailing list and make sure there 
are no objections before making a release with the new build.

This release is only available for e17 (and e17:py3).

I remain concerned about tensorflow going forward.  With newer releases 
we will have to use the bazel build, which appears to be problematic for 
spack.  Also, tensorflow seems to be making its own copy of some 
utilities that are usually provided by the system.  As long as 
everything is completely contained, that should be fine. Given that 
tensorflow maintains its own ecosystem, it may be wise to consider 
running it inside a container and taking the output in some fashion.

Lynn

#6 Updated by Tingjun Yang 3 months ago

I have started testing tensorflow v1_12_0b. The upgrade seems to be quite straightforward. The only issue is the following error:

/cvmfs/larsoft.opensciencegrid.org/products/tensorflow/v1_12_0b/Linux64bit+3.10-2.17-e17-prof/include/tensorflow/core/public/session.h:24,
                 from /data/tjyang/dune/larsoft_em/srcs/dunetpc/dune/CVN/tf/tf_graph.cc:12:
/cvmfs/larsoft.opensciencegrid.org/products/tensorflow/v1_12_0b/Linux64bit+3.10-2.17-e17-prof/include/tensorflow/core/lib/core/stringpiece.h:29:10: fatal error: absl/strings/string_view.h: No such file or directory
 #include "absl/strings/string_view.h" 
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

The solution is to add include_directories( $ENV{TENSORFLOW_INC}/absl ) to the CMakeLists.txt file wherever session.h is included.

I have created feature branches feature/team_for_tensorflow_v1_12_0b in both larreco and dunetpc. I am going to test if the tensorflow results remain the same for DUNE. I will deal with argoneutcode later.

#7 Updated by Tingjun Yang 3 months ago

Tried to run tensorflow on an existing DUNE MC file:

lar -c select_ana_dune10kt_nu.fcl xroot://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/mcc11/protodune/mc/full-reconstructed/07/51/34/20/nue_dune10kt_1x2x6_12855888_0_20181104T211321_gen_g4_detsim_reco.root -n -1

There are lots of errors like this
2019-07-12 12:04:58.201673: E tensorflow/core/framework/op_kernel.cc:1197] OpKernel ('op: "MutableDenseHashTableV2" device_type: "CPU" constraint { name: "key_dtype" allowed_values { list { type: DT_STRING } } } constraint { name: "value_dtype" allowed_values { list { type: DT_INT64 } } }') for unknown op: MutableDenseHashTableV2

The full error log is attached.
The program hangs on the first event:
Classifier summary: 
Output 0: 0.0257705, 
Output 1: 2.55598e-05, 0.999172, 0.000753308, 4.93325e-05, 
Output 2: 0.00586059, 0.367086, 0.627042, 1.20333e-05, 
Output 3: 6.90421e-05, 3.75733e-05, 0.000108019, 0.999785, 
Output 4: 0.997856, 0.0021328, 1.05676e-05, 3.6611e-07, 
Output 5: 0.999907, 9.18793e-05, 7.18959e-07, 3.51992e-08, 
Output 6: 0.999743, 0.000255833, 9.39033e-07, 3.14338e-07, 

I guess the conclusion is that we cannot use the old networks with the new version of tensorflow.

#8 Updated by Kyle Knoepfel 3 months ago

There appears to be a conflict between the old and new tensorflow-generated data schema. Unless tensorflow supports schema evolution, you may need to regenerate your tensorflow-formatted data.

#9 Updated by Kyle Knoepfel 3 months ago

  • Status changed from Work in progress to Feedback

Do you think the feature branches you are working on will be ready for this week's release?

#10 Updated by Tingjun Yang 3 months ago

Kyle Knoepfel wrote:

Do you think the feature branches you are working on will be ready for this week's release?

Hi Kyle,
No, we need to train new networks in order to use the new version of tensorflow. By the way, does MicroBooNE use tensorflow in their code? If so they also need to retrain in order to use the new tensorflow.

#11 Updated by Lynn Garren 3 months ago

MicroBooNE uses tensorflow via a container. They do not use our build. We expect to have a report from them at the next coordination meeting.

#12 Updated by Aiden Reynolds about 2 months ago

Tried to run a larsoft job to do the CNN based hit tagging with a new network trained with tensorflow v1.12.

I get a lot of errors similar to those mentioned by Tingjun above, e.g.

2019-08-28 07:49:27.703868: E tensorflow/core/framework/op_kernel.cc:1197] OpKernel ('op: "MutableDenseHashTableV2" device_type: "CPU" constraint { name: "key_dtype" allowed_values { list { type: DT_STRING } } } constraint { name: "value_dtype" allowed_values { list { type: DT_INT32 } } }') for unknown op: MutableDenseHashTableV2

After opening the input file the job stalls but does not return a failure code, full log attached.

#13 Updated by Lynn Garren about 2 months ago

Aiden, Tingjun is still working on the training to use tensorflow 1.12. The good news is that this might be ready next week.

#14 Updated by Tingjun Yang about 2 months ago

Lynn Garren wrote:

Aiden, Tingjun is still working on the training to use tensorflow 1.12. The good news is that this might be ready next week.

Lynn, Aiden is the person who is training the CNN using tensorflow 1.12 and he saw the same error with the retrained network.

#15 Updated by Kyle Knoepfel about 2 months ago

Aiden, can you give us a sample workflow to test, including setup instructions?

#16 Updated by Tingjun Yang about 2 months ago

I have created feature branch feature/team_for_tensorflow_v1_12_0b in larreco and dunetpc. It works with larsoft v08_29_00.

To build it, you need to unsetup tensorflow and protobuf first.

After you build and setup dunetpc, you can run the following command to reproduce the problem:
lar -c protoDUNE_SP_keepup_decoder_reco.fcl xroot://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/07/74/37/39/np04_raw_run005809_0023_dl6.root -n 1

#17 Updated by Kyle Knoepfel about 1 month ago

I find that if I use the debug build, I am able to proceed without any issues, albeit more slowly. However, there does appear to be some type of hang in the prof build. That will take some more investigation.

#18 Updated by Kyle Knoepfel about 1 month ago

Using the 'pstack' command, I get the attached printout. Looking at the lowest stack frame for each of the threads launched by tensorflow:

$ grep "#0" tensorflow_stacks.txt
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7df3cd7965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#0  0x00007f7dd3861c88 in absl::InlinedVector<long long, 4ul, std::allocator<long long> >::EnlargeBy(unsigned long) () from /scratch/knoepfel/products/tensorflow/v1_12_0b/Linux64bit+3.10-2.17-e17-prof/lib/libtensorflow-core.so

The only thread attempting to make progress is the master thread (thread 1), by calling EnlargeBy.

#19 Updated by Kyle Knoepfel about 1 month ago

  • Assignee set to Kyle Knoepfel
  • Status changed from Feedback to Assigned

#20 Updated by Kyle Knoepfel about 1 month ago

  • Status changed from Assigned to Feedback

The job you provided (with e17:prof qualifiers) enters an infinite loop deep within TensorFlow code. Specifically, the absl::InlinedVector::EnlargeBy function is problematic (see comments prefaced with 'KJK'):

template <typename T, size_t N, typename A>
void InlinedVector<T, N, A>::EnlargeBy(size_type delta) {
  const size_type s = size();
  assert(s <= capacity());

  size_type target = std::max(static_cast<size_type>(N), s + delta);

  // Compute new capacity by repeatedly doubling current capacity                                                                                                                                                 
  // TODO(psrc): Check and avoid overflow?                                                                                                                                                                        
  size_type new_capacity = capacity(); // KJK: new_capacity is 0
  while (new_capacity < target) {      // KJK: comparison returns true
    new_capacity <<= 1;                // KJK: new_capacity is shifted by 1 bit (i.e. doubled); still 0.
  }                                    // KJK: new_capacity still 0, repeat loop infinitely

  Allocation new_allocation(allocator(), new_capacity);

  UninitializedCopy(std::make_move_iterator(data()),
                    std::make_move_iterator(data() + s),
                    new_allocation.buffer());

  ResetAllocation(new_allocation, s);
}

Although this is a bug in Abseil's library, it's possible it is triggered by incorrect use of TensorFlow. I recommend two approaches:

  • Switch to using TensorFlow's newer C++ API (some of the TensorFlow code in LArSoft is using older API)--that may help get around the problem
  • Switch to a newer version of TensorFlow, which does not use this fragile bit of code. It would take some investigation to figure out which version of TensorFlow would be adequate.

Please let us know how you would like to proceed.

#21 Updated by Tingjun Yang about 1 month ago

Dear Kyle,

Thank you for the investigation and locating the problem. It is probably easier to update the C++ API as building a newer version of tensorflow can be time consuming. Do you expect us to update the API or are you available to help us with that?

Thanks,
Tingjun

#22 Updated by Kyle Knoepfel about 1 month ago

We should probably do it together. Are you guys available sometime this week to meet?

#23 Updated by Kyle Knoepfel 16 days ago

Pinging to keep this alive.

#24 Updated by Saul Alonso Monsalve 15 days ago

I am getting similar errors when trying to run a network trained with Tensorflow 1.12.0 (I have attached a file with the errors). As Kyle said, it might be an issue related to the Tensorflow C++ API. I will investigate it.

#25 Updated by Tingjun Yang 6 days ago

Hi Kyle and all,

We would like to setup a meeting with you next week to discuss a plan to upgrade the tensorflow C++ API. Could you please fill this doodle pool?
https://doodle.com/poll/pt2fw2f26ntac5bw

Thanks,
Tingjun

#26 Updated by Tingjun Yang 1 day ago

Lynn, Leigh, Saul and myself just had a meeting. We agreed to try ClientSession as suggested by Kyle and see if that solves the problem.

#27 Updated by Leigh Whitehead 1 day ago

Saúl and I have done a bit of research and found the following: the C++ interface in larreco (and the one we based on it in dunetpc), which was written by Robert Sulej, actually goes straight to the core of Tensorflow and bypasses the entire C++ API itself. This certainly doesn't sound like the best approach! As such, we will modify the interface in dunetpc to test things using the C++ API properly with the ClientSession class suggested by Kyle and some other necessary changes. Hopefully this will solve the issues that we have seen.



Also available in: Atom PDF