Abseil not c++17 compliant, causing problems in Tensorflow v1_12_0b e17 build
This ticket is to pass on knowledge NOvA has just learnt about Tensorflow v1_12_0b e17 build, just in case other experiments get into the same issue.
They made their own v1_12_0b Tensorflow e17 build with two changes:
1. build with gcc 7.3 but not use c++17, instead use c++14;
2. build on a node with intel CPU so that "SSE" extensions was enabled on the built shared library.
The problem started when NOvA found they have problems with v1_12_0b Tensorflow e17 build.
Code got hung at:
absl::InlinedVector<long long, 4ul, std::allocator<long long> >::EnlargeBy(unsigned long) ()
It looked the third party package Abseil used by Tensorflow is not c++17 compliant. When testing with a build with gcc 7.3 but with c++14, the problem seemed went away.
Additionally, they found the shared library in the current v1_12_0b on SciSoft does not seem to use intel's "SSE" extensions. This made running it very slow and inefficient. This might be caused by Jenkins server which distributed the build task to an AMD machine, thus not supporting SSE extensions. Rebuilding the product on a machine with Intel CPUs mitigated the problem.
NOvA previously had issues when running grid jobs with the tensorflow product with SSE turned on. That was caused by jobs hitting nodes with AMD CPUs or older nodes not supporting SSE. But that problem was mitigated later when they found condor/jobsub can actually filter on CPU types when submitting jobs.)
#3 Updated by Christopher Green 11 months ago
- % Done changed from 0 to 100
- Status changed from Assigned to Resolved
Thank you for this information. As a side note: generally speaking "standards compliance" issues do not cause run-time errors, although compiler-specific bugs could plausibly do so. In this case however, I understand from other sources that this was a bug in Abseil itself.
However, it is important specifically with Abseil (not the case in general) to ensure that everything is built to the same C++ standard, as it appears that in recent versions types (e.g.
absl::string_view) may be different in headers depending on the standard selected. If that is different for the Abseil library vs the code you're using it with, there may be issues.