Potential Pitfalls and their Resolutions

Kernel version <2.6.21 + Unstable git libunwind version (April 20, 2010 and later)

  • Symptoms
    • Call paths are truncated on 32-bit systems (namely recursive calls are logged as a single call)
    • profdata_x_y files are empty on 64-bit systems (unw_local_init always returns -1)
  • Solutions
    • Change #define HAVE_MINCORE 1 to #define HAVE_MINCORE 0 in config.h in <libunwind_source_directory>/include and rerun make && make install (do not run configure)
    • Remove the word "mincore" from in <libunwind_source_directory> and run autoreconf <autoconf_options> --force && ./configure && make && make install
    • Modify in <libunwind_source_directory> to properly detect older kernels and disable HAVE_MINCORE and regenerate files with autoconf
    • Apply the attached configure_mincore_kernel_version_check.patch patch file.
    • Use the patch from and run configure with --enable-mincore=no and again autoreconf, configure, and make. This has the advantage that libraries compiled on kernels newer than 2.6.21 can be used on systems with older kernels.
  • Technical details
    • Commit ee99dbec879212406d813b1bae56b988b4ab1e00 added the ability to use mincore() instead of msync() when determining if a page accessed anonymously is mapped.
    • msync() and mincore() can both be used to validate the memory that libunwind is going to access to ensure that the address is valid in the context of the process being profiled and that the address represents mapped memory
    • msync() is a system call that synchronizes memory with physical storage. libunwind calls msync() with the MS_ASYNC flag which returns almost immediately. Until kernel version 2.5.67, I/O would be performed to synchronize memory with physical storage, but in newer kernel versions, I/O isn't started. Until kernel version 2.6.17, this function would also mark pages dirty, but in newer kernel versions, dirty pages are properly tracked, so this too no longer happens. On these more recent kernels, msync() only performs two tasks when MS_ASYNC is set: locks mapped pages temporarily in memory and returns an error on unmapped pages or bad address ranges. It is this second action that libunwind depends on when validating memory.
    • mincore() is a system call that determines if pages are resident in memory and whether they can be accessed without causing a page fault. As with msync(), libunwind uses this call simply to determine whether the address range is invalid or contains unmapped pages.
    • libunwind requires this additional information about the memory page it is about to access as attempting to read from a virtual address without a corresponding physical memory address will cause a segmentation fault.
    • mincore() is not supported on all kernels and configurations and specifically it does not return correct information for anonymous mappings, nonlinear mappings, and migration entries on kernels older than 2.6.21 (this was patched on Feb 12 2007 here)
    • Because the majority of the pages libunwind is attempting to access are being accessed anonymously, mincore will return an error for many pages that are properly mapped. libunwind interprets this to mean that the page is unmapped and by default, libunwind makes no attempt to access potentially unmapped pages.
    • This causes unw_local_init to fail every time it is called on an x86_64 system
    • On i386 systems, the first call to mincore() when stepping through a chain of recursive calls succeeds, but following mincore() calls fail, resulting in the apparent collapse of the recursive call chain (functions that call recursively only show up as a single call in the profdata_x_y_paths file) and callpaths are greatly shortened (depth <= 3).

glibc version <= 2.5-24 + x86_64 OS + Executable compiled with optimization >= -O1

  • Symptoms
    • Profiling runs to completion but profdata_x_y_<etc> files are empty or contain very few lines and profdata_x_y raw file is mostly 0's.
    • total_empty_stacks in profdata_x_y_debugging is very high
  • Solutions
  • Technical Details
    • Versions of glibc prior to glibc-2.5-24 did not include unwind information for the function __restore_rt which is used to return after a system call
    • libunwind depends on proper detection of signal frames to be able to locate unwind information as some signals point to the previous instruction and some to the next
    • libunwind was incorrectly decrementing the address in its internal instruction pointer variable and instead getting unwind info for the function before __restore_rt (killpg in the case of glibc-2.5-24).
    • This causes libunwind to get an incorrect return address, fail at locating further information, and return an error, resulting in the truncation of call chains at the signal handler

Segmentation fault after very long runs

  • Symptoms
    • Profiling runs of several hours sometimes fail after running for a long time.
  • Solutions
    • Upgrade to a later version of SimpleProfiler.
  • Technical Details
    • An internal vector was missing bounds checking for stacks with depth > 1000.
    • Normally these stack sizes are very rare, but in the case of missing/incorrect unwind info, libunwind can occasionally attempt to unwind through invalid memory addresses (such as those of the heap and stack), resulting in an essentially unending unwind until resources are exhausted or variables overflow, causing the segmentation fault.
    • In the case of exceptionally long runs, these rare occasions become commonplace, although they occur at unpredictable moments.