Hi Wenlei, Thanks for the comments! David answered the first question, I do have some comments on the second one though. Teresa On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at google.com> wrote: > > > On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote: > >> This sounds very useful. Weâve improved and used memoro >> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and >> analysis, and we are also looking for ways to leverage memory profile for >> PGO/FDO. I think having a common profiling infrastructure for analysis >> tooling as well as profile guided optimizations is good design, and having >> it in LLVM is also helpful. Very interested in the tooling and optimization >> that comes after the profiler. >> >> >> >> Two questions: >> >> - How does the profiling overhead look? Is that similar to ASAN >> overhead from what youâve seen, which would be higher than PGO >> instrumentation? Asking because Iâm wondering if any PGO training setup can >> be used directly for the new heap profiling. >> >> > It is built on top of ASAN runtime, but the overhead can be made much > lower by using counter update consolidation -- all fields sharing the same > shadow counter can be merged, and aggressive loop sinking/hoisting can be > done. > > The goal is to integrate this with the PGO instrumentation. The PGO > instrumentation overhead can be further reduced with sampling technique > (Rong Xu has a patch to be submitted). > > >> - >> - Iâm not familiar with how sanitizer handles stack trace, but for >> getting most accurate calling context (use FP rather than dwarf), I guess >> frame pointer omission and tail call opt etc. need to be turned off? Is >> that going to be implied by -fheapprof? >> >> > Kostya can provide detailed answers to these questions. > I'm not aware that -fsanitizer* options disable these, but I know in our environment we do disable frame pointer omission when setting up ASAN builds, and I am arranging for heap profiling builds to do the same. Not sure whether we want to do this within clang itself, would be interested in Kostya's opinion. I can't see anywhere that we are disabling tail call optimizations for ASAN though, but I might have missed it. Thanks, Teresa > > David > >> >> - >> >> >> >> Thanks, >> >> Wenlei >> >> >> >> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa >> Johnson via llvm-dev <llvm-dev at lists.llvm.org> >> *Reply-To: *Teresa Johnson <tejohnson at google.com> >> *Date: *Wednesday, June 24, 2020 at 4:58 PM >> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany < >> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka < >> vitalybuka at google.com> >> *Cc: *David Li <davidxl at google.com> >> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler >> >> >> >> Hi all, >> >> >> >> I've included an RFC for a heap profiler design I've been working on in >> conjunction with David Li. Please send any questions or feedback. For >> sanitizer folks, one area of feedback is on refactoring some of the *ASAN >> shadow setup code (see the Shadow Memory section). >> >> >> >> Thanks, >> >> Teresa >> >> >> >> RFC: Sanitizer-based Heap Profiler >> Summary >> >> This document provides an overview of an LLVM Sanitizer-based heap >> profiler design. >> Motivation >> >> The objective of heap memory profiling is to collect critical runtime >> information associated with heap memory references and information on heap >> memory allocations. The profile information will be used first for tooling, >> and subsequently to guide the compiler optimizer and allocation runtime to >> layout heap objects with improved spatial locality. As a result, DTLB and >> cache utilization will be improved, and program IPC (performance) will be >> increased due to reduced TLB and cache misses. More details on the heap >> profile guided optimizations will be shared in the future. >> Overview >> >> The profiler is based on compiler inserted instrumentation of load and >> store accesses, and utilizes runtime support to monitor heap allocations >> and profile data. The target consumer of the heap memory profile >> information is initially tooling and ultimately automatic data layout >> optimizations performed by the compiler and/or allocation runtime (with the >> support of new allocation runtime APIs). >> >> >> >> Each memory address is mapped to Shadow Memory >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>, >> similar to the approach used by the Address Sanitizer >> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN). >> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the >> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow >> location implements the profile counter (incremented on accesses to the >> corresponding memory). This granularity was chosen to help avoid counter >> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes. >> To avoid aliasing of shadow memory for different allocations, we must >> choose a minimum alignment carefully. As discussed further below, we can >> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by >> storing necessary heap information for each allocation in a 32-byte header >> block. >> >> >> >> The compiler instruments each load and store to increment the associated >> shadow memory counter, in order to determine hotness. >> >> >> >> The heap profiler runtime is responsible for tracking allocations and >> deallocations, including the stack at each allocation, and information such >> as the allocation size and other statistics. I have implemented a prototype >> built using a stripped down and modified version of ASAN, however this will >> be a separate library utilizing sanitizer_common components. >> Compiler >> >> A simple HeapProfiler instrumentation pass instruments interesting memory >> accesses (loads, stores, atomics), with a simple load, increment, store of >> the associated shadow memory location (computed via a mask and shift to do >> the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset). >> The handling is very similar to and based off of the ASAN instrumentation >> pass, with slightly different instrumentation code. >> >> >> >> Various techniques can be used to reduce the overhead, by aggressively >> coalescing counter updates (e.g. given the 32-byte alignment, accesses >> known to be in the same 32-byte block, or across possible aliases since we >> donât care about the dereferenced values). >> >> >> >> Additionally, the Clang driver needs to set up to link with the runtime >> library, much as it does with the sanitizers. >> >> >> >> A -fheapprof option is added to enable the instrumentation pass and >> runtime library linking. Similar to -fprofile-generate, -fheapprof will >> accept an argument specifying the directory in which to write the profile. >> Runtime >> >> The heap profiler runtime is responsible for tracking and reporting >> information about heap allocations and accesses, aggregated by allocation >> calling context. For example, the hotness, lifetime, and cpu affinity. >> >> >> >> A new heapprof library will be created within compiler-rt. It will >> leverage support within sanitizer_common, which already contains facilities >> like stack context tracking, needed by the heap profiler. >> Shadow Memory >> >> There are some basic facilities in sanitizer_common for mmapâing the >> shadow memory, but most of the existing setup lives in the ASAN and HWASAN >> libraries. In the case of ASAN, there is support for both statically >> assigned shadow offsets (the default on most platforms), and for >> dynamically assigned shadow memory (implemented for Windows and currently >> also used for Android and iOS). According to kcc, recent experiments show >> that the performance with a dynamic shadow is close to that with a static >> mapping. In fact, that is the only approach currently used by HWASAN. Given >> the simplicity, the heap profiler will be implemented with a dynamic shadow >> as well. >> >> >> >> There are a number of functions in ASAN and HWASAN related to setup of >> the shadow that are duplicated but very nearly identical, at least for >> linux (which seems to be the only OS flavor currently supported for >> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and >> FindDynamicShadowStart (in ASAN there is another nearly identical copy in >> PremapShadow, used by Android, whereas in HW ASAN the premap handling is >> already commoned with the non-premap handling). Rather than make yet >> another copy of these mechanisms, I propose refactoring them into >> sanitizer_common versions. Like HWASAN, the initial version of the heap >> profiler will be supported for linux only, but other OSes can be added as >> needed similar to ASAN. >> StackTrace and StackDepot >> >> The sanitizer already contains support for obtaining and representing a >> stack trace in a StackTrace object, and storing it in the StackDepot which >> âefficiently stores huge amounts of stack tracesâ. This is in the >> sanitizer_common subdirectory and the support is shared by ASAN and >> ThreadSanitizer. The StackDepot is essentially an unbounded hash table, >> where each StackTrace is assigned a unique id. ASAN stores this id in the >> alloc_context_id field in each ChunkHeader (in the redzone preceding each >> allocation). Additionally, there is support for symbolizing and printing >> StackTrace objects. >> ChunkHeader >> >> The heap profiler needs to track several pieces of information for each >> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve >> a minimum of 32-byte alignment by holding this information in a 32-byte >> header block preceding each allocation. >> >> >> >> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It >> contains information about the current allocation state, user requested >> size, allocation and free thread ids, the allocation context id >> (representing the call stack at allocation, assigned by the StackDepot as >> described above), and misc other bookkeeping. For heap profiling, this will >> be converted to a 32-byte header block. >> >> >> >> Note that we could instead use the metadata section, similar to other >> sanitizers, which is stored in a separate location. However, as described >> above, storing the header block with each allocation enables 32-byte >> alignment without aliasing shadow counters for the same 64 bytes of memory. >> >> >> >> In the prototype heap profiler implementation, the header contains the >> following fields: >> >> >> >> // Should be 32 bytes >> >> struct ChunkHeader { >> >> // 1-st 4 bytes >> >> // Carry over from ASAN (available, allocated, quarantined). Will be >> >> // reduced to 1 bit (available or allocated). >> >> u32 chunk_state : 8; >> >> // Carry over from ASAN. Used to determine the start of user allocation. >> >> u32 from_memalign : 1; >> >> // 23 bits available >> >> >> >> // 2-nd 4 bytes >> >> // Carry over from ASAN (comment copied verbatim). >> >> // This field is used for small sizes. For large sizes it is equal to >> >> // SizeClassMap::kMaxSize and the actual size is stored in the >> >> // SecondaryAllocator's metadata. >> >> u32 user_requested_size : 29; >> >> >> >> // 3-rd 4 bytes >> >> u32 cpu_id; // Allocation cpu id >> >> >> >> // 4-th 4 bytes >> >> // Allocation timestamp in ms from a baseline timestamp computed at >> >> // the start of profiling (to keep this within 32 bits). >> >> u32 timestamp_ms; >> >> >> >> // 5-th and 6-th 4 bytes >> >> // Carry over from ASAN. Used to identify allocation stack trace. >> >> u64 alloc_context_id; >> >> >> >> // 7-th and 8-th 4 bytes >> >> // UNIMPLEMENTED in prototype - needs instrumentation and IR support. >> >> u64 data_type_id; // hash of type name >> >> }; >> >> As noted, the chunk state can be reduced to a single bit (no need for >> quarantined memory in the heap profiler). The header contains a placeholder >> for the data type hash, which is not yet implemented as it needs >> instrumentation and IR support. >> Heap Info Block (HIB) >> >> On a deallocation, information from the corresponding shadow block(s) and >> header are recorded in a Heap Info Block (HIB) object. The access count is >> computed from the shadow memory locations for the allocation, as well as >> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero >> 8-byte shadow locations for the whole allocation). Other information such >> as the deallocation timestamp (for lifetime computation) and deallocation >> cpu id (to determine migrations) are recorded along with the information in >> the chunk header recorded on allocation. >> >> >> >> The prototyped HIB object tracks the following: >> >> >> >> struct HeapInfoBlock { >> >> // Total allocations at this stack context >> >> u32 alloc_count; >> >> // Access count computed from all allocated 64-byte blocks (track total >> >> // across all allocations, and the min and max). >> >> u64 total_access_count, min_access_count, max_access_count; >> >> // Allocated size (track total across all allocations, and the min and >> max). >> >> u64 total_size; >> >> u32 min_size, max_size; >> >> // Lifetime (track total across all allocations, and the min and max). >> >> u64 total_lifetime; >> >> u32 min_lifetime, max_lifetime; >> >> // Percent utilization of allocated 64-byte blocks (track total >> >> // across all allocations, and the min and max). The utilization is >> >> // defined as the percentage of 8-byte shadow counters corresponding to >> >> // the full allocation that are non-zero. >> >> u64 total_percent_utilized; >> >> u32 min_percent_utilized, max_percent_utilized; >> >> // Allocation and deallocation timestamps from the most recent merge >> into >> >> // the table with this stack context. >> >> u32 alloc_timestamp, dealloc_timestamp; >> >> // Allocation and deallocation cpu ids from the most recent merge into >> >> // the table with this stack context. >> >> u32 alloc_cpu_id, dealloc_cpu_id; >> >> // Count of allocations at this stack context that had a different >> >> // allocation and deallocation cpu id. >> >> u32 num_migrated_cpu; >> >> // Number of times the lifetime of the entry being merged had its >> lifetime >> >> // overlap with the previous entry merged with this stack context (by >> >> // comparing the new alloc/dealloc timestamp with the one last recorded >> in >> >> // the entry in the table. >> >> u32 num_lifetime_overlaps; >> >> // Number of times the alloc/dealloc cpu of the entry being merged was >> the >> >> // same as that of the previous entry merged with this stack context >> >> u32 num_same_alloc_cpu; >> >> u32 num_same_dealloc_cpu; >> >> // Hash of type name (UNIMPLEMENTED). This needs instrumentation >> support and >> >> // possibly IR changes. >> >> u64 data_type_id; >> >> } >> HIB Table >> >> The Heap Info Block Table, which is a multi-way associative cache, holds >> HIB objects from deallocated objects. It is indexed by the stack allocation >> context id from the chunk header, and currently utilizes a simple mod with >> a prime number close to a power of two as the hash (because of the way the >> stack context ids are assigned, a mod of a power of two performs very >> poorly). Thus far, only 4-way associativity has been evaluated. >> >> >> >> HIB entries are added or merged into the HIB Table on each deallocation. >> If an entry with a matching stack alloc context id is found in the Table, >> the newly deallocated information is merged into the existing entry. Each >> HIB Table entry currently tracks the min, max and total value of the >> various fields for use in computing and reporting the min, max and average >> when the Table is ultimately dumped. >> >> >> >> If no entry with a matching stack alloc context id is found, a new entry >> is created. If this causes an eviction, the evicted entry is dumped >> immediately (by default to stderr, otherwise to a specified report file). >> Later post processing can merge dumped entries with the same stack alloc >> context id. >> Initialization >> >> >> >> For ASAN, an __asan_init function initializes the memory allocation >> tracking support, and the ASAN instrumentation pass in LLVM creates a >> global constructor to invoke it. The heap profiler prototype adds a new >> __heapprof_init function, which performs heap profile specific >> initialization, and the heap profile instrumentation pass calls this new >> init function instead by a generated global constructor. It currently >> additionally invokes __asan_init since we are leveraging a modified ASAN >> runtime. Eventually, this should be changed to initialize refactored common >> support. >> >> >> >> Note that __asan init is also placed in the .preinit_array when it is >> available, so it is invoked even earlier than global constructors. >> Currently, it is not possible to do this for __heapprof_init, as it calls >> timespec_get in order to get a baseline timestamp (as described in the >> ChunkHeader comments the timestamps (ms) are actually offsets from the >> baseline timestamp, in order to fit into 32 bits), and system calls cannot >> be made that early (dl_init is not complete). Since the constructor >> priority is 1, it should be executed early enough that there are very few >> allocations before it runs, and likely the best solution is to simply >> ignore any allocations before initialization. >> Dumping >> >> For the prototype, the profile is dumped as text with a compact raw >> format to limit its size. Ultimately it should be dumped in a more compact >> binary format (i.e. into a different section of the raw instrumentation >> based profile, with llvm-profdata performing post-processing) which is TBD. >> HIB Dumping >> >> As noted earlier, HIB Table entries are created as memory is deallocated. >> At the end of the run (or whenever dumping is requested, discussed later), >> HIB entries need to be created for allocations that are still live. >> Conveniently, the sanitizer allocator already contains a mechanism to walk >> through all chunks of memory it is tracking (ForEachChunk). The heap >> profiler simply looks for all chunks with a chunk state of allocated, and >> creates a HIB the same as would be done on deallocation, adding each to the >> table. >> >> >> >> A HIB Table mechanism for printing each entry is then invoked. >> >> >> >> By default, the dumping occurs: >> >> - on evictions >> - full table at exit (when the static Allocator object is destructed) >> >> >> >> For running in a load testing scenario, we will want to add a mechanism >> to provoke finalization (merging currently live allocations) and dumping of >> the HIB Table before exit. This would be similar to the __llvm_profile_dump >> facility used for normal PGO counter dumping. >> Stack Trace Dumping >> >> There is existing support for dumping symbolized StackTrace objects. A >> wrapper to dump all StackTrace objects in the StackDepot will be added. >> This new interface is invoked just after the HIB Table is dumped (on exit >> or via dumping interface). >> Memory Map Dumping >> >> In cases where we may want to symbolize as a post processing step, we may >> need the memory map (from /proc/self/smaps). Specifically, this is needed >> to symbolize binaries using ASLR (Address Space Layout Randomization). >> There is already support for reading this file and dumping it to the >> specified report output file (DumpProcessMap()). This is invoked when the >> profile output file is initialized (HIB Table construction), so that the >> memory map is available at the top of the raw profile. >> Current Status and Next Steps >> >> >> >> As mentioned earlier, I have a working prototype based on a simplified >> stripped down version of ASAN. My current plan is to do the following: >> >> 1. Refactor out some of the shadow setup code common between ASAN and >> HWASAN into sanitizer_common. >> 2. Rework my prototype into a separate heapprof library in >> compiler-rt, using sanitizer_common support where possible, and send >> patches for review. >> 3. Send patches for the heap profiler instrumentation pass and >> related clang options. >> 4. Design/implement binary profile format >> >> >> >> -- >> >> Teresa Johnson | >> >> Software Engineer | >> >> tejohnson at google.com | >> >> >> > -- Teresa Johnson | Software Engineer | tejohnson at google.com | -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200706/8cadee89/attachment.html>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4