> I'm not aware that -fsanitizer* options disable these, but I know in our environment we do disable frame pointer omission when setting up ASAN builds, and I am arranging for heap profiling builds to do the same. Not sure whether we want to do this within clang itself, would be interested in Kostya's opinion. I can't see anywhere that we are disabling tail call optimizations for ASAN though, but I might have missed it. We don't force frame pointers to be emitted with -fsanitize=address at least -- although we highly recommend it <https://clang.llvm.org/docs/AddressSanitizer.html#usage> as the frame pointer unwinder is much faster than DWARF, particularly important for stack collection on malloc/free. On Mon, Jul 6, 2020 at 7:59 AM Teresa Johnson via llvm-dev < llvm-dev at lists.llvm.org> wrote: > Hi Wenlei, > > Thanks for the comments! David answered the first question, I do have some > comments on the second one though. > Teresa > > On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at google.com> > wrote: > >> >> >> On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote: >> >>> This sounds very useful. Weâve improved and used memoro >>> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and >>> analysis, and we are also looking for ways to leverage memory profile for >>> PGO/FDO. I think having a common profiling infrastructure for analysis >>> tooling as well as profile guided optimizations is good design, and having >>> it in LLVM is also helpful. Very interested in the tooling and optimization >>> that comes after the profiler. >>> >>> >>> >>> Two questions: >>> >>> - How does the profiling overhead look? Is that similar to ASAN >>> overhead from what youâve seen, which would be higher than PGO >>> instrumentation? Asking because Iâm wondering if any PGO training setup can >>> be used directly for the new heap profiling. >>> >>> >> It is built on top of ASAN runtime, but the overhead can be made much >> lower by using counter update consolidation -- all fields sharing the same >> shadow counter can be merged, and aggressive loop sinking/hoisting can be >> done. >> >> The goal is to integrate this with the PGO instrumentation. The PGO >> instrumentation overhead can be further reduced with sampling technique >> (Rong Xu has a patch to be submitted). >> >> >>> - >>> - Iâm not familiar with how sanitizer handles stack trace, but for >>> getting most accurate calling context (use FP rather than dwarf), I guess >>> frame pointer omission and tail call opt etc. need to be turned off? Is >>> that going to be implied by -fheapprof? >>> >>> >> Kostya can provide detailed answers to these questions. >> > > I'm not aware that -fsanitizer* options disable these, but I know in our > environment we do disable frame pointer omission when setting up ASAN > builds, and I am arranging for heap profiling builds to do the same. Not > sure whether we want to do this within clang itself, would be interested in > Kostya's opinion. I can't see anywhere that we are disabling tail call > optimizations for ASAN though, but I might have missed it. > > Thanks, > Teresa > > >> >> David >> >>> >>> - >>> >>> >>> >>> Thanks, >>> >>> Wenlei >>> >>> >>> >>> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa >>> Johnson via llvm-dev <llvm-dev at lists.llvm.org> >>> *Reply-To: *Teresa Johnson <tejohnson at google.com> >>> *Date: *Wednesday, June 24, 2020 at 4:58 PM >>> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany < >>> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka < >>> vitalybuka at google.com> >>> *Cc: *David Li <davidxl at google.com> >>> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler >>> >>> >>> >>> Hi all, >>> >>> >>> >>> I've included an RFC for a heap profiler design I've been working on in >>> conjunction with David Li. Please send any questions or feedback. For >>> sanitizer folks, one area of feedback is on refactoring some of the *ASAN >>> shadow setup code (see the Shadow Memory section). >>> >>> >>> >>> Thanks, >>> >>> Teresa >>> >>> >>> >>> RFC: Sanitizer-based Heap Profiler >>> Summary >>> >>> This document provides an overview of an LLVM Sanitizer-based heap >>> profiler design. >>> Motivation >>> >>> The objective of heap memory profiling is to collect critical runtime >>> information associated with heap memory references and information on heap >>> memory allocations. The profile information will be used first for tooling, >>> and subsequently to guide the compiler optimizer and allocation runtime to >>> layout heap objects with improved spatial locality. As a result, DTLB and >>> cache utilization will be improved, and program IPC (performance) will be >>> increased due to reduced TLB and cache misses. More details on the heap >>> profile guided optimizations will be shared in the future. >>> Overview >>> >>> The profiler is based on compiler inserted instrumentation of load and >>> store accesses, and utilizes runtime support to monitor heap allocations >>> and profile data. The target consumer of the heap memory profile >>> information is initially tooling and ultimately automatic data layout >>> optimizations performed by the compiler and/or allocation runtime (with the >>> support of new allocation runtime APIs). >>> >>> >>> >>> Each memory address is mapped to Shadow Memory >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>, >>> similar to the approach used by the Address Sanitizer >>> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN). >>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the >>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow >>> location implements the profile counter (incremented on accesses to the >>> corresponding memory). This granularity was chosen to help avoid counter >>> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes. >>> To avoid aliasing of shadow memory for different allocations, we must >>> choose a minimum alignment carefully. As discussed further below, we can >>> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by >>> storing necessary heap information for each allocation in a 32-byte header >>> block. >>> >>> >>> >>> The compiler instruments each load and store to increment the associated >>> shadow memory counter, in order to determine hotness. >>> >>> >>> >>> The heap profiler runtime is responsible for tracking allocations and >>> deallocations, including the stack at each allocation, and information such >>> as the allocation size and other statistics. I have implemented a prototype >>> built using a stripped down and modified version of ASAN, however this will >>> be a separate library utilizing sanitizer_common components. >>> Compiler >>> >>> A simple HeapProfiler instrumentation pass instruments interesting >>> memory accesses (loads, stores, atomics), with a simple load, increment, >>> store of the associated shadow memory location (computed via a mask and >>> shift to do the mapping of 64 bytes to 8 byte shadow, and add of the shadow >>> offset). The handling is very similar to and based off of the ASAN >>> instrumentation pass, with slightly different instrumentation code. >>> >>> >>> >>> Various techniques can be used to reduce the overhead, by aggressively >>> coalescing counter updates (e.g. given the 32-byte alignment, accesses >>> known to be in the same 32-byte block, or across possible aliases since we >>> donât care about the dereferenced values). >>> >>> >>> >>> Additionally, the Clang driver needs to set up to link with the runtime >>> library, much as it does with the sanitizers. >>> >>> >>> >>> A -fheapprof option is added to enable the instrumentation pass and >>> runtime library linking. Similar to -fprofile-generate, -fheapprof will >>> accept an argument specifying the directory in which to write the profile. >>> Runtime >>> >>> The heap profiler runtime is responsible for tracking and reporting >>> information about heap allocations and accesses, aggregated by allocation >>> calling context. For example, the hotness, lifetime, and cpu affinity. >>> >>> >>> >>> A new heapprof library will be created within compiler-rt. It will >>> leverage support within sanitizer_common, which already contains facilities >>> like stack context tracking, needed by the heap profiler. >>> Shadow Memory >>> >>> There are some basic facilities in sanitizer_common for mmapâing the >>> shadow memory, but most of the existing setup lives in the ASAN and HWASAN >>> libraries. In the case of ASAN, there is support for both statically >>> assigned shadow offsets (the default on most platforms), and for >>> dynamically assigned shadow memory (implemented for Windows and currently >>> also used for Android and iOS). According to kcc, recent experiments show >>> that the performance with a dynamic shadow is close to that with a static >>> mapping. In fact, that is the only approach currently used by HWASAN. Given >>> the simplicity, the heap profiler will be implemented with a dynamic shadow >>> as well. >>> >>> >>> >>> There are a number of functions in ASAN and HWASAN related to setup of >>> the shadow that are duplicated but very nearly identical, at least for >>> linux (which seems to be the only OS flavor currently supported for >>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and >>> FindDynamicShadowStart (in ASAN there is another nearly identical copy in >>> PremapShadow, used by Android, whereas in HW ASAN the premap handling is >>> already commoned with the non-premap handling). Rather than make yet >>> another copy of these mechanisms, I propose refactoring them into >>> sanitizer_common versions. Like HWASAN, the initial version of the heap >>> profiler will be supported for linux only, but other OSes can be added as >>> needed similar to ASAN. >>> StackTrace and StackDepot >>> >>> The sanitizer already contains support for obtaining and representing a >>> stack trace in a StackTrace object, and storing it in the StackDepot which >>> âefficiently stores huge amounts of stack tracesâ. This is in the >>> sanitizer_common subdirectory and the support is shared by ASAN and >>> ThreadSanitizer. The StackDepot is essentially an unbounded hash table, >>> where each StackTrace is assigned a unique id. ASAN stores this id in the >>> alloc_context_id field in each ChunkHeader (in the redzone preceding each >>> allocation). Additionally, there is support for symbolizing and printing >>> StackTrace objects. >>> ChunkHeader >>> >>> The heap profiler needs to track several pieces of information for each >>> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve >>> a minimum of 32-byte alignment by holding this information in a 32-byte >>> header block preceding each allocation. >>> >>> >>> >>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It >>> contains information about the current allocation state, user requested >>> size, allocation and free thread ids, the allocation context id >>> (representing the call stack at allocation, assigned by the StackDepot as >>> described above), and misc other bookkeeping. For heap profiling, this will >>> be converted to a 32-byte header block. >>> >>> >>> >>> Note that we could instead use the metadata section, similar to other >>> sanitizers, which is stored in a separate location. However, as described >>> above, storing the header block with each allocation enables 32-byte >>> alignment without aliasing shadow counters for the same 64 bytes of memory. >>> >>> >>> >>> In the prototype heap profiler implementation, the header contains the >>> following fields: >>> >>> >>> >>> // Should be 32 bytes >>> >>> struct ChunkHeader { >>> >>> // 1-st 4 bytes >>> >>> // Carry over from ASAN (available, allocated, quarantined). Will be >>> >>> // reduced to 1 bit (available or allocated). >>> >>> u32 chunk_state : 8; >>> >>> // Carry over from ASAN. Used to determine the start of user >>> allocation. >>> >>> u32 from_memalign : 1; >>> >>> // 23 bits available >>> >>> >>> >>> // 2-nd 4 bytes >>> >>> // Carry over from ASAN (comment copied verbatim). >>> >>> // This field is used for small sizes. For large sizes it is equal to >>> >>> // SizeClassMap::kMaxSize and the actual size is stored in the >>> >>> // SecondaryAllocator's metadata. >>> >>> u32 user_requested_size : 29; >>> >>> >>> >>> // 3-rd 4 bytes >>> >>> u32 cpu_id; // Allocation cpu id >>> >>> >>> >>> // 4-th 4 bytes >>> >>> // Allocation timestamp in ms from a baseline timestamp computed at >>> >>> // the start of profiling (to keep this within 32 bits). >>> >>> u32 timestamp_ms; >>> >>> >>> >>> // 5-th and 6-th 4 bytes >>> >>> // Carry over from ASAN. Used to identify allocation stack trace. >>> >>> u64 alloc_context_id; >>> >>> >>> >>> // 7-th and 8-th 4 bytes >>> >>> // UNIMPLEMENTED in prototype - needs instrumentation and IR support. >>> >>> u64 data_type_id; // hash of type name >>> >>> }; >>> >>> As noted, the chunk state can be reduced to a single bit (no need for >>> quarantined memory in the heap profiler). The header contains a placeholder >>> for the data type hash, which is not yet implemented as it needs >>> instrumentation and IR support. >>> Heap Info Block (HIB) >>> >>> On a deallocation, information from the corresponding shadow block(s) >>> and header are recorded in a Heap Info Block (HIB) object. The access count >>> is computed from the shadow memory locations for the allocation, as well as >>> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero >>> 8-byte shadow locations for the whole allocation). Other information such >>> as the deallocation timestamp (for lifetime computation) and deallocation >>> cpu id (to determine migrations) are recorded along with the information in >>> the chunk header recorded on allocation. >>> >>> >>> >>> The prototyped HIB object tracks the following: >>> >>> >>> >>> struct HeapInfoBlock { >>> >>> // Total allocations at this stack context >>> >>> u32 alloc_count; >>> >>> // Access count computed from all allocated 64-byte blocks (track total >>> >>> // across all allocations, and the min and max). >>> >>> u64 total_access_count, min_access_count, max_access_count; >>> >>> // Allocated size (track total across all allocations, and the min and >>> max). >>> >>> u64 total_size; >>> >>> u32 min_size, max_size; >>> >>> // Lifetime (track total across all allocations, and the min and max). >>> >>> u64 total_lifetime; >>> >>> u32 min_lifetime, max_lifetime; >>> >>> // Percent utilization of allocated 64-byte blocks (track total >>> >>> // across all allocations, and the min and max). The utilization is >>> >>> // defined as the percentage of 8-byte shadow counters corresponding to >>> >>> // the full allocation that are non-zero. >>> >>> u64 total_percent_utilized; >>> >>> u32 min_percent_utilized, max_percent_utilized; >>> >>> // Allocation and deallocation timestamps from the most recent merge >>> into >>> >>> // the table with this stack context. >>> >>> u32 alloc_timestamp, dealloc_timestamp; >>> >>> // Allocation and deallocation cpu ids from the most recent merge into >>> >>> // the table with this stack context. >>> >>> u32 alloc_cpu_id, dealloc_cpu_id; >>> >>> // Count of allocations at this stack context that had a different >>> >>> // allocation and deallocation cpu id. >>> >>> u32 num_migrated_cpu; >>> >>> // Number of times the lifetime of the entry being merged had its >>> lifetime >>> >>> // overlap with the previous entry merged with this stack context (by >>> >>> // comparing the new alloc/dealloc timestamp with the one last >>> recorded in >>> >>> // the entry in the table. >>> >>> u32 num_lifetime_overlaps; >>> >>> // Number of times the alloc/dealloc cpu of the entry being merged was >>> the >>> >>> // same as that of the previous entry merged with this stack context >>> >>> u32 num_same_alloc_cpu; >>> >>> u32 num_same_dealloc_cpu; >>> >>> // Hash of type name (UNIMPLEMENTED). This needs instrumentation >>> support and >>> >>> // possibly IR changes. >>> >>> u64 data_type_id; >>> >>> } >>> HIB Table >>> >>> The Heap Info Block Table, which is a multi-way associative cache, holds >>> HIB objects from deallocated objects. It is indexed by the stack allocation >>> context id from the chunk header, and currently utilizes a simple mod with >>> a prime number close to a power of two as the hash (because of the way the >>> stack context ids are assigned, a mod of a power of two performs very >>> poorly). Thus far, only 4-way associativity has been evaluated. >>> >>> >>> >>> HIB entries are added or merged into the HIB Table on each deallocation. >>> If an entry with a matching stack alloc context id is found in the Table, >>> the newly deallocated information is merged into the existing entry. Each >>> HIB Table entry currently tracks the min, max and total value of the >>> various fields for use in computing and reporting the min, max and average >>> when the Table is ultimately dumped. >>> >>> >>> >>> If no entry with a matching stack alloc context id is found, a new entry >>> is created. If this causes an eviction, the evicted entry is dumped >>> immediately (by default to stderr, otherwise to a specified report file). >>> Later post processing can merge dumped entries with the same stack alloc >>> context id. >>> Initialization >>> >>> >>> >>> For ASAN, an __asan_init function initializes the memory allocation >>> tracking support, and the ASAN instrumentation pass in LLVM creates a >>> global constructor to invoke it. The heap profiler prototype adds a new >>> __heapprof_init function, which performs heap profile specific >>> initialization, and the heap profile instrumentation pass calls this new >>> init function instead by a generated global constructor. It currently >>> additionally invokes __asan_init since we are leveraging a modified ASAN >>> runtime. Eventually, this should be changed to initialize refactored common >>> support. >>> >>> >>> >>> Note that __asan init is also placed in the .preinit_array when it is >>> available, so it is invoked even earlier than global constructors. >>> Currently, it is not possible to do this for __heapprof_init, as it calls >>> timespec_get in order to get a baseline timestamp (as described in the >>> ChunkHeader comments the timestamps (ms) are actually offsets from the >>> baseline timestamp, in order to fit into 32 bits), and system calls cannot >>> be made that early (dl_init is not complete). Since the constructor >>> priority is 1, it should be executed early enough that there are very few >>> allocations before it runs, and likely the best solution is to simply >>> ignore any allocations before initialization. >>> Dumping >>> >>> For the prototype, the profile is dumped as text with a compact raw >>> format to limit its size. Ultimately it should be dumped in a more compact >>> binary format (i.e. into a different section of the raw instrumentation >>> based profile, with llvm-profdata performing post-processing) which is TBD. >>> HIB Dumping >>> >>> As noted earlier, HIB Table entries are created as memory is >>> deallocated. At the end of the run (or whenever dumping is requested, >>> discussed later), HIB entries need to be created for allocations that are >>> still live. Conveniently, the sanitizer allocator already contains a >>> mechanism to walk through all chunks of memory it is tracking ( >>> ForEachChunk). The heap profiler simply looks for all chunks with a >>> chunk state of allocated, and creates a HIB the same as would be done on >>> deallocation, adding each to the table. >>> >>> >>> >>> A HIB Table mechanism for printing each entry is then invoked. >>> >>> >>> >>> By default, the dumping occurs: >>> >>> - on evictions >>> - full table at exit (when the static Allocator object is destructed) >>> >>> >>> >>> For running in a load testing scenario, we will want to add a mechanism >>> to provoke finalization (merging currently live allocations) and dumping of >>> the HIB Table before exit. This would be similar to the __llvm_profile_dump >>> facility used for normal PGO counter dumping. >>> Stack Trace Dumping >>> >>> There is existing support for dumping symbolized StackTrace objects. A >>> wrapper to dump all StackTrace objects in the StackDepot will be added. >>> This new interface is invoked just after the HIB Table is dumped (on exit >>> or via dumping interface). >>> Memory Map Dumping >>> >>> In cases where we may want to symbolize as a post processing step, we >>> may need the memory map (from /proc/self/smaps). Specifically, this is >>> needed to symbolize binaries using ASLR (Address Space Layout >>> Randomization). There is already support for reading this file and dumping >>> it to the specified report output file (DumpProcessMap()). This is invoked >>> when the profile output file is initialized (HIB Table construction), so >>> that the memory map is available at the top of the raw profile. >>> Current Status and Next Steps >>> >>> >>> >>> As mentioned earlier, I have a working prototype based on a simplified >>> stripped down version of ASAN. My current plan is to do the following: >>> >>> 1. Refactor out some of the shadow setup code common between ASAN >>> and HWASAN into sanitizer_common. >>> 2. Rework my prototype into a separate heapprof library in >>> compiler-rt, using sanitizer_common support where possible, and send >>> patches for review. >>> 3. Send patches for the heap profiler instrumentation pass and >>> related clang options. >>> 4. Design/implement binary profile format >>> >>> >>> >>> -- >>> >>> Teresa Johnson | >>> >>> Software Engineer | >>> >>> tejohnson at google.com | >>> >>> >>> >> > > -- > Teresa Johnson | Software Engineer | tejohnson at google.com | > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200706/e374fa7e/attachment.html>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4