On Mon, Jul 6, 2020 at 11:49 AM Mitch Phillips via llvm-dev < llvm-dev at lists.llvm.org> wrote: > > I'm not aware that -fsanitizer* options disable these, but I know in our > environment we do disable frame pointer omission when setting up ASAN > builds, and I am arranging for heap profiling builds to do the same. Not > sure whether we want to do this within clang itself, would be interested in > Kostya's opinion. I can't see anywhere that we are disabling tail call > optimizations for ASAN though, but I might have missed it. > > We don't force frame pointers to be emitted with -fsanitize=address at > least -- although we highly recommend it > <https://clang.llvm.org/docs/AddressSanitizer.html#usage> as the frame > pointer unwinder is much faster than DWARF, particularly important for > stack collection on malloc/free. > Correct. frame pointers are a must for fast collection of stack traces, but we decouple this requirement from ASAN. It is perfectly valid to run ASAN w/o frame pointers but then you either have to use exceptionally slow unwinder or you get garbled error messages. > > On Mon, Jul 6, 2020 at 7:59 AM Teresa Johnson via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi Wenlei, >> >> Thanks for the comments! David answered the first question, I do have >> some comments on the second one though. >> Teresa >> >> On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at google.com> >> wrote: >> >>> >>> >>> On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote: >>> >>>> This sounds very useful. Weâve improved and used memoro >>>> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and >>>> analysis, and we are also looking for ways to leverage memory profile for >>>> PGO/FDO. I think having a common profiling infrastructure for analysis >>>> tooling as well as profile guided optimizations is good design, and having >>>> it in LLVM is also helpful. Very interested in the tooling and optimization >>>> that comes after the profiler. >>>> >>>> >>>> >>>> Two questions: >>>> >>>> - How does the profiling overhead look? Is that similar to ASAN >>>> overhead from what youâve seen, which would be higher than PGO >>>> instrumentation? Asking because Iâm wondering if any PGO training setup can >>>> be used directly for the new heap profiling. >>>> >>>> >>> It is built on top of ASAN runtime, but the overhead can be made much >>> lower by using counter update consolidation -- all fields sharing the same >>> shadow counter can be merged, and aggressive loop sinking/hoisting can be >>> done. >>> >>> The goal is to integrate this with the PGO instrumentation. The PGO >>> instrumentation overhead can be further reduced with sampling technique >>> (Rong Xu has a patch to be submitted). >>> >>> >>>> - >>>> - Iâm not familiar with how sanitizer handles stack trace, but for >>>> getting most accurate calling context (use FP rather than dwarf), I guess >>>> frame pointer omission and tail call opt etc. need to be turned off? Is >>>> that going to be implied by -fheapprof? >>>> >>>> >>> Kostya can provide detailed answers to these questions. >>> >> >> I'm not aware that -fsanitizer* options disable these, but I know in our >> environment we do disable frame pointer omission when setting up ASAN >> builds, and I am arranging for heap profiling builds to do the same. Not >> sure whether we want to do this within clang itself, would be interested in >> Kostya's opinion. I can't see anywhere that we are disabling tail call >> optimizations for ASAN though, but I might have missed it. >> >> Thanks, >> Teresa >> >> >>> >>> David >>> >>>> >>>> - >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Wenlei >>>> >>>> >>>> >>>> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa >>>> Johnson via llvm-dev <llvm-dev at lists.llvm.org> >>>> *Reply-To: *Teresa Johnson <tejohnson at google.com> >>>> *Date: *Wednesday, June 24, 2020 at 4:58 PM >>>> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany < >>>> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka < >>>> vitalybuka at google.com> >>>> *Cc: *David Li <davidxl at google.com> >>>> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler >>>> >>>> >>>> >>>> Hi all, >>>> >>>> >>>> >>>> I've included an RFC for a heap profiler design I've been working on in >>>> conjunction with David Li. Please send any questions or feedback. For >>>> sanitizer folks, one area of feedback is on refactoring some of the *ASAN >>>> shadow setup code (see the Shadow Memory section). >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Teresa >>>> >>>> >>>> >>>> RFC: Sanitizer-based Heap Profiler >>>> Summary >>>> >>>> This document provides an overview of an LLVM Sanitizer-based heap >>>> profiler design. >>>> Motivation >>>> >>>> The objective of heap memory profiling is to collect critical runtime >>>> information associated with heap memory references and information on heap >>>> memory allocations. The profile information will be used first for tooling, >>>> and subsequently to guide the compiler optimizer and allocation runtime to >>>> layout heap objects with improved spatial locality. As a result, DTLB and >>>> cache utilization will be improved, and program IPC (performance) will be >>>> increased due to reduced TLB and cache misses. More details on the heap >>>> profile guided optimizations will be shared in the future. >>>> Overview >>>> >>>> The profiler is based on compiler inserted instrumentation of load and >>>> store accesses, and utilizes runtime support to monitor heap allocations >>>> and profile data. The target consumer of the heap memory profile >>>> information is initially tooling and ultimately automatic data layout >>>> optimizations performed by the compiler and/or allocation runtime (with the >>>> support of new allocation runtime APIs). >>>> >>>> >>>> >>>> Each memory address is mapped to Shadow Memory >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>, >>>> similar to the approach used by the Address Sanitizer >>>> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN). >>>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the >>>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow >>>> location implements the profile counter (incremented on accesses to the >>>> corresponding memory). This granularity was chosen to help avoid counter >>>> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes. >>>> To avoid aliasing of shadow memory for different allocations, we must >>>> choose a minimum alignment carefully. As discussed further below, we can >>>> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by >>>> storing necessary heap information for each allocation in a 32-byte header >>>> block. >>>> >>>> >>>> >>>> The compiler instruments each load and store to increment the >>>> associated shadow memory counter, in order to determine hotness. >>>> >>>> >>>> >>>> The heap profiler runtime is responsible for tracking allocations and >>>> deallocations, including the stack at each allocation, and information such >>>> as the allocation size and other statistics. I have implemented a prototype >>>> built using a stripped down and modified version of ASAN, however this will >>>> be a separate library utilizing sanitizer_common components. >>>> Compiler >>>> >>>> A simple HeapProfiler instrumentation pass instruments interesting >>>> memory accesses (loads, stores, atomics), with a simple load, increment, >>>> store of the associated shadow memory location (computed via a mask and >>>> shift to do the mapping of 64 bytes to 8 byte shadow, and add of the shadow >>>> offset). The handling is very similar to and based off of the ASAN >>>> instrumentation pass, with slightly different instrumentation code. >>>> >>>> >>>> >>>> Various techniques can be used to reduce the overhead, by aggressively >>>> coalescing counter updates (e.g. given the 32-byte alignment, accesses >>>> known to be in the same 32-byte block, or across possible aliases since we >>>> donât care about the dereferenced values). >>>> >>>> >>>> >>>> Additionally, the Clang driver needs to set up to link with the runtime >>>> library, much as it does with the sanitizers. >>>> >>>> >>>> >>>> A -fheapprof option is added to enable the instrumentation pass and >>>> runtime library linking. Similar to -fprofile-generate, -fheapprof >>>> will accept an argument specifying the directory in which to write the >>>> profile. >>>> Runtime >>>> >>>> The heap profiler runtime is responsible for tracking and reporting >>>> information about heap allocations and accesses, aggregated by allocation >>>> calling context. For example, the hotness, lifetime, and cpu affinity. >>>> >>>> >>>> >>>> A new heapprof library will be created within compiler-rt. It will >>>> leverage support within sanitizer_common, which already contains facilities >>>> like stack context tracking, needed by the heap profiler. >>>> Shadow Memory >>>> >>>> There are some basic facilities in sanitizer_common for mmapâing the >>>> shadow memory, but most of the existing setup lives in the ASAN and HWASAN >>>> libraries. In the case of ASAN, there is support for both statically >>>> assigned shadow offsets (the default on most platforms), and for >>>> dynamically assigned shadow memory (implemented for Windows and currently >>>> also used for Android and iOS). According to kcc, recent experiments show >>>> that the performance with a dynamic shadow is close to that with a static >>>> mapping. In fact, that is the only approach currently used by HWASAN. Given >>>> the simplicity, the heap profiler will be implemented with a dynamic shadow >>>> as well. >>>> >>>> >>>> >>>> There are a number of functions in ASAN and HWASAN related to setup of >>>> the shadow that are duplicated but very nearly identical, at least for >>>> linux (which seems to be the only OS flavor currently supported for >>>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and >>>> FindDynamicShadowStart (in ASAN there is another nearly identical copy in >>>> PremapShadow, used by Android, whereas in HW ASAN the premap handling is >>>> already commoned with the non-premap handling). Rather than make yet >>>> another copy of these mechanisms, I propose refactoring them into >>>> sanitizer_common versions. Like HWASAN, the initial version of the heap >>>> profiler will be supported for linux only, but other OSes can be added as >>>> needed similar to ASAN. >>>> StackTrace and StackDepot >>>> >>>> The sanitizer already contains support for obtaining and representing a >>>> stack trace in a StackTrace object, and storing it in the StackDepot which >>>> âefficiently stores huge amounts of stack tracesâ. This is in the >>>> sanitizer_common subdirectory and the support is shared by ASAN and >>>> ThreadSanitizer. The StackDepot is essentially an unbounded hash table, >>>> where each StackTrace is assigned a unique id. ASAN stores this id in the >>>> alloc_context_id field in each ChunkHeader (in the redzone preceding each >>>> allocation). Additionally, there is support for symbolizing and printing >>>> StackTrace objects. >>>> ChunkHeader >>>> >>>> The heap profiler needs to track several pieces of information for each >>>> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve >>>> a minimum of 32-byte alignment by holding this information in a 32-byte >>>> header block preceding each allocation. >>>> >>>> >>>> >>>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It >>>> contains information about the current allocation state, user requested >>>> size, allocation and free thread ids, the allocation context id >>>> (representing the call stack at allocation, assigned by the StackDepot as >>>> described above), and misc other bookkeeping. For heap profiling, this will >>>> be converted to a 32-byte header block. >>>> >>>> >>>> >>>> Note that we could instead use the metadata section, similar to other >>>> sanitizers, which is stored in a separate location. However, as described >>>> above, storing the header block with each allocation enables 32-byte >>>> alignment without aliasing shadow counters for the same 64 bytes of memory. >>>> >>>> >>>> >>>> In the prototype heap profiler implementation, the header contains the >>>> following fields: >>>> >>>> >>>> >>>> // Should be 32 bytes >>>> >>>> struct ChunkHeader { >>>> >>>> // 1-st 4 bytes >>>> >>>> // Carry over from ASAN (available, allocated, quarantined). Will be >>>> >>>> // reduced to 1 bit (available or allocated). >>>> >>>> u32 chunk_state : 8; >>>> >>>> // Carry over from ASAN. Used to determine the start of user >>>> allocation. >>>> >>>> u32 from_memalign : 1; >>>> >>>> // 23 bits available >>>> >>>> >>>> >>>> // 2-nd 4 bytes >>>> >>>> // Carry over from ASAN (comment copied verbatim). >>>> >>>> // This field is used for small sizes. For large sizes it is equal to >>>> >>>> // SizeClassMap::kMaxSize and the actual size is stored in the >>>> >>>> // SecondaryAllocator's metadata. >>>> >>>> u32 user_requested_size : 29; >>>> >>>> >>>> >>>> // 3-rd 4 bytes >>>> >>>> u32 cpu_id; // Allocation cpu id >>>> >>>> >>>> >>>> // 4-th 4 bytes >>>> >>>> // Allocation timestamp in ms from a baseline timestamp computed at >>>> >>>> // the start of profiling (to keep this within 32 bits). >>>> >>>> u32 timestamp_ms; >>>> >>>> >>>> >>>> // 5-th and 6-th 4 bytes >>>> >>>> // Carry over from ASAN. Used to identify allocation stack trace. >>>> >>>> u64 alloc_context_id; >>>> >>>> >>>> >>>> // 7-th and 8-th 4 bytes >>>> >>>> // UNIMPLEMENTED in prototype - needs instrumentation and IR support. >>>> >>>> u64 data_type_id; // hash of type name >>>> >>>> }; >>>> >>>> As noted, the chunk state can be reduced to a single bit (no need for >>>> quarantined memory in the heap profiler). The header contains a placeholder >>>> for the data type hash, which is not yet implemented as it needs >>>> instrumentation and IR support. >>>> Heap Info Block (HIB) >>>> >>>> On a deallocation, information from the corresponding shadow block(s) >>>> and header are recorded in a Heap Info Block (HIB) object. The access count >>>> is computed from the shadow memory locations for the allocation, as well as >>>> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero >>>> 8-byte shadow locations for the whole allocation). Other information such >>>> as the deallocation timestamp (for lifetime computation) and deallocation >>>> cpu id (to determine migrations) are recorded along with the information in >>>> the chunk header recorded on allocation. >>>> >>>> >>>> >>>> The prototyped HIB object tracks the following: >>>> >>>> >>>> >>>> struct HeapInfoBlock { >>>> >>>> // Total allocations at this stack context >>>> >>>> u32 alloc_count; >>>> >>>> // Access count computed from all allocated 64-byte blocks (track >>>> total >>>> >>>> // across all allocations, and the min and max). >>>> >>>> u64 total_access_count, min_access_count, max_access_count; >>>> >>>> // Allocated size (track total across all allocations, and the min >>>> and max). >>>> >>>> u64 total_size; >>>> >>>> u32 min_size, max_size; >>>> >>>> // Lifetime (track total across all allocations, and the min and max). >>>> >>>> u64 total_lifetime; >>>> >>>> u32 min_lifetime, max_lifetime; >>>> >>>> // Percent utilization of allocated 64-byte blocks (track total >>>> >>>> // across all allocations, and the min and max). The utilization is >>>> >>>> // defined as the percentage of 8-byte shadow counters corresponding >>>> to >>>> >>>> // the full allocation that are non-zero. >>>> >>>> u64 total_percent_utilized; >>>> >>>> u32 min_percent_utilized, max_percent_utilized; >>>> >>>> // Allocation and deallocation timestamps from the most recent merge >>>> into >>>> >>>> // the table with this stack context. >>>> >>>> u32 alloc_timestamp, dealloc_timestamp; >>>> >>>> // Allocation and deallocation cpu ids from the most recent merge into >>>> >>>> // the table with this stack context. >>>> >>>> u32 alloc_cpu_id, dealloc_cpu_id; >>>> >>>> // Count of allocations at this stack context that had a different >>>> >>>> // allocation and deallocation cpu id. >>>> >>>> u32 num_migrated_cpu; >>>> >>>> // Number of times the lifetime of the entry being merged had its >>>> lifetime >>>> >>>> // overlap with the previous entry merged with this stack context (by >>>> >>>> // comparing the new alloc/dealloc timestamp with the one last >>>> recorded in >>>> >>>> // the entry in the table. >>>> >>>> u32 num_lifetime_overlaps; >>>> >>>> // Number of times the alloc/dealloc cpu of the entry being merged >>>> was the >>>> >>>> // same as that of the previous entry merged with this stack context >>>> >>>> u32 num_same_alloc_cpu; >>>> >>>> u32 num_same_dealloc_cpu; >>>> >>>> // Hash of type name (UNIMPLEMENTED). This needs instrumentation >>>> support and >>>> >>>> // possibly IR changes. >>>> >>>> u64 data_type_id; >>>> >>>> } >>>> HIB Table >>>> >>>> The Heap Info Block Table, which is a multi-way associative cache, >>>> holds HIB objects from deallocated objects. It is indexed by the stack >>>> allocation context id from the chunk header, and currently utilizes a >>>> simple mod with a prime number close to a power of two as the hash (because >>>> of the way the stack context ids are assigned, a mod of a power of two >>>> performs very poorly). Thus far, only 4-way associativity has been >>>> evaluated. >>>> >>>> >>>> >>>> HIB entries are added or merged into the HIB Table on each >>>> deallocation. If an entry with a matching stack alloc context id is found >>>> in the Table, the newly deallocated information is merged into the existing >>>> entry. Each HIB Table entry currently tracks the min, max and total value >>>> of the various fields for use in computing and reporting the min, max and >>>> average when the Table is ultimately dumped. >>>> >>>> >>>> >>>> If no entry with a matching stack alloc context id is found, a new >>>> entry is created. If this causes an eviction, the evicted entry is dumped >>>> immediately (by default to stderr, otherwise to a specified report file). >>>> Later post processing can merge dumped entries with the same stack alloc >>>> context id. >>>> Initialization >>>> >>>> >>>> >>>> For ASAN, an __asan_init function initializes the memory allocation >>>> tracking support, and the ASAN instrumentation pass in LLVM creates a >>>> global constructor to invoke it. The heap profiler prototype adds a new >>>> __heapprof_init function, which performs heap profile specific >>>> initialization, and the heap profile instrumentation pass calls this new >>>> init function instead by a generated global constructor. It currently >>>> additionally invokes __asan_init since we are leveraging a modified ASAN >>>> runtime. Eventually, this should be changed to initialize refactored common >>>> support. >>>> >>>> >>>> >>>> Note that __asan init is also placed in the .preinit_array when it is >>>> available, so it is invoked even earlier than global constructors. >>>> Currently, it is not possible to do this for __heapprof_init, as it calls >>>> timespec_get in order to get a baseline timestamp (as described in the >>>> ChunkHeader comments the timestamps (ms) are actually offsets from the >>>> baseline timestamp, in order to fit into 32 bits), and system calls cannot >>>> be made that early (dl_init is not complete). Since the constructor >>>> priority is 1, it should be executed early enough that there are very few >>>> allocations before it runs, and likely the best solution is to simply >>>> ignore any allocations before initialization. >>>> Dumping >>>> >>>> For the prototype, the profile is dumped as text with a compact raw >>>> format to limit its size. Ultimately it should be dumped in a more compact >>>> binary format (i.e. into a different section of the raw instrumentation >>>> based profile, with llvm-profdata performing post-processing) which is TBD. >>>> HIB Dumping >>>> >>>> As noted earlier, HIB Table entries are created as memory is >>>> deallocated. At the end of the run (or whenever dumping is requested, >>>> discussed later), HIB entries need to be created for allocations that are >>>> still live. Conveniently, the sanitizer allocator already contains a >>>> mechanism to walk through all chunks of memory it is tracking ( >>>> ForEachChunk). The heap profiler simply looks for all chunks with a >>>> chunk state of allocated, and creates a HIB the same as would be done on >>>> deallocation, adding each to the table. >>>> >>>> >>>> >>>> A HIB Table mechanism for printing each entry is then invoked. >>>> >>>> >>>> >>>> By default, the dumping occurs: >>>> >>>> - on evictions >>>> - full table at exit (when the static Allocator object is >>>> destructed) >>>> >>>> >>>> >>>> For running in a load testing scenario, we will want to add a mechanism >>>> to provoke finalization (merging currently live allocations) and dumping of >>>> the HIB Table before exit. This would be similar to the __llvm_profile_dump >>>> facility used for normal PGO counter dumping. >>>> Stack Trace Dumping >>>> >>>> There is existing support for dumping symbolized StackTrace objects. A >>>> wrapper to dump all StackTrace objects in the StackDepot will be added. >>>> This new interface is invoked just after the HIB Table is dumped (on exit >>>> or via dumping interface). >>>> Memory Map Dumping >>>> >>>> In cases where we may want to symbolize as a post processing step, we >>>> may need the memory map (from /proc/self/smaps). Specifically, this is >>>> needed to symbolize binaries using ASLR (Address Space Layout >>>> Randomization). There is already support for reading this file and dumping >>>> it to the specified report output file (DumpProcessMap()). This is invoked >>>> when the profile output file is initialized (HIB Table construction), so >>>> that the memory map is available at the top of the raw profile. >>>> Current Status and Next Steps >>>> >>>> >>>> >>>> As mentioned earlier, I have a working prototype based on a simplified >>>> stripped down version of ASAN. My current plan is to do the following: >>>> >>>> 1. Refactor out some of the shadow setup code common between ASAN >>>> and HWASAN into sanitizer_common. >>>> 2. Rework my prototype into a separate heapprof library in >>>> compiler-rt, using sanitizer_common support where possible, and send >>>> patches for review. >>>> 3. Send patches for the heap profiler instrumentation pass and >>>> related clang options. >>>> 4. Design/implement binary profile format >>>> >>>> >>>> >>>> -- >>>> >>>> Teresa Johnson | >>>> >>>> Software Engineer | >>>> >>>> tejohnson at google.com | >>>> >>>> >>>> >>> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson at google.com | >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200708/b0f3d543/attachment-0001.html>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4