Most Linux distributions still compile against the original x86-64 baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel EM64T compatibility). There has been an attempt to use the existing AT_PLATFORM-based loading mechanism in the glibc dynamic linker to enable a selection of optimized libraries. But the general selection mechanism in glibc is problematic: hwcaps subdirectory selection in the dynamic loader <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html> We also have the problem that the glibc version of "haswell" is distinct from GCC's -march=haswell (and presumably other compilers): Definition of "haswell" platform is inconsistent with GCC <https://sourceware.org/bugzilla/show_bug.cgi?id=24080> And that the selection criteria are not what people expect: Epyc and other current AMD CPUs do not select the "haswell" platform subdirectory <https://sourceware.org/bugzilla/show_bug.cgi?id=23249> Since the hwcaps-based selection does not work well regardless of architecture (even in cases the kernel provides glibc with data), I worked on a new mechanism that does not have the problems associated with the old mechanism: [PATCH 00/30] RFC: elf: glibc-hwcaps support <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html> (Don't be concerned that these patches have not been reviewed; we are busy preparing the glibc 2.32 release, and these changes do not alter the glibc ABI itself, so they do not have immediate priority. I'm fairly confident that a version of these changes will make it into glibc 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat Enterprise Linux 8.4. Debian as well, but I have never done anything like it there, so I don't know if the patches will be accepted.) Out of the box, this should work fairly well for IBM POWER and Z, where there is a clear progression of silicon versions (at least on paper âvirtualization may blur the picture somewhat). However, for x86, we do not have such a clear progression of micro-architecture versions. This is not just as a result of the AMD/Intel competition, but also due to ongoing product differentiation within one chip vendor. I think we need these levels broadly for the following reasons: * Selecting on individual CPU features (similar to the old hwcaps mechanism) in glibc has scalability issues, particularly for LD_LIBRARY_PATH processing. * Developers need guidance about useful targets for optimization. I think there is value in limiting the choices, in the sense that âif you are able to test three builds in total, these are the things you should buildâ. * glibc and the compilers should align in their definition of the levels, so that developers can use an -march= option to build for a particular level that is recognized by glibc. This is why I think the description of the levels should go into the psABI supplement. * A preference order for these levels avoids falling back to the K8 baseline if the platform progresses to a new version due to glibc/kernel/hypervisor/hardware upgrades. I'm including a proposal for the levels below. I use single letters for them, but I expect that the concrete implementation of this proposal will use names like âx86-100â, âx86-101â, like in the glibc patch referenced above. (But we can discuss other approaches.) I looked at various machines in the Red Hat labs and talked to Intel and AMD engineers about this, but this concrete proposal is based on my own analysis of the situation. I excluded CPU features related to cryptography and cache management, including hardware transactional memory, and CPU timing. I assume that we will see some of these features being disabled by the firmware or the kernel over time. That would eliminate entire levels from selection, which is not desirable. For cryptographic code, I expect that localized selection of an optimized implementation works because such code tends to be isolated blocks, running for dozens of cycles each time, not something that gets scattered all over the place by the compiler. We previously discussed not emitting VZEROUPPER at later levels, but I don't think this is beneficial because the ABI does not have callee-saved vector registers, so it can only be useful with local functions (or whatever LTO considers local), where there is no ABI impact anyway. I did not include FSGSBASE because the FS base is already available at %fs:0. Changing the FS base in userspace breaks too much, so the main benefit is the tighter encoding of rdfsbase, which seems very slim. Not covered in this are tuning decisions. I think we can benefit from some variance in this area between implementations; it should not affect correctness. 32-bit support is also a separate matter. * Level A CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3 This is one step above the K8 baseline and corresponds to a mainline CPU model ca. 2008 to 2011. It is also implemented by recent-ish generations of Intel Atom server CPUs (although I haven't tested the latest version). A 32-bit variant would have to list many additional CPU features here. * Level B AVX, plus everything in level A. This step is so small that it probably can be dropped, unless the benefits from using VEX encoding are truly significant. For AVX and some of the following features, it is assumed that the run-time selection takes full support coverage (from silicon to the kernel) into account. * Level C AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B. This is close to what glibc currently calls "haswell". * Level D AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in level C. This is the AVX-512 level implemented by Xeon Scalable Processors, not the Xeon Phi variant. glibc (or an alternative loader implementation) would search for libraries starting at level D, going back to level A, and finally the baseline implementation in the default library location. I expect that some distributions will also use these levels to set a baseline for the entire distribution (i.e., everything would be built to level A or maybe even level C), and these libraries would then be installed in the default location. I'll be glad if I can get any feedback on this proposal. I plan to turn it into a merge request for the x86-64 psABI document eventually. Thanks, Florian
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.3