Instruction set extension by Intel
AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200 (Knights Landing),[1] and then later in a number of AMD and other Intel CPUs (see list below). AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F (AVX-512 Foundation) is required by all AVX-512 implementations.
Besides widening most 256-bit instructions, the extensions introduce various new operations, such as new data conversions, scatter operations, and permutations. The number of AVX registers is increased from 16 to 32, and eight new "mask registers" are added, which allow for variable selection and blending of the results of instructions. In CPUs with the vector length (VL) extension—included in most AVX-512-capable processors (see § CPUs with AVX-512)—these instructions may also be used on the 128-bit and 256-bit vector sizes.
AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors: the earlier 512-bit SIMD instructions used in the first generation Xeon Phi coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible.[1]
The successor to AVX-512 is AVX10, announced in July 2023.[3] AVX10 simplifies detection of supported instructions by introducing a version of the instruction set, where each subsequent version includes all instructions from the previous one. In the initial revisions of the AVX10 specification, the support for 512-bit vectors was made optional, which would allow Intel to support it in their E-cores. In later revisions, Intel made 512-bit vectors mandatory, with the intention to support 512-bit vectors both in P- and E-cores. The initial version 1 of AVX10 does not add new instructions compared to AVX-512, and for processors supporting 512-bit vectors it is equivalent to AVX-512 (in the set supported by Intel Sapphire Rapids processors). Later AVX10 versions will introduce new features.
The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit. However, they are typically grouped by the processor generation that implements them.
F, CD, ER, PF: introduced with Xeon Phi x200 (Knights Landing) and Xeon Scalable (Skylake SP "Purley"), with the last two (ER and PF) being specific to Knights Landing & Knights Mill.
4VNNIW, 4FMAPS: introduced with and specific to Knights Mill.[4][5]
VL, DQ, BW: introduced with Skylake-X/SP and Cannon Lake.
IFMA, VBMI: introduced with Cannon Lake.[7]
VNNI: introduced with Cascade Lake.
VPOPCNTDQ: Vector population count instruction. Introduced with Knights Mill and Ice Lake.[8]
VBMI2, BITALG: introduced with Ice Lake.[8]
VP2INTERSECT: introduced with Tiger Lake.
GFNI, VPCLMULQDQ, VAES: introduced with Ice Lake.[8]
The VEX prefix used by AVX and AVX2, while flexible, did not leave enough room for the features Intel wanted to add to AVX-512. This has led them to define a new prefix called EVEX.
Compared to VEX, EVEX adds the following benefits:[5]
The extended registers, SIMD width bit, and opmask registers of AVX-512 are mandatory and all require support from the OS.
The AVX-512 instructions are designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty. However, AVX-512VL extensions allows the use of AVX-512 instructions on 128/256-bit registers XMM/YMM, so most SSE and AVX/AVX2 instructions have new AVX-512 versions encoded with the EVEX prefix which allow access to new features such as opmask and additional registers. Unlike AVX-256, the new instructions do not have new mnemonics but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous in the source code. Since AVX-512F only works on 32- and 64-bit values, SSE and AVX/AVX2 instructions that operate on bytes or words are available only with the AVX-512BW extension (byte & word support).[5]
Extended registers[edit] x64 AVX-512 register scheme as extension from the x64 AVX (YMM0–YMM15) and x64 SSE (XMM0–XMM15) registers 511 256 255 128 127 0 ZMM0 YMM0 XMM0 ZMM1 YMM1 XMM1 ZMM2 YMM2 XMM2 ZMM3 YMM3 XMM3 ZMM4 YMM4 XMM4 ZMM5 YMM5 XMM5 ZMM6 YMM6 XMM6 ZMM7 YMM7 XMM7 ZMM8 YMM8 XMM8 ZMM9 YMM9 XMM9 ZMM10 YMM10 XMM10 ZMM11 YMM11 XMM11 ZMM12 YMM12 XMM12 ZMM13 YMM13 XMM13 ZMM14 YMM14 XMM14 ZMM15 YMM15 XMM15 ZMM16 YMM16 XMM16 ZMM17 YMM17 XMM17 ZMM18 YMM18 XMM18 ZMM19 YMM19 XMM19 ZMM20 YMM20 XMM20 ZMM21 YMM21 XMM21 ZMM22 YMM22 XMM22 ZMM23 YMM23 XMM23 ZMM24 YMM24 XMM24 ZMM25 YMM25 XMM25 ZMM26 YMM26 XMM26 ZMM27 YMM27 XMM27 ZMM28 YMM28 XMM28 ZMM29 YMM29 XMM29 ZMM30 YMM30 XMM30 ZMM31 YMM31 XMM31The width of the SIMD register file is increased from 256 bits to 512 bits, and expanded from 16 to a total of 32 registers ZMM0–ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
AVX-512 vector instructions may indicate an opmask register to control which values are written to the destination, the instruction encoding supports 0–7 for this field, however, only opmask registers k1–k7 (of k0–k7) can be used as the mask corresponding to the value 1–7, whereas the value 0 is reserved for indicating no opmask register is used, i.e. a hardcoded constant (instead of 'k0') is used to indicate unmasked operations. The special opmask register 'k0' is still a functioning, valid register, it can be used in opmask register manipulation instructions or used as the destination opmask register.[9] A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.
The opmask registers are normally 16 bits wide, but can be up to 64 bits with the AVX-512BW extension.[5] How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used.
The opmask register is the reason why several bitwise instructions which naturally have no element widths had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle now exist in both double-word and quad-word variants with the only difference being in the final masking.
New opmask instructions[edit]The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16-bit (Word) versions. With AVX-512DQ 8-bit (Byte) versions were added to better match the needs of masking 8 64-bit values, and with AVX-512BW 32-bit (Double) and 64-bit (Quad) versions were added so they can mask up to 64 8-bit values. The instructions KORTEST and KTEST can be used to set the x86 flags based on mask registers, so that they may be used together with non-SIMD x86 branch and conditional instructions.
New instructions in AVX-512 foundation[edit]Many AVX-512 instructions are simply EVEX versions of old SSE or AVX instructions. There are, however, several new instructions, and old instructions that have been replaced with new AVX-512 versions. The new or heavily reworked instructions are listed below. These foundation instructions also include the extensions from AVX-512VL and AVX-512BW since those extensions merely add new versions of these instructions instead of new instructions.
There are no EVEX-prefixed versions of the blend instructions from SSE4; instead, AVX-512 has a new set of blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.
Since blending is an integral part of the EVEX encoding, these instructions may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.
AVX-512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration, however, they save the result to a mask register and initially only support doubleword and quadword comparisons. The AVX-512BW extension provides the byte and word versions. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking.[5]
Imme-The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero. Note that like the comparison instructions, these take two opmask registers, one as destination and one a regular opmask.
Instruction ExtensionVPTESTMD
, VPTESTMQ
F Logical AND and set mask for 32 or 64 bit integers. VPTESTNMD
, VPTESTNMQ
F Logical NAND and set mask for 32 or 64 bit integers. VPTESTMB
, VPTESTMW
BW Logical AND and set mask for 8 or 16 bit integers. VPTESTNMB
, VPTESTNMW
BW Logical NAND and set mask for 8 or 16 bit integers. Compress and expand[edit]
The compress and expand instructions match the APL operations of the same name. They use the opmask in a slightly different way from other AVX-512 instructions. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.
Instruction DescriptionVCOMPRESSPD
, VCOMPRESSPS
Store sparse packed double/single-precision floating-point values into dense memory VPCOMPRESSD
, VPCOMPRESSQ
Store sparse packed doubleword/quadword integer values into dense memory/register VEXPANDPD
, VEXPANDPS
Load sparse packed double/single-precision floating-point values from dense memory VPEXPANDD
, VPEXPANDQ
Load sparse packed doubleword/quadword integer values from dense memory/register
A new set of permute instructions have been added for full two input permutations. They all take three arguments, two source registers and one index; the result is output by either overwriting the first source register or the index register. AVX-512BW extends the instructions to also include 16-bit (word) versions, and the AVX-512_VBMI extension defines the byte versions of the instructions.
Instruction ExtensionVPERMB
VBMI Permute packed bytes elements. VPERMW
BW Permute packed words elements. VPERMT2B
VBMI Full byte permute overwriting first source. VPERMT2W
BW Full word permute overwriting first source. VPERMI2PD
, VPERMI2PS
F Full single/double floating-point permute overwriting the index. VPERMI2D
, VPERMI2Q
F Full doubleword/quadword permute overwriting the index. VPERMI2B
VBMI Full byte permute overwriting the index. VPERMI2W
BW Full word permute overwriting the index. VPERMT2PS
, VPERMT2PD
F Full single/double floating-point permute overwriting first source. VPERMT2D
, VPERMT2Q
F Full doubleword/quadword permute overwriting first source. VSHUFF32x4
, VSHUFF64x2
,
VSHUFI32x4
, VSHUFI64x2
F Shuffle four packed 128-bit lines. VPMULTISHIFTQB
VBMI Select packed unaligned bytes from quadword sources. Bitwise ternary logic[edit]
Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8-bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8-bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed.[5] These are the only bitwise vector instructions in AVX-512F; EVEX versions of the two source SSE and AVX bitwise vector instructions AND, ANDN, OR and XOR were added in AVX-512DQ.
The difference in the doubleword and quadword versions is only the application of the opmask.
Instruction DescriptionVPTERNLOGD
, VPTERNLOGQ
Bitwise Ternary Logic Bitwise Ternary Logic Truth table A0 A1 A2 Double AND
A number of conversion or move instructions were added; these complete the set of conversion instructions available from SSE2.
Instruction ExtensionVPMOVQD
, VPMOVSQD
, VPMOVUSQD
,
VPMOVQW
, VPMOVSQW
, VPMOVUSQW
,
VPMOVQB
, VPMOVSQB
, VPMOVUSQB
,
VPMOVDW
, VPMOVSDW
, VPMOVUSDW
,
VPMOVDB
, VPMOVSDB
, VPMOVUSDB
F Down convert quadword or doubleword to doubleword, word or byte; unsaturated, saturated or saturated unsigned. The reverse of the sign/zero extend instructions from SSE4.1. VPMOVWB
, VPMOVSWB
, VPMOVUSWB
BW Down convert word to byte; unsaturated, saturated or saturated unsigned. VCVTPS2UDQ
, VCVTPD2UDQ
,
VCVTTPS2UDQ
, VCVTTPD2UDQ
F Convert with or without truncation, packed single or double-precision floating point to packed unsigned doubleword integers. VCVTSS2USI
, VCVTSD2USI
,
VCVTTSS2USI
, VCVTTSD2USI
F Convert with or without truncation, scalar single or double-precision floating point to unsigned doubleword integer. VCVTPS2QQ
, VCVTPD2QQ
,
VCVTPS2UQQ
, VCVTPD2UQQ
,
VCVTTPS2QQ
, VCVTTPD2QQ
,
VCVTTPS2UQQ
, VCVTTPD2UQQ
DQ Convert with or without truncation, packed single or double-precision floating point to packed signed or unsigned quadword integers. VCVTUDQ2PS
, VCVTUDQ2PD
F Convert packed unsigned doubleword integers to packed single or double-precision floating point. VCVTUSI2PS
, VCVTUSI2PD
F Convert scalar unsigned doubleword integers to single or double-precision floating point. VCVTUSI2SD
, VCVTUSI2SS
F Convert scalar unsigned integers to single or double-precision floating point. VCVTUQQ2PS
, VCVTUQQ2PD
DQ Convert packed unsigned quadword integers to packed single or double-precision floating point. VCVTQQ2PD
, VCVTQQ2PS
F Convert packed quadword integers to packed single or double-precision floating point. Floating-point decomposition[edit]
Among the unique new features in AVX-512F are instructions to decompose floating-point values and handle special floating-point values. Since these methods are completely new, they also exist in scalar versions.
Instruction DescriptionVGETEXPPD
, VGETEXPPS
Convert exponents of packed fp values into fp values VGETEXPSD
, VGETEXPSS
Convert exponent of scalar fp value into fp value VGETMANTPD
, VGETMANTPS
Extract vector of normalized mantissas from float32/float64 vector VGETMANTSD
, VGETMANTSS
Extract float32/float64 of normalized mantissa from float32/float64 scalar VFIXUPIMMPD
, VFIXUPIMMPS
Fix up special packed float32/float64 values VFIXUPIMMSD
, VFIXUPIMMSS
Fix up special scalar float32/float64 value Floating-point arithmetic[edit]
This is the second set of new floating-point methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2−14.[5]
Instruction DescriptionVRCP14PD
, VRCP14PS
Compute approximate reciprocals of packed float32/float64 values VRCP14SD
, VRCP14SS
Compute approximate reciprocals of scalar float32/float64 value VRNDSCALEPS
, VRNDSCALEPD
Round packed float32/float64 values to include a given number of fraction bits VRNDSCALESS
, VRNDSCALESD
Round scalar float32/float64 value to include a given number of fraction bits VRSQRT14PD
, VRSQRT14PS
Compute approximate reciprocals of square roots of packed float32/float64 values VRSQRT14SD
, VRSQRT14SS
Compute approximate reciprocal of square root of scalar float32/float64 value VSCALEFPS
, VSCALEFPD
Scale packed float32/float64 values with float32/float64 values VSCALEFSS
, VSCALEFSD
Scale scalar float32/float64 value with float32/float64 value Instruction Extension
VBROADCASTSS
, VBROADCASTSD
F, VL Broadcast single/double floating-point value VPBROADCASTB
, VPBROADCASTW
,
VPBROADCASTD
, VPBROADCASTQ
F, VL, DQ, BW Broadcast a byte/word/doubleword/quadword integer value VBROADCASTI32X2
, VBROADCASTI64X2
,
VBROADCASTI32X4
, VBROADCASTI32X8
,
VBROADCASTI64X4
F, VL, DQ, BW Broadcast two or four doubleword/quadword integer values Instruction Extension
VALIGND
, VALIGNQ
F, VL Align doubleword or quadword vectors VDBPSADBW
BW Double block packed sum-absolute-differences (SAD) on unsigned bytes VPABSQ
F Packed absolute value quadword VPMAXSQ
, VPMAXUQ
F Maximum of packed signed/unsigned quadword VPMINSQ
, VPMINUQ
F Minimum of packed signed/unsigned quadword VPROLD
, VPROLVD
, VPROLQ
, VPROLVQ
,
VPRORD
, VPRORVD
, VPRORQ
, VPRORVQ
F Bit rotate left or right VPSCATTERDD
, VPSCATTERDQ
,
VPSCATTERQD
, VPSCATTERQQ
F Scatter packed doubleword/quadword with
VSCATTERDPS
, VSCATTERDPD
,
VSCATTERQPS
, VSCATTERQPD
F Scatter packed float32/float64 with
The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.[10]
Instruction Name DescriptionVPCONFLICTD
,
VPCONFLICTQ
Detect conflicts within vector of packed double- or quadwords values Compares each element in the first source, to all elements on same or earlier places in the second source and forms a bit vector of the results VPLZCNTD
,
VPLZCNTQ
Count the number of leading zero bits for packed double- or quadword values Vectorized LZCNT
instruction VPBROADCASTMB2Q
,
VPBROADCASTMW2D
Broadcast mask to vector register Either 8-bit mask to quadword vector, or 16-bit mask to doubleword vector Exponential and reciprocal[edit]
AVX-512 exponential and reciprocal (AVX-512ER) instructions contain more accurate approximate reciprocal instructions than those in the AVX-512 foundation; relative error is at most 2−28. They also contain two new exponential functions that have a relative error of at most 2−23.[5]
Instruction DescriptionVEXP2PD
, VEXP2PS
Compute approximate exponential 2x of packed single or double-precision floating-point values VRCP28PD
, VRCP28PS
Compute approximate reciprocals of packed single or double-precision floating-point values VRCP28SD
, VRCP28SS
Compute approximate reciprocal of scalar single or double-precision floating-point value VRSQRT28PD
, VRSQRT28PS
Compute approximate reciprocals of square roots of packed single or double-precision floating-point values VRSQRT28SD
, VRSQRT28SS
Compute approximate reciprocal of square root of scalar single or double-precision floating-point value
AVX-512 prefetch (AVX-512PF) instructions contain new prefetch operations for the new scatter and gather functionality introduced in AVX2 and AVX-512. T0
prefetch means prefetching into level 1 cache and T1
means prefetching into level 2 cache.
VGATHERPF0DPS
, VGATHERPF0QPS
,
VGATHERPF0DPD
, VGATHERPF0QPD
Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T0 hint. VGATHERPF1DPS
, VGATHERPF1QPS
,
VGATHERPF1DPD
, VGATHERPF1QPD
Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T1 hint. VSCATTERPF0DPS
, VSCATTERPF0QPS
,
VSCATTERPF0DPD
, VSCATTERPF0QPD
Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using writemask k1 and T0 hint with intent to write. VSCATTERPF1DPS
, VSCATTERPF1QPS
,
VSCATTERPF1DPD
, VSCATTERPF1QPD
Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double precision data using writemask k1 and T1 hint with intent to write.
The two sets of instructions perform multiple iterations of processing. They are generally only found in Xeon Phi products.
Instruction ExtensionV4FMADDPS
,
V4FMADDSS
4FMAPS Packed/scalar single-precision floating-point fused multiply-add (4-iterations) V4FNMADDPS
,
V4FNMADDSS
4FMAPS Packed/scalar single-precision floating-point fused multiply-add and negate (4-iterations) VP4DPWSSD
4VNNIW Dot product of signed words with double word accumulation (4-iterations) VP4DPWSSDS
4VNNIW Dot product of signed words with double word accumulation and saturation (4-iterations)
AVX-512DQ adds new doubleword and quadword instructions. AVX-512BW adds byte and words versions of the same instructions, and adds byte and word version of doubleword/quadword instructions in AVX-512F. A few instructions which get only word forms with AVX-512BW acquire byte forms with the AVX-512_VBMI extension (VPERMB
, VPERMI2B
, VPERMT2B
, VPMULTISHIFTQB
).
Two new instructions were added to the mask instructions set: KADD
and KTEST
(B and W forms with AVX-512DQ, D and Q with AVX-512BW). The rest of mask instructions, which had only word forms, got byte forms with AVX-512DQ and doubleword/quadword forms with AVX-512BW. KUNPCKBW
was extended to KUNPCKWD
and KUNPCKDQ
by AVX-512BW.
Among the instructions added by AVX-512DQ are several SSE and AVX instructions that didn't get AVX-512 versions with AVX-512F, among those are all the two input bitwise instructions and extract/insert integer instructions.
Instructions that are completely new are covered below.
Floating-point instructions[edit]Three new floating-point operations are introduced. Since they are not only new to AVX-512 they have both packed/SIMD and scalar versions.
The VFPCLASS
instructions tests if the floating-point value is one of eight special floating-point values, which of the eight values will trigger a bit in the output mask register is controlled by the immediate field. The VRANGE
instructions perform minimum or maximum operations depending on the value of the immediate field, which can also control if the operation is done absolute or not and separately how the sign is handled. The VREDUCE
instructions operate on a single source, and subtract from that the integer part of the source value plus a number of bits specified in the immediate field of its fraction.
VFPCLASSPS
, VFPCLASSPD
DQ Test types of packed single and double precision floating-point values. VFPCLASSSS
, VFPCLASSSD
DQ Test types of scalar single and double precision floating-point values. VRANGEPS
, VRANGEPD
DQ Range restriction calculation for packed floating-point values. VRANGESS
, VRANGESD
DQ Range restriction calculation for scalar floating-point values. VREDUCEPS
, VREDUCEPD
DQ Perform reduction transformation on packed floating-point values. VREDUCESS
, VREDUCESD
DQ Perform reduction transformation on scalar floating-point values. Other instructions[edit] Instruction Extension
VPMOVM2D
, VPMOVM2Q
DQ Convert mask register to double- or quad-word vector register. VPMOVM2B
, VPMOVM2W
BW Convert mask register to byte or word vector register. VPMOVD2M
, VPMOVQ2M
DQ Convert double- or quad-word vector register to mask register. VPMOVB2M
, VPMOVW2M
BW Convert byte or word vector register to mask register. VPMULLQ
DQ Multiply packed quadword store low result. A quadword version of VPMULLD.
Extend VPCOMPRESS and VPEXPAND with byte and word variants. Shift instructions are new.
Instruction DescriptionVPCOMPRESSB
, VPCOMPRESSW
Store sparse packed byte/word integer values into dense memory/register VPEXPANDB
, VPEXPANDW
Load sparse packed byte/word integer values from dense memory/register VPSHLD
Concatenate and shift packed data left logical VPSHLDV
Concatenate and variable shift packed data left logical VPSHRD
Concatenate and shift packed data right logical VPSHRDV
Concatenate and variable shift packed data right logical
Vector Neural Network Instructions:[11] AVX512-VNNI adds EVEX-coded instructions described below. With AVX-512F, these instructions can operate on 512-bit vectors, and AVX-512VL further adds support for 128- and 256-bit vectors.
A later AVX-VNNI extension adds VEX encodings of these instructions which can only operate on 128- or 256-bit vectors. AVX-VNNI is not part of the AVX-512 suite, it does not require AVX-512F and can be implemented independently.
Instruction DescriptionVPDPBUSD
Multiply and add unsigned and signed bytes VPDPBUSDS
Multiply and add unsigned and signed bytes with saturation VPDPWSSD
Multiply and add signed word integers VPDPWSSDS
Multiply and add word integers with saturation
Integer fused multiply-add instructions. AVX512-IFMA adds EVEX-coded instructions described below.
A separate AVX-IFMA instruction set extension defines VEX encoding of these instructions. This extension is not part of the AVX-512 suite and can be implemented independently.
Instruction ExtensionVPMADD52LUQ
IFMA Packed multiply of unsigned 52-bit integers and add the low 52-bit products to 64-bit accumulators VPMADD52HUQ
IFMA Packed multiply of unsigned 52-bit integers and add the high 52-bit products to 64-bit accumulators VPOPCNTDQ and BITALG[edit] Instruction Extension set Description VPOPCNTD
, VPOPCNTQ
VPOPCNTDQ Return the number of bits set to 1 in doubleword/quadword VPOPCNTB
, VPOPCNTW
BITALG Return the number of bits set to 1 in byte/word VPSHUFBITQMB
BITALG Shuffle bits from quadword elements using byte indexes into mask Instruction Extension set Description VP2INTERSECTD
,
VP2INTERSECTQ
VP2INTERSECT Compute intersection between doublewords/quadwords to a pair of mask registers
Galois field new instructions are useful for cryptography,[12] as they can be used to implement Rijndael-style S-boxes such as those used in AES, Camellia, and SM4.[13] These instructions may also be used for bit manipulation in networking and signal processing.[12]
GFNI is a standalone instruction set extension and can be enabled separately from AVX or AVX-512. Depending on whether AVX and AVX-512F support is indicated by the CPU, GFNI support enables legacy (SSE), VEX or EVEX-coded instructions operating on 128, 256 or 512-bit vectors.
Instruction DescriptionVGF2P8AFFINEINVQB
Galois field affine transformation inverse VGF2P8AFFINEQB
Galois field affine transformation VGF2P8MULB
Galois field multiply bytes
VPCLMULQDQ with AVX-512F adds an EVEX-encoded 512-bit version of the PCLMULQDQ instruction. With AVX-512VL, it adds EVEX-encoded 256- and 128-bit versions. VPCLMULQDQ alone (that is, on non-AVX512 CPUs) adds only VEX-encoded 256-bit version. (Availability of the VEX-encoded 128-bit version is indicated by different CPUID bits: PCLMULQDQ and AVX.) The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers, but they do not extend it to select quadwords from different 128-bit fields (the meaning of imm8 operand is the same: either low or high quadword of the 128-bit field is selected).
Instruction DescriptionVPCLMULQDQ
Carry-less multiplication quadword
VEX- and EVEX-encoded AES instructions. The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers. The VEX versions can be used without AVX-512 support.
Instruction DescriptionVAESDEC
Perform one round of an AES decryption flow VAESDECLAST
Perform last round of an AES decryption flow VAESENC
Perform one round of an AES encryption flow VAESENCLAST
Perform last round of an AES encryption flow
AI acceleration instructions operating on the Bfloat16 numbers.
Instruction DescriptionVCVTNE2PS2BF16
Convert two vectors of packed single precision numbers into one vector of packed Bfloat16 numbers VCVTNEPS2BF16
Convert one vector of packed single precision numbers to one vector of packed Bfloat16 numbers VDPBF16PS
Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number
An extension of the earlier F16C instruction set, adding comprehensive support for the binary16 floating-point numbers (also known as FP16, float16 or half-precision floating-point numbers). The new instructions implement most operations that were previously available for single and double-precision floating-point numbers and also introduce new complex number instructions and conversion instructions. Scalar and packed operations are supported.
Unlike the single and double-precision format instructions, the half-precision operands are neither conditionally flushed to zero (FTZ) nor conditionally treated as zero (DAZ) based on MXCSR
settings. Subnormal values are processed at full speed by hardware to facilitate using the full dynamic range of the FP16 numbers. Instructions that create FP32 and FP64 numbers still respect the MXCSR.FTZ
bit.[14]
VADDPH
, VADDSH
Add packed/scalar FP16 numbers. VSUBPH
, VSUBSH
Subtract packed/scalar FP16 numbers. VMULPH
, VMULSH
Multiply packed/scalar FP16 numbers. VDIVPH
, VDIVSH
Divide packed/scalar FP16 numbers. VSQRTPH
, VSQRTSH
Compute square root of packed/scalar FP16 numbers. VFMADD{132, 213, 231}PH
,
VFMADD{132, 213, 231}SH
Multiply-add packed/scalar FP16 numbers. VFNMADD{132, 213, 231}PH
,
VFNMADD{132, 213, 231}SH
Negated multiply-add packed/scalar FP16 numbers. VFMSUB{132, 213, 231}PH
,
VFMSUB{132, 213, 231}SH
Multiply-subtract packed/scalar FP16 numbers. VFNMSUB{132, 213, 231}PH
,
VFNMSUB{132, 213, 231}SH
Negated multiply-subtract packed/scalar FP16 numbers. VFMADDSUB{132, 213, 231}PH
Multiply-add (odd vector elements) or multiply-subtract (even vector elements) packed FP16 numbers. VFMSUBADD{132, 213, 231}PH
Multiply-subtract (odd vector elements) or multiply-add (even vector elements) packed FP16 numbers. VREDUCEPH
, VREDUCESH
Perform reduction transformation of the packed/scalar FP16 numbers. VRNDSCALEPH
, VRNDSCALESH
Round packed/scalar FP16 numbers to a given number of fraction bits. VSCALEFPH
, VSCALEFSH
Scale packed/scalar FP16 numbers by multiplying it by a power of two. Complex arithmetic instructions[edit] Instruction Description VFMULCPH
, VFMULCSH
Multiply packed/scalar complex FP16 numbers. VFCMULCPH
, VFCMULCSH
Multiply packed/scalar complex FP16 numbers. Complex conjugate form of the operation. VFMADDCPH
, VFMADDCSH
Multiply-add packed/scalar complex FP16 numbers. VFCMADDCPH
, VFCMADDCSH
Multiply-add packed/scalar complex FP16 numbers. Complex conjugate form of the operation. Approximate reciprocal instructions[edit] Instruction Description VRCPPH
, VRCPSH
Compute approximate reciprocal of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than 2−11 + 2−14. VRSQRTPH
, VRSQRTSH
Compute approximate reciprocal square root of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than 2−14. Comparison instructions[edit] Instruction Description VCMPPH
, VCMPSH
Compare the packed/scalar FP16 numbers and store the result in a mask register. VCOMISH
Compare the scalar FP16 numbers and store the result in the flags register. Signals an exception if a source operand is QNaN or SNaN. VUCOMISH
Compare the scalar FP16 numbers and store the result in the flags register. Signals an exception only if a source operand is SNaN. VMAXPH
, VMAXSH
Select the maximum of each vertical pair of the source packed/scalar FP16 numbers. VMINPH
, VMINSH
Select the minimum of each vertical pair of the source packed/scalar FP16 numbers. VFPCLASSPH
, VFPCLASSSH
Test packed/scalar FP16 numbers for special categories (NaN, infinity, negative zero, etc.) and store the result in a mask register. Conversion instructions[edit] Instruction Description VCVTW2PH
Convert packed signed 16-bit integers to FP16 numbers. VCVTUW2PH
Convert packed unsigned 16-bit integers to FP16 numbers. VCVTDQ2PH
Convert packed signed 32-bit integers to FP16 numbers. VCVTUDQ2PH
Convert packed unsigned 32-bit integers to FP16 numbers. VCVTQQ2PH
Convert packed signed 64-bit integers to FP16 numbers. VCVTUQQ2PH
Convert packed unsigned 64-bit integers to FP16 numbers. VCVTPS2PHX
Convert packed FP32 numbers to FP16 numbers. Unlike VCVTPS2PH
from F16C, VCVTPS2PHX
has a different encoding that also supports broadcasting. VCVTPD2PH
Convert packed FP64 numbers to FP16 numbers. VCVTSI2SH
Convert a scalar signed 32-bit or 64-bit integer to FP16 number. VCVTUSI2SH
Convert a scalar unsigned 32-bit or 64-bit integer to FP16 number. VCVTSS2SH
Convert a scalar FP32 number to FP16 number. VCVTSD2SH
Convert a scalar FP64 number to FP16 number. VCVTPH2W
, VCVTTPH2W
Convert packed FP16 numbers to signed 16-bit integers. VCVTPH2W
rounds the value according to the MXCSR
register. VCVTTPH2W
rounds toward zero. VCVTPH2UW
, VCVTTPH2UW
Convert packed FP16 numbers to unsigned 16-bit integers. VCVTPH2UW
rounds the value according to the MXCSR
register. VCVTTPH2UW
rounds toward zero. VCVTPH2DQ
, VCVTTPH2DQ
Convert packed FP16 numbers to signed 32-bit integers. VCVTPH2DQ
rounds the value according to the MXCSR
register. VCVTTPH2DQ
rounds toward zero. VCVTPH2UDQ
, VCVTTPH2UDQ
Convert packed FP16 numbers to unsigned 32-bit integers. VCVTPH2UDQ
rounds the value according to the MXCSR
register. VCVTTPH2UDQ
rounds toward zero. VCVTPH2QQ
, VCVTTPH2QQ
Convert packed FP16 numbers to signed 64-bit integers. VCVTPH2QQ
rounds the value according to the MXCSR
register. VCVTTPH2QQ
rounds toward zero. VCVTPH2UQQ
, VCVTTPH2UQQ
Convert packed FP16 numbers to unsigned 64-bit integers. VCVTPH2UQQ
rounds the value according to the MXCSR
register. VCVTTPH2UQQ
rounds toward zero. VCVTPH2PSX
Convert packed FP16 numbers to FP32 numbers. Unlike VCVTPH2PS
from F16C, VCVTPH2PSX
has a different encoding that also supports broadcasting. VCVTPH2PD
Convert packed FP16 numbers to FP64 numbers. VCVTSH2SI
, VCVTTSH2SI
Convert a scalar FP16 number to signed 32-bit or 64-bit integer. VCVTSH2SI
rounds the value according to the MXCSR
register. VCVTTSH2SI
rounds toward zero. VCVTSH2USI
, VCVTTSH2USI
Convert a scalar FP16 number to unsigned 32-bit or 64-bit integer. VCVTSH2USI
rounds the value according to the MXCSR
register. VCVTTSH2USI
rounds toward zero. VCVTSH2SS
Convert a scalar FP16 number to FP32 number. VCVTSH2SD
Convert a scalar FP16 number to FP64 number. Decomposition instructions[edit] Instruction Description VGETEXPPH
, VGETEXPSH
Extract exponent components of packed/scalar FP16 numbers as FP16 numbers. VGETMANTPH
, VGETMANTSH
Extract mantissa components of packed/scalar FP16 numbers as FP16 numbers. Instruction Description VMOVSH
Move scalar FP16 number to/from memory or between vector registers. VMOVW
Move scalar FP16 number to/from memory or general purpose register. Legacy instructions with EVEX-encoded versions[edit]
^Note 1 : Intel does not officially support AVX-512 family of instructions on the Alder Lake microprocessors. In early 2022, Intel began disabling in silicon (fusing off) AVX-512 in Alder Lake microprocessors to prevent customers from enabling AVX-512.[35] In older Alder Lake family CPUs with some legacy combinations of BIOS and microcode revisions, it was possible to execute AVX-512 family instructions when disabling all the efficiency cores which do not contain the silicon for AVX-512.[36][37][24]
Intel Vectorization Advisor (starting from version 2017) supports native AVX-512 performance and vector code quality analysis (for "Core", Xeon and Intel Xeon Phi processors). Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX-512 ISA metrics and new AVX-512-specific "traits", e.g. Scatter, Compress/Expand, mask utilization.[38][39]
On some processors (mostly pre-Ice Lake Intel), AVX-512 instructions can cause a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512-bit width of vectors and depends on the nature of instructions being executed; using the 128 or 256-bit part of AVX-512 (AVX-512VL) does not trigger it. As a result, gcc and clang default to prefer using the 256-bit vectors for Intel targets.[40][41][42]
C/C++ compilers also automatically handle loop unrolling and preventing stalls in the pipeline in order to use AVX-512 most effectively, which means a programmer using language intrinsics to try to force use of AVX-512 can sometimes result in worse performance relative to the code generated by the compiler when it encounters loops plainly written in the source code.[43] In other cases, using AVX-512 intrinsics in C/C++ code can result in a performance improvement relative to plainly written C/C++.[44]
There are many examples of AVX-512 applications, including media processing, cryptography, video games,[45] neural networks,[46] and even OpenJDK, which employs AVX-512 for sorting.[47]
In a much-cited quote from 2020, Linus Torvalds said "I hope AVX-512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on,"[48] stating that he would prefer the transistor budget be spent on additional cores and integer performance instead, and that he "detests" floating point benchmarks.[49]
Numenta touts their "highly sparse"[50] neural network technology, which they say obviates the need for GPUs as their algorithms run on CPUs with AVX-512.[51] They claim a ten times speedup relative to A100 largely because their algorithms reduce the size of the neural network, while maintaining accuracy, by techniques such as the Sparse Evolutionary Training (SET) algorithm[52] and Foresight Pruning.[53]
Newer x86-64 processors also support Galois Field New Instructions (GFNI) which allow implementing Camellia s-box more straightforward manner and yield even better performance.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.3