"RDNA3" Instruction Set Architecture Reference Guide 20-February-2023 "RDNA3" Instruction Set Architecture Chapter 1. Introduction This document describes the instruction set and shader program accessible state for RDNA3 devices. The AMD RDNA3 processor implements a parallel micro-architecture that provides a platform for computer graphics applications and also for general-purpose data parallel applications. 1.1. Terminology The following terminology and conventions are used in this document: Table 1. Conventions * Any number of alphanumeric characters in the name of a code format, parameter, or instruction. <> Angle brackets denote streams. [1,2) A range that includes the left-most value (in this case, 1), but excludes the right-most value (in this case, 2). [1,2] A range that includes both the left-most and right-most values. {x | y} or {x, y} One of the multiple options listed. In this case, X or Y. 0.0 A floating-point value. 1011b 'b0010 32’b0010 A binary value, in this example a 4-bit value. A binary value of unspecified size. A 32-bit binary value. Binary values may include underscores for readability and can be ignored when parsing the value. 0x1A 'h123 24’h01 A hexadecimal value. A hexadecimal value. A 24-bit hexadecimal value. 7:4 [7:4] A bit range, from bit 7 to bit 4, inclusive. The high-order bit is shown first. May be enclosed in brackets. italicized word or phrase The first use of a term or concept basic to the understanding of stream computing. Table 2. Basic Terms Term Description RDNA3 Processor The RDNA3 shader processor is a scalar and vector ALU with memory access designed to run complex programs on behalf of a wave. Kernel A program executed by the shader processor for each work item submitted to it. Shader Program Same meaning as "Kernel". The shader types are: CS (Compute Shader), and for graphics-capable devices, PS (Pixel Shader), GS (Geometry Shader), and HS (Hull Shader). Dispatch A dispatch launches a 1D, 2D, or 3D grid of work to the RDNA3 processor array. Work-group A work-group is a collection of waves that have the ability to synchronize with each other with barriers; they also can share data through the Local Data Share. Waves in a work-group all run on the same WGP. Wave A collection of 32 or 64 work-items that execute in parallel on a single RDNA3 processor. Work-item A single element of work: one element from the dispatch grid, or in graphics a pixel, vertex or primitive. Thread A synonym for "work-item". Lane A synonym for "work-item" typically used only when describing VALU operations. SA Shader Array. A collection of compute units. 1.1. Terminology 3 of 597 "RDNA3" Instruction Set Architecture Term Description SE Shader Engine. A collection of shader arrays. SGPR Scalar General Purpose Registers. 32-bit registers that are shared by work-items in each wave. VGPR Vector General Purpose Registers. 32-bit registers that are private to each work-items in a wave. LDS Local Data Share. A 32-bank scratch memory allocated to waves or work-groups GDS Global Data Share. A scratch memory shared by all shader engines. Similar to LDS but also supports append operations. VMEM Vector Memory. Refers to LDS, Texture, Global, Flat and Scratch memory. SIMD32 Single Instruction Multiple Data. In this document a SIMD refers to the Vector ALU unit that processes instructions for a single wave. Literal Constant A 32-bit integer or float constant that is placed in the instruction stream. Scalar ALU (SALU) The scalar ALU operates on one value per wave and manages all control flow. Vector ALU (VALU) The vector ALU maintains Vector GPRs that are unique for each work item and execute arithmetic operations uniquely on each work-item. Work-group Processor The basic unit of shader computation hardware, including scalar & vector ALU’s and memory, as (WGP) well as LDS and scalar caches. Compute Unit (CU) One half of a WGP. Contains 2 SIMD32’s that share one path to memory. Microcode format The microcode format describes the bit patterns used to encode instructions. Each instruction is 32-bits or more, in units of 32-bits. Instruction An instruction is the basic unit of the kernel. Instructions include: vector ALU, scalar ALU, memory transfer, and control flow operations. Quad A quad is a 2x2 group of screen-aligned pixels. This is relevant for sampling texture maps. Texture Sampler (S#) A texture sampler is a 128-bit entity that describes how the vector memory system reads and samples (filters) a texture map. Texture Resource (T#) A texture resource descriptor describes an image in memory: address, data format, width, height, depth, etc. Buffer Resource (V#) A buffer resource descriptor describes a buffer in memory: address, data format, stride, etc. NGG Next Generation Graphics pipeline DPP Data Parallel Primitives: VALU instructions which can pass data between work-items LSB Least Significant Bit MSB Most Significant Bit DWORD 32-bit data SHORT 16-bit data BYTE 8-bit data Table 3. Instruction suffixes have the following definitions: Format Meaning B32 binary (untyped data) 32-bit B64 binary (untyped data) 64-bit F16 floating-point 16-bit (sign + exp5 + mant10) F32 floating-point 32-bit (IEEE 754 single-precision float) (sign + exp8 + mant23) F64 floating-point 64-bit (IEEE 754 double-precision float) (sign + exp11 + mant52) BF16 floating-point 16-bit for machine learning ("bfloat16"). (sign + exp8 + mant7) I8 signed 8-bit integer I16 signed 16-bit integer I32 signed 32-bit integer I64 signed 64-bit integer U16 unsigned 16-bit integer U32 unsigned 32-bit integer 1.1. Terminology 4 of 597 "RDNA3" Instruction Set Architecture Format Meaning U64 unsigned 64-bit integer D.i Destination which is a signed integer D.u Destination which is an unsigned integer D.f Destination which is a float S*.i Source which is a signed integer S*.u Source which is an unsigned integer S*.f Source which is a float If an instruction has two suffixes (for example, _I32_F32), the first suffix indicates the destination type, the second the source type. The following abbreviations are used in instruction definitions: • D = destination • U = unsigned integer • S = source • SCC = scalar condition code • I = signed integer • B = bitfield Note: .u or .i specifies to interpret the argument as an unsigned or signed integer. 1.2. Hardware Overview The figure below shows a block diagram of the AMD RDNA3 Generation series processors: Figure 1. AMD RDNA3 Generation Series Block Diagram 1.2. Hardware Overview 5 of 597 "RDNA3" Instruction Set Architecture The RDNA3 device includes a data-parallel processor array, a command processor, a memory controller, and other logic (not shown). The command processor reads commands that the host has written to memorymapped registers in the system-memory address space. The command processor sends hardware-generated interrupts to the host when the command is completed. The memory controller has direct access to all device memory and the host-specified areas of system memory. To satisfy read and write requests, the memory controller performs the functions of a direct-memory access (DMA) controller, including computing memoryaddress offsets based on the format of the requested data in memory. In the RDNA3 environment, a complete application includes two parts: • a program running on the host processor, and • programs, called shader programs or kernels, running on the RDNA3 processor. The RDNA3 programs are controlled by a driver running on the host that: • sets internal base-address and other configuration registers, • specifies the data domain on which the GPU is to operate, • invalidates and flushes caches on the GPU, and • causes the GPU to begin execution of a program. 1.2.1. Work-group Processor (WGP) The processor array is the heart of the GPU. The array is organized as a set of work-group processor (WGP) pipelines, each independent from the others, that operate in parallel on streams of floating-point or integer data. The work-group processor pipelines can process data or, through the memory controller, transfer data to, or from, memory. Computation in a work-group processor pipeline can be made conditional. Outputs written to memory can also be made conditional. When it receives a request, the work-group processor pipeline loads instructions and data from memory, begins execution, and continues until the end of the kernel. As kernels are running, the GPU hardware automatically fetches instructions from memory into on-chip caches; software plays no role in this. Kernels can load data from off-chip memory into on-chip general-purpose registers (GPRs) and caches. The GPU devices can detect floating point exceptions and can generate interrupts to the host. In particular, they detect IEEE-754 floating-point exceptions in hardware; these can be recorded for post-execution analysis. The GPU hides memory latency by keeping track of potentially hundreds of work-items in various stages of execution, and by overlapping compute operations with memory-access operations. 1.2.2. Data Sharing The processors may share data between different work-items. Data sharing can boost performance. The figure below shows the memory hierarchy that is available to each work-item. The actual number of GPRs may differ from what is shown in the image below. 1.2. Hardware Overview 6 of 597 "RDNA3" Instruction Set Architecture Figure 2. Shared Memory Hierarchy 1.2.2.1. Local Data Share (LDS) Each work-group processor (WGP) has a 128kB memory space that enables low-latency communication between work-items within a work-group, or the work-items within a wave; this is the local data share (LDS). This memory is configured with 64 banks, each with 512 entries of 4 bytes. The shared memory contains 64 integer atomic units to enable fast, unordered atomic operations. This memory can be used as a software cache for predictable re-use of data, a data exchange machine for the work-items of a work-group, or as a cooperative way to enable efficient access to off-chip memory. A single work-group may allocate up to 64kB of LDS space. 1.2.2.2. Global Data Share (GDS) The AMD RDNA3 devices use a 4kB global data share (GDS) memory that can be used by waves of a kernel on all WGPs. This memory provides 128 bytes per cycle of memory access to all the processing elements. It provides full access to any location for any processor. The shared memory contains 2 integer atomic units to enable fast, unordered atomic operations. This memory can be used as a software cache to store important control data for compute kernels, reduction operations, or a small global shared surface. Data can be preloaded from memory prior to kernel launch and written to memory after kernel completion. The GDS block contains support logic for unordered append/consume and domain launch ordered append/consume operations to buffers in memory. These dedicated circuits enable fast compaction of data or the creation of complex data structures in memory. 1.2.3. Device Memory The AMD RDNA3 devices offer several methods for access to off-chip memory from the processing elements 1.2. Hardware Overview 7 of 597 "RDNA3" Instruction Set Architecture (PE) within each WGP. On the primary read path, the device consists of multiple channels of L2 cache that provides data to read-only L1 caches, and finally to L0 caches per WGP. Specific cache-less load instructions can force data to be retrieved from device memory during an execution of a load clause. Load requests that overlap within the clause are cached with respect to each other. The output cache is formed by two levels of cache: the first for write-combining cache (collect scatter and store operations and combine them to provide good access patterns to memory); the second is a read/write cache with atomic units that lets each processing element complete unordered atomic accesses that return the initial value. Each processing element provides the destination address on which the atomic operation acts, the data to be used in the atomic operation, and a return address for the read/write atomic unit to store the pre-op value in memory. Each store or atomic operation can be set up to return an acknowledgment to the requesting PE upon write confirmation of the return value (pre-atomic op value at destination) being stored to device memory. This acknowledgment has two purposes: • enabling a PE to recover the pre-op value from an atomic operation by performing a cache-less load from its return address after receipt of the write confirmation acknowledgment, and • enabling the system to maintain a relaxed consistency model. Each scatter write from a given PE to a given memory channel maintains order. The acknowledgment enables one processing element to implement a fence to maintain serial consistency by ensuring all writes have been posted to memory prior to completing a subsequent write. In this manner, the system can maintain a relaxed consistency model between all parallel work-items operating on the system. 1.2. Hardware Overview 8 of 597 "RDNA3" Instruction Set Architecture Chapter 2. Shader Concepts RDNA3 shader programs (kernels) are programs executed by the GPU processor. Conceptually, the shader program is executed independently on every work-item, but in reality the processor groups up to 32 or 64 work-items into a wave, which executes the shader program on all 32 or 64 work-items in one pass. The RDNA3 processor consists primarily of: • A scalar ALU, which operates on one value per wave (common to all work-items) • A vector ALU, which operates on unique values per work-item • Local data storage, which allows work-items within a work-group to communicate and share data • Scalar memory, which can transfer data between SGPRs and memory through a cache • Vector memory, which can transfer data between VGPRs and memory, including sampling texture maps • Exports which transfer data from the shader to dedicated rendering hardware Program control flow is handled using scalar ALU instructions. This includes if/else, branches and looping. Scalar ALU (SALU) and memory instructions work on an entire wave and operate on up to two SGPRs, as well as literal constants. Vector memory and ALU instructions operate on all work-items in the wave at one time. In order to support branching and conditional execute, every wave has an EXECute mask that determines which work-items are active at that moment, and which are dormant. Active work-items execute the vector instruction, and dormant ones treat the instruction as a NOP. The EXEC mask can be written at any time by Scalar ALU instructions or VALU comparisons. Vector ALU instructions can typically take up to three arguments, which can come from VGPRs, SGPRs, or literal constants that are part of the instruction stream. They operate on all work-items enabled by the EXEC mask. Vector compare and add-with-carry-out return a bit-per-work-item mask back to the SGPRs to indicate, per work-item, which had a "true" result from the compare or generated a carry-out. Vector memory instructions transfer data between VGPRs and memory. Each work-item supplies its own memory address and supplies or receives unique data. These instructions are also subject to the EXEC mask. 2.1. Wave32 and Wave64 The shader supports both waves of 32 work-items ("wave32") and waves of 64 work-items ("wave64"). Both wave sizes are supported for all operations, but shader programs must be compiled for and run as a particular wave size, regardless of how many work-items are active in any given wave. Wave32 waves issue each instruction at most once. Wave64 waves typically issue each instruction twice: once for the low half (work-items 31-0) and then again for the high half (work-items 63-32). This occurs only for VALU and VMEM (LDS, texture, buffer, flat) instructions; scalar ALU and memory as well as branch and messages are issued only once regardless of the wave size. Export requests also issue just once regardless of wave size. It is possible that instructions from other waves may be executed in between the low and high half of a given wave’s instructions. Hardware may choose to skip either half if the EXEC mask for that half is all zeros, but does not skip both halves for VMEM instructions as that would confuse the outstanding-memory-instruction counters, unless 2.1. Wave32 and Wave64 9 of 597 "RDNA3" Instruction Set Architecture there are no outstanding VMEM instructions from this wave. It also does not skip either half of a VALU instruction which writes an SGPR. See Instruction Skipping: EXEC==0 for details on instruction skipping rules. Hardware operates such that both passes of a wave64 use the state of the wave prior to instruction execution; the first pass of the wave64 does not affect the input to the second pass. In addition to the EXEC mask being different between the low and high half, scalar inputs may vary between the two passes. Both passes use the same constants, but different masks and carry-in/out. The differences in the second pass are: • Input increments: Carry-in, div-fmas and v_cndmask all use the next SGPR (SSRC + 1, or VCC_HI) • Output increments: Carry-out, div-scale and v_cmp all write to the next SGPR (SDST + 1, or VCC_HI) ◦ v_cmpx writes to EXEC_HI instead of EXEC_LO The upper 32-bits of EXEC and VCC are ignored for wave32 waves. VCCZ and EXECZ reflect the status of the lowest 32-bits of VCC and EXEC respectively for wave32 waves. 2.2. Shader Types 2.2.1. Compute Shaders Compute kernels (shaders) are generic programs that can run on the RDNA3 processor, taking data from memory, processing it, and writing results back to memory. Compute kernels are created by a dispatch, which causes the RDNA3 processors to run the kernel over all of the work-items in a 1D, 2D, or 3D grid of data. The RDNA3 processor walks through this grid and generates waves, which then run the compute kernel. Each work-item is initialized with its unique address (index) within the grid. Based on this index, the work-item computes the address of the data it is required to work on and what to do with the results. 2.2.2. Graphics Shaders The shader supports 3 types of graphics waves: PS, GS, and HS. Rendering modes (launch behavior): • Normal NGG - Geometry Engine (GE) sends info to wave launch hardware to init VGPRs for each element (prim) launched; GE fetches index and vertex buffer data and loads to VGPRs • Mesh shader - turns GS-launch into a CS-style launch, and wave launch hardware does unrolling into elements and generates element indices on the fly. The mesh shader program determines how to use this index value. 2.2. Shader Types 10 of 597 "RDNA3" Instruction Set Architecture The amplification shader decides how many mesh shader groups to launch. The mesh shader processes vertices and then primitives. 2.3. Work-groups A work-group is a collection of waves which can share data through LDS and can synchronize at a barrier. Waves in a work-group are all issued to the same WGP but can run on any of the 4 SIMD32’s and can share data through LDS. The WGP supports up to 32 work-groups with a maximum of 1024 work-items per work-group. Waves in a work-group may share up to 64kB of LDS space. Work-groups consisting of a single wave do not count against the limit of 32. They do not allocate a barrier resource, and barrier ops are treated as S_NOP. Each work-group or wave can operate in one of two modes, selectable per draw/dispatch at wave-create time: CU mode In this mode, the LDS is effectively split into a separate upper and lower LDS, each serving two SIMD32’s. Waves are allocated LDS space within the half of LDS which is associated with the SIMD the wave is running on. For work-groups, all waves are assigned to the pair of SIMD32’s. This mode may provide faster operation since both halves run in parallel, but limits data sharing (upper waves cannot read data in the lower half of LDS and vice versa). When in CU mode, all waves in the work-group are resident within the same CU. 2.3. Work-groups 11 of 597 "RDNA3" Instruction Set Architecture WGP mode In this mode, the LDS is one large contiguous memory that all waves on the WGP can access. In WGP mode, waves of a work-group may be distributed across both CU’s (all 4 SIMD32’s) in the WGP. LDS_PARAM_LOAD and LDS_DIRECT_LOAD are not supported in WGP mode. The WGP (and LDS) can simultaneously have some waves running in WGP mode and other waves in CU mode running. A barrier is a synchronization primitive which makes each wave reach a given point in the shader before any wave proceeds. 2.4. Shader Padding Requirement Due to aggressive instruction prefetching used in some graphics devices, the user must pad all shaders with 64 extra DWORDs (256 bytes) of data past the end of the shader. It is recommended to use the S_CODE_END instruction as padding. This ensures that if the instruction prefetch hardware goes beyond the end of the shader, it may not reach into uninitialized memory (or unmapped memory pages). The amount of shader padding required is related to how far the shader may prefetch ahead. The shader can be set to prefetch 1, 2 or 3 cachelines (64 bytes) ahead of the current program counter. This is controlled via a wave-launch state register, or by the shader program itself with S_SET_INST_PREFETCH_DISTANCE. 2.4. Shader Padding Requirement 12 of 597 "RDNA3" Instruction Set Architecture Chapter 3. Wave State This chapter describes the state variables visible to the shader program. Each wave has a private copy of this state unless otherwise specified. 3.1. State Overview The table below shows the hardware states readable or writable by a shader program. All registers below are unique to each wave except for TBA and TMA which are shared. Table 4. Readable and Writable Hardware States Abbrev. Name Size (bits) Description PC Program Counter 48 Points to the memory address of the next shader instruction to execute. Read/write only via scalar control flow instructions and indirectly using branch. The 2 LSB’s are forced to zero. V0-V255 VGPR 32 Vector general-purpose register. (32 bits per work-item x (32 or 64) work-items per wave). S0-S105 SGPR 32 Scalar general-purpose register. All waves are allocated 106 SGPRs + 16 TTMPs. LDS Local Data Share 64kB Local data share is a scratch RAM with built-in arithmetic capabilities that allow data to be shared between threads in a work-group. EXEC Execute Mask 64 A bit mask with one bit per thread, which is applied to vector instructions and controls which threads execute and which ignore the instruction. EXECZ EXEC is zero 1 A single bit flag indicating that the EXEC mask is all zeros. For wave32 it considers only EXEC[31:0]. VCC Vector Condition Code 64 A bit mask with one bit per thread; it holds the result of a vector compare operation or integer carry-out. Physically VCC is stored in specific SGPRs. VCCZ VCC is zero 1 A single-bit flag indicating that the VCC mask is all zeros. For wave32 it considers only VCC[31:0]. SCC Scalar Condition Code 1 Result from a scalar ALU comparison instruction. FLAT_SCRATCH Flat scratch address 48 The base address of scratch memory for this wave. Used by Flat and Scratch instructions. Read-only by user shader. STATUS Status 32 Read-only shader status bits. MODE Mode 32 Writable shader mode bits. M0 Misc Reg 32 A temporary register that has various uses, including GPR indexing and bounds checking. TRAPSTS Trap Status 32 Holds information about exceptions and pending traps. TBA Trap Base Address 48 Holds the pointer to the current trap handler program address. Per-VMID register. Bit [63] indicates if the trap handler is present (1) or not (0) and is not considered part of the address (bit[62] is replicated into address bit[63]). Accessed via S_SENDMSG_RTN TMA Trap Memory Address 48 Temporary register for shader operations. For example, can hold a pointer to memory used by the trap handler. 3.1. State Overview 13 of 597 "RDNA3" Instruction Set Architecture Abbrev. Name Size (bits) Description TTMP0-TTMP15 Trap Temporary SGPRs 32 16 SGPRs available only to the Trap Handler for temporary storage. VMcnt Vector memory load instruction count 6 Counts the number of VMEM load and sample instructions issued but not yet completed. VScnt Vector memory store instruction count 6 Counts the number of VMEM store instructions issued but not yet completed. EXPcnt Export Count 3 Counts the number of Export and GDS instructions issued but not yet completed. Also counts VMEM writes that have not yet sent their write-data to the last level cache, and parameter loads outstanding. LGKMcnt LDS, GDS, Constant and Message count 6 Counts the number of LDS, GDS, constant-fetch (scalar memory read), and message instructions issued but not yet completed. 3.2. Control State: PC and EXEC 3.2.1. Program Counter (PC) The Program Counter is a DWORD-aligned byte address that points to the next instruction to execute. When a wave is created the PC is initialized to the first instruction in the program. There are a few instructions to interact directly with the PC: S_GETPC_B64, S_SETPC_B64, S_CALL_B64, S_RFE_B64 and S_SWAPPC_B64. These transfer the PC to and from an even-aligned SGPR pair (sign-extended). Branches jump to (PC_of_the_instruction_after_the_branch + offset*4). Branches, GET_PC and SWAP_PC are PCrelative to the next instruction, not the current one. S_TRAP, on the other hand, saves the PC of the S_TRAP instruction itself. During wave debugging, the program counter may be read. The PC points to the next instruction to issue. All prior instructions have been issued but may or may not have completed execution. 3.2.2. EXECute Mask The Execute mask (64-bit) controls which threads in the vector are executed. Each bit indicates how one thread behaves for vector instructions: 1 = execute, 0 = do not execute. EXEC can be read and written via scalar instructions, and can also be written as a result of a vector-alu compare. EXEC affects: vector-alu, vectormemory, LDS, GDS and export instructions. It does not affect scalar execution or branches. Wave64 uses all 64 bits of the exec mask. Wave32 waves use only bits 31:0 and hardware does not act upon the upper bits. There is a summary bit (EXECZ) that indicates that the entire execute mask is zero. It can be used as a condition for branches to skip code when EXEC is zero. For wave32, this reflects the state of EXEC[31:0]. 3.2. Control State: PC and EXEC 14 of 597 "RDNA3" Instruction Set Architecture 3.2.3. Instruction Skipping: EXEC==0 The shader hardware may skip vector instructions when EXEC==0. Instructions which may be skipped are: • VALU - skip if EXEC == 0 ◦ Not skipped if the instruction writes SGPRs/VCC ◦ Does not skip WMMA or SWMMA ◦ This skipping is opportunistic and may not occur depending on timing after a V_CMPX. • These are not skipped regardless of EXEC mask value, and are issued only once in wave64 ◦ V_NOP, V_PIPEFLUSH, V_READLANE, V_READFIRSTLANE, V_WRITELANE ◦ BUFFER_GL1_INV, BUFFER_GL0_INV • These are not skipped and are issued twice regardless of EXEC mask value in wave64 mode ◦ V_CMP which writes SGPR or VCC (not V_CMPX - may skip one pass but not both) ◦ Any VALU which writes an SGPR • Export Request - skip unless: Done==1 or if export target is POS0 ◦ Skipped if the wave was created with SKIP_EXPORT=1 • LDS_param_load / LDS-direct: are skipped when EXEC==0 • LDS, Memory, GDS - do not skip ◦ VMEM can be skipped only if: VMcnt/VScnt==0 and EXEC==0 ▪ otherwise for wave64 one pass can be skipped if EXEC==0 for that half, but not both halves. ◦ LDS can be skipped only if: LGKMcnt==0 and EXEC==0 ◦ Does not skip GDS or GWS 3.3. Storage State: SGPR, VGPR, LDS 3.3.1. SGPRs 3.3.1.1. SGPR Allocation and storage Every wave is allocated a fixed number of SGPRs: • 106 normal SGPRs • VCC_HI and VCC_LO (stored in SGPRs 106 and 107) • 16 Trap-temporary SGPRs, meant for use by the trap handler 3.3.1.2. VCC The Vector Condition Code (VCC) can be written by V_CMP and integer vector ADD/SUB instructions. VCC is implicitly read by V_ADD_CI, V_SUB_CI, V_CNDMASK and V_DIV_FMAS. VCC is a named SGPR-pair and is subject to the same dependency checks as any other SGPR. 3.3. Storage State: SGPR, VGPR, LDS 15 of 597 "RDNA3" Instruction Set Architecture 3.3.1.3. SGPR Alignment There are a few cases where even-aligned SGPRs are required: 1. any time 64-bit data is used a. this includes moves to/from 64-bit registers, including PC 2. Scalar memory reads when the address-base comes from an SGPR-pair Quad-alignment of SGPRs is required for operation on more than 64-bits, and for the data GPR when a scalar memory operation (read, write or atomic) operates on more than 2 DWORDs. Similarly, when a 64-bit SGPR data value is used as a source to a VALU op, it must be even aligned regardless of size. In contrast, when a 32bit SGPR data value is used as a source to a VALU op, it can be arbitrarily aligned regardless of wave size. When a 64-bit quantity is stored in SGPRs, the LSB’s are in SGPR[n], and the MSB’s are in SGPR[n+1]. It is illegal to use mis-aligned source or destination SGPRs for data larger than 32 bits and results are unpredictable. As an example, VALU ops with carry-in or carry-out: • When used with wave32, these are 32 bit values and may have any arbitrary alignment • When used with wave64, these are 64 bit values and must be aligned to an even SGPR address Hardware enforces SGPR alignment by ignoring LSB’s as necessary and treating them as zero. For *MOVREL*_B64, the LSB of the index is also ignored and treated as zero. 3.3.1.4. SGPR Out of Range Behavior Scalar sources and dests use a 7-bit encoding: Scalar 0-105=SGPR; 106,107=VCC, 108-123=TTMP0-15, and 124-127={NULL, M0, EXEC_LO, EXEC_HI}. It is illegal to use GPR indexing or a multi-DWORD operand to cross SGPR regions. The regions are: • SGPRs 0 - 107 (includes VCC) • Trap Temp SGPRs • All other SGPR & Scalar-source addresses must not be indexed and no single operand can reference multiple register ranges. General Rules: • Out of range source SGPRs return zero (using a TTMP when STATUS.PRIV=0, NULL, M0 or EXEC where not allowed) • Writes to an out of range SGPR are ignored TTMP0-15 can only be written while in the trap handler (STATUS.PRIV=1) but can be read by the user’s shader (STATUS.PRIV=0). Writes to TTMPs while outside the trap handler are ignored. SALU instructions which try but fail to write a TTMP also do not update SCC. • SALU: Above rules apply. ◦ WREXEC and SAVEEXEC write the EXEC mask even when the SDST is out-of-range • VALU: Above rules apply. • VMEM: S#, T#, V# must be contained within one region. 3.3. Storage State: SGPR, VGPR, LDS 16 of 597 "RDNA3" Instruction Set Architecture ◦ T# (128b), V# or S#: no possible range violation exists (forced alignment puts all in 1 range). ◦ T# (256b) starting at 104 and extending into TTMPs; or starting at TTMP12 and going past TTMP15 is a violation. If this occurs, force to use S0. • SMEM return data starting in SGPRs/VCC and extending into TTMPs, or starting in TTMPs and extending outside TTMPs becomes out of range. ◦ No data gets written to dest-SGPRs that are out-of-range ◦ Addr and write-data are aligned and so cannot go out of range, except: ▪ Referencing M0, NULL, or EXEC* returns zero, and SMEM loads cannot load into these registers. • S_MOVREL: ◦ Indexing is allowed only within SGPRs and TTMPs, and must not cross between the two. Indexing must stay within the "base" range (the operand type where index==0). The ranges are: [ SGPRs 0-105 and VCC_LO, VCC_HI ], [ Trap Temps 0-15 ], [ all other values ] ◦ Indexing must not reach M0, exec or inline constants, the rule is: ▪ Base is SGPR: addr > VCC_HI (or if 64-bit operand, addr > VCC_LO) ▪ Base is TTMP: addr > TTMP15 (or if B64 if addr > ttmp14) ◦ If the source is out of range, S0 is used. If the dest is out of range, nothing is written. 3.3.2. VGPRs 3.3.2.1. VGPR Allocation and Alignment VGPRs are allocated in blocks of 16 for wave32 or 8 for wave64, and a shader may have up to 256 VGPRs. In other words, VGPRs are allocated in units of (16*32 or 8*64 = 512 DWORDs). A wave may not be created with zero VGPRs. Devices which have 1536 VGPRs per SIMD allocate in blocks of 24 for wave32 and 12 for wave64. A wave may voluntarily deallocate all of its VGPRs via S_SENDMSG. Once this is done, the wave may not reallocate them and the only valid action is to terminate the wave. This can be useful if a wave has issued stores to memory and is waiting for the write-confirms before terminating. Releasing the VGPRs while waiting may allow a new wave to allocate them and start earlier. 3.3.2.2. VGPR Out of Range Behavior Given an instruction operand that uses one or more DWORDs of VGPR data: "V" Vs = the first VGPR DWORD (start) Ve = the last VGPR DWORD (end) For a 32-bit operand, Vs==Ve; for a 64-bit operand Ve=Vs+1, etc. Operand is out of range if: • Vs < 0 || Vs >= VGPR_SIZE • Ve < 0 || Ve >= VGPR_SIZE V_MOVREL indexed operand out of range if either: • Index > 255 3.3. Storage State: SGPR, VGPR, LDS 17 of 597 "RDNA3" Instruction Set Architecture • (Vs + M0) >= VGPR_SIZE • (Ve + M0) >= VGPR_SIZE Out of range consequences: • If a dest VGPR is out of range, the instruction is ignored (treat as NOP). • V_SWAP & V_SWAPREL : since both arguments are destinations, if either is out of range, discard the instruction. ◦ VALU instructions with multiple destination (e.g. VGPR and SGPR): nothing is written to any GPR • If a source VGPR is out of range in a VMEM or Export instruction: VGPR0 is used ◦ Memory instructions that use a group of consecutive VGPRs that are out of range: the group is forced to start at VGPR0. • If a source VGPR in a VALU instruction is out of range in a VALU instruction: VGPR0 ◦ VOPD has different rules: the source address forced to (VGPRaddr % 4). Instructions with multiple destinations (e.g. V_ADD_CO): if any destination is out of range, no results are written. 3.3.3. Memory Alignment and Out-of-Range Behavior This section defines the behavior when a source or destination GPR or memory address is outside the legal range for a wave. Except where noted, these rules apply to LDS, GDS, buffer, global, flat and scratch memory accesses. Memory, LDS & GDS: Reads and Atomics with return: • If any source VGPR or SGPR is out-of-range, the data value is undefined. • If any destination VGPR is out-of-range, the operation is nullified by issuing the instruction as if the EXEC mask were cleared to 0. ◦ This out-of-range test checks all VGPRs which could be returned (e.g. VDST to VDST+3 for a BUFFER_LOAD_B128) ◦ This check also includes the extra PRT (partially resident texture) VGPR and nullifies the fetch if this VGPR would be out of range no matter whether the texture system actually returns this value or not. ◦ Atomic operations with out-of-range destination VGPRs are nullified: issued, but with EXEC mask of zero. • Image loads and stores consider DMASK bits when making an out-of-bounds determination. • Note: VDST is only checked for lds/gds/mem-atomic that actually return a value. VMEM (texture) memory alignment rules are defined using the config register: SH_MEM_CONFIG.alignment_mode. This setting also affects LDS, Flat/Scratch/Global operations. DWORD Automatic alignment to multiple of the smaller of element size or a DWORD. UNALIGNED No alignment requirements. Formatted ops like buffer_load_format_* must be aligned to "min(DWORD, ElementSize)" where ElementSize is the size of one "thing" described by the FORMAT. E.g. format 5_6_6 = 2 bytes so it must be 4-byte (DWORD) aligned. Non-formatted ops follow the above alignment mode. Atomics must be aligned to the data size, or will trigger a MEMVIOL. 3.3. Storage State: SGPR, VGPR, LDS 18 of 597 "RDNA3" Instruction Set Architecture 3.3.4. LDS Waves may be allocated LDS memory, and waves in a work-group all share the same LDS memory allocation. A wave may have 0 - 64kbyte of LDS space allocated, and it is allocated in blocks of 1024 bytes. All accesses to LDS are restricted to the space allocated to that wave/work-group. Internally LDS is composed of two blocks of memory of 64kB each. Each one of these two blocks is affiliated with one CU or the other: byte addresses 0-65535 with CU0, 65536-131071 with CU1. Allocations of LDS space to a wave or work-group do not wrap around: the allocation starting address is less than the ending address. In CU mode, a wave’s entire LDS allocation resides in the same "side" of LDS as the wave is loaded. No access is allowed to cross over or wrap around to the other side. In WGP mode, a wave’s LDS allocation may be entirely in either the CU0 or CU1 part of LDS, or it may straddle the boundary and be partially in each CU. The location of the LDS storage is unrelated to which CU the wave is on. Pixel parameters are loaded into the same CU side as the wave resides and do not cross over into the other side of LDS storage. Pixel shaders are run only in CU mode. Pixel shader may request additional LDS space in addition to what is required for vertex parameters. 3.3.4.1. LDS/GDS Alignment and Out-of-Range Any DS_LOAD or DS_STORE of any size can be byte aligned if the alignment mode is set to "unaligned". For all other alignment modes, LDS forces alignment by zeroing out address least significant bits. • 32-bit Atomics must be aligned to a 4-byte address; 64-bit atomics to an 8-byte address. • LDS operations report MEMVIOL if the LDS-address is out of range and LDS_CONFIG.ADDR_OUT_OF_RANGE_REPORTING==1 • MEMVIOL is reported for misaligned LDS accesses when the alignment mode is set to STRICT or DWORD_STRICT. Out Of Range • If the LDS-ADDRESS is out of range (addr < 0 or >= LDS_size): ◦ Writes out-of-range are discarded. ◦ Reads return the value zero. For multi-DWORD reads, if any part of the LDS-address is out of range, the entire instruction returns zero. • If any source-VGPR is out of range, the value from VGPR0 is used to supply the LDS address or data. • If the dest-VGPR is out of range, nullify the instruction (issue with EXEC=0) "Native" Alignment in LDS & GDS is: B8: byte aligned B16 or D16: 2 byte aligned B32: 4 byte aligned B64: 8 byte aligned B128 and B96: 16 byte aligned If the alignment mode is set to "unaligned", the LDS disables its auto-alignment and doesn’t report error for misaligned reads & writes. 3.3. Storage State: SGPR, VGPR, LDS 19 of 597 "RDNA3" Instruction Set Architecture if (sh_alignment_mode == unaligned) align = 0xffff else if (B32) align = 0xfffC else if (B64) align = 0xfff8 else if (B96 or B128) align = 0xfff0 LDSaddr = (addr + offset) & align 3.4. Wave State Registers The following registers are accessed infrequently, and are only readable/writable via S_GETREG and S_SETREG instructions. Some of these registers are read-only, some are writable and others are writable only when in the trap handler ("PRIV"). Code Register 0 Reserved 1 MODE read / write 2 STATUS read / write. Only writable when priv=1 3 TRAPSTS read / write 14 FLUSH_IB write-only. Writing this causes all waves to flush their instruction buffers 15 SH_MEM_BASES read-only. Allows a wave to read the value of this register to do aperture checks and memory space conversions. Bits [15:0] = Private Base; [31:16] = Shared Base. 20 FLAT_SCRATCH_LO read only (writable only while in trap handler) 21 FLAT_SCRATCH_HI read only (writable only while in trap handler) 23 HW_ID1 read only. debug only - not predictable values 24 HW_ID2 read only. debug only - not predictable values 29 SHADER_CYCLES Get the current graphics clock counter value 3.4.1. Status register Status register fields can be read but not written to by the shader. While in the trap handler, certain STATUS fields can be written. These bits are initialized at wave-creation time. The table below describes the status register fields. Table 5. Status Register Fields Field Bit Write Description Pos when Priv? SCC 0 Y Scalar condition code. Used as a carry-out bit. For a comparison instruction, this bit indicates failure or success. For logical operations, this is 1 if the result is non-zero. SYS_PRIO 2:1 Y Wave priority set at wave creation time. See S_SETPRIO instruction for details. 0 is lowest, 3 is highest priority. USER_PRIO 4:3 Y Wave’s priority set by shader program itself. See S_SETPRIO instruction for details. PRIV 5 N Privileged mode. Indicates that the wave is in the trap handler. Gives write access to TTMP registers. TRAP_EN 6 N Indicates that a trap handler is present. When set to zero, traps are not taken. 3.4. Wave State Registers 20 of 597 "RDNA3" Instruction Set Architecture Field Bit Write Description Pos when Priv? EXPORT_RDY 8 Y This status bit indicates if export buffer space has been allocated. The shader stalls any export instruction until this bit becomes "1". It gets set to 1 when export buffer space has been allocated. Shader hardware checks this bit before executing any EXPORT instruction to Position, Z or MRT targets, and put the wave into a waiting state if the alloc has not yet been received. The alloc arrives eventually (unless SKIP_EXPORT is set) as a message and the shader then continues with the export. EXECZ 9 N Exec Mask is Zero. VCCZ 10 N Vector Condition Code is Zero. IN_WG 11 N Wave is a member of a work-group of more than one wave. IN_BARRIER 12 N Wave is waiting at a barrier. HALT 13 Y Wave is halted or scheduled to halt. HALT can be set by the host via wave-control messages, or by the shader. The HALT bit is ignored while in the trap handler (PRIV = 1). HALT is also ignored if a hostinitiated trap is received (request to enter the trap handler). TRAP 14 N Wave is flagged to enter the trap handler as soon as possible. VALID 16 N Wave is valid (has been created and not yet ended) SKIP_EXPORT 18 Y For Pixel and Vertex Shaders only. "1" means this shader is not allocated export buffer space, so export instructions are ignored (treated as NOPs). For pixel shaders, this is set to 1 when both the COL0_EXPORT_FORMAT and Z_EXPORT_FORMAT are set to ZERO. If SKIP_EXPORT==1, Must_export must be zero and vice versa. PERF_EN 19 N Performance counters are enabled for this wave CDBG_USER 20 Y User-controlled conditional debug. Set at wave-create time by a user register. Can be used in conditional branches. CDBG_SYS 21 Y System-controlled conditional debug. Set at wave-create time by a system register. Can be used in conditional branches. FATAL_HALT 23 N Indicates that the wave has halted due to a fatal error: illegal instruction . The difference between halt and fatal_halt is that fatal_halt stops waves even when PRIV=1. NO_VGPRS 24 N Indicates that this wave has released all of its VGPRs. LDS_PARAM_RDY 25 Y PS shaders only: indicates that LDS has been written with vertex attribute data and the shader may now execute LDS_PARAM_LOAD instructions. If the wave attempts to issue LDS_PARAM_LOAD before this bit is set, it stalls until the bit is set. MUST_GS_ALLOC 26 N GS shader must issue a GS_ALLOC_REQ message before terminating. Sending this message clears this bit. MUST_EXPORT 27 Y PS: this wave must export color ("export-done") before it terminates. Set to 1 for PS waves unless "skip_export==1". Cleared when PS exports data with export’s Done bit set to 1. GS: this wave must perform a GDS_ordered_count before terminating. Cleared when a GS shader issues a GDS_ordered_count. GS is initialized to 1 normally, but to zero for "no export" passes (stream-out only). IDLE 28 N Wave is idle (has no outstanding instructions). Used by the host (GRBM) to determine if a wave is valid, halted and idle - able to read other wave state. SCRATCH_EN 29 Y Indicate that the wave has scratch memory allocated. This bit gets set to 1 if the wave has FLAT_SCRATCH initialized; otherwise is zero. 3.4. Wave State Registers 21 of 597 "RDNA3" Instruction Set Architecture 3.4.2. Mode register Mode register fields can be read from, and written to, by the shader through scalar instructions. The table below describes the mode register fields. Table 6. Mode Register Fields Field Bit Pos Description FP_ROUND 3:0 Controls round modes for math operations [1:0] Single precision round mode [3:2] Double precision and half precision (FP16) round mode Round Modes: 0=nearest even, 1= +infinity, 2= -infinity, 3= toward zero Round mode affects float ops in VALU, but not LDS or memory. FP_DENORM 7:4 Controls whether floating point denormals are flushed or not. [5:4] Single precision denormal mode [7:6] Double precision and FP16 denormal mode Denormal modes: 2 bits = { allow_output_denorms, allow_input_denorms } 0 = flush input and output denorms 1 = allow input denorms, flush output denorms 2 = flush input denorms, allow output denorms 3 = allow input and output denorms Denorm mode affects float ops in: VALU, LDS, and VMEM atomics. Texture/Buffer/Flat considers only bits 4 and 6 (allowing mode control over input-denorm flushing, and not flushing output denorms), while LDS uses all bits for DS ops (but not for FLAT). DX10_CLAMP 8 Used by the vector ALU to force DX10 style treatment of NaN’s. When set, clamp NaN to zero, otherwise pass NaN thru and also suppress all VALU exceptions. The clamping only occurs when the instruction has the CLAMP bit set to 1, but exceptions are suppressed when DX10_CLAMP==1. IEEE 9 IEEE==0: IEEE-754-1985/DX10 behavior for Min and Max, pass signaling NaN. IEEE==1: IEEE-754-2008 behavior for Min and Max, quiet signaling NaN. When set to 1, floating point opcodes that support exception flag gathering quiet and propagate signaling NaN inputs per IEEE 754-2008. Min_f32/f64 and Max_f32/f64 become IEEE 754-2008 compliant due to signaling NaN propagation and quieting. When set to 1, MAX performs a ">" compare, but when set to zero (directX mode/IEEE 754-1985 mode) MAX performs a ">=" compare. This only affects results for +/-0 and input denormals which are flushed to zero. LOD_CLAMPED 10 Sticky status bit - indicates that one or more texture accesses had their LOD clamped. TRAP_AFTER_ INST 11 Forces the wave to jump to the exception handler after each instruction is executed (but not after ENDPGM). Only works if TRAP_EN = 1. EXCP_EN 21:12 Enable mask for exceptions. Enabled means if the exception occurs and if TRAP_EN==1, a trap may be taken. [12] : invalid [13] : inputDenormal [14] : float_div0 [15] : overflow [16] : underflow [17] : inexact [18] : int_div0 [19] : addr_watch - take exception when TC sees wave access an "address of interest" [21] : trap on wave end - h/w clears this upon entering trap handler for end-of-wave 3.4. Wave State Registers 22 of 597 "RDNA3" Instruction Set Architecture Field Bit Pos Description FP16_OVFL 23 If set, an overflowed FP16 VALU result is clamped to +/- MAX_FP16 regardless of round mode, while still preserving true INF values. (Inputs which are infinity may result in infinity, as does divide-by-zero). DISABLE_PERF 27 1 = disable performance counting for this wave. 3.4.3. M0 : Miscellaneous Register There is one 32-bit M0 register per wave and is it used for: Table 7. M0 Register Fields Operation M0 Contents Notes LDS_PARAM_LOAD { 1’b0, new_prim_mask[15:1], parameter_offset[15:0] } Offset is in bytes and offset[6:0] must be zero. Wave32: new_prim_mask is {8’b0, mask[7:1] } LDS_DIRECT_LOAD { 13’b0, DataType[2:0], LDS_address[15:0] } address is in bytes LDS ADDTID { 16’h0, lds_offset[15:0] } offset is in bytes, must be 4-byte aligned Global Data Share { base[15:0] , size[15:0] } base and size are in bytes GDS Ordered Count { base[15:0], 3’h0, logical_wave_id[12:0] } used for deferred attribute shading (split-GS) Global Wave Sync various uses see instruction definition S/V_MOVREL GPR index See S_MOVREL and V_MOVREL instructions S_SENDMSG / _RTN varies sendmsg data. See [Send_Message_Types] EXPORT Row number for mesh shader POS & Param exports See Export chapter SMEM address_offset[31:0] see SMEM section Temporary data[31:0] can be used as general temporary data storage M0 can only be written by the scalar ALU. 3.4.4. NULL NULL is a scalar source and destination. Reading NULL returns zero, writing to NULL has no effect (write data is discarded). NULL may be used anywhere scalar sources can normally be used: • When NULL is used as the destination of an SALU instruction, the instruction executes: SDST is not written but SCC is updated (if the instruction normally updates SCC). • NULL may not be used as an S#, V# or T#. 3.4.5. SCC: Scalar Condition Code Many scalar ALU instructions set the Scalar Condition Code (SCC) bit, indicating the result of the operation. Compare operations: 1 = true 3.4. Wave State Registers 23 of 597 "RDNA3" Instruction Set Architecture Arithmetic operations: 1 = carry out Bit/logical operations: 1 = result was not zero Move: does not alter SCC The SCC can be used as the carry-in for extended-precision integer arithmetic, as well as the selector for conditional moves and branches. 3.4.6. Vector Compares: VCC and VCCZ Vector ALU comparison instructions (V_CMP) compare two values and return a bit-mask of the result, where each bit represents one lane (work-item) where: 1= pass, 0 = fail. This result mask is the Vector Condition Code (VCC). VCC is also set for selected integer ALU operations (carry-out). These instructions write this mask either to VCC, an SGPR or to EXEC, but do not write to both EXEC and SGPRs. Wave32 writes only the low 32 bits of VCC, EXEC or a single SGPR; Wave64 writes 64-bits of VCC, EXEC or an aligned pair of SGPRs. Whenever any instruction writes a value to VCC, the hardware automatically updates a "VCC summary" bit called VCCZ. This bit indicates whether or not the entire VCC mask is zero for the current wave-size. Wave32 ignores VCC[63:32] and only bits[31:0] contribute to VCCZ. This is useful for early-exit branch tests. VCC is also set for certain integer ALU operations (carry-out). The EXEC mask determines which threads execute an instruction. The VCC indicates which executing threads passed the conditional test, or which threads generated a carry-out from an integer add or subtract. S_MOV_B64 EXEC, 0x00000001 // set just one thread active; others are inactive V_CMP_EQ_B32 VCC, V0, V0 // compare (V0 == V0) and write result to VCC (all bits in VCC are updated)  VCC physically resides in the SGPR register file in a specific pair of SGPRs, so when an instruction sources VCC, that counts against the limit on the total number of SGPRs that can be sourced for a given instruction. Wave32 waves may use any SGPR for mask/carry/borrow operations, but may not use VCC_HI or EXEC_HI. 3.4.7. FLAT_SCRATCH FLAT_SCRATCH is a 64-bit register that holds a pointer to the base of scratch memory for this wave. For waves that have scratch space allocated, wave-launch hardware initializes the FLAT_SCRATCH register with the scratch base address unique to this wave. This register is read-only, except while in the trap handler where it is writable. The value is a byte address and must be 256byte aligned. If the wave has no scratch space allocated, then reading FLAT_SCRATCH returns zero. The value for FLAT_SCRATCH is computed in hardware and initialized for any wave that has scratch space allocated: scratch_base = scratch_base[63:0] + spi_scratch_offset[31:0] 3.4. Wave State Registers 24 of 597 "RDNA3" Instruction Set Architecture FLAT_SCRATCH_LO = scratch_base [31:0] FLAT_SCRATCH_HI = scratch_base [63:32] 3.4.8. Hardware Internal Registers These registers are read-only and can be accessed by the S_GETREG instruction. They return information about hardware allocation and status. HW_ID and the various *_BASE values are not predictable and may change over the lifetime of a wave if context-switching can occur. HW_ID1 Field Bits Description WAVE_ID 4:0 Wave id within the SIMD. SIMD_ID 9:8 SIMD_ID within the WGP: [0] = row, [1] = column. WGP_ID 13:10 Physical WGP ID. SA_ID 16 Shader Array ID SE_ID 20:18 Shader Engine ID DP_RATE 31:29 Number of double-precision float units per SIMD. 1+log2(#DP-alu’s). 0=none, 1=1/32rate (1 dp lane/clk), 2=1/16 rate (2 dp lanes/clk), 3=1/8, 4=1/4, 5=1/2, 6=full rate (32 dp lanes per clock). HW_ID2 Field Bits Description QUEUE_ID 3:0 Queue_ID (also encodes shader stage) PIPE_ID 5:4 Pipeline ID ME_ID 9:8 MicroEngine ID: 0 = graphics, 1 & 2 = ACE compute STATE_ID 14:12 State context ID WG_ID 20:16 Work-group ID (0-31) within the WGP. VM_ID 27:24 Virtual Memory ID Other S_GETREG, S_SETREG targets: Register Bits Description FLUSH_IB 1 Writing this with bit[0]=1 flushes the instruction fetch buffers for the targeted wave. SH_MEM_BASES 16, 16 Per-VMID register, readable by the shader, which holds the private and shared apertures. PC_LO PC_HI 32 32 Program counter low and high halves. GETREG should not be used to read the PC use S_GETPC instead. FLAT_SCRATCH_HI FLAT_SCRATCH_LO 32 32 Flat scratch base address. Only writable when in trap handler Note: TMA and TBA are read using S_SENDMSG_RTN. 3.4.9. Trap and Exception registers Each type of exception can be enabled or disabled independently by setting, or clearing, bits in the TRAPSTS register’s EXCP_EN field. This section describes the registers that control and report shader exceptions. Trap temporary SGPRs (TTMP*) are privileged for writes - they can be written only when in the trap handler 3.4. Wave State Registers 25 of 597 "RDNA3" Instruction Set Architecture (STATUS.PRIV = 1). TTMPs can be read by the user shader. When the shader is not privileged (STATUS.PRIV==0), writes to these are ignored. TMA and TBA are read-only; they can be accessed through S_SENDMSG_RTN. When a trap is taken (either user initiated, exception or host initiated), the shader hardware generates an S_TRAP instruction. This loads trap information into a pair of SGPRS: {TTMP1, TTMP0} = {7'h0, HT[0],trapID[7:0], PC[47:0]}. HT is set to one for host initiated traps, and zero for user traps (s_trap) or exceptions. TRAP_ID is zero for exceptions, or the user/host trapID for those traps. STATUS . TRAP_EN This bit tells the shader whether or not a trap handler is present. When one is not present, traps are not taken no matter whether they’re floating point, user or host-initiated traps. When the trap handler is present, the wave uses an extra 16 SGPRs for trap processing. If trap_en == 0, all traps and exceptions are ignored, and s_trap is converted by hardware to NOP. MODE . EXCP_EN[8:0] Exception enable mask. Defines which of the sources of exception cause the shader to jump to the trap handler when the exception occurs. 1 = enable traps; 0 = disable traps. MEMVIOL and Illegal-Instruction jump to the trap handler and cannot be masked off. Bit Exception Cause Result 0 invalid operand is invalid for operation: 0 * inf, 0/0, sqrt(-x), any input QNaN is SNaN. 1 Input Denormal one or more operands was subnormal 2 Divide by zero Float X / 0 correct signed infinity 3 overflow The rounded result would be larger than the largest finite number. Depends on rounding mode. Signed max# or infinity. 4 underflow The exact or rounded result is less than the smallest normal (non-subnormal) representable number. subnormal or zero 5 inexact The rounded result of a valid operation is different from the infinitely precise result. Operation result 6 integer divide Integer X / 0 by zero 7 address watch VMEM or SMEM has witnessed a thread access an 'address of interest' 8 reserved ordinary result undefined TRAPSTS Register TRAPSTS contains information about traps and exceptions, and may be written by user shader or trap handler. 3.4. Wave State Registers 26 of 597 "RDNA3" Instruction Set Architecture Field Bit Pos Description EXCP 8:0 Status bits of which exceptions have occurred. These bits are sticky and accumulate results until the shader program clears them. These bits are accumulated regardless of the setting of EXCP_EN. These can be read or written without shader privilege. Bit Exception 0 invalid 1 Input Denormal 2 Divide by zero 3 overflow 4 underflow 5 inexact 6 integer divide by zero 7 address watch 8 memory violation SAVECTX 10 A bit set by the host command via GRBM (or context-save/restore unit) indicating that this wave must jump to its trap handler and save its context. This bit should be cleared by the trap handler using S_SETREG. ILLEGAL_INST 11 An illegal instruction has been detected. If a trap handler is present and the wave is not in the trap handler: jump to the trap handler; Otherwise, send an interrupt and halt. ADDR_WATCH1-3 14:12 Indicates that address watch 1, 2 or 3 have been hit. [12]=addr_watch1. Addr_watch0 is indicated by the existing bit TRAPSTS.EXCP[7]. BUFFER_OOB 15 Buffer Out Of Bounds indicator. Set when a buffer (MUBUF, MTBUF) instruction requests an address that is out of bounds. Does not cause a trap. Status bit is sticky. HOST_TRAP 16 Trap handler has been called to service a host trap. Trap may simultaneously have been called to handle other traps as well WAVE_START 17 Trap handler has been called before the first instruction of a new wave. WAVE_END 18 Trap handler has been called after the last instruction of a wave. TRAP_AFTER_INST 20 Trap handler has been called due to "trap after instruction" mode 3.4.10. Time There are two methods for measuring time in the shader: • "TIME" - measure cycles in graphics core clocks (20 bit counter) • "REALTIME" - measure time based on a fixed frequency, constantly running clock (typically 100MHz), providing a 64bit value. Shader programs have access to a free-running clock counter in order to measure the duration of portions of a wave’s execution. This counter can be read via: "S_GETREG S0, SHADER_CYCLES" and returns a 20-bit cycle counter value. This counter is not synchronized across different SIMDs and should only be used to measure time-delta within one wave. Reading the counter is handled through the SALU which has a typical latency of around 8 cycles. For measuring time between different waves or SIMDs, or to reference a clock that does not stop counting when the chip is idle, use "REALTIME". Real-time is a clock counter that comes from the clock-generator and runs at a constant speed, regardless of the shader or memory clock speeds. This counter can be read by: 3.4. Wave State Registers 27 of 597 "RDNA3" Instruction Set Architecture S_SENDMSG_RTN_B64 S[2:3] REALTIME S_WAITCNT LGKMcnt == 0 3.5. Initial Wave State Before a wave begins execution, some of the state registers including SGPRs and VGPRs are initialized with values derived either from state data, dynamic or derived data (e.g. interpolants or unique per-wave data). The values are derived from register state and dynamic wave-launch state. Note that some of this state is common across all waves in a draw call, and other state is unique per wave. This section describes what state is initialized per shader stage. Note that as usual in this spec, the shader stages refer to hardware shader stages and these often are not identical to software shader stages. State initialization is controlled by state registers which are defined in other documentation. 3.5.1. EXEC initialization Normally, EXEC is initialized with the mask of which threads are active in a wave. There are, however, cases where the EXEC mask is initialized to zero indicating that this wave should do no work and exit immediately. These are referred to as "Null waves" (EXEC==0) and exit immediately after starting execution. 3.5.2. FLAT_SCRATCH Initialization Waves that have scratch memory space allocated to them are initialized with their FLAT_SCRATCH register having a pointer to the address in global memory. Waves without scratch have this initialized to zero. 3.5.3. SGPR Initialization SGPRs are initialized based on various SPI_PGM_RSRC* or COMPUTE_PGM_* register settings. Note that only the enabled values are loaded, and they are packed into consecutive SGPRs, skipping over disabled values regardless of the number of user-constants loaded. No SGPRs are skipped for alignment. The tables below show how to control which values are initialized prior to shader launch. 3.5.3.1. Pixel Shader (PS) Table 8. PS SGPR Load SGPR Order Description Enable First 0..32 of User data registers SPI_SHADER_PGM_RSRC2_PS.user_sgpr then {bc_optimize, prim_mask[14:0], lds_offset[15:0]} N/A then {ps_wave_id[9:0], ps_wave_index[5:0]} SPI_SHADER_PGM_RSRC2_PS.wave_cnt_en 3.5. Initial Wave State 28 of 597 "RDNA3" Instruction Set Architecture SGPR Order Description Enable then Provoking Vtx Info: {prim15[1:0], prim14[1:0], …, prim0[1:0]} SPI_SHADER_PGM_RSRC1_PS . LOAD_PROVOKING_VTX PS_wave_index is (se_id[1:0] * GPU__GC__NUM_PACKER_PER_SE + packer_id). PS_wave_id is an index value which is incremented for every wave. There is a separate counter per packer, so the combination of { ps_wave_id, ps_wave_index } forms a unique ID for any wave on the chip. The wave-id counter wraps at SPI_PS_MAX_WAVE_ID. 3.5.3.2. Geometry Shader (GS) ES and GS are launched as a combined wave, of type GS. The shader is initialized as a GS wave type, with the PC pointing to the ES shader and with GS user-SGPRs preloaded, along with a memory pointer to more GS user SGPRs. The shader executes to the ES program first, then upon completion executes the GS shader. Once the ES shader completes, it may re-use the SGPRs which contain ES user data and the GS shader address. The first 8 SGPRs are automatically initialized - no values are skipped (unused ones are written with zero). State registers: • SPI_SHADER_PGM_{LO,HI}_ES : address of the GS shader • SPI_SHADER_PGM_RSRC1: resources of combined ES + GS shader ◦ GS_VGPR_COMP_CNT = # of GS VGPRs to load (2 bits) • SPI_SHADER_PGM_RSRC2: resources of combined ES + GS shader ◦ VGPR_COMP_CNT = # of VGPRs to load (2 bits) ◦ OC_LDS_EN • SPI_SHADER_PGM_RSRC{3,4}: resources of combined ES + GS shader Table 9. GS SGPR Load SGPR # GS with FAST_LAUNCH != 2 GS with FAST_LAUNCH == 2 Enable 0 GS Program Address [31:0] comes from: SPI_SHADER_PGM_LO_GS GS Program Address [31:0] comes from: SPI_SHADER_PGM_LO_GS automatically loaded 1 GS Program Address [63:32] comes from: SPI_SHADER_PGM_HI_GS GS Program Address [63:32] comes from: SPI_SHADER_PGM_HI_GS automatically loaded 2 {1’b0, gsAmpPrimPerGrp[8:0], 32’h0 1’b0, esAmpVertPerGrp[8:0], ordered_wave_id[11:0]} Must not be overwritten, in some cases listed below. 3 { TGsize[3:0], WaveInGroup[3:0], 8’h0, gsInputPrimCnt[7:0], esInputVertCnt[7:0] } { TGsize[3:0], WaveInGroup[3:0], 24’h0 } automatically loaded. 4 Off-chip LDS base [31:0] { TGID_Y[15:0], TGID_X[15:0] } SPI_SHADER_PGM_RSRC2_GS.oc_lds_en 5 { 17’h0, attrSgBase[14:0] } { TGID_Z[15:0], 1’b0, attrSgBase[14:0] } - SPI is loading flat_scratch[63:0] at this time - 6 7 3.5. Initial Wave State - 29 of 597 "RDNA3" Instruction Set Architecture SGPR # GS with FAST_LAUNCH != 2 GS with FAST_LAUNCH == 2 8 - (up to) 39 User data registers of GS shader User data registers of GS shader SPI_SHADER_PGM_RSRC2_GS.user_sgpr Enable When stream-out is used, SGPR[2] must not be modified or overwritten any time before the final stream out is issued (GDS ordered count with 'done' = 1). This is because the pipeline reset sequence which hardware automatically executes reads SGPR to fabricate a GDS-ordered-count instruction and relies on this value. 3.5.3.3. Front End Shader (HS) LS and HS are launched as a combined wave, of type HS. The shader is initialized as an HS wave type, with the PC pointing to the LS shader and with HS user-SGPRs preloaded, along with a memory pointer to more HS user SGPRs. The shader executes to the LS program first, then upon completion executes the HS shader. Once the LS shader completes, it may re-use the SGPRs which contain LS user data and the HS shader address. The first 8 SGPRs are automatically initialized - no values are skipped (unused ones are written with zero). Other registers: • SPI_SHADER_PGM_{LO,HI}_LS : address of the LS shader • SPI_SHADER_PGM_RSRC1: resources of combined LS + HS shader ◦ LS_VGPR_COMP_CNT = # of LS VGPRs to load (2 bits) • SPI_SHADER_PGM_RSRC{2,3,4}: resources of combined LS + HS shader Table 10. HS (LS) SGPR Load SGPR # Description Enable 0 HS Program Address Low ([31:0]) SPI_SHADER_USER_DATA_LO_HS 1 HS Program Address High ([63:32]) SPI_SHADER_USER_DATA_HI_HS 2 Off-chip LDS base [31:0] automatically loaded 3 {first_wave[0], lshs_TGsize[6:0], lshs_PatchCount[7:0], HS_vertCount[7:0], LS_vertCount[7:0]} automatically loaded 4 TF buffer base [15:0] automatically loaded 5 { 27’b0, wave_id_in_group[4:0] } SPI_SHADER_PGM_RSRC2_HS.scratch_en 8 - (up to) 39 User data registers of HS shader SPI_SHADER_PGM_RSRC2_HS.user_sgpr 3.5.3.4. Compute Shader (CS) Table 11. CS SGPR Load SGPR Order Description Enable First 0.. 16 of User data registers COMPUTE_PGM_RSRC2.user_sgpr then work_group_id0[31:0] COMPUTE_PGM_RSRC2.tgid_x_en then work_group_id1[31:0] COMPUTE_PGM_RSRC2.tgid_y_en then work_group_id2[31:0] COMPUTE_PGM_RSRC2.tgid_z_en then {first_wave, 6’h00, wave_id_in_group[4:0], 2’h0, ordered_append_term[11:0], workgroup_size_in_waves[5:0]} COMPUTE_PGM_RSRC2.tg_size_en 3.5. Initial Wave State 30 of 597 "RDNA3" Instruction Set Architecture 3.5.4. Which VGPRs Get Initialized The table shows the VGPRs which may be initialized prior to wave launch. COMPUTE_PGM_RSRC* or SPI_SHADER_PGM_RSRC* control registers can select a reduced set per shader stage. 3.5.4.1. Pixel Shader VGPR Input Control Pixel Shader VGPR input loading is quite a bit more complicated. There is a CAM which maps VS outputs to PS inputs. Of the PS inputs which need loading, they are loaded in this order: I persp sample J persp sample I persp center J persp center I persp centroid J persp centroid I/W J/W 1/W I linear sample J linear sample I linear center J linear center I linear centroid J linear centroid Line stipple X float Y float Z float W float Facedness Ancillary: RTA, ISN, PT, eye-id Sample mask X/Y fixed Two registers (SPI_PS_INPUT_ENA and SPI_PS_INPUT_ADDR) control the enabling of IJ calculations and specifying of VGPR initialization for PS waves. SPI_PS_INPUT_ENA is used to determine what gradients are enabled for setup, whether per-pixel Z is enabled, what terms are calculated and/or passed through the barycentric logic, and what is loaded into VGPR for PS. SPI_PS_INPUT_ADDR can be used to manipulate the VGPR destination of terms that are enabled by INPUT_ENA, typically providing a way to maintain consistent VGPR addressing when terms are removed from INPUT_ENA. It is valid to set a bit in ADDR when the corresponding bit in ENA is not set, but if the ENA bit is set then the corresponding bit in ADDR must also be set. The two Pixel Staging Register (PSR) control registers contain an identical set of fields and consist of the following: 3.5. Initial Wave State 31 of 597 "RDNA3" Instruction Set Architecture Field Name IJ / VGPR Terms BITS VGPR Dest with Full Load PERSP_SAMPLE_ENA PERSP_SAMPLE I 32 VGPR0 PERSP_SAMPLE J 32 VGPR1 PERSP_CENTER I 32 VGPR2 PERSP_CENTER J 32 VGPR3 PERSP_CENTROID I 32 VGPR4 PERSP_CENTROID J PERSP_CENTER_ENA PERSP_CENTROID_ENA 32 VGPR5 PERSP_PULL_MODEL_ENA PERSP_PULL_MODEL I/W 32 VGPR6 PERSP_PULL_MODEL J/W 32 VGPR7 PERSP_PULL_MODEL 1/W 32 VGPR8 LINEAR_SAMPLE I 32 VGPR9 LINEAR_SAMPLE J 32 VGPR10 LINEAR_CENTER I 32 VGPR11 LINEAR_CENTER J 32 VGPR12 LINEAR_CENTROID I 32 VGPR13 LINEAR_SAMPLE_ENA LINEAR_CENTER_ENA LINEAR_CENTROID_ENA LINEAR_CENTROID J 32 VGPR14 LINE_STIPPLE_TEX_ENA LINE_STIPPLE_TEX 32 VGPR15 POS_X_FLOAT_ENA POS_X_FLOAT 32 VGPR16 POS_Y_FLOAT_ENA POS_Y_FLOAT 32 VGPR17 POS_Z_FLOAT_ENA POS_Z_FLOAT 32 VGPR18 POS_W_FLOAT_ENA POS_W_FLOAT 32 VGPR19 FRONT_FACE_ENA FRONT_FACE 32 VGPR20 ANCILLARY_ENA RTA_Index[28:16], Sample_Num[11:8], Eye_id[7], VRSrateY[5:4], VRSrateX[3:2], Prim Typ[1:0] 29 VGPR21 SAMPLE_COVERAGE_ENA SAMPLE_COVERAGE 16 VGPR22 POS_FIXED_PT_ENA Position {Y[16], X[16]} 32 VGPR23 The above table shows VGPR destinations for PS when all possible terms are enabled. If PS_INPUT_ADDR == PS_INPUT_ENA, then PS VGPRs pack towards VGPR0 as terms are disabled, as shown in the table below: Field Name ENA ADDR IJ / VGPR Terms VGPR Dest PERSP_SAMPLE_ENA 1 1 PERSP_SAMPLE I VGPR0 PERSP_SAMPLE J VGPR1 PERSP_CENTER I VGPR2 PERSP_CENTER J VGPR3 PERSP_CENTROID I X PERSP_CENTROID J X PERSP_PULL_MODEL I/W X PERSP_PULL_MODEL J/W X PERSP_PULL_MODEL 1/W X LINEAR_SAMPLE I X LINEAR_SAMPLE J X LINEAR_CENTER I X LINEAR_CENTER J X PERSP_CENTER_ENA PERSP_CENTROID_ENA PERSP_PULL_MODEL_ENA LINEAR_SAMPLE_ENA LINEAR_CENTER_ENA 3.5. Initial Wave State 1 0 0 0 0 1 0 0 0 0 32 of 597 "RDNA3" Instruction Set Architecture Field Name ENA ADDR IJ / VGPR Terms VGPR Dest LINEAR_CENTROID_ENA 0 0 LINEAR_CENTROID I X LINEAR_CENTROID J X LINE_STIPPLE_TEX_ENA 0 0 LINE_STIPPLE_TEX X POS_X_FLOAT_ENA 1 1 POS_X_FLOAT VGPR4 POS_Y_FLOAT_ENA 1 1 POS_Y_FLOAT VGPR5 POS_Z_FLOAT_ENA 0 0 POS_Z_FLOAT X POS_W_FLOAT_ENA 0 0 POS_W_FLOAT X FRONT_FACE_ENA 0 0 FRONT_FACE X ANCILLARY_ENA 0 0 Ancil Data X SAMPLE_COVERAGE_ENA 0 0 SAMPLE_COVERAGE X POS_FIXED_PT_ENA 0 0 Position {Y[16], X[16]} X However, if PS_INPUT_ADDR != PS_INPUT_ENA then the VGPR destination of enabled terms can be manipulated. An example is this is shown in the table below: Field Name ENA ADDR IJ / VGPR Terms VGPR Dest PERSP_SAMPLE_ENA 1 1 PERSP_SAMPLE I VGPR0 PERSP_SAMPLE J VGPR1 PERSP_CENTER_ENA 1 1 PERSP_CENTER I VGPR2 PERSP_CENTER J VGPR3 PERSP_CENTROID I VGPR4 skipped PERSP_CENTROID J VGPR5 skipped PERSP_PULL_MODEL I/W VGPR6 skipped PERSP_PULL_MODEL J/W VGPR7 skipped PERSP_PULL_MODEL 1/W VGPR8 skipped LINEAR_SAMPLE I X LINEAR_SAMPLE J X LINEAR_CENTER I X LINEAR_CENTER J X LINEAR_CENTROID I VGPR9 skipped LINEAR_CENTROID J VGPR10 skipped PERSP_CENTROID_ENA PERSP_PULL_MODEL_ENA 0 0 1 1 LINEAR_SAMPLE_ENA 0 0 LINEAR_CENTER_ENA 0 0 LINEAR_CENTROID_ENA 0 1 LINE_STIPPLE_TEX_ENA 0 1 LINE_STIPPLE_TEX VGPR11 skipped POS_X_FLOAT_ENA 1 1 POS_X_FLOAT VGPR12 POS_Y_FLOAT_ENA 1 1 POS_Y_FLOAT VGPR13 POS_Z_FLOAT_ENA 0 0 POS_Z_FLOAT X POS_W_FLOAT_ENA 0 0 POS_W_FLOAT X FRONT_FACE_ENA 0 0 FRONT_FACE X ANCILLARY_ENA 0 0 Ancil Data X SAMPLE_COVERAGE_ENA 0 0 SAMPLE_COVERAGE X POS_FIXED_PT_ENA 0 0 Position {Y[16], X[16]} X 3.5.5. LDS Initialization Only pixel shader (PS) waves have LDS pre-initialized with data before the wave launches. For PS wave, LDS is preloaded with vertex parameter data that can be interpolated using barycentrics (I and J) to compute per-pixel parameters. 3.5. Initial Wave State 33 of 597 "RDNA3" Instruction Set Architecture Chapter 4. Shader Instruction Set This chapter describes the shader instruction set. Instructions are divided into the following groups: • Program Flow • Scalar ALU • Scalar memory read from constant cache • Vector ALU & Parameter-Interpolate • Vector Memory read/write : ◦ buffers ◦ Flat, Global and Scratch ◦ LDS • GDS • Misc: wait on counter, barrier, send message Instructions are encoded in various microcode formats. The formats are defined by a set of "encoding" bits (in red) that define the family of instructions and the meaning of the rest of the bits in the instruction. Not every instruction uses every field in its encoding. Fields which can specify an SGPR as a source or dest are typically set to NULL when unused; other fields are typically set to zero. 4.1. Common Instruction Fields "inline constant" - a constant specified in place of a source argument, # 128-248. E.g 1.0, -0.5, 32 etc. Float constants work with single, double and 16bit float instructions, and when used in non-float instructions, the data is not converted (remains a float). Float constants are encoded according to the size of the source operand. For 16-bit operations (both packed and non-packed), a float constant is treated as zero-extended 32-bit data, i.e. with the 16-bit floating point in the low bits and zeros in the high bits. Integer constants used with 32-bit or smaller operands are treated as 32-bit signed integers. Integer constants are signed extended for 64-bit sources. "literal constant" - a 32-bit constant in the instruction stream immediately after a 32- or 64-bit instruction. When used in a 64-bit signed integer operation, it is sign-extended to 64 bits. For unsigned 64-bit integer ops (and 64-bit binary ops) it is zero extended. When used in a double-float operation, the 32-bit literal is the most-significant bits, and the LSBs are zero. Other operations (32 bits or less, or packed math) treat it as 32-bit data. 4.1. Common Instruction Fields 34 of 597 "RDNA3" Instruction Set Architecture Vector Source (when 9 bits) Scalar Scalar Source (8 Dest (7 bits) bits) Code Meaning 0-105 SGPR 0 .. 105 SGPRs. One DWORD each. 106 VCC_LO VCC[31:0] 107 VCC_HI VCC[63:32] 108-123 ttmp0 .. ttmp15 Trap handler temporary SGPRs (privileged) 124 NULL Reads return zero, writes are ignored. When used as a destination, nullifies the instruction. 125 M0 Temporary register, use for a variety of functions 126 EXEC_LO EXEC[31:0] 127 EXEC_HI EXEC[63:32] 0 Inline constant zero int 1 .. 64 Integer inline constants Integer 128 Inline 129-192 Constants 193-208 int -1 .. -16 209-232 Reserved Reserved 233 DPP8 8-lane DPP (only valid as SRC0) 234 DPP8FI 8-lane DPP with Fetch-Invalid (only valid as SRC0) 235 SHARED_BASE Memory Aperture Definition 236 SHARED_LIMIT 237 PRIVATE_BASE 238 PRIVATE_LIMIT 239 Reserved Reserved Float 240 Inline 241 Constants 242 0.5 243 -1.0 Inline floating point constants. Can be used in 16, 32 and 64 bit floating point math. They may be used with non-float instructions but the value remains a float. 244 2.0 245 -2.0 246 4.0 247 -4.0 248 1.0 / (2 * PI) 249 Reserved Reserved 250 DPP16 data parallel primitive 251 Reserved Reserved 252 Reserved Reserved 253 SCC { 31’b0, SCC } 254 Reserved Reserved 255 Literal constant 32 bit constant from instruction stream 256 - 511 VGPR 0 .. 255 Vector GPRs. One DWORD each. Vector Src/Dst (8 bits) -0.5 1.0 1/(2*PI) is 0.15915494. The hex values are: half: 0x3118 single: 0x3e22f983 double: 0x3fc45f306dc9c882 4.1.1. Cache Controls: SLC, GLC and DLC Scalar and vector memory instructions contain bits that control cache behavior. The SLC, GLC and DLC instruction bits influence cache behavior for loads, stores, and atomics. GLC controls the graphics first-level cache 4.1. Common Instruction Fields 35 of 597 "RDNA3" Instruction Set Architecture SLC controls the graphics L2 cache DLC controls the Memory-Attached Last-Level cache (MALL) if it is present (ignored otherwise) Typically loads use GLC=0 (except for load-acquire). GLC=1 forces a miss in the first level cache and reads data rom the L2 cache. If there was a line in the GPU L0 that matched, it is invalidated; L2 is reread. Shader LOAD ops (load, sample, gather, etc…) SRD ISA Resulting Policy in Cache SCOPE llc_ noalloc DLC SLC GLC MALL GL2 (NOA) GL1 Tex(L0) 0 or 1 0 0 0 0 LRU HIT_LRU HIT_LRU 0 or 1 0 0 1 0 LRU MISS_EVICT 0 or 1 0 1 0 0 STREAM HIT_EVICT 0 or 1 0 1 1 0 STREAM 0 or 1 1 0 0 1 0 or 1 1 0 1 0 or 1 1 1 0 or 1 1 1 2 or 3 0 2 or 3 Non-Temporal Hint MALL GL2 GL1 Tex(L0) CU no no no no MISS_EVICT DEVICE no no _NA_ _NA_ HIT_LRU CU no yes yes no MISS_EVICT MISS_EVICT DEVICE no yes _NA_ _NA_ LRU HIT_LRU HIT_LRU CU yes no no no 1 LRU MISS_EVICT MISS_EVICT DEVICE yes no _NA_ _NA_ 0 1 STREAM HIT_EVICT HIT_LRU CU yes yes yes no 1 1 STREAM MISS_EVICT MISS_EVICT DEVICE yes yes _NA_ _NA_ 0 0 1 LRU HIT_LRU HIT_LRU CU no no no no 0 0 1 1 LRU MISS_EVICT MISS_EVICT DEVICE no no _NA_ _NA_ 2 or 3 0 1 0 1 STREAM HIT_EVICT HIT_LRU CU no yes yes no 2 or 3 0 1 1 1 STREAM MISS_EVICT MISS_EVICT DEVICE no yes _NA_ _NA_ 2 or 3 1 0 0 1 LRU HIT_LRU HIT_LRU CU yes no no no 2 or 3 1 0 1 1 LRU MISS_EVICT MISS_EVICT DEVICE yes no _NA_ _NA_ 2 or 3 1 1 0 1 STREAM HIT_EVICT HIT_LRU CU yes yes yes no 2 or 3 1 1 1 1 STREAM MISS_EVICT MISS_EVICT DEVICE yes yes _NA_ _NA_ • For S_BUFFER_LOAD instructions, LLC_NOALLOC comes from V#.LLC_noalloc. For S_LOAD, LLC_NOALLOC is zero. • SMEM operations have SLC set to zero. Shader STORE / ATOMIC ops (all are device scope) SRD ISA Policy in Cache Non-Temporal Hint llc_ noalloc DLC SLC MALL (NOA) GL2 MALL GL2 0 or 2 0 0 0 LRU no no 0 or 2 0 1 0 STREAM no yes 0 or 2 1 0 1 LRU yes no 0 or 2 1 1 1 STREAM yes yes 1 or 3 0 0 1 LRU no no 1 or 3 0 1 1 STREAM no yes 1 or 3 1 0 1 LRU no no 1 or 3 1 1 1 STREAM no yes "Temporal Hint" = expect data to have temporal reuse. "SRD" = Shader Resource Descriptor • ISA.GLC ⇒ this is a scope bit for load operations (including sample, gather, etc…) ◦ 0 : CU (work-group) scope ◦ 1 : DEVICE scope 4.1. Common Instruction Fields 36 of 597 "RDNA3" Instruction Set Architecture ◦ All stores/atomic ops are device scope (GLC has non-perf related functionality) • ISA.SLC ⇒ Temporal Hint for graphic client caches ◦ 0 : Regular ◦ 1 : Stream (non-temporal) • ISA.DLC ⇒ Temporal Hint for Infinity Cache ◦ 0 : Regular ◦ 1 : Non-temporal GLC is used by atomics to indicate: • 0: return nothing • 1: return pre-operation value from memory to VGPR 4.1. Common Instruction Fields 37 of 597 "RDNA3" Instruction Set Architecture Chapter 5. Program Flow Control Program flow control is programmed using scalar ALU instructions. This includes loops, branches, subroutine calls, and traps. The program uses SGPRs to store branch conditions and loop counters. Constants can be fetched from the scalar constant cache directly into SGPRs. 5.1. Program Control The instructions in the table below control the priority and termination of a shader program, as well as provide support for trap handlers. Table 12. Wave Termination and Traps Instructions Description S_ENDPGM Terminates the wave. It can appear anywhere in the shader program and can appear multiple times. S_ENDPGM_SAVED Terminates the wave due to context save. Intended for use only within the trap handler. S_TRAP Jump to the trap handler and pass in 8-bit TRAP id from SIMM[7:0]. It does not affect SCCZ. {TTMP1,TTMP0} = {7'h0,HT[0],trapID[7:0],PC[47:0]} PC = TBA (trap base address) PRIV = 1 "HT" : 1 = this is a host-initiated trap, 0 = user (s_trap). Host traps cause the shader hardware to generate an S_TRAP instruction. Note: the save-PC points to the S_TRAP instruction. TRAPID 0 is reserved for hardware use. S_RFE_B64 Return from exception (trap handler) and continue. Start executing at PC (trap handler must increment PC past the faulting instruction). MOVE PC, ; STATUS.PRIV = 0. This instruction may only be used within a trap handler. S_SETKILL Set the KILL bit to 1, causing the shader to s_endpgm immediately. Used primarily for debugging 'kill' wave-command behavior. S_SETHALT Set the HALT bit to the value of SIMM16[0]. Setting to 1 halts the shader when PRIV=0 (not in trap handler); setting to 0 resumes the shader (can only occur in trap handler). Fatal Halt control: SIMM16[2] 1 : set fatal halt; 0 : clear fatal halt. Table 13. Dependency, Delay and Scheduling Instructions Instructions Description S_NOP NOP. Repeat SIMM16[3:0] times. (1..16) Like a short version of S_SLEEP S_SLEEP Cause a wave to sleep for approx. 64*SIMM16[6:0] clocks. "s_sleep 0" sleeps the wave for 0 cycles. S_WAKEUP Causes one wave in a work-group to signal all other waves in the same work-group to wake up from S_SLEEP early. If waves are not sleeping, they are not affected by this instruction. S_SETPRIO Set 2-bits of USER_PRIO: user-settable wave priority. 0 = low, 3 = high. Overall wave priority is: {MIN(3,(SysPrio[1:0] + UserPrio[1:0])), WaveAge[3:0]} 5.1. Program Control 38 of 597 "RDNA3" Instruction Set Architecture Instructions Description S_CLAUSE Begin a clause consisting of instructions matching the instruction after the s_clause. The clause length is: (SIMM16[5:0] + 1), and clauses must be between 2 and 63 instructions. SIMM16[5:0] must be 1-62, not 0 or 63. The clause breaks after every N instructions, N = simm[11:8] (0 - 15; 0 = no breaks) S_BARRIER Synchronize waves within a work-group. If not all waves in group have been created yet, waits for entire group before proceeding. Waves that have ended do not prevent barriers from being satisfied. Waves not in a work-group (or work-group size = 1 wave), treat this as S_NOP. Table 14. Control Instructions Instructions Description S_VERSION Does nothing (treated as S_NOP), but can be used as a code comment to indicate the hardware version the shader is compiled for (using the SIMM16 field). S_CODE_END Treated as an illegal instruction. Used to pad past the end of shaders. S_SENDMSG Send a message upstream to the Interrupt handler or dedicated hardware. SIMM[9:0] is an immediate value holding the message type. There is no "s_waitcnt" enforced before this. S_SENDMSG_RTN_B32 S_SENDMSG_RTN_B64 Send a message upstream to that requests that some data be returned to an SGPR. Uses LGKMcnt to track when data is returned. (or an aligned SGPR-pair for "_B64"). SDST = SGPR to return to. SSRC0 = enum, not an SGPR with the code for what data is requested. (see the message table below). If this is used to write VCC, then VCCZ is undefined. S_SENDMSGHALT S_SENDMSG and then HALT. S_ICACHE_INV Invalidate first-level shader instruction cache for the WGP associated with this wave. 5.2. Instruction Clauses An instruction clause is a group of instructions of the same type that are to be executed in an uninterrupted sequence. Normally hardware may interleave instructions from different waves, but a clause can be used to override that behavior and force the hardware to service only one wave for a given instruction type for the duration of the clause, even if that leaves the execution hardware idle. Clauses are defined and started using the S_CLAUSE instruction, and must contain only a single type of instruction. The clause-type is implicitly defined by the type of instruction immediately following the clause. Clause Types are: • Image (no sampler) load • Image store • Image atomic • Image sample • Buffer / Global / Scratch load • Buffer / Global / Scratch store • Buffer / Global / Scratch atomic • Flat load • Flat store • Flat atomic • LDS load / store / atomic / bvh_stack • IMAGE_BVH 5.2. Instruction Clauses 39 of 597 "RDNA3" Instruction Set Architecture • SMEM • VALU May also be in a clause ("clause internal instructions"): • S_DELAY_ALU is legal inside a clause (internal) but is pointless. ◦ S_DELAY_ALU must not occur within a VALU clause. • S_NOP and S_SLEEP may be used inside a clause, but the first instruction of the clause must be the clausetype instruction (ALU, memory). Cannot be in a clause: • Instructions of a different type those of the clause type are illegal • S_CLAUSE • S_ENDPGM • SALU, Export, branch, message, GDS, lds_param_load, lds_direct_load • S_WAITCNT, S_WAIT_IDLE, S_WAIT_DEPCTR S_CLAUSE defines both the total length of the clause, and how often it should be broken to allow other waves a chance to go. For instance, it could say: clause of 16 instructions, but break after every 4th to allow a higher priority wave to get access to the execution unit. "clause internal instructions" count against this clause size. If a clause defines regular clause breaks (e.g. a clause of 16 instructions, but break every 4th), the first instruction of each sub-clause (every 4 instructions) must be of the clause-type, not a "clause internal instruction". Each group of instructions must have at least two of the clause-type of instructions. E.g. a clause of 12 VALU instructions broken up into 4 groups of 3 instructions - each group of 3 instructions must have at least two VALU instructions. Clause groups with only 1 VALU instruction per group make no sense - they are no longer a clause. If the first instruction in a VALU clause has EXEC==0, then the clause is ignored and instructions are issued as if there were no clause. If the VALU clause starts with EXEC!=0 but EXEC becomes zero in the middle of the clause, the clause continues until the last instruction of the specified clause. If an S_DELAY_ALU is needed before starting a clause, the order must be: S_DELAY_ALU // must not come immediately after S_CLAUSE - that inst declares clause type S_CLAUSE If the first instruction after S_CLAUSE is skipped (e.g. due to EXEC==0, or VMEM-load skipped due to EXEC==0 and VMcnt==0) then then a clause is not started. Subsequent instructions within what would have been the clause that are not skipped and are still executed but individually, not as part of a clause. 5.2.1. Clause Breaks The following conditions can break a clause: 1. VALU exception (trap) breaks a VALU clause 2. Host commands to wave (halt, resume, single step, etc) breaks all active clauses. Context-save breaks clauses of affected waves. This allows the host to read and write SGPRs & VGPRs while debugging. If clauses were not broken by host 5.2. Instruction Clauses 40 of 597 "RDNA3" Instruction Set Architecture commands, the GPRs could not be read from waves other than the one currently in a clause. If a wave halts or is kill, its clauses are ended. 3. Any action that cause a wave to jump to its trap handler breaks clause (includes context-save). A wave entering HALT (including for host-initiated single-step) may break clauses. 5.3. Send Message Types S_SENDMSG is used to send messages to fixed function hardware, the host, or to request that a value be returned to the wave. S_SENDMSG encodes the message type in the SIMM16 field and the message payload in M0. S_SENDMSG_RTN encodes the message type in the SSRC0 field (does not read an SGPR), the payload (if any) in M0, and the destination SGPR in SDST. Completion is tracked with LGKMcnt. The table below lists the messages that can be generated using the S_SENDMSG command. S_SENDMSG_RTN_B* instructions return data to the shader: increment LGKMcnt by 2, and then decrement by 1 when the messages goes out, and by another 1 when the data returns. This allows the user to simply use "s_waitcnt LGKMcnt==0" to wait for the data to be returned. All message codes not listed are reserved (illegal). Table 15. S_SENDMSG Messages Message SIMM16 [7:0] Payload Reserved 0x00 Reserved Interrupt 0x01 Software-generated interrupt. M0[23:0] carries user data. ID’s are also sent (wave_id, cu_id, etc.) HS TessFactor 0x02 Indicates HS tessellation factor is all zero or one for all patches in this HS work-group. Data from M0[0]: 1 = "all are zero or one". This message is optional, but do not send more than once or from any shader stage other than HS. Dealloc VGPRs 0x03 Deallocate all VGPRs for this wave, allowing another wave to allocate these VGPRs before this wave ends. Use only when next instruction is S_ENDPGM. Typically used when a shader is waiting memory-write-acknowledgments before ending. GS alloc req 0x09 Request GS space in parameter cache. M0[9:0] = number of vertices, M0[22:12] = number of primitives. Response: a GS-alloc response to non-zero requests (broadcast to work-group). S_SENDMSG_RTN is used to send messages that return a value to the wave. The instruction specifies which SGPR receives the data in SDST field. The message is encoded in SSRC0 (in the instruction field, not in an SGPR). Table 16. S_SENDMSG_RTN Messages Message SSRC0 Payload Get Doorbell ID 0x80 Get the doorbell ID associated with this wave. (does not exist for ME0. Return 0x0bad. Also returns 0x0bad for invalid pipeID or queueID). Get Draw ID 0x81 Get the Draw or dispatch ID associated with this wave. Get TMA 0x82 Get the Trap Memory Address: [31:0] or [63:0] depending on the request size. 5.3. Send Message Types 41 of 597 "RDNA3" Instruction Set Architecture Message SSRC0 Payload Get REALTIME 0x83 Get the value of the constant frequency (REFCLK) time counter: [31:0] or [63:0] depending on the request size. Save wave 0x84 Used in context switching in indicate this wave is ready to be context saved. Only the trap handler can send this message (user shaders have this converted to MSG_ILLEGAL_RTN). Get TBA 0x85 Gets the Trap Base Address [31:0] or [63:0] depending on request size MSG_ILLEGAL _RTN 0xFF Illegal message with data return to wave 5.4. Branching Branching is done using one of the following scalar ALU instructions. "SIMM16" is a sign-extended 16 bit integer constant, treated as a DWORD offset for branches. Table 17. Branch Instructions Instructions Description S_BRANCH Unconditional branch. PC = PC + (SIMM16 * 4) + 4 S_CBRANCH_ Conditional branch. Branch only if is true. if (cond) PC = PC + (SIMM16 *4) +4; else NOP; If SIMM16=0, the branch goes to the next instruction). : SCC1, SCC0, VCCZ, VCCNZ, EXECZ, EXECNZ (SCC==1, SCC==0, VCC==0, VCC!=0, EXEC==0, EXEC!=0) S_CBRANCH_CDBGSYS Conditional branch, taken if the COND_DBG_SYS status bit is set. if (cond) PC = PC + (SIMM16 *4) +4; else NOP; = SYS, USER, SYS_AND_USER, SYS_OR_USER. S_CBRANCH_CDBGUSER Conditional branch, taken if the COND_DBG_USER status bit is set. S_CBRANCH_CDBGSYS_AND Conditional branch, taken only if both COND_DBG_SYS and COND_DBG_USER are set. _USER S_CBRANCH_CDBGSYS_OR_U Conditional branch, taken if either COND_DBG_SYS or COND_DBG_USER is set. SER S_SETPC_B64 Directly set the PC from an SGPR pair: PC = SGPR-pair S_SWAPPC_B64 Swap the current PC with an address in an SGPR pair. SWAP (PC+4, SGPR-pair). (result is: PC of this instruction + 4, zero extended) S_GETPC_B64 Retrieve the current PC value (does not cause a branch). (SGPR-pair = PC of this instruction + 4, zero extended) S_CALL_B64 Jump to a subroutine, and save return address. SGPR_pair = PC+4; PC = PC+4+SIMM16*4. For conditional branches, the branch condition can be determined by either scalar or vector operations. A scalar compare operation sets the Scalar Condition Code (SCC) which then can be used as a conditional branch condition. Vector compare operations set the VCC mask, and VCCZ or VCCNZ then can be used to determine branching. 5.5. Work-groups and Barriers Work-groups are collections of waves running on the same work-group processor that can synchronize and share data. Up to 1024 work-items (16 wave64’s or 32 wave32’s) can be combined into a work-group. When multiple waves are in a work-group, the S_BARRIER instruction can be used to force each wave to wait until all other waves reach the same instruction; then, all waves continue. Work-groups of a single wave treat all 5.4. Branching 42 of 597 "RDNA3" Instruction Set Architecture barrier instructions as S_NOP. If a wave executes an S_BARRIER before all of the waves of the work-group have been created, the wave waits until the work-group is complete. Any wave may terminate early using S_ENDPGM, and the barrier is considered satisfied when the remaining live waves reach their barrier instruction. 5.6. Data Dependency Resolution Shader hardware can resolve most data dependencies, but a few cases must be explicitly handled by the shader program. In these cases, the program must insert S_WAITCNT instructions to ensure that previous operations have completed before continuing. The shader has four counters that track the progress of issued instructions. S_WAITCNT waits for the values of these counters to be at, or below, specified values before continuing. These allow the shader writer to schedule long-latency instructions, execute unrelated work, and specify when results of long-latency operations are needed. Inserting S_NOP is not required to achieve correct operation. Table 18. Data Dependency Instructions Instructions Description S_WAITCNT Wait for count of outstanding instruction counters to be less-than or equal-to all of these values before continuing. SIMM16 = { VMcnt[5:0], LGKMcnt[5:0], 1’b0, EXPcnt[2:0] } S_WAITCNT_VSCNT Wait for VSCNT, VMCNT, EXPCNT or LGKMcnt to be less-than or equal-to the count in SIMM16 before continuing. S_WAITCNT_LGKMCNT S_WAITCNT_EXPCNT S_WAITCNT_VMCNT S_WAIT_EVENT Wait for an event to occur before proceeding SIMM16[0] : 1=don’t wait, 0= wait for export-ready; other bits are reserved. Any exception waits for this to complete before being processed, including: KILL, savecontext, host trap, memviol and anything that causes a trap to be taken. S_DELAY_ALU Insert delay between dependent SALU/VALU instructions. SIMM16[3:0] = InstID0 SIMM16[6:4] = InstSkip SIMM16[10:7] = InstID1 This instruction describes dependencies for two instructions, directing the hardware to insert delay if the dependent instruction was issued too recently to forward data to the second. For details, see: S_DELAY_ALU. S_WAITCNT* waits for outstanding instructions that use the specified counter to complete. Instructions within a type often return in the order they were issued compared to other instructions of that type, but typically return out of order with respect to instructions of different types. These counters count instructions, not threads. These are the memory instruction groups - each returns out of order with respect to the others: • VMcnt: ◦ Texture SAMPLE ◦ Texture/Buffer/Global/Scratch/Flat Loads and atomic-with-return 5.6. Data Dependency Resolution 43 of 597 "RDNA3" Instruction Set Architecture • VScnt: ◦ Texture/Buffer/Global/Scratch/Flat Stores and atomic-without-return • LGKMcnt: ◦ LDS indexed operations ◦ SMEM: scalar memory loads may return completely out-of-order with respect to other scalar memory loads ◦ GDS & GWS ◦ FLAT instructions (uses both LGKMcnt and either VMcnt or VScnt) ◦ Messages • EXPcnt: ◦ LDS parameter-load and direct-load ◦ Exports: stay in order within a type (MRT, Z, position, primitive data) but out of order between types It is possible for data to be written to VGPRs out-of-order, but the counter-decrement still reflects in-order completion. Stores from a wave are not kept in order with stores from that same wave when they write to different addresses. Simple S_WAITCNT Example global_load_b32 V0, V[4:5], 0x0 // load memory[ {V5, V4} ] into V0 global_load_b32 V1, V[4:5], 0x8 // load memory[ {V5, V4} +8 ] into V1 s_waitcnt VMcnt <= 1 // wait for first global_load to have completed v_mov_b32 // move V0 into V9 V9, V0 5.7. ALU Instruction Software Scheduling The shader program may include instructions to delay ALU instructions from being issued in order to attempt to avoid pipeline stalls caused by issuing dependent instructions too closely together. This is accomplished with the: S_DELAY_ALU instruction: "insert delay with respect to a previous VALU instruction". The compiler may insert S_DELAY_ALU instructions to indicate data dependencies that might benefit from having extra idle cycles inserted between them. This instruction is inserted before the instruction which the user wants to delay, and it specifies which previous instructions this one is dependent on. The hardware then determines the number of cycles of delay to add. This instruction is optional - it is not necessary for correct operation. It should be inserted only when necessary to avoid dependency stalls. If enough independent instructions are between dependent ones then no delay is necessary. For wave64, the user may not know the status of the EXEC mask and hence not know if instructions take 1 or 2 passes to issue. The S_DELAY_ALU instruction says: wait for the VALU-Inst N ago to have completed. To reduce instruction stream overhead, the S_DELAY_ALU instructions packs two delay values into one instruction, with a "skip" indicator so the two delayed instructions don’t need to be back-to-back. S_DELAY_ALU may be executed in zero cycles - it may be executed in parallel with the instruction before it. This avoids extra delay if no delay is needed. 5.7. ALU Instruction Software Scheduling 44 of 597 "RDNA3" Instruction Set Architecture S_DELAY_ALU InstID1[4], Skip[3], InstID0[4] // packed into SIMM16 INSTID counts backwards N VALU instructions that were issued. This means it does not count instructions which were branched over. VALU instructions skipped due to EXEC==0 do count (scoreboard immediately marked 'ready'). SKIP counts the number of instructions skipped before the instruction which has the second dependency. Every instruction is counted for skipping - all types. If another S_DELAY_ALU is encountered before the info from the previous one is consumed, the current S_DELAY_ALU replaces any previous dependency info. This means if an instruction is dependent on two separate previous instructions, both of those dependencies can be expressed in a single S_DELAY_ALU op, but not in two separate S_DELAY_ALU ops. S_DELAY_ALU is applied to any type of opcode, even non-alu (but serves no purpose). S_DELAY_ALU should not be used within VALU clauses. Table 19. S_DELAY_ALU Instruction Codes DEP Code Dep Code Meaning SKIP Code SKIP Code Meaning 0 no dependency 0 Same op. Both DEP codes apply to the next instruction 1-4 dependent on previous VALU 1-4 back 1 No skip. Dep0 applies to the following instruction, and DEP1 applies to the instruction after that one. 5-7 dependent on previous trans. VALU 1-4 back 2 Skip 1. Dep0 applies to the following instruction. Dep1 applies to 2 instructions ahead (skip 1 instruction). 8 Reserved 3-5 Skip 2-4 instructions between Dep0 and Dep1. 9-11 Wait 1-3 cycles for previous SALU ops 6 Reserved Codes 9-11: SALU ops typically complete in a single cycle, so waiting for 1 cycle is roughly equivalent to waiting for 1 SALU op to execute before continuing. 5.7. ALU Instruction Software Scheduling 45 of 597 "RDNA3" Instruction Set Architecture Chapter 6. Scalar ALU Operations Scalar ALU (SALU) instructions operate on values that are common to all work-items in the wave. These operations consist of 32-bit integer or float arithmetic, and 32- or 64-bit bit-wise operations. The SALU also can perform operations directly on the Program Counter, allowing the program to create a call stack in SGPRs. Many operations also set the Scalar Condition Code bit (SCC) to indicate the result of a comparison, a carry-out, or whether the instruction result was zero. 6.1. SALU Instruction Formats SALU instructions are encoded in one of five microcode formats, shown below: Name Size Function SOP1 32 bit SALU op with 1 input SOP2 32 bit SALU op with 2 inputs SOPK 32 bit SALU op with 1 constant signed 16-bit integer input SOPC 32 bit SALU compare op SOPP 32 bit SALU program control op Each of these instruction formats uses some of these fields: Field Description OP Opcode: instruction to be executed. SDST Destination SGPR, M0, NULL or EXEC. SSRC0 First source operand. SSRC1 Second source operand. SIMM16 Signed immediate 16-bit integer constant. The lists of similar instructions sometimes use a condensed form using curly braces { } to express a list of possible names. For example, S_AND_{B32, B64} defines two legal instructions: S_AND_B32 and S_AND_B64. 6.2. Scalar ALU Operands Valid operands of SALU instructions are: 6.1. SALU Instruction Formats 46 of 597 "RDNA3" Instruction Set Architecture • SGPRs, including trap temporary SGPRs • Mode register • Status register (read-only) • M0 register • EXEC mask • VCC mask • SCC • Inline constants: integers from -16 to 64, and select floating point values • Hardware registers (at most 1 of: EXEC, M0, SCC) • One 32-bit literal constant • If the destination is NULL, the instruction does not execute: nothing is written and SCC is not modified In the table below, 0-127 can be used as scalar sources or destinations; 128-255 can only be used as sources. Table 20. Scalar Operands 6.2. Scalar ALU Operands 47 of 597 "RDNA3" Instruction Set Architecture Code Scalar Source (8 bits) Scalar Dest 0-105 (7 bits) 106 Meaning SGPR 0 .. 105 SGPRs. One DWORD each. VCC_LO VCC[31:0] 107 VCC_HI VCC[63:32] 108-123 ttmp0 .. ttmp15 Trap handler temporary SGPRs (privileged) 124 NULL Reads return zero, writes are ignored. When used as a destination, nullifies the instruction. 125 M0 Temporary register, use for a variety of functions 126 EXEC_LO EXEC[31:0] 127 EXEC_HI EXEC[63:32] 0 Inline constant zero int 1 .. 64 Integer inline constants Integer 128 Inline 129-192 Constants 193-208 int -1 .. -16 209-232 Reserved Reserved 233 DPP8 8-lane DPP (only valid as SRC0) 234 DPP8FI 8-lane DPP with Fetch-Invalid (only valid as SRC0) 235 SHARED_BASE Memory Aperture Definition 236 SHARED_LIMIT 237 PRIVATE_BASE 238 PRIVATE_LIMIT 239 Reserved Reserved Float 240 Inline 241 Constants 242 0.5 Inline floating point constants. Can be used in 16, 32 and 64 bit floating point math. They may be used with nonfloat instructions but the value remains a float. 243 -1.0 244 2.0 245 -2.0 246 4.0 247 -4.0 248 1.0 / (2 * PI) 249 Reserved Reserved 250 DPP16 data parallel primitive 251 Reserved Reserved 252 Reserved Reserved 253 SCC { 31’b0, SCC } 254 Reserved Reserved 255 Literal constant 32 bit constant from instruction stream -0.5 1.0 1/(2*PI) is 0.15915494. The hex values are: half: 0x3118 single: 0x3e22f983 double: 0x3fc45f306dc9c882 SALU destinations are in the range 0-127. SALU instructions can use a 32-bit literal constant. This constant is part of the instruction stream and is available to all SALU microcode formats except SOPP and SOPK (except literal is allowed in S_SETREG_IMM32_B32). Literal constants are used by setting the source instruction field to "literal" (255), and then the following instruction DWORD is used as the source value. If the destination SGPR is out-of-range, no SGPR is written with the result and SCC is not updated. If an instruction uses 64-bit data in SGPRs, the SGPR pair must be aligned to an even boundary. For example, it is legal to use SGPRs 2 and 3 or 8 and 9 (but not 11 and 12) to represent 64-bit data. 6.2. Scalar ALU Operands 48 of 597 "RDNA3" Instruction Set Architecture 6.3. Scalar Condition Code (SCC) The scalar condition code (SCC) is written as a result of executing most SALU instructions. For integer arithmetic it is used as carry/borrow in for extended integer arithmetic. The SCC is set by many instructions: • Compare operations: 1 = true. • Arithmetic operations: 1 = carry out. ◦ SCC = overflow for signed add and subtract operations. For add ops, overflow = both operands are of the same sign, and the MSB (sign bit) of the result is different than the sign of the operands. For subtract (A - B), overflow = A and B have opposite signs and the resulting sign is not the same as the sign of A. • Bit/logical operations: 1 = result was not zero. 6.4. Integer Arithmetic Instructions This section describes the arithmetic operations supplied by the SALU. The table below shows the scalar integer arithmetic instructions: Table 21. Integer Arithmetic Instructions Instruction Encoding Sets SCC? Operation S_ADD_I32 SOP2 Ovfl D = S0 + S1, SCC = overflow. S_ADD_U32 SOP2 Cout D = S0 + S1, SCC = carry out. S_ADDC_U32 SOP2 Cout D = S0 + S1 + SCC, SCC = overflow. S_SUB_I32 SOP2 Ovfl D = S0 - S1, SCC = overflow. S_SUB_U32 SOP2 Cout D = S0 - S1, SCC = carry out. S_SUBB_U32 SOP2 Cout D = S0 - S1 - SCC, SCC = carry out. S_ADD_LSH{1,2,3,4}_U32 SOP2 D!=0 D = S0 + (S1 << {1,2,3,4}) S_ABSDIFF_I32 SOP2 D!=0 D = abs (S0 - S1), SCC = result not zero. S_MIN_I32 S_MIN_U32 SOP2 D!=0 D = (S0 < S1) ? S0 : S1 SCC = (S0 < S1) S_MAX_I32 S_MAX_U32 SOP2 D!=0 D = (S0 > S1) ? S0 : S1 SCC = (S0 > S1) S_MUL_I32 SOP2 No D = S0 * S1 low 32bits of result works identically for unsigned data S_ADDK_I32 SOPK Ovfl D = D + simm16, SCC = overflow. Sign extended version of simm16. S_MULK_I32 SOPK No D = D * simm16. Return low 32bits. Sign extended version of simm16. S_ABS_I32 SOP1 D!=0 D.i = abs (S0.i). SCC=result not zero. S_SEXT_I32_I8 SOP1 No D = { 24{S0[7]}, S0[7:0] }. S_SEXT_I32_I16 SOP1 No D = { 16{S0[15]}, S0[15:0] }. S_MUL_HI_I32 SOP2 No D = S0 * S1 high 32bits of result S_MUL_HI_U32 SOP2 No D = S0 * S1 high 32bits of result S_PACK_LL_B32_B16 SOP2 No D = { S1[15:0], S0[15:0] } S_PACK_LH_B32_B16 SOP2 No D = { S1[31:16], S0[15:0] } S_PACK_HL_B32_B16 SOP2 No D = { S1[15:0], S0[31:16] } 6.3. Scalar Condition Code (SCC) 49 of 597 "RDNA3" Instruction Set Architecture Instruction Encoding Sets SCC? Operation S_PACK_HH_B32_B16 SOP2 No D = { S1[31:16], S0[31:16] } 6.5. Conditional Move Instructions Conditional instructions use the SCC flag to determine whether to perform the operation, or (for CSELECT) which source operand to use. Table 22. Conditional Instructions Instruction Encoding Sets SCC? Operation S_CSELECT_{B32, B64} SOP2 No D = SCC ? S0 : S1. S_CMOVK_I32 SOPK No if (SCC) D = signext(simm16). S_CMOV_{B32,B64} SOP1 No if (SCC) D = S0, else NOP. 6.6. Comparison Instructions These instructions compare two values and set the SCC to 1 if the comparison yielded a TRUE result. Table 23. Conditional Instructions Instruction Encoding Sets SCC? Operation S_CMP_EQ_U64, S_CMP_LG_U64 SOPC Test Compare two 64-bit source values. SCC = S0 S1. S_CMP_{EQ,LG,GT,GE,LE,LT}_{I32 SOPC ,U32} Test Compare two source values. SCC = S0 S1. S_BITCMP0_{B32,B64} SOPC Test Test for "is a bit zero". SCC = !S0[S1]. S_BITCMP1_{B32,B64} SOPC Test Test for "is a bit one". SCC = S0[S1]. 6.7. Bit-Wise Instructions Bit-wise instructions operate on 32- or 64-bit data without interpreting it has having a type. For bit-wise operations if noted in the table below, SCC is set if the result is nonzero. Table 24. Bit-Wise Instructions Instruction Encoding Sets SCC? Operation S_MOV_{B32,B64} SOP1 No D = S0 S_MOVK_I32 SOPK No D = signext(simm16) {S_AND,S_OR,S_XOR}_{B32,B64} SOP2 D!=0 D = S0 & S1, S0 OR S1, S0 XOR S1 {S_AND_NOT1,S_OR_NOT1}_{B32,B64} SOP2 D!=0 D = S0 & ~S1, S0 OR ~S1 {S_NAND,S_NOR,S_XNOR}_{B32,B64} SOP2 D!=0 D = ~(S0 & S1), ~(S0 OR S1), ~(S0 XOR S1) S_LSHL_{B32,B64} SOP2 D!=0 D = S0 << S1[4:0], [5:0] for B64. S_LSHR_{B32,B64} SOP2 D!=0 D = S0 >> S1[4:0], [5:0] for B64. S_ASHR_{I32,I64} SOP2 D!=0 D = sext(S0 >> S1[4:0]) ([5:0] for I64). S_BFM_{B32,B64} SOP2 No Bit field mask D = ( (1 << S0[4:0]) -1) << S1[4:0] (uses [5:0] for the B64 version) 6.5. Conditional Move Instructions 50 of 597 "RDNA3" Instruction Set Architecture Instruction Encoding Sets SCC? Operation S_BFE_U32, S_BFE_U64 S_BFE_I32, S_BFE_I64 (signed/unsigned) SOP2 D!=0 Bit Field Extract, then sign extend result for I32/64 instructions. S0 = data, S1[22:16]= width I32/U32: S1[4:0] = offset I64/U64: S1[5:0] = offset S_NOT_{B32,B64} SOP1 D!=0 D = ~S0. S_WQM_{B32,B64} SOP1 D!=0 D = wholeQuadMode(S0) Per quad (4 bits): set the result to 1111 if any of the 4 bits in the corresponding source mask are set to 1. D[n*4] = (S[n*4] || S[n*4+1] || S[n*4+2] || S[n*4+3] ) D[n*4+1] = (S[n*4] || S[n*4+1] || S[n*4+2] || S[n*4+3] ) D[n*4+2] = (S[n*4] || S[n*4+1] || S[n*4+2] || S[n*4+3] ) D[n*4+3] = (S[n*4] || S[n*4+1] || S[n*4+2] || S[n*4+3] ) S_QUADMASK_{B32,B64} SOP1 D!=0 Create a 1-bit per quad mask from a 1 bit per pixel mask. Creates an 8-bit mask from 32-bits, or 16 bits from 64. D[0] = (S0[3:0] != 0), D[1] = (S0[7:4] != 0), … S_BITREPLICATE_B64_B32 SOP1 No Replicate each bit in 32-bit S0 twice: D = { … S0[1], S0[1], S0[0], S0[0] }. Two of these instructions is the inverse of S_QUADMASK. Two of these instructions expands a quad mask into a thread-mask. S_BREV_{B32,B64} SOP1 No D = S0[0:31] are reverse bits. S_BCNT0_I32_{B32,B64} SOP1 D!=0 D = CountZeroBits(S0). S_BCNT1_I32_{B32,B64} SOP1 D!=0 D = CountOneBits(S0). S_CTZ_I32_{B32,B64} SOP1 No Count Trailing zeroes: Find-first One from LSB. D = Bit position of first one in S0 starting from LSB. -1 if not found S_CLZ_I32_{B32,B64} SOP1 No Count Leading zeroes. D = "how many zeros before the first one starting from the MSB". Returns -1 if none. S_CLS_I32_{B32,B64} SOP1 N Count Leading Sign-bits: Count how many bits in a row (from MSB to LSB) are the same as the sign bit. Return -1 if the input is zero or all 1’s (-1). 32-bit pseudo-code: if (S0 == 0 || S0 == -1) D = -1 else D = 0 for (I = 31 .. 0) if (S0[I] == S0[31]) D++ else break S_BITSET0_{B32,B64} SOP1 No D[S0[4:0], [5:0] for B64] = 0 S_BITSET1_{B32,B64} SOP1 No D[S0[4:0], [5:0] for B64] = 1 6.7. Bit-Wise Instructions 51 of 597 "RDNA3" Instruction Set Architecture Instruction Encoding Sets SCC? Operation S_{and, or, xor, and_not0, and_not1,or_not0, or_not1, nand, nor, xnor}_SAVEEXEC_{B32,B64} SOP1 D!=0 Save the EXEC mask, then apply a bit-wise operation to it. D = EXEC EXEC = S0 EXEC SCC = (EXEC != 0) ("not1" version inverts EXEC) ("not0" version inverts SGPR) S_{AND_NOT{0,1}_WREXEC_B{32,64} SOP1 D!=0 NOT0: EXEC, D = ~S0 & EXEC NOT1: EXEC, D = S0 & ~EXEC Both D and EXEC get the same result. SCC = (result != 0). D cannot be EXEC. S_MOVRELS_{B32,B64} S_MOVRELD_{B32,B64} SOP1 No Move a value into an SGPR relative to the value in M0. MOVRELS: D = SGPR[S0+M0] MOVRELD: SGPR[D+M0] = S0 Index must be even for B64. M0 is an unsigned index. 6.8. Access Instructions These instructions access hardware internal registers. Table 25. Hardware Internal Registers Instruction Encoding Sets SCC? Operation S_GETREG_B32 SOPK No Read a hardware register into the LSBs of SDST. S_SETREG_B32 SOPK No Write the LSBs of SDST into a hardware register. (Note that SDST is used as a source SGPR). S_SETREG_IMM32_B32 SOPK No S_SETREG where 32-bit data comes from a literal constant (so this is a 64-bit instruction format). GETREG/SETREG : #SIMM16 = { Size[4:0], Offset[4:0], hwRegId[5:0] } Offset is 0..31. Size is 1..32. S_ROUND_MODE SOPP No Set the round mode from an immediate: simm16[3:0] S_DENORM_MODE SOPP No Set the denorm mode from an immediate: simm16[3:0] For hardware register index values, see Hardware Registers . 6.9. Memory Aperture Query Shaders can query the memory aperture base and size for shared and private space through scalar operands: • PRIVATE_BASE • PRIVATE_LIMIT • SHARED_BASE • SHARED_LIMIT These values originate from the SH_MEM_BASES register ("SMB"), and are used primarily with FLAT memory instructions. Setting Shared Base or Private Base to zero disables that aperture. "PTR32" is short for "Address mode is 32bit", and "SMB" is short for "SH_MEM_BASES". These constants can be 6.8. Access Instructions 52 of 597 "RDNA3" Instruction Set Architecture used by SALU and VALU ops, and are 64-bit unsigned integers: SHARED_BASE = ptr32 ? {32’h0, SMB.shared_base[15:0], 16’h0000} : {SMB.shared_base[15:0], 48’h000000000000} SHARED_LIMIT = ptr32 ? {32’h0, SMB.shared_base[15:0], 16’hFFFF} : {SMB.shared_base[15:0], 48’h0000FFFFFFFF} PRIVATE_BASE = ptr32 ? {32’h0, SMB.private_base[15:0], 16’h0000} : {SMB.private_base[15:0], 48’h000000000000} PRIVATE_LIMIT =ptr32 ? {32’h0, SMB.private_base[15:0], 16’hFFFF} : {SMB.private_base[15:0], 48’h0000FFFFFFFF} "Hole" = (addr[63:47] != all zeros or all ones) and is the illegal address section of memory 6.9. Memory Aperture Query 53 of 597 "RDNA3" Instruction Set Architecture Chapter 7. Vector ALU Operations Vector ALU instructions (VALU) perform an arithmetic or logical operations on data for each of 32 or 64 threads and write results back to VGPRs, SGPRs or the EXEC mask. Parameter interpolation is a two step process involving an LDS instruction followed by a VALU instruction and is described in: Parameter Interpolation Vector ALU (VALU) instructions control the SIMD32’s math unit and operate on 32 work-items of data at a time. Each instruction may take input from either VGPRs, SGPRs or constants and typically returns results to VGPRs. Mask results and carry-out are returned to SGPRs. The ALU provides operations that work on 16, 32 and 64-bit data of both integer and float types. The ALU also supports "packed" data types that pack 2 16-bit values into one VGPR, or 4 8-bit values into a VGPR. 7.1. Microcode Encodings VALU instructions are encoded in one of these ways: Name Size Function Modifiers VOP1 32 bit VALU op with 1 input - VOP2 32 bit VALU op with 2 inputs - VOP3 64 bit VALU op with 3 inputs, or a VOP1,2,C instruction abs, neg, omod, clamp VOP3SD 64 bit VALU op with 3 inputs and SDST neg, omod, clamp VOPC 32 bit VALU compare op with 2 inputs, writes to VCC/EXEC - VOP3P 64 bit VALU op with 3 inputs using packed math neg, clamp VOPD 64 bit VALU dual opcode : 2 operations in one instruction - Many VALU instructions are available in two encodings: VOP3 that uses 64-bits of instruction, and one of three 32-bit encodings that offer a restricted set of capabilities but smaller code size. Some instructions are only available in the VOP3 encoding. When an instruction is available in two microcode formats, it is up to the user to decide which to use. It is recommended to use the 32-bit encoding whenever possible. VOP2 can also be used 7.1. Microcode Encodings 54 of 597 "RDNA3" Instruction Set Architecture for "ACCUM" type ops where the third input is implied to be the same as the dest. Advantages of using VOP3 include: • More flexibility in source addressing (all source fields are 9 bits) • NEG, ABS, and OMOD fields (for floating point only) • CLAMP field for output range limiting • Ability to select alternate source and destination registers for VCC (carry in and out) The following VOP1 and VOP2 instructions may not be promoted to VOP3: • swap and swaprel • fmamk, fmaak, pk_fmac The VOP3 encoding has two variants: • VOP3 - used for most instructions including V_CMP*; has OPSEL and ABS fields • VOP3SD - has an SDST field instead of OPSEL and ABS. This encoding is used only for: ◦ V_{ADD,SUB,SUBREV}_CO_CI_U32, V_{ADD,SUB,SUBREV}_CO_U32 (adds with carry-out) ◦ V_DIV_SCALE_{F32, F64}, V_MAD_U64_U32, V_MAD_I64_I32. ◦ V_DOT2ACC_F32_F16 ◦ VOP3SD is not used for V_CMP*. Any of the VALU microcode formats may use a 32-bit literal constant, as well VOP3. Note however that VOP3 plus a literal makes a 96-bit instruction and excessive use of this combination may reduce performance. VOP3P is for instructions that use "packed math": instructions that performs an operation on a pair of input values that are packed into the high and low 16-bits of each operand; the two 16-bit results are written to a single VGPR as two packed values. Field Size Description OP varies instruction opcode SRC0 9 first instruction argument. May come from: vgpr, sgpr, VCC, M0, EXEC, SCC, or a constant SRC1 9 second instruction argument. May come from: vgpr, sgpr, VCC, M0, EXEC, SCC, or a constant VSRC1 8 second instruction argument. May come from: vgpr only SRC2 9 third instruction argument. May come from: vgpr, sgpr, VCC, M0, EXEC, SCC, or a constant VDST 8 VGPR that takes the result. For V_READLANE and V_CMP, indicates the SGPR that receives the result. This cannot be M0 or EXEC. SDST 8 SGPR that takes the result of operations that produce a scalar output. Can’t be M0 or EXEC. Supports NULL to not write any SDST. Used for: V_{ADD,SUB,SUBREV}_CO_U32, V_{ADD,SUB,SUBREV}_CO_CI_U32, V_DIV_SCALE*; not used for V_CMP. OMOD 2 output modifier. for float results only. 0 = no modifier, 1=multiply result by 2, 2=multiply result by 4, 3=divide result by 2 NEG 3 negate the input (invert sign bit). float inputs only. bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2. ABS 3 apply absolute value on input. float inputs only. applied before 'neg'. bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2. 7.1. Microcode Encodings 55 of 597 "RDNA3" Instruction Set Architecture Field Size Description CLMP 1 clamp or compare-signal (depends on opcode): V_CMP: clmp=1 means signaling-compare when qNaN detected; 0 = non-signaling Float arithmetic: clamp result to [0, 1.0]; -0 is clamped to +0. Signed integer arithmetic: clamp result to [min_int, +max_int] Unsigned integer arithmetic: clamp result to [0, +max_uint] Where "min_int" and "max_int" are the largest negative and positive representable integers for the size of integer being used (16, 32 or 64 bit). "max_uint" is the largest unsigned int. OPSEL 4 Operation select for 16-bit math: 1=select high half, 0=select low half [0]=src0, [1]=src1, [2]=src2, [3]=dest For dest=0, dest_vgpr[31:0] = {prev_dst_vgpr[31:16], result[15:0] } For dest=1, dest_vgpr[31:0] = {result[15:0], prev_dst_vgpr[15:0] } OPSEL may only be used for 16-bit operands, and must be zero for any other operands/results. For V_PERMLANE*, OPSEL[0] is "fetch invalid"; OPSEL[1] is "bounds control" (like DPP8). DOT2_F16 and_BF16: src0 and src1 must have OPSEL[1:0] = 0 7.2. Operands Most VALU instructions take at least one input operand. The data-size of the operands is explicitly defined in the name of the instruction. For example, V_FMA_F32 operates on 32-bit floating point data. VGPR Alignment: there is no alignment restriction for single or double-float operations. Table 26. VALU Instruction Operands 7.2. Operands 56 of 597 "RDNA3" Instruction Set Architecture Vector Source (when 9 bits) Scalar Scalar Source (8 Dest (7 bits) bits) Code Meaning 0-105 SGPR 0 .. 105 SGPRs. One DWORD each. 106 VCC_LO VCC[31:0] 107 VCC_HI VCC[63:32] 108-123 ttmp0 .. ttmp15 Trap handler temporary SGPRs (privileged) 124 NULL Reads return zero, writes are ignored. When used as a destination, nullifies the instruction. 125 M0 Temporary register, use for a variety of functions 126 EXEC_LO EXEC[31:0] 127 EXEC_HI EXEC[63:32] 0 Inline constant zero int 1 .. 64 Integer inline constants Integer 128 Inline 129-192 Constants 193-208 int -1 .. -16 209-232 Reserved Reserved 233 DPP8 8-lane DPP (only valid as SRC0) 234 DPP8FI 8-lane DPP with Fetch-Invalid (only valid as SRC0) 235 SHARED_BASE Memory Aperture Definition 236 SHARED_LIMIT 237 PRIVATE_BASE 238 PRIVATE_LIMIT 239 Reserved Reserved Float 240 Inline 241 Constants 242 0.5 243 -1.0 Inline floating point constants. Can be used in 16, 32 and 64 bit floating point math. They may be used with non-float instructions but the value remains a float. 244 2.0 245 -2.0 246 4.0 247 -4.0 248 1.0 / (2 * PI) 249 Reserved Reserved 250 DPP16 data parallel primitive 251 Reserved Reserved 252 Reserved Reserved 253 SCC { 31’b0, SCC } 254 Reserved Reserved 255 Literal constant 32 bit constant from instruction stream 256 - 511 VGPR 0 .. 255 Vector GPRs. One DWORD each. Vector Src/Dst (8 bits) -0.5 1.0 1/(2*PI) is 0.15915494. The hex values are: half: 0x3118 single: 0x3e22f983 double: 0x3fc45f306dc9c882 7.2.1. Non-Standard Uses of Operand Fields A few instructions use the operand fields in non-standard ways: 7.2. Operands 57 of 597 "RDNA3" Instruction Set Architecture Opcode VDST SDST VSRC0 VSRC1 VSRC2 V_{ADD,SUB,SUBREV} VOP2 _CO_U32, V_{ADD,SUB,SUBREV} VOP3SD _CO_CI_U32 add result (VCC=carry-out) n/a in0 in1 unused (carry-in=VCC) add result carry-out in0 in1 carry-in V_DIV_SCALE VOP3SD result carry-out in0 in1 in2 V_READLANE VOP3 scalar dst (SGPR only) n/a vgpr# lane-sel: sgpr, M0, inline n/a V_READFIRSTLANE VOP1 scalar dst (SGPR only) n/a vgpr# n/a (lane-sel = exec) n/a V_WRITELANE VOP3 vgpr dst n/a sgpr#, const, lane-sel: sgpr, M0, M0 inline n/a V_CMP* VOPC "VCC" implied n/a in0 in1 n/a VOP3SD cmp-result (sgpr) unused in0 in1 unused VOP2 dest vgpr n/a in0 in1 unused (implied: VCC) VOP3 dest vgpr unused in0 in1 select sgpr (e.g. VCC) V_CNDMASK Encoding The readlane lane-select is limited to the valid range of lanes (0-31 for wave32, 0-63 for wave64) by ignoring upper bits of the lane number. Inline constants with DOT2_F16_F16 and DOT2_BF16_BF16 For these 2 instructions, the inline constant for sources 0 and 1 replicate the inline constant value into bits[31:16]. For source2, the OPSEL bit is used to control replication or not (gets zero if not replicating low bits). 7.2.2. Inputs Operands VALU instructions can use any of the following sources for input, subject to restrictions listed below: • VOP1, VOP2, VOPC: ◦ SRC0 is 9 bits and may be a VGPR, SGPR (including TTMPs and VCC), M0, EXEC, inline or literal constant. ◦ SRC1 is 8 bits and may specify only a VGPR • VOP3 : all 3 sources are 9 bits but still have restrictions: ◦ Not all VOPC/1/2 instructions are available in VOP3 (only those that benefit from VOP3 encoding). • See complete operand list: VALU Instruction Operands 7.2.2.1. Input Operand Modifiers The input modifiers ABS and NEG apply to floating point inputs and are undefined for any other type of input. In addition, input modifiers are supported for: V_MOV_B32, V_MOV_B16, V_MOVREL*_B32 and V_CNDMASK. ABS returns the absolute value, and NEG negates the input. Input modifiers are not supported for: • readlane, readfirstlane, writelane • integer arithmetic or bitwise operations • permlane 7.2. Operands 58 of 597 "RDNA3" Instruction Set Architecture • QSAD 7.2.2.2. Literal Expansion to 64 bits Literal constants are 32-bits, but they can be used as sources that normally require 64-bit data. They are expanded to 64 bits following these rules: • 64 bit float: the lower 32-bit are padded with zero • 64-bit unsigned integer: zero extended to 64 bits • 64-bit signed integer: sign extended to 64 bits 7.2.2.3. Source Operand Restrictions Not every combination of source operands that can be expressed in the microcode format is legal. This section describes the legal and illegal settings. Terminology for this section: "scalar value" = SGPR, EXEC, VCC, M0, SCC or literal constant; can be 32 or 64 bits. • Instructions may use at most two Scalar Values: SGPR, VCC, M0, EXEC, SCC, Literal • All instruction formats including VOP3 and VOP3P may use one literal constant ◦ Inline constants are free (do not count against 2 scalar value limit). ◦ Literals may not be used with DPP ◦ It is permissible for both scalar values to be SGPRs, although VCC counts as an SGPR. ▪ VCC when used implicitly counts against this limit: addci, subci, fmas, cndmask ◦ 64-bit shift instructions can use only one scalar value input, and can’t use the same one twice (inlines don’t count against this limit) ◦ Using the same scalar value twice only counts as a single scalar value, however using the same scalar value twice, but with different sizes has specific rules and limits: ▪ Using the same literal with different sizes counts as 2 scalar values, not 1. ▪ S[0] and S[0:1] can be considered as 1 scalar value, but S[1] and S[0:1] count as 2. In general, these rules apply to any S[2n] and S[2n:2n+1] count as one, but S[2n+1] and S[2n:2n+1] count as 2. • SGPR source rules must be met for both passes of a wave64, bearing in mind that sources that read a mask (bit-per-lane) increment the SGPR address for the second pass, and they may not be shared with other sources. 7.2.2.4. OPSEL Field Restrictions The OPSEL field (of VOP3) is usable only for a subset of VOP3 instructions, as well as VOP1/2/C instructions promoted to VOP3. Table 27. Opcodes usable with OPSEL 7.2. Operands V_MAD_I16 V_MAD_U16 V_FMA_F16 V_ADD_NC_U16 V_ADD_NC_I16 V_CVT_PKNORM_I16_F16 V_SUB_NC_U16 V_SUB_NC_I16 V_CVT_PKNORM_U16_F16 59 of 597 "RDNA3" Instruction Set Architecture V_MUL_LO_U16 V_MAD_U32_U16 V_MAD_I32_I16 V_LSHLREV_B16 V_LSHRREV_B16 V_ASHRREV_I16 V_ALIGNBIT_B32 V_ALIGNBYTE_B32 V_DIV_FIXUP_F16 V_MIN3_{F16,I16,U16} V_MAX3_{F16,I16,U16} V_MED3_{F16,I16,U16} V_MAX_{I16,U16} V_MIN_{I16,U16} V_PACK_B32_F16 V_MAXMIN_F16 V_MINMAX_F16 V_CNDMASK_B16 V_XOR_B16 V_AND_B16 V_OR_B16 V_DOT2_F16_F16 V_DOT2_BF16_BF16 V_INTERP_P10_RTZ_F16_F32 V_INTERP_P2_RTZ_F16_F32 V_INTERP_P2_F16_F32 V_INTERP_P10_F16_F32 7.2.3. Output Operands VALU instructions typically write their results to VGPRs specified in the VDST field of the microcode word. A thread only writes a result if the associated bit in the EXEC mask is set to 1. V_CMPX instructions write the result of their comparison (one bit per thread) to the EXEC mask. Instructions producing a carry-out (integer add and subtract) write their result to VCC when used in the VOP2 form, and to an arbitrary SGPR-pair when used in the VOP3 form. When the VOP3 form is used, instructions with a floating-point result may apply an output modifier (OMOD field) that multiplies the result by: 0.5, 2.0, or 4.0. Optionally, the result can be clamped (CLAMP field) to the min and max representable range (see next section). 7.2.3.1. Output Operand Modifiers Output modifiers (OMOD) apply to half, single and double floating point results only and scale the result by : 0.5, 2.0, 4.0 or do not scale. Integer and packed float 16 results ignore the omod setting. Output modifiers are not compatible with output denormals: if output denormals are enabled, then output modifiers are ignored. If output denormals are disabled, then the output modifier is applied and denormals are flushed to zero. These are not IEEE compatible: -0 is flushed to +0. Output modifiers are ignored if the IEEE mode bit is set to 1. A few opcodes force output denorms to be disabled. Output Modifiers are not supported for: • V_PERMLANE • DOT2_F16_F16 • DOT2_BF16_BF16 The clamp bit has multiple uses. For V_CMP instructions, setting the clamp bit to 1 indicates that the compare signals if a floating point exception occurs. For integer operations, it clamps the result to the largest and smallest representable value. For floating point operations, it clamps the result to the range: [0.0, 1.0]. Output Clamping: The clamp instruction bit applies to the following operations and data types: • Float clamp to [0.0, 1.0] • Signed Int [-max_int, +max_int] • Unsigned int [0, +max_int] • Bool (V_CMP) enables signaling compare 7.2. Operands 60 of 597 "RDNA3" Instruction Set Architecture The clamp bit is not supported for (ignored): V_PERMLANE* V_PERM_B32 Float DOT instructions V_SWAP and V_SWAPREL WMMA ops V_ADD3 V_ADD_LSHL V_ALIGN* Bitwise ops V_CMP*_CLASS V_CMP on integers V_READLANE V_READFIRSTLANE V_WRITELANE 7.2.3.2. Wave64 Destination Restrictions When a VALU instruction is issued from a wave64, it may issue twice as two wave32 instructions. While in most cases the programmer need not be aware of this, it does impose a prohibition on wave64 VALU instructions that both write and read the same SGPR value. Doing this may lead to unpredictable results. Specifically, the first pass of a wave64 VALU instruction may not overwrite a scalar value used by the second half. 7.2.4. Denormalized and Rounding Modes The shader program has explicit control over the rounding mode applied and the handling of denormalized inputs and results. The MODE register is set using the S_SETREG instruction; it has separate bits for controlling the behavior of single and double-precision floating-point numbers. Round and denormal modes can also be set using S_ROUND_MODE and S_DENORM_MODE which is the preferred method over using S_SETREG. 16-bit floats support denormals, infinity and NaN. Table 28. Round and Denormal Modes Field Bit Position Description FP_ROUND 3:0 [1:0] Single-precision round mode. [3:2] Double and Half-precision (FP16) round mode. Round Modes: 0=nearest even 1= +infinity 2= -infinity 3= toward zero FP_DENORM 7:4 [5:4] Single-precision denormal mode. [7:6] Double and Half-precision (FP16) denormal mode. Denormal modes: 0 = Flush input and output denorms 1 = Allow input denorms, flush output denorms 2 = Flush input denorms, allow output denorms 3 = Allow input and output denorms These mode bits do not affect rounding and denormal handling of F32 global memory atomics. DOT2_F16_F16 and DOT2_BF16_BF16 support round-to-nearest-even rounding. DOT2_F16_F16 supports denorms, and DOT2_BF16_BF16 disables all denorms. 7.2. Operands 61 of 597 "RDNA3" Instruction Set Architecture 7.2.5. Instructions using SGPRs as Mask or Carry Every VALU instruction can use SGPRs as a constant, but the following can read or write SGPRs as masks or carry: Read Mask or Carry in Write Carry out Implicitly Reads VCC Implicitly Writes VCC V_CNDMASK_B32 V_CMP* V_DIV_FMAS_F32 V_DIV_SCALE_F32 V_ADD_CO_CI_U32 V_ADD_CO_CI_U32 V_DIV_FMAS_F64 V_DIV_SCALE_F64 V_SUB_CO_CI_U32 V_SUB_CO_CI_U32 (fmas reads 3 operands + VCC) V_CMP (not V_CMPX) V_SUBREV_CO_CI_U32 V_SUBREV_CO_CI_U32 V_CNDMASK in VOP2 V_ADD_CO_U32 V_{ADD,SUB,SUBREV}_CO_CI_U 32 in VOP2 V_SUB_CO_U32 V_SUBREV_CO_U32 V_MAD_U64_U32 V_MAD_I64_I32 Write Data out (not carry) V_READLANE V_READFIRSTLANE "VCC" in the above table refers to VCC in a VOP2 or VOPC encoding, or any SGPR specified in the SRC2 or SDST field for VOP3 encoding, except for DIV_FMAS that implicitly reads VCC (no choice). V_CMPX is the only VALU instruction that writes EXEC. 7.2.6. Wave64 use of SGPRs VALU instructions may use SGPRs as a uniform input, shared by all work-items. If the value is used as simple data value, then the same SGPR is distributed to all 64 work-items. If, on the other hand, the data value represents a mask (e.g. carry-in, mask for CNDMASK), then each work-item receives a separate value, and two consecutive SGPRs are read. 7.2.7. Out-of-Range GPRs When a source VGPR is out-of-range, the instruction uses as input the value from VGPR0. When the destination GPR is out-of-range, the instruction executes but does not write the results. See VGPR Out Of Range Behavior for more information. 7.2.8. PERMLANE Specific Rules V_PERMLANE may not occur immediately after a V_CMPX. To prevent this, any other VALU opcode may be inserted (e.g. V_NOP). 7.2. Operands 62 of 597 "RDNA3" Instruction Set Architecture 7.3. Instructions The table below lists the complete VALU instruction set by microcode encoding, except for VOP3P instructions which are listed in a later section. VOP3 VOP3 - 2 operands VOP2 VOP1 V_ADD3_U32 V_ADD_CO_U32 V_ADD_CO_CI_U32 V_BFREV_B32 V_ADD_LSHL_U32 V_ADD_F64 V_ADD_F16 V_CEIL_F16 V_ALIGNBIT_B32 V_ADD_NC_I16 V_ADD_F32 V_CEIL_F32 V_ALIGNBYTE_B32 V_ADD_NC_I32 V_ADD_NC_U32 V_CEIL_F64 V_AND_OR_B32 V_ADD_NC_U16 V_AND_B32 V_CLS_I32 V_BFE_I32 V_AND_B16 V_ASHRREV_I32 V_CLZ_I32_U32 V_BFE_U32 V_ASHRREV_I16 V_CNDMASK_B32 V_COS_F16 V_BFI_B32 V_ASHRREV_I64 V_CVT_PK_RTZ_F16_F32 V_COS_F32 V_CNDMASK_B16 V_BCNT_U32_B32 V_DOT2ACC_F32_F16 V_CTZ_I32_B32 V_CUBEID_F32 V_BFM_B32 V_FMAAK_F16 V_CVT_F16_F32 V_CUBEMA_F32 V_CVT_PK_I16_F32 V_FMAAK_F32 V_CVT_F16_I16 V_CUBESC_F32 V_CVT_PK_I16_I32 V_FMAC_DX9_ZERO_F32 V_CVT_F16_U16 V_CUBETC_F32 V_CVT_PK_NORM_I16_F16 V_FMAC_F16 V_CVT_F32_F16 V_CVT_PK_U8_F32 V_CVT_PK_NORM_I16_F32 V_FMAC_F32 V_CVT_F32_F64 V_DIV_FIXUP_F16 V_CVT_PK_NORM_U16_F16 V_FMAMK_F16 V_CVT_F32_I32 V_DIV_FIXUP_F32 V_CVT_PK_NORM_U16_F32 V_FMAMK_F32 V_CVT_F32_U32 V_DIV_FIXUP_F64 V_CVT_PK_U16_F32 V_LDEXP_F16 V_CVT_F32_UBYTE0 V_DIV_FMAS_F32 V_CVT_PK_U16_U32 V_LSHLREV_B32 V_CVT_F32_UBYTE1 V_DIV_FMAS_F64 V_LDEXP_F32 V_LSHRREV_B32 V_CVT_F32_UBYTE2 V_DIV_SCALE_F32 V_LDEXP_F64 V_MAX_F16 V_CVT_F32_UBYTE3 V_DIV_SCALE_F64 V_LSHLREV_B16 V_MAX_F32 V_CVT_F64_F32 V_DOT2_BF16_BF16 V_LSHLREV_B64 V_MAX_I32 V_CVT_F64_I32 V_DOT2_F16_F16 V_LSHRREV_B16 V_MAX_U32 V_CVT_F64_U32 V_FMA_DX9_ZERO_F32 V_LSHRREV_B64 V_MIN_F16 V_CVT_FLOOR_I32_F32 V_FMA_F16 V_MAX_F64 V_MIN_F32 V_CVT_I16_F16 V_FMA_F32 V_MAX_I16 V_MIN_I32 V_CVT_I32_F32 V_FMA_F64 V_MAX_U16 V_MIN_U32 V_CVT_I32_F64 V_LERP_U8 V_MBCNT_HI_U32_B32 V_MUL_DX9_ZERO_F32 V_CVT_I32_I16 V_LSHL_ADD_U32 V_MBCNT_LO_U32_B32 V_MUL_F16 V_CVT_NEAREST_I32_F32 V_LSHL_OR_B32 V_MIN_F64 V_MUL_F32 V_CVT_NORM_I16_F16 V_MAD_I16 V_MIN_I16 V_MUL_HI_I32_I24 V_CVT_NORM_U16_F16 V_MAD_I32_I16 V_MIN_U16 V_MUL_HI_U32_U24 V_CVT_OFF_F32_I4 V_MAD_I32_I24 V_MUL_F64 V_MUL_I32_I24 V_CVT_U16_F16 V_MAD_I64_I32 V_MUL_HI_I32 V_MUL_U32_U24 V_CVT_U32_F32 V_MAD_U16 V_MUL_HI_U32 V_OR_B32 V_CVT_U32_F64 V_MAD_U32_U16 V_MUL_LO_U16 V_PK_FMAC_F16 V_CVT_U32_U16 V_MAD_U32_U24 V_MUL_LO_U32 V_SUBREV_CO_CI_U32 V_EXP_F16 V_MAD_U64_U32 V_OR_B16 V_SUBREV_F16 V_EXP_F32 V_MAX3_F16 V_PACK_B32_F16 V_SUBREV_F32 V_FLOOR_F16 V_MAX3_F32 V_READLANE_B32 V_SUBREV_NC_U32 V_FLOOR_F32 V_MAX3_I16 V_SUBREV_CO_U32 V_SUB_CO_CI_U32 V_FLOOR_F64 V_MAX3_I32 V_SUB_CO_U32 V_SUB_F16 V_FRACT_F16 V_MAX3_U16 V_SUB_NC_I16 V_SUB_F32 V_FRACT_F32 V_MAX3_U32 V_SUB_NC_I32 V_SUB_NC_U32 V_FRACT_F64 V_MAXMIN_F16 V_SUB_NC_U16 V_XNOR_B32 V_FREXP_EXP_I16_F16 V_MAXMIN_F32 V_TRIG_PREOP_F64 V_XOR_B32 V_FREXP_EXP_I32_F32 V_MAXMIN_I32 V_WRITELANE_B32 7.3. Instructions V_FREXP_EXP_I32_F64 63 of 597 "RDNA3" Instruction Set Architecture VOP3 VOP3 - 2 operands V_MAXMIN_U32 V_XOR_B16 VOP2 VOP1 V_FREXP_MANT_F16 V_MED3_F16 V_FREXP_MANT_F32 V_MED3_F32 V_FREXP_MANT_F64 V_MED3_I16 V_LOG_F16 V_MED3_I32 V_LOG_F32 V_MED3_U16 V_MOVRELD_B32 V_MED3_U32 V_MOVRELSD_2_B32 V_MIN3_F16 V_MOVRELSD_B32 V_MIN3_F32 V_MOVRELS_B32 V_MIN3_I16 V_MOV_B16 V_MIN3_I32 V_MOV_B32 V_MIN3_U16 V_NOP V_MIN3_U32 V_NOT_B16 V_MINMAX_F16 V_NOT_B32 V_MINMAX_F32 V_PERMLANE64_B32 V_MINMAX_I32 V_PIPEFLUSH V_MINMAX_U32 V_RCP_F16 V_MQSAD_PK_U16_U8 V_RCP_F32 V_MQSAD_U32_U8 V_RCP_F64 V_MSAD_U8 V_RCP_IFLAG_F32 V_MULLIT_F32 V_READFIRSTLANE_B32 V_OR3_B32 V_RNDNE_F16 V_PERMLANE16_B32 V_RNDNE_F32 V_PERMLANEX16_B32 V_RNDNE_F64 V_PERM_B32 V_RSQ_F16 V_QSAD_PK_U16_U8 V_RSQ_F32 V_SAD_HI_U8 V_RSQ_F64 V_SAD_U16 V_SAT_PK_U8_I16 V_SAD_U32 V_SIN_F16 V_SAD_U8 V_SIN_F32 V_XAD_U32 V_SQRT_F16 V_XOR3_B32 V_SQRT_F32 V_SQRT_F64 V_SWAPREL_B32 V_SWAP_B16 V_SWAP_B32 V_TRUNC_F16 V_TRUNC_F32 V_TRUNC_F64 VOPC - Compare Ops VOPC writes to VCC, VOP3 writes compare result to any SGPR V_CMP V_CMPX V_CMP I16, I32, I64, U16, U32, U64 F, LT, EQ, LE, GT, LG, GE, T V_CMPX_CLASS 7.3. Instructions write exec F16, F32, F64 F, LT, EQ, LE, GT, LG, GE, T, write VCC O, U, NGE, NLG, NGT, NLE, NEQ, NLT (T = True, F = False, O = total order, U = unordered, "N" write exec = Not (inverse) compare) F16, F32, F64 Test for any combination of: signaling-NaN, quiet-NaN, write VCC positive or negative: infinity, normal, subnormal, zero. write exec V_CMPX V_CMP_CLASS write VCC 64 of 597 "RDNA3" Instruction Set Architecture 7.4. 16-bit Math and VGPRs VALU instructions that operate on 16-bit data (non-packed) can separately address the two halves of a 32-bit VGPR. 16-bit VGPR-pairs are packed into a 32-bit VGPRs: the 32-bit VGPR "V0" contains two 16-bit VGPRs: "V0.L" representing V0[15:0] and "V0.H" representing V0[31:16]. How this addressing is encoded in the ISA varies by the instruction encoding: The 16-bit instructions can be encoded using VOP1/2/C as well as VOP3/VOP3P/VINTERP. 16bit VGPR Naming The 32-bit VGPR is "V0". The two halves are called "V0.L" and "V0.H". VOP1, VOP2, VOPC Encoding 16-bit VGPRs are encoded as: SRC/DST[6:0] = 32-bit VGPR address; SRC/DST[7] = (1=hi, 0=lo half) In this encoding, only 256 16-bit VGPRs can be addressed. VOP3, VOP3P, VINTERP 16-bit VGPRs are encoded as: SRC/DST[7:0] = 32-bit VGPR address, OPSEL = high/low. In this encoding, a wave can address 512 16-bit VGPRs. The packing shown below allows reading or writing in one cycle: • 32 lanes of one 32-bit VGPR: V0 • 64 lanes of one 16-bit VGPR: V0.L • 32 lanes of two 16-bit VGPRs (a pair, as used by packed math): V0.L and V0.H 7.5. Packed Math Packed math is a form of operation that accelerates arithmetic on two values packed into the same VGPR. It performs operations on two 16-bit values within a DWORD as if they were separate threads. For example, a packed add of V0=V1+V2 is really two separate adds: adding the low 16 bits of each DWORD and storing the result in the low 16 bits of V0, and adding the high halves and storing the result in the high 16 bits of V0. Packed math uses the instructions below and the microcode format "VOP3P". This format has OPSEL and NEG fields for both the low and high operands, and does not have ABS and OMOD. Table 29. Packed Math Opcodes: Packed Math ops V_PK_MUL_F16 V_PK_FMA_F16 V_PK_MIN_F16 V_PK_ADD_F16 V_PK_FMAC_F16 V_PK_MAX_F16 V_PK_ADD_I16 V_PK_MAD_I16 V_PK_MIN_I16 7.4. 16-bit Math and VGPRs V_PK_LSHLREV_B16 65 of 597 "RDNA3" Instruction Set Architecture Packed Math ops V_PK_ADD_U16 V_PK_MAD_U16 V_PK_MIN_U16 V_PK_LSHRREV_B16 V_PK_SUB_I16 V_PK_MUL_LO_U16 V_PK_MAX_I16 V_PK_ASHRREV_I16 V_PK_SUB_U16 V_PK_MAX_U16 V_FMA_MIX_F32 V_FMA_MIXLO_F16 V_FMA_MIXHI_F16 V_WMMA_F32_16X16X16_F16 V_DOT2_F32_BF16 V_WMMA_F32_16X16X16_BF16 V_DOT2_F32_F16 V_WMMA_F16_16X16X16_F16 V_DOT4_I32_IU8 V_WMMA_BF16_16X16X16_BF16 V_DOT4_U32_U8 V_WMMA_I32_16X16X16_IU8 V_DOT8_I32_IU4 V_WMMA_I32_16X16X16_IU4 V_DOT8_U32_U4 V_FMA_MIX_* and WMMA instructions are not packed math, but perform a single MAD operation on a mixture of 16- and 32-bit inputs. They are listed here because they use the VOP3P encoding.  VOP3P Instruction Fields Field Size Description OP 7 instruction opcode SRC0 9 first instruction argument. May come from: vgpr, sgpr, VCC, M0, exec or a constant WMMA: must be a VGPR SRC1 9 second instruction argument. May come from: vgpr, sgpr, VCC, M0, exec or a constant WMMA: must be a VGPR SRC2 9 third instruction argument. May come from: vgpr, sgpr, VCC, M0, exec or a constant VDST 8 vgpr that takes the result. For V_READLANE, indicates the SGPR that receives the result. NEG 3 negate the input (invert sign bit) for the lower-16bit operand. float inputs only. bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2. For V_FMA_MIX_* opcodes, this modifies all inputs. For DOT…IU… and WMMA…IU… NEG[1:0] = signed(1)/unsigned(0) for src0 and src1, and Neg[2] behavior is undefined. NEG_HI 3 negate the input (invert sign bit) for the higher-16bit operand. float inputs only. bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2. For V_FMA_MIX_* opcodes, this acts as an ABS (absolute value) modifier. For DOT…IU… and WMMA…IU… NEG_HI behavior is undefined. OPSEL [13:11] 3 Select the high (1) or low (0) operand as input to the operation that results in the lower-half of the destination. [0] = src0, [1] = src1, [2] = src2 If either the source operand or destination operand is 32bits, the corresponding OPSEL bit must set to zero. This rule does not apply to MIX instructions, which have a unique interpretation of OPSEL. See notes below. OPSEL works for 16-bit VGPR, SGPR and literal-constant sources; for inline constant sources OPSEL must be zero (value only exists in lower 16 bits). OPSEL[0] and [1] are unused for WMMA ops, and OPSEL[2] is used only with WMMA ops with 16-bit output to control whether the C matrix is read from upper or lower bits in the VGPR, and whether the D matrix is stored into upper or lower bits. 7.5. Packed Math 66 of 597 "RDNA3" Instruction Set Architecture Field Size Description OPSEL_HI 3 {[60:59],[14]} Select the high (1) or low (0) operand as input to the operation that results in the upper-half of the destination. [0] = src0, [1] = src1, [2] = src2. Concatenation of ISA fields { OPSLH, OPSLH0 }. If either the source operand or destination operand is 32bits or is a constant, the corresponding OPSEL_HI bit must set to zero. This rule does not apply to MIX instructions, which have a unique interpretation of OPSEL. See notes below. CLMP clamp result. Float arithmetic: clamp result to [0, 1.0]; -0 is clamped to +0. Signed integer arithmetic: clamp result to [min_int, +max_int] Unsigned integer arithmetic: clamp result to [0, +max_uint] Where "min_int" and "max_int" are the largest negative and positive representable integers for the size of integer being used (16, 32 or 64 bit). "max_uint" is the largest unsigned int. 1 OPSEL for MIX instructions MIX, MIXLO and MIXHI interpret OPSEL and OPSEL_HI as three 2-bit fields, one per source operand: { OPSEL_HI[0], OPSEL[0] } controls source0; { OPSEL_HI[1], OPSEL[1] } controls source1; { OPSEL_HI[2], OPSEL[2] } controls source2. These 2-bit fields control source-selection for each of the 3 source operands: 2’b00: Src[31:0] as FP32 2’b01: Src[31:0] as FP32 2’b10: Src[15:0] as FP16 2’b11: Src[31:16] as FP16 V_WMMA…IU… and V_DOT4…IU… with NEG:: These instructions use the NEG[1:0] bits to indicate signed (0=unsigned, 1=signed) per input source instead of meaning "negate". NEG[2] should be set to zero (behavior is undefined). NEG_HI must be zero. 7.5.1. Inline Constants with Packed Math Inline constants may be used with packed math, but they require the use of OPSEL. Inline constants produce a value in only the low 16-bits of the 32-bit constant value. Inline constants used with float 16-bit sources produce an F16 constant value. Without using OPSEL, only the lower half of the source would contain the constant. To use the inline constant in both halves, use OPSEL to select the lower input for both low and high sources. BF16 uses 32-bit float constants and then the BF16 operand selects the upper 16 bits of the FP32 constant (matches the definition of BF16). For the WMMA_F16_F16_16x16x16 or VOPD DOT2_F32_F16, hardware automatically selects the low 16 bits of the constant. Any packed math instructions that use data sizes less than 16 bits do not work with inline constants, other than the DOT instructions below: Opcode inline OPSEL DOT4_I32_IU8 use 32bit inline src0/1 (ignore OPSEL) OPSEL/OPSEL_HI on src0/1 DOT8_I32_IU4 use 32bit inline src0/1 (ignore OPSEL) OPSEL/OPSEL_HI on src0/1 DOT4_U32_U8 use 32bit inline src0/1 (ignore OPSEL) OPSEL/OPSEL_HI on src0/1 7.5. Packed Math 67 of 597 "RDNA3" Instruction Set Architecture Opcode inline OPSEL DOT8_U32_U4 use 32bit inline src0/1 (ignore OPSEL) OPSEL/OPSEL_HI on src0/1 DOT2_F32_F16 use FP32 inline, supports OPSEL OPSEL/OPSEL_HI on src0/1 DOT2_F32_BF16 upper16(FP32)/same as replicate (src0/1) ignore OPSEL OPSEL/OPSEL_HI on src0/1 DOT2ACC_F32_F16 Duplicate lo to hi, ignore OPSEL none DOT2ACC_F32_BF16 Duplicate lo to hi, ignore OPSEL none 7.6. Dual Issue VALU The VOPD instruction encoding allows a single shader instruction to encode two separate VALU operations that are executed in parallel. The two operations must be independent of each other. This instruction has certain restrictions that must be met - hardware does not function correctly if they are not. This instruction format is legal only for wave32. It must not be used by wave64’s. It is skipped for wave64. The instruction defines 2 operations, named "X" and "Y", each with their own sources and destination VGPRs. The two instructions packed into this one ISA are referred to as OpcodeX and OpcodeY. • OpcodeX sources data from SRC0X (a VGPR, SGPR or constant), and SRC1X (a VGPR); • OpcodeY sources data from SRC0Y (a VGPR, SGPR or constant), and SRC1Y (a VGPR). The two instructions in the VOPD are executed at the same time, so there are no races between them if one reads a VGPR and the other writes the same VGPR. The 'read' gets the old value. Restrictions: • Each of the two instructions may use up to 2 VGPRs • Each instruction in the pair may use at most 1 SGPR or they may share a single literal ◦ Legal combinations for the dual-op: at most 2 SGPRs, or 1 SGPR + 1 literal, or share a literal. • SRC0 can be either a VGPR or SGPR (or constant) • VSRC1 can only be a VGPR • Instructions must not exceed the VGPR source-cache port limits ◦ There are 4 VGPR banks (indexed by SRC[1:0]), and each bank has a cache ◦ Each cache has 3 read ports: one dedicated to SRC0, one dedicated to SRC1 and one for SRC2 ▪ A cache can read all 3 of them at once, but it can’t read two SRC0’s at once (or SRC1/2). ◦ SRCX0 and SRCY0 must use different VGPR banks; ◦ VSRCX1 and VSRCY1 must use different banks. ▪ FMAMK is an exception : V = S0 + K * S1 ("S1" uses the SRC2 read port) ◦ If both operations use the SRC2 input, then one SRC2 input must be even and the other SRC2 input must be odd. The following operations use SRC2: FMAMK_F32 (second input operand); DOT2ACC_F32_F16, DOT2ACC_F32_BF16, FMAC_F32 (destination operand). ◦ These are hard rules - the instruction does not function if these rules are broken • The pair of instructions combined have the following restrictions: ◦ At most one literal constant, or they may share the same literal ◦ Dest VGPRs: one must be even and the other odd ◦ The instructions must be independent of each other • Must not use DPP • Must be wave32. 7.6. Dual Issue VALU 68 of 597 "RDNA3" Instruction Set Architecture VOPD Instruction Fields Field Size Description opX 4 instruction opcode for the X operation opY 5 instruction opcode for the Y operation src0X 9 Source 0 for X operation. May be a VGPR, SGPR, exec, inline or literal constant src0Y 9 Source 0 for Y operation. May be a VGPR, SGPR, exec, inline or literal constant vsrc1X 8 Source 1 for X operation. Must be a VGPR. Ignored for V_MOV_B32 vsrc1Y 8 Source 1 for Y operation. Must be a VGPR. Ignored for V_MOV_B32 vdstX 8 Destination VGPR for X operation. vdstY 7 Destination VGPR for Y operation. vdstY specifies bits [7:1]. The LSB of the destination address is: !vdstX[0]. vdstX and vdstY: one must be even and the other is an odd VGPR. See VOPD for a list of opcodes usable in the X and Y opcode fields. V_CNDMASK_B32 is the "VOP2" form that uses VCC as the select. VCC counts as one SGPR read. VOPD instruction pairs generate only a single exception if either or both raise an exception. 7.7. Data Parallel Processing (DPP) Data Parallel Processing (DPP) operations allow VALU instruction to select operands from different lanes (threads) rather than just using a thread’s own data. DPP operations are indicated by the use of the inline constant: DPP8 or DPP16 in the SRC0 operand. Note that since SRC0 is set to the DPP value, the actual VGPR address for SRC0 comes from the DPP DWORD. One example of using DPP is for scan operations. A scan operation is one that computes a value per thread that is based on the values of the previous threads and possibly itself. E.g. a running sum is the sum of the values from previous threads in the vector. A reduction operation is essentially a scan that returns a single value from the highest numbered active thread. A scan operation requires that the EXEC mask to be set to all 1’s for proper operation. Unused threads (lanes) should be set to a value that does not change the result prior to the scan. There are two forms of the DPP instruction word: DPP8 allows arbitrary swizzling between groups of 8 lanes DPP16 allows a set of predefined swizzles between groups of 16 lanes DPP may be used only with: VOP1, VOP2, VOPC, VOP3 and VOP3P (but not "packed math" ops). DPP instructions incur an extra cycle of delay to execute. Table 30. Which instructions support DPP 7.7. Data Parallel Processing (DPP) 69 of 597 "RDNA3" Instruction Set Architecture Encoding Opcodes * Rule* Encoding Opcodes Rule VOP1 All 64-bit opcodes NO DPP VOP3 All 64bit opcodes NO DPP READFIRSTLANE_B32 NO DPP MUL_LO_U32 NO DPP SWAP_B32 NO DPP MUL_HI_U32 NO DPP PIPEFLUSH NO DPP MUL_HI_I32 NO DPP WRITELANE_REGWR_B32 NO DPP QSAD_PK_U16_U8 NO DPP PERMUTE64 NO DPP MQSAD_PK_U16_U8 NO DPP All Others Allow DPP MQSAD_U32_U8 NO DPP ALL 64bit opcodes NO DPP READLANE_REGRD_B32 NO DPP FMAMK/AD_F32/16 NO DPP READLANE_B32 NO DPP All Others Allow DPP WRITELANE_B32 NO DPP V_DOT4_I32_IU8 V_DOT4_U32_U8 V_DOT8_I32_IU4 V_DOT8_U32_U4 V_PK_* WMMA NO DPP PERMLANE16_B32 NO DPP ALL Others: V_FMA_MIX_* V_DOT2_F32_{BF16, F16} Allow DPP PERMLANEX16_B32 NO DPP ALL NO DPP The others Allow DPP All 64bit opcodes NO DPP The others Allow DPP VOP2 VOP3P VINTERP VOPD ALL NO DPP VOPC V_CMP and V_CMPX write the full mask, not a partial mask. When using DPP with V_CMP or V_CMPX and setting bound_ctrl=0, lanes that have their EXEC mask bit set to zero instead of not writing the bit, a zero bit is written. "FI" (Fetch Inactive) with DPP16 causes a lane to act as if it is active when supplying data, but the compare result for that lane will still be zero for V_CMPX (V_CMPX with FI=1 will not turn on a lane that was off). 7.7.1. DPP16 DPP16 allows selection of data within groups of 16 lanes with a fixed set of possible swizzle patterns. Both VOP3/VOP3P and DPP16 have ABS and NEG fields: • VOP3’s ABS & NEG fields are used, and DPP16’s are ignored • VOP3P’s NEG/NEG_HI fields are used and DPP16’s ABS & NEG are ignored. DPP16 Instruction Fields Field BITS Description row_mask 31:28 Applies to the VGPR destination write only, does not impact the thread mask when fetching source VGPR data. For VOPC, the SGPR/VCC bit associated with the disabled lane receives zero. 31==0: lanes[63:48] are disabled (wave 64 only) 30==0: lanes[47:32] are disabled (wave 64 only) 29==0: lanes[31:16] are disabled 28==0: lanes[15:0] are disabled 7.7. Data Parallel Processing (DPP) 70 of 597 "RDNA3" Instruction Set Architecture Field BITS Description bank_mask 27:24 Applies to the VGPR destination write only, does not impact the thread mask when fetching source VGPR data. For VOPC, the SGPR/VCC bit associated with the disabled lane receives zero. In wave32 mode: 27==0: lanes[12:15, 28:31] are disabled 26==0: lanes[8:11, 24:27 are disabled 25==0: lanes[4:7, 20:23] are disabled 24==0: lanes[0:3, 16:19] are disabled In wave64 mode: 27==0: lanes[12:15, 28:31, 44:47, 60:63] are disabled 26==0: lanes[8:11, 24:27, 40:43, 56:59] are disabled 25==0: lanes[4:7, 20:23, 36:39, 52:55] are disabled 24==0: lanes[0:3, 16:19, 32:35, 48:51] are disabled Notice: the term "bank" here is not the same as was used for the VGPR bank. src1_imod 23:22 23: Apply Absolute value to SRC1 22: Apply Negate to SRC1 (done after absolute value) src0_imod 21:20 21: Apply Absolute value to SRC0 20: Apply Negate to SRC0 (done after absolute value) BC 19 Bound_ctrl is used to determine what a thread should do if its source operand is from a disabled thread or invalid input: use the value zero, or disable the write. For example, a right shift into lane 0 is an invalid input, so the VALU uses Bound_ctrl to decide if lane 0’s src0 should be 0 or if it’s VGPR write enable should be disabled. 19==0: Do not write when source is invalid or out-of-range (DPP_BOUND_OFF) 19==1: User zero as input if source is invalid or out-of-range. (DPP_BOUND_ZERO) FI 18 Fetch inactive lane behavior: 18 == 0: If source lane is invalid (disabled thread or out-of-range), use "bound_ctrl" to determine the source value. 18 == 1: If the source lane is disabled, fetch the source value anyway (ignoring the bound_ctrl bit). If the source lane is out-of-range, behavior is decided by the bound_ctrl bit. rsvd 17 Reserved dpp_ctrl 16:8 Data Share control word. Src0 7:0 DPP_QUAD_PERM{00:FF} 000-0FF DPP_UNUSED 100 DPP_ROW_SL{1:15} 101-10F DPP_ROW_SR{1:15} 111-11F DPP_ROW_RR{1:15} 121-12F DPP_ROW_MIRROR 140 DPP_ROW_HALF_MIRROR 141 DPP_ROW_SHARE{0:15} 150 - 15F DPP_ROW_XMASK{0:15} 160 - 16F VGPR address of srcA operand Table 31. BC and FI Behavior BC FI Source lane out-ofrange Source lane in-range but disabled Source lane in-range and active 0 0 Disable write Disable write Normal 1 0 Src0 = 0 Src0 = 0 Normal 0 1 Src0 = 0 Normal Normal 1 1 Normal Normal Normal 7.7. Data Parallel Processing (DPP) 71 of 597 "RDNA3" Instruction Set Architecture Where "out of range" means the lane offset goes outside a group of 16 lanes (e.g. 0..15, or 16..31). 7.7.2. DPP8 DPP8 allows arbitrary cross-lane swizzling within groups of 8 lanes. There are two forms of DPP8: normal, which reads zero from lanes whose EXEC mask bit is zero, and DPP8FI, which fetches data from inactive lanes instead of using the value zero. DPP8 follows DPP16’s "BC = 1" behavior and assumes all source lanes are in-range. DPP8 Instruction Fields Field Size Description SRC 8 Source 0 (VGPR). Since the VOP1/VOP2 source0 slot was filled with the constant "DPP" or "DPPFI", this field provides the actual source-0 vgpr. SEL0 SEL1 SEL2 SEL3 SEL4 SEL5 SEL6 SEL7 3 Selects which lane to pull data from, within a group of 8 lanes. SEL0 selects which lane to read from to supply data into lane 0. SEL1 selects which lane to read from to supply data into lane 1. etc. 0 = read from lane 0, 1 = read from lane 1, … 7 = read from lane 7. Lanes 0-7 can pull from any of lanes 0-7; lanes 8-15 can pull from lanes 8-15, etc. 7.8. VGPR Indexing The VALU provides a set of instructions that move or swap VGPRs where the source, dest or both are indexed by a value in the M0 register. Indices are unsigned. Table 32. VGPR Indexing Instructions Instruction Index Function V_MOVRELD_B32 M0[31:0] Move with relative destination: VGPR[dst + M0[31:0]] = VGPR[src] V_MOVRELS_B32 Move with relative source: VGPR[dst] = VGPR[src + M0[31:0]] V_MOVRELSD_B32 Move with relative source and destination: VGPR[dst + M0[31:0]] = VGPR[src + M0[31:0]] V_MOVRELSD_2_B32 V_SWAPREL_B32 Src: M0[9:0] Dst: M0[25:16] Move with relative source and destination, each different: VGPR[dst + M0[25:16]] = VGPR[src + M0[9:0]] Swap two VGPRs, each relative to a separate index: tmp = VGPR[src + M0[9:0]] VGPR[src + M0[9:0]] = VGPR[dst + M0[25:16]] VGPR[dst + M0[25:16]] = tmp 7.9. Wave Matrix Multiply Accumulate (WMMA) Wave Matrix Multiply-Accumulate (WMMA) instructions provide acceleration for common matrix arithmetic operations. The instructions are encoded using the VOP3P encoding. 7.8. VGPR Indexing 72 of 597 "RDNA3" Instruction Set Architecture These perform: A * B + C ⇒ D, where A, B, C and D are matrices. WMMA does not generate any ALU exceptions. These are all encoded using VOP3P. The NEG[1:0] field is repurposed for the "IU" integer types to indicate whether the inputs are signed or not (0=unsigned, 1=signed). For WMMA_*UI8/UI4, NEG[1:0] indicates whether SRC0 and 1 are signed or unsigned, and NEG[2] and NEG_HI[2:0] must be zero. For WMMA*_F16/BF16, NEG[1:0] is applied on SRC1 and SRC0’s low 16bit. NEG_HI[1:0] is applied on SRC1 and SRC0’s high 16bit. {NEG_HI[2], NEG[2]} is applied on SRC2, act as {ABS, NEG}. The destination is signed for the integer types. Neg[0] applies to the A-matrix, and Neg[1] to the B-matrix. Neg[2] must be set to zero. Table 33. WMMA Instructions Instruction Matrix A Matrix B Matrix C Result Matrix V_WMMA_F32_16X16X16_F16 16x16 F16 16x16 F16 16x16 F32 16x16 F32 V_WMMA_F32_16X16X16_BF16 16x16 BF16 16x16 BF16 16x16 F32 16x16 F32 V_WMMA_F16_16X16X16_F16 16x16 F16 16x16 F16 V_WMMA_BF16_16X16X16_BF16 16x16 BF16 16x16 BF16 16x16 BF16 16x16 BF16 V_WMMA_I32_16X16X16_IU8 16x16 IU8 16x16 IU8 16x16 I32 16x16 I32 V_WMMA_I32_16X16X16_IU4 16x16 IU4 16x16 IU4 16x16 I32 16x16 I32 16x16 F16 16x16 F16 "IU4" and "IU8" mean that the operand is either signed or unsigned (4 or 8 bits) as indicate by the NEG bits. These instructions work over multiple cycles to compute the result matrix and internally use the DOT instructions. In order to achieve this performance, the user must arrange the data such that: • A and B matrices: lanes 0-15 data are replicated into lanes 16-31 (for wave64: also into lanes 32-47 and 4863). WMMA supports only round-to-nearest-even rounding for float types. Inline constants: can only be used for C-matrix. For F16 and BF16, the inline value is replicated into both low and high halves of the DWORD. Back-to-back dependent WMMA instructions require one V_NOP (or independent VALU op) between them if the first instruction’s matrix D is the same or overlaps with the second instruction’s matrices A or B. Matrix A/B can overlap C as long as C is distinct from D. The typical case is that C and D are the same. Simplified example of matrix multiplication on 4x4 matrices: This diagram below shows the A, B, C and D matrices in the traditional point of view: one row is a horizontal 7.9. Wave Matrix Multiply Accumulate (WMMA) 73 of 597 "RDNA3" Instruction Set Architecture strip of entries, and columns are a vertical strip. This is the linear algebra view, regardless of layout in memory or in VGPRs. The matrix operation is defined as: D = A * B + C. Each entry in D is the result of multiplication of a row from A with a column from B, added to the C value for that entry. This diagram below shows how the matrices are laid out in VGPRs when M = N = K = 16. Note that the A matrix is column-major while the others are in row-major order. 7.9. Wave Matrix Multiply Accumulate (WMMA) 74 of 597 "RDNA3" Instruction Set Architecture 7.9. Wave Matrix Multiply Accumulate (WMMA) 75 of 597 "RDNA3" Instruction Set Architecture Chapter 8. Scalar Memory Operations Scalar Memory Loads (SMEM) instructions allow a shader program to load data from memory into SGPRs through the Constant Cache ("Kcache"). Instructions can load from 1 to 16 DWORDs. Data is loaded directly into SGPRs without any format conversion. The scalar unit loads consecutive DWORDs from memory to the SGPRs. This is intended primarily for loading ALU constants and for indirect T#/S# lookup. No data formatting is supported, nor is byte or short data. Loads come in two forms: one that simply takes a base-address pointer, and the other that uses a vertex-buffer constant to provide: base, size and stride. 8.1. Microcode Encoding Scalar memory load instructions are encoded using the SMEM microcode format. The fields are described in the table below: Table 34. SMEM Encoding Field Descriptions Field Size Description OP 8 Opcode. See the next table. SDATA 7 SGPRs to return Load data to. • Loads of 2 DWORDs must have an even SDST-sgpr. • Loads of 4 or more DWORDs must have their DST-gpr aligned to a multiple of 4. • SDATA must be: SGPR or VCC. Not: EXEC, M0 or NULL except for instructions that return nothing: these may use NULL SBASE 6 SGPR-pair (SBASE has an implied LSB of zero) that provides a base address, or for BUFFER instructions, a set of 4 SGPRs (4-sgpr aligned) that hold the resource constant. For BUFFER instructions, the only resource fields used are: base, stride, num_records. OFFSET 21 Instruction Address Offset : An immediate signed byte offset. Negative offsets only work with S_LOAD; a negative offset applied to S_BUFFER results in a MEMVIOL. SOFFSET 7 SGPR that has the 32-bit unsigned byte offset. May only specify an SGPR, M0 or set to "NULL" to not use (offset=0). GLC 1 Globally Coherent. DLC 1 Device Coherent. Table 35. SMEM Instructions Opcode # Name Opcode # Name 0 S_LOAD_B32 9 S_BUFFER_LOAD_B64 1 S_LOAD_B64 10 S_BUFFER_LOAD_B128 2 S_LOAD_B128 11 S_BUFFER_LOAD_B256 3 S_LOAD_B256 12 S_BUFFER_LOAD_B512 4 S_LOAD_B512 32 S_GL1_INV 8 S_BUFFER_LOAD_B32 33 S_DCACHE_INV 8.1. Microcode Encoding 76 of 597 "RDNA3" Instruction Set Architecture These instructions load 1-16 DWORDs from memory. The data in SGPRs is specified in SDATA, and the address is composed of the SBASE, OFFSET, and SOFFSET fields. 8.1.1. Scalar Memory Addressing Non-buffer S_LOAD instructions use the following formula to calculate the memory address: ADDR = SGPR[base] + inst_offset + { M0 or SGPR[offset] or zero } All components of the address (base, offset, inst_offset, M0) are in bytes, but the two LSBs are ignored and treated as if they were zero. It is illegal and undefined for the inst_offset to be negative if the resulting (inst_offset + (M0 or SGPR[offset])) is negative. 8.1.2. Loads using Buffer Constant S_BUFFER_LOAD instructions use a similar formula, but the base address comes from the buffer constant’s base_address field. Buffer constant fields used: base_address, stride, num_records. Other fields are ignored. Scalar memory load does not support "swizzled" buffers. Stride is used only for memory address bounds checking, not for computing the address to access. The SMEM supplies only a SBASE address (byte) and an offset (byte or DWORD). Any "index * stride" must be calculated manually in shader code and added to the offset prior to the SMEM. Inst_offset must be nonnegative - a negative value of inst_offset results in a MEMVIOL. The two LSBs of V#.base and of the final address are ignored to force DWORD alignment. "m_*" components come from the buffer constant (V#): offset = OFFSET + SOFFSET (M0, SGPR or zero) m_base = { SGPR[SBASE * 2 +1][15:0], SGPR[SBASE*2] } m_stride = SGPR[SBASE * 2 +1][31:16] m_num_records = SGPR[SBASE * 2 + 2] m_size = (m_stride == 0 ? 1 : m_stride) * m_num_records addr = (m_base & ~3) + (offset & ~0x3) SGPR[SDST] = load_dword_from_dcache(addr, m_size) If more than 1 DWORD is being loaded, it is returned to SDST+1, SDST+2, etc, and the offset is incremented by 4 bytes per DWORD. 8.1.3. S_DCACHE_INV and S_GL1_INV This instruction invalidates the entire scalar cache or L1 cache. It does not return anything to SDST. 8.1. Microcode Encoding 77 of 597 "RDNA3" Instruction Set Architecture S_GL1_INV and S_DCACHE_INV do not have any address or data arguments. 8.2. Dependency Checking Scalar memory loads can return data out-of-order from how they were issued; they can return partial results at different times when the load crosses two cache lines. The shader program uses the LGKMcnt counter to determine when the data has been returned to the SDST SGPRs. This is done as follows. • LGKMcnt is incremented by 1 for every fetch of a single DWORD, or cache invalidates. • LGKMcnt is incremented by 2 for every fetch of two or more DWORDs. • LGKMcnt is decremented by an equal amount when each instruction completes. Because the instructions can return out-of-order, the only sensible way to use this counter is to implement "S_WAITCNT LGKMcnt 0"; this imposes a wait for all data to return from previous SMEMs before continuing. Cache invalidate instructions are not assured to have completed until the shader waits for LGKMcnt==0. 8.3. Scalar Memory Clauses and Groups A clause is a sequence of instructions starting with S_CLAUSE and continuing for 2-63 instructions. Clauses lock the instruction arbiter onto this wave until the clause completes. A group is a set of the same type of instruction that happen to occur in the code but are not necessarily executed as a clause. A group ends when a non-SMEM instruction is encountered. Scalar memory instructions are issued in groups. The hardware does not enforce that a single wave executes an entire group before issuing instructions from another wave. Group restrictions: • INV must be in a group by itself and may not be in a clause 8.4. Alignment and Bounds Checking SDST The value of SDST must be even for fetches of two DWORDs, or a multiple of four for larger fetches. If this rule is not followed, invalid data can result. SBASE The value of SBASE must be even for S_BUFFER_LOAD (specifying the address of an SGPR which is a multiple of four). If SBASE is out-of-range, the value from SGPR0 is used. OFFSET The value of OFFSET has no alignment restrictions. 8.2. Dependency Checking 78 of 597 "RDNA3" Instruction Set Architecture 8.4.1. Address and GPR Range Checking The hardware checks for both the address being out of range (BUFFER instructions only), and for the source or destination SGPRs being out of range. Address Out-of-Range if offset >= ( (stride==0 ? 1 : stride) * num_records). where "offset" is: inst_offset + {M0 or sgpr-offset} Any DWORDs that are out of range in memory from a buffer_load return zero. If a multi-DWORD request (e.g. S_BUFFER_LOAD_B256) is partially out of range, the DWORDs that are in range return data as normal, and the out-of-range DWORDs return zero. Source SGPR out of range If any source data is out of the range of SGPRs (either partially or completely), the value 'zero' is used instead. Destination SGPR out of range If the destination SGPR is partially or fully out of range, no data is written back to SGPRs for this instruction. 8.4. Alignment and Bounds Checking 79 of 597 "RDNA3" Instruction Set Architecture Chapter 9. Vector Memory Buffer Instructions Vector-memory (VM) buffer operations transfer data between the VGPRs and buffer objects in memory through the texture cache (TC). Vector means that one or more piece of data is transferred uniquely for every thread in the wave, in contrast to scalar memory loads that transfer only one value that is shared by all threads in the wave. The instruction defines which VGPR(s) supply the addresses for the operation, which VGPRs supply or receive data from the operation, and a series of SGPRs that contain the memory buffer descriptor (V#). Buffer atomics have the option of returning the pre-op memory value to VGPRs. Examples of buffer objects are vertex buffers, raw buffers, stream-out buffers, and structured buffers. Buffer objects support both homogeneous and heterogeneous data, but no filtering of load-data (no samplers). Buffer instructions are divided into two groups: MUBUF: Untyped buffer objects • Data format is specified in the resource constant. • Load, store, atomic operations, with or without data format conversion. MTBUF: Typed buffer objects • Data format is specified in the instruction. • The only operations are Load and Store, both with data format conversion. All buffer operations use a buffer resource constant (V#) that is a 128-bit value in SGPRs. This constant is sent to the texture cache when the instruction is executed. This constant defines the address and characteristics of the buffer in memory. Typically, these constants are fetched from memory using scalar memory loads prior to executing VM instructions, but these constants also can be generated within the shader. Memory operations of different types (loads, stores) can complete out of order with respect to each other. Simplified view of buffer addressing The equation below shows how the memory address is calculated for a buffer access: Memory instructions return MEMVIOL for any misaligned access when the alignment mode does not allow it. 9.1. Buffer Instructions Buffer instructions (MTBUF and MUBUF) allow the shader program to load from, and store to, linear buffers in memory. These operations can operate on data as small as one byte, and up to four DWORDs per work-item. Atomic operations take data from VGPRs and combine them arithmetically with data already in memory. Optionally, the value that was in memory before the operation took place can be returned to the shader. 9.1. Buffer Instructions 80 of 597 "RDNA3" Instruction Set Architecture The D16 instruction variants of buffer ops convert the results to and from packed 16-bit values. For example, BUFFER_LOAD_D16_FORMAT_XYZW stores two VGPRs with 4 16-bit values. Table 36. Buffer Instructions MTBUF Instructions TBUFFER_LOAD_FORMAT_{x,xy,xyz,xyzw} TBUFFER_STORE_FORMAT_{x,xy,xyz,xyzw} TBUFFER_LOAD_D16_FORMAT_{x,xy,xyz,xyzw} TBUFFER_STORE_D16_FORMAT_{x,xy,xyz,xyzw} Load from or store to a Typed buffer object. Convert data to 16-bits before loading into VGPRs. Convert data from 16-bits to tex-format before storing to memory MUBUF Instructions BUFFER_LOAD_FORMAT_{x,xy,xyz,xyzw} BUFFER_STORE_FORMAT_{x,xy,xyz,xyzw} BUFFER_LOAD_D16_FORMAT_{x,xy,xyz,xyzw} BUFFER_STORE_D16_FORMAT_{x,xy,xyz,xyzw} BUFFER_LOAD_ BUFFER_STORE_ BUFFER_{LOAD,STORE}_D16_FORMAT_X BUFFER_{LOAD,STORE}_D16_HI_FORMAT_X Load from or store to an Untyped Buffer object = I8, U8, I16, U16, B32, B64, B96, B128 BUFFER_ATOMIC_ Buffer object atomic operation. Automatically globally coherent. Operates on 32bit or 64bit values. BUFFER_GL{0,1}_INV Cache invalidate: either L0 or L1 cache for the CU (L0) and Shader Array (L1) associated with this wave. Table 37. Microcode Formats Field Bit Size Description OP 4 8 MTBUF: Opcode for Typed buffer instructions. MUBUF: Opcode for Untyped buffer instructions. VADDR 8 Address of VGPR to supply first component of address (offset or index). When both index and offset are used, index is in the first VGPR, offset in the second. VDATA 8 Address of VGPR to supply first component of store data or receive first component of load-data. SOFFSET 8 SGPR to supply unsigned byte offset. SGPR, M0, NULL, or inline constant. SRSRC Specifies which SGPR supplies V# (resource constant) in four consecutive SGPRs. This field is missing the two LSBs of the SGPR address, since this address is be aligned to a multiple of four SGPRs. 5 FORMA 7 T Data Format of data in memory buffer. See: Buffer Image Format Table OFFSET 12 Unsigned byte offset. OFFEN 1 1 = Supply an offset from VGPR (VADDR). 0 = Do not (offset = 0). IDXEN 1 1 = Supply an index from VGPR (VADDR). 0 = Do not (index = 0). GLC 1 Globally Coherent. Controls how loads and stores are handled by the L0 texture cache. ATOMIC GLC = 0 Previous data value is not returned. GLC = 1 Previous data value is returned. DLC 1 Device Level Coherent. SLC 1 System Level Coherent. 9.1. Buffer Instructions 81 of 597 "RDNA3" Instruction Set Architecture Field Bit Size Description TFE 1 Texel Fault Enable for PRT (partially resident textures). When set to 1 and fetch returns a NACK, status is written to the VGPR after the last fetch-dest VGPR. Table 38. MTBUF Instructions Opcode Description - all address components for buffer ops are uint TBUFFER_LOAD_FORMAT_X load X component w/ format convert TBUFFER_LOAD_FORMAT_XY load XY components w/ format convert TBUFFER_LOAD_FORMAT_XYZ load XYZ components w/ format convert TBUFFER_LOAD_FORMAT_XYZW load XYZW components w/ format convert TBUFFER_STORE_FORMAT_X store X component w/ format convert TBUFFER_STORE_FORMAT_XY store XY components w/ format convert TBUFFER_STORE_FORMAT_XYZ store XYZ components w/ format convert TBUFFER_STORE_FORMAT_XYZW store XYZW components w/ format convert TBUFFER_LOAD_D16_FORMAT_X load X component w/ format convert, 16bit TBUFFER_LOAD_D16_FORMAT_XY load XY components w/ format convert, 16bit TBUFFER_LOAD_D16_FORMAT_XYZ load XYZ components w/ format convert, 16bit TBUFFER_LOAD_D16_FORMAT_XYZW load XYZW components w/ format convert, 16bit TBUFFER_STORE_D16_FORMAT_X store X component w/ format convert, 16bit TBUFFER_STORE_D16_FORMAT_XY store XY components w/ format convert, 16bit TBUFFER_STORE_D16_FORMAT_XYZ store XYZ components w/ format convert, 16bit TBUFFER_STORE_D16_FORMAT_XYZW store XYZW components w/ format convert, 16bit • TBUFFER*_FORMAT instructions include a data-format conversion specified in the instruction. Table 39. MUBUF Instructions Opcode Description - all address components for buffer ops are uint BUFFER_LOAD_U8 load unsigned byte (extend 0’s to MSB’s of DWORD VGPR) BUFFER_LOAD_D16_U8 load unsigned byte into VGPR[15:0] BUFFER_LOAD_D16_HI_U8 load unsigned byte into VGPR[31:16] BUFFER_LOAD_I8 load signed byte (sign extend to MSB’s of DWORD VGPR) BUFFER_LOAD_D16_I8 load signed byte into VGPR[15:0] BUFFER_LOAD_D16_HI_I8 load signed byte into VGPR[31:16] BUFFER_LOAD_U16 load unsigned short (extend 0’s to MSB’s of DWORD VGPR) BUFFER_LOAD_I16 load signed short (sign extend to MSB’s of DWORD VGPR) BUFFER_LOAD_D16_B16 load short into VGPR[15:0] BUFFER_LOAD_D16_HI_B16 load short into VGPR[31:16] BUFFER_LOAD_B32 load DWORD BUFFER_LOAD_B64 load 2 DWORD per element BUFFER_LOAD_B96 load 3 DWORD per element BUFFER_LOAD_B128 load 4 DWORD per element BUFFER_LOAD_FORMAT_X load X component w/ format convert BUFFER_LOAD_FORMAT_XY load XY components w/ format convert BUFFER_LOAD_FORMAT_XYZ load XYZ components w/ format convert BUFFER_LOAD_FORMAT_XYZW load XYZW components w/ format convert BUFFER_LOAD_D16_FORMAT_X load X component w/ format convert, 16b BUFFER_LOAD_D16_HI_FORMAT_X load X component w/ format convert, 16b BUFFER_LOAD_D16_FORMAT_XY load XY components w/ format convert, 16b 9.1. Buffer Instructions 82 of 597 "RDNA3" Instruction Set Architecture Opcode Description - all address components for buffer ops are uint BUFFER_LOAD_D16_FORMAT_XYZ load XYZ components w/ format convert, 16b BUFFER_LOAD_D16_FORMAT_XYZW load XYZW components w/ format convert, 16b BUFFER_STORE_B8 store byte (ignore MSB’s of DWORD VGPR) BUFFER_STORE_D16_HI_B8 store byte from VGPR bits [23:16] BUFFER_STORE_B16 store short (ignore MSB’s of DWORD VGPR) BUFFER_STORE_D16_HI_B16 store short from VGPR bits [32:16] BUFFER_STORE_B32 store DWORD BUFFER_STORE_B64 store 2 DWORD per element BUFFER_STORE_B96 store 3 DWORD per element BUFFER_STORE_B128 store 4 DWORD per element BUFFER_STORE_FORMAT_X store X component w/ format convert BUFFER_STORE_FORMAT_XY store XY components w/ format convert BUFFER_STORE_FORMAT_XYZ store XYZ components w/ format convert BUFFER_STORE_FORMAT_XYZW store XYZW components w/ format convert BUFFER_STORE_D16_FORMAT_X store X component w/ format convert, 16b BUFFER_STORE_D16_HI_FORMAT_X store X component w/ format convert, 16b BUFFER_STORE_D16_FORMAT_XY store XY components w/ format convert, 16b BUFFER_STORE_D16_FORMAT_XYZ store XYZ components w/ format convert, 16b BUFFER_STORE_D16_FORMAT_XYZW store XYZW components w/ format convert, 16b BUFFER_ATOMIC_ADD_U32 32b , dst += src, returns previous value if glc==1 BUFFER_ATOMIC_ADD_F32 32b , dst += src, returns previous value if glc==1 BUFFER_ATOMIC_ADD_U64 64b , dst += src, returns previous value if glc==1 BUFFER_ATOMIC_AND_B32 32b , dst &= src, returns previous value if glc==1 BUFFER_ATOMIC_AND_B64 64b , dst &= src, returns previous value if glc==1 BUFFER_ATOMIC_CMPSWAP_B32 32b , dst = (dst == cmp) ? src : dst, returns previous value if glc==1. Src is from vdata, cmp from vdata+1 BUFFER_ATOMIC_CMPSWAP_B64 64b , dst = (dst == cmp) ? src : dst, returns previous value if glc==1 BUFFER_ATOMIC_CSUB_U32 32b , dst = if (src > dst) ? 0 : dst - src, returns previous . GLC must be set to 1. BUFFER_ATOMIC_DEC_U32 32b , dst = dst == 0) | (dst > src ? src : dst-1, returns previous value if glc==1 BUFFER_ATOMIC_DEC_U64 64b , dst = dst == 0) | (dst > src ? src : dst-1, returns previous value if glc==1 BUFFER_ATOMIC_CMPSWAP_F32 32b , dst = (dst == cmp) ? src : dst, returns previous value if glc==1. Src is from vdata, cmp from vdata+1 BUFFER_ATOMIC_MAX_F32 32b , dst = (src > dst) ? src : dst, (float) returns previous value if glc==1 BUFFER_ATOMIC_MIN_F32 32b , dst = (src < dst) ? src : dst, (float) returns previous value if glc==1 BUFFER_ATOMIC_INC_U32 32b , dst = (dst >= src) ? 0 : dst+1, returns previous value if glc==1 BUFFER_ATOMIC_INC_U64 64b , dst = (dst >= src) ? 0 : dst+1, returns previous value if glc==1 BUFFER_ATOMIC_OR_B32 32b , dst |= src, returns previous value if glc==1 BUFFER_ATOMIC_OR_B64 64b , dst |= src, returns previous value if glc==1 BUFFER_ATOMIC_MAX_I32 32b , dst = (src > dst) ? src : dst, (signed) returns previous value if glc==1 BUFFER_ATOMIC_MAX_I64 64b , dst = (src > dst) ? src : dst, (signed) returns previous value if glc==1 BUFFER_ATOMIC_MIN_I32 32b , dst = (src < dst) ? src : dst, (signed) returns previous value if glc==1 BUFFER_ATOMIC_MIN_I64 64b , dst = (src < dst) ? src : dst, (signed) returns previous value if glc==1 BUFFER_ATOMIC_SUB_U32 32b , dst -= src, returns previous value if glc==1 BUFFER_ATOMIC_SUB_U64 64b , dst -= src, returns previous value if glc==1 BUFFER_ATOMIC_SWAP_B32 32b , dst = src, returns previous value of dst if glc==1 BUFFER_ATOMIC_SWAP_B64 64b , dst = src, returns previous value of dst if glc==1 BUFFER_ATOMIC_MAX_U32 32b , dst = (src > dst) ? src : dst, (unsigned) returns previous value if glc==1 BUFFER_ATOMIC_MAX_U64 64b , dst = (src > dst) ? src : dst, (unsigned) returns previous value if glc==1 9.1. Buffer Instructions 83 of 597 "RDNA3" Instruction Set Architecture Opcode Description - all address components for buffer ops are uint BUFFER_ATOMIC_MIN_U32 32b , dst = (src < dst) ? src : dst, (unsigned) returns previous value if glc==1 BUFFER_ATOMIC_MIN_U64 64b , dst = (src < dst) ? src : dst, (unsigned) returns previous value if glc==1 BUFFER_ATOMIC_XOR_B32 32b , dst ^= src, returns previous value if glc==1 BUFFER_ATOMIC_XOR_B64 64b , dst ^= src, returns previous value if glc==1 BUFFER_GL0_INV invalidate the shader L0 cache (texture cache) associated with this wave. BUFFER_GL1_INV invalidate the GL1 (L1) cache associated with this wave, for this wave’s VMID • BUFFER*_FORMAT instructions include a data-format conversion specified in the resource constant (V#). • In the table above, "D16" means the data in the VGPR is 16-bits, not the usual 32 bits. "D16_HI" means that the upper 16-bits of the VGPR is used instead of "D16" that uses the lower 16 bits. 9.2. VGPR Usage VGPRs supply address and store-data, and they can be the destination for return data. Address Zero, one or two VGPRs are used, depending on the index-enable (IDXEN) and offset-enable (OFFEN) in the instruction word. These are unsigned ints. For 64-bit addresses the LSBs are in VGPRn and the MSBs are in VGPRn+1. Table 40. Address VGPRs IDXEN OFFEN VGPRn VGPRn+1 0 0 nothing 0 1 uint offset 1 0 uint index 1 1 uint index uint offset Store Data : N consecutive VGPRs, starting at VDATA. The data format specified in the instruction word’s opcode and D16 setting determines how many DWORDs the shader provides to store. Load Data : Same as stores. Data is returned to consecutive VGPRs. Load Data Format : Load data is 32 or 16 bits, based on the data format in the instruction or resource and D16. Float or normalized data is returned as floats; integer formats are returned as integers (signed or unsigned, same type as the memory storage format). Memory loads of data in memory that is 32 or 64 bits do not undergo any format conversion unless they return as 16-bit due to D16 being set to 1. Atomics with Return : Data is read out of the VGPR(s) starting at VDATA to supply to the atomic operation. If the atomic returns a value to VGPRs, that data is returned to those same VGPRs starting at VDATA. Table 41. Data format in VGPRs and Memory Instruction Memory Format VGPR Format BUFFER_LOAD_U8 ubyte V0[31:0] = {24’b0, byte} BUFFER_LOAD_D16_U8 ubyte V0[15:0] = {8’b0, byte} writes only 16 bits BUFFER_LOAD_D16_HI_U8 ubyte V0[31:16] = {8’h0, byte} writes only 16 bits BUFFER_LOAD_S8 sbyte V0[31:0] = { 24{sign}, byte} BUFFER_LOAD_D16_S8 sbyte V0[15:0] {8{sign}, byte} 9.2. VGPR Usage Notes writes only 16 bits 84 of 597 "RDNA3" Instruction Set Architecture Instruction Memory Format VGPR Format Notes BUFFER_LOAD_D16_HI_S8 sbyte V0[31:16] = {8{sign}, byte} writes only 16 bits BUFFER_LOAD_U16 ushort V0[31:0] = { 16’b0, short} BUFFER_LOAD_S16 sshort V0[31:0] = { 16{sign}, short} BUFFER_LOAD_D16_B16 short V0[15:0] = short writes only 16 bits BUFFER_LOAD_D16_HI_B16 short V0[31:16] = short writes only 16 bits BUFFER_LOAD_B32 DWORD DWORD BUFFER_LOAD_FORMAT_X FORMAT field float, uint or sint Load X into V0[31:0] BUFFER_LOAD_FORMAT_XY FORMAT field BUFFER_LOAD_FORMAT_XYZ FORMAT field data type in VGPR is based on FORMAT field. float, uint or sint Load X,Y into V0[31:0], V1[31:0] (D16_X and D16_HI_X write only 16 bits) float, uint or sint Load X,Y,Z into V0[31:0], V1[31:0], V2[31:0] BUFFER_LOAD_FORMAT_XYZW FORMAT field float, uint or sint Load X,Y,Z,W into V0[31:0], V1[31:0], V2[31:0], v3[31:0] BUFFER_LOAD_D16_FORMAT_X FORMAT field float, uint or sint Load X into in V0[15:0] BUFFER_LOAD_D16_HI_FORMAT_X FORMAT field float, ushort or sshort Load X into in V0[31:16] BUFFER_LOAD_D16_FORMAT_XY FORMAT field float, ushort or sshort Load X,Y into in V0[15:0], V0[31:16] BUFFER_LOAD_D16_FORMAT_XYZ FORMAT field float, ushort or sshort Load X,Y,Z into in V0[15:0], V0[31:16], V1[15:0] BUFFER_LOAD_D16_FORMAT_XYZW FORMAT field float, ushort or sshort Load X,Y,Z,W into in V0[15:0], V0[31:16], V1[15:0], V1[31:16] Where "V0" is the VDATA VGPR; V1 is the VDATA+1 VGPR, etc. Instruction VGPR Format Memory Format BUFFER_STORE_B8 byte in [7:0] byte BUFFER_STORE_D16_HI_B8 byte in [23:16] byte BUFFER_STORE_B16 short in [15:0] short BUFFER_STORE_D16_HI_B16 short in [31:16] short BUFFER_STORE_B32 data in [31:0] DWORD 9.2. VGPR Usage Notes 85 of 597 "RDNA3" Instruction Set Architecture Instruction VGPR Format Memory Format Notes BUFFER_STORE_FORMAT_X float, uint or sint data in V0[31:0] BUFFER_STORE_D16_FORMAT_X float, ushort or sshort data in V0[15:0] FORMAT field data type in VGPR is based on FORMAT field. BUFFER_STORE_D16_FORMAT_XY float, ushort or sshort data in V0[15:0], V0[31:16] BUFFER_STORE_D16_FORMAT_XYZ float, ushort or sshort data in V0[15:0], V0[31:16], V1[15:0] BUFFER_STORE_D16_FORMAT_XYZW float, ushort or sshort data in V0[15:0], V0[31:16], V1[15:0], V1[31:16] BUFFER_STORE_D16_HI_FORMAT_X float, ushort or sshort data in V0[31:16] 9.3. Buffer Data The amount and type of data that is loaded or stored is controlled by the following: the resource format field, destination-component-selects (dst_sel), and the opcode. Data-format can come from the resource, instruction fields, or the opcode itself. MTBUF derives data-format from the instruction, MUBUF-"format" instructions use format from the resource, and other MUBUF opcode derive data-format from the instruction itself. DST_SEL comes from the resource, but is ignored for many operations. Table 42. Buffer Instructions Instruction Data Format DST SEL TBUFFER_LOAD_FORMAT_* instruction identity TBUFFER_STORE_FORMAT_* instruction identity BUFFER_LOAD_ derived identity BUFFER_STORE_ derived identity BUFFER_LOAD_FORMAT_* resource resource BUFFER_STORE_FORMAT_* resource resource BUFFER_ATOMIC_* derived identity Instruction : The instruction’s format field is used instead of the resource’s fields. Data format derived : The data format is derived from the opcode and ignores the resource definition. For example, BUFFER_LOAD_U8 sets the data-format to uint-8.  The resource’s data format must not be INVALID; that format has specific meaning (unbound resource), and for that case the data format is not replaced by the instruction’s implied data format. DST_SEL identity : Depending on the number of components in the data-format, this is: X000, XY00, XYZ0, or XYZW. 9.3. Buffer Data 86 of 597 "RDNA3" Instruction Set Architecture 9.3.1. D16 Instructions Load-format and store-format instructions also come in a "D16" variant. The D16 buffer instructions allow a shader program to load or store just 16 bits per work-item between VGPRs and memory. For stores, each 32bit VGPR holds two 16bit data elements that are passed to the texture unit which in turn, converts to the texture format before writing to memory. For loads, data returned from the texture unit is converted to 16 bits and a pair of data are stored in each 32bit VGPR (LSBs first, then MSBs). Control over int vs. float is controlled by FORMAT. Conversion of float32 to float16 uses truncation; conversion of other input data formats uses roundto-nearest-even. There are two variants of these instructions: • D16 loads data into or stores data from the lower 16 bits of a VGPR. • D16_HI loads data into or stores data from the upper 16 bits of a VGPR. For example, BUFFER_LOAD_D16_U8 loads a byte per work-item from memory, converts it to a 16-bit integer, then loads it into the lower 16 bits of the data VGPR. 9.3.2. LOAD/STORE_FORMAT and DATA-FORMAT mismatches The "format" instructions specify a number of elements (x, xy, xyz or xyzw) and this could mismatch with the number of elements in the data format specified in the instruction’s or resource’s data-format field. When that happens. • buffer_load_format_x and dfmt is "32_32_32_32" : load 4 DWORDs from memory, but only load first into the shader • buffer_store_format_x and dfmt is "32_32_32_32" : stores 4 DWORDs to memory based on dst_sel • buffer_load_format_xyzw and dfmt is "32" : load 1 DWORD from memory, return 4 to shader (dst_sel) • buffer_store_format_xyzw and dfmt is "32" : store 1 DWORD (X) to memory, ignore YZW. 9.4. Buffer Addressing A buffer is a data structure in memory that is addressed with an index and an offset. The index points to a particular record of size stride bytes, and the offset is the byte-offset within the record. The stride comes from the resource, the index from a VGPR (or zero), and the offset from an SGPR or VGPR and also from the instruction itself. Table 43. BUFFER Instruction Fields for Addressing Field Size Description inst_offset 12 Literal byte offset from the instruction. inst_idxen 1 Boolean: get per-lane index from VGPR when true, or no index when false. inst_offen 1 Boolean: get per-lane offset from VGPR when true, or no offset when false. Note that inst_offset is present regardless of this bit. The "element size" for a buffer instruction is the amount of data the instruction transfers in bytes. It is determined by the FORMAT field for MTBUF instructions, or from the opcode for MUBUF instructions, and is: 1, 2, 4, 8, 12 or 16 bytes. For example, format "16_16" has an element size of 4-bytes. 9.4. Buffer Addressing 87 of 597 "RDNA3" Instruction Set Architecture Table 44. Buffer Resource Constant Fields for Addressing Field Size Description const_base 48 Base address of the buffer resource, in bytes. const_stride 14 Stride of the record in bytes (0 to 16,383 bytes). const_num_records 32 Number of records in the buffer. In units of: Bytes if: const_stride == 0 || const_swizzle_enable == false Otherwise, in units of "stride". const_add_tid_enable 1 Boolean. Add thread_ID within the wave to the index when true. const_swizzle_enable 2 Swizzle AOS according to stride, index_stride and element_size: 0: disabled 1: enabled with element_size = 4-byte 2: Reserved 3: enabled with element_size = 16-byte const_index_stride 2 Used only when const_swizzle_en = true. Number of contiguous indices for a single element (of const_element_size=4 or 16 bytes) before switching to the next element. 8, 16, 32 or 64 indices. Table 45. Address Components from GPRs Field Size Description SGPR_offset 32 An unsigned byte-offset to the address. Comes from an SGPR or M0. VGPR_offset 32 An optional unsigned byte-offset. It is per-thread, and comes from a VGPR. VGPR_index 32 An optional index value. It is per-thread and comes from a VGPR. The final buffer memory address is composed of three parts: • the base address from the buffer resource (V#), • the offset from the SGPR, and • a buffer-offset that is calculated differently, depending on whether the buffer is linearly addressed (a simple Array-of-Structures calculation) or is swizzled. Address Calculation for a Linear Buffer 9.4.1. Range Checking Buffer addresses are checked against the size of the memory buffer. Loads that are out of range return zero, and stores and atomics are dropped. Range checking is per-component for non-formatted loads and stores that are larger than one DWORD. Note that load/store_B64, B96 and B128 are considered "2-DWORD/3-DWORD/4DWORD load/store", and each DWORD is bounds checked separately. The method of clamping is controlled by 9.4. Buffer Addressing 88 of 597 "RDNA3" Instruction Set Architecture a 2-bit field in the buffer resource: OOB_SELECT (Out of Bounds select). Table 46. Buffer Out Of Bounds Selection OOB SELECT Out of Bounds Check Description or use 0 (index >= NumRecords) || (offset+payload > stride) structured buffers 1 (index >= NumRecords) Raw buffers 2 (NumRecords == 0) do not check bounds (except empty buffer) 3 Bounds check: Raw In this mode, "num_records" is reduced by "sgpr_offset" if (swizzle_en && const_stride != 0x0) OOB = (index >= NumRecords || (offset+payload > stride)) else OOB = (offset+payload > NumRecords) Where "payload" is the number of bytes the instruction transfers. Notes: 1. Loads that go out-of-range return zero (except for components with V#.dst_sel = SEL_1 that return 1). 2. Stores that are out-of-range do not store anything. 3. Load/store-format-* instruction and atomics are range-checked "all or nothing" - either entirely in or out. 4. Load/store-B{64,96,128} and range-check per component. For MTBUF, if any component of the thread is out of bounds, the whole thread is considered out of bounds and returns zero. For MUBUF, only the components that are out of bounds return zero. 9.4.1.1. Structured Buffer The address calculation for swizzle_en==0 is: (unswizzled structured buffer) ADDR = Base + baseOff + Ioff + V# SGPR INST Stride * Vidx V# + (OffEn ? Voff : 0) VGPR INST VGPR NumRecords for structured buffer is in units of stride. 9.4.1.2. Raw Buffer ADDR = Base + baseOff + Ioff + V# SGPR INST (OffEn ? Voff : 0) INST VGPR NumRecords for raw buffer is in units of bytes. This is an exact range check, meaning it includes the payload and handles multi-DWORD and unaligned correctly. The stride field is ignored. 9.4. Buffer Addressing 89 of 597 "RDNA3" Instruction Set Architecture 9.4.1.3. Scratch Buffer The address calculation for swizzle_en = 0 is…(unswizzled scratch buffer) ADDR = Base + baseOffset + Ioff + V# SGPR INST Stride * TID + V# (OffEn ? Voff : 0) 0..63 INST VGPR Swizzle of scratch buffer is also supported (and is typical). The MSBs of the TID (TID / 64) is folded into baseOffset. No range checking (using OOB mode 2). 9.4.1.4. Scalar Memory Scalar memory does the following, that works with RAW buffers and unswizzled structured buffers: Addr = Base + V# offset SGPR or Inst Address Out-of-Range if: offset >= ( (stride==0 ? 1 : stride) * num_records). Notes 1. Loads that go out-of-range return zero (except for components with V#.dst_sel = SEL_1 that return 1). Stores that are out of range do not write anything. 2. Load/store-format-* instruction and atomics are range-checked "all or nothing" - either entirely in or out. 3. Load/store-DWORD-x{2,3,4} perform range-check per component. 9.4.2. Swizzled Buffer Addressing Swizzled addressing rearranges the data in the buffer that may improve cache locality for arrays of structures. Swizzled addressing also requires DWORD-aligned accesses. A single fetch instruction must not fetch a unit larger than const_element_size. The buffer’s STRIDE must be a multiple of const_element_size. const_element_size is either 4 or 16 bytes, depending on the setting of V#.swizzle_enable Index = (inst_idxen ? vgpr_index : 0) + (const_add_tid_enable ? thread_id[5:0] : 0) Offset = (inst_offen ? vgpr_offset : 0) + inst_offset index_msb = index / const_index_stride index_lsb = index % const_index_stride offset_msb = offset / const_element_size offset_lsb = offset % const_element_size buffer_offset = (index_msb * const_stride + offset_msb * const_element_size) * const_index_stride + index_lsb * const_element_size + offset_lsb Final Address = const_base + sgpr_offset + buffer_offset The "sgpr_offset" is not a part of the "offset" term in the above equations - it's in the "base". 9.4. Buffer Addressing 90 of 597 "RDNA3" Instruction Set Architecture Example of Buffer Swizzling 9.5. Alignment Formatted ops such as BUFFER_LOAD_FORMAT_* must be aligned to element_size. Memory alignment enforcement for non-formatted ops is controlled by a configuration register: SH_MEM_CONFIG.alignment_mode. Options are: 0. : DWORD - hardware automatically aligns request to the smaller of: element-size or DWORD. For DWORD or larger loads or stores of non-formatted ops (such as BUFFER_LOAD_DWORD), the two LSBs of the byte-address are ignored, thus forcing DWORD alignment. 1. : DWORD_STRICT - must be aligned to the smaller of: element-size or DWORD. 2. : STRICT - access must be aligned to data size 9.5. Alignment 91 of 597 "RDNA3" Instruction Set Architecture 3. : UNALIGNED - any alignment is allowed Options 1 and 2 report MEMVIOL if a request is made with incorrect address alignment. In options 1 and 2, loads that are misaligned return zero, and stores that are misaligned are discarded. Note that in this context "element-size" refers to the size of the data transfer indicated by the instruction, not const_element_size. 9.6. Buffer Resource The buffer resource (V#) describes the location of a buffer in memory and the format of the data in the buffer. It is specified in four consecutive SGPRs (4-SGPR aligned) and sent to the texture cache with each buffer instruction. The table below details the fields that make up the buffer resource descriptor. Table 47. Buffer Resource Descriptor Bits Size Name Description 47:0 48 Base address Byte address. 61:48 14 Stride Bytes 0 to 16383 63:62 2 swizzle Enable Swizzle AOS according to stride, index_stride and element_size; otherwise linear. 0: disabled 1: enabled with element_size = 4byte 2: Reserved 3: enabled with element_size = 16byte 95:64 32 Num_records In units of stride if (stride >=1), else in bytes. 98:96 3 Dst_sel_x 101:99 3 Dst_sel_y Destination channel select: 0=0, 1=1, 4=R, 5=G, 6=B, 7=A 104:102 3 Dst_sel_z 107:105 3 Dst_sel_w 113:108 6 Format Memory data type. 118:117 2 Index stride 0:8, 1:16, 2:32, or 3:64. Used for swizzled buffer addressing. 119 1 Add tid enable Add thread ID to the index for to calculate the address. 123:122 2 LLC NoAlloc May become deprecated. Please use shader instruction fields instead. 0: NOALLOC = (PTE.NOALLOC | instruction.dlc) 1: NOALLOC = Read ? (PTE.NOALLOC | instruction.dlc) : 1 2: NOALLOC = Read ? 1 : (PTE.NOALLOC | instruction.dlc) 3: NOALLOC = 1 125:124 2 OOB_SELECT Out of bounds select. 127:126 2 Type Value == 0 for buffer. Overlaps upper two bits of four-bit TYPE field in 128-bit V# resource. Unbound Resources Setting the resource constant to all zeros has the effect of forcing any loads to return zero, and stores to be ignored. This is keyed off the "data-format" being set to zero (INVALID), and for MUBUF the "add_tid_en = false". Resource - Instruction mismatch If the resource type and instruction mismatch (e.g. a buffer constant with an image instruction, or an image 9.6. Buffer Resource 92 of 597 "RDNA3" Instruction Set Architecture resource with a buffer instruction), the instruction is ignored (loads return nothing and stores do not alter memory). 9.6. Buffer Resource 93 of 597 "RDNA3" Instruction Set Architecture Chapter 10. Vector Memory Image Instructions Vector Memory (VMEM) Image operations transfer data between the VGPRs and memory through the texture cache. Image operations support access to image objects such as texture maps and typed surfaces. Sample operations read multiple elements from a surface and combine them to produce a single result per lane. Image objects are accessed using from one to four dimensional addresses; they are composed of homogeneous samples, each sample containing one to four elements. These image objects are read from, or written to, using IMAGE_* or SAMPLE_* instructions, all of which use the MIMG instruction format. IMAGE_LOAD instructions load an element from the image buffer directly into VGPRS, and SAMPLE instructions use sampler constants (S#) and apply filtering to the data after it is read. IMAGE_ATOMIC instructions combine data from VGPRs with data already in memory, and optionally return the value that was in memory before the operation. VMEM image operations use an image resource constant (T#) that is a 128-bit or 256-bit value in SGPRs. This constant is sent to the texture cache when the instruction is executed. This constant defines the address, data format, and characteristics of the surface in memory. Some image instructions also use a sampler constant that is a 128-bit constant in SGPRs. Typically, these constants are fetched from memory using scalar memory loads prior to executing VM instructions, but these constants can also be generated within the shader. Texture fetch instructions have a data mask (DMASK) field. DMASK specifies how many data components it receives. If DMASK is less than the number of components in the texture, the texture unit only sends DMASK components, starting with R, then G, B, and A. if DMASK specifies more than the texture format specifies, the shader receives data based on T#.DST_SEL for the missing components. Image ops do not generate MemViol instead they apply clamp modes if the address goes out of range. Memory operations of different types (e.g. loads, stores and samples) can complete out of order with respect to each other. 10.1. Image Instructions This section describes the image instruction set, and the microcode fields available to those instructions. MIMG Instructions IMAGE_SAMPLE IMAGE_SAMPLE_G16 Load and filter data from a image object Sample with 16-bit gradients IMAGE_GATHER4 Load and return samples from 4 texels for software filtering. Returns a single component, starting with the lower-left texel and in counter-clockwise order. IMAGE_GATHER4H 4H: fetch 1 component per texel from 4x1 texels "DMASK" selects which component to load (R,G,B,A) and must have only one bit set to 1. IMAGE_LOAD_{-, PCK, PCK_SGN} Load data from an image object IMAGE_LOAD_MIP_{-, PCK, PCK_SGN } Load data from an image object from a specified mip level. IMAGE_MSAA_LOAD Load up to 4 samples of 1 component from an MSAA resource with a userspecified fragment ID. Uses DMASK as component select - it behaves like gather4 ops and returns 4 VGPR (2 if D16=1). IMAGE_STORE_{-, PCK } IMAGE_STORE_MIP_{-, PCK } Store data to an image object to a specific mipmap level 10.1. Image Instructions 94 of 597 "RDNA3" Instruction Set Architecture MIMG Instructions IMAGE_ATOMIC_{SWAP, CMPSWAP, Image atomic operations ADD, SUB, SMIN, UMIN, SMAX, UMAX, AND, OR, XOR, INC, DEC } IMAGE_GET_RESINFO Return resource info into 4 VGPRs for the MIP level specified. These are 32bit integer values: VDATA3-0 = { #mipLevels, depth, height, width } For cubemaps, depth = 6 * Number_of_array_faces. (DX expects the # of cubes, but gets # of faces instead) IMAGE_GET_LOD Return the calculated LOD. Treated as a Sample instruction. Returns the "raw" LOD and the "clamped" LOD into VDATA as two 32 bit floats: First VGPR = clampLOD Second VGPR = rawLOD Table 48. Instruction Fields Instruction Fields Field Size Description OP 8 Opcode VADDR 8 Address of VGPR to supply first component of address. VDATA 8 Address of VGPR to supply first component of store-data or receive first component of load-data. SSAMP 5 SGPR to supply S# (sampler constant) in 4 consecutive SGPRs. missing 2 LSB’s of SGPR-address since must be aligned to 4. SRSRC 5 SGPR to supply T# (resource constant) in 8 consecutive SGPRs. missing 2 LSB’s of SGPR-address since must be aligned to 4. UNRM 1 Force address to be un-normalized. Must be set to 1 for Image stores & atomics. 0: for image ops with samplers, S,T,R from [0.0, 1.0] span the entire texture map; 1: for image ops with samplers, S,T,R from [0.0 to N] span the texture map, where N is width, height or depth. Array/cube slice, lod, bias etc. are not affected. Image ops without sampler are not affected. UINT inputs are "unnormalized". This bit is logically OR’d with the S#.force_unnormalized bit. R128 1 Texture Resource Size: 1 = 128bits, 0 = 256bits A16 1 Address components are 16-bits (instead of the usual 32 bits). When set, all address components are 16 bits (packed into 2 per DWORD), except: Texel offsets (3 6bit UINT packed into 1 DWORD) PCF reference (for "_C" instructions) Address components are 16b uint for image ops without sampler; 16b float with sampler. DIM 3 Surface Dimension: 10.1. Image Instructions 0: 1D 4: 1d array 1: 2D 5: 2d array 2: 3D 6: 2d msaa 3: cube 7: 2d msaa array 95 of 597 "RDNA3" Instruction Set Architecture Instruction Fields DMASK 4 Data VGPR enable mask: 1 .. 4 consecutive VGPRs Loads: defines which components are returned: 0=red,1=green,2=blue,3=alpha Stores: defines which components are written with data from VGPRs (missing components get 0). Enabled components come from consecutive VGPRs. E.G. DMASK=1001 : Red is in VGPRn and alpha in VGPRn+1. For D16 loads, DMASK indicates which components to return; For D16 stores, the DMASK the mask indicates which components to store but has restrictions: Data is read out of consecutive VGPRs: LSB’s of VDATA, then MSB’s of VDATA then LSB’s of VDATA+1 and last if needed MSB’s of VDATA+1. This is regardless of which DMASK bits are set, only how many bits are set. The position of the DMASK bits controls which components are written in memory. If DMASK==0, the TA overrides DMASK=1 and puts zeros in VGPR followed by LWE status if exists. TFE status is not generated since the fetch is dropped. For IMAGE_GATHER4* instructions, DMASK indicates which component (RGBA), and the number of VGPRs to use is determined automatically by hardware (4 VGPRs when D16=0, and 2 VGPRs when D16=1). GLC 1 Group Level Coherent. Atomics: 1 = return the memory value before the atomic operation is performed. 0 = do not return anything. DLC 1 Device Level Coherent. Controls behavior of L1 cache (GL1). SLC 1 System Level Coherent. TFE 1 Texel Fault Enable for PRT (Partially Resident Textures). When set, fetch may return a NACK that causes a VGPR write into DST+1 (first GPR after all fetch-dest gprs). LWE 1 LOD Warning Enable. When set to 1, a texture fetch may return "LOD_CLAMPED = 1", and causes a VGPR write into DST+1 (first GPR after all fetch-dest gprs). LWE only works for sampler ops; LWE is ignored for non-sampler ops. D16 1 VGPR-Data-16bit. On loads, convert data in memory to 16-bit format before storing it in VGPRs. For stores, convert 16-bit data in VGPRs to the memory format before going to memory. Whether the data is treated as float or int is decided by NFMT. Allowed only with these opcodes: • IMAGE_SAMPLE* • IMAGE_GATHER4 • IMAGE_LOAD • IMAGE_LOAD_MIP • IMAGE_STORE • IMAGE_STORE_MIP NSA 1 Non-Sequential Address When NSA=0, the image addresses must be in sequential VGPRs starting at 'VADDR'. When NSA=1, the instruction encoding allows up to 5 address components to be specified separately by using an additional instruction DWORD. ADDR1-4 4x8 Four 8-bit VGPR address fields, used by NSA. The "VADDR" field provides ADDR0. 10.1.1. Texture Fault Enable (TFE) and LOD Warning Enable (LWE) This is related to "Partially Resident Textures". When either of these bits are set in the instruction, any texture fetch may return one extra VGPR after all of the data-return VGPRs. This data is returned uniquely to each thread and indicates the error / warning status of that thread. 10.1. Image Instructions 96 of 597 "RDNA3" Instruction Set Architecture The data returned is: TEXEL_FAIL | (LOD_WARNING << 1) | (LOD << 16) • TEXEL_FAIL : 1 bit indicating that 1 or more texels for this pixel produced a NACK. "failure" means accessing an unmapped page. ◦ TFE == 0 ▪ TD writes the data for threads that didn’t NACK to VGPR DST ▪ TD writes zeros or the result of blend using zeros for samples that NACKed to VGPR DST ◦ TFE == 1 ▪ VGPR DST is written similar to above ▪ TD writes to VGPR DST+1 with a status where the bits corresponding to threads that NACKed are set to 1 • LOD_WARNING : 1 bit indicating a that a pixel attempted to access a texel at too small a LOD: warn = ( LOD < T#.min_lod_warning) • LOD : indicates which LOD was attempted to be accessed that caused the NACK. Returns the floor of the requested LOD. A pixel cannot receive both TEXEL_FAIL and LOD_WARNING: TEXEL_FAIL takes precedence. 10.1.2. D16 Instructions Load-format and store-format instructions also come in a "d16" variant. For stores, each 32-bit VGPR holds two 16-bit data elements that are passed to the texture unit. The texture unit converts them to the texture format before writing to memory. For loads, data returned from the texture unit is converted to 16 bits, and a pair of data are stored in each 32- bit VGPR (LSBs first, then MSBs). The DMASK bit represents individual 16- bit elements; so, when DMASK=0011 for an image-load, two 16-bit components are loaded into a single 32-bit VGPR. 10.1.3. A16 Instructions The A16 instruction bit indicates that the address components are 16 bits instead of the usual 32 bits. Components are packed such that the first address component goes into the low 16 bits ([15:0]), and the next into the high 16 bits ([31:16]). 10.1.4. G16 Instructions The instructions with "G16" in the name mean the user provided derivatives are 16 bits instead of the usual 32 bits. Derivatives are packed such that the first derivative goes into the low 16 bits ([15:0]), and the next into the high 16 bits ([31:16]). 10.1.5. Image Non-Sequential Address (NSA) To avoid having many V_MOV instructions to pack image address VGPRs together, MIMG supports a "Non Sequential Address" version of the instruction where the VGPR of every address component is uniquely defined. Data components are still packed. This format creates a larger instruction word, which can be up to 3 DWORDs long. The first address goes in the VADDR field, and subsequent addresses go into ADDR1-4. This 3 DWORD form of the instruction can supply up to 5 addresses. 10.1. Image Instructions 97 of 597 "RDNA3" Instruction Set Architecture NSA allows an image instruction to specify up to 5 unique address VGPRs. These are the rules for how instructions requiring more than 5 addresses are handled with NSA. It is permissible to use non-NSA mode where all addresses are in sequential VGPRs. • VADDR provides the first address component • ADDR1 provides the second address component • ADDR2 provides the third address component • ADDR3 provides the fourth address component • ADDR4 provides all additional components in sequential VGPRs: VADDR4, VADDR4+1, etc. When using 16-bit addresses, each VGPR holds a pair of addresses and these cannot be located in different VGPRs. The lower numbered 16-bit value is in the LSBs of the VGPR. For Ray Tracing, the VGPRs are divided up into 5 groups of VGPRs. The VGPRs within each group must be contiguous, but the groups can be scattered. The packing is different when A16=1 because RayDir.Z and RayInvDir.x are in the same DWORD. In A16 mode, the RayDir and RayInvDir are merged into 3 VGPRs but in a different order: RayDir and RayInvDir per component share a VGPR. 10.2. Image Opcodes with No Sampler For image opcodes with no sampler, all VGPR address values are taken as uint. For cubemaps, face_id = slice * 6 + face. MSAA surfaces support only load, store and atomics; not load-mip or store-mip. The table below shows the contents of address VGPRs for the various image opcodes. Opcode a16[0] type acnt VGPRn[31:0] GET_RESINFO x Any 0 mipid MSAA_LOAD 0 2D MSAA 2 2D Array MSAA 3 2D MSAA 2D Array MSAA 1 10.2. Image Opcodes with No Sampler VGPRn+1[31:0] VGPRn+2[31:0] s t fragid s t slice 2 t, s -, fragid 3 t, s fragid, slice VGPRn+3[31:0] fragid 98 of 597 "RDNA3" Instruction Set Architecture Opcode a16[0] type acnt VGPRn[31:0] LOAD LOAD_PCK LOAD_PCK_SGN STORE STORE_PCK 0 1D 0 s 2D 1 s t 3D 2 s t r Cube/Cube Array 2 s t face 1D Array 1 s slice 2D Array 2 s t slice 2D MSAA 2 s t fragid 2D Array MSAA 3 s t slice 1D 0 -, s 2D 1 t, s 3D 1 ATOMIC 0 1 LOAD_MIP 0 LOAD_MIP_PCK LOAD_MIP_PCK_SGN STORE_MIP STORE_MIP_PCK 1 VGPRn+1[31:0] VGPRn+2[31:0] VGPRn+3[31:0] fragid 2 t, s -, r Cube/Cube Array 2 t, s -, face 1D Array 1 slice, s 2D Array 2 t, s -, slice 2D MSAA 2 t, s -, fragid 2D Array MSAA 3 t, s fragid, slice 1D 0 s 2D 1 s t 3D 2 s t 1D Array 1 s slice 2D Array 2 s t slice 2D MSAA 2 s t fragid 2D Array MSAA 3 s t slice 1D 0 -, s 2D 1 t, s 3D 2 t, s 1D Array 1 slice, s 2D Array 2 t, s -, slice 2D MSAA 2 t, s -, fragid 2D Array MSAA 3 t, s fragid, slice 1D 1 s mipid 2D 2 s t mipid 3D 3 s t r mipid Cube/Cube Array 3 s t face mipid 1D Array 2 s slice mipid 2D Array 3 s t slice 1D 1 mipid, s 2D 2 t, s -, mipid 3D r fragid -, r 3 t, s mipid, r Cube/Cube Array 3 t, s mipid, face 1D Array 2 slice, s -, mipid 2D Array 3 t, s mipid, slice mipid • Image_Load : image_load, image_load_mip, image_load_{pck, pck_sgn, mip_pck, mip_pck_sgn} • Image_Store: image_store, image_store_mip • Image_Atomic_*: swap, cmpswap, add, sub, {u,s}{min,max}, and, or, xor, inc, dec. "ACNT" is the Address Count: the number of VGPRs that supply the "body" of the address, derived from the 10.2. Image Opcodes with No Sampler 99 of 597 "RDNA3" Instruction Set Architecture instruction’s DIM field and the opcode. 10.3. Image Opcodes with a Sampler Opcodes with a sampler: all VGPR address values are taken as FLOAT except for Texel-offset which are UINT. For cubemaps, face_id = slice * 8 + face. (Note that the "*8" differs from the non-sampler case which is "*6"). Certain sample and gather opcodes require additional values from VGPRs beyond what is shown in the table below. These values are: offset, bias, z-compare and gradients. Please see the next section for details. MSAA surfaces do not support sample or gather4 operations. Opcode a16[0] acnt type VGPRn[31:0] Sample GetLod 0 1 Sample "_L": 0 1 Sample "_CL": 0 1 10.3. Image Opcodes with a Sampler VGPRn+1[31:0] VGPRn+2[31:0] VGPRn+3[31:0] 0 1D s 1 2D s t 2 3D s t r 2 Cube(Array) s t face 1 1D Array s slice 2 2D Array s t 0 1D -, s 1 2D t, s 2 3D t, s 2 Cube(Array) t, s 1 1D Array slice, s 2 2D Array t, s -, slice 1 1D s lod 2 2D s t lod 3 3D s t r lod 3 Cube(Array) s t face lod 2 1D Array s slice lod 3 2D Array s t slice 1 1D lod, s 2 2D t, s -, lod 3 3D t, s lod, r 3 Cube(Array) t, s lod, face 2 1D Array slice, s -, lod 3 2D Array t, s lod, slice 1 1D s clamp 2 2D s t clamp 3 3D s t r clamp 3 Cube(Array) s t face clamp 2 1D Array s slice clamp 3 2D Array s t slice 1 1D clamp, s 2 2D t, s -, clamp 3 3D t, s clamp, r 3 Cube(Array) t, s clamp, face 2 1D Array slice, s -, clamp 3 2D Array t, s clamp, slice slice -, r -, face lod clamp 100 of 597 "RDNA3" Instruction Set Architecture Opcode a16[0] acnt type VGPRn[31:0] VGPRn+1[31:0] Gather 0 1 Gather "_L" 0 1 Gather "_CL" 0 1 VGPRn+2[31:0] VGPRn+3[31:0] 1 2D s t 2 Cube(Array) s t face 2 2D Array s t slice 1 2D t, s 2 Cube(Array) t, s -, face 2 2D Array t, s -, slice 2 2D s t lod 3 Cube(Array) s t face lod 3 2D Array s t slice lod 2 2D t, s -, lod 3 Cube(Array) t, s lod, face 3 2D Array t, s lod, slice 2 2D s t clamp 3 Cube(Array) s t face clamp 3 2D Array s t slice clamp 2 2D t, s -, clamp 3 Cube(Array) t, s clamp, face 3 2D Array clamp, slice t, s The table below lists and briefly describes the legal suffixes for image instructions: Table 49. Sample Instruction Suffix Key Suffix Meaning Extra Addresses Description _L LOD - LOD is used instead of computed LOD. _B LOD BIAS 1: lod bias Add this BIAS to the computed LOD. _CL LOD CLAMP - Clamp the computed LOD to be no larger than this value. _D Derivative 2,4 or 6: slopes Send dx/dv, dx/dy, etc. slopes to be used in LOD computation. _LZ Level 0 - Force use of MIP level 0. _C PCF 1: z-comp Percentage closer filtering. _O Offset 1: offsets Send X, Y, Z integer offsets (packed into 1 DWORD) to offset XYZ address. _G16 Gradient 16b - Gradients are 16-bits instead of 32-bits, packed 2 gradients per VGPR (dX in low 16bits, dY in high 16bits). 10.4. VGPR Usage Address: The address consists of up to 5 parts: { offset } { bias } { z-compare } { derivative } { body } These are all packed into consecutive VGPRs, (may be non-consecutive if "NSA" is used), and can consist of up to 12 values. • Offset: SAMPLE*O*, GATHER*O* 1 DWORD of 'offset_xyz' . The offsets are 6-bit signed integers: X=[5:0], Y=[13:8], Z=[21:16] • Bias: SAMPLE*B*, GATHER*B*. 1 DWORD float. • Z-compare: SAMPLE*C*, GATHER*C*. 1 DWORD. • Derivatives (SAMPLE_D): 2,4 or 6 DWORDS - these packed 1 DWORD per derivative as shown below (F32). • Body: One to four DWORDs, as defined by the table: Image Opcodes with a Sampler Address components are X,Y,Z,W with X in VGPR[M], Y in VGPR[M]+1, etc. 10.4. VGPR Usage 101 of 597 "RDNA3" Instruction Set Architecture The number of components in "body" is the value of the ACNT field in the table, plus one. Address components are X,Y,Z,W with X in VGPR[M], Y in VGPR[M]+1, etc. Note: Bias and Derivatives are mutually exclusive - the shader can use one or the other, but not both. 32-bit derivatives: Image Dim VGPR N N+1 N+2 N+3 N+4 N+5 1D dx/dh dx/dv - - - - 2D/cube dx/dh dy/dh dx/dv dy/dv - — 3D dx/dh dy/dh dz/dh dx/dv dy/dv dz/dv 16-bit derivatives: Image Type VGPR_D 1 (1D, 1D Array) 16’hx, dx/dh 16’hx dx/dv VGPR_D+1 VGPR_D+2 VGPR_D+3 - - 2 (2D, 2D Array, Cubemap) dy/dh, dx/dh dy/dv, dx/dv - - 3 (3D) dy/dh, dx/dh 16’hx, dz/dh dy/dv, dx/dv 16’hx, dz/dv The "A16" instruction bit specifies that address components are 16 bits instead of the usual 32 bits. Data : data is stored from or returned to 1-4 consecutive VGPRs. The amount of data loaded or stored is completely determined by the DMASK field of the instruction. Loads DMASK specifies which elements of the resource are returned to consecutive VGPRs. The texture system loads data from memory and based on the data format expands it to a canonical RGBA form, filling in values for missing components based on T#.dst_sel. Then DMASK is applied and only those components selected are returned to the shader. Stores When writing an image object, it is only possible to write an entire element (all components) - not only individual components. The components come from consecutive VGPRs and the texture system fill in the value zero for any missing components of the image’s data format, and ignore any values that are not part of the stored data format. For example if the DMASK=1001, the shader sends Red from VGPR_N and Alpha from VGPR_N+1 to the texture unit. If the image object is RGB, the texel is overwritten with Red from the VGPR_N, Green and Blue set to zero, and Alpha from the shader ignored. For D16=1, the DMASK has 1 bit set per 16-bits of data to be written from VGPRs to memory. The position of the bits in DMASK is irrelevant, only the number of bits set to 1. "D16" instructions Load and store instructions also come in a "d16" variant. For stores, each 32bit VGPR holds two 16bit data elements that are passed to the texture unit which in turn, converts to the texture format before writing to memory. For loads, data returned from the texture unit is converted to 16 bits and a pair of data are stored in each 32bit VGPR (LSBs first, then MSBs). If there is only one component, the data goes into the lower half of the VGPR unless the "HI" instruction variant is used in which case the high-half of the VGPR is loaded with data. 10.4. VGPR Usage 102 of 597 "RDNA3" Instruction Set Architecture Atomics Image atomic operations are supported only on 32- and 64-bit-per-pixel surfaces. The surface data format is specified in the resource constant. Atomic operations treat the element as a single component of 32- or 64bits. For atomic operations, DMASK is set to the number of VGPRs (DWORDs) to send to the texture unit. DMASK legal values for atomic image operations: All other values of DMASK are illegal. • 0x1 = 32bit atomics except cmpswap • 0x3 = 32bit atomic cmpswap • 0x3 = 64bit atomics except cmpswap • 0xf = 64bit atomic cmpswap • Atomics with Return: Data is read out of the VGPR(s), starting at VDATA, to supply to the atomic operation. If the atomic returns a value to VGPRs, that data is returned to those same VGPRs starting at VDATA. The DMASK must be compatible with the resource’s data format. Denormals in Floats Sample ops flush denormals, and loads do not modify denormals. 10.4.1. Data format in VGPRs Data in VGPRs sent to texture (stores) or returned from texture (loads) is in one of a few standard formats, and the texture unit converts to/from the memory format. FORMAT VGPR data format If D16==1 SINT signed 32-bit integer 16 bit signed int UINT unsigned 32-bit integer 16 bit unsigned int others 32-bit float 16 bit float Atomics depends on opcode: uint or float - ASTC data formats 32-bit float - 10.5. Image Resource The image resource (also referred to as T#) defines the location of the image buffer in memory, its dimensions, tiling, and data format. These resources are stored in four or eight consecutive SGPRs and are read by MIMG instructions. All undefined or reserved bit must be set to zero unless otherwise specified. Table 50. Image Resource Definition Bits Size Name Comments 128-bit Resource: 1D-tex, 2d-tex, 2d-msaa (multi-sample anti-aliasing) 39:0 40 base address 256-byte aligned (represents bits 47:8). 47 1 Big Page 0 = No page size override, 1 = coalesce page translation requests to 64kB granularity. Use only when entire resource uses pages 64kB or greater. 51:48 4 max mip MSAA resources: holds Log2(number of samples); others holds: MipLevels-1. This describes the resource, not the resource view. 59:52 8 format Memory Data format 75:62 14 width width-1 of mip 0 in texels 10.5. Image Resource 103 of 597 "RDNA3" Instruction Set Architecture Bits Size Name Comments 91:78 14 height height-1 of mip 0 in texels 98:96 3 dst_sel_x 0 = 0, 1 = 1, 4 = R, 5 = G, 6 = B, 7 = A. 101:99 3 dst_sel_y 104:102 3 dst_sel_z 107:105 3 dst_sel_w 111:108 4 base level largest mip level in the resource view. For MSAA, this should be set to 0 115:112 4 last level smallest mip level in resource view. For MSAA, holds log2(number of samples). 123:121 3 BC Swizzle Specifies channel ordering for border color data independent of the T# dst_sel_*s. Internal xyzw channels get the following border color channels as stored in memory. 0=xyzw, 1=xwyz, 2=wzyx, 3=wxyz, 4=zyxw, 5=yxwz 127:124 4 type 0 = buf, 8 = 1d, 9 = 2d, 10 = 3d, 11 = cube, 12 = 1d-array, 13 = 2d-array, 14 = 2d-msaa, 15 = 2d-msaa-array. 1-7 are reserved. 256-bit Resource: 1d-array, 2d-array, 3d, cubemap, MSAA 140:128 13 depth Depth-1 of Mip0 for a 3D map; last array slice for a 2D-array or 1D-array or cube-map; (pitch-1)[12:0] of mip0 for 1D, 2D, 2D-MSAA resources if pitch > width. 141 1 Pitch[13] (pitch-1)[13] of mip0 for 1D, 2D and 2D-MSAA. 156:144 13 base array First slice in array of the resource view. 163:160 4 array pitch For 3D, bit 0 indicates SRV or UAV: 0: SRV (base_array ignored, depth w.r.t. base map) 1: UAV (base_array and depth are first and last layer in view, and w.r.t. mip level specified) 179:168 12 min lod warn feedback trigger for LOD, u4.8 format 183 1 corner samples mod Describes how texels were generated in the resource. 0=center sampled, 1 = corner sampled. 198:187 12 min_lod smallest LOD allowed for PRTs, U4.8 format 198:187 12 min LOD smallest LOD allowed for PRTs, u4.8 format. 202 1 Iterate 256 Indicates that compressed tiles in this surface have been flushed out to every 256B of the tile. Applies only to MSAA depth surfaces. 211 1 Meta Pipe Aligned Maintains pipe alignment in metadata addressing (DCC and tiling) 213 1 Compression Enable enable delta color compression (DCC) 214 1 Alpha is on MSB Set to 1 if the surface’s component swap is not reversed (DCC) 215 1 Color Transform Auto=0, none=1 (DCC) 255:216 40 Meta Data Address Upper bits of meta-data address (DCC) [47:8] A resource that is all zeros is treated as 'unbound': it returns all zeros and not generate a memory transaction. The "resource-level" field is ignored when checking for "all zeros". 10.6. Image Sampler The sampler resource (also referred to as S#) defines what operations to perform on texture map data loaded by sample instructions. These are primarily address clamping and filter options. Sampler resources are defined in four consecutive SGPRs and are supplied to the texture cache with every sample instruction. Table 51. Image Sampler Definition 10.6. Image Sampler 104 of 597 "RDNA3" Instruction Set Architecture Bits Size Name Description 2:0 3 clamp x 5:3 3 clamp y 8:6 3 clamp z Clamp/wrap mode: 0: Wrap 1: Mirror 2: ClampLastTexel 3: MirrorOnceLastTexel 4: ClampHalfBorder 5: MirrorOnceHalfBorder 6: ClampBorder 7: MirrorOnceBorder 11:9 3 max aniso ratio 0 = 1:1 1 = 2:1 2 = 4:1 3 = 8:1 4 = 16:1 14:12 3 depth compare func 0: Never 1: Less 2: Equal 3: Less than or equal 4: Greater 5: Not equal 6: Greater than or equal 7: Always 15 1 force unnormalized Force address cords to be unorm: 0 = address coordinates are normalized, in [0,1); 1 = address coordinates are unnormalized in the range [0,dim). 18:16 3 aniso threshold threshold under which floor(aniso ratio) determines number of samples and step size 19 1 mc coord trunc enables bilinear blend fraction truncation to 1 bit for motion compensation 20 1 force degamma force format to srgb if data_format allows 26:21 6 aniso bias 6 bits, in u1.5 format. 27 1 trunc coord selects texel coordinate rounding or truncation. 28 1 disable cube wrap disables seamless DX10 cubemaps, allows cubemaps to clamp according to clamp_x and clamp_y fields 30:29 2 filter_mode 0 = Blend (lerp); 1 = min, 2 = max. 31 1 skip degamma disabled degamma (sRGB→Linear) conversion. 43:32 12 min lod minimum LOD ins resource view space (0.0 = T#.base_level) u4.8. 55:44 12 max lod maximum LOD ins resource view space 77:64 14 lod bias LOD bias s6.8. 83:78 6 lod bias sec bias (s2.4) added to computed LOD 85:84 2 xy mag filter Magnification filter: 0=point, 1=bilinear, 2=aniso-point, 3=aniso-linear 87:86 2 xy min filter Minification filter: 0=point, 1=bilinear, 2=aniso-point, 3=aniso-linear 89:88 2 z filter Volume Filter: 0=none (use XY min/mag filter), 1=point, 2=linear 91:90 2 mip filter Mip level filter: 0=none (disable mipmapping,use base-leve), 1=point, 2=linear 94 1 Blend PRT For PRT fetches, bled the PRT_default valu for non-resident levels 107:96 12 border color ptr 10.6. Image Sampler 105 of 597 "RDNA3" Instruction Set Architecture Bits Size Name Description 127:126 2 border color type Opaque-black, transparent-black, white, use border color ptr. 0: Transparent Black 1: Opaque Black 2: Opaque White 3: Register (User border color, pointed to by border_color_ptr)" 10.7. Data Formats The table below details all the data formats that can be used by image and buffer resources. Table 52. Buffer and Image Data Formats # Buffer and Image Formats # Buffer and Image Formats # Image Formats 0 INVALID 31 11_11_10_FLOAT 64 8_SRGB 1 8_UNORM 32 10_10_10_2_UNORM 65 8_8_SRGB 2 8_SNORM 33 10_10_10_2_SNORM 66 8_8_8_8_SRGB 3 8_USCALED 34 10_10_10_2_UINT 67 5_9_9_9_FLOAT 4 8_SSCALED 35 10_10_10_2_SINT 68 5_6_5_UNORM 5 8_UINT 36 2_10_10_10_UNORM 69 1_5_5_5_UNORM 6 8_SINT 37 2_10_10_10_SNORM 70 5_5_5_1_UNORM 7 16_UNORM 38 2_10_10_10_USCALED 71 4_4_4_4_UNORM 8 16_SNORM 39 2_10_10_10_SSCALED 72 4_4_UNORM 9 16_USCALED 40 2_10_10_10_UINT 73 1_UNORM 10 16_SSCALED 41 2_10_10_10_SINT 74 1_REVERSED_UNORM 11 16_UINT 42 8_8_8_8_UNORM 75 32_FLOAT_CLAMP 12 16_SINT 43 8_8_8_8_SNORM 76 8_24_UNORM 13 16_FLOAT 44 8_8_8_8_USCALED 77 8_24_UINT 14 8_8_UNORM 45 8_8_8_8_SSCALED 78 24_8_UNORM 15 8_8_SNORM 46 8_8_8_8_UINT 79 24_8_UINT 16 8_8_USCALED 47 8_8_8_8_SINT 80 X24_8_32_UINT 17 8_8_SSCALED 48 32_32_UINT 81 X24_8_32_FLOAT 18 8_8_UINT 49 32_32_SINT 82 GB_GR_UNORM 19 8_8_SINT 50 32_32_FLOAT 83 GB_GR_SNORM 20 32_UINT 51 16_16_16_16_UNORM 84 GB_GR_UINT 21 32_SINT 52 16_16_16_16_SNORM 85 GB_GR_SRGB 22 32_FLOAT 53 16_16_16_16_USCALED 86 BG_RG_UNORM 23 16_16_UNORM 54 16_16_16_16_SSCALED 87 BG_RG_SNORM 24 16_16_SNORM 55 16_16_16_16_UINT 88 BG_RG_UINT 25 16_16_USCALED 56 16_16_16_16_SINT 89 BG_RG_SRGB 26 16_16_SSCALED 57 16_16_16_16_FLOAT 27 16_16_UINT 58 32_32_32_UINT 28 16_16_SINT 59 32_32_32_SINT 109 BC1_UNORM 29 16_16_FLOAT 60 32_32_32_FLOAT 110 BC1_SRGB 30 10_11_11_FLOAT 61 32_32_32_32_UINT 111 BC2_UNORM 62 32_32_32_32_SINT 112 BC2_SRGB 63 32_32_32_32_FLOAT 113 BC3_UNORM 114 BC3_SRGB 115 BC4_UNORM 10.7. Data Formats Compressed Formats 106 of 597 "RDNA3" Instruction Set Architecture # Buffer and Image Formats # Buffer and Image Formats # Image Formats 116 BC4_SNORM 117 BC5_UNORM 118 BC5_SNORM 119 BC6_UFLOAT 120 BC6_SFLOAT 121 BC7_UNORM 122 BC7_SRGB 205 YCBCR_UNORM 206 YCBCR_SRGB 227 6E4_FLOAT 228 7E3_FLOAT 10.8. Vector Memory Instruction Data Dependencies When a VM instruction is issued, it schedules the reads of address and store-data from VGPRs to be sent to the texture unit. Any ALU instruction that attempts to write this data before it has been sent to the texture unit is stalled. The shader developer’s responsibility to avoid data hazards associated with VMEM instructions include waiting for VMEM load instruction completion before reading data fetched from the cache (VMCNT and VSCNT). This is explained in the section: Data Dependency Resolution 10.9. Ray Tracing Ray Tracing support includes the following instructions: • IMAGE_BVH_INTERSECT_RAY • IMAGE_BVH64_INTERSECT_RAY These instructions receive ray data from the VGPRs and fetch BVH (Bounding Volume Hierarchy) from memory. • Box BVH nodes perform 4x Ray/Box intersection, sorts the 4 children based on intersection distance and returns the child pointers and hit status. • Triangle nodes perform 1 Ray/Triangle intersection test and returns the intersection point and triangle ID. The two instructions are identical, except that the "64" version supports a 64-bit address while the normal version supports only a 32bit address. Both instructions can use the "A16" instruction field to reduce some (but not all) of the address components to 16 bits (from 32). These addresses are: ray_dir and ray_inv_dir. 10.9.1. Instruction definition and fields image_bvh_intersect_ray vgpr_d[4], vgpr_a[11], sgpr_r[4] image_bvh_intersect_ray vgpr_d[4], vgpr_a[8], sgpr_r[4] A16=1 image_bvh64_intersect_ray vgpr_d[4], vgpr_a[12], sgpr_r[4] 10.8. Vector Memory Instruction Data Dependencies 107 of 597 "RDNA3" Instruction Set Architecture image_bvh64_intersect_ray vgpr_d[4], vgpr_a[9], sgpr_r[4] A16=1 Table 53. Ray Tracing VGPR Contents VGPR_ BVH A16=0 A BVH A16=1 BVH64 A16=0 BVH64 A16=1 0 node_pointer (u32) node_pointer (u32) node_pointer [31:0] (u32) node_pointer [31:0] (u32) 1 ray_extent (f32) ray_extent (f32) node_pointer [63:32] (u32) node_pointer [63:32] (u32) 2 ray_origin.x (f32) ray_origin.x (f32) ray_extent (f32) ray_extent (f32) 3 ray_origin.y (f32) ray_origin.y (f32) ray_origin.x (f32) ray_origin.x (f32) 4 ray_origin.z (f32) ray_origin.z (f32) ray_origin.y (f32) ray_origin.y (f32) 5 ray_dir.x (f32) [15:0] = ray_dir.x (f16) [31:16] = ray_inv_dir.x (f16) ray_origin.z (f32) ray_origin.z (f32) 6 ray_dir.y (f32) [15:0] = ray_dir.y (f16) [31:16] = ray_inv_dir.y(f16) ray_dir.x (f32) [15:0] = ray_dir.x (f16) [31:16] = ray_inv_dir.x (f16) 7 ray_dir.z (f32) [15:0] = ray_dir.z (f16) [31:16] = ray_inv_dir.z (f16) ray_dir.y (f32) [15:0] = ray_dir.y (f16) [31:16] = ray_inv_dir.y(f16) 8 ray_inv_dir.x (f32) unused ray_dir.z (f32) [15:0] = ray_dir.z (f16) [31:16] = ray_inv_dir.z (f16) 9 ray_inv_dir.y (f32) unused ray_inv_dir.x (f32) unused 10 ray_inv_dir.z (f32) unused ray_inv_dir.y (f32) unused 11 unused unused ray_inv_dir.z (f32) unused Vgpr_d[4] are the destination VGPRs of the results of intersection testing. The values returned here are different depending on the type of BVH node that was fetched. For box nodes the results contain the 4 pointers of the children boxes in intersection time sorted order. For triangle BVH nodes the results contain the intersection time and triangle ID of the triangle tested. Sgpr_r[4] is the texture descriptor for the operation. The instruction is encoded with use_128bit_resource=1. Restrictions on image_bvh instructions • DMASK must be set to 0xf (instruction returns all four DWORDs) • D16 must be set to 0 (16 bit return data is not supported) • R128 must be set to 1 (256 bit T#s are not supported) • UNRM must be set to 1 (only unnormalized coordinates are supported) • DIM must be set to 0 (BVH textures are 1D) • LWE must be set to 0 (LOD warn is not supported) • TFE must be set to 0 (no support for writing out the extra DWORD for the PRT hit status) • SSAMP must be set to 0 (just a placeholder, since samplers are not used by the instruction) The return order settings of the BVH ops are ignored instead they use the in-order load return queue. 10.9.2. Using BVH with NSA When using the BVH instruction with Non-Sequential Address, the BVH components fall into 5 groups each of which is specified by a NSA address VGPR. • node pointer : 1 vgpr • ray extent : 1 vgpr 10.9. Ray Tracing 108 of 597 "RDNA3" Instruction Set Architecture • ray origin : 3 consecutive vgprs • ray dir : 3 consecutive vgprs • ray inv dir : 3 consecutive vgprs (paired with ray-dir for 16-bit addresses) NSA and A16: • A16=0, MIMG-NSA specifies 5 groups of consecutive VGPRs: node_pointer, ray_extent, ray_origin, ray_dir and ray_inv_dir. • A16=1, MIMG-NSA specifies 4 groups. In the above set, ray_dir and ray_inv_dir are packed into 3 VGPRs. When using A16=1 mode, ray-dir and ray-inv-dir share the same vgprs and ADDR4 is unused. 10.9.3. Texture Resource Definition The T# used with these instructions is different from other image instructions. Table 54. BVH Resource Definition Field Bits Size Data Base Address 39:0 40 Base address of the BVH texture 256 byte aligned Reserved 54:40 15 Set to zero Box growing amount 62:55 8 Number of ULPs to be added during ray-box test, encoded as unsigned integer Box sorting enable 63 1 Whether the ray-box test result need to be sorted Size 105:64 42 Number of nodes minus 1 in the BVH texture used to enforce bounds checking Reserved 118:106 13 Set to zero Pointer Flags 119 1 0: Do not use pointer flags or features supported by point flags 1: Utilize pointer flags to enable HW winding, backface cull, opaque/non-opaque culling and primitive type-based culling. triangle_return 120 _mode 1 0: Return data for triangle tests are {0: t_num, 1: t_denom, 2: triangle_id, 3: hit_status} 1: Return data for triangle tests are {0: t_num, 1: t_denom, 2: I_num, 3: J_num} llc_stream or unused 122:121 2 0: use the LLC for load/store if enabled in Mtype 1: use the LLC for load, bypass for store/atomics (store/atomics probe-invalidate) 2: Reserved 3: bypass the LLC for all ops big_page 123 1 Describes resource page usage 0 : No page size override. 1 : Indicates when a whole resource is only using pages that are >= 64kB in size. Type 127:124 4 Set to 0x8 Barycentrics The ray-tracing hardware is designed to support computation of barycentric coordinates directly in hardware. This uses the "triangle_return_mode" in the table in the previous section (T# descriptor). Table 55. Ray Tracing Return Mode DWORD 0 Return Mode =0 Return Mode = 1 Field Name Type Field Name Type t_num float32 t_num float32 10.9. Ray Tracing 109 of 597 "RDNA3" Instruction Set Architecture DWORD Return Mode =0 1 t_denom float32 Return Mode = 1 t_denom float32 2 triangle_id uint32 I_num float32 3 hit_status uint32 (boolean value) J_num float32 10.10. Partially Resident Textures "Partially Resident Textures" provides support for texture maps in which not all levels of detail are resident in memory. The shader compiler declares the texture map as being P.R.T. in the resource, but the shader program must also be aware of this because if a texture fetch accesses a MIP level that is not present, the texture unit returns an extra DWORD of status into VGPRs indicating the fetch failure. If any of the texels are not present in memory, the texture cache returns NACK that causes a non-zero value to be written into DST_VGPR+1 for each failing thread. The value may represent the LOD requested. The shader program must allocate this extra VGPR for all PRT texture fetches and check that it is zero after the fetch. This PRT VGPR must have previously been initialized to zero by the shader. PRT is enabled when the texture resource MIN_LOD_WARN value is non-zero. Normal textures cannot NACK, so only PRT’s can get a NACK, and a NACK causes a write to DST_VGPR+Num_VGPRS. E.g. if a SAMPLE loads 4 values into 4 VGPRs: 4,5,6,7 then PRT may return NACK status into VGPR_8. 10.10. Partially Resident Textures 110 of 597 "RDNA3" Instruction Set Architecture Chapter 11. Global, Scratch and Flat Address Space Flat, Global and Scratch are a collection of VMEM instructions that allow per-thread access to global memory, shared memory and private memory. Unlike buffer and image instructions, these do not use an SRD (resource constant). Flat is the most generic of the 3 types where per-thread the address may map to global, private or shared memory. Memory is addressed as a single flat address space, where certain memory address apertures map these regions. The determination of the memory space to which an address maps is controlled by a set of "memory aperture" base and size registers. Flat load/store/atomic instructions are effectively a simultaneous issue of an LDS and GLOBAL instruction at the same time with the same address. The address per-thread is read from the ADDR VGPR and then tested to see in which address space the data exists. Flat Address Space ("flat") instructions allow load/store/atomic access to a generic memory address pointer that can resolve to any of the following physical memories: • Global memory • Scratch ("private") • LDS ("shared") • Invalid • But not to: GPRs, GDS or LDS-parameters. GLOBAL is used when all of the address fall into global memory, not LDS or Scratch. This should be used when possible (instead of "Flat") as Global does not tie up LDS resources. SCRATCH is similar, but is used to access scratch (private) memory space. Scratch (thread-private memory) is an area of memory defined by the aperture registers. When an address falls in scratch space, additional address computation is automatically performed by the hardware. For waves that are allocated scratch memory space, the 64-bit FLAT_SCRATCH register is initialized with the a pointer to that wave’s private scratch memory. Waves that have no scratch memory have FLAT_SCRATCH initialized to zero. FLAT_SCRATCH is a 64-bit byte address that is implicitly used by Flat and Scratch memory instructions, and can be manually read via S_GETREG. The instruction specifies which VGPR supplies the address (per work-item), and that address for each workitem may be in any one of those address spaces. Instruction Fields Field Size Description OP 8 Opcode: see next table ADDR 8 VGPR that holds address or offset. For 64-bit addresses, ADDR has the LSB’s and ADDR+1 has the MSBs. For offset a single VGPR has a 32 bit unsigned offset. For FLAT_*: specifies an address. For GLOBAL_* when SADDR is NULL: specifies an address. For GLOBAL_* when SADDR is not NULL: specifies an offset. For SCRATCH, specifies an offset if SVE=1 111 of 597 "RDNA3" Instruction Set Architecture Field Size Description DATA 8 VGPR that holds the first DWORD of store-data. Instructions can use 0-4 DWORDs. VDST 8 VGPR destination for data returned to the shader, either from LOADs or Atomics with GLC=1 (return pre-op value). SLC 1 System Level Coherent. Used in conjunction with GLC to determine cache policies. DLC 1 Device Level Coherent. Controls behavior of L1 cache (GL1). GLC 1 Group Level Coherent - controls behavior of L0 cache. Atomics: 1 = return the memory value before the atomic operation is performed. 0 = do not return anything. SEG 2 Memory Segment: 0=Flat, 1=Scratch, 2=GLOBAL, 3=Reserved OFFSET 13 Address offset: 13-bit signed byte offset (Must be positive for Flat; MSB is ignored and forced to zero) SADDR 7 Scalar SGPR that provides an address of offset (unsigned). To disable use, set this field to NULL. The meaning of this field is different for Scratch and Global. Flat: Unused Scratch: use an SGPR as part of the address Global: use the SGPR to provide a base address and the VGPR provides a 32-bit byte offset. SVE 1 Scratch VGPR Enable When set to 1, scratch instructions include a 32-bit offset from a VGPR; when set to 0, scratch instructions do not use a VGPR for addressing. Table 56. Instructions Flat GLOBAL Scratch FLAT_LOAD_U8 GLOBAL_LOAD_U8 SCRATCH_LOAD_U8 FLAT_LOAD_D16_U8 GLOBAL_LOAD_D16_U8 SCRATCH_LOAD_D16_U8 FLAT_LOAD_D16_HI_U8 GLOBAL_LOAD_D16_HI_U8 SCRATCH_LOAD_D16_HI_U8 FLAT_LOAD_I8 GLOBAL_LOAD_I8 SCRATCH_LOAD_I8 FLAT_LOAD_D16_I8 GLOBAL_LOAD_D16_I8 SCRATCH_LOAD_D16_I8 FLAT_LOAD_D16_HI_I8 GLOBAL_LOAD_D16_HI_I8 SCRATCH_LOAD_D16_HI_I8 FLAT_LOAD_U16 GLOBAL_LOAD_U16 SCRATCH_LOAD_U16 FLAT_LOAD_I16 GLOBAL_LOAD_I16 SCRATCH_LOAD_I16 FLAT_LOAD_D16_B16 GLOBAL_LOAD_D16_B16 SCRATCH_LOAD_D16_B16 FLAT_LOAD_D16_HI_B16 GLOBAL_LOAD_D16_HI_B16 SCRATCH_LOAD_D16_HI_B16 FLAT_LOAD_B32 GLOBAL_LOAD_B32 SCRATCH_LOAD_B32 FLAT_LOAD_B64 GLOBAL_LOAD_B64 SCRATCH_LOAD_B64 FLAT_LOAD_B96 GLOBAL_LOAD_B96 SCRATCH_LOAD_B96 FLAT_LOAD_B128 GLOBAL_LOAD_B128 SCRATCH_LOAD_B128 FLAT_STORE_B8 GLOBAL_STORE_B8 SCRATCH_STORE_B8 FLAT_STORE_D16_HI_B8 GLOBAL_STORE_D16_HI_B8 SCRATCH_STORE_D16_HI_B8 FLAT_STORE_B16 GLOBAL_STORE_B16 SCRATCH_STORE_B16 FLAT_STORE_D16_HI_B16 GLOBAL_STORE_D16_HI_B16 SCRATCH_STORE_D16_HI_B16 FLAT_STORE_B32 GLOBAL_STORE_B32 SCRATCH_STORE_B32 FLAT_STORE_B64 GLOBAL_STORE_B64 SCRATCH_STORE_B64 FLAT_STORE_B96 GLOBAL_STORE_B96 SCRATCH_STORE_B96 FLAT_STORE_B128 GLOBAL_STORE_B128 SCRATCH_STORE_B128 none GLOBAL_LOAD_ADDTID_B32 none none GLOBAL_STORE_ADDTID_B32 none FLAT_ATOMIC_SWAP_B32 GLOBAL_ATOMIC_SWAP_B32 none FLAT_ATOMIC_CMPSWAP_B32 GLOBAL_ATOMIC_CMPSWAP_B32 none 112 of 597 "RDNA3" Instruction Set Architecture Flat GLOBAL Scratch FLAT_ATOMIC_ADD_U32 GLOBAL_ATOMIC_ADD_U32 none FLAT_ATOMIC_ADD_F32 GLOBAL_ATOMIC_ADD_F32 none FLAT_ATOMIC_SUB_U32 GLOBAL_ATOMIC_SUB_U32 none FLAT_ATOMIC_MIN_I32 GLOBAL_ATOMIC_MIN_I32 none FLAT_ATOMIC_MIN_U32 GLOBAL_ATOMIC_MIN_U32 none FLAT_ATOMIC_MAX_I32 GLOBAL_ATOMIC_MAX_I32 none FLAT_ATOMIC_MAX_U32 GLOBAL_ATOMIC_MAX_U32 none FLAT_ATOMIC_AND_B32 GLOBAL_ATOMIC_AND_B32 none FLAT_ATOMIC_OR_B32 GLOBAL_ATOMIC_OR_B32 none FLAT_ATOMIC_XOR_B32 GLOBAL_ATOMIC_XOR_B32 none FLAT_ATOMIC_INC_U32 GLOBAL_ATOMIC_INC_U32 none FLAT_ATOMIC_DEC_U32 GLOBAL_ATOMIC_DEC_U32 none FLAT_ATOMIC_CMPSWAP_F32 GLOBAL_ATOMIC_CMPSWAP_F32 none FLAT_ATOMIC_MIN_F32 GLOBAL_ATOMIC_MIN_F32 none FLAT_ATOMIC_MAX_F32 GLOBAL_ATOMIC_MAX_F32 none FLAT_ATOMIC_SWAP_B64 GLOBAL_ATOMIC_SWAP_B64 none FLAT_ATOMIC_CMPSWAP_B64 GLOBAL_ATOMIC_CMPSWAP_B64 none FLAT_ATOMIC_ADD_U64 GLOBAL_ATOMIC_ADD_U64 none FLAT_ATOMIC_SUB_U64 GLOBAL_ATOMIC_SUB_U64 none FLAT_ATOMIC_MIN_I64 GLOBAL_ATOMIC_MIN_I64 none FLAT_ATOMIC_MIN_U64 GLOBAL_ATOMIC_MIN_U64 none FLAT_ATOMIC_MAX_I64 GLOBAL_ATOMIC_MAX_I64 none FLAT_ATOMIC_MAX_U64 GLOBAL_ATOMIC_MAX_U64 none FLAT_ATOMIC_AND_B64 GLOBAL_ATOMIC_AND_B64 none FLAT_ATOMIC_OR_B64 GLOBAL_ATOMIC_OR_B64 none FLAT_ATOMIC_XOR_B64 GLOBAL_ATOMIC_XOR_B64 none FLAT_ATOMIC_INC_U64 GLOBAL_ATOMIC_INC_U64 none FLAT_ATOMIC_DEC_U64 GLOBAL_ATOMIC_DEC_U64 none none GLOBAL_ATOMIC_CSUB_U32 (GLC must be set to 1) none 11.1. Instructions 11.1.1. FLAT The Flat instruction set is nearly identical to the BUFFER instruction set, minus the FORMAT loads & stores. Flat instructions do not use a resource constant (V#) or sampler (S#), but they do use a specific SGPR-pair (FLAT_SCRATCH) to hold scratch-space information in case any threads' address resolves to scratch space. See "Scratch" section below. Since Flat instruction are executed as both an LDS and a Global instruction, Flat instructions increment both VMcnt (or VScnt) and LGKMcnt and are not considered done until both have been decremented. There is no way a priori to determine whether a Flat instruction uses only LDS or Global memory space. When the address from a Flat instruction falls into scratch (private) space, a different addressing mechanism is 11.1. Instructions 113 of 597 "RDNA3" Instruction Set Architecture used. The address from the VGPR points to the memory space for a specific DWORD of scratch data owned by this thread. The hardware maps this address to the actual memory address that holds data for all of the threads in the wave. Flat atomics which map into scratch: 4-byte atomics are supported, and 8-byte atomics return MEMVIOL. The wave supplies the offset (for space allocated to this wave) with every Flat request. This is stored in a dedicated per-wave register: FLAT_SCRATCH, that holds a 64-bit byte address. The aperture check occurs when VGPRs are read, with invalid addresses being routed to the texture unit. The "aperture check" is performed before "inst_offset" is added into the address, so it is undefined what occurs if the addition of inst_offset pushes the address into a different memory aperture. (Hole) Addr[48] Addr[47] Addr[46] Aperture 0 x x Normal (global memory) 1 0 0 Potential Private (scratch) 1 0 1 Potential Shared (LDS) 1 1 x Invalid Ordering Flat instructions may complete out of order with each other. If one Flat instruction finds all of its data in Texture cache, and the next finds all of its data in LDS, the second instruction might complete first. If the two fetches return data to the same VGPR, the result is unknown (order is not deterministic). Flat instructions decrement VMcnt in order for the threads that went to global memory and those are in order with other scratch, global, texture and buffer instructions. Separately each Flat instruction increments and decrements LGKMcnt. This is out-of-order with the VMcnt path but is in-order with other DS (LDS) instructions. Since the data for a Flat load can come from either LDS or the texture cache, and because these units have different latencies, there is a potential race condition with respect to the VMcnt/VScnt and LGKMcnt counters. Because of this, the only sensible S_WAITCNT value to use after Flat instructions is zero. 11.1.2. Global Global operations transfer data between VGPR and global memory. Global instructions are similar to Flat, but the programmer is responsible to make sure that no threads access LDS or private space. Because of this, no LDS bandwidth is used by global instructions. Since these instructions do not access LDS, only VMcnt (or VScnt) is used, not LGKMcnt. If a global instruction does attempt to access LDS, the instruction returns MEMVIOL. Global includes two instructions which do not use any VGPRs for addressing, just SGPRs and INST_OFFSET: • GLOBAL_LOAD_ADDTID_B32 • GLOBAL_STORE_ADDTID_B32 11.1.3. Scratch Scratch instructions are similar to global but they access a private (per-thread) memory space that is swizzled. Because of this, no LDS bandwidth is used by scratch instructions. Scratch instructions also support multiDWORD access and mis-aligned access (although mis-aligned is slower). 11.1. Instructions 114 of 597 "RDNA3" Instruction Set Architecture Since these instructions do not access LDS, only VMcnt (or VScnt) is used, not LGKMcnt. It is not possible for a scratch instruction to access LDS, and so no error checking is done (and no aperture check is performed). 11.2. Addressing Global, Flat and Scratch each have their own addressing modes. Flat addressing is a subset of the global and scratch modes. 64-bit addresses are stored with the LSB’s in the VGPR at ADDR, and the MSBs in the VGPR at ADDR+1. There are 4 distinct shader instructions: • GLOBAL • SCRATCH • LDS • FLAT - based on per-thread address (VGPR), can load/store: global memory, LDS or scratch memory. Global Addressing GV mem_addr = VGPRU64 + INST_OFFSETI13 GVS mem_addr = SGPRU64 + VGPRU32 + INST_OFFSETI13 GT mem_addr = SGPRU64 + INST_OFFSETI13 + ThreadID*4 LDS Addressing (DS ops) LDS LDS_ADDR = VGPR_addrU32 + INST_OFFSETU16 LDS address is relative to the LDS space allocated to this wave. Scratch Addressing SV mem_addr = SCRATCH_BASEU64 + SWIZZLE(VGPR_offsetU32 + INST_OFFSETI13, ThreadID) SS mem_addr = SCRATCH_BASEU64 + SWIZZLE(SGPR_offsetU32 + INST_OFFSETI13, ThreadID) SVS mem_addr = SCRATCH_BASEU64 + SWIZZLE(SGPR_offsetU32 + VGPR_offsetU32 + INST_OFFSETI13, ThreadID) ST mem_addr = SCRATCH_BASEU64 + SWIZZLE(INST_OFFSETI13, ThreadID) SGPR_offset and VGPR_offset are 32 bits unsigned byte offsets. The combined offsets inside SWIZZLE() must result in a non-negative number. The value from an SGPR and VGPR are unsigned 32-bit byte offsets. Flat Addressing Aperture test on the address-VGPR value determines: Global/LDS/Scratch per thread (ignores INST_OFFSET). Use one of the 3 address equations per lane depending on which memory it maps to: GLOBAL (GV) mem_addr = VGPRU64 + INST_OFFSETI13 SCRATCH (SV) mem_addr = SCRATCH_BASE(sgpr:U64) + SWIZZLE(VGPR_offset + INST_OFFSETI13, ThreadID) LDS LDS_ADDR = VGPR(addr) + INST_OFFSET - sharedApertureBase If the address falls into LDS space, it is checked against the range: [0, LDS_allocated_size-1 ] There is no range checking on this address. Scratch Addressing Equation "SWIZZLE(offset,TID)" is hard coded based on wave size (32 or 64) Swizzle for Scratch is hard-coded to: elem_size=4bytes, const_index_stride=32 (wave32) or 64 (wave64). 11.2. Addressing 115 of 597 "RDNA3" Instruction Set Architecture Addr = SCRATCH_BASE + (offset / 4) * 4 * const_index_stride + (offset % 4) + TID*4 where "offset" = either "INST_OFFSET + SGPR_offset" or "INST_OFFSET + VGPR_offset". Restrictions: • Inst_offset : ◦ Flat and Scratch-ST mode: must not be negative ◦ Global and Scratch-SS and -SV modes: can be negative ◦ In Scratch SS mode, the inst_offset must be aligned to the payload size: 4 byte aligned for 1-DWORD, 16-byte aligned for 4-DWORD. ▪ Also (SADDR + INST_OFFSET) must be at least DWORD-aligned SADDR SVE MODE ==NULL 0 ST !=NULL 0 SS ==NULL 1 SV !=NULL 1 SVS Scratch Instruction Modes Indicated by SVE / SADDR SV Addr = FLAT_SCRATCH + swizzle(Voff + Ioff, TID) 1 / NULL SS Addr = FLAT_SCRATCH + swizzle(Soff + Ioff, TID) 0 / !NULL ST Addr = FLAT_SCRATCH + swizzle(0 + Ioff, TID) 0 / NULL SVS Addr = FLAT_SCRATCH + swizzle(Soff + Voff + Ioff, TID) 1 / !NULL BUFFER_ Addr = + LOAD T#.base + Soff + swizzle( (Vidx + TID) * stride + Ioff + Voff) Global Instruction Modes GV Addr = Vaddr64 GVS Addr = Saddr64 GT Addr = Saddr64 + Voff32 + Ioff x / NULL + Ioff x / !=NULL + Ioff + TID*4 x/x instruction + Ioff x/x instruction LDS Instruction Modes LDS Addr = Vaddr Flat Instruction Modes Scratch Addr = FLAT_SCRATCH swizzle (Voff + Ioff -privApertureBase, TID) // "SV" x / NULL LDS Addr = Vaddr + Ioff - sharedApertureBase // "LDS" x / NULL Global Addr = Vaddr + Ioff // "GV" x / NULL • Scratch: Voff and Soff are 32 bits, unsigned bytes. • Global: Addresses are 64 bits, offset is 32bits. • FLAT_SCRATCH is an SGPR-pair 64-bit address. • "Ioff" is the offset from the instruction field. • "x" = don’t care (either value works) 11.3. Memory Error Checking Both Texture and LDS can report that an error occurred due to a bad address. This can occur due to: • Invalid address (outside any aperture) • Write to read-only global memory address 11.3. Memory Error Checking 116 of 597 "RDNA3" Instruction Set Architecture • Misaligned data (scratch accesses may be misaligned) • Out-of-range address: ◦ LDS access with an address outside the range: [ 0, LDS_SIZE-1 ] The policy for threads with bad addresses is: stores outside this range do not write a value, and reads return zero. The aperture check for invalid address occurs before adding any address offsets - it is based only on the base address; the other checks are performed after adding the offsets. Addressing errors from either LDS or TA are returned on their respective "instruction done" busses as MEMVIOL. This sets the wave’s MEMVIOL TrapStatus bit, and also causes an exception (trap). 11.4. Data FLAT instructions can use from zero to four consecutive DWORDs of data in VGPRs and/or memory. The DATA field determines which VGPR(s) supply source data (if any) and the VDST VGPRs hold return data (if any). There is no data-format conversion performed. "D16" instructions use only 16-bit of the VGPR instead of the full 32bits. "D16_HI" instructions read or write only the high 16-bits, while "D16" use the low 16-bits. Scratch & Global D16 load instructions of the "_LDS_" type write the entire 32-bits of LDS. 11.4. Data 117 of 597 "RDNA3" Instruction Set Architecture Chapter 12. Data Share Operations Local data share (LDS) is a low-latency, RAM scratchpad for temporary data storage and for sharing data between threads within a work-group. Accessing data through LDS may be significantly lower latency and higher bandwidth than going through memory. For compute workloads, it allows a simple method to pass data between threads in different waves within the same work-group. For graphics, it is also used to hold vertex parameters for pixel shaders. LDS space is allocated per work-group or wave (when work-groups not used) and recorded in dedicated LDSbase/size (allocation) registers that are not writable by the shader. These restrict all LDS accesses to the space owned by the work-group or wave. 12.1. Overview The figure below shows how the LDS fits into the memory hierarchy of the GPU. Figure 3. High-Level Memory Configuration There are 128kB of memory per work-group processor split up into 64 banks of DWORD-wide RAMs. These 64 banks are further sub-divided into two sets of 32-banks each where 32 of the banks are affiliated with a pair of SIMD32’s, and the other 32 banks are affiliated with the other pair of SIMD32’s within the WGP. Each bank is a 512x32 two-port RAM (1R/1W per clock cycle). DWORDs are placed in the banks serially, but all banks can execute a store or load simultaneously. One work-group can request up to 64kB memory. The high bandwidth of the LDS memory is achieved not only through its proximity to the ALUs, but also through simultaneous access to its memory banks. Thus, it is possible to concurrently execute 32 store or load instructions, each nominally 32-bits; extended instructions, load_2addr/store_2addr, can be 64-bits each. If, 12.1. Overview 118 of 597 "RDNA3" Instruction Set Architecture however, more than one access attempt is made to the same bank at the same time, a bank conflict occurs. In this case, for indexed and atomic operations, the hardware is designed to prevent the attempted concurrent accesses to the same bank by turning them into serial accesses. This can decrease the effective bandwidth of the LDS. For increased throughput (optimal efficiency), therefore, it is important to avoid bank conflicts. A knowledge of request scheduling and address mapping can be key to help achieving this. 12.1.1. Dataflow in Memory Hierarchy The figure below is a conceptual diagram of the dataflow within the memory structure. Data can be loaded into LDS either by transferring it from VGPRs to LDS using "DS" instructions, or by loading in from memory. When loading from memory, the data may be loaded into VGPRs first or for some types of loads it may be loaded directly into LDS from memory. To store data from LDS to global memory, data is read from LDS and placed into the work-item’s VGPRs, then written out to global memory. To help make effective use of the LDS, a shader program must perform many operations on what is transferred between global memory and LDS. LDS atomics are performed in the LDS hardware. Although ALUs are not directly used for these operations, latency is incurred by the LDS executing this function. 12.1.2. LDS Modes and Allocation: CU vs. WGP Mode Work-groups of waves are dispatched in one of two modes: CU or WGP. See this section for details: WGP and CU Mode 12.1.3. LDS Access Methods There are 3 forms of Local Data Share access: 12.1. Overview 119 of 597 "RDNA3" Instruction Set Architecture Direct Load Loads a single DWORD from LDS and broadcasts the data to a VGPR across all lanes. Indexed load/store and Atomic ops Load/store address comes from a VGPR and data to/from VGPR. LDS-ops require up to 3 inputs: 2data+1addr and immediate return VGPR. Parameter Interpolation Load Reads pixel parameters from LDS per quad and loads them into one VGPR. Reads all 3 parameters per quad (P1, P1-P0 and P2-P0) and loads them into 3 lanes within the quad (the 4th lane receives zero). The following sections describe these methods. 12.2. Pixel Parameter Interpolation For pixel waves, vertex attribute data is preloaded into LDS and barycentrics (I, J) are preloaded into VGPRs before the wave starts. Parameter interpolation can be performed by loading attribute data from LDS into VGPRs using LDS_PARAM_LOAD and then using V_INTERP instructions to interpolate the value per pixel. LDS-Parameter loads are used to read vertex parameter data and store them in VGPRs to be used for parameter interpolation. These instructions operate like memory instructions except they use EXPcnt to track outstanding reads and decrement EXPCnt when they arrive in VGPRs. Pixel shaders can be launched before their parameter data has been written into LDS. Once the data is available in LDS, the wave’s STATUS register "LDS_READY" bit is set to 1. Pixel shader waves stall if an LDS_DIRECT_LOAD or LDS_PARAM_LOAD is to be issued before LDS_READY is set. The most common form of interpolation involves weighting vertex parameters by the barycentric coordinates "I" and "J". A common calculation is: Result = P0 + I * P10 + J * P20 where "P10" is (P1 - P0), and "P20" is (P2 - P0) Parameter interpolation involves two types of instructions: • LDS_PARAM_LOAD : to read packed parameter data from LDS into a VGPR (data packed per quad) • V_INTERP_* : VALU FMA instructions that unpack parameter data across lanes in a quad. 12.2.1. LDS Parameter Loads Parameter Loads are only available in LDS, not in GDS, and only in CU mode (not WGP mode). LDS_PARAM_LOAD reads three parameters (P0, P10, P20) of one 32-bit attribute or of two 16-bit attributes from LDS into VGPRs. The are 3 parameters (P0, P10 and P20) are the same for the 4 pixels within a quad. These values are spread out across VGPR lanes 0, 1 and 2 of each quad. Interpolation is then performed using FMA with DPP so each lane uses its I or J value with the quad’s shared P0, P10 and P20 values. 12.2. Pixel Parameter Interpolation 120 of 597 "RDNA3" Instruction Set Architecture Table 57. LDSDIR Instruction Fields Field Size Description OP 2 Opcode: 0: LDS_DIRECT_LOAD 1: LDS_PARAM_LOAD 2,3: Reserved WAITVDST 4 Wait for the number of previously issued still outstanding VALU instructions to be less than or equal to this number. Used to avoid Write-After-Read hazards on VGPRs. VDST 8 Destination VGPR ATTR_CHAN 2 Attribute channel: 0=X, 1=Y, 2=Z, 3=W. Unused for LDS_DIRECT_LOAD. ATTR 6 Attribute number: 0 - 32. Unused for LDS_DIRECT_LOAD. ( M0 ) 32 LDS_DIRECT_LOAD: { 13’b0, DataType[2:0], LDS_address[15:0] } //addr in bytes LDS_PARAM_LOAD: { 1’b0, new_prim_mask[15:1], lds_param_offset[15:0] } M0 is implicitly read for this instruction and must be initialized before these instructions. new_prim_mask a mask that has a bit per quad indicating that this quad begins a new primitive; zero indicates same primitive as previous quad. There is an implied "one" for the first quad in the wave (every wave begins a new primitive) and so bit[0] is omitted. lds_param_offset The parameter offset indicates the starting address of the parameters in LDS. Space before that can be used as temporary wave storage space. Lds_param_offset bits [6:0] must be set to zero. Example LDS_PARAM_LOAD (new_prim_mask[3:0] = 0110) LDS_ADDR = lds_base + param_offset + attr#*numPrimsInVector*12DWORDs + prim#*12 + attr_offset (attr_offset = 0..11 : 0 = P0.x, 1 = P0.Y, … 11 = P2.W) From NewPrimMask h/w derives NumPrimInVec and Prim# (0..15) If the dest-VGPR is out of range, the load is still performed but EXEC is forced to zero. LDS_PARAM_LOAD and LDS_DIRECT_LOAD use EXEC per quad (if any pixel is enabled in the quad, data is written to all 4 pixels/threads in the quad). 12.2. Pixel Parameter Interpolation 121 of 597 "RDNA3" Instruction Set Architecture 12.2.1.1. 16-bit Parameter Data 16-bit parameters are packed in LDS as pairs of attributes in DWORDs: ATTR0.X and ATTR1.X share a DWORD. There is an alternate packing mode where the parameters are not packed (one 16-bit param in low half of DWORD). These attributes can be read with the same LDS_PARAM_LOAD instruction, and returns the packed DWORD with 2 attributes (when they are packed). Interpolation can then be done using specific mixedprecision FMA opcodes, along with DPP (to select P0, P10 or P20) and OPSEL (to select upper or lower 16-bits). Barycentrics are 32-bits, not 16 bit. 12.2.1.2. Parameter Load Data Hazard Avoidance These data dependency rules apply to both parameter and direct loads. LDS_DIRECT_LOAD and LDS_PARAM_LOAD read data from LDS and write it into VGPRs, and they use EXPcnt to track when the instruction has completed and written the VGPRs. It is up to the shader program to ensure that data hazards are avoided. These instructions are issued along a different path from VALU instructions so it is possible that previous VALU instructions may still be reading from the VGPR that these LDS instructions are going to write and this could lead to a hazard. EXPcnt is used to track read-after-write hazards where LDS_PARAM_LOAD writes a value to a VGPR and another instruction reads it. The shader program uses "s_waitcnt EXPcnt" to wait for results from a LDS_DIRECT_LOAD or LDS_PARAM_LOAD to be available in VGPRs before consuming it in a subsequent instruction. The VINTERP instructions have a "wait_EXPcnt" field to assist in avoid this hazard. These are skipped when EXEC==0 and EXPCnt==0 (like memory ops). Mixed exports & LDS-direct/param instructions from the same wave might not complete in order (both use EXPcnt), requiring "s_waitcnt 0" if they are overlapped. LDS_PARAM_LOAD V2 S_WAITCNT EXPcnt 0 A potential Write-After-Read hazard exists if a VALU instruction reads a VGPR and then LDS_PARAM_LOAD writes that VGPR: It is possible the LDS_PARAM_LOAD overwrites the VALU’s source VGPR before it was read. The user must prevent this by using the "wait_Vdst" field of the LDS_PARAM_LOAD instruction. This field indicates the maximum number of uncompleted VALU instructions that may be outstanding when this LDS_PARAM_LOAD is issued. Use this to ensure any dependent VALU instructions have completed. Another potential data hazard involves LDS_PARAM_LOAD overwriting a VGPR that has not yet been read as a source by a previous VMEM (LDS, Texture, Buffer, Flat) instruction. To avoid this hazard, the user must ensure that the VMEM instruction has read its source VGPRs. This can be achieved by issuing any VALU or export instruction before the LDS_PARAM_LOAD. 12.2. Pixel Parameter Interpolation 122 of 597 "RDNA3" Instruction Set Architecture 12.3. VALU Parameter Interpolation Parameter interpolation is performed using an FMA operation that includes a built-in DPP operation to unpack the per-quad P0/P10/P20 values into per-lane values. Because this instruction reads data from neighboring lanes, the implicit DPP acts as if "fetch invalid = 1", so that the instruction can read data from neighboring lanes that have EXEC==0, rather than getting the value 0 from those. Standard interpolation is calculating: Per-Pixel-Parameter = P0 + I * P10 + J * P20 // I, J are per-pixel; P0/P10/P20 are per-primitive This parameter interpolation is realized using a pair of instructions: // V1 = I, V2 = J, V3 = result of LDS_PARAM_LOAD V_INTERP_P10_F32 V4, V3[1], V1, V3[0] // tmp = P0 + I*P10 V_INTERP_P20_F32 V5, V3[2], V2, V4 // uses DPP8=1,1,1,1,5,5,5,5; Src2(P0) uses DPP8=0,0,0,0,4,4,4,4 // dst = J*P20 + tmp uses DPP8=2,2,2,2,6,6,6,6 Table 58. Parameter Interpolation Instruction Fields Field Size Description OP 7 Instruction Opcode: V_INTERP_P10_F32 // tmp = P0 + I*P10. hardcoded DPP8 on 2 sources V_INTERP_P2_F32 // D = tmp + J*P20. hardcoded DPP8 on 1 source V_INTERP_P10_F16_F32 // tmp = P0 + I*P10. hardcoded DPP8 on 2 sources V_INTERP_P2_F16_F32 // D = tmp + J*P20. hardcoded DPP8 on 1 source V_INTERP_RTZ_P10_F16_F32 // same as above, but round-toward-zero V_INTERP_RTZ_P2_F16_F32 // same as above, but round-toward-zero SRC0 9 First argument VGPR: Parameter data (P0 or P20) from LDS stored in a VGPR. SRC1 9 Second argument VGPR: I or J barycentric SRC2 9 Third argument VGPR: "P10" ops holds P10 data; "P2" ops holds partial result from "P10" op. VDST 8 Destination VGPR NEG 3 Negate the input (invert sign bit). bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2. For 16-bit interpolation this applies to both low and high halves. WaitEXP 3 Wait for EXPcnt to be less than or equal to this value before issuing this instruction. Used to wait for a specific previous LDS_PARAM_LOAD to have completed. OPSEL 4 Operation select for 16-bit math: 1=select high half, 0=select low half [0]=src0, [1]=src1, [2]=src2, [3]=dest For dest=0, dest_vgpr[31:0] = {prev_dst_vgpr[31:16], result[15:0] } For dest=1, dest_vgpr[31:0] = {result[15:0], prev_dst_vgpr[15:0] } OPSEL may only be used for 16-bit operands, and must be zero for any other operands/results. CLMP 1 Clamp result to [0, 1.0] The VINTERP instructions include a builtin "s_waitcnt EXPcnt" to easily allow data hazard resolution for data produced by LDS_PARAM_LOAD. 12.3. VALU Parameter Interpolation 123 of 597 "RDNA3" Instruction Set Architecture Instructions Restrictions and Limitations: • V_INTERP instructions do not detect or report exceptions • V_INTERP instructions do not support data forwarding into inputs that would normally come from LDS data (sources A and C for V_INTERP_P10_* and source A for V_INTERP_P2_*). VGPRs are preloaded with some or all of: • I_persp_sample, J_persp_sample, I_persp_center, J_persp_center, • I_persp_centroid, J_persp_centroid, • I/W, J/W, 1.0/W, • I_linear_sample, J_linear_sample, • I_linear_center, J_linear_center, • I_linear_centroid, J_linear_centroid These instructions consume data that was supplied by LDS_PARAM_LOAD. These instructions contain a builtin "s_waitcnt EXPcnt <= N" capability to allow for efficient software pipelining. lds_param_load V0, attr0 lds_param_load V10, attr1 lds_param_load V20, attr2 lds_param_load V30, attr3 v_interp_p0 V1, v_interp_p0 V11, V10[1], Vi, V10[0] V0[1], s_waitcnt EXPcnt<=2 v_interp_p0 V21, V20[1], Vi, V20[0] s_waitcnt EXPcnt<=1 v_interp_p0 V31, V30[1], Vi, V30[0] s_waitcnt EXPcnt<=0 //Wait V30 v_interp_p2 V2, v_interp_p2 V12, V10[2], Vj, V11 v_interp_p2 V22, V20[2], Vj, V21 v_interp_p2 V32, V30[2], Vj, V31 V0[2], Vi, V0[0] s_waitcnt EXPcnt<=3 //Wait V0 Vj, V1 12.3.1. 16-bit Parameter Interpolation 16-bit interpolation operates on pairs of attribute values packed into a 16-bit VGPR. These use the same I and J values during interpolation. OPSEL is used to select the upper or lower portion of the data. There are variants of the 16-bit interpolation instructions that override the round mode to "round toward zero". V_INTERP_P10_F16_F32 dst.f32 = vgpr_hi/lo.f16 * vgpr.f32 + vgpr_hi/lo.f16 // tmp = P10 * I + P0 • allows OPSEL; Src0 uses DPP8=1,1,1,1,5,5,5,5; Src2 uses DPP8=0,0,0,0,4,4,4,4 V_INTERP_P2_F16_F32 dst.f16 = vgpr_hi/lo.f16 * vgpr.f32 + vgpr.f32 // dst = P2 * J + tmp • allows OPSEL; Src0 uses DPP8=2,2,2,2,6,6,6,6 12.4. LDS Direct Load Direct loads are only available in LDS, not in GDS. Direct access is allowed only in CU mode, not WGP mode. The LDS_DIRECT_LOAD instruction reads a single DWORD from LDS and returns it to a VGPR, broadcasting it 12.4. LDS Direct Load 124 of 597 "RDNA3" Instruction Set Architecture to all active lanes in the wave. M0 provides the address and data type. LDS_DIRECT_LOAD uses EXEC per quad, not per pixel: if any pixel in a quad is enabled then the data is written to all 4 pixels in the quad. LDS_DIRECT_LOAD uses EXPcnt to track completion. LDS_DIRECT_LOAD uses the same instruction format and fields as LDS_PARAM_LOAD. See Pixel Parameter Interpolation. LDS_addr = M0[15:0] (byte address and must be DWORD aligned) DataType = M0[18:16] 0 unsigned byte 1 unsigned short 2 DWORD 3 unused 4 signed byte 5 signed short 6,7 Reserved Example: LDS_DIRECT_LOAD V4 // load the value from LDS-address in M0[15:0] to V4 Signed byte and short data is sign-extend to 32 bits before writing the result to a VGPR; unsigned byte and short data is zero-extended to 32 bits before writing to a VGPR. 12.5. Data Share Indexed and Atomic Access Both LDS and GDS can perform indexed and atomic data share operations. For brevity, "LDS" is used in the text below and, except where noted, also applies to GDS. Indexed and atomic operations supply a unique address per work-item from the VGPRs to the LDS, and supply or return unique data per work-item back to VGPRs. Due to the internal banked structure of LDS, operations can complete in as little as one cycle (for wave32, or 2 cycles for wave64), or take as many 64 cycles, depending upon the number of bank conflicts (addresses that map to the same memory bank). Indexed operations are simple LDS load and store operations that read data from, and return data to, VGPRs. Atomic operations are arithmetic operations that combine data from VGPRs and data in LDS, and write the result back to LDS. Atomic operations have the option of returning the LDS "pre-op" value to VGPRs. LDS Indexed and atomic instructions use LGKMcnt to track when they have completed. LGKMcnt is incremented as each instruction is issued, and decremented when they have completed execution. LDS instructions stay in-order with other LDS instructions from the same wave. The table below lists and briefly describes the LDS instruction fields. Table 59. LDS Instruction Fields 12.5. Data Share Indexed and Atomic Access 125 of 597 "RDNA3" Instruction Set Architecture Field Size Description OP 8 LDS opcode. GDS 1 0 = LDS, 1 = GDS. OFFSET0 8 OFFSET1 8 Immediate address offset. Interpretation varies with opcode: Instructions with one address:: combine the offset fields into a 16-bit unsigned byte offset: {offset1, offset0}. Instructions that have 2 addresses (e.g. {LOAD, STORE, XCHG}_2ADDR):: use the offsets separately as 2 8bit unsigned offsets. Each offset is multiplied by 4 for 8, 16 and 32-bit data; multiplied by 8 for 64-bit data. VDST 8 VGPR to which result is written: either from LDS-load or atomic return value. ADDR 8 VGPR that supplies the byte address offset. DATA0 8 VGPR that supplies first data source. DATA1 8 VGPR that supplies second data source. M0 16 Unsigned byte Offset[15:0] used for: ds_load_addtid_b32, ds_write_addtid_b32 and for GDS-base/size The M0 register is not used for most LDS-indexed operations: only the "ADDTID" instructions read M0 and for these it represents a byte address. Table 60. LDS Indexed Load/Store Load / Store Description DS_LOAD_{B32,B64,B96,B128,U8,I8,U16,I16} Load one value per thread into VGPRs; if signed, sign extend to DWORD; zero e xtend if unsigned. DS_LOAD_2ADDR_{B32,B64} Load two values at unique addresses. DS_LOAD_2ADDR_STRIDE64_{B32,B64} Load 2 values at unique addresses; offset *= 64. DS_STORE_{B32,B64,B96,B128,B8,B16} Store one value from VGPR to LDS. DS_STORE_2ADDR_{B32,B64} Store two values. DS_STORE_2ADDR_STRIDE64_{B32,B64} Store two values, offset *= 64. DS_STOREXCHG_RTN_{B32,B64} Exchange GPR with LDS-memory. DS_STOREXCHG_2ADDR_RTN_{B32,B64} Exchange two separate GPRs with LDS-memory. DS_STOREXCHG_2ADDR_STRIDE64_RTN_{B32,B64} Exchange GPR with LDS-memory; offset *= 64. "D16 ops" - Load ops write only 16bits of VGPR, low or high; Store ops use 16bits of VGPR: DS_STORE_{B8, B16}_D16_HI Store 8 or 16 bits using high 16 bits of VGPR. DS_LOAD_{U8, I8, U16}_D16 Load unsigned or signed 8 or 16 bits into low-half of VGPR DS_LOAD_{U8, I8, U16}_D16_HI Load unsigned or signed 8 or 16 bits into high-half of VGPR DS_PERMUTE_B32 Forward permute. Does not write any LDS memory. See LDS Lanepermute Ops for details. DS_BPERMUTE_B32 Backward permute. Does not write any LDS memory. See LDS Lanepermute Ops for details. Single Address Instructions LDS_Addr = LDS_BASE + VGPR[ADDR] + {InstOffset1,InstOffset0} Double Address Instructions LDS_Addr0 = LDS_BASE + VGPR[ADDR] + InstOffset0*ADJ + LDS_Addr1 = LDS_BASE + VGPR[ADDR] + InstOffset1*ADJ + Where ADJ = 4 for 8, 16 and 32-bit data types; and ADJ = 8 for 64-bit. 12.5. Data Share Indexed and Atomic Access 126 of 597 "RDNA3" Instruction Set Architecture The double address instructions are: LOAD_2ADDR*, STORE_2ADDR*, and STOREXCHG_2ADDR_*. The address comes from VGPR, and both VGPR[ADDR] and InstOffset are byte addresses. At the time of wave creation, LDS_BASE is assigned to the physical LDS region owned by this wave or work-group. DS_{LOAD,STORE}_ADDTID Addressing LDS_Addr = LDS_BASE + {InstOffset1, InstOffset0} + TID(0..63)*4 + M0 Note: no part of the address comes from a VGPR. M0 must be DWORD-aligned. The "ADDTID" (add thread-id) is a separate form where the base address for the instruction is common to all threads, but then each thread has a fixed offset added in based on its thread-ID within the wave. This can allow a convenient way to quickly transfer data between VGPRs and LDS without having to use a VGPR to supply an address. LDS & GDS Opcodes Instruction Fields: op, gds, offset0, offset1, vdst, addr, data0, data1 32-bit no return 32-bit with return 64-bit no return ds_load_b{64,96,128} ds_store_b{64,96,128} ds_store_{b32,b16,b8} ds_store_b64 ds_load_addtid_b32 (LDS only) ds_permute_b32 (LDS only) ds_store_addtid_b32 (LDS only) ds_bpermute_b32 (LDS only) ds_store_2addr_b32 ds_store_2addr_b64 ds_store_2addr_stride64_b3 2 ds_store_2addr_stride64_ b64 64-bit with return ds_load_{b32, u8,i8,u16,i16} ds_load_b64 ds_store_b8_d16_hi ds_load_2addr_b32 ds_load_2addr_b64 ds_store_b16_d16_hi ds_load_2addr_stride64_b32 ds_load_2addr_stride64_b64 ds_load_u8_d16 ds_consume ds_load_u8_d16_hi ds_append ds_condxchg32_rtn_b64 ds_load_i8_d16 ds_load_i8_d16_hi ds_swizzle_b32 (LDS only) ds_load_u16_d16 ds_load_u16_d16_hi GDS-only Opcodes ds_ordered_count gws_init gws_sema_v gws_sema_bf gws_sema_p gws_barrier gws_sema_release_all ds_add_gs_reg_rtn ds_sub_gs_reg_rtn 12.5. Data Share Indexed and Atomic Access 127 of 597 "RDNA3" Instruction Set Architecture 12.5.1. LDS Atomic Ops Atomic ops combine data from a VGPR with data in LDS, write the result back to LDS memory and optionally return the "pre-op" value from LDS memory back to a VGPR. When multiple lanes in a wave access the same LDS location there it is not specified in which order the lanes will perform their operations, only that each lane will perform the complete read-modify-write operation before another lane operates on the data. LDS_Addr0 = LDS_BASE + VGPR[ADDR] + {InstOffset1,InstOffset0} VGPR[ADDR] is a byte address. VGPRs 0,1 and dst are double-GPRs for doubles data. VGPR data sources can only be VGPRs or constant values, not SGPRs. Floating point atomic ops use the MODE register to control denormal flushing behavior. LDS & GDS Atomic Opcodes Instruction Fields: op, gds, offset0, offset1, vdst, addr, data0, data1 32-bit no return 32-bit with return 64-bit no return 64-bit with return ds_add_u32 ds_add_rtn_u32 ds_add_u64 ds_add_rtn_u64 ds_sub_u32 ds_sub_rtn_u32 ds_sub_u64 ds_rsub_rtn_u64 ds_rsub_u32 ds_rsub_rtn_u32 ds_rsub_u64 ds_rsub_rtn_u64 ds_inc_u32 ds_inc_rtn_u32 ds_inc_u64 ds_inc_rtn_u64 ds_dec_u32 ds_dec_rtn_u32 ds_dec_u64 ds_dec_rtn_u64 ds_min_{u32,i32,f32} ds_min_rtn_{u32,i32,f32} ds_min_{u64,i64,f64} ds_min_rtn_{u64,i64,f64} ds_max_{u32,i32,f32} ds_max_rtn_{u32,i32,f32} ds_max_{u64,i64,f64} ds_max_rtn_{u64,i64,f64} ds_and_b32 ds_and_rtn_b32 ds_and_b64 ds_and_rtn_b64 ds_or_b32 ds_or_rtn_b32 ds_or_b64 ds_or_rtn_b64 ds_xor_b32 ds_xor_rtn_b32 ds_xor_b64 ds_xor_rtn_b64 ds_mskor_b32 ds_mskor_rtn_b32 ds_mskor_b64 ds_mskor_rtn_b64 ds_cmpstore_b32 ds_cmpstore_rtn_b32 ds_cmpstore_b64 ds_cmpstore_rtn_b64 ds_cmpstore_f32 ds_cmpstore_rtn_f32 ds_cmpstore_f64 ds_cmpstore_rtn_f64 ds_add_f32 ds_add_rtn_f32 ds_storexchg_rtn_b32 ds_storexchg_rtn_b64 ds_storexchg_2addr_rtn_b32 ds_storexchg_2addr_rtn_b64 ds_storexchg_2addr_stride64_rt n_b32 ds_storexchg_2addr_stride64_rt n_b64 12.5.2. LDS Lane-permute Ops DS_PERMUTE instructions allow data to be swizzled arbitrarily across 32 lanes. Two versions of the instruction are provided: forward (scatter) and backward (gather). These exist in LDS only, not GDS. Note that in wave64 mode the permute operates only across 32 lanes at a time on each half of a wave64. In other words, it executes as if were two independent wave32’s. Each half-wave can use indices in the range 0-31 to reference lanes in that same half-wave. These instructions use the LDS hardware but do not use any memory storage, and may be used by waves that have not allocated any LDS space. The instructions supply a data value from VGPRs and an index value per lane. 12.5. Data Share Indexed and Atomic Access 128 of 597 "RDNA3" Instruction Set Architecture • ds_permute_b32 : Dst[index[0..31]] = src[0..31] Where [0..31] is the lane number • ds_bpermute_b32 : Dst[0..31] = src[index[0..31]] The EXEC mask is honored for both reading the source and writing the destination. Index values out of range wrap around (only index bits [6:2] are used, the other bits of the index are ignored). Reading from disabled lanes returns zero. In the instruction word: VDST is the dest VGPR, ADDR is the index VGPR, and DATA0 is the source data VGPR. Note that index values are in bytes (so multiply by 4), and have the 'offset0' field added to them before use. 12.5.3. DS Stack Operations for Ray Tracing DS_BVH_STACK_RTN_B32 is an LDS instruction to manage a per-thread shallow stack in LDS used in ray tracing BVH traversal. BVH structures consist of box nodes and triangle nodes. A box node has up to four child node pointers that may all be returned to the shader (to VGPRs) for a given ray (thread). A traversal shader follows one pointer per ray per iteration, and extra pointers can be pushed to a per-thread stack in LDS. Note: the returned pointers are sorted. This "short stack" has a limited size beyond that the stack wraps around and overwrites older items. When the stack is exhausted, the shader should switch to a stackless mode where it looks up the parent of the current node from a table in memory. The shader program tracks the last visited address to avoid re-traversing subtrees. DS_BVH_STACK_RTN_B32 vgpr(dst), vgpr(stack_addr), vgpr(lvaddr), vgpr[4](data) Field Size Description OP 8 Instruction == DS_STORE_STACK (LDS only) GDS 1 1 = GDS, 0 = LDS (must be: 0 = LDS) OFFSET0 8 unused OFFSET1 8 bits[5:4] carry StackSize (8, 16, 32, 64) VDST 8 Destination VGPR for resulting address (e.g. X or top of stack) Returns the next "LV addr" 12.5. Data Share Indexed and Atomic Access 129 of 597 "RDNA3" Instruction Set Architecture Field Size Description ADDR 8 STACK_VGPR: Both a source and destination VGPR: supplies the LDS stack address and is written back with updated address. stack_addr[31:18] = stack_base[15:2] : stack base address (relative to allocated LDS space). stack_addr[17:16] = stack_size[1:0] : 0=8DWORDs, 1=16, 2=32, 3=64 DWORDs per thread stack_addr[15:0] = stack_index[15:0]. (bits [1:0] must be zero). DATA0 8 LVADDR: Last Visited Address. Is compared with data values (next field) to determine the next node to visit. DATA1 8 4 VGPRs (X,Y,Z,W). M0 16 Unused. 12.6. Global Data Share Global data Share is similar to LDS, but is a single memory accessible by all waves on the GPU. Global Data Share uses the same instruction format as local data share (indexed operations only - no interpolation or direct loads). Instructions increment the LGKMcnt for all loads, stores and atomics, and decrement LGKMcnt when the instruction completes. GDS instructions support only one active lane per instruction. The first active lane (based on EXEC) is used and others are ignored. M0 is used for: • [15:0] holds SIZE, in bytes • [31:16] holds BASE address in bytes 12.6.1. GS NGG Streamout Instructions The DS_ADD_GS_REG_RTN and DS_SUB_GS_REG_RTN instructions are used only by the GS stage, and are used for streamout. These instructions perform atomic add or sub operations to data in dedicated registers, not in GDS memory, and return the pre-op value. The source register is 32 bits and is an unsigned int. These 2 instructions increment the wave’s LGKMcnt, and decrement LGKMcnt when the instruction completes. Table 61. GDS Streamout Register Targets offset[5:2] Register 32-bit source, 32-bit dest & return value offset[5:2] Register 32-bit source, 64-bit dest & return value 0 GDS_STRMOUT_DWORDS_WRITTEN_0 8 GDS_STRMOUT_PRIMS_NEEDED_0 1 GDS_STRMOUT_DWORDS_WRITTEN_1 9 GDS_STRMOUT_PRIMS_WRITTEN_0 2 GDS_STRMOUT_DWORDS_WRITTEN_2 10 GDS_STRMOUT_PRIMS_NEEDED_1 3 GDS_STRMOUT_DWORDS_WRITTEN_3 11 GDS_STRMOUT_PRIMS_WRITTEN_1 4 GDS_GS_0 12 GDS_STRMOUT_PRIMS_NEEDED_2 5 GDS_GS_1 13 GDS_STRMOUT_PRIMS_WRITTEN_2 6 GDS_GS_2 14 GDS_STRMOUT_PRIMS_NEEDED_3 7 GDS_GS_3 15 GDS_STRMOUT_PRIMS_WRITTEN_3 Table 62. DS_ADD_GS_REG_RTN* and DS_SUB_GS_REG_RTN: Field Size Description OP 8 ds_add_gs_reg_rtn, ds_sub_gs_reg_rtn OFFSET0 8 gs_reg_index[3:0]=offset0[5:2] indexes the GS register array VDST 8 VGPR to write pre-op value to 12.6. Global Data Share 130 of 597 "RDNA3" Instruction Set Architecture Field Size Description DATA0 8 operand, from the first valid data; if no valid data (i.e., EXEC==0), the operand is 0. • The input comes from the first valid data of DATA0. • If offset[5:2] is 8-15: The operation is mapped to 64b operation to take 2 dst registers as a combined one. The source data is still 32b. The post-op result is 64b and store back to the 2 dst registers. The return value takes 2 VGPRs. • If offset[5:2] is 0-7: The operation is mapped to normal 32b operation. • For ds_add_gs_reg_rtn, the atomic add operation is ◦ VDST[0] = GS_REG[offset0[5:2]][31:0] ◦ If (offset0[5:2] >= 8) VDST[1] = GS_REG[offset0[5:2]][63:32] ◦ GS_REG[offset0[4:2]] += DATA0 • For ds_sub_gs_reg, the atomic sub operation is ◦ VDST[0] = GS_REG[offset0[5:2]][31:0] ◦ If (offset0[5:2] >= 8) VDST[1] = GS_REG[offset0[5:2]][63:32] ◦ GS_REG[offset0[4:2]] -= DATA0 12.7. Alignment and Errors GDS and LDS operations (both direct & indexed) report Memory Violation (memviol) for misaligned atomics. LDS handles misaligned indexed reads & writes, but only when SH_MEM_CONFIG. alignment_mode == UNALIGNED. Atomics must be aligned. LDS Alignment modes (config-reg controlled, in SH_MEM_CONFIG): • ALIGNMENT_MODE_DWORD: Automatic alignment to multiple of element size • ALIGNMENT_MODE_UNALIGNED: No alignment requirements. # LDS Access Type Source Inst Types Controls 1 Direct (Read Broadcast) ALU ops LDS_CONFIG.ADDR_OUT_ Out of range direct operations report memviol if OF_RANGE_REPORTING ADDR_OUT_OF_RANGE_REPORTING is true. 2 Indexed Atomic DS ops FLAT ops LDS_CONFIG.ADDR_OUT_ Out of range atomic operations report memviol if OF_RANGE_REPORTING ADDR_OUT_OF_RANGE_REPORTING is true. 3 Indexed Non- DS ops Atomic FLAT ops 12.7. Alignment and Errors Behavior LDS_CONFIG.ADDR_OUT_ the LSBs are ignored to force alignment. No memviol OF_RANGE_REPORTING is generated. Out of range indexed operations report memviol if ADDR_OUT_OF_RANGE_REPORTING is true. 131 of 597 "RDNA3" Instruction Set Architecture Chapter 13. Float Memory Atomics Floating point atomics can be issued as LDS, Buffer, and Flat/Global/Scratch instructions. 13.1. Rounding LDS and Memory atomics have the rounding mode for float-atomic-add fixed at "round to nearest even". The MODE.round bits are ignored. 13.2. Denormals When these operate on floating point data, there is the possibility of the data containing denormal numbers, or the operation producing a denormal. The floating point atomic instructions have the option of passing denormal values through, or flushing them to zero. LDS instructions allow denormals to be passed through or flushed to zero based on the MODE.denormal wavestate register. As with VALU ops, "denorm_single" affects F32 ops and "denorm_double" affects F64. LDS instructions use both FP_DENORM bits (allow_input_denormal, allow_output_denormal) to control flushing of inputs and outputs separately. • Float 32 bit adder uses both input and output denorm flush controls from MODE • Float CMP, MIN and MAX use only the "input denormal" flushing control ◦ Each input to the comparisons flushes the mantissa of both operands to zero before the compare if the exponent is zero and the flush denorm control is active. For Min and Max the actual result returned is the selected non-flushed input. ◦ CompareStore ("compare swap") flushes the result when input denormal flushing occurs. Cache Atomic Float Denormal (Buffer, Flat, Global, Scratch) Min/Max_F32 Mode CmpStore_F32, _F64 Mode Add_F32 Flush LDS Float Atomics Min/Max_F32 Mode CmpStore_F32, _F64 Mode Add_F32 Mode Min/Max_F64 Mode • "Flush" = flush all input denorm • "No Flush" = don’t flush input denorm • "Mode" = denormal flush controlled by bit from shader’s "MODE . fp_denorm" register Note that MIN and MAX when flushing denormals only do it for the comparison, but the result is an unmodified copy of one of the sources. CompareStore ("compare swap") flushes the result when input denormal flushing occurs. Memory Atomics: 13.1. Rounding 132 of 597 "RDNA3" Instruction Set Architecture The floating point atomic instructions (ds_{min,max,cmpst}_f32) have the option of passing denormal values through, or flushing them to zero. This is controlled with the MODE.fp_denorm bits that also control VALU denormal behavior. There is no separate input and output denormal control: only bit 0 of sp_denorm or bit 0 of dp_denorm is considered. The rest of the denormal rules are identical to LDS. Float atomic add is hardwired to flush input denormals - it does not use the MODE.fp_denorm bits. 13.3. NaN Handling Not A Number ("NaN") is a IEEE-754 value representing a result that cannot be computed. There two types of NaN: quiet and signaling • Quiet NaN Exponent=0xFF, Mantissa MSB=1 • Signaling NaN Exponent=0xFF, Mantissa MSB=0 and at least one other mantissa bit ==1 The LDS does not produce any exception or "signal" due to a signaling NaN. DS_ADD_F32 can create a quiet NaN, or propagate NaN from its inputs: if either input is a NaN, the output is that same NaN, and if both inputs are NaN, the NaN from the first input is selected as the output. Signaling NaN is converted to Quiet NaN. Floating point atomics (CMPSWAP, MIN, MAX) flush input denormals only when MODE (allow_input_denorm)=0, otherwise values are passed through without modification. When flushing, denorms are flushed before the operation (i.e. before the comparison). FP Max Selection Rules: if (src0 == SNaN) result = QNaN (src0) // bits of SRC0 are preserved but is a QNaN else if (src1 == SNaN) result = QNaN (src1) else result = larger of (src0, src1) "Larger" order from smallest to largest: QNaN, -inf, -float, -denorm, -0, +0, +denorm, +float, +inf FP Min Selection Rules: if (src0 == SNaN) result = QNaN (src0) else if (src1 == SNaN) result = QNaN (src1) else result = smaller of (src0, src1) "Smaller" order from smallest to largest: -inf, -float, -denorm, -0, +0, +denorm, +float, +inf, QNaN FP Compare Swap: only swap if the compare condition (==) is true, treating +0 and -0 as equal doSwap = (src0 != NaN) && (src1 != NaN) && (src0 == src1) // allow +0 == -0 Float Add rules: 1. -INF + INF = QNAN (mantissa is all zeros except MSB) 2. +/-INF + NAN = QNAN (NAN input is copied to output but made quiet NAN) 3. -INF + INF, or INF - INF = -QNAN 4. -0 + 0 = +0 13.3. NaN Handling 133 of 597 "RDNA3" Instruction Set Architecture 5. INF + (float, +0, -0) = INF, with infinity sign preserved 6. NaN + NaN = SRC0’s NaN, converted to QNaN 13.4. Global Wave Sync & Atomic Ordered Count Global Wave Sync (GWS) provides a capability to synchronize between different waves across the entire GPU. GWS instructions use LGKMcnt to determine when the operation has completed. 13.4.1. GWS and Ordered Count Programming Rule "GWS" instructions (ordered count and GWS*) must be issued as a single instruction clause of the form: S_WAITCNT LGKMcnt==0 // this is only necessary if there might be any outstanding GDS instructions GWS_instruction S_WAITCNT LGKMcnt==0 >1) & 0x5555_5555) 1 Pix1, MRT0 Pix1, MRT1 Pix3 MRT0 14.2. Primitive Shader Exports (From GS shader stage) The GS shader uses export instructions to output vertex position data, and memory stores for vertex parameter data. This data is passed on to subsequent pixel shaders. Every vertex shader must output at least one position vector (x, y, z; w is optional) to the POS0 target. The last position export must have the DONE bit set to 1. For optimized performance, it is recommended to output all position data as early as possible in the vertex shader. 14.3. Dependency Checking Export instructions are executed by the hardware in two phases. First, the instruction is selected to be executed, and EXPCNT is incremented by 1. At this time, the wave has made a request to export data, but the 14.1. Pixel Shader Exports 140 of 597 "RDNA3" Instruction Set Architecture data has not been exported yet. Later, when the export actually occurs the EXEC mask and VGPR data is read and the data is exported, and finally EXPcnt is decremented. Use S_WAITCNT on EXPcnt to prevent the shader program from overwriting EXEC or the VGPRs holding the data to be exported before the export operation has completed. Multiple export instructions can be outstanding at one time. Exports of the same type (for example: position) are completed in order, but exports of different types can be completed out of order. If the STATUS register’s SKIP_EXPORT bit is set to one, the hardware treats all EXPORT instructions as if they were NOPs. 14.3. Dependency Checking 141 of 597 "RDNA3" Instruction Set Architecture Chapter 15. Microcode Formats This section specifies the microcode formats. The definitions can be used to simplify compilation by providing standard templates and enumeration names for the various instruction formats. Endian Order - The RDNA3 architecture addresses memory and registers using little-endian byte-ordering and bit-ordering. Multi-byte values are stored with their least-significant (low-order) byte at the lowest byte address, and they are illustrated with their least-significant byte at the right side. Byte values are stored with their least-significant (low-order) bit (LSB) at the lowest bit address, and they are illustrated with their LSB at the right side. SALU and VALU instructions may optionally include a 32-bit literal constant, and some VALU instructions may include a 32-bit DPP control DWORD at the end of the instructions. No instruction may use both DPP and a literal constant. The table below summarizes the microcode formats and their widths, not including extra literal or DPP instruction words. The sections that follow provide details. Table 64. Summary of Microcode Formats Microcode Formats Reference Width (bits) SOP2 SOP2 32 SOP1 SOP1 SOPK SOPK SOPP SOPP SOPC SOPC Scalar ALU and Control Formats Scalar Memory Format SMEM SMEM 64 VOP1 VOP1 32 VOP2 VOP2 32 VOPC VOPC 32 VOP3 VOP3 64 VOP3SD VOP3SD 64 VOP3P VOP3P 64 VOPD VOPD 64 DPP16 DPP16 32 DPP8 DPP8 32 VINTERP 64 LDSDIR 32 DS 64 MTBUF MTBUF 64 MUBUF MUBUF 64 Vector ALU Format Vector Parameter Interpolation Format VINTERP LDS Parameter Load and Direct Load LDSDIR LDS/GDS Format DS Vector Memory Buffer Formats Vector Memory Image Format 142 of 597 "RDNA3" Instruction Set Architecture Microcode Formats Reference Width (bits) MIMG MIMG 64 or 96 EXP 64 FLAT FLAT 64 GLOBAL GLOBAL 64 SCRATCH SCRATCH 64 Export Format EXP Flat Formats  any instruction field marked as "Reserved" must be set to zero. Instruction Suffixes Most instructions include a suffix that indicates the data type the instruction handles. This suffix may also include a number that indicates the size of the data. For example: "F32" indicates "32-bit floating point data", or "B16" is "16-bit binary data". • B = binary • F = floating point • BF = "brain-float" floating point • U = unsigned integer • S = signed integer When more than one data-type specifier occurs in an instruction, the first one is the result type and size, and the later one(s) is/are input data type and size. E.g. V_CVT_F32_I32 reads an integer and writes a float. 143 of 597 "RDNA3" Instruction Set Architecture 15.1. Scalar ALU and Control Formats 15.1.1. SOP2 Description This is a scalar instruction with two inputs and one output. Can be followed by a 32-bit literal constant. Table 65. SOP2 Fields Field Name Bits Format or Description SSRC0 [7:0] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 - 252 253 254 255 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). Reserved. SCC. Reserved. Literal constant. SSRC1 [15:8] Second scalar source operand. Same codes as SSRC0, above. SDST [22:16] Scalar destination. Same codes as SSRC0, above except only codes 0-127 are valid. OP [29:23] See Opcode table below. ENCODING [31:30] 'b10 Table 66. SOP2 Opcodes 15.1. Scalar ALU and Control Formats 144 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 0 S_ADD_U32 27 S_XOR_B64 1 S_SUB_U32 28 S_NAND_B32 2 S_ADD_I32 29 S_NAND_B64 3 S_SUB_I32 30 S_NOR_B32 4 S_ADDC_U32 31 S_NOR_B64 5 S_SUBB_U32 32 S_XNOR_B32 6 S_ABSDIFF_I32 33 S_XNOR_B64 8 S_LSHL_B32 34 S_AND_NOT1_B32 9 S_LSHL_B64 35 S_AND_NOT1_B64 10 S_LSHR_B32 36 S_OR_NOT1_B32 11 S_LSHR_B64 37 S_OR_NOT1_B64 12 S_ASHR_I32 38 S_BFE_U32 13 S_ASHR_I64 39 S_BFE_I32 14 S_LSHL1_ADD_U32 40 S_BFE_U64 15 S_LSHL2_ADD_U32 41 S_BFE_I64 16 S_LSHL3_ADD_U32 42 S_BFM_B32 17 S_LSHL4_ADD_U32 43 S_BFM_B64 18 S_MIN_I32 44 S_MUL_I32 19 S_MIN_U32 45 S_MUL_HI_U32 20 S_MAX_I32 46 S_MUL_HI_I32 21 S_MAX_U32 48 S_CSELECT_B32 22 S_AND_B32 49 S_CSELECT_B64 23 S_AND_B64 50 S_PACK_LL_B32_B16 24 S_OR_B32 51 S_PACK_LH_B32_B16 25 S_OR_B64 52 S_PACK_HH_B32_B16 26 S_XOR_B32 53 S_PACK_HL_B32_B16 15.1.2. SOPK Description This is a scalar instruction with one 16-bit signed immediate (SIMM16) input and a single destination. Instructions that take 2 inputs use the destination as the first input and the SIMM16 as the second input. E.g. "S_CMPK_GT_I32 S0, 1" means "SCC = (s0 > 1)" Table 67. SOPK Fields Field Name Bits Format or Description SIMM16 [15:0] Signed immediate 16-bit value. 15.1. Scalar ALU and Control Formats 145 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description SDST [22:16] 0-105 106 107 108-123 124 125 126 127 Scalar destination, and can provide second source operand. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. M0. Memory register 0. NULL EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. OP [27:23] See Opcode table below. ENCODING [31:28] 'b1011 Table 68. SOPK Opcodes Opcode # Name Opcode # Name 0 S_MOVK_I32 13 S_CMPK_LT_U32 1 S_VERSION 14 S_CMPK_LE_U32 2 S_CMOVK_I32 15 S_ADDK_I32 3 S_CMPK_EQ_I32 16 S_MULK_I32 4 S_CMPK_LG_I32 17 S_GETREG_B32 5 S_CMPK_GT_I32 18 S_SETREG_B32 6 S_CMPK_GE_I32 19 S_SETREG_IMM32_B32 7 S_CMPK_LT_I32 20 S_CALL_B64 8 S_CMPK_LE_I32 24 S_WAITCNT_VSCNT 9 S_CMPK_EQ_U32 25 S_WAITCNT_VMCNT 10 S_CMPK_LG_U32 26 S_WAITCNT_EXPCNT 11 S_CMPK_GT_U32 27 S_WAITCNT_LGKMCNT 12 S_CMPK_GE_U32 15.1.3. SOP1 Description This is a scalar instruction with two inputs and one output. Can be followed by a 32-bit literal constant. Table 69. SOP1 Fields 15.1. Scalar ALU and Control Formats 146 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description SSRC0 [7:0] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 - 252 253 254 255 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). Reserved. SCC. Reserved. Literal constant. OP [15:8] See Opcode table below. SDST [22:16] Scalar destination. Same codes as SSRC0, above except only codes 0-127 are valid. ENCODING [31:23] 'b10_1111101 Table 70. SOP1 Opcodes Opcode # Name Opcode # Name 0 S_MOV_B32 35 S_OR_SAVEEXEC_B64 1 S_MOV_B64 36 S_XOR_SAVEEXEC_B32 2 S_CMOV_B32 37 S_XOR_SAVEEXEC_B64 3 S_CMOV_B64 38 S_NAND_SAVEEXEC_B32 4 S_BREV_B32 39 S_NAND_SAVEEXEC_B64 5 S_BREV_B64 40 S_NOR_SAVEEXEC_B32 8 S_CTZ_I32_B32 41 S_NOR_SAVEEXEC_B64 9 S_CTZ_I32_B64 42 S_XNOR_SAVEEXEC_B32 10 S_CLZ_I32_U32 43 S_XNOR_SAVEEXEC_B64 11 S_CLZ_I32_U64 44 S_AND_NOT0_SAVEEXEC_B32 12 S_CLS_I32 45 S_AND_NOT0_SAVEEXEC_B64 13 S_CLS_I32_I64 46 S_OR_NOT0_SAVEEXEC_B32 14 S_SEXT_I32_I8 47 S_OR_NOT0_SAVEEXEC_B64 15 S_SEXT_I32_I16 48 S_AND_NOT1_SAVEEXEC_B32 15.1. Scalar ALU and Control Formats 147 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 16 S_BITSET0_B32 49 S_AND_NOT1_SAVEEXEC_B64 17 S_BITSET0_B64 50 S_OR_NOT1_SAVEEXEC_B32 18 S_BITSET1_B32 51 S_OR_NOT1_SAVEEXEC_B64 19 S_BITSET1_B64 52 S_AND_NOT0_WREXEC_B32 20 S_BITREPLICATE_B64_B32 53 S_AND_NOT0_WREXEC_B64 21 S_ABS_I32 54 S_AND_NOT1_WREXEC_B32 22 S_BCNT0_I32_B32 55 S_AND_NOT1_WREXEC_B64 23 S_BCNT0_I32_B64 64 S_MOVRELS_B32 24 S_BCNT1_I32_B32 65 S_MOVRELS_B64 25 S_BCNT1_I32_B64 66 S_MOVRELD_B32 26 S_QUADMASK_B32 67 S_MOVRELD_B64 27 S_QUADMASK_B64 68 S_MOVRELSD_2_B32 28 S_WQM_B32 71 S_GETPC_B64 29 S_WQM_B64 72 S_SETPC_B64 30 S_NOT_B32 73 S_SWAPPC_B64 31 S_NOT_B64 74 S_RFE_B64 32 S_AND_SAVEEXEC_B32 76 S_SENDMSG_RTN_B32 33 S_AND_SAVEEXEC_B64 77 S_SENDMSG_RTN_B64 34 S_OR_SAVEEXEC_B32 15.1.4. SOPC Description This is a scalar instruction with two inputs that are compared and produces SCC as a result. Can be followed by a 32-bit literal constant. Table 71. SOPC Fields 15.1. Scalar ALU and Control Formats 148 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description SSRC0 [7:0] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 - 252 253 254 255 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). Reserved. SCC. Reserved. Literal constant. SSRC1 [15:8] Second scalar source operand. Same codes as SSRC0, above. OP [22:16] See Opcode table below. ENCODING [31:23] 'b10_1111110 Table 72. SOPC Opcodes Opcode # Name Opcode # Name 0 S_CMP_EQ_I32 9 S_CMP_GE_U32 1 S_CMP_LG_I32 10 S_CMP_LT_U32 2 S_CMP_GT_I32 11 S_CMP_LE_U32 3 S_CMP_GE_I32 12 S_BITCMP0_B32 4 S_CMP_LT_I32 13 S_BITCMP1_B32 5 S_CMP_LE_I32 14 S_BITCMP0_B64 6 S_CMP_EQ_U32 15 S_BITCMP1_B64 7 S_CMP_LG_U32 16 S_CMP_EQ_U64 8 S_CMP_GT_U32 17 S_CMP_LG_U64 15.1. Scalar ALU and Control Formats 149 of 597 "RDNA3" Instruction Set Architecture 15.1.5. SOPP Description This is a scalar instruction with one 16-bit signed immediate (SIMM16) input. Table 73. SOPP Fields Field Name Bits Format or Description SIMM16 [15:0] Signed immediate 16-bit value. OP [22:16] See Opcode table below. ENCODING [31:23] 'b10_1111111 Table 74. SOPP Opcodes Opcode # Name Opcode # Name 0 S_NOP 36 S_CBRANCH_VCCNZ 1 S_SETKILL 37 S_CBRANCH_EXECZ 2 S_SETHALT 38 S_CBRANCH_EXECNZ 3 S_SLEEP 39 S_CBRANCH_CDBGSYS 4 S_SET_INST_PREFETCH_DISTANCE 40 S_CBRANCH_CDBGUSER 5 S_CLAUSE 41 S_CBRANCH_CDBGSYS_OR_USER 7 S_DELAY_ALU 42 S_CBRANCH_CDBGSYS_AND_USER 8 Reserved 48 S_ENDPGM 9 S_WAITCNT 49 S_ENDPGM_SAVED 10 S_WAIT_IDLE 50 S_ENDPGM_ORDERED_PS_DONE 11 S_WAIT_EVENT 52 S_WAKEUP 16 S_TRAP 53 S_SETPRIO 17 S_ROUND_MODE 54 S_SENDMSG 18 S_DENORM_MODE 55 S_SENDMSGHALT 31 S_CODE_END 56 S_INCPERFLEVEL 32 S_BRANCH 57 S_DECPERFLEVEL 33 S_CBRANCH_SCC0 60 S_ICACHE_INV 34 S_CBRANCH_SCC1 61 S_BARRIER 35 S_CBRANCH_VCCZ 15.1. Scalar ALU and Control Formats 150 of 597 "RDNA3" Instruction Set Architecture 15.2. Scalar Memory Format 15.2.1. SMEM Description Scalar Memory data load Table 75. SMEM Fields Field Name Bits Format or Description SBASE [5:0] SGPR-pair that provides base address or SGPR-quad that provides V#. (LSB of SGPR address is omitted). SDATA [12:6] SGPR that provides write data or accepts return data. DLC [14] Device level coherent. GLC [16] Globally memory Coherent. Force bypass of L1 cache, or for atomics, cause pre-op value to be returned. OP [25:18] See Opcode table below. ENCODING [31:26] 'b111101 OFFSET [52:32] An immediate signed byte offset. Ignored for cache invalidations. SOFFSET [63:57] SGPR that supplies an unsigned byte offset. Disabled if set to NULL. Table 76. SMEM Opcodes Opcode # Name Opcode # Name 0 S_LOAD_B32 9 S_BUFFER_LOAD_B64 1 S_LOAD_B64 10 S_BUFFER_LOAD_B128 2 S_LOAD_B128 11 S_BUFFER_LOAD_B256 3 S_LOAD_B256 12 S_BUFFER_LOAD_B512 4 S_LOAD_B512 32 S_GL1_INV 8 S_BUFFER_LOAD_B32 33 S_DCACHE_INV 15.2. Scalar Memory Format 151 of 597 "RDNA3" Instruction Set Architecture 15.3. Vector ALU Formats 15.3.1. VOP2 Description Vector ALU format with two input operands. Can be followed by a 32-bit literal constant or DPP instruction DWORD when the instruction allows it. Table 77. VOP2 Fields Field Name Bits Format or Description SRC0 [8:0] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 250 253 254 255 256 - 511 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. DPP8 DPP8FI SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). DPP16 SCC. Reserved. Literal constant. VGPR 0 - 255 VSRC1 [16:9] VGPR that provides the second operand. VDST [24:17] Destination VGPR. OP [30:25] See Opcode table below. ENCODING [31] 'b0 Table 78. VOP2 Opcodes 15.3. Vector ALU Formats 152 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 1 V_CNDMASK_B32 29 V_XOR_B32 2 V_DOT2ACC_F32_F16 30 V_XNOR_B32 3 V_ADD_F32 32 V_ADD_CO_CI_U32 4 V_SUB_F32 33 V_SUB_CO_CI_U32 5 V_SUBREV_F32 34 V_SUBREV_CO_CI_U32 6 V_FMAC_DX9_ZERO_F32 37 V_ADD_NC_U32 7 V_MUL_DX9_ZERO_F32 38 V_SUB_NC_U32 8 V_MUL_F32 39 V_SUBREV_NC_U32 9 V_MUL_I32_I24 43 V_FMAC_F32 10 V_MUL_HI_I32_I24 44 V_FMAMK_F32 11 V_MUL_U32_U24 45 V_FMAAK_F32 12 V_MUL_HI_U32_U24 47 V_CVT_PK_RTZ_F16_F32 15 V_MIN_F32 50 V_ADD_F16 16 V_MAX_F32 51 V_SUB_F16 17 V_MIN_I32 52 V_SUBREV_F16 18 V_MAX_I32 53 V_MUL_F16 19 V_MIN_U32 54 V_FMAC_F16 20 V_MAX_U32 55 V_FMAMK_F16 24 V_LSHLREV_B32 56 V_FMAAK_F16 25 V_LSHRREV_B32 57 V_MAX_F16 26 V_ASHRREV_I32 58 V_MIN_F16 27 V_AND_B32 59 V_LDEXP_F16 28 V_OR_B32 60 V_PK_FMAC_F16 15.3.2. VOP1 Description Vector ALU format with one input operand. Can be followed by a 32-bit literal constant or DPP instruction DWORD when the instruction allows it. Table 79. VOP1 Fields 15.3. Vector ALU Formats 153 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description SRC0 [8:0] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 250 253 254 255 256 - 511 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. DPP8 DPP8FI SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). DPP16 SCC. Reserved. Literal constant. VGPR 0 - 255 OP [16:9] See Opcode table below. VDST [24:17] Destination VGPR. ENCODING [31:25] 'b0_111111 Table 80. VOP1 Opcodes Opcode # Name Opcode # Name 0 V_NOP 54 V_COS_F32 1 V_MOV_B32 55 V_NOT_B32 2 V_READFIRSTLANE_B32 56 V_BFREV_B32 3 V_CVT_I32_F64 57 V_CLZ_I32_U32 4 V_CVT_F64_I32 58 V_CTZ_I32_B32 5 V_CVT_F32_I32 59 V_CLS_I32 6 V_CVT_F32_U32 60 V_FREXP_EXP_I32_F64 7 V_CVT_U32_F32 61 V_FREXP_MANT_F64 8 V_CVT_I32_F32 62 V_FRACT_F64 10 V_CVT_F16_F32 63 V_FREXP_EXP_I32_F32 11 V_CVT_F32_F16 64 V_FREXP_MANT_F32 12 V_CVT_NEAREST_I32_F32 66 V_MOVRELD_B32 15.3. Vector ALU Formats 154 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 13 V_CVT_FLOOR_I32_F32 67 V_MOVRELS_B32 14 V_CVT_OFF_F32_I4 68 V_MOVRELSD_B32 15 V_CVT_F32_F64 72 V_MOVRELSD_2_B32 16 V_CVT_F64_F32 80 V_CVT_F16_U16 17 V_CVT_F32_UBYTE0 81 V_CVT_F16_I16 18 V_CVT_F32_UBYTE1 82 V_CVT_U16_F16 19 V_CVT_F32_UBYTE2 83 V_CVT_I16_F16 20 V_CVT_F32_UBYTE3 84 V_RCP_F16 21 V_CVT_U32_F64 85 V_SQRT_F16 22 V_CVT_F64_U32 86 V_RSQ_F16 23 V_TRUNC_F64 87 V_LOG_F16 24 V_CEIL_F64 88 V_EXP_F16 25 V_RNDNE_F64 89 V_FREXP_MANT_F16 26 V_FLOOR_F64 90 V_FREXP_EXP_I16_F16 27 V_PIPEFLUSH 91 V_FLOOR_F16 28 V_MOV_B16 92 V_CEIL_F16 32 V_FRACT_F32 93 V_TRUNC_F16 33 V_TRUNC_F32 94 V_RNDNE_F16 34 V_CEIL_F32 95 V_FRACT_F16 35 V_RNDNE_F32 96 V_SIN_F16 36 V_FLOOR_F32 97 V_COS_F16 37 V_EXP_F32 98 V_SAT_PK_U8_I16 39 V_LOG_F32 99 V_CVT_NORM_I16_F16 42 V_RCP_F32 100 V_CVT_NORM_U16_F16 43 V_RCP_IFLAG_F32 101 V_SWAP_B32 46 V_RSQ_F32 102 V_SWAP_B16 47 V_RCP_F64 103 V_PERMLANE64_B32 49 V_RSQ_F64 104 V_SWAPREL_B32 51 V_SQRT_F32 105 V_NOT_B16 52 V_SQRT_F64 106 V_CVT_I32_I16 53 V_SIN_F32 107 V_CVT_U32_U16 15.3.3. VOPC Description Vector instruction taking two inputs and producing a comparison result. Can be followed by a 32-bit literal constant or DPP control DWORD. Vector Comparison operations are divided into three groups: • those that can use any one of 16 comparison operations, • those that can use any one of 8, and • those that have a single comparison operation. The final opcode number is determined by adding the base for the opcode family plus the offset from the compare op. Compare instructions write a result to VCC (for VOPC) or an SGPR (for VOP3). Additionally, 15.3. Vector ALU Formats 155 of 597 "RDNA3" Instruction Set Architecture compare instructions have variants that writes to the EXEC mask instead of VCC or SGPR. The destination of the compare result is VCC or EXEC when encoded using the VOPC format, and can be an arbitrary SGPR (indicated in the VDST field) when only encoded in the VOP3 format. Comparison Operations Table 81. Comparison Operations Compare Operation Opcode Offset Description Sixteen Compare Operations (COMPF) F 0 D.u = 0 LT 1 D.u = (S0 < S1) EQ 2 D.u = (S0 == S1) LE 3 D.u = (S0 <= S1) GT 4 D.u = (S0 > S1) LG 5 D.u = (S0 <> S1) GE 6 D.u = (S0 >= S1) O 7 D.u = (!isNaN(S0) && !isNaN(S1)) U 8 D.u = (!isNaN(S0) || !isNaN(S1)) NGE 9 D.u = !(S0 >= S1) NLG 10 D.u = !(S0 <> S1) NGT 11 D.u = !(S0 > S1) NLE 12 D.u = !(S0 <= S1) NEQ 13 D.u = !(S0 == S1) NLT 14 D.u = !(S0 < S1) TRU 15 D.u = 1 Eight Compare Operations (COMPI) F 0 D.u = 0 LT 1 D.u = (S0 < S1) EQ 2 D.u = (S0 == S1) LE 3 D.u = (S0 <= S1) GT 4 D.u = (S0 > S1) LG 5 D.u = (S0 <> S1) GE 6 D.u = (S0 >= S1) TRU 7 D.u = 1 Table 82. VOPC Fields 15.3. Vector ALU Formats 156 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description SRC0 [8:0] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 250 253 254 255 256 - 511 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. DPP8 DPP8FI SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). DPP16 SCC. Reserved. Literal constant. VGPR 0 - 255 VSRC1 [16:9] VGPR that provides the second operand. OP [24:17] See Opcode table below. ENCODING [31:25] 'b0_111110 Table 83. VOPC Opcodes Opcode # Name Opcode # Name 0 V_CMP_F_F16 128 V_CMPX_F_F16 1 V_CMP_LT_F16 129 V_CMPX_LT_F16 2 V_CMP_EQ_F16 130 V_CMPX_EQ_F16 3 V_CMP_LE_F16 131 V_CMPX_LE_F16 4 V_CMP_GT_F16 132 V_CMPX_GT_F16 5 V_CMP_LG_F16 133 V_CMPX_LG_F16 6 V_CMP_GE_F16 134 V_CMPX_GE_F16 7 V_CMP_O_F16 135 V_CMPX_O_F16 8 V_CMP_U_F16 136 V_CMPX_U_F16 9 V_CMP_NGE_F16 137 V_CMPX_NGE_F16 10 V_CMP_NLG_F16 138 V_CMPX_NLG_F16 11 V_CMP_NGT_F16 139 V_CMPX_NGT_F16 15.3. Vector ALU Formats 157 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 12 V_CMP_NLE_F16 140 V_CMPX_NLE_F16 13 V_CMP_NEQ_F16 141 V_CMPX_NEQ_F16 14 V_CMP_NLT_F16 142 V_CMPX_NLT_F16 15 V_CMP_T_F16 143 V_CMPX_T_F16 16 V_CMP_F_F32 144 V_CMPX_F_F32 17 V_CMP_LT_F32 145 V_CMPX_LT_F32 18 V_CMP_EQ_F32 146 V_CMPX_EQ_F32 19 V_CMP_LE_F32 147 V_CMPX_LE_F32 20 V_CMP_GT_F32 148 V_CMPX_GT_F32 21 V_CMP_LG_F32 149 V_CMPX_LG_F32 22 V_CMP_GE_F32 150 V_CMPX_GE_F32 23 V_CMP_O_F32 151 V_CMPX_O_F32 24 V_CMP_U_F32 152 V_CMPX_U_F32 25 V_CMP_NGE_F32 153 V_CMPX_NGE_F32 26 V_CMP_NLG_F32 154 V_CMPX_NLG_F32 27 V_CMP_NGT_F32 155 V_CMPX_NGT_F32 28 V_CMP_NLE_F32 156 V_CMPX_NLE_F32 29 V_CMP_NEQ_F32 157 V_CMPX_NEQ_F32 30 V_CMP_NLT_F32 158 V_CMPX_NLT_F32 31 V_CMP_T_F32 159 V_CMPX_T_F32 32 V_CMP_F_F64 160 V_CMPX_F_F64 33 V_CMP_LT_F64 161 V_CMPX_LT_F64 34 V_CMP_EQ_F64 162 V_CMPX_EQ_F64 35 V_CMP_LE_F64 163 V_CMPX_LE_F64 36 V_CMP_GT_F64 164 V_CMPX_GT_F64 37 V_CMP_LG_F64 165 V_CMPX_LG_F64 38 V_CMP_GE_F64 166 V_CMPX_GE_F64 39 V_CMP_O_F64 167 V_CMPX_O_F64 40 V_CMP_U_F64 168 V_CMPX_U_F64 41 V_CMP_NGE_F64 169 V_CMPX_NGE_F64 42 V_CMP_NLG_F64 170 V_CMPX_NLG_F64 43 V_CMP_NGT_F64 171 V_CMPX_NGT_F64 44 V_CMP_NLE_F64 172 V_CMPX_NLE_F64 45 V_CMP_NEQ_F64 173 V_CMPX_NEQ_F64 46 V_CMP_NLT_F64 174 V_CMPX_NLT_F64 47 V_CMP_T_F64 175 V_CMPX_T_F64 49 V_CMP_LT_I16 177 V_CMPX_LT_I16 50 V_CMP_EQ_I16 178 V_CMPX_EQ_I16 51 V_CMP_LE_I16 179 V_CMPX_LE_I16 52 V_CMP_GT_I16 180 V_CMPX_GT_I16 53 V_CMP_NE_I16 181 V_CMPX_NE_I16 54 V_CMP_GE_I16 182 V_CMPX_GE_I16 57 V_CMP_LT_U16 185 V_CMPX_LT_U16 58 V_CMP_EQ_U16 186 V_CMPX_EQ_U16 59 V_CMP_LE_U16 187 V_CMPX_LE_U16 60 V_CMP_GT_U16 188 V_CMPX_GT_U16 61 V_CMP_NE_U16 189 V_CMPX_NE_U16 15.3. Vector ALU Formats 158 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 62 V_CMP_GE_U16 190 V_CMPX_GE_U16 64 V_CMP_F_I32 192 V_CMPX_F_I32 65 V_CMP_LT_I32 193 V_CMPX_LT_I32 66 V_CMP_EQ_I32 194 V_CMPX_EQ_I32 67 V_CMP_LE_I32 195 V_CMPX_LE_I32 68 V_CMP_GT_I32 196 V_CMPX_GT_I32 69 V_CMP_NE_I32 197 V_CMPX_NE_I32 70 V_CMP_GE_I32 198 V_CMPX_GE_I32 71 V_CMP_T_I32 199 V_CMPX_T_I32 72 V_CMP_F_U32 200 V_CMPX_F_U32 73 V_CMP_LT_U32 201 V_CMPX_LT_U32 74 V_CMP_EQ_U32 202 V_CMPX_EQ_U32 75 V_CMP_LE_U32 203 V_CMPX_LE_U32 76 V_CMP_GT_U32 204 V_CMPX_GT_U32 77 V_CMP_NE_U32 205 V_CMPX_NE_U32 78 V_CMP_GE_U32 206 V_CMPX_GE_U32 79 V_CMP_T_U32 207 V_CMPX_T_U32 80 V_CMP_F_I64 208 V_CMPX_F_I64 81 V_CMP_LT_I64 209 V_CMPX_LT_I64 82 V_CMP_EQ_I64 210 V_CMPX_EQ_I64 83 V_CMP_LE_I64 211 V_CMPX_LE_I64 84 V_CMP_GT_I64 212 V_CMPX_GT_I64 85 V_CMP_NE_I64 213 V_CMPX_NE_I64 86 V_CMP_GE_I64 214 V_CMPX_GE_I64 87 V_CMP_T_I64 215 V_CMPX_T_I64 88 V_CMP_F_U64 216 V_CMPX_F_U64 89 V_CMP_LT_U64 217 V_CMPX_LT_U64 90 V_CMP_EQ_U64 218 V_CMPX_EQ_U64 91 V_CMP_LE_U64 219 V_CMPX_LE_U64 92 V_CMP_GT_U64 220 V_CMPX_GT_U64 93 V_CMP_NE_U64 221 V_CMPX_NE_U64 94 V_CMP_GE_U64 222 V_CMPX_GE_U64 95 V_CMP_T_U64 223 V_CMPX_T_U64 125 V_CMP_CLASS_F16 253 V_CMPX_CLASS_F16 126 V_CMP_CLASS_F32 254 V_CMPX_CLASS_F32 127 V_CMP_CLASS_F64 255 V_CMPX_CLASS_F64 15.3.4. VOP3 Description Vector ALU format with three input operands. Can be followed by a 32-bit literal constant or DPP instruction DWORD when the instruction allows it. Table 84. VOP3 Fields 15.3. Vector ALU Formats 159 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description VDST [7:0] Destination VGPR ABS [10:8] Absolute value of input. [8] = src0, [9] = src1, [10] = src2 OPSEL [14:11] Operand select for 16-bit data. 0 = select low half, 1 = select high half. [11] = src0, [12] = src1, [13] = src2, [14] = dest. CLMP [15] Clamp output OP [25:16] Opcode. See next table. ENCODING [31:26] 'b110101 SRC0 [40:32] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 250 253 254 255 256 - 511 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. DPP8 DPP8FI SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). DPP16 SCC. Reserved. Literal constant. VGPR 0 - 255 SRC1 [49:41] Second input operand. Same options as SRC0. SRC2 [58:50] Third input operand. Same options as SRC0. OMOD [60:59] Output Modifier: 0=none, 1=*2, 2=*4, 3=*0.5 NEG [63:61] Negate input. [61] = src0, [62] = src1, [63] = src2 Table 85. VOP3 Opcodes Opcode # Name Opcode # Name 384 V_NOP 803 V_CVT_PK_U16_U32 385 V_MOV_B32 804 V_CVT_PK_I16_I32 386 V_READFIRSTLANE_B32 805 V_SUB_NC_I32 387 V_CVT_I32_F64 806 V_ADD_NC_I32 15.3. Vector ALU Formats 160 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 388 V_CVT_F64_I32 807 V_ADD_F64 389 V_CVT_F32_I32 808 V_MUL_F64 390 V_CVT_F32_U32 809 V_MIN_F64 391 V_CVT_U32_F32 810 V_MAX_F64 392 V_CVT_I32_F32 811 V_LDEXP_F64 394 V_CVT_F16_F32 812 V_MUL_LO_U32 395 V_CVT_F32_F16 813 V_MUL_HI_U32 396 V_CVT_NEAREST_I32_F32 814 V_MUL_HI_I32 397 V_CVT_FLOOR_I32_F32 815 V_TRIG_PREOP_F64 398 V_CVT_OFF_F32_I4 824 V_LSHLREV_B16 399 V_CVT_F32_F64 825 V_LSHRREV_B16 400 V_CVT_F64_F32 826 V_ASHRREV_I16 401 V_CVT_F32_UBYTE0 828 V_LSHLREV_B64 402 V_CVT_F32_UBYTE1 829 V_LSHRREV_B64 403 V_CVT_F32_UBYTE2 830 V_ASHRREV_I64 404 V_CVT_F32_UBYTE3 864 V_READLANE_B32 405 V_CVT_U32_F64 865 V_WRITELANE_B32 406 V_CVT_F64_U32 866 V_AND_B16 407 V_TRUNC_F64 867 V_OR_B16 408 V_CEIL_F64 868 V_XOR_B16 409 V_RNDNE_F64 0 V_CMP_F_F16 410 V_FLOOR_F64 1 V_CMP_LT_F16 411 V_PIPEFLUSH 2 V_CMP_EQ_F16 412 V_MOV_B16 3 V_CMP_LE_F16 416 V_FRACT_F32 4 V_CMP_GT_F16 417 V_TRUNC_F32 5 V_CMP_LG_F16 418 V_CEIL_F32 6 V_CMP_GE_F16 419 V_RNDNE_F32 7 V_CMP_O_F16 420 V_FLOOR_F32 8 V_CMP_U_F16 421 V_EXP_F32 9 V_CMP_NGE_F16 423 V_LOG_F32 10 V_CMP_NLG_F16 426 V_RCP_F32 11 V_CMP_NGT_F16 427 V_RCP_IFLAG_F32 12 V_CMP_NLE_F16 430 V_RSQ_F32 13 V_CMP_NEQ_F16 431 V_RCP_F64 14 V_CMP_NLT_F16 433 V_RSQ_F64 15 V_CMP_T_F16 435 V_SQRT_F32 16 V_CMP_F_F32 436 V_SQRT_F64 17 V_CMP_LT_F32 437 V_SIN_F32 18 V_CMP_EQ_F32 438 V_COS_F32 19 V_CMP_LE_F32 439 V_NOT_B32 20 V_CMP_GT_F32 440 V_BFREV_B32 21 V_CMP_LG_F32 441 V_CLZ_I32_U32 22 V_CMP_GE_F32 442 V_CTZ_I32_B32 23 V_CMP_O_F32 443 V_CLS_I32 24 V_CMP_U_F32 444 V_FREXP_EXP_I32_F64 25 V_CMP_NGE_F32 445 V_FREXP_MANT_F64 26 V_CMP_NLG_F32 15.3. Vector ALU Formats 161 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 446 V_FRACT_F64 27 V_CMP_NGT_F32 447 V_FREXP_EXP_I32_F32 28 V_CMP_NLE_F32 448 V_FREXP_MANT_F32 29 V_CMP_NEQ_F32 450 V_MOVRELD_B32 30 V_CMP_NLT_F32 451 V_MOVRELS_B32 31 V_CMP_T_F32 452 V_MOVRELSD_B32 32 V_CMP_F_F64 456 V_MOVRELSD_2_B32 33 V_CMP_LT_F64 464 V_CVT_F16_U16 34 V_CMP_EQ_F64 465 V_CVT_F16_I16 35 V_CMP_LE_F64 466 V_CVT_U16_F16 36 V_CMP_GT_F64 467 V_CVT_I16_F16 37 V_CMP_LG_F64 468 V_RCP_F16 38 V_CMP_GE_F64 469 V_SQRT_F16 39 V_CMP_O_F64 470 V_RSQ_F16 40 V_CMP_U_F64 471 V_LOG_F16 41 V_CMP_NGE_F64 472 V_EXP_F16 42 V_CMP_NLG_F64 473 V_FREXP_MANT_F16 43 V_CMP_NGT_F64 474 V_FREXP_EXP_I16_F16 44 V_CMP_NLE_F64 475 V_FLOOR_F16 45 V_CMP_NEQ_F64 476 V_CEIL_F16 46 V_CMP_NLT_F64 477 V_TRUNC_F16 47 V_CMP_T_F64 478 V_RNDNE_F16 49 V_CMP_LT_I16 479 V_FRACT_F16 50 V_CMP_EQ_I16 480 V_SIN_F16 51 V_CMP_LE_I16 481 V_COS_F16 52 V_CMP_GT_I16 482 V_SAT_PK_U8_I16 53 V_CMP_NE_I16 483 V_CVT_NORM_I16_F16 54 V_CMP_GE_I16 484 V_CVT_NORM_U16_F16 57 V_CMP_LT_U16 489 V_NOT_B16 58 V_CMP_EQ_U16 490 V_CVT_I32_I16 59 V_CMP_LE_U16 491 V_CVT_U32_U16 60 V_CMP_GT_U16 257 V_CNDMASK_B32 61 V_CMP_NE_U16 259 V_ADD_F32 62 V_CMP_GE_U16 260 V_SUB_F32 64 V_CMP_F_I32 261 V_SUBREV_F32 65 V_CMP_LT_I32 262 V_FMAC_DX9_ZERO_F32 66 V_CMP_EQ_I32 263 V_MUL_DX9_ZERO_F32 67 V_CMP_LE_I32 264 V_MUL_F32 68 V_CMP_GT_I32 265 V_MUL_I32_I24 69 V_CMP_NE_I32 266 V_MUL_HI_I32_I24 70 V_CMP_GE_I32 267 V_MUL_U32_U24 71 V_CMP_T_I32 268 V_MUL_HI_U32_U24 72 V_CMP_F_U32 271 V_MIN_F32 73 V_CMP_LT_U32 272 V_MAX_F32 74 V_CMP_EQ_U32 273 V_MIN_I32 75 V_CMP_LE_U32 274 V_MAX_I32 76 V_CMP_GT_U32 275 V_MIN_U32 77 V_CMP_NE_U32 15.3. Vector ALU Formats 162 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 276 V_MAX_U32 78 V_CMP_GE_U32 280 V_LSHLREV_B32 79 V_CMP_T_U32 281 V_LSHRREV_B32 80 V_CMP_F_I64 282 V_ASHRREV_I32 81 V_CMP_LT_I64 283 V_AND_B32 82 V_CMP_EQ_I64 284 V_OR_B32 83 V_CMP_LE_I64 285 V_XOR_B32 84 V_CMP_GT_I64 286 V_XNOR_B32 85 V_CMP_NE_I64 293 V_ADD_NC_U32 86 V_CMP_GE_I64 294 V_SUB_NC_U32 87 V_CMP_T_I64 295 V_SUBREV_NC_U32 88 V_CMP_F_U64 299 V_FMAC_F32 89 V_CMP_LT_U64 303 V_CVT_PK_RTZ_F16_F32 90 V_CMP_EQ_U64 306 V_ADD_F16 91 V_CMP_LE_U64 307 V_SUB_F16 92 V_CMP_GT_U64 308 V_SUBREV_F16 93 V_CMP_NE_U64 309 V_MUL_F16 94 V_CMP_GE_U64 310 V_FMAC_F16 95 V_CMP_T_U64 313 V_MAX_F16 125 V_CMP_CLASS_F16 314 V_MIN_F16 126 V_CMP_CLASS_F32 315 V_LDEXP_F16 127 V_CMP_CLASS_F64 521 V_FMA_DX9_ZERO_F32 128 V_CMPX_F_F16 522 V_MAD_I32_I24 129 V_CMPX_LT_F16 523 V_MAD_U32_U24 130 V_CMPX_EQ_F16 524 V_CUBEID_F32 131 V_CMPX_LE_F16 525 V_CUBESC_F32 132 V_CMPX_GT_F16 526 V_CUBETC_F32 133 V_CMPX_LG_F16 527 V_CUBEMA_F32 134 V_CMPX_GE_F16 528 V_BFE_U32 135 V_CMPX_O_F16 529 V_BFE_I32 136 V_CMPX_U_F16 530 V_BFI_B32 137 V_CMPX_NGE_F16 531 V_FMA_F32 138 V_CMPX_NLG_F16 532 V_FMA_F64 139 V_CMPX_NGT_F16 533 V_LERP_U8 140 V_CMPX_NLE_F16 534 V_ALIGNBIT_B32 141 V_CMPX_NEQ_F16 535 V_ALIGNBYTE_B32 142 V_CMPX_NLT_F16 536 V_MULLIT_F32 143 V_CMPX_T_F16 537 V_MIN3_F32 144 V_CMPX_F_F32 538 V_MIN3_I32 145 V_CMPX_LT_F32 539 V_MIN3_U32 146 V_CMPX_EQ_F32 540 V_MAX3_F32 147 V_CMPX_LE_F32 541 V_MAX3_I32 148 V_CMPX_GT_F32 542 V_MAX3_U32 149 V_CMPX_LG_F32 543 V_MED3_F32 150 V_CMPX_GE_F32 544 V_MED3_I32 151 V_CMPX_O_F32 545 V_MED3_U32 152 V_CMPX_U_F32 546 V_SAD_U8 153 V_CMPX_NGE_F32 15.3. Vector ALU Formats 163 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 547 V_SAD_HI_U8 154 V_CMPX_NLG_F32 548 V_SAD_U16 155 V_CMPX_NGT_F32 549 V_SAD_U32 156 V_CMPX_NLE_F32 550 V_CVT_PK_U8_F32 157 V_CMPX_NEQ_F32 551 V_DIV_FIXUP_F32 158 V_CMPX_NLT_F32 552 V_DIV_FIXUP_F64 159 V_CMPX_T_F32 567 V_DIV_FMAS_F32 160 V_CMPX_F_F64 568 V_DIV_FMAS_F64 161 V_CMPX_LT_F64 569 V_MSAD_U8 162 V_CMPX_EQ_F64 570 V_QSAD_PK_U16_U8 163 V_CMPX_LE_F64 571 V_MQSAD_PK_U16_U8 164 V_CMPX_GT_F64 573 V_MQSAD_U32_U8 165 V_CMPX_LG_F64 576 V_XOR3_B32 166 V_CMPX_GE_F64 577 V_MAD_U16 167 V_CMPX_O_F64 580 V_PERM_B32 168 V_CMPX_U_F64 581 V_XAD_U32 169 V_CMPX_NGE_F64 582 V_LSHL_ADD_U32 170 V_CMPX_NLG_F64 583 V_ADD_LSHL_U32 171 V_CMPX_NGT_F64 584 V_FMA_F16 172 V_CMPX_NLE_F64 585 V_MIN3_F16 173 V_CMPX_NEQ_F64 586 V_MIN3_I16 174 V_CMPX_NLT_F64 587 V_MIN3_U16 175 V_CMPX_T_F64 588 V_MAX3_F16 177 V_CMPX_LT_I16 589 V_MAX3_I16 178 V_CMPX_EQ_I16 590 V_MAX3_U16 179 V_CMPX_LE_I16 591 V_MED3_F16 180 V_CMPX_GT_I16 592 V_MED3_I16 181 V_CMPX_NE_I16 593 V_MED3_U16 182 V_CMPX_GE_I16 595 V_MAD_I16 185 V_CMPX_LT_U16 596 V_DIV_FIXUP_F16 186 V_CMPX_EQ_U16 597 V_ADD3_U32 187 V_CMPX_LE_U16 598 V_LSHL_OR_B32 188 V_CMPX_GT_U16 599 V_AND_OR_B32 189 V_CMPX_NE_U16 600 V_OR3_B32 190 V_CMPX_GE_U16 601 V_MAD_U32_U16 192 V_CMPX_F_I32 602 V_MAD_I32_I16 193 V_CMPX_LT_I32 603 V_PERMLANE16_B32 194 V_CMPX_EQ_I32 604 V_PERMLANEX16_B32 195 V_CMPX_LE_I32 605 V_CNDMASK_B16 196 V_CMPX_GT_I32 606 V_MAXMIN_F32 197 V_CMPX_NE_I32 607 V_MINMAX_F32 198 V_CMPX_GE_I32 608 V_MAXMIN_F16 199 V_CMPX_T_I32 609 V_MINMAX_F16 200 V_CMPX_F_U32 610 V_MAXMIN_U32 201 V_CMPX_LT_U32 611 V_MINMAX_U32 202 V_CMPX_EQ_U32 612 V_MAXMIN_I32 203 V_CMPX_LE_U32 613 V_MINMAX_I32 204 V_CMPX_GT_U32 15.3. Vector ALU Formats 164 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 614 V_DOT2_F16_F16 205 V_CMPX_NE_U32 615 V_DOT2_BF16_BF16 206 V_CMPX_GE_U32 771 V_ADD_NC_U16 207 V_CMPX_T_U32 772 V_SUB_NC_U16 208 V_CMPX_F_I64 773 V_MUL_LO_U16 209 V_CMPX_LT_I64 774 V_CVT_PK_I16_F32 210 V_CMPX_EQ_I64 775 V_CVT_PK_U16_F32 211 V_CMPX_LE_I64 777 V_MAX_U16 212 V_CMPX_GT_I64 778 V_MAX_I16 213 V_CMPX_NE_I64 779 V_MIN_U16 214 V_CMPX_GE_I64 780 V_MIN_I16 215 V_CMPX_T_I64 781 V_ADD_NC_I16 216 V_CMPX_F_U64 782 V_SUB_NC_I16 217 V_CMPX_LT_U64 785 V_PACK_B32_F16 218 V_CMPX_EQ_U64 786 V_CVT_PK_NORM_I16_F16 219 V_CMPX_LE_U64 787 V_CVT_PK_NORM_U16_F16 220 V_CMPX_GT_U64 796 V_LDEXP_F32 221 V_CMPX_NE_U64 797 V_BFM_B32 222 V_CMPX_GE_U64 798 V_BCNT_U32_B32 223 V_CMPX_T_U64 799 V_MBCNT_LO_U32_B32 253 V_CMPX_CLASS_F16 800 V_MBCNT_HI_U32_B32 254 V_CMPX_CLASS_F32 801 V_CVT_PK_NORM_I16_F32 255 V_CMPX_CLASS_F64 802 V_CVT_PK_NORM_U16_F32 15.3.5. VOP3SD Description Vector ALU format with three operands and a scalar result. This encoding is used only for a few opcodes. Can be followed by a 32-bit literal constant or DPP instruction DWORD when the instruction allows it. This encoding allows specifying a unique scalar destination, and is used only for the opcodes listed below. All other opcodes use VOP3. Table 86. VOP3SD Fields Field Name Bits Format or Description VDST [7:0] Destination VGPR SDST [14:8] Scalar destination CLMP [15] Clamp result OP [25:16] Opcode. see next table. ENCODING [31:26] 'b110101 15.3. Vector ALU Formats 165 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description SRC0 [40:32] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 250 253 254 255 256 - 511 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. DPP8 DPP8FI SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). DPP16 SCC. Reserved. Literal constant. VGPR 0 - 255 SRC1 [49:41] Second input operand. Same options as SRC0. SRC2 [58:50] Third input operand. Same options as SRC0. OMOD [60:59] Output Modifier: 0=none, 1=*2, 2=*4, 3=*0.5 NEG [63:61] Negate input. [61] = src0, [62] = src1, [63] = src2 Table 87. VOP3SD Opcodes Opcode # Name Opcode # Name 288 V_ADD_CO_CI_U32 766 V_MAD_U64_U32 289 V_SUB_CO_CI_U32 767 V_MAD_I64_I32 290 V_SUBREV_CO_CI_U32 768 V_ADD_CO_U32 764 V_DIV_SCALE_F32 769 V_SUB_CO_U32 765 V_DIV_SCALE_F64 770 V_SUBREV_CO_U32 15.3.6. VOP3P 15.3. Vector ALU Formats 166 of 597 "RDNA3" Instruction Set Architecture Description Vector ALU format taking one, two or three pairs of 16 bit inputs and producing two 16-bit outputs (packed into 1 DWORD). WMMA instructions have larger input and output VGPR sets. Can be followed by a 32-bit literal constant or DPP instruction DWORD when the instruction allows it. Table 88. VOP3P Fields Field Name Bits Format or Description VDST [7:0] Destination VGPR NEG_HI [10:8] Negate sources 0,1,2 of the high 16-bits. OPSEL [13:11] Select low or high for low sources 0=[11], 1=[12], 2=[13]. OPSEL_HI2 [14] Select low or high for high sources 0=[14], 1=[60], 2=[59]. CLMP [15] 1 = clamp result. OP [22:16] Opcode. see next table. ENCODING [31:26] 'b11001100 SRC0 [40:32] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 250 253 254 255 256 - 511 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. DPP8 DPP8FI SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). DPP16 SCC. Reserved. Literal constant. VGPR 0 - 255 SRC1 [49:41] Second input operand. Same options as SRC0. SRC2 [58:50] Third input operand. Same options as SRC0. 15.3. Vector ALU Formats 167 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description OPSEL_HI [60:59] See OPSEL_HI2. NEG [63:61] Negate input for low 16-bits of sources. [61] = src0, [62] = src1, [63] = src2 Table 89. VOP3P Opcodes Opcode # Name Opcode # Name 0 V_PK_MAD_I16 17 V_PK_MIN_F16 1 V_PK_MUL_LO_U16 18 V_PK_MAX_F16 2 V_PK_ADD_I16 19 V_DOT2_F32_F16 3 V_PK_SUB_I16 22 V_DOT4_I32_IU8 4 V_PK_LSHLREV_B16 23 V_DOT4_U32_U8 5 V_PK_LSHRREV_B16 24 V_DOT8_I32_IU4 6 V_PK_ASHRREV_I16 25 V_DOT8_U32_U4 7 V_PK_MAX_I16 26 V_DOT2_F32_BF16 8 V_PK_MIN_I16 32 V_FMA_MIX_F32 9 V_PK_MAD_U16 33 V_FMA_MIXLO_F16 10 V_PK_ADD_U16 34 V_FMA_MIXHI_F16 11 V_PK_SUB_U16 64 V_WMMA_F32_16X16X16_F16 12 V_PK_MAX_U16 65 V_WMMA_F32_16X16X16_BF16 13 V_PK_MIN_U16 66 V_WMMA_F16_16X16X16_F16 14 V_PK_FMA_F16 67 V_WMMA_BF16_16X16X16_BF16 15 V_PK_ADD_F16 68 V_WMMA_I32_16X16X16_IU8 16 V_PK_MUL_F16 69 V_WMMA_I32_16X16X16_IU4 15.3.7. VOPD Description Vector ALU format describing two instructions to be executed in parallel. Can be followed by a 32-bit literal constant, but not a DPP control DWORD. This instruction format describe two opcodes: X and Y. Table 90. VOPD Fields 15.3. Vector ALU Formats 168 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description SRCX0 [8:0] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 250 253 254 255 256 - 511 Source 0 for opcode X. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. DPP8 DPP8FI SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). DPP16 SCC. Reserved. Literal constant. VGPR 0 - 255 VSRCX1 [16:9] Source VGPR 1 for opcode X. OPY [21:17] Opcode Y. see next table. OPX [25:22] Opcode X. see next table. ENCODING [31:26] 'b110010 SRCY0 [40:32] Source 0 for opcode Y. See SRCX0 for enumerations VSRCY1 [48:41] Source VGPR 1 for opcode Y. VDSTY [55:49] Instruction Y destination VGPR, excluding LSB. LSB is the opposite of VDSTX[0]. VDSTX [63:56] Instruction X destination VGPR Table 91. VOPD X-Opcodes 0 V_DUAL_FMAC_F32 7 V_DUAL_MUL_DX9_ZERO_F32 1 V_DUAL_FMAAK_F32 8 V_DUAL_MOV_B32 2 V_DUAL_FMAMK_F32 9 V_DUAL_CNDMASK_B32 3 V_DUAL_MUL_F32 10 V_DUAL_MAX_F32 4 V_DUAL_ADD_F32 11 V_DUAL_MIN_F32 5 V_DUAL_SUB_F32 12 V_DUAL_DOT2ACC_F32_F16 6 V_DUAL_SUBREV_F32 13 V_DUAL_DOT2ACC_F32_BF16 15.3. Vector ALU Formats 169 of 597 "RDNA3" Instruction Set Architecture Table 92. VOPD Y-Opcodes 0 V_DUAL_FMAC_F32 9 1 V_DUAL_FMAAK_F32 10 V_DUAL_MAX_F32 V_DUAL_CNDMASK_B32 2 V_DUAL_FMAMK_F32 11 V_DUAL_MIN_F32 3 V_DUAL_MUL_F32 12 V_DUAL_DOT2ACC_F32_F16 4 V_DUAL_ADD_F32 13 V_DUAL_DOT2ACC_F32_BF16 5 V_DUAL_SUB_F32 16 V_DUAL_ADD_NC_U32 6 V_DUAL_SUBREV_F32 17 V_DUAL_LSHLREV_B32 7 V_DUAL_MUL_DX9_ZERO_F32 18 V_DUAL_AND_B32 8 V_DUAL_MOV_B32 15.3.8. DPP16 Description Data Parallel Primitives over 16 lanes. This is an additional DWORD that can follow VOP1, VOP2, VOPC, VOP3 or VOP3P instructions (in place of a literal constant) to control selection of data from other lanes. Table 93. DPP16 Fields Field Name Bits Format or Description SRC0 [39:32] Real SRC0 operand (VGPR). DPP_CTRL [48:40] See next table: "DPP_CTRL Enumeration" FI [50] Fetch invalid data: 0 = read zero for any inactive lanes; 1 = read VGPRs even for invalid lanes. BC [51] Bounds Control: 0 = do not write when source is out of range, 1 = write. SRC0_NEG [52] 1 = negate source 0. SRC0_ABS [53] 1 = Absolute value of source 0. SRC1_NEG [54] 1 = negate source 1. SRC1_ABS [55] 1 = Absolute value of source 1. BANK_MASK [59:56] Bank Mask Applies to the VGPR destination write only, does not impact the thread mask when fetching source VGPR data. 27==0: lanes[12:15, 28:31, 44:47, 60:63] are disabled 26==0: lanes[8:11, 24:27, 40:43, 56:59] are disabled 25==0: lanes[4:7, 20:23, 36:39, 52:55] are disabled 24==0: lanes[0:3, 16:19, 32:35, 48:51] are disabled Notice: the term "bank" here is not the same as was used for the VGPR bank. ROW_MASK [63:60] Row Mask Applies to the VGPR destination write only, does not impact the thread mask when fetching source VGPR data. 31==0: lanes[63:48] are disabled (wave 64 only) 30==0: lanes[47:32] are disabled (wave 64 only) 29==0: lanes[31:16] are disabled 28==0: lanes[15:0] are disabled Table 94. DPP_CTRL Enumeration 15.3. Vector ALU Formats 170 of 597 "RDNA3" Instruction Set Architecture DPP_Cntl Enumeration Hex Function Value Description DPP_QUAD_PE 000RM* 0FF pix[n].srca = pix[(n&0x3c)+ dpp_cntl[n%4*2+1 : n%4*2]].srca Permute of four threads. DPP_UNUSED Undefined Reserved. DPP_ROW_SL* 10110F 100 if ((n&0xf) < (16-cntl[3:0])) pix[n].srca = pix[n+ cntl[3:0]].srca else use bound_cntl Row shift left by 1-15 threads. DPP_ROW_SR* 11111F if ((n&0xf) >= cntl[3:0]) pix[n].srca = pix[n cntl[3:0]].srca else use bound_cntl Row shift right by 1-15 threads. DPP_ROW_RR* 12112F if ((n&0xf) >= cnt[3:0]) pix[n].srca = pix[n cntl[3:0]].srca else pix[n].srca = pix[n + 16 cntl[3:0]].srca Row rotate right by 1-15 threads. DPP_ROW_MIR 140 ROR* pix[n].srca = pix[15-(n&f)].srca Mirror threads within row. DPP_ROW_HA 141 LF_MIRROR* pix[n].srca = pix[7-(n&7)].srca Mirror threads within row (8 threads). DPP_ROW_SHA 150RE* 15F lanesel = DPP_CTRL & 0xf; lane[n].src0 = lane[(n & 0x30) + lanesel].src0. Select one lane within each row and share the result with all lanes in the row. DPP_ROW_XM 160ASK* 16F lane[n].src0 = lane[(n & 0x30) + ((n & 0xf) ^ mask)].src0. Fetch lane ID is the current lane ID XOR’d with a mask specified by DPP_CTRL[3:0]. 15.3.9. DPP8 Description Data Parallel Primitives over 8 lanes. This is a second DWORD that can follow VOP1, VOP2, VOPC, VOP3 or VOP3P instructions (in place of a literal constant) to control selection of data from other lanes. Table 95. DPP8 Fields Field Name Bits Format or Description SRC0 [39:32] Real SRC0 operand (VGPR). LANE_SEL0 [42:40] Which lane to read for 1st output lane per 8-lane group LANE_SEL1 [45:43] Which lane to read for 2nd output lane per 8-lane group LANE_SEL2 [48:46] Which lane to read for 3rd output lane per 8-lane group LANE_SEL3 [51:49] Which lane to read for 4th output lane per 8-lane group LANE_SEL4 [54:52] Which lane to read for 5th output lane per 8-lane group LANE_SEL5 [57:55] Which lane to read for 6th output lane per 8-lane group LANE_SEL6 [60:58] Which lane to read for 7th output lane per 8-lane group LANE_SEL7 [63:61] Which lane to read for 8th output lane per 8-lane group 15.3. Vector ALU Formats 171 of 597 "RDNA3" Instruction Set Architecture 15.4. Vector Parameter Interpolation Format 15.4.1. VINTERP Description Vector Parameter Interpolation. These opcodes perform parameter interpolation using vertex data in pixel shaders. Table 96. VINTERP Fields Field Name Bits Format or Description VDST [7:0] Destination VGPR WAITEXP [10:8] Wait for EXPcnt to be less-than or equal-to this value before issuing instruction. OPSEL [14:11] Select low or high for low sources 0=[11], 1=[12], 2=[13], dst=[14]. CLMP [15] 1 = clamp result. OP [22:16] Opcode. see next table. ENCODING [31:26] 'b11001101 SRC0 [40:32] Source 0. First operand for the instruction: VGPR 0-255. 15.4. Vector Parameter Interpolation Format 172 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description SRC0 [40:32] 0-105 106 107 108-123 124 125 126 127 128 129-192 193-208 209-232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 250 253 254 255 256 - 511 Source 0. First operand for the instruction. SGPR0 - SGPR105: Scalar general-purpose registers. VCC_LO: VCC[31:0]. VCC_HI: VCC[63:32]. TTMP0 - TTMP15: Trap handler temporary register. NULL M0. Misc register 0. EXEC_LO: EXEC[31:0]. EXEC_HI: EXEC[63:32]. 0. Signed integer 1 to 64. Signed integer -1 to -16. Reserved. DPP8 DPP8FI SHARED_BASE (Memory Aperture definition). SHARED_LIMIT (Memory Aperture definition). PRIVATE_BASE (Memory Aperture definition). PRIVATE_LIMIT (Memory Aperture definition). Reserved. 0.5. -0.5. 1.0. -1.0. 2.0. -2.0. 4.0. -4.0. 1/(2*PI). DPP16 SCC. Reserved. Literal constant. VGPR 0 - 255 SRC1 [49:41] Second input operand. Same options as SRC0. SRC2 [58:50] Third input operand. Same options as SRC0. NEG [63:61] Negate input for low 16-bits of sources. [61] = src0, [62] = src1, [63] = src2 Table 97. VINTERP Opcodes Opcode # Name Opcode # Name 0 V_INTERP_P10_F32 3 V_INTERP_P2_F16_F32 1 V_INTERP_P2_F32 4 V_INTERP_P10_RTZ_F16_F32 2 V_INTERP_P10_F16_F32 5 V_INTERP_P2_RTZ_F16_F32 15.5. Parameter and Direct Load from LDS 15.5.1. LDSDIR 15.5. Parameter and Direct Load from LDS 173 of 597 "RDNA3" Instruction Set Architecture Description LDS Direct and Parameter Load. These opcodes read either pixel parameter data or individual DWORDs from LDS into VGPRs. Table 98. LDSDIR Fields Field Name Bits Format or Description VDST [7:0] Destination VGPR ATTR_CHAN [9:8] Attribute channel: 0=X, 1=Y, 2=Z, 3=W ATTR [15:10] Attribute number: 0 - 32. WAIT_VA [19:16] Wait for previous VALU instructions to complete to resolve data dependency. Value is the max number of VALU ops still outstanding when issuing this instruction. OP [21:20] Opcode: 0: LDS_DIRECT_LOAD 1: LDS_PARAM_LOAD 2, 3: Reserved. ENCODING [31:24] 'b11001110 15.5. Parameter and Direct Load from LDS 174 of 597 "RDNA3" Instruction Set Architecture 15.6. LDS and GDS Format 15.6.1. DS Description Local and Global Data Sharing instructions Table 99. DS Fields Field Name Bits Format or Description OFFSET0 [7:0] First address offset OFFSET1 [15:8] Second address offset. For some opcodes this is concatenated with OFFSET0. GDS [17] 1=GDS, 0=LDS operation. OP [25:18] See Opcode table below. ENCODING [31:26] 'b110110 ADDR [39:32] VGPR that supplies the address. DATA0 [47:40] First data VGPR. DATA1 [55:48] Second data VGPR. VDST [63:56] Destination VGPR when results returned to VGPRs. Table 100. DS Opcodes Opcode # Name Opcode # Name 0 DS_ADD_U32 65 DS_SUB_U64 1 DS_SUB_U32 66 DS_RSUB_U64 2 DS_RSUB_U32 67 DS_INC_U64 3 DS_INC_U32 68 DS_DEC_U64 4 DS_DEC_U32 69 DS_MIN_I64 5 DS_MIN_I32 70 DS_MAX_I64 6 DS_MAX_I32 71 DS_MIN_U64 7 DS_MIN_U32 72 DS_MAX_U64 8 DS_MAX_U32 73 DS_AND_B64 9 DS_AND_B32 74 DS_OR_B64 10 DS_OR_B32 75 DS_XOR_B64 11 DS_XOR_B32 76 DS_MSKOR_B64 12 DS_MSKOR_B32 77 DS_STORE_B64 13 DS_STORE_B32 78 DS_STORE_2ADDR_B64 14 DS_STORE_2ADDR_B32 79 DS_STORE_2ADDR_STRIDE64_B64 15 DS_STORE_2ADDR_STRIDE64_B32 80 DS_CMPSTORE_B64 16 DS_CMPSTORE_B32 81 DS_CMPSTORE_F64 17 DS_CMPSTORE_F32 82 DS_MIN_F64 18 DS_MIN_F32 83 DS_MAX_F64 19 DS_MAX_F32 96 DS_ADD_RTN_U64 20 DS_NOP 97 DS_SUB_RTN_U64 21 DS_ADD_F32 98 DS_RSUB_RTN_U64 24 Reserved 99 DS_INC_RTN_U64 15.6. LDS and GDS Format 175 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 25 Reserved 100 DS_DEC_RTN_U64 26 Reserved 101 DS_MIN_RTN_I64 27 Reserved 102 DS_MAX_RTN_I64 28 Reserved 103 DS_MIN_RTN_U64 29 Reserved 104 DS_MAX_RTN_U64 30 DS_STORE_B8 105 DS_AND_RTN_B64 31 DS_STORE_B16 106 DS_OR_RTN_B64 32 DS_ADD_RTN_U32 107 DS_XOR_RTN_B64 33 DS_SUB_RTN_U32 108 DS_MSKOR_RTN_B64 34 DS_RSUB_RTN_U32 109 DS_STOREXCHG_RTN_B64 35 DS_INC_RTN_U32 110 DS_STOREXCHG_2ADDR_RTN_B64 36 DS_DEC_RTN_U32 111 DS_STOREXCHG_2ADDR_STRIDE64_RTN_B64 37 DS_MIN_RTN_I32 112 DS_CMPSTORE_RTN_B64 38 DS_MAX_RTN_I32 113 DS_CMPSTORE_RTN_F64 39 DS_MIN_RTN_U32 114 DS_MIN_RTN_F64 40 DS_MAX_RTN_U32 115 DS_MAX_RTN_F64 41 DS_AND_RTN_B32 118 DS_LOAD_B64 42 DS_OR_RTN_B32 119 DS_LOAD_2ADDR_B64 43 DS_XOR_RTN_B32 120 DS_LOAD_2ADDR_STRIDE64_B64 44 DS_MSKOR_RTN_B32 121 DS_ADD_RTN_F32 45 DS_STOREXCHG_RTN_B32 122 DS_ADD_GS_REG_RTN 46 DS_STOREXCHG_2ADDR_RTN_B32 123 DS_SUB_GS_REG_RTN 47 DS_STOREXCHG_2ADDR_STRIDE64_RTN_B32 126 DS_CONDXCHG32_RTN_B64 48 DS_CMPSTORE_RTN_B32 160 DS_STORE_B8_D16_HI 49 DS_CMPSTORE_RTN_F32 161 DS_STORE_B16_D16_HI 50 DS_MIN_RTN_F32 162 DS_LOAD_U8_D16 51 DS_MAX_RTN_F32 163 DS_LOAD_U8_D16_HI 52 DS_WRAP_RTN_B32 164 DS_LOAD_I8_D16 53 DS_SWIZZLE_B32 165 DS_LOAD_I8_D16_HI 54 DS_LOAD_B32 166 DS_LOAD_U16_D16 55 DS_LOAD_2ADDR_B32 167 DS_LOAD_U16_D16_HI 56 DS_LOAD_2ADDR_STRIDE64_B32 173 DS_BVH_STACK_RTN_B32 57 DS_LOAD_I8 176 DS_STORE_ADDTID_B32 58 DS_LOAD_U8 177 DS_LOAD_ADDTID_B32 59 DS_LOAD_I16 178 DS_PERMUTE_B32 60 DS_LOAD_U16 179 DS_BPERMUTE_B32 61 DS_CONSUME 222 DS_STORE_B96 62 DS_APPEND 223 DS_STORE_B128 63 DS_ORDERED_COUNT 254 DS_LOAD_B96 64 DS_ADD_U64 255 DS_LOAD_B128 15.6. LDS and GDS Format 176 of 597 "RDNA3" Instruction Set Architecture 15.7. Vector Memory Buffer Formats There are two memory buffer instruction formats: MTBUF typed buffer access (data type is defined by the instruction) MUBUF untyped buffer access (data type is defined by the buffer / resource-constant) 15.7.1. MTBUF Description Memory Typed-Buffer Instructions Table 101. MTBUF Fields Field Name Bits Format or Description OFFSET [11:0] Address offset, unsigned byte. SLC [12] System Level Coherent. Used in conjunction with DLC to determine L2 cache policies. DLC [13] 0 = normal, 1 = Device Coherent GLC [14] 0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-op value to VGPR. OP [18:15] Opcode. See table below. FORMAT [25:19] Data Format of data in memory buffer. See Buffer Image format Table ENCODING [31:26] 'b111010 VADDR [39:32] Address of VGPR to supply first component of address (offset or index). When both index and offset are used, index is in the first VGPR and offset in the second. VDATA [47:40] Address of VGPR to supply first component of write data or receive first component of read-data. SRSRC [52:48] SGPR to supply V# (resource constant) in 4 or 8 consecutive SGPRs. It is missing 2 LSB’s of SGPR-address since it is aligned to 4 SGPRs. TFE [53] Partially resident texture, texture fault enable. OFFEN [54] 1 = enable offset VGPR, 0 = use zero for address offset IDXEN [55] 1 = enable index VGPR, 0 = use zero for address index SOFFSET [63:56] Address offset, unsigned byte. Table 102. MTBUF Opcodes Opcode # Name Opcode # Name 0 TBUFFER_LOAD_FORMAT_X 8 TBUFFER_LOAD_D16_FORMAT_X 1 TBUFFER_LOAD_FORMAT_XY 9 TBUFFER_LOAD_D16_FORMAT_XY 2 TBUFFER_LOAD_FORMAT_XYZ 10 TBUFFER_LOAD_D16_FORMAT_XYZ 3 TBUFFER_LOAD_FORMAT_XYZW 11 TBUFFER_LOAD_D16_FORMAT_XYZW 4 TBUFFER_STORE_FORMAT_X 12 TBUFFER_STORE_D16_FORMAT_X 5 TBUFFER_STORE_FORMAT_XY 13 TBUFFER_STORE_D16_FORMAT_XY 15.7. Vector Memory Buffer Formats 177 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 6 TBUFFER_STORE_FORMAT_XYZ 14 TBUFFER_STORE_D16_FORMAT_XYZ 7 TBUFFER_STORE_FORMAT_XYZW 15 TBUFFER_STORE_D16_FORMAT_XYZW 15.7.2. MUBUF Description Memory Untyped-Buffer Instructions Table 103. MUBUF Fields Field Name Bits Format or Description OFFSET [11:0] Address offset, unsigned byte. SLC [12] System Level Coherent. Used in conjunction with DLC to determine L2 cache policies. DLC [13] 0 = normal, 1 = Device Coherent GLC [14] 0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-op value to VGPR. OP [25:18] Opcode. See table below. ENCODING [31:26] 'b111000 VADDR [39:32] Address of VGPR to supply first component of address (offset or index). When both index and offset are used, index is in the first VGPR and offset in the second. VDATA [47:40] Address of VGPR to supply first component of write data or receive first component of read-data. SRSRC [52:48] SGPR to supply V# (resource constant) in 4 or 8 consecutive SGPRs. It is missing 2 LSB’s of SGPR-address since it is aligned to 4 SGPRs. TFE [53] Partially resident texture, texture fault enable. OFFEN [54] 1 = enable offset VGPR, 0 = use zero for address offset IDXEN [55] 1 = enable index VGPR, 0 = use zero for address index SOFFSET [63:56] Address offset, unsigned byte. Table 104. MUBUF Opcodes Opcode # Name Opcode # Name 0 BUFFER_LOAD_FORMAT_X 43 BUFFER_GL0_INV 1 BUFFER_LOAD_FORMAT_XY 44 BUFFER_GL1_INV 2 BUFFER_LOAD_FORMAT_XYZ 45 BUFFER_LOAD_LDS_U8 3 BUFFER_LOAD_FORMAT_XYZW 46 BUFFER_LOAD_LDS_I8 4 BUFFER_STORE_FORMAT_X 47 BUFFER_LOAD_LDS_U16 5 BUFFER_STORE_FORMAT_XY 48 BUFFER_LOAD_LDS_I16 6 BUFFER_STORE_FORMAT_XYZ 49 BUFFER_LOAD_LDS_B32 7 BUFFER_STORE_FORMAT_XYZW 50 BUFFER_LOAD_LDS_FORMAT_X 8 BUFFER_LOAD_D16_FORMAT_X 51 BUFFER_ATOMIC_SWAP_B32 9 BUFFER_LOAD_D16_FORMAT_XY 52 BUFFER_ATOMIC_CMPSWAP_B32 10 BUFFER_LOAD_D16_FORMAT_XYZ 53 BUFFER_ATOMIC_ADD_U32 11 BUFFER_LOAD_D16_FORMAT_XYZW 54 BUFFER_ATOMIC_SUB_U32 12 BUFFER_STORE_D16_FORMAT_X 55 BUFFER_ATOMIC_CSUB_U32 15.7. Vector Memory Buffer Formats 178 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 13 BUFFER_STORE_D16_FORMAT_XY 56 BUFFER_ATOMIC_MIN_I32 14 BUFFER_STORE_D16_FORMAT_XYZ 57 BUFFER_ATOMIC_MIN_U32 15 BUFFER_STORE_D16_FORMAT_XYZW 58 BUFFER_ATOMIC_MAX_I32 16 BUFFER_LOAD_U8 59 BUFFER_ATOMIC_MAX_U32 17 BUFFER_LOAD_I8 60 BUFFER_ATOMIC_AND_B32 18 BUFFER_LOAD_U16 61 BUFFER_ATOMIC_OR_B32 19 BUFFER_LOAD_I16 62 BUFFER_ATOMIC_XOR_B32 20 BUFFER_LOAD_B32 63 BUFFER_ATOMIC_INC_U32 21 BUFFER_LOAD_B64 64 BUFFER_ATOMIC_DEC_U32 22 BUFFER_LOAD_B96 65 BUFFER_ATOMIC_SWAP_B64 23 BUFFER_LOAD_B128 66 BUFFER_ATOMIC_CMPSWAP_B64 24 BUFFER_STORE_B8 67 BUFFER_ATOMIC_ADD_U64 25 BUFFER_STORE_B16 68 BUFFER_ATOMIC_SUB_U64 26 BUFFER_STORE_B32 69 BUFFER_ATOMIC_MIN_I64 27 BUFFER_STORE_B64 70 BUFFER_ATOMIC_MIN_U64 28 BUFFER_STORE_B96 71 BUFFER_ATOMIC_MAX_I64 29 BUFFER_STORE_B128 72 BUFFER_ATOMIC_MAX_U64 30 BUFFER_LOAD_D16_U8 73 BUFFER_ATOMIC_AND_B64 31 BUFFER_LOAD_D16_I8 74 BUFFER_ATOMIC_OR_B64 32 BUFFER_LOAD_D16_B16 75 BUFFER_ATOMIC_XOR_B64 33 BUFFER_LOAD_D16_HI_U8 76 BUFFER_ATOMIC_INC_U64 34 BUFFER_LOAD_D16_HI_I8 77 BUFFER_ATOMIC_DEC_U64 35 BUFFER_LOAD_D16_HI_B16 80 BUFFER_ATOMIC_CMPSWAP_F32 36 BUFFER_STORE_D16_HI_B8 81 BUFFER_ATOMIC_MIN_F32 37 BUFFER_STORE_D16_HI_B16 82 BUFFER_ATOMIC_MAX_F32 38 BUFFER_LOAD_D16_HI_FORMAT_X 86 BUFFER_ATOMIC_ADD_F32 39 BUFFER_STORE_D16_HI_FORMAT_X 15.7. Vector Memory Buffer Formats 179 of 597 "RDNA3" Instruction Set Architecture 15.8. Vector Memory Image Format 15.8.1. MIMG Description Memory Image Instructions Memory Image instructions (MIMG format) can be between 2 and 3 DWORDs. There are two variations of the instruction: • Normal, where the address VGPRs are specified in the "ADDR" field, and are a contiguous set of VGPRs. This is a 2-DWORD instruction. • Non-Sequential-Address (NSA), where each address VGPR is specified individually and the address VGPRs can be scattered. This version uses 1 extra DWORD to specify the individual address VGPRs. Table 105. MIMG Fields Field Name Bits Format or Description NSA [0] Non-sequential address. Specifies that an additional instruction DWORD exists holding up to 4 unique VGPR addresses. DIM [4:2] Dimensionality of the resource constant. Set to bits [3:1] of the resource type field. UNRM [7] Force address to be un-normalized. User must set to 1 for Image stores & atomics. DMASK [11:8] Data VGPR enable mask: 1 .. 4 consecutive VGPRs Reads: defines which components are returned: 0=red,1=green,2=blue,3=alpha Writes: defines which components are written with data from VGPRs (missing components get 0). Enabled components come from consecutive VGPRs. E.G. dmask=1001 : Red is in VGPRn and alpha in VGPRn+1. For D16 writes, DMASK is only used as a word count: each bit represents 16 bits of data to be written starting at the LSB’s of VDATA, then MSBs, then VDATA+1 etc. Bit position is ignored. SLC [12] System Level Coherent. Used in conjunction with DLC to determine L2 cache policies. DLC [13] 0 = normal, 1 = Device Coherent GLC [14] 0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-op value to VGPR. R128 [15] Resource constant size: 1 = 128bit, 0 = 256bit A16 [16] Address components are 16-bits (instead of the usual 32 bits). When set, all address components are 16 bits (packed into 2 per DWORD), except: Texel offsets (3 6bit UINT packed into 1 DWORD) PCF reference (for "_C" instructions) Address components are 16b uint for image ops without sampler; 16b float with sampler. D16 [17] Data components are 16-bits (instead of the usual 32 bits). OP [25:18] Opcode. See table below. ENCODING [31:26] 'b111100 15.8. Vector Memory Image Format 180 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description VADDR [39:32] Address of VGPR to supply first component of address. VDATA [47:40] Address of VGPR to supply first component of write data or receive first component of read-data. SRSRC [52:48] SGPR to supply V# (resource constant) in 4 or 8 consecutive SGPRs. It is missing 2 LSB’s of SGPR-address since it is aligned to 4 SGPRs. TFE [53] Partially resident texture, texture fault enable. LWE [54] LOD Warning Enable. When set to 1, a texture fetch may return "LOD_CLAMPED = 1". SSAMP [62:58] SGPR to supply V# (resource constant) in 4 or 8 consecutive SGPRs. It is missing 2 LSB’s of SGPR-address since it is aligned to 4 SGPRs. ADDR1 [71:64] Second Address register or group. Present only when NSA=1. ADDR2 [79:72] Third Address register or group. Present only when NSA=1. Table 106. MIMG Opcodes Opcode # Name Opcode # Name 0 IMAGE_LOAD 42 IMAGE_SAMPLE_C_O 1 IMAGE_LOAD_MIP 43 IMAGE_SAMPLE_C_D_O 2 IMAGE_LOAD_PCK 44 IMAGE_SAMPLE_C_L_O 3 IMAGE_LOAD_PCK_SGN 45 IMAGE_SAMPLE_C_B_O 4 IMAGE_LOAD_MIP_PCK 46 IMAGE_SAMPLE_C_LZ_O 5 IMAGE_LOAD_MIP_PCK_SGN 47 IMAGE_GATHER4 6 IMAGE_STORE 48 IMAGE_GATHER4_L 7 IMAGE_STORE_MIP 49 IMAGE_GATHER4_B 8 IMAGE_STORE_PCK 50 IMAGE_GATHER4_LZ 9 IMAGE_STORE_MIP_PCK 51 IMAGE_GATHER4_C 10 IMAGE_ATOMIC_SWAP 52 IMAGE_GATHER4_C_LZ 11 IMAGE_ATOMIC_CMPSWAP 53 IMAGE_GATHER4_O 12 IMAGE_ATOMIC_ADD 54 IMAGE_GATHER4_LZ_O 13 IMAGE_ATOMIC_SUB 55 IMAGE_GATHER4_C_LZ_O 14 IMAGE_ATOMIC_SMIN 56 IMAGE_GET_LOD 15 IMAGE_ATOMIC_UMIN 57 IMAGE_SAMPLE_D_G16 16 IMAGE_ATOMIC_SMAX 58 IMAGE_SAMPLE_C_D_G16 17 IMAGE_ATOMIC_UMAX 59 IMAGE_SAMPLE_D_O_G16 18 IMAGE_ATOMIC_AND 60 IMAGE_SAMPLE_C_D_O_G16 19 IMAGE_ATOMIC_OR 64 IMAGE_SAMPLE_CL 20 IMAGE_ATOMIC_XOR 65 IMAGE_SAMPLE_D_CL 21 IMAGE_ATOMIC_INC 66 IMAGE_SAMPLE_B_CL 22 IMAGE_ATOMIC_DEC 67 IMAGE_SAMPLE_C_CL 23 IMAGE_GET_RESINFO 68 IMAGE_SAMPLE_C_D_CL 24 IMAGE_MSAA_LOAD 69 IMAGE_SAMPLE_C_B_CL 25 IMAGE_BVH_INTERSECT_RAY 70 IMAGE_SAMPLE_CL_O 26 IMAGE_BVH64_INTERSECT_RAY 71 IMAGE_SAMPLE_D_CL_O 27 IMAGE_SAMPLE 72 IMAGE_SAMPLE_B_CL_O 28 IMAGE_SAMPLE_D 73 IMAGE_SAMPLE_C_CL_O 29 IMAGE_SAMPLE_L 74 IMAGE_SAMPLE_C_D_CL_O 30 IMAGE_SAMPLE_B 75 IMAGE_SAMPLE_C_B_CL_O 31 IMAGE_SAMPLE_LZ 84 IMAGE_SAMPLE_C_D_CL_G16 32 IMAGE_SAMPLE_C 85 IMAGE_SAMPLE_D_CL_O_G16 15.8. Vector Memory Image Format 181 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 33 IMAGE_SAMPLE_C_D 86 IMAGE_SAMPLE_C_D_CL_O_G16 34 IMAGE_SAMPLE_C_L 95 IMAGE_SAMPLE_D_CL_G16 35 IMAGE_SAMPLE_C_B 96 IMAGE_GATHER4_CL 36 IMAGE_SAMPLE_C_LZ 97 IMAGE_GATHER4_B_CL 37 IMAGE_SAMPLE_O 98 IMAGE_GATHER4_C_CL 38 IMAGE_SAMPLE_D_O 99 IMAGE_GATHER4_C_L 39 IMAGE_SAMPLE_L_O 100 IMAGE_GATHER4_C_B 40 IMAGE_SAMPLE_B_O 101 IMAGE_GATHER4_C_B_CL 41 IMAGE_SAMPLE_LZ_O 144 IMAGE_GATHER4H 15.8. Vector Memory Image Format 182 of 597 "RDNA3" Instruction Set Architecture 15.9. Flat Formats Flat memory instructions come in three versions: FLAT memory address (per work-item) may be in global memory, scratch (private) memory or shared memory (LDS) GLOBAL same as FLAT, but assumes all memory addresses are global memory. SCRATCH same as FLAT, but assumes all memory addresses are scratch (private) memory. The microcode format is identical for each, and only the value of the SEG (segment) field differs. 15.9.1. FLAT Description FLAT Memory Access Table 107. FLAT Fields Field Name Bits Format or Description OFFSET [12:0] Address offset Scratch, Global: 13-bit signed byte offset FLAT: 12-bit unsigned offset (MSB is ignored) DLC [13] 0 = normal, 1 = Device Coherent GLC [14] 0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-op value to VGPR. SLC [15] System Level Coherent. Used in conjunction with DLC to determine L2 cache policies. SEG [17:16] Memory Segment (instruction type): 0 = flat, 1 = scratch, 2 = global. OP [24:18] Opcode. See tables below for FLAT, SCRATCH and GLOBAL opcodes. ENCODING [31:26] 'b110111 ADDR [39:32] VGPR that holds address or offset. For 64-bit addresses, ADDR has the LSBs and ADDR+1 has the MSBs. For offset a single VGPR has a 32 bit unsigned offset. For FLAT_*: specifies an address. For GLOBAL_* and SCRATCH_* when SADDR is NULL or 0x7f: specifies an address. For GLOBAL_* and SCRATCH_* when SADDR is not NULL or 0x7f: specifies an offset. DATA [47:40] VGPR that supplies data. SADDR [54:48] Scalar SGPR that provides an address of offset (unsigned). Set this field to NULL or 0x7f to disable use. Meaning of this field is different for Scratch and Global: FLAT: Unused Scratch: use an SGPR for the address instead of a VGPR Global: use the SGPR to provide a base address and the VGPR provides a 32-bit byte offset. 15.9. Flat Formats 183 of 597 "RDNA3" Instruction Set Architecture Field Name Bits Format or Description SVE [55] Scratch VGPR Enable. 1 = scratch address includes a VGPR to provide an offset; 0 = no VGPR used. VDST [63:56] Destination VGPR for data returned from memory to VGPRs. Table 108. FLAT Opcodes Opcode # Name Opcode # Name 16 FLAT_LOAD_U8 56 FLAT_ATOMIC_MIN_I32 17 FLAT_LOAD_I8 57 FLAT_ATOMIC_MIN_U32 18 FLAT_LOAD_U16 58 FLAT_ATOMIC_MAX_I32 19 FLAT_LOAD_I16 59 FLAT_ATOMIC_MAX_U32 20 FLAT_LOAD_B32 60 FLAT_ATOMIC_AND_B32 21 FLAT_LOAD_B64 61 FLAT_ATOMIC_OR_B32 22 FLAT_LOAD_B96 62 FLAT_ATOMIC_XOR_B32 23 FLAT_LOAD_B128 63 FLAT_ATOMIC_INC_U32 24 FLAT_STORE_B8 64 FLAT_ATOMIC_DEC_U32 25 FLAT_STORE_B16 65 FLAT_ATOMIC_SWAP_B64 26 FLAT_STORE_B32 66 FLAT_ATOMIC_CMPSWAP_B64 27 FLAT_STORE_B64 67 FLAT_ATOMIC_ADD_U64 28 FLAT_STORE_B96 68 FLAT_ATOMIC_SUB_U64 29 FLAT_STORE_B128 69 FLAT_ATOMIC_MIN_I64 30 FLAT_LOAD_D16_U8 70 FLAT_ATOMIC_MIN_U64 31 FLAT_LOAD_D16_I8 71 FLAT_ATOMIC_MAX_I64 32 FLAT_LOAD_D16_B16 72 FLAT_ATOMIC_MAX_U64 33 FLAT_LOAD_D16_HI_U8 73 FLAT_ATOMIC_AND_B64 34 FLAT_LOAD_D16_HI_I8 74 FLAT_ATOMIC_OR_B64 35 FLAT_LOAD_D16_HI_B16 75 FLAT_ATOMIC_XOR_B64 36 FLAT_STORE_D16_HI_B8 76 FLAT_ATOMIC_INC_U64 37 FLAT_STORE_D16_HI_B16 77 FLAT_ATOMIC_DEC_U64 51 FLAT_ATOMIC_SWAP_B32 80 FLAT_ATOMIC_CMPSWAP_F32 52 FLAT_ATOMIC_CMPSWAP_B32 81 FLAT_ATOMIC_MIN_F32 53 FLAT_ATOMIC_ADD_U32 82 FLAT_ATOMIC_MAX_F32 54 FLAT_ATOMIC_SUB_U32 86 FLAT_ATOMIC_ADD_F32 15.9.2. GLOBAL Table 109. GLOBAL Opcodes Opcode # Name Opcode # Name 16 GLOBAL_LOAD_U8 52 GLOBAL_ATOMIC_CMPSWAP_B32 17 GLOBAL_LOAD_I8 53 GLOBAL_ATOMIC_ADD_U32 18 GLOBAL_LOAD_U16 54 GLOBAL_ATOMIC_SUB_U32 19 GLOBAL_LOAD_I16 55 GLOBAL_ATOMIC_CSUB_U32 20 GLOBAL_LOAD_B32 56 GLOBAL_ATOMIC_MIN_I32 21 GLOBAL_LOAD_B64 57 GLOBAL_ATOMIC_MIN_U32 22 GLOBAL_LOAD_B96 58 GLOBAL_ATOMIC_MAX_I32 23 GLOBAL_LOAD_B128 59 GLOBAL_ATOMIC_MAX_U32 24 GLOBAL_STORE_B8 60 GLOBAL_ATOMIC_AND_B32 15.9. Flat Formats 184 of 597 "RDNA3" Instruction Set Architecture Opcode # Name Opcode # Name 25 GLOBAL_STORE_B16 61 GLOBAL_ATOMIC_OR_B32 26 GLOBAL_STORE_B32 62 GLOBAL_ATOMIC_XOR_B32 27 GLOBAL_STORE_B64 63 GLOBAL_ATOMIC_INC_U32 28 GLOBAL_STORE_B96 64 GLOBAL_ATOMIC_DEC_U32 29 GLOBAL_STORE_B128 65 GLOBAL_ATOMIC_SWAP_B64 30 GLOBAL_LOAD_D16_U8 66 GLOBAL_ATOMIC_CMPSWAP_B64 31 GLOBAL_LOAD_D16_I8 67 GLOBAL_ATOMIC_ADD_U64 32 GLOBAL_LOAD_D16_B16 68 GLOBAL_ATOMIC_SUB_U64 33 GLOBAL_LOAD_D16_HI_U8 69 GLOBAL_ATOMIC_MIN_I64 34 GLOBAL_LOAD_D16_HI_I8 70 GLOBAL_ATOMIC_MIN_U64 35 GLOBAL_LOAD_D16_HI_B16 71 GLOBAL_ATOMIC_MAX_I64 36 GLOBAL_STORE_D16_HI_B8 72 GLOBAL_ATOMIC_MAX_U64 37 GLOBAL_STORE_D16_HI_B16 73 GLOBAL_ATOMIC_AND_B64 40 GLOBAL_LOAD_ADDTID_B32 74 GLOBAL_ATOMIC_OR_B64 41 GLOBAL_STORE_ADDTID_B32 75 GLOBAL_ATOMIC_XOR_B64 42 GLOBAL_LOAD_LDS_ADDTID_B32 76 GLOBAL_ATOMIC_INC_U64 45 GLOBAL_LOAD_LDS_U8 77 GLOBAL_ATOMIC_DEC_U64 46 GLOBAL_LOAD_LDS_I8 80 GLOBAL_ATOMIC_CMPSWAP_F32 47 GLOBAL_LOAD_LDS_U16 81 GLOBAL_ATOMIC_MIN_F32 48 GLOBAL_LOAD_LDS_I16 82 GLOBAL_ATOMIC_MAX_F32 49 GLOBAL_LOAD_LDS_B32 86 GLOBAL_ATOMIC_ADD_F32 51 GLOBAL_ATOMIC_SWAP_B32 15.9.3. SCRATCH Table 110. SCRATCH Opcodes Opcode # Name Opcode # Name 16 SCRATCH_LOAD_U8 30 SCRATCH_LOAD_D16_U8 17 SCRATCH_LOAD_I8 31 SCRATCH_LOAD_D16_I8 18 SCRATCH_LOAD_U16 32 SCRATCH_LOAD_D16_B16 19 SCRATCH_LOAD_I16 33 SCRATCH_LOAD_D16_HI_U8 20 SCRATCH_LOAD_B32 34 SCRATCH_LOAD_D16_HI_I8 21 SCRATCH_LOAD_B64 35 SCRATCH_LOAD_D16_HI_B16 22 SCRATCH_LOAD_B96 36 SCRATCH_STORE_D16_HI_B8 23 SCRATCH_LOAD_B128 37 SCRATCH_STORE_D16_HI_B16 24 SCRATCH_STORE_B8 45 SCRATCH_LOAD_LDS_U8 25 SCRATCH_STORE_B16 46 SCRATCH_LOAD_LDS_I8 26 SCRATCH_STORE_B32 47 SCRATCH_LOAD_LDS_U16 27 SCRATCH_STORE_B64 48 SCRATCH_LOAD_LDS_I16 28 SCRATCH_STORE_B96 49 SCRATCH_LOAD_LDS_B32 29 SCRATCH_STORE_B128 15.9. Flat Formats 185 of 597 "RDNA3" Instruction Set Architecture 15.10. Export Format 15.10.1. EXP Description EXPORT instructions The export format has only a single opcode, "EXPORT". Table 111. EXP Fields Field Name Bits Format or Description EN [3:0] VGPR Enables: [0] enables VSRC0, … [3] enables VSRC3. TARGET [9:4] Export destination: 0..7 MRT 0..7 8 Z 12-16 Position 0-4 20 Primitive data 21 Dual Source Blend Left 22 Dual Source Blend Right DONE [11] Indicates that this is the last export from the shader. Used only for Position and Pixel/color data. ROW [13] Row to export ENCODING [31:26] 'b111110 VSRC0 [39:32] VGPR for source 0. VSRC1 [47:40] VGPR for source 1. VSRC2 [55:48] VGPR for source 2. VSRC3 [63:56] VGPR for source 3. 15.10. Export Format 186 of 597 "RDNA3" Instruction Set Architecture Chapter 16. Instructions This chapter lists, and provides descriptions for, all instructions in the RDNA3 Generation environment. Instructions are grouped according to their format. Note: Rounding and Denormal modes apply to all floating-point operations unless otherwise specified in the instruction description. 187 of 597 "RDNA3" Instruction Set Architecture 16.1. SOP2 Instructions Instructions in this format may use a 32-bit literal constant that occurs immediately after the instruction. S_ADD_U32 0 Add two unsigned integers with carry-out. tmp = 64'U(S0.u) + 64'U(S1.u); SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U; // unsigned overflow or carry-out for S_ADDC_U32. D0.u = tmp.u S_SUB_U32 1 Subtract the second unsigned integer from the first with carry-out. tmp = S0.u - S1.u; SCC = S1.u > S0.u ? 1'1U : 1'0U; // unsigned overflow or carry-out for S_SUBB_U32. D0.u = tmp.u S_ADD_I32 2 Add two signed integers with carry-out. tmp = S0.i + S1.i; SCC = ((S0.u[31] == S1.u[31]) && (S0.u[31] != tmp.u[31])); // signed overflow. D0.i = tmp.i Notes This opcode is not suitable for use with S_ADDC_U32 for implementing 64-bit operations. S_SUB_I32 3 Subtract the second signed integer from the first with carry-out. 16.1. SOP2 Instructions 188 of 597 "RDNA3" Instruction Set Architecture tmp = S0.i - S1.i; SCC = ((S0.u[31] != S1.u[31]) && (S0.u[31] != tmp.u[31])); // signed overflow. D0.i = tmp.i Notes This opcode is not suitable for use with S_SUBB_U32 for implementing 64-bit operations. S_ADDC_U32 4 Add two unsigned integers with carry-in and carry-out. tmp = 64'U(S0.u) + 64'U(S1.u) + SCC.u64; SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U; // unsigned overflow. D0.u = tmp.u S_SUBB_U32 5 Subtract the second unsigned integer from the first with carry-in and carry-out. tmp = S0.u - S1.u - SCC.u; SCC = 64'U(S1.u) + SCC.u64 > 64'U(S0.u) ? 1'1U : 1'0U; // unsigned overflow. D0.u = tmp.u S_ABSDIFF_I32 6 Compute the absolute value of difference between two values. D0.i = S0.i - S1.i; if D0.i < 0 then D0.i = -D0.i endif; SCC = D0.i != 0 Notes Functional examples: S_ABSDIFF_I32(0x00000002, 0x00000005) ⇒ 0x00000003 16.1. SOP2 Instructions 189 of 597 "RDNA3" Instruction Set Architecture S_ABSDIFF_I32(0xffffffff, 0x00000000) ⇒ 0x00000001 S_ABSDIFF_I32(0x80000000, 0x00000000) ⇒ 0x80000000 // Note: result is negative! S_ABSDIFF_I32(0x80000000, 0x00000001) ⇒ 0x7fffffff S_ABSDIFF_I32(0x80000000, 0xffffffff) ⇒ 0x7fffffff S_ABSDIFF_I32(0x80000000, 0xfffffffe) ⇒ 0x7ffffffe S_LSHL_B32 8 Logical shift left. D0.u = S0.u << S1.u[4 : 0].u; SCC = D0.u != 0U S_LSHL_B64 9 Logical shift left. D0.u64 = S0.u64 << S1.u[5 : 0].u; SCC = D0.u64 != 0ULL S_LSHR_B32 10 Logical shift right. D0.u = S0.u >> S1.u[4 : 0].u; SCC = D0.u != 0U S_LSHR_B64 11 Logical shift right. D0.u64 = S0.u64 >> S1.u[5 : 0].u; SCC = D0.u64 != 0ULL S_ASHR_I32 12 Arithmetic shift right (preserve sign bit). 16.1. SOP2 Instructions 190 of 597 "RDNA3" Instruction Set Architecture D0.i = 32'I(signext(S0.i) >> S1.u[4 : 0].u); SCC = D0.i != 0 S_ASHR_I64 13 Arithmetic shift right (preserve sign bit). D0.i64 = signext(S0.i64) >> S1.u[5 : 0].u; SCC = D0.i64 != 0LL S_LSHL1_ADD_U32 14 Logical shift left by 1 bit and then add. tmp = (64'U(S0.u) << 1U) + 64'U(S1.u); SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U; // unsigned overflow. D0.u = tmp.u S_LSHL2_ADD_U32 15 Logical shift left by 2 bits and then add. tmp = (64'U(S0.u) << 2U) + 64'U(S1.u); SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U; // unsigned overflow. D0.u = tmp.u S_LSHL3_ADD_U32 16 Logical shift left by 3 bits and then add. tmp = (64'U(S0.u) << 3U) + 64'U(S1.u); SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U; // unsigned overflow. D0.u = tmp.u 16.1. SOP2 Instructions 191 of 597 "RDNA3" Instruction Set Architecture S_LSHL4_ADD_U32 17 Logical shift left by 4 bits and then add. tmp = (64'U(S0.u) << 4U) + 64'U(S1.u); SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U; // unsigned overflow. D0.u = tmp.u S_MIN_I32 18 Minimum of two signed integers. SCC = S0.i < S1.i; D0.i = SCC ? S0.i : S1.i S_MIN_U32 19 Minimum of two unsigned integers. SCC = S0.u < S1.u; D0.u = SCC ? S0.u : S1.u S_MAX_I32 20 Maximum of two signed integers. SCC = S0.i > S1.i; D0.i = SCC ? S0.i : S1.i S_MAX_U32 21 Maximum of two unsigned integers. SCC = S0.u > S1.u; D0.u = SCC ? S0.u : S1.u 16.1. SOP2 Instructions 192 of 597 "RDNA3" Instruction Set Architecture S_AND_B32 22 Bitwise AND. D0.u = (S0.u & S1.u); SCC = D0.u != 0U S_AND_B64 23 Bitwise AND. D0.u64 = (S0.u64 & S1.u64); SCC = D0.u64 != 0ULL S_OR_B32 24 Bitwise OR. D0.u = (S0.u | S1.u); SCC = D0.u != 0U S_OR_B64 25 Bitwise OR. D0.u64 = (S0.u64 | S1.u64); SCC = D0.u64 != 0ULL S_XOR_B32 26 Bitwise XOR. D0.u = (S0.u ^ S1.u); SCC = D0.u != 0U S_XOR_B64 16.1. SOP2 Instructions 27 193 of 597 "RDNA3" Instruction Set Architecture Bitwise XOR. D0.u64 = (S0.u64 ^ S1.u64); SCC = D0.u64 != 0ULL S_NAND_B32 28 Bitwise NAND. D0.u = ~(S0.u & S1.u); SCC = D0.u != 0U S_NAND_B64 29 Bitwise NAND. D0.u64 = ~(S0.u64 & S1.u64); SCC = D0.u64 != 0ULL S_NOR_B32 30 Bitwise NOR. D0.u = ~(S0.u | S1.u); SCC = D0.u != 0U S_NOR_B64 31 Bitwise NOR. D0.u64 = ~(S0.u64 | S1.u64); SCC = D0.u64 != 0ULL S_XNOR_B32 32 Bitwise XNOR. 16.1. SOP2 Instructions 194 of 597 "RDNA3" Instruction Set Architecture D0.u = ~(S0.u ^ S1.u); SCC = D0.u != 0U S_XNOR_B64 33 Bitwise XNOR. D0.u64 = ~(S0.u64 ^ S1.u64); SCC = D0.u64 != 0ULL S_AND_NOT1_B32 34 Bitwise AND with negated second argument. D0.u = (S0.u & ~S1.u); SCC = D0.u != 0U S_AND_NOT1_B64 35 Bitwise AND with negated second argument. D0.u64 = (S0.u64 & ~S1.u64); SCC = D0.u64 != 0ULL S_OR_NOT1_B32 36 Bitwise OR with negated second argument. D0.u = (S0.u | ~S1.u); SCC = D0.u != 0U S_OR_NOT1_B64 37 Bitwise OR with negated second argument. 16.1. SOP2 Instructions 195 of 597 "RDNA3" Instruction Set Architecture D0.u64 = (S0.u64 | ~S1.u64); SCC = D0.u64 != 0ULL S_BFE_U32 38 Bitfield extract. Extract unsigned bitfield from first operand using field offset and field size encoded in second operand. D0.u = ((S0.u >> S1.u[4 : 0].u) & 32'U((1 << S1.u[22 : 16].u) - 1)); SCC = D0.u != 0U S_BFE_I32 39 Bitfield extract. Extract signed bitfield from first operand using field offset and field size encoded in second operand. tmp = ((S0.i >> S1.u[4 : 0].u) & ((1 << S1.u[22 : 16].u) - 1)); D0.i = 32'I(signextFromBit(tmp, S1.i[22 : 16].i)); SCC = D0.i != 0 S_BFE_U64 40 Bitfield extract. Extract unsigned bitfield from first operand using field offset and field size encoded in second operand. D0.u64 = ((S0.u64 >> S1.u[5 : 0].u) & 64'U((1U << S1.u[22 : 16].u) - 1U)); SCC = D0.u64 != 0ULL S_BFE_I64 41 Bitfield extract. Extract signed bitfield from first operand using field offset and field size encoded in second operand. tmp = ((S0.i64 >> S1.u[5 : 0].u) & 64'I((1 << S1.u[22 : 16].u) - 1)); D0.i64 = signextFromBit(tmp, S1.i[22 : 16].i64); SCC = D0.i64 != 0LL 16.1. SOP2 Instructions 196 of 597 "RDNA3" Instruction Set Architecture S_BFM_B32 42 Bitfield mask. D0.u = 32'U((1 << S0.u[4 : 0].u) - 1 << S1.u[4 : 0].u) S_BFM_B64 43 Bitfield mask. D0.u64 = (1ULL << S0.u[5 : 0].u) - 1ULL << S1.u[5 : 0].u S_MUL_I32 44 Multiply two signed integers. D0.i = S0.i * S1.i S_MUL_HI_U32 45 Multiply two unsigned integers and store the high 32 bits. D0.u = 32'U(64'U(S0.u) * 64'U(S1.u) >> 32U) S_MUL_HI_I32 46 Multiply two signed integers and store the high 32 bits. D0.i = 32'I(64'I(S0.i) * 64'I(S1.i) >> 32U) S_CSELECT_B32 48 Conditional select based on scalar condition code. 16.1. SOP2 Instructions 197 of 597 "RDNA3" Instruction Set Architecture D0.u = SCC ? S0.u : S1.u S_CSELECT_B64 49 Conditional select based on scalar condition code. D0.u64 = SCC ? S0.u64 : S1.u64 S_PACK_LL_B32_B16 50 Pack two short values into the destination. D0 = { S1[15 : 0].u16, S0[15 : 0].u16 } S_PACK_LH_B32_B16 51 Pack two short values into the destination. D0 = { S1[31 : 16].u16, S0[15 : 0].u16 } S_PACK_HH_B32_B16 52 Pack two short values into the destination. D0 = { S1[31 : 16].u16, S0[31 : 16].u16 } S_PACK_HL_B32_B16 53 Pack two short values into the destination. D0 = { S1[15 : 0].u16, S0[31 : 16].u16 } 16.1. SOP2 Instructions 198 of 597 "RDNA3" Instruction Set Architecture 16.2. SOPK Instructions Instructions in this format may not use a 32-bit literal constant that occurs immediately after the instruction. S_MOVK_I32 0 Sign extension from a 16-bit constant. D0.i = 32'I(signext(SIMM16.i16)) S_VERSION 1 Do nothing. This opcode is used to specify the microcode version for tools that interpret shader microcode. Argument is ignored by hardware. This opcode is not designed for inserting wait states as the next instruction may issue in the same cycle. Do not use this opcode to resolve wait state hazards, use S_NOP instead. This opcode may also be used to validate microcode is running with the correct compatibility settings in drivers and functional models that support multiple generations. We strongly encourage this opcode be included at the top of every shader block to simplify debug and catch configuration errors. This opcode must appear in the first 16 bytes of a block of shader code in order to be recognized by external tools and functional models. Avoid placing opcodes > 32 bits or encodings that are not available in all versions of the microcode before the S_VERSION opcode. If this opcode is absent then tools are allowed to make an educated guess of the microcode version using cues from the environment; the guess may be incorrect and lead to an invalid decode. It is highly recommended that this be the first opcode of a shader block except for trap handlers, where it should be the second opcode (allowing the first opcode to be a 32-bit branch to accommodate context switch). SIMM16[7:0] specifies the microcode version. SIMM16[15:8] must be set to zero. nop(); // Do nothing - for use by tools only S_CMOVK_I32 2 Conditional move with sign extension. if SCC then 16.2. SOPK Instructions 199 of 597 "RDNA3" Instruction Set Architecture D0.i = 32'I(signext(SIMM16.i16)) endif S_CMPK_EQ_I32 3 Argument is equal to constant. SCC = 64'I(S0.i) == signext(SIMM16.i16) S_CMPK_LG_I32 4 Argument is not equal to constant. SCC = 64'I(S0.i) != signext(SIMM16.i16) S_CMPK_GT_I32 5 Argument is greater than constant. SCC = 64'I(S0.i) > signext(SIMM16.i16) S_CMPK_GE_I32 6 Argument is greater than or equal to constant. SCC = 64'I(S0.i) >= signext(SIMM16.i16) S_CMPK_LT_I32 7 Argument is less than constant. SCC = 64'I(S0.i) < signext(SIMM16.i16) 16.2. SOPK Instructions 200 of 597 "RDNA3" Instruction Set Architecture S_CMPK_LE_I32 8 Argument is less than or equal to constant. SCC = 64'I(S0.i) <= signext(SIMM16.i16) S_CMPK_EQ_U32 9 Argument is equal to constant. SCC = S0.u == 32'U(SIMM16.u16) S_CMPK_LG_U32 10 Argument is not equal to constant. SCC = S0.u != 32'U(SIMM16.u16) S_CMPK_GT_U32 11 Argument is greater than constant. SCC = S0.u > 32'U(SIMM16.u16) S_CMPK_GE_U32 12 Argument is greater than or equal to constant. SCC = S0.u >= 32'U(SIMM16.u16) S_CMPK_LT_U32 13 Argument is less than constant. 16.2. SOPK Instructions 201 of 597 "RDNA3" Instruction Set Architecture SCC = S0.u < 32'U(SIMM16.u16) S_CMPK_LE_U32 14 Argument is less than or equal to constant. SCC = S0.u <= 32'U(SIMM16.u16) S_ADDK_I32 15 Add a 16-bit signed constant to the destination with carry-out. tmp = D0.i; // save value so we can check sign bits for overflow later. D0.i = 32'I(64'I(D0.i) + signext(SIMM16.i16)); SCC = ((tmp[31] == SIMM16.i16[15]) && (tmp[31] != D0.i[31])); // signed overflow. S_MULK_I32 16 Multiply a 16-bit signed constant with the destination. D0.i = 32'I(64'I(D0.i) * signext(SIMM16.i16)) S_GETREG_B32 17 Read some or all of a hardware register into the LSBs of destination. The SIMM16 argument is encoded as follows: ID = SIMM16[5:0] ID of hardware register to access. OFFSET = SIMM16[10:6] LSB offset of register bits to access. SIZE = SIMM16[15:11] Size of register bits to access, minus 1. Set this field to 31 to read/write all bits of the hardware register. 16.2. SOPK Instructions 202 of 597 "RDNA3" Instruction Set Architecture hwRegId = SIMM16.u16[5 : 0]; offset = SIMM16.u16[10 : 6]; size = SIMM16.u16[15 : 11].u + 1U; // logical size is in range 1:32 value = HW_REGISTERS[hwRegId]; D0.u = 32'U(32'I(value >> offset.u) & ((1 << size) - 1)) S_SETREG_B32 18 Write some or all of the LSBs of source argument into a hardware register. The SIMM16 argument is encoded as follows: ID = SIMM16[5:0] ID of hardware register to access. OFFSET = SIMM16[10:6] LSB offset of register bits to access. SIZE = SIMM16[15:11] Size of register bits to access, minus 1. Set this field to 31 to read/write all bits of the hardware register. hwRegId = SIMM16.u16[5 : 0]; offset = SIMM16.u16[10 : 6]; size = SIMM16.u16[15 : 11].u + 1U; // logical size is in range 1:32 mask = (1 << size) - 1; mask = (mask & 32'I(writeableBitMask(hwRegId.u, WAVE_STATUS.PRIV))); // Mask of bits we are allowed to modify value = ((S0.u << offset.u) & mask.u); value = (value | 32'U(HW_REGISTERS[hwRegId].i & ~mask)); HW_REGISTERS[hwRegId] = value.b; // Side-effects may trigger here if certain bits are modified S_SETREG_IMM32_B32 19 Write some or all of the LSBs of a 32-bit literal constant into a hardware register; this instruction requires a 32bit literal constant. The SIMM16 argument is encoded as follows: ID = SIMM16[5:0] ID of hardware register to access. OFFSET = SIMM16[10:6] LSB offset of register bits to access. 16.2. SOPK Instructions 203 of 597 "RDNA3" Instruction Set Architecture SIZE = SIMM16[15:11] Size of register bits to access, minus 1. Set this field to 31 to read/write all bits of the hardware register. hwRegId = SIMM16.u16[5 : 0]; offset = SIMM16.u16[10 : 6]; size = SIMM16.u16[15 : 11].u + 1U; // logical size is in range 1:32 mask = (1 << size) - 1; mask = (mask & 32'I(writeableBitMask(hwRegId.u, WAVE_STATUS.PRIV))); // Mask of bits we are allowed to modify value = ((SIMM32.u << offset.u) & mask.u); value = (value | 32'U(HW_REGISTERS[hwRegId].i & ~mask)); HW_REGISTERS[hwRegId] = value.b; // Side-effects may trigger here if certain bits are modified S_CALL_B64 20 Short call to label. Implements a short call, where the return address (the next instruction after the S_CALL_B64) is saved to D. Long calls should consider S_SWAPPC_B64 instead. D0.i64 = PC + 4LL; PC = PC + signext(SIMM16.i16 * 16'4) + 4LL Notes This instruction must be 4 bytes. S_WAITCNT_VSCNT 24 Wait for the counts of outstanding vector store events -- vector memory stores and atomics that DO NOT return data -- to be at or below the specified level. This counter is not used in 'all-in-order' mode. Waits for the following condition to hold before continuing: vscnt <= S0.u[5:0] + S1.u[5:0]. // Comparison is 6 bits, no clamping is applied for add overflow To wait on a literal constant only, write 'null' for the GPR argument. This opcode may only appear inside a clause if the SGPR operand is set to NULL. See also S_WAITCNT. 16.2. SOPK Instructions 204 of 597 "RDNA3" Instruction Set Architecture S_WAITCNT_VMCNT 25 Wait for the counts of outstanding vector memory events -- everything except for memory stores and atomicswithout-return -- to be at or below the specified level. When in 'all-in-order' mode, wait for all vector memory events. Waits for the following condition to hold before continuing: vmcnt <= S0.u[5:0] + S1.u[5:0]. // Comparison is 6 bits, no clamping is applied for add overflow To wait on a literal constant only, write 'null' for the GPR argument or use S_WAITCNT. This opcode may only appear inside a clause if the SGPR operand is set to NULL. See also S_WAITCNT. S_WAITCNT_EXPCNT 26 Wait for the counts of outstanding export events to be at or below the specified level. Waits for the following condition to hold before continuing: expcnt <= S0.u[2:0] + S1.u[2:0]. // Comparison is 3 bits, no clamping is applied for add overflow To wait on a literal constant only, write 'null' for the GPR argument or use S_WAITCNT. This opcode may only appear inside a clause if the SGPR operand is set to NULL. See also S_WAITCNT. S_WAITCNT_LGKMCNT 27 Wait for the counts of outstanding DS (LG), scalar memory (K) and message (M) events to be at or below the specified level. Waits for the following condition to hold before continuing: lgkmcnt <= S0.u[5:0] + S1.u[5:0]. // Comparison is 6 bits, no clamping is applied for add overflow To wait on a literal constant only, write 'null' for the GPR argument or use S_WAITCNT. This opcode may only appear inside a clause if the SGPR operand is set to NULL. 16.2. SOPK Instructions 205 of 597 "RDNA3" Instruction Set Architecture See also S_WAITCNT. 16.2. SOPK Instructions 206 of 597 "RDNA3" Instruction Set Architecture 16.3. SOP1 Instructions Instructions in this format may use a 32-bit literal constant that occurs immediately after the instruction. S_MOV_B32 0 Move data to an SGPR. D0.b = S0.b S_MOV_B64 1 Move data to an SGPR. D0.b64 = S0.b64 S_CMOV_B32 2 Conditionally move data to an SGPR when scalar condition code is true. if SCC then D0.b = S0.b endif S_CMOV_B64 3 Conditionally move data to an SGPR when scalar condition code is true. if SCC then D0.b64 = S0.b64 endif S_BREV_B32 16.3. SOP1 Instructions 4 207 of 597 "RDNA3" Instruction Set Architecture Reverse bits. D0.u[31 : 0] = S0.u[0 : 31] S_BREV_B64 5 Reverse bits. D0.u64[63 : 0] = S0.u64[0 : 63] S_CTZ_I32_B32 8 Count trailing zeros. Returns the bit position of the first one from the LSB, or -1 if there are no ones. tmp = -1; // Set if no ones are found for i in 0 : 31 do // Search from LSB if S0.u[i] == 1'1U then tmp = i; break endif endfor; D0.i = tmp Notes Functional examples: S_CTZ_I32_B32(0xaaaaaaaa) ⇒ 1 S_CTZ_I32_B32(0x55555555) ⇒ 0 S_CTZ_I32_B32(0x00000000) ⇒ 0xffffffff S_CTZ_I32_B32(0xffffffff) ⇒ 0 S_CTZ_I32_B32(0x00010000) ⇒ 16 Compare with V_CTZ_I32_B32, which performs the equivalent operation in the vector ALU. S_CTZ_I32_B64 9 Count trailing zeros. 16.3. SOP1 Instructions 208 of 597 "RDNA3" Instruction Set Architecture Returns the bit position of the first one from the LSB, or -1 if there are no ones. tmp = -1; // Set if no ones are found for i in 0 : 63 do // Search from LSB if S0.u64[i] == 1'1U then tmp = i; break endif endfor; D0.i = tmp S_CLZ_I32_U32 10 Count leading zeros. Counts how many zeros before the first one starting from the MSB. Returns -1 if there are no ones. tmp = -1; // Set if no ones are found for i in 0 : 31 do // Search from MSB if S0.u[31 - i] == 1'1U then tmp = i; break endif endfor; D0.i = tmp Notes Functional examples: S_CLZ_I32_U32(0x00000000) ⇒ 0xffffffff S_CLZ_I32_U32(0x0000cccc) ⇒ 16 S_CLZ_I32_U32(0xffff3333) ⇒ 0 S_CLZ_I32_U32(0x7fffffff) ⇒ 1 S_CLZ_I32_U32(0x80000000) ⇒ 0 S_CLZ_I32_U32(0xffffffff) ⇒ 0 Compare with V_CLZ_I32_U32, which performs the equivalent operation in the vector ALU. S_CLZ_I32_U64 11 Count leading zeros. Counts how many zeros before the first one starting from the MSB. Returns -1 if there are no ones. 16.3. SOP1 Instructions 209 of 597 "RDNA3" Instruction Set Architecture tmp = -1; // Set if no ones are found for i in 0 : 63 do // Search from MSB if S0.u64[63 - i] == 1'1U then tmp = i; break endif endfor; D0.i = tmp S_CLS_I32 12 Count leading sign bits. Counts how many bits in a row (from MSB to LSB) are the same as the sign bit. Returns -1 if all bits are the same. tmp = -1; // Set if all bits are the same for i in 1 : 31 do // Search from MSB if S0.u[31 - i] != S0.u[31] then tmp = i; break endif endfor; D0.i = tmp Notes Functional examples: S_CLS_I32(0x00000000) ⇒ 0xffffffff S_CLS_I32(0x0000cccc) ⇒ 16 S_CLS_I32(0xffff3333) ⇒ 16 S_CLS_I32(0x7fffffff) ⇒ 1 S_CLS_I32(0x80000000) ⇒ 1 S_CLS_I32(0xffffffff) ⇒ 0xffffffff Compare with S_CLS_I32, which performs the equivalent operation in the vector ALU. S_CLS_I32_I64 13 Count leading sign bits. Counts how many bits in a row (from MSB to LSB) are the same as the sign bit. Returns -1 if all bits are the 16.3. SOP1 Instructions 210 of 597 "RDNA3" Instruction Set Architecture same. tmp = -1; // Set if all bits are the same for i in 1 : 63 do // Search from MSB if S0.u64[63 - i] != S0.u64[63] then tmp = i; break endif endfor; D0.i = tmp S_SEXT_I32_I8 14 Sign extension of a signed byte. D0.i = 32'I(signext(S0.i8)) S_SEXT_I32_I16 15 Sign extension of a signed short. D0.i = 32'I(signext(S0.i16)) S_BITSET0_B32 16 Set a specific bit to zero. D0.u[S0.u[4 : 0]] = 1'0U S_BITSET0_B64 17 Set a specific bit to zero. D0.u64[S0.u[5 : 0]] = 1'0U 16.3. SOP1 Instructions 211 of 597 "RDNA3" Instruction Set Architecture S_BITSET1_B32 18 Set a specific bit to one. D0.u[S0.u[4 : 0]] = 1'1U S_BITSET1_B64 19 Set a specific bit to one. D0.u64[S0.u[5 : 0]] = 1'1U S_BITREPLICATE_B64_B32 20 Replicate the low 32 bits of the argument by 'doubling' each bit. tmp = S0.u; for i in 0 : 31 do D0.u64[i * 2 + 0] = tmp[i]; D0.u64[i * 2 + 1] = tmp[i] endfor Notes This opcode can be used to convert a quad mask into a pixel mask; given quad mask in s0, the following sequence produces a pixel mask in s2: s_bitreplicate_b64 s2, s0 s_bitreplicate_b64 s2, s2 To perform the inverse operation see S_QUADMASK_B64. S_ABS_I32 21 Integer absolute value. D0.i = S0.i < 0 ? -S0.i : S0.i; SCC = D0.i != 0 Notes 16.3. SOP1 Instructions 212 of 597 "RDNA3" Instruction Set Architecture Functional examples: S_ABS_I32(0x00000001) ⇒ 0x00000001 S_ABS_I32(0x7fffffff) ⇒ 0x7fffffff S_ABS_I32(0x80000000) ⇒ 0x80000000 // Note this is negative! S_ABS_I32(0x80000001) ⇒ 0x7fffffff S_ABS_I32(0x80000002) ⇒ 0x7ffffffe S_ABS_I32(0xffffffff) ⇒ 0x00000001 S_BCNT0_I32_B32 22 Count number of bits that are zero. tmp = 0; for i in 0 : 31 do tmp += S0.u[i].u == 0U ? 1 : 0 endfor; D0.i = tmp; SCC = D0.u != 0U Notes Functional examples: S_BCNT0_I32_B32(0x00000000) ⇒ 32 S_BCNT0_I32_B32(0xcccccccc) ⇒ 16 S_BCNT0_I32_B32(0xffffffff) ⇒ 0 S_BCNT0_I32_B64 23 Count number of bits that are zero. tmp = 0; for i in 0 : 63 do tmp += S0.u64[i].u == 0U ? 1 : 0 endfor; D0.i = tmp; SCC = D0.u64 != 0ULL S_BCNT1_I32_B32 24 Count number of bits that are one. tmp = 0; 16.3. SOP1 Instructions 213 of 597 "RDNA3" Instruction Set Architecture for i in 0 : 31 do tmp += S0.u[i].u == 1U ? 1 : 0 endfor; D0.i = tmp; SCC = D0.u != 0U Notes Functional examples: S_BCNT1_I32_B32(0x00000000) ⇒ 0 S_BCNT1_I32_B32(0xcccccccc) ⇒ 16 S_BCNT1_I32_B32(0xffffffff) ⇒ 32 S_BCNT1_I32_B64 25 Count number of bits that are one. tmp = 0; for i in 0 : 63 do tmp += S0.u64[i].u == 1U ? 1 : 0 endfor; D0.i = tmp; SCC = D0.u64 != 0ULL S_QUADMASK_B32 26 Reduce a pixel mask to a quad mask. tmp = 0U; for i in 0 : 7 do tmp[i] = S0.u[i * 4 + 3 : i * 4] != 0U endfor; D0.u = tmp; SCC = D0.u != 0U Notes To perform the inverse operation see S_BITREPLICATE_B64_B32. S_QUADMASK_B64 27 Reduce a pixel mask to a quad mask. 16.3. SOP1 Instructions 214 of 597 "RDNA3" Instruction Set Architecture tmp = 0U; for i in 0 : 15 do tmp[i] = S0.u64[i * 4 + 3 : i * 4] != 0ULL endfor; D0.u = tmp; SCC = D0.u != 0U Notes To perform the inverse operation see S_BITREPLICATE_B64_B32. S_WQM_B32 28 Computes whole quad mode for an active/valid mask. If any pixel in a quad is active, all pixels of the quad are marked active. tmp = 0U; declare i : 6'U; for i in 6'0U : 6'31U do tmp[i] = S0.u[i | 6'3U : i & 6'60U] != 0U endfor; D0.u = tmp; SCC = D0.u != 0U S_WQM_B64 29 Computes whole quad mode for an active/valid mask. If any pixel in a quad is active, all pixels of the quad are marked active. tmp = 0ULL; declare i : 6'U; for i in 6'0U : 6'63U do tmp[i] = S0.u64[i | 6'3U : i & 6'60U] != 0ULL endfor; D0.u64 = tmp; SCC = D0.u64 != 0ULL S_NOT_B32 30 Bitwise negation. D0.u = ~S0.u; SCC = D0.u != 0U 16.3. SOP1 Instructions 215 of 597 "RDNA3" Instruction Set Architecture S_NOT_B64 31 Bitwise negation. D0.u64 = ~S0.u64; SCC = D0.u64 != 0ULL S_AND_SAVEEXEC_B32 32 Bitwise AND with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u; EXEC.u = (S0.u & EXEC.u); D0.u = saveexec.u; SCC = EXEC.u != 0U S_AND_SAVEEXEC_B64 33 Bitwise AND with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = (S0.u64 & EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL S_OR_SAVEEXEC_B32 34 Bitwise OR with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u; EXEC.u = (S0.u | EXEC.u); D0.u = saveexec.u; SCC = EXEC.u != 0U 16.3. SOP1 Instructions 216 of 597 "RDNA3" Instruction Set Architecture S_OR_SAVEEXEC_B64 35 Bitwise OR with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = (S0.u64 | EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL S_XOR_SAVEEXEC_B32 36 Bitwise XOR with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u; EXEC.u = (S0.u ^ EXEC.u); D0.u = saveexec.u; SCC = EXEC.u != 0U S_XOR_SAVEEXEC_B64 37 Bitwise XOR with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = (S0.u64 ^ EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL S_NAND_SAVEEXEC_B32 38 Bitwise NAND with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u; EXEC.u = ~(S0.u & EXEC.u); D0.u = saveexec.u; 16.3. SOP1 Instructions 217 of 597 "RDNA3" Instruction Set Architecture SCC = EXEC.u != 0U S_NAND_SAVEEXEC_B64 39 Bitwise NAND with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = ~(S0.u64 & EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL S_NOR_SAVEEXEC_B32 40 Bitwise NOR with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u; EXEC.u = ~(S0.u | EXEC.u); D0.u = saveexec.u; SCC = EXEC.u != 0U S_NOR_SAVEEXEC_B64 41 Bitwise NOR with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = ~(S0.u64 | EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL S_XNOR_SAVEEXEC_B32 42 Bitwise XNOR with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. 16.3. SOP1 Instructions 218 of 597 "RDNA3" Instruction Set Architecture saveexec = EXEC.u; EXEC.u = ~(S0.u ^ EXEC.u); D0.u = saveexec.u; SCC = EXEC.u != 0U S_XNOR_SAVEEXEC_B64 43 Bitwise XNOR with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = ~(S0.u64 ^ EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL S_AND_NOT0_SAVEEXEC_B32 44 Bitwise AND with negated first argument with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u; EXEC.u = (~S0.u & EXEC.u); D0.u = saveexec.u; SCC = EXEC.u != 0U S_AND_NOT0_SAVEEXEC_B64 45 Bitwise AND with negated first argument with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = (~S0.u64 & EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL S_OR_NOT0_SAVEEXEC_B32 16.3. SOP1 Instructions 46 219 of 597 "RDNA3" Instruction Set Architecture Bitwise OR with negated first argument with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u; EXEC.u = (~S0.u | EXEC.u); D0.u = saveexec.u; SCC = EXEC.u != 0U S_OR_NOT0_SAVEEXEC_B64 47 Bitwise OR with negated first argument with EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = (~S0.u64 | EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL S_AND_NOT1_SAVEEXEC_B32 48 Bitwise AND with negated EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u; EXEC.u = (S0.u & ~EXEC.u); D0.u = saveexec.u; SCC = EXEC.u != 0U S_AND_NOT1_SAVEEXEC_B64 49 Bitwise AND with negated EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = (S0.u64 & ~EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL 16.3. SOP1 Instructions 220 of 597 "RDNA3" Instruction Set Architecture S_OR_NOT1_SAVEEXEC_B32 50 Bitwise OR with negated EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u; EXEC.u = (S0.u | ~EXEC.u); D0.u = saveexec.u; SCC = EXEC.u != 0U S_OR_NOT1_SAVEEXEC_B64 51 Bitwise OR with negated EXEC mask. The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed. saveexec = EXEC.u64; EXEC.u64 = (S0.u64 | ~EXEC.u64); D0.u64 = saveexec.u64; SCC = EXEC.u64 != 0ULL S_AND_NOT0_WREXEC_B32 52 Bitwise AND with negated first argument with EXEC mask. Unlike the SAVEEXEC series of opcodes, the value written to destination SGPRs is the result of the bitwise-op result. EXEC and the destination SGPRs have the same value at the end of this instruction. This instruction is intended to help accelerate waterfalling. EXEC_LO.u = (~S0.u & EXEC_LO.u); D0.u = EXEC_LO.u; SCC = EXEC_LO.u != 0U S_AND_NOT0_WREXEC_B64 53 Bitwise AND with negated first argument with EXEC mask. Unlike the SAVEEXEC series of opcodes, the value written to destination SGPRs is the result of the bitwise-op result. EXEC and the destination SGPRs have the same value at the end of this instruction. This instruction is intended to help accelerate waterfalling. 16.3. SOP1 Instructions 221 of 597 "RDNA3" Instruction Set Architecture EXEC.u64 = (~S0.u64 & EXEC.u64); D0.u64 = EXEC.u64; SCC = EXEC.u64 != 0ULL S_AND_NOT1_WREXEC_B32 54 Bitwise AND with negated EXEC mask. Unlike the SAVEEXEC series of opcodes, the value written to destination SGPRs is the result of the bitwise-op result. EXEC and the destination SGPRs have the same value at the end of this instruction. This instruction is intended to help accelerate waterfalling. See S_AND_NOT1_WREXEC_B64 for example code. EXEC_LO.u = (S0.u & ~EXEC_LO.u); D0.u = EXEC_LO.u; SCC = EXEC_LO.u != 0U S_AND_NOT1_WREXEC_B64 55 Bitwise AND with negated EXEC mask. Unlike the SAVEEXEC series of opcodes, the value written to destination SGPRs is the result of the bitwise-op result. EXEC and the destination SGPRs have the same value at the end of this instruction. This instruction is intended to help accelerate waterfalling. EXEC.u64 = (S0.u64 & ~EXEC.u64); D0.u64 = EXEC.u64; SCC = EXEC.u64 != 0ULL Notes In particular, the following sequence of waterfall code is optimized by using a WREXEC instead of two separate scalar ops: // V0 holds the index value per lane // save exec mask for restore at the end s_mov_b64 s2, exec // exec mask of remaining (unprocessed) threads s_mov_b64 s4, exec loop: // get the index value for the first active lane v_readfirstlane_b32 s0, v0 // find all other lanes with same index value v_cmpx_eq s0, v0 // do the operation using the current EXEC mask. S0 holds the index. // mask out thread that was just executed 16.3. SOP1 Instructions 222 of 597 "RDNA3" Instruction Set Architecture // s_andn2_b64 s4, s4, exec // s_mov_b64 exec, s4 s_andn2_wrexec_b64 s4, s4 // replaces above 2 ops // repeat until EXEC==0 s_cbranch_scc1 s_mov_b64 loop exec, s2 S_MOVRELS_B32 64 Move from a relative source address. addr = SRC0.u; // Raw value from instruction addr += M0.u[31 : 0]; D0.b = SGPR[addr].b Notes Example: The following instruction sequence performs the move s5 <= s17: s_mov_b32 m0, 10 s_movrels_b32 s5, s7 S_MOVRELS_B64 65 Move from a relative source address. The index in M0.u and the operand address in SRC0.u must be even for this operation. addr = SRC0.u; // Raw value from instruction addr += M0.u[31 : 0]; D0.b64 = SGPR[addr].b64 S_MOVRELD_B32 66 Move to a relative destination address. addr = DST.u; // Raw value from instruction addr += M0.u[31 : 0]; SGPR[addr].b = S0.b 16.3. SOP1 Instructions 223 of 597 "RDNA3" Instruction Set Architecture Notes Example: The following instruction sequence performs the move s15 <= s7: s_mov_b32 m0, 10 s_movreld_b32 s5, s7 S_MOVRELD_B64 67 Move to a relative destination address. The index in M0.u and the operand address in DST.u must be even for this operation. addr = DST.u; // Raw value from instruction addr += M0.u[31 : 0]; SGPR[addr].b64 = S0.b64 S_MOVRELSD_2_B32 68 Move from a relative source address to a relative destination address, with different offsets. addrs = SRC0.u; // Raw value from instruction addrd = DST.u; // Raw value from instruction addrs += M0.u[9 : 0].u; addrd += M0.u[25 : 16].u; SGPR[addrd].b = SGPR[addrs].b Notes Example: The following instruction sequence performs the move s25 <= s17: s_mov_b32 m0, ((20 << 16) | 10) s_movrelsd_2_b32 s5, s7 S_GETPC_B64 71 Save current program location. The byte address of the instruction immediately following the GETPC instruction is saved to the destination register D0. 16.3. SOP1 Instructions 224 of 597 "RDNA3" Instruction Set Architecture D0.i64 = PC + 4LL Notes This instruction must be 4 bytes. S_SETPC_B64 72 Jump to a new location. Argument is a byte address of the instruction to jump to. PC = S0.i64 S_SWAPPC_B64 73 Save current program location and jump to a new location. Argument is a byte address of the instruction to jump to. The byte address of the instruction immediately following the SWAPPC instruction is saved to the destination register D0. jump_addr = S0.i64; D0.i64 = PC + 4LL; PC = jump_addr.i64 Notes This instruction must be 4 bytes. S_RFE_B64 74 Return from exception handler and continue. This instruction may only be used within a trap handler. WAVE_STATUS.PRIV = 1'0U; PC = S0.i64 S_SENDMSG_RTN_B32 76 Send a message to upstream control hardware. 16.3. SOP1 Instructions 225 of 597 "RDNA3" Instruction Set Architecture SSRC[7:0] contains the message type encoded in the instruction directly (this instruction does not read an SGPR). The message is expected to return a response from the upstream control hardware and the result is written to SDST. Use s_waitcnt lgkmcnt(…) to wait for the response on the dependent instruction. S_SENDMSG_RTN* instructions return data in-order among themselves but out-of-order with other instructions that manipulate lgkmcnt (including S_SENDMSG and S_SENDMSGHALT). If the message returns a 64 bit value then only the lower 32 bits are written to SDST. If SDST is VCC then VCCZ is undefined. S_SENDMSG_RTN_B64 77 Send a message to upstream control hardware. SSRC[7:0] contains the message type encoded in the instruction directly (this instruction does not read an SGPR). The message is expected to return a response from the upstream control hardware and the result is written to SDST. Use s_waitcnt lgkmcnt(…) to wait for the response on the dependent instruction. S_SENDMSG_RTN* instructions return data in-order among themselves but out-of-order with other instructions that manipulate lgkmcnt (including S_SENDMSG and S_SENDMSGHALT). If the message returns a 32 bit value then this instruction fills the upper bits of SDST with zero. If SDST is VCC then VCCZ is undefined. 16.3. SOP1 Instructions 226 of 597 "RDNA3" Instruction Set Architecture 16.4. SOPC Instructions Instructions in this format may use a 32-bit literal constant that occurs immediately after the instruction. S_CMP_EQ_I32 0 A equal to B. SCC = S0.i == S1.i Notes Note that S_CMP_EQ_I32 and S_CMP_EQ_U32 are identical opcodes, but both are provided for symmetry. S_CMP_LG_I32 1 A not equal to B. SCC = S0.i <> S1.i Notes Note that S_CMP_LG_I32 and S_CMP_LG_U32 are identical opcodes, but both are provided for symmetry. S_CMP_GT_I32 2 A greater than B. SCC = S0.i > S1.i S_CMP_GE_I32 3 A greater than or equal to B. SCC = S0.i >= S1.i 16.4. SOPC Instructions 227 of 597 "RDNA3" Instruction Set Architecture S_CMP_LT_I32 4 A less than B. SCC = S0.i < S1.i S_CMP_LE_I32 5 A less than or equal to B. SCC = S0.i <= S1.i S_CMP_EQ_U32 6 A equal to B. SCC = S0.i == S1.i Notes Note that S_CMP_EQ_I32 and S_CMP_EQ_U32 are identical opcodes, but both are provided for symmetry. S_CMP_LG_U32 7 A not equal to B. SCC = S0.i <> S1.i Notes Note that S_CMP_LG_I32 and S_CMP_LG_U32 are identical opcodes, but both are provided for symmetry. S_CMP_GT_U32 8 A greater than B. SCC = S0.u > S1.u 16.4. SOPC Instructions 228 of 597 "RDNA3" Instruction Set Architecture Notes Unsigned integer comparison. S_CMP_GE_U32 9 A greater than or equal to B. SCC = S0.u >= S1.u Notes Unsigned integer comparison. S_CMP_LT_U32 10 A less than B. SCC = S0.u < S1.u Notes Unsigned integer comparison. S_CMP_LE_U32 11 A less than or equal to B. SCC = S0.u <= S1.u Notes Unsigned integer comparison. S_BITCMP0_B32 12 Test if a bit is unset. SCC = S0.u[S1.u[4 : 0]] == 1'0U 16.4. SOPC Instructions 229 of 597 "RDNA3" Instruction Set Architecture S_BITCMP1_B32 13 Test if a bit is set. SCC = S0.u[S1.u[4 : 0]] == 1'1U S_BITCMP0_B64 14 Test if a bit is unset. SCC = S0.u64[S1.u[5 : 0]] == 1'0U S_BITCMP1_B64 15 Test if a bit is set. SCC = S0.u64[S1.u[5 : 0]] == 1'1U S_CMP_EQ_U64 16 A equal to B. SCC = S0.u64 == S1.u64 S_CMP_LG_U64 17 A not equal to B. SCC = S0.u64 != S1.u64 16.4. SOPC Instructions 230 of 597 "RDNA3" Instruction Set Architecture 16.5. SOPP Instructions S_NOP 0 Do nothing. Delay issue of next instruction by a small, fixed amount. Insert 0..15 wait states based on SIMM16[3:0]. 0x0 means the next instruction can issue on the next clock, 0xf means the next instruction can issue 16 clocks later. for i in 0U : SIMM16.u16[3 : 0].u do nop() endfor Notes Examples: s_nop 0 // Wait 1 cycle. s_nop 0xf // Wait 16 cycles. S_SETKILL 1 Set KILL bit to value of SIMM16[0]. Used primarily for debugging kill wave host command behavior. S_SETHALT 2 Set or clear the HALT or FATAL_HALT status bits. The particular status bit is chosen by halt type control as indicated in SIMM16[2]; 0 = HALT bit select; 1 = FATAL_HALT bit select. When halt type control is set to 0 (HALT bit select): Set HALT bit to value of SIMM16[0]; 1 = halt, 0 = clear HALT bit. The halt flag is ignored while PRIV == 1 (inside trap handlers) but the shader halts after the handler returns if HALT is still set at that time. When halt type control is set to 1 (FATAL HALT bit select): Set FATAL_HALT bit to value of SIMM16[0]; 1 = fatal_halt, 0 = clear FATAL_HALT bit. Setting the fatal_halt flag halts the shader in or outside of the trap handlers. 16.5. SOPP Instructions 231 of 597 "RDNA3" Instruction Set Architecture S_SLEEP 3 Cause a wave to sleep for up to ~8000 clocks. The wave sleeps for (64*(SIMM16[6:0]-1) .. 64*SIMM16[6:0]) clocks. The exact amount of delay is approximate. Compare with S_NOP. When SIMM16[6:0] is zero then no sleep occurs. Notes Examples: s_sleep 0 // Wait for 0 clocks. s_sleep 1 // Wait for 1-64 clocks. s_sleep 2 // Wait for 65-128 clocks. S_SET_INST_PREFETCH_DISTANCE 4 Change instruction prefetch mode. This controls how many cachelines ahead of the current PC the shader will try to prefetch. SIMM16[1:0] specifies the prefetch mode to switch to. Prefetch modes are: PREFETCH_SAFE (0x0) Reserved, do not use. PREFETCH_1_LINE (0x1) Prefetch 1 cache line ahead of PC; keep 2 lines behind PC. PREFETCH_2_LINES (0x2) Prefetch 2 cache lines ahead of PC; keep 1 line behind PC. PREFETCH_3_LINES (0x3) Prefetch 3 cache lines ahead of PC; keep 0 lines behind PC. SIMM16[15:2] must be set to zero. S_CLAUSE 5 Mark the beginning of a clause. The next instruction determines the clause type, which may be one of the following types. • Image Load (non-sample instructions ) • Image Sample • Image Store • Image Atomic 16.5. SOPP Instructions 232 of 597 "RDNA3" Instruction Set Architecture • Buffer/Global/Scratch Load • Buffer/Global/Scratch Store • Buffer/Global/Scratch Atomic • Flat Load • Flat Store • Flat Atomic • LDS (loads, stores, atomics may be in same clause) • Scalar Memory • Vector ALU Once the clause type is determined, any instruction encountered within the clause that is not of the same type (and not an internal instruction described below) is illegal and may lead to undefined behaviour. Attempting to issue S_CLAUSE while inside a clause is also illegal. Instructions that are processed internally do not interrupt the clause. The following instructions are internal: • S_NOP, • S_WAITCNT and its variants, unless they read an SGPR, • S_SLEEP, • S_DELAY_ALU. Halting or killing a wave breaks the clause. VALU exceptions and other traps that cause the shader to enter its trap handler breaks the clause. The single-step debug mode breaks the clause. The clause length must be between 2 and 63 instructions, inclusive. Clause breaks may be from 1 to 15, or may be disabled entirely. Clause length and breaks are encoded in the SIMM16 argument as follows: LENGTH = SIMM16[5:0] This field is set to the logical number of instructions in the clause, minus 1 (e.g. if a clause has 4 instructions, program this field to 3). The minimum number of instructions required for a clause is 2 and the maximum number of instructions is 63, therefore this field must be programmed in the range [1, 62] inclusive. BREAK_SPAN = SIMM16[11:8] This field is set to the number of instructions to issue before each clause break. If set to zero then there are no clause breaks. If set to nonzero value then the maximum number of instructions between clause breaks is 15. The following instruction types cannot appear in a clause: • SALU • Export • Branch • Message • LDSDIR • VINTERP • GDS If you need to schedule an S_WAITCNT or S_DELAY_ALU instruction for the first instruction in the clause, the waitcnt/delay instruction must appear before the S_CLAUSE instruction so that S_CLAUSE can accurately determine the clause type. 16.5. SOPP Instructions 233 of 597 "RDNA3" Instruction Set Architecture S_DELAY_ALU must not appear inside a clause. The features are orthogonal; ALU clauses should be structured to avoid any stalling. S_DELAY_ALU 7 Insert delay between dependent SALU/VALU instructions. The SIMM16 argument is encoded as: INSTID0 = SIMM16[3:0] Hazard to delay for with the next VALU instruction. INSTSKIP = SIMM16[6:4] Identify the VALU instruction that the second delay condition applies to. INSTID1 = SIMM16[10:7] Hazard to delay for with the VALU instruction identified by INSTSKIP. Legal values for the InstID0 and InstID1 fields are: INSTID_NO_DEP (0x0) No dependency on any prior instruction. INSTID_VALU_DEP_1 (0x1) Dependent on previous VALU instruction, 1 instruction back. INSTID_VALU_DEP_2 (0x2) Dependent on previous VALU instruction, 2 instructions back. INSTID_VALU_DEP_3 (0x3) Dependent on previous VALU instruction, 3 instructions back. INSTID_VALU_DEP_4 (0x4) Dependent on previous VALU instruction, 4 instructions back. INSTID_TRANS32_DEP_1 (0x5) Dependent on previous TRANS32 instruction, 1 instruction back. INSTID_TRANS32_DEP_2 (0x6) Dependent on previous TRANS32 instruction, 2 instructions back. INSTID_TRANS32_DEP_3 (0x7) Dependent on previous TRANS32 instruction, 3 instructions back. INSTID_FMA_ACCUM_CYCLE_1 (0x8) Single cycle penalty for FMA accumulation (reserved for future use in architectures with fast FMA SRC-C accumulation). 16.5. SOPP Instructions 234 of 597 "RDNA3" Instruction Set Architecture INSTID_SALU_CYCLE_1 (0x9) 1 cycle penalty for a prior SALU instruction. INSTID_SALU_CYCLE_2 (0xa) 2 cycle penalty for a prior SALU instruction (reserved for future use in architectures with SALU floats). INSTID_SALU_CYCLE_3 (0xb) 3 cycle penalty for a prior SALU instruction (reserved for future use in architectures with SALU floats). Legal values for the InstSkip field are: INSTSKIP_SAME (0x0) Apply second dependency to same instruction (2 dependencies on one instruction). INSTSKIP_NEXT (0x1) Apply second dependency to next instruction (no skip). INSTSKIP_SKIP_1 (0x2) Skip 1 instruction then apply dependency. INSTSKIP_SKIP_2 (0x3) Skip 2 instructions then apply dependency. INSTSKIP_SKIP_3 (0x4) Skip 3 instructions then apply dependency. INSTSKIP_SKIP_4 (0x5) Skip 4 instructions then apply dependency. This instruction describes dependencies for two instructions, directing the hardware to insert delay if the dependent instruction was issued too recently to forward data to the second. S_DELAY_ALU instructions record the required delay with respect to a previous VALU instruction and indicate data dependencies that benefit from having extra idle cycles inserted between them. These instructions are optional: without them the program still functions correctly but performance may suffer when multiple waves are in flight; IB may issue dependent instructions that stall in the ALU, preventing those cycles from being utilized by other wavefronts. If enough independent instructions are between dependent ones then no delay is necessary and this instruction may be omitted. For wave64 the compiler may not know the status of the EXEC mask and hence does not know if instructions require 1 or 2 passes to issue. S_DELAY_ALU encodes the type of dependency so that hardware may apply the correct delay depending on the number of active passes. S_DELAY_ALU may execute in zero cycles. To reduce instruction stream overhead the S_DELAY_ALU instructions packs two delay values into one instruction, with a "skip" indicator so the two delayed instructions don't need to be back-to-back. S_DELAY_ALU is illegal inside of a clause created by S_CLAUSE. Example: 16.5. SOPP Instructions 235 of 597 "RDNA3" Instruction Set Architecture v_mov_b32 v3, v0 v_lshlrev_b32 v30, 1, v31 v_lshlrev_b32 v24, 1, v25 s_delay_alu instid0(INSTID_VALU_DEP_3) | instskip(INSTSKIP_SKIP_1) | instid1(INSTID_VALU_DEP_1) // 1 cycle delay here v_add_f32 v0, v1, v3 v_sub_f32 v11, v9, v9 // 2 cycles delay here v_mul_f32 v10, v13, v11 S_WAITCNT 9 Wait for the counts of outstanding lds, vector-memory and export/vmem-write-data to be at or below the specified levels. The SIMM16 argument is encoded as: EXP = SIMM16[2:0] Export wait count. 0x7 means do not wait on EXPCNT. LGKM = SIMM16[9:4] LGKM wait count. 0x3f means do not wait on LGKMCNT. VM = SIMM16[15:10] VM wait count. 0x3f means do not wait on VMCNT. Waits for all of the following conditions to hold before continuing: expcnt <= WaitEXPCNT lgkmcnt <= WaitLGKMCNT vmcnt <= WaitVMCNT VMCNT only counts vector memory loads, image sample instructions, and vector memory atomics that return data. Contrast with the VSCNT counter. See also S_WAITCNT_VSCNT. S_WAIT_IDLE 10 Wait for all activity in the wave to be complete (all dependency and memory counters at zero). S_WAIT_EVENT 11 Wait for an event to occur or a condition to be satisfied before continuing. The SIMM16 argument specifies 16.5. SOPP Instructions 236 of 597 "RDNA3" Instruction Set Architecture which event(s) to wait on. DONT_WAIT_EXPORT_READY = SIMM16[0] If this value is ZERO then sleep until the export_ready bit is 1. If the export_ready bit is already 1, no sleep occurs. Effect is the same as the export_ready check performed before issuing an export instruction. No wait occurs if this value is ONE. This wait can be broken or preempted by KILL, context-save, host trap, single-step or trap after instruction events. IB waits for the event to occur before processing internal exceptions which can delay entry to the trap handler for a significant amount of time. S_TRAP 16 Enter the trap handler. This instruction may be generated internally as well in response to a host trap (HT = 1) or an exception. TrapID 0 is reserved for hardware use and should not be used in a shader-generated trap. TrapID = SIMM16.u16[7 : 0]; "Wait for all instructions to complete"; // PC passed into trap handler points to S_TRAP itself, // *not* to the next instruction. { TTMP[1], TTMP[0] } = { 7'0, HT[0], TrapID[7 : 0], PC[47 : 0] }; PC = TBA.i64; // trap base address WAVE_STATUS.PRIV = 1'1U S_ROUND_MODE 17 Set floating point round mode using an immediate constant. Avoids wait state penalty that would be imposed by S_SETREG. S_DENORM_MODE 18 Set floating point denormal mode using an immediate constant. Avoids wait state penalty that would be imposed by S_SETREG. S_CODE_END 31 Generate an illegal instruction interrupt. This instruction is used to mark the end of a shader buffer for debug tools. 16.5. SOPP Instructions 237 of 597 "RDNA3" Instruction Set Architecture This instruction should not appear in typical shader code. It is used to pad the end of a shader program to make it easier for analysis programs to locate the end of a shader program buffer. Use of this opcode in an embedded shader block may cause analysis tools to fail. To unambiguously mark the end of a shader buffer, this instruction must be specified five times in a row (total of 20 bytes) and analysis tools must ensure the opcode occurs at least five times to be certain they are at the end of the buffer. This is because the bit pattern generated by this opcode could incidentally appear in a valid instruction's second dword, literal constant or as part of a multi-DWORD image instruction. In short: do not embed this opcode in the middle of a valid shader program. DO use this opcode 5 times at the end of a shader program to clearly mark the end of the program. Example: ... s_endpgm // last real instruction in shader buffer s_code_end // 1 s_code_end // 2 s_code_end // 3 s_code_end // 4 s_code_end // done! S_BRANCH 32 Unconditional short jump to label. PC = PC + signext(SIMM16.i16 * 16'4) + 4LL; // short jump. Notes For a long jump or an indirect jump use S_SETPC_B64. Examples: s_branch label s_nop 0 // Set SIMM16 = +4 = 0x0004 // 4 bytes label: s_nop 0 // 4 bytes s_branch label // Set SIMM16 = -8 = 0xfff8 S_CBRANCH_SCC0 33 Conditional short jump when SCC is zero. 16.5. SOPP Instructions 238 of 597 "RDNA3" Instruction Set Architecture if SCC.u == 0U then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_CBRANCH_SCC1 34 Conditional short jump when SCC is one. if SCC.u == 1U then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_CBRANCH_VCCZ 35 Conditional short jump when VCC is zero. if VCC == 0x0LL then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_CBRANCH_VCCNZ 36 Conditional short jump when VCC is nonzero. if VCC != 0x0LL then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_CBRANCH_EXECZ 37 Conditional short jump when EXEC is zero. 16.5. SOPP Instructions 239 of 597 "RDNA3" Instruction Set Architecture if EXEC == 0x0LL then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_CBRANCH_EXECNZ 38 Conditional short jump when EXEC is nonzero. if EXEC != 0x0LL then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_CBRANCH_CDBGSYS 39 Conditional short jump when the system debug flag is set. if WAVE_STATUS.COND_DBG_SYS.u != 0U then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_CBRANCH_CDBGUSER 40 Conditional short jump when the user debug flag is set. if WAVE_STATUS.COND_DBG_USER.u != 0U then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_CBRANCH_CDBGSYS_OR_USER 41 Conditional short jump when either the system or the user debug flag are set. 16.5. SOPP Instructions 240 of 597 "RDNA3" Instruction Set Architecture if (WAVE_STATUS.COND_DBG_SYS || WAVE_STATUS.COND_DBG_USER) then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_CBRANCH_CDBGSYS_AND_USER 42 Conditional short jump when both the system and the user debug flag are set. if (WAVE_STATUS.COND_DBG_SYS && WAVE_STATUS.COND_DBG_USER) then PC = PC + signext(SIMM16.i16 * 16'4) + 4LL else PC = PC + 4LL endif S_ENDPGM 48 End of program; terminate wavefront. The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0 before executing this instruction. See S_ENDPGM_SAVED for the context-switch version of this instruction and S_ENDPGM_ORDERED_PS_DONE for the POPS critical region version of this instruction. S_ENDPGM_SAVED 49 End of program; signal that a wave has been saved by the context-switch trap handler and terminate wavefront. The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0 before executing this instruction. See S_ENDPGM for additional variants. S_ENDPGM_ORDERED_PS_DONE 50 End of program; signal that a wave has exited its POPS critical section and terminate wavefront. The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0 before executing this instruction. This instruction is an optimization that combines S_SENDMSG(MSG_ORDERED_PS_DONE) and S_ENDPGM; there may be cases where you still need to send the message separately, in which case the shader must end with a regular S_ENDPGM instruction. See S_ENDPGM for additional variants. 16.5. SOPP Instructions 241 of 597 "RDNA3" Instruction Set Architecture S_WAKEUP 52 Allow a wave to 'ping' all the other waves in its threadgroup to force them to wake up early from an S_SLEEP instruction. The ping is ignored if the waves are not sleeping. This allows for efficient polling on a memory location. The waves which are polling can sit in a long S_SLEEP between memory reads, but the wave which writes the value can tell them all to wake up early now that the data is available. This method is also safe from races because if any wave misses the ping, everything is expected to work fine (waves which missed it just complete their S_SLEEP). If the wave executing S_WAKEUP is in a threadgroup (in_wg set), then it wakes up all waves associated with the same threadgroup ID. Otherwise, S_WAKEUP is treated as an S_NOP. S_SETPRIO 53 Change wave user priority. User settable wave priority is set to SIMM16[1:0]. 0 is the lowest priority and 3 is the highest. The overall wave priority is: Priority = {SysUserPrio[1:0], WaveAge[3:0]} SysUserPrio = MIN(3, SysPrio[1:0] + UserPrio[1:0]). The system priority cannot be modified from within the wave. S_SENDMSG 54 Send a message to upstream control hardware. SIMM16[7:0] contains the message type. Notes S_SENDMSGHALT 55 Send a message to upstream control hardware and then HALT the wavefront; see S_SENDMSG for details. S_INCPERFLEVEL 56 Increment performance counter specified in SIMM16[3:0] by 1. 16.5. SOPP Instructions 242 of 597 "RDNA3" Instruction Set Architecture S_DECPERFLEVEL 57 Decrement performance counter specified in SIMM16[3:0] by 1. S_ICACHE_INV 60 Invalidate entire first level instruction cache. S_BARRIER 61 Synchronize waves within a threadgroup. If not all waves of the threadgroup have been created yet, waits for entire group before proceeding. If some waves in the threadgroup have already terminated, this waits on only the surviving waves. Barriers are legal inside trap handlers. Barrier instructions do not wait for any counters to go to zero before issuing. If you need the barrier to protect an outstanding memory operation use the appropriate S_WAITCNT instruction before the barrier. 16.5. SOPP Instructions 243 of 597 "RDNA3" Instruction Set Architecture 16.6. SMEM Instructions S_LOAD_B32 0 Read 1 dword from scalar data cache. If the offset is specified as an SGPR, the SGPR contains an UNSIGNED BYTE offset (the 2 LSBs are ignored). If the offset is specified as an immediate 21-bit constant, the constant is a SIGNED BYTE offset. SDATA[31 : 0] = MEM[ADDR + 0U].b S_LOAD_B64 1 Read 2 dwords from scalar data cache. SDATA[31 : 0] = MEM[ADDR + 0U].b; SDATA[63 : 32] = MEM[ADDR + 4U].b Notes See S_LOAD_B32 for details on the offset input. S_LOAD_B128 2 Read 4 dwords from scalar data cache. SDATA[31 : 0] = MEM[ADDR + 0U].b; SDATA[63 : 32] = MEM[ADDR + 4U].b; SDATA[95 : 64] = MEM[ADDR + 8U].b; SDATA[127 : 96] = MEM[ADDR + 12U].b Notes See S_LOAD_B32 for details on the offset input. S_LOAD_B256 16.6. SMEM Instructions 3 244 of 597 "RDNA3" Instruction Set Architecture Read 8 dwords from scalar data cache. SDATA[31 : 0] = MEM[ADDR + 0U].b; SDATA[63 : 32] = MEM[ADDR + 4U].b; SDATA[95 : 64] = MEM[ADDR + 8U].b; SDATA[127 : 96] = MEM[ADDR + 12U].b; SDATA[159 : 128] = MEM[ADDR + 16U].b; SDATA[191 : 160] = MEM[ADDR + 20U].b; SDATA[223 : 192] = MEM[ADDR + 24U].b; SDATA[255 : 224] = MEM[ADDR + 28U].b Notes See S_LOAD_B32 for details on the offset input. S_LOAD_B512 4 Read 16 dwords from scalar data cache. SDATA[31 : 0] = MEM[ADDR + 0U].b; SDATA[63 : 32] = MEM[ADDR + 4U].b; SDATA[95 : 64] = MEM[ADDR + 8U].b; SDATA[127 : 96] = MEM[ADDR + 12U].b; SDATA[159 : 128] = MEM[ADDR + 16U].b; SDATA[191 : 160] = MEM[ADDR + 20U].b; SDATA[223 : 192] = MEM[ADDR + 24U].b; SDATA[255 : 224] = MEM[ADDR + 28U].b; SDATA[287 : 256] = MEM[ADDR + 32U].b; SDATA[319 : 288] = MEM[ADDR + 36U].b; SDATA[351 : 320] = MEM[ADDR + 40U].b; SDATA[383 : 352] = MEM[ADDR + 44U].b; SDATA[415 : 384] = MEM[ADDR + 48U].b; SDATA[447 : 416] = MEM[ADDR + 52U].b; SDATA[479 : 448] = MEM[ADDR + 56U].b; SDATA[511 : 480] = MEM[ADDR + 60U].b Notes See S_LOAD_B32 for details on the offset input. S_BUFFER_LOAD_B32 8 Read 1 dword from scalar data cache. SDATA[31 : 0] = MEM[ADDR + 0U].b Notes 16.6. SMEM Instructions 245 of 597 "RDNA3" Instruction Set Architecture See S_LOAD_B32 for details on the offset input. S_BUFFER_LOAD_B64 9 Read 2 dwords from scalar data cache. SDATA[31 : 0] = MEM[ADDR + 0U].b; SDATA[63 : 32] = MEM[ADDR + 4U].b Notes See S_LOAD_B32 for details on the offset input. S_BUFFER_LOAD_B128 10 Read 4 dwords from scalar data cache. SDATA[31 : 0] = MEM[ADDR + 0U].b; SDATA[63 : 32] = MEM[ADDR + 4U].b; SDATA[95 : 64] = MEM[ADDR + 8U].b; SDATA[127 : 96] = MEM[ADDR + 12U].b Notes See S_LOAD_B32 for details on the offset input. S_BUFFER_LOAD_B256 11 Read 8 dwords from scalar data cache. SDATA[31 : 0] = MEM[ADDR + 0U].b; SDATA[63 : 32] = MEM[ADDR + 4U].b; SDATA[95 : 64] = MEM[ADDR + 8U].b; SDATA[127 : 96] = MEM[ADDR + 12U].b; SDATA[159 : 128] = MEM[ADDR + 16U].b; SDATA[191 : 160] = MEM[ADDR + 20U].b; SDATA[223 : 192] = MEM[ADDR + 24U].b; SDATA[255 : 224] = MEM[ADDR + 28U].b Notes See S_LOAD_B32 for details on the offset input. 16.6. SMEM Instructions 246 of 597 "RDNA3" Instruction Set Architecture S_BUFFER_LOAD_B512 12 Read 16 dwords from scalar data cache. SDATA[31 : 0] = MEM[ADDR + 0U].b; SDATA[63 : 32] = MEM[ADDR + 4U].b; SDATA[95 : 64] = MEM[ADDR + 8U].b; SDATA[127 : 96] = MEM[ADDR + 12U].b; SDATA[159 : 128] = MEM[ADDR + 16U].b; SDATA[191 : 160] = MEM[ADDR + 20U].b; SDATA[223 : 192] = MEM[ADDR + 24U].b; SDATA[255 : 224] = MEM[ADDR + 28U].b; SDATA[287 : 256] = MEM[ADDR + 32U].b; SDATA[319 : 288] = MEM[ADDR + 36U].b; SDATA[351 : 320] = MEM[ADDR + 40U].b; SDATA[383 : 352] = MEM[ADDR + 44U].b; SDATA[415 : 384] = MEM[ADDR + 48U].b; SDATA[447 : 416] = MEM[ADDR + 52U].b; SDATA[479 : 448] = MEM[ADDR + 56U].b; SDATA[511 : 480] = MEM[ADDR + 60U].b Notes See S_LOAD_B32 for details on the offset input. S_GL1_INV 32 Invalidate the GL1 cache only. S_DCACHE_INV 33 Invalidate the scalar data L0 cache. 16.6. SMEM Instructions 247 of 597 "RDNA3" Instruction Set Architecture 16.7. VOP2 Instructions Instructions in this format may use a 32-bit literal constant or DPP that occurs immediately after the instruction. V_CNDMASK_B32 1 Conditional mask on each thread. D0.u = VCC.u64[laneId] ? S1.u : S0.u Notes In VOP3 the VCC source may be a scalar GPR specified in S2.u. Floating-point modifiers are valid for this instruction if S0.u and S1.u are 32-bit floating point values. This instruction is suitable for negating or taking the absolute value of a floating-point value. V_DOT2ACC_F32_F16 2 Dot product of packed FP16 values, accumulate with destination. // Accumulate with destination D0.f += 32'F(S0[15 : 0].f16) * 32'F(S1[15 : 0].f16); D0.f += 32'F(S0[31 : 16].f16) * 32'F(S1[31 : 16].f16) V_ADD_F32 3 Add two single-precision values. D0.f = S0.f + S1.f Notes 0.5ULP precision, denormals are supported. V_SUB_F32 16.7. VOP2 Instructions 4 248 of 597 "RDNA3" Instruction Set Architecture Subtract the second single-precision input from the first input. D0.f = S0.f - S1.f V_SUBREV_F32 5 Subtract the first single-precision input from the second input. D0.f = S1.f - S0.f V_FMAC_DX9_ZERO_F32 6 Multiply two single-precision values and accumulate the result with the destination. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE compliant). if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then // DX9 rules, 0.0 * x = 0.0 D0.f = S2.f else D0.f = fma(S0.f, S1.f, D0.f) endif V_MUL_DX9_ZERO_F32 7 Multiply two single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE compliant). if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then // DX9 rules, 0.0 * x = 0.0 D0.f = 0.0F else D0.f = S0.f * S1.f endif V_MUL_F32 8 Multiply two single-precision values. 16.7. VOP2 Instructions 249 of 597 "RDNA3" Instruction Set Architecture D0.f = S0.f * S1.f Notes 0.5ULP precision, denormals are supported. V_MUL_I32_I24 9 Multiply two signed 24-bit integers and store the result as a signed 32-bit integer. D0.i = 32'I(S0.i24) * 32'I(S1.i24) Notes This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision floating point multiplier. See also V_MUL_HI_I32_I24. V_MUL_HI_I32_I24 10 Multiply two signed 24-bit integers and store the high 32 bits of the result as a signed 32-bit integer. D0.i = 32'I(64'I(S0.i24) * 64'I(S1.i24) >> 32U) Notes See also V_MUL_I32_I24. V_MUL_U32_U24 11 Multiply two unsigned 24-bit integers and store the result as an unsigned 32-bit integer. D0.u = 32'U(S0.u24) * 32'U(S1.u24) Notes This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision floating point multiplier. See also V_MUL_HI_U32_U24. V_MUL_HI_U32_U24 12 16.7. VOP2 Instructions 250 of 597 "RDNA3" Instruction Set Architecture Multiply two unsigned 24-bit integers and store the high 32 bits of the result as an unsigned 32-bit integer. D0.u = 32'U(64'U(S0.u24) * 64'U(S1.u24) >> 32U) Notes See also V_MUL_U32_U24. V_MIN_F32 15 Compute the minimum of two floats. LT_NEG_ZERO = lambda(a, b) ( ((a < b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && sign(a) && !sign(b)))); // Version of comparison where -0.0 < +0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(64'F(S0.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S0.f))) elsif isSignalNAN(64'F(S1.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S1.f))) elsif isQuietNAN(64'F(S1.f)) then D0.f = S0.f elsif isQuietNAN(64'F(S0.f)) then D0.f = S1.f elsif LT_NEG_ZERO(S0.f, S1.f) then // NOTE: -0<+0 is TRUE in this comparison D0.f = S0.f else D0.f = S1.f endif else if isNAN(64'F(S1.f)) then D0.f = S0.f elsif isNAN(64'F(S0.f)) then D0.f = S1.f elsif LT_NEG_ZERO(S0.f, S1.f) then // NOTE: -0<+0 is TRUE in this comparison D0.f = S0.f else D0.f = S1.f endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so 16.7. VOP2 Instructions 251 of 597 "RDNA3" Instruction Set Architecture output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. V_MAX_F32 16 Compute the maximum of two floats. GT_NEG_ZERO = lambda(a, b) ( ((a > b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && !sign(a) && sign(b)))); // Version of comparison where +0.0 > -0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(64'F(S0.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S0.f))) elsif isSignalNAN(64'F(S1.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S1.f))) elsif isQuietNAN(64'F(S1.f)) then D0.f = S0.f elsif isQuietNAN(64'F(S0.f)) then D0.f = S1.f elsif GT_NEG_ZERO(S0.f, S1.f) then // NOTE: +0>-0 is TRUE in this comparison D0.f = S0.f else D0.f = S1.f endif else if isNAN(64'F(S1.f)) then D0.f = S0.f elsif isNAN(64'F(S0.f)) then D0.f = S1.f elsif GT_NEG_ZERO(S0.f, S1.f) then // NOTE: +0>-0 is TRUE in this comparison D0.f = S0.f else D0.f = S1.f endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. V_MIN_I32 16.7. VOP2 Instructions 17 252 of 597 "RDNA3" Instruction Set Architecture Compute the minimum of two signed integers. D0.i = S0.i < S1.i ? S0.i : S1.i V_MAX_I32 18 Compute the maximum of two signed integers. D0.i = S0.i >= S1.i ? S0.i : S1.i V_MIN_U32 19 Compute the minimum of two unsigned integers. D0.u = S0.u < S1.u ? S0.u : S1.u V_MAX_U32 20 Compute the maximum of two unsigned integers. D0.u = S0.u >= S1.u ? S0.u : S1.u V_LSHLREV_B32 24 Logical shift left with shift count in the first operand. D0.u = S1.u << S0[4 : 0].u V_LSHRREV_B32 25 Logical shift right with shift count in the first operand. D0.u = S1.u >> S0[4 : 0].u 16.7. VOP2 Instructions 253 of 597 "RDNA3" Instruction Set Architecture V_ASHRREV_I32 26 Arithmetic shift right (preserve sign bit) with shift count in the first operand. D0.i = S1.i >> S0[4 : 0].u V_AND_B32 27 Bitwise AND. D0.u = (S0.u & S1.u) Notes Input and output modifiers not supported. V_OR_B32 28 Bitwise OR. D0.u = (S0.u | S1.u) Notes Input and output modifiers not supported. V_XOR_B32 29 Bitwise XOR. D0.u = (S0.u ^ S1.u) Notes Input and output modifiers not supported. V_XNOR_B32 16.7. VOP2 Instructions 30 254 of 597 "RDNA3" Instruction Set Architecture Bitwise XNOR. D0.u = ~(S0.u ^ S1.u) Notes Input and output modifiers not supported. V_ADD_CO_CI_U32 32 Add two unsigned integers and a carry-in from VCC. Store the result and also save the carry-out to VCC. tmp = 64'U(S0.u) + 64'U(S1.u) + VCC.u64[laneId].u64; VCC.u64[laneId] = tmp >= 0x100000000ULL ? 1'1U : 1'0U; D0.u = tmp.u Notes In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at S2.u. V_SUB_CO_CI_U32 33 Subtract the second unsigned integer from the first unsigned integer and then subtract a carry-in from VCC. Store the result and also save the carry-out to VCC. tmp = S0.u - S1.u - VCC.u64[laneId].u; VCC.u64[laneId] = 64'U(S1.u) + VCC.u64[laneId].u64 > 64'U(S0.u) ? 1'1U : 1'0U; D0.u = tmp.u Notes In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at S2.u. V_SUBREV_CO_CI_U32 34 Subtract the first unsigned integer from the second unsigned integer and then subtract a carry-in from VCC. Store the result and also save the carry-out to VCC. tmp = S1.u - S0.u - VCC.u64[laneId].u; VCC.u64[laneId] = 64'U(S1.u) + VCC.u64[laneId].u64 > 64'U(S0.u) ? 1'1U : 1'0U; 16.7. VOP2 Instructions 255 of 597 "RDNA3" Instruction Set Architecture D0.u = tmp.u Notes In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at S2.u. V_ADD_NC_U32 37 Add two unsigned integers. No carry-in or carry-out. D0.u = S0.u + S1.u V_SUB_NC_U32 38 Subtract the second unsigned integer from the first unsigned integer. No carry-in or carry-out. D0.u = S0.u - S1.u V_SUBREV_NC_U32 39 Subtract the first unsigned integer from the second unsigned integer. No carry-in or carry-out. D0.u = S1.u - S0.u V_FMAC_F32 43 Fused multiply-add of single-precision floats, accumulate with destination. D0.f = fma(S0.f, S1.f, D0.f) V_FMAMK_F32 44 Multiply a single-precision float with a literal constant and add a second single-precision float using fused multiply-add. 16.7. VOP2 Instructions 256 of 597 "RDNA3" Instruction Set Architecture D0.f = fma(S0.f, SIMM32.f, S1.f) Notes This opcode cannot use the VOP3 encoding and cannot use input/output modifiers. V_FMAAK_F32 45 Multiply two single-precision floats and add a literal constant using fused multiply-add. D0.f = fma(S0.f, S1.f, SIMM32.f) Notes This opcode cannot use the VOP3 encoding and cannot use input/output modifiers. V_CVT_PK_RTZ_F16_F32 47 Convert two single-precision floats into a packed FP16 result and round to zero (ignore the current rounding mode). D0[15 : 0].f16 = f32_to_f16(S0.f); D0[31 : 16].f16 = f32_to_f16(S1.f); // Round-toward-zero regardless of current round mode setting in hardware. Notes This opcode is intended for use with 16-bit compressed exports. See V_CVT_F16_F32 for a version that respects the current rounding mode. V_ADD_F16 50 Add two FP16 values. D0.f16 = S0.f16 + S1.f16 Notes 0.5ULP precision. Supports denormals, round mode, exception flags and saturation. 16.7. VOP2 Instructions 257 of 597 "RDNA3" Instruction Set Architecture V_SUB_F16 51 Subtract the second FP16 value from the first. D0.f16 = S0.f16 - S1.f16 Notes 0.5ULP precision, Supports denormals, round mode, exception flags and saturation. V_SUBREV_F16 52 Subtract the first FP16 value from the second. D0.f16 = S1.f16 - S0.f16 Notes 0.5ULP precision. Supports denormals, round mode, exception flags and saturation. V_MUL_F16 53 Multiply two FP16 values. D0.f16 = S0.f16 * S1.f16 Notes 0.5ULP precision. Supports denormals, round mode, exception flags and saturation. V_FMAC_F16 54 Fused multiply-add of FP16 values, accumulate with destination. D0.f16 = fma(S0.f16, S1.f16, D0.f16) Notes 0.5ULP precision. Supports denormals, round mode, exception flags and saturation. 16.7. VOP2 Instructions 258 of 597 "RDNA3" Instruction Set Architecture V_FMAMK_F16 55 Multiply a FP16 value with a literal constant and add a second FP16 value using fused multiply-add. D0.f16 = fma(S0.f16, SIMM32.f16, S1.f16) Notes This opcode cannot use the VOP3 encoding and cannot use input/output modifiers. Supports round mode, exception flags, saturation. V_FMAAK_F16 56 Multiply two FP16 values and add a literal constant using fused multiply-add. D0.f16 = fma(S0.f16, S1.f16, SIMM32.f16) Notes This opcode cannot use the VOP3 encoding and cannot use input/output modifiers. Supports round mode, exception flags, saturation. V_MAX_F16 57 Compute the maximum of two floats. GT_NEG_ZERO = lambda(a, b) ( ((a > b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && !sign(a) && sign(b)))); // Version of comparison where +0.0 > -0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(64'F(S0.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S0.f16))) elsif isSignalNAN(64'F(S1.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16))) elsif isQuietNAN(64'F(S1.f16)) then D0.f16 = S0.f16 elsif isQuietNAN(64'F(S0.f16)) then D0.f16 = S1.f16 elsif GT_NEG_ZERO(S0.f16, S1.f16) then // NOTE: +0>-0 is TRUE in this comparison D0.f16 = S0.f16 else D0.f16 = S1.f16 endif else if isNAN(64'F(S1.f16)) then D0.f16 = S0.f16 16.7. VOP2 Instructions 259 of 597 "RDNA3" Instruction Set Architecture elsif isNAN(64'F(S0.f16)) then D0.f16 = S1.f16 elsif GT_NEG_ZERO(S0.f16, S1.f16) then // NOTE: +0>-0 is TRUE in this comparison D0.f16 = S0.f16 else D0.f16 = S1.f16 endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. V_MIN_F16 58 Compute the minimum of two floats. LT_NEG_ZERO = lambda(a, b) ( ((a < b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && sign(a) && !sign(b)))); // Version of comparison where -0.0 < +0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(64'F(S0.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S0.f16))) elsif isSignalNAN(64'F(S1.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16))) elsif isQuietNAN(64'F(S1.f16)) then D0.f16 = S0.f16 elsif isQuietNAN(64'F(S0.f16)) then D0.f16 = S1.f16 elsif LT_NEG_ZERO(S0.f16, S1.f16) then // NOTE: -0<+0 is TRUE in this comparison D0.f16 = S0.f16 else D0.f16 = S1.f16 endif else if isNAN(64'F(S1.f16)) then D0.f16 = S0.f16 elsif isNAN(64'F(S0.f16)) then D0.f16 = S1.f16 elsif LT_NEG_ZERO(S0.f16, S1.f16) then // NOTE: -0<+0 is TRUE in this comparison D0.f16 = S0.f16 else D0.f16 = S1.f16 16.7. VOP2 Instructions 260 of 597 "RDNA3" Instruction Set Architecture endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. V_LDEXP_F16 59 Load exponent. Multiply an FP16 value by an integral power of 2, compare with the ldexp() function in C. The second argument is an integer value. D0.f16 = S0.f16 * f32_to_f16(2.0F ** 32'I(S1.i16)) V_PK_FMAC_F16 60 Multiply packed FP16 values and accumulate with destination. D0[31 : 16].f16 = fma(S0[31 : 16].f16, S1[31 : 16].f16, D0[31 : 16].f16); D0[15 : 0].f16 = fma(S0[15 : 0].f16, S1[15 : 0].f16, D0[15 : 0].f16) Notes VOP2 version of V_PK_FMA_F16 with third source VGPR address is the destination. 16.7.1. VOP2 using VOP3 or VOP3SD encoding Instructions in this format may also be encoded as VOP3. VOP3 allows access to the extra control bits (e.g. ABS, OMOD) at the expense of a larger instruction word. The VOP3 opcode is: VOP2 opcode + 0x100. 16.7. VOP2 Instructions 261 of 597 "RDNA3" Instruction Set Architecture 16.7. VOP2 Instructions 262 of 597 "RDNA3" Instruction Set Architecture 16.8. VOP1 Instructions Instructions in this format may use a 32-bit literal constant or DPP that occurs immediately after the instruction. V_NOP 0 Do nothing. V_MOV_B32 1 Move data to a VGPR. D0.b = S0.b Notes Floating-point modifiers are valid for this instruction if S0.u is a 32-bit floating point value. This instruction is suitable for negating or taking the absolute value of a floating-point value. Functional examples: v_mov_b32 v0, v1 // Move v1 to v0 v_mov_b32 v0, -v1 // Set v1 to the negation of v0 v_mov_b32 v0, abs(v1) // Set v1 to the absolute value of v0 V_READFIRSTLANE_B32 2 Copy one VGPR value from the lowest active lane to one SGPR. declare lane : 32'U; if WAVE64 then // 64 lanes if EXEC == 0x0LL then lane = 0U; // Force lane 0 if all lanes are disabled else lane = 32'U(s_ff1_i32_b64(EXEC)); // Lowest active lane endif else 16.8. VOP1 Instructions 263 of 597 "RDNA3" Instruction Set Architecture // 32 lanes if EXEC_LO.i == 0 then lane = 0U; // Force lane 0 if all lanes are disabled else lane = 32'U(s_ff1_i32_b32(EXEC_LO)); // Lowest active lane endif endif; D0.b = VGPR[lane][SRC0.u] Notes Ignores EXEC mask for the VGPR read. Input and output modifiers not supported; this is an untyped operation. V_CVT_I32_F64 3 Convert from a double-precision float to a signed integer. D0.i = f64_to_i32(S0.f64) Notes 0.5ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_CVT_F64_I32 4 Convert from a signed integer to a double-precision float. D0.f64 = i32_to_f64(S0.i) Notes 0ULP accuracy. V_CVT_F32_I32 5 Convert from a signed integer to a single-precision float. D0.f = i32_to_f32(S0.i) 16.8. VOP1 Instructions 264 of 597 "RDNA3" Instruction Set Architecture Notes 0.5ULP accuracy. V_CVT_F32_U32 6 Convert from an unsigned integer to a single-precision float. D0.f = u32_to_f32(S0.u) Notes 0.5ULP accuracy. V_CVT_U32_F32 7 Convert from a single-precision float to an unsigned integer. D0.u = f32_to_u32(S0.f) Notes 1ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_CVT_I32_F32 8 Convert from a single-precision float to a signed integer. D0.i = f32_to_i32(S0.f) Notes 1ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_CVT_F16_F32 16.8. VOP1 Instructions 10 265 of 597 "RDNA3" Instruction Set Architecture Convert from a single-precision float to an FP16 float. D0.f16 = f32_to_f16(S0.f) Notes 0.5ULP accuracy, supports input modifiers and creates FP16 denormals when appropriate. Flush denorms on output if specified based on DP denorm mode. Output rounding based on DP rounding mode. V_CVT_F32_F16 11 Convert from an FP16 float to a single-precision float. D0.f = f16_to_f32(S0.f16) Notes 0ULP accuracy, FP16 denormal inputs are accepted. Flush denorms on input if specified based on DP denorm mode. V_CVT_NEAREST_I32_F32 12 Convert from a single-precision float to a signed integer, round to nearest integer. D0.i = f32_to_i32(floor(S0.f + 0.5F)) Notes 0.5ULP accuracy, denormals are supported. V_CVT_FLOOR_I32_F32 13 Convert from a single-precision float to a signed integer, round down. D0.i = f32_to_i32(floor(S0.f)) Notes 1ULP accuracy, denormals are supported. 16.8. VOP1 Instructions 266 of 597 "RDNA3" Instruction Set Architecture V_CVT_OFF_F32_I4 14 4-bit signed int to 32-bit float. Used for interpolation in shader. Lookup table on S0[3:0]: S0 binary Result 1000 -0.5000f 1001 -0.4375f 1010 -0.3750f 1011 -0.3125f 1100 -0.2500f 1101 -0.1875f 1110 -0.1250f 1111 -0.0625f 0000 +0.0000f 0001 +0.0625f 0010 +0.1250f 0011 +0.1875f 0100 +0.2500f 0101 +0.3125f 0110 +0.3750f 0111 +0.4375f declare CVT_OFF_TABLE : 32'F[16]; D0.f = CVT_OFF_TABLE[S0.u[3 : 0]] V_CVT_F32_F64 15 Convert from a double-precision float to a single-precision float. D0.f = f64_to_f32(S0.f64) Notes 0.5ULP accuracy, denormals are supported. V_CVT_F64_F32 16 Convert from a single-precision float to a double-precision float. D0.f64 = f32_to_f64(S0.f) Notes 16.8. VOP1 Instructions 267 of 597 "RDNA3" Instruction Set Architecture 0ULP accuracy, denormals are supported. V_CVT_F32_UBYTE0 17 Convert an unsigned byte (byte 0) to a single-precision float. D0.f = u32_to_f32(S0.u[7 : 0].u) V_CVT_F32_UBYTE1 18 Convert an unsigned byte (byte 1) to a single-precision float. D0.f = u32_to_f32(S0.u[15 : 8].u) V_CVT_F32_UBYTE2 19 Convert an unsigned byte (byte 2) to a single-precision float. D0.f = u32_to_f32(S0.u[23 : 16].u) V_CVT_F32_UBYTE3 20 Convert an unsigned byte (byte 3) to a single-precision float. D0.f = u32_to_f32(S0.u[31 : 24].u) V_CVT_U32_F64 21 Convert from a double-precision float to an unsigned integer. D0.u = f64_to_u32(S0.f64) Notes 0.5ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0. 16.8. VOP1 Instructions 268 of 597 "RDNA3" Instruction Set Architecture Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_CVT_F64_U32 22 Convert from an unsigned integer to a double-precision float. D0.f64 = u32_to_f64(S0.u) Notes 0ULP accuracy. V_TRUNC_F64 23 Return integer part of a number with round-to-zero semantics. D0.f64 = trunc(S0.f64) V_CEIL_F64 24 Round up to next whole integer. D0.f64 = trunc(S0.f64); if ((S0.f64 > 0.0) && (S0.f64 != D0.f64)) then D0.f64 += 1.0 endif V_RNDNE_F64 25 Round-to-nearest-even semantics. D0.f64 = floor(S0.f64 + 0.5); if (isEven(floor(S0.f64)) && (fract(S0.f64) == 0.5)) then D0.f64 -= 1.0 endif V_FLOOR_F64 16.8. VOP1 Instructions 26 269 of 597 "RDNA3" Instruction Set Architecture Round down to previous whole integer. D0.f64 = trunc(S0.f64); if ((S0.f64 < 0.0) && (S0.f64 != D0.f64)) then D0.f64 += -1.0 endif V_PIPEFLUSH 27 Flush the VALU destination cache. V_MOV_B16 28 Move data to a VGPR. D0.b16 = S0.b16 Notes Floating-point modifiers are valid for this instruction if S0.u16 is a 16-bit floating point value. This instruction is suitable for negating or taking the absolute value of a floating-point value. V_FRACT_F32 32 Return fractional portion of a number. D0.f = S0.f + -floor(S0.f) Notes 0.5ULP accuracy, denormals are accepted. This complies with the DX specification of fract where the function behaves like an extension of integer modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8 in DX. Obey round mode, result clamped to 0x3f7fffff. V_TRUNC_F32 33 Return integer part of a number with round-to-zero semantics. 16.8. VOP1 Instructions 270 of 597 "RDNA3" Instruction Set Architecture D0.f = trunc(S0.f) V_CEIL_F32 34 Round up to next whole integer. D0.f = trunc(S0.f); if ((S0.f > 0.0F) && (S0.f != D0.f)) then D0.f += 1.0F endif V_RNDNE_F32 35 Round-to-nearest-even semantics. D0.f = floor(S0.f + 0.5F); if (isEven(64'F(floor(S0.f))) && (fract(S0.f) == 0.5F)) then D0.f -= 1.0F endif V_FLOOR_F32 36 Round down to previous whole integer. D0.f = trunc(S0.f); if ((S0.f < 0.0F) && (S0.f != D0.f)) then D0.f += -1.0F endif V_EXP_F32 37 Base 2 exponentiation. D0.f = pow(2.0F, S0.f) Notes 1ULP accuracy, denormals are flushed. 16.8. VOP1 Instructions 271 of 597 "RDNA3" Instruction Set Architecture Functional examples: V_EXP_F32(0xff800000) ⇒ 0x00000000 // exp(-INF) = 0 V_EXP_F32(0x80000000) ⇒ 0x3f800000 // exp(-0.0) = 1 V_EXP_F32(0x7f800000) ⇒ 0x7f800000 // exp(+INF) = +INF V_LOG_F32 39 Base 2 logarithm. D0.f = log2(S0.f) Notes 1ULP accuracy, denormals are flushed. Functional examples: V_LOG_F32(0xff800000) ⇒ 0xffc00000 // log(-INF) = NAN V_LOG_F32(0xbf800000) ⇒ 0xffc00000 // log(-1.0) = NAN V_LOG_F32(0x80000000) ⇒ 0xff800000 // log(-0.0) = -INF V_LOG_F32(0x00000000) ⇒ 0xff800000 // log(+0.0) = -INF V_LOG_F32(0x3f800000) ⇒ 0x00000000 // log(+1.0) = 0 V_LOG_F32(0x7f800000) ⇒ 0x7f800000 // log(+INF) = +INF V_RCP_F32 42 Compute reciprocal with IEEE rules. D0.f = 1.0F / S0.f Notes 1ULP accuracy. Accuracy converges to < 0.5ULP when using the Newton-Raphson method and 2 FMA operations. Denormals are flushed. Functional examples: V_RCP_F32(0xff800000) ⇒ 0x80000000 // rcp(-INF) = -0 V_RCP_F32(0xc0000000) ⇒ 0xbf000000 // rcp(-2.0) = -0.5 V_RCP_F32(0x80000000) ⇒ 0xff800000 // rcp(-0.0) = -INF V_RCP_F32(0x00000000) ⇒ 0x7f800000 // rcp(+0.0) = +INF V_RCP_F32(0x7f800000) ⇒ 0x00000000 // rcp(+INF) = +0 16.8. VOP1 Instructions 272 of 597 "RDNA3" Instruction Set Architecture V_RCP_IFLAG_F32 43 Compute reciprocal as part of integer divide. D0.f = 1.0F / S0.f; // Can only raise integer DIV_BY_ZERO exception Notes Can raise integer DIV_BY_ZERO exception but cannot raise floating-point exceptions. To be used in an integer reciprocal macro by the compiler with one of the sequences listed below (depending on signed or unsigned operation). Unsigned usage: CVT_F32_U32 RCP_IFLAG_F32 MUL_F32 (2**32 - 1) CVT_U32_F32 + Signed usage: CVT_F32_I32 RCP_IFLAG_F32 MUL_F32 (2**31 - 1) CVT_I32_F32 V_RSQ_F32 46 Reciprocal square root with IEEE rules. D0.f = 1.0F / sqrt(S0.f) Notes 1ULP accuracy, denormals are flushed. Functional examples: V_RSQ_F32(0xff800000) ⇒ 0xffc00000 // rsq(-INF) = NAN V_RSQ_F32(0x80000000) ⇒ 0xff800000 // rsq(-0.0) = -INF V_RSQ_F32(0x00000000) ⇒ 0x7f800000 // rsq(+0.0) = +INF V_RSQ_F32(0x40800000) ⇒ 0x3f000000 // rsq(+4.0) = +0.5 V_RSQ_F32(0x7f800000) ⇒ 0x00000000 // rsq(+INF) = +0 V_RCP_F64 16.8. VOP1 Instructions 47 273 of 597 "RDNA3" Instruction Set Architecture Reciprocal with IEEE rules. D0.f64 = 1.0 / S0.f64 Notes This opcode has (2**29)ULP accuracy and supports denormals. V_RSQ_F64 49 Reciprocal square root with IEEE rules. D0.f64 = 1.0 / sqrt(S0.f64) Notes This opcode has (2**29)ULP accuracy and supports denormals. V_SQRT_F32 51 Square root. D0.f = sqrt(S0.f) Notes 1ULP accuracy, denormals are flushed. Functional examples: V_SQRT_F32(0xff800000) ⇒ 0xffc00000 // sqrt(-INF) = NAN V_SQRT_F32(0x80000000) ⇒ 0x80000000 // sqrt(-0.0) = -0 V_SQRT_F32(0x00000000) ⇒ 0x00000000 // sqrt(+0.0) = +0 V_SQRT_F32(0x40800000) ⇒ 0x40000000 // sqrt(+4.0) = +2.0 V_SQRT_F32(0x7f800000) ⇒ 0x7f800000 // sqrt(+INF) = +INF V_SQRT_F64 52 Square root. D0.f64 = sqrt(S0.f64) 16.8. VOP1 Instructions 274 of 597 "RDNA3" Instruction Set Architecture Notes This opcode has (2**29)ULP accuracy and supports denormals. V_SIN_F32 53 Trigonometric sine. D0.f = 32'F(sin(64'F(S0.f) * 2.0 * PI)) Notes Denormals are supported. Full range input is supported. Functional examples: V_SIN_F32(0xff800000) ⇒ 0xffc00000 // sin(-INF) = NAN V_SIN_F32(0xff7fffff) ⇒ 0x00000000 // -MaxFloat, finite V_SIN_F32(0x80000000) ⇒ 0x80000000 // sin(-0.0) = -0 V_SIN_F32(0x3e800000) ⇒ 0x3f800000 // sin(0.25) = 1 V_SIN_F32(0x7f800000) ⇒ 0xffc00000 // sin(+INF) = NAN V_COS_F32 54 Trigonometric cosine. D0.f = 32'F(cos(64'F(S0.f) * 2.0 * PI)) Notes Denormals are supported. Full range input is supported. Functional examples: V_COS_F32(0xff800000) ⇒ 0xffc00000 // cos(-INF) = NAN V_COS_F32(0xff7fffff) ⇒ 0x3f800000 // -MaxFloat, finite V_COS_F32(0x80000000) ⇒ 0x3f800000 // cos(-0.0) = 1 V_COS_F32(0x3e800000) ⇒ 0x00000000 // cos(0.25) = 0 V_COS_F32(0x7f800000) ⇒ 0xffc00000 // cos(+INF) = NAN V_NOT_B32 55 Bitwise negation. 16.8. VOP1 Instructions 275 of 597 "RDNA3" Instruction Set Architecture D0.u = ~S0.u Notes Input and output modifiers not supported. V_BFREV_B32 56 Bitfield reverse. D0.u[31 : 0] = S0.u[0 : 31] Notes Input and output modifiers not supported. V_CLZ_I32_U32 57 Count leading zeros. Counts how many zeros before the first one starting from the MSB. Returns -1 if there are no ones. D0.i = -1; // Set if no ones are found for i in 0 : 31 do // Search from MSB if S0.u[31 - i] == 1'1U then D0.i = i; break endif endfor Notes Compare with S_CLZ_I32_U32, which performs the equivalent operation in the scalar ALU. Functional examples: V_CLZ_I32_U32(0x00000000) ⇒ 0xffffffff V_CLZ_I32_U32(0x800000ff) ⇒ 0 V_CLZ_I32_U32(0x100000ff) ⇒ 3 V_CLZ_I32_U32(0x0000ffff) ⇒ 16 V_CLZ_I32_U32(0x00000001) ⇒ 31 16.8. VOP1 Instructions 276 of 597 "RDNA3" Instruction Set Architecture V_CTZ_I32_B32 58 Count trailing zeros. Returns the bit position of the first one from the LSB, or -1 if there are no ones. D0.i = -1; // Set if no ones are found for i in 0 : 31 do // Search from LSB if S0.u[i] == 1'1U then D0.i = i; break endif endfor Notes Compare with S_CTZ_I32_B32, which performs the equivalent operation in the scalar ALU. Functional examples: V_CTZ_I32_B32(0x00000000) ⇒ 0xffffffff V_CTZ_I32_B32(0xff000001) ⇒ 0 V_CTZ_I32_B32(0xff000008) ⇒ 3 V_CTZ_I32_B32(0xffff0000) ⇒ 16 V_CTZ_I32_B32(0x80000000) ⇒ 31 V_CLS_I32 59 Count leading sign bits. Counts how many bits in a row (from MSB to LSB) are the same as the sign bit. Returns -1 if all bits are the same. D0.i = -1; // Set if all bits are the same for i in 1 : 31 do // Search from MSB if S0.i[31 - i] != S0.i[31] then D0.i = i; break endif endfor Notes Compare with S_CLS_I32, which performs the equivalent operation in the scalar ALU. 16.8. VOP1 Instructions 277 of 597 "RDNA3" Instruction Set Architecture Functional examples: V_CLS_I32(0x00000000) ⇒ 0xffffffff V_CLS_I32(0x40000000) ⇒ 1 V_CLS_I32(0x80000000) ⇒ 1 V_CLS_I32(0x0fffffff) ⇒ 4 V_CLS_I32(0xffff0000) ⇒ 16 V_CLS_I32(0xfffffffe) ⇒ 31 V_CLS_I32(0xffffffff) ⇒ 0xffffffff V_FREXP_EXP_I32_F64 60 Returns exponent of single precision float input. This operation satisfies the invariant S0.f64 = significand * (2 ** exponent). See also V_FREXP_MANT_F64, which returns the significand. See the C library function frexp() for more information. if ((S0.f64 == +INF) || (S0.f64 == -INF) || isNAN(S0.f64)) then D0.i = 0 else D0.i = exponent(S0.f64) - 1023 + 1 endif V_FREXP_MANT_F64 61 Returns binary significand of double precision float input. This operation satisfies the invariant S0.f64 = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0) in normal cases. See also V_FREXP_EXP_I32_F64, which returns integer exponent. See the C library function frexp() for more information. if ((S0.f64 == +INF) || (S0.f64 == -INF) || isNAN(S0.f64)) then D0.f64 = S0.f64 else D0.f64 = mantissa(S0.f64) endif V_FRACT_F64 62 Return fractional portion of a number. D0.f64 = S0.f64 + -floor(S0.f64) 16.8. VOP1 Instructions 278 of 597 "RDNA3" Instruction Set Architecture Notes 0.5ULP accuracy, denormals are accepted. This complies with the DX specification of fract where the function behaves like an extension of integer modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8 in DX. Obey round mode, result clamped to 0x3fefffffffffffff. V_FREXP_EXP_I32_F32 63 Returns exponent of single precision float input. This operation satisfies the invariant S0.f = significand * (2 ** exponent). See also V_FREXP_MANT_F32, which returns the significand. See the C library function frexp() for more information. if ((64'F(S0.f) == +INF) || (64'F(S0.f) == -INF) || isNAN(64'F(S0.f))) then D0.i = 0 else D0.i = exponent(S0.f) - 127 + 1 endif V_FREXP_MANT_F32 64 Returns binary significand of single precision float input. This operation satisfies the invariant S0.f = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0) in normal cases. See also V_FREXP_EXP_I32_F32, which returns integer exponent. See the C library function frexp() for more information. if ((64'F(S0.f) == +INF) || (64'F(S0.f) == -INF) || isNAN(64'F(S0.f))) then D0.f = S0.f else D0.f = mantissa(S0.f) endif V_MOVRELD_B32 66 Move to a relative destination address. addr = DST.u; // Raw value from instruction addr += M0.u[31 : 0]; 16.8. VOP1 Instructions 279 of 597 "RDNA3" Instruction Set Architecture VGPR[laneId][addr].b = S0.b Notes Example: The following instruction sequence performs the move v15 <= v7: s_mov_b32 m0, 10 v_movreld_b32 v5, v7 V_MOVRELS_B32 67 Move from a relative source address. addr = SRC0.u; // Raw value from instruction addr += M0.u[31 : 0]; D0.b = VGPR[laneId][addr].b Notes Example: The following instruction sequence performs the move v5 <= v17: s_mov_b32 m0, 10 v_movrels_b32 v5, v7 V_MOVRELSD_B32 68 Move from a relative source address to a relative destination address. addrs = SRC0.u; // Raw value from instruction addrd = DST.u; // Raw value from instruction addrs += M0.u[31 : 0]; addrd += M0.u[31 : 0]; VGPR[laneId][addrd].b = VGPR[laneId][addrs].b Notes Example: The following instruction sequence performs the move v15 <= v17: s_mov_b32 m0, 10 16.8. VOP1 Instructions 280 of 597 "RDNA3" Instruction Set Architecture v_movrelsd_b32 v5, v7 V_MOVRELSD_2_B32 72 Move from a relative source address to a relative destination address, with different relative offsets. addrs = SRC0.u; // Raw value from instruction addrd = DST.u; // Raw value from instruction addrs += M0.u[9 : 0].u; addrd += M0.u[25 : 16].u; VGPR[laneId][addrd].b = VGPR[laneId][addrs].b Notes Example: The following instruction sequence performs the move v25 <= v17: s_mov_b32 m0, ((20 << 16) | 10) v_movrelsd_2_b32 v5, v7 V_CVT_F16_U16 80 Convert from an unsigned short to an FP16 float. D0.f16 = u16_to_f16(S0.u16) Notes 0.5ULP accuracy, supports denormals, rounding, exception flags and saturation. V_CVT_F16_I16 81 Convert from a signed short to an FP16 float. D0.f16 = i16_to_f16(S0.i16) Notes 0.5ULP accuracy, supports denormals, rounding, exception flags and saturation. 16.8. VOP1 Instructions 281 of 597 "RDNA3" Instruction Set Architecture V_CVT_U16_F16 82 Convert from an FP16 float to an unsigned short. D0.u16 = f16_to_u16(S0.f16) Notes 1ULP accuracy, supports rounding, exception flags and saturation. FP16 denormals are accepted. Conversion is done with truncation. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_CVT_I16_F16 83 Convert from an FP16 float to a signed short. D0.i16 = f16_to_i16(S0.f16) Notes 1ULP accuracy, supports rounding, exception flags and saturation. FP16 denormals are accepted. Conversion is done with truncation. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_RCP_F16 84 Reciprocal with IEEE rules. D0.f16 = 16'1.0 / S0.f16 Notes 0.51ULP accuracy. Functional examples: V_RCP_F16(0xfc00) ⇒ 0x8000 // rcp(-INF) = -0 V_RCP_F16(0xc000) ⇒ 0xb800 // rcp(-2.0) = -0.5 V_RCP_F16(0x8000) ⇒ 0xfc00 // rcp(-0.0) = -INF 16.8. VOP1 Instructions 282 of 597 "RDNA3" Instruction Set Architecture V_RCP_F16(0x0000) ⇒ 0x7c00 // rcp(+0.0) = +INF V_RCP_F16(0x7c00) ⇒ 0x0000 // rcp(+INF) = +0 V_SQRT_F16 85 Square root. D0.f16 = sqrt(S0.f16) Notes 0.51ULP accuracy, denormals are supported. Functional examples: V_SQRT_F16(0xfc00) ⇒ 0xfe00 // sqrt(-INF) = NAN V_SQRT_F16(0x8000) ⇒ 0x8000 // sqrt(-0.0) = -0 V_SQRT_F16(0x0000) ⇒ 0x0000 // sqrt(+0.0) = +0 V_SQRT_F16(0x4400) ⇒ 0x4000 // sqrt(+4.0) = +2.0 V_SQRT_F16(0x7c00) ⇒ 0x7c00 // sqrt(+INF) = +INF V_RSQ_F16 86 Reciprocal square root with IEEE rules. D0.f16 = 16'1.0 / sqrt(S0.f16) Notes 0.51ULP accuracy, denormals are supported. Functional examples: V_RSQ_F16(0xfc00) ⇒ 0xfe00 // rsq(-INF) = NAN V_RSQ_F16(0x8000) ⇒ 0xfc00 // rsq(-0.0) = -INF V_RSQ_F16(0x0000) ⇒ 0x7c00 // rsq(+0.0) = +INF V_RSQ_F16(0x4400) ⇒ 0x3800 // rsq(+4.0) = +0.5 V_RSQ_F16(0x7c00) ⇒ 0x0000 // rsq(+INF) = +0 V_LOG_F16 87 Base 2 logarithm. 16.8. VOP1 Instructions 283 of 597 "RDNA3" Instruction Set Architecture D0.f16 = log2(S0.f16) Notes 0.51ULP accuracy, denormals are supported. Functional examples: V_LOG_F16(0xfc00) ⇒ 0xfe00 // log(-INF) = NAN V_LOG_F16(0xbc00) ⇒ 0xfe00 // log(-1.0) = NAN V_LOG_F16(0x8000) ⇒ 0xfc00 // log(-0.0) = -INF V_LOG_F16(0x0000) ⇒ 0xfc00 // log(+0.0) = -INF V_LOG_F16(0x3c00) ⇒ 0x0000 // log(+1.0) = 0 V_LOG_F16(0x7c00) ⇒ 0x7c00 // log(+INF) = +INF V_EXP_F16 88 Base 2 exponentiation. D0.f16 = pow(16'2.0, S0.f16) Notes 0.51ULP accuracy, denormals are supported. Functional examples: V_EXP_F16(0xfc00) ⇒ 0x0000 // exp(-INF) = 0 V_EXP_F16(0x8000) ⇒ 0x3c00 // exp(-0.0) = 1 V_EXP_F16(0x7c00) ⇒ 0x7c00 // exp(+INF) = +INF V_FREXP_MANT_F16 89 Returns binary significand of half precision float input. This operation satisfies the invariant S0.f16 = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0) in normal cases. See also V_FREXP_EXP_I16_F16, which returns integer exponent. See the C library function frexp() for more information. if ((64'F(S0.f16) == +INF) || (64'F(S0.f16) == -INF) || isNAN(64'F(S0.f16))) then D0.f16 = S0.f16 else D0.f16 = mantissa(S0.f16) endif 16.8. VOP1 Instructions 284 of 597 "RDNA3" Instruction Set Architecture V_FREXP_EXP_I16_F16 90 Returns exponent of half precision float input. This operation satisfies the invariant S0.f16 = significand * (2 ** exponent). See also V_FREXP_MANT_F16, which returns the significand. See the C library function frexp() for more information. if ((64'F(S0.f16) == +INF) || (64'F(S0.f16) == -INF) || isNAN(64'F(S0.f16))) then D0.i = 0 else D0.i = exponent(S0.f16) - 15 + 1 endif V_FLOOR_F16 91 Round down to previous whole integer. D0.f16 = trunc(S0.f16); if ((S0.f16 < 16'0.0) && (S0.f16 != D0.f16)) then D0.f16 += -16'1.0 endif V_CEIL_F16 92 Round up to next whole integer. D0.f16 = trunc(S0.f16); if ((S0.f16 > 16'0.0) && (S0.f16 != D0.f16)) then D0.f16 += 16'1.0 endif V_TRUNC_F16 93 Return integer part of a number with round-to-zero semantics. D0.f16 = trunc(S0.f16) V_RNDNE_F16 16.8. VOP1 Instructions 94 285 of 597 "RDNA3" Instruction Set Architecture Round-to-nearest-even semantics. D0.f16 = floor(S0.f16 + 16'0.5); if (isEven(64'F(floor(S0.f16))) && (fract(S0.f16) == 16'0.5)) then D0.f16 -= 16'1.0 endif V_FRACT_F16 95 Return fractional portion of a number. D0.f16 = S0.f16 + -floor(S0.f16) Notes 0.5ULP accuracy, denormals are accepted. This complies with the DX specification of fract where the function behaves like an extension of integer modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8 in DX. V_SIN_F16 96 Trigonometric sine. D0.f16 = 16'F(sin(64'F(S0.f16) * 2.0 * PI)) Notes Denormals are supported. Full range input is supported. Functional examples: V_SIN_F16(0xfc00) ⇒ 0xfe00 // sin(-INF) = NAN V_SIN_F16(0xfbff) ⇒ 0x0000 // Most negative finite FP16 V_SIN_F16(0x8000) ⇒ 0x8000 // sin(-0.0) = -0 V_SIN_F16(0x3400) ⇒ 0x3c00 // sin(0.25) = 1 V_SIN_F16(0x7bff) ⇒ 0x0000 // Most positive finite FP16 V_SIN_F16(0x7c00) ⇒ 0xfe00 // sin(+INF) = NAN V_COS_F16 97 Trigonometric cosine. 16.8. VOP1 Instructions 286 of 597 "RDNA3" Instruction Set Architecture D0.f16 = 16'F(cos(64'F(S0.f16) * 2.0 * PI)) Notes Denormals are supported. Full range input is supported. Functional examples: V_COS_F16(0xfc00) ⇒ 0xfe00 // cos(-INF) = NAN V_COS_F16(0xfbff) ⇒ 0x3c00 // Most negative finite FP16 V_COS_F16(0x8000) ⇒ 0x3c00 // cos(-0.0) = 1 V_COS_F16(0x3400) ⇒ 0x0000 // cos(0.25) = 0 V_COS_F16(0x7bff) ⇒ 0x3c00 // Most positive finite FP16 V_COS_F16(0x7c00) ⇒ 0xfe00 // cos(+INF) = NAN V_SAT_PK_U8_I16 98 Packed 8-bit saturating of packed 16-bit integer values. Used for 4x16bit data packed as 4x8bit data. SAT8 = lambda(n) ( if n.i <= 0 then return 8'0U elsif n >= 16'I(0xff) then return 8'255U else return n[7 : 0].u8 endif); D0.b16 = { SAT8(S0[31 : 16].i16), SAT8(S0[15 : 0].i16) } V_CVT_NORM_I16_F16 99 Convert from an FP16 float to a signed normalized short. D0.i16 = f16_to_snorm(S0.f16) Notes 0.5ULP accuracy, supports rounding, exception flags and saturation, denormals are supported. V_CVT_NORM_U16_F16 100 Convert from an FP16 float to an unsigned normalized short. 16.8. VOP1 Instructions 287 of 597 "RDNA3" Instruction Set Architecture D0.u16 = f16_to_unorm(S0.f16) Notes 0.5ULP accuracy, supports rounding, exception flags and saturation, denormals are supported. V_SWAP_B32 101 Swap values of two operands. tmp = D0.b; D0.b = S0.b; S0.b = tmp Notes Input and output modifiers not supported; this is an untyped operation. V_SWAP_B16 102 Swap values of two operands. tmp = D0.b16; D0.b16 = S0.b16; S0.b16 = tmp Notes Input and output modifiers not supported; this is an untyped operation. V_PERMLANE64_B32 103 Perform a specific permutation across lanes where the high half and low half of a wave64 are swapped. Performs no operation in wave32 mode. declare tmp : 32'B[64]; declare lane : 32'U; if WAVE32 then // Supported in wave64 ONLY v_nop() else for lane in 0U : 63U do // Copy original S0 in case D==S0 16.8. VOP1 Instructions 288 of 597 "RDNA3" Instruction Set Architecture tmp[lane] = VGPR[lane][SRC0.u] endfor; for lane in 0U : 63U do altlane = { ~lane[5], lane[4 : 0] }; // 0<->32, ..., 31<->63 if EXEC[lane].u1 then VGPR[lane][VDST.u] = tmp[altlane] endif endfor endif Notes In wave32 mode this opcode is translated to V_NOP and performs no writes. In wave64 the EXEC mask of the destination lane is used as the read mask for the alternate lane; as a result this opcode may read values from disabled lanes. The source must be a VGPR and SVGPRs are not allowed for this opcode. ABS, NEG and OMOD modifiers should all be zeroed for this instruction. V_SWAPREL_B32 104 Swap values of two operands. The two addresses are relatively indexed using M0. addrs = SRC0.u; // Raw value from instruction addrd = DST.u; // Raw value from instruction addrs += M0.u[9 : 0].u; addrd += M0.u[25 : 16].u; tmp = VGPR[laneId][addrd].b; VGPR[laneId][addrd].b = VGPR[laneId][addrs].b; VGPR[laneId][addrs].b = tmp Notes Input and output modifiers not supported; this is an untyped operation. Example: The following instruction sequence swaps v25 and v17: s_mov_b32 m0, ((20 << 16) | 10) v_swaprel_b32 v5, v7 V_NOT_B16 16.8. VOP1 Instructions 105 289 of 597 "RDNA3" Instruction Set Architecture Bitwise negation. D0.u16 = ~S0.u16 Notes Input and output modifiers not supported. V_CVT_I32_I16 106 Convert from an 16-bit signed integer to a 32-bit signed integer, sign extending as needed. D0.i = 32'I(signext(S0.i16)) Notes To convert in the other direction (from 32-bit to 16-bit integer) use V_MOV_B16. V_CVT_U32_U16 107 Convert from an 16-bit unsigned integer to a 32-bit unsigned integer, zero extending as needed. D0 = { 16'0, S0.u16 } Notes To convert in the other direction (from 32-bit to 16-bit integer) use V_MOV_B16. 16.8.1. VOP1 using VOP3 encoding Instructions in this format may also be encoded as VOP3. VOP3 allows access to the extra control bits (e.g. ABS, OMOD) at the expense of a larger instruction word. The VOP3 opcode is: VOP2 opcode + 0x180. 16.8. VOP1 Instructions 290 of 597 "RDNA3" Instruction Set Architecture 16.9. VOPC Instructions The bitfield map for VOPC is: SRC0 = First operand for instruction. VSRC1 = Second operand for instruction. OP = Instruction opcode. All VOPC instructions can alternatively be encoded in the VOP3 format. Compare instructions perform the same compare operation on each lane (work-Item or thread) using that lane’s private data, and producing a 1 bit result per lane into VCC or EXEC. Instructions in this format may use a 32-bit literal constant that occurs immediately after the instruction. Most compare instructions fall into one of two categories: • Those which can use one of 16 compare operations (floating point types). "{COMPF}" • Those which can use one of 8 compare operations (integer types). "{COMPI}" The opcode number is such that for these the opcode number can be calculated from a base opcode number for the data type, plus an offset for the specific compare operation. Table 112. Sixteen Compare Operations Compare Operation Opcode Offset Description F 0 D.u = 0 LT 1 D.u = (S0 < S1) EQ 2 D.u = (S0 == S1) LE 3 D.u = (S0 <= S1) GT 4 D.u = (S0 > S1) LG 5 D.u = (S0 <> S1) GE 6 D.u = (S0 >= S1) O 7 D.u = (!isNaN(S0) && !isNaN(S1)) U 8 D.u = (!isNaN(S0) || !isNaN(S1)) NGE 9 D.u = !(S0 >= S1) NLG 10 D.u = !(S0 <> S1) NGT 11 D.u = !(S0 > S1) NLE 12 D.u = !(S0 <= S1) NEQ 13 D.u = !(S0 == S1) NLT 14 D.u = !(S0 < S1) TRU 15 D.u = 1 Table 113. Instructions with Sixteen Compare Operations Instruction Description Hex Range V_CMP_{COMPF}_F16 16-bit float compare. 0x20 to 0x2F V_CMPX_{COMPF}_F16 16-bit float compare. Also writes EXEC. 0x30 to 0x3F V_CMP_{COMPF}_F32 32-bit float compare. 0x40 to 0x4F 16.9. VOPC Instructions 291 of 597 "RDNA3" Instruction Set Architecture Instruction Description Hex Range V_CMPX_{COMPF}_F32 32-bit float compare. Also writes EXEC. 0x50 to 0x5F V_CMP_{COMPF}_F64 64-bit float compare. 0x60 to 0x6F V_CMPX_{COMPF}_F64 64-bit float compare. Also writes EXEC. 0x70 to 0x7F Table 114. Eight Compare Operations Compare Operation Opcode Offset Description F 0 D.u = 0 LT 1 D.u = (S0 < S1) EQ 2 D.u = (S0 == S1) LE 3 D.u = (S0 <= S1) GT 4 D.u = (S0 > S1) LG 5 D.u = (S0 <> S1) GE 6 D.u = (S0 >= S1) TRU 7 D.u = 1 Table 115. Instructions with Eight Compare Operations Instruction Description Hex Range V_CMP_{COMPI}_I16 16-bit signed integer compare. 0xA0 - 0xA7 V_CMP_{COMPI}_U16 16-bit signed integer compare. Also writes EXEC. 0xA8 - 0xAF V_CMPX_{COMPI}_I16 16-bit unsigned integer compare. 0xB0 - 0xB7 V_CMPX_{COMPI}_U16 16-bit unsigned integer compare. Also writes EXEC. 0xB8 - 0xBF V_CMP_{COMPI}_I32 32-bit signed integer compare. 0xC0 - 0xC7 V_CMP_{COMPI}_U32 32-bit signed integer compare. Also writes EXEC. 0xC8 - 0xCF V_CMPX_{COMPI}_I32 32-bit unsigned integer compare. 0xD0 - 0xD7 V_CMPX_{COMPI}_U32 32-bit unsigned integer compare. Also writes EXEC. 0xD8 - 0xDF V_CMP_{COMPI}_I64 64-bit signed integer compare. 0xE0 - 0xE7 V_CMP_{COMPI}_U64 64-bit signed integer compare. Also writes EXEC. 0xE8 - 0xEF V_CMPX_{COMPI}_I64 64-bit unsigned integer compare. 0xF0 - 0xF7 V_CMPX_{COMPI}_U64 64-bit unsigned integer compare. Also writes EXEC. 0xF8 - 0xFF V_CMP_F_F16 0 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_F16 1 A less than B. 16.9. VOPC Instructions 292 of 597 "RDNA3" Instruction Set Architecture D0.u64[laneId] = S0.f16 < S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_F16 2 A equal to B. D0.u64[laneId] = S0.f16 == S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_F16 3 A less than or equal to B. D0.u64[laneId] = S0.f16 <= S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_F16 4 A greater than B. D0.u64[laneId] = S0.f16 > S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9. VOPC Instructions 293 of 597 "RDNA3" Instruction Set Architecture V_CMP_LG_F16 5 A less than or greater than B. D0.u64[laneId] = S0.f16 <> S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_F16 6 A greater than or equal to B. D0.u64[laneId] = S0.f16 >= S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_O_F16 7 A orderable with B. D0.u64[laneId] = (!isNAN(64'F(S0.f16)) && !isNAN(64'F(S1.f16))); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_U_F16 8 A not orderable with B. D0.u64[laneId] = (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16))); // D0 = VCC in VOPC encoding. Notes 16.9. VOPC Instructions 294 of 597 "RDNA3" Instruction Set Architecture Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGE_F16 9 A not greater than or equal to B. D0.u64[laneId] = !(S0.f16 >= S1.f16); // With NAN inputs this is not the same operation as < // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLG_F16 10 A not less than or greater than B. D0.u64[laneId] = !(S0.f16 <> S1.f16); // With NAN inputs this is not the same operation as == // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGT_F16 11 A not greater than B. D0.u64[laneId] = !(S0.f16 > S1.f16); // With NAN inputs this is not the same operation as <= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLE_F16 12 A not less than or equal to B. 16.9. VOPC Instructions 295 of 597 "RDNA3" Instruction Set Architecture D0.u64[laneId] = !(S0.f16 <= S1.f16); // With NAN inputs this is not the same operation as > // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NEQ_F16 13 A not equal to B. D0.u64[laneId] = !(S0.f16 == S1.f16); // With NAN inputs this is not the same operation as != // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLT_F16 14 A not less than B. D0.u64[laneId] = !(S0.f16 < S1.f16); // With NAN inputs this is not the same operation as >= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_F16 15 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9. VOPC Instructions 296 of 597 "RDNA3" Instruction Set Architecture V_CMP_F_F32 16 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_F32 17 A less than B. D0.u64[laneId] = S0.f < S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_F32 18 A equal to B. D0.u64[laneId] = S0.f == S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_F32 19 A less than or equal to B. D0.u64[laneId] = S0.f <= S1.f; // D0 = VCC in VOPC encoding. 16.9. VOPC Instructions 297 of 597 "RDNA3" Instruction Set Architecture Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_F32 20 A greater than B. D0.u64[laneId] = S0.f > S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LG_F32 21 A less than or greater than B. D0.u64[laneId] = S0.f <> S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_F32 22 A greater than or equal to B. D0.u64[laneId] = S0.f >= S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_O_F32 23 A orderable with B. 16.9. VOPC Instructions 298 of 597 "RDNA3" Instruction Set Architecture D0.u64[laneId] = (!isNAN(64'F(S0.f)) && !isNAN(64'F(S1.f))); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_U_F32 24 A not orderable with B. D0.u64[laneId] = (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f))); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGE_F32 25 A not greater than or equal to B. D0.u64[laneId] = !(S0.f >= S1.f); // With NAN inputs this is not the same operation as < // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLG_F32 26 A not less than or greater than B. D0.u64[laneId] = !(S0.f <> S1.f); // With NAN inputs this is not the same operation as == // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9. VOPC Instructions 299 of 597 "RDNA3" Instruction Set Architecture V_CMP_NGT_F32 27 A not greater than B. D0.u64[laneId] = !(S0.f > S1.f); // With NAN inputs this is not the same operation as <= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLE_F32 28 A not less than or equal to B. D0.u64[laneId] = !(S0.f <= S1.f); // With NAN inputs this is not the same operation as > // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NEQ_F32 29 A not equal to B. D0.u64[laneId] = !(S0.f == S1.f); // With NAN inputs this is not the same operation as != // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLT_F32 30 A not less than B. D0.u64[laneId] = !(S0.f < S1.f); // With NAN inputs this is not the same operation as >= 16.9. VOPC Instructions 300 of 597 "RDNA3" Instruction Set Architecture // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_F32 31 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_F_F64 32 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_F64 33 A less than B. D0.u64[laneId] = S0.f64 < S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_F64 16.9. VOPC Instructions 34 301 of 597 "RDNA3" Instruction Set Architecture A equal to B. D0.u64[laneId] = S0.f64 == S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_F64 35 A less than or equal to B. D0.u64[laneId] = S0.f64 <= S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_F64 36 A greater than B. D0.u64[laneId] = S0.f64 > S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LG_F64 37 A less than or greater than B. D0.u64[laneId] = S0.f64 <> S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9. VOPC Instructions 302 of 597 "RDNA3" Instruction Set Architecture V_CMP_GE_F64 38 A greater than or equal to B. D0.u64[laneId] = S0.f64 >= S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_O_F64 39 A orderable with B. D0.u64[laneId] = (!isNAN(S0.f64) && !isNAN(S1.f64)); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_U_F64 40 A not orderable with B. D0.u64[laneId] = (isNAN(S0.f64) || isNAN(S1.f64)); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGE_F64 41 A not greater than or equal to B. D0.u64[laneId] = !(S0.f64 >= S1.f64); // With NAN inputs this is not the same operation as < // D0 = VCC in VOPC encoding. 16.9. VOPC Instructions 303 of 597 "RDNA3" Instruction Set Architecture Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLG_F64 42 A not less than or greater than B. D0.u64[laneId] = !(S0.f64 <> S1.f64); // With NAN inputs this is not the same operation as == // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGT_F64 43 A not greater than B. D0.u64[laneId] = !(S0.f64 > S1.f64); // With NAN inputs this is not the same operation as <= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLE_F64 44 A not less than or equal to B. D0.u64[laneId] = !(S0.f64 <= S1.f64); // With NAN inputs this is not the same operation as > // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NEQ_F64 16.9. VOPC Instructions 45 304 of 597 "RDNA3" Instruction Set Architecture A not equal to B. D0.u64[laneId] = !(S0.f64 == S1.f64); // With NAN inputs this is not the same operation as != // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLT_F64 46 A not less than B. D0.u64[laneId] = !(S0.f64 < S1.f64); // With NAN inputs this is not the same operation as >= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_F64 47 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_I16 49 A less than B. D0.u64[laneId] = S0.i16 < S1.i16; // D0 = VCC in VOPC encoding. Notes 16.9. VOPC Instructions 305 of 597 "RDNA3" Instruction Set Architecture Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_I16 50 A equal to B. D0.u64[laneId] = S0.i16 == S1.i16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_I16 51 A less than or equal to B. D0.u64[laneId] = S0.i16 <= S1.i16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_I16 52 A greater than B. D0.u64[laneId] = S0.i16 > S1.i16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_I16 53 A not equal to B. D0.u64[laneId] = S0.i16 <> S1.i16; 16.9. VOPC Instructions 306 of 597 "RDNA3" Instruction Set Architecture // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_I16 54 A greater than or equal to B. D0.u64[laneId] = S0.i16 >= S1.i16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_U16 57 A less than B. D0.u64[laneId] = S0.u16 < S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_U16 58 A equal to B. D0.u64[laneId] = S0.u16 == S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_U16 16.9. VOPC Instructions 59 307 of 597 "RDNA3" Instruction Set Architecture A less than or equal to B. D0.u64[laneId] = S0.u16 <= S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_U16 60 A greater than B. D0.u64[laneId] = S0.u16 > S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_U16 61 A not equal to B. D0.u64[laneId] = S0.u16 <> S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_U16 62 A greater than or equal to B. D0.u64[laneId] = S0.u16 >= S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9. VOPC Instructions 308 of 597 "RDNA3" Instruction Set Architecture V_CMP_F_I32 64 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_I32 65 A less than B. D0.u64[laneId] = S0.i < S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_I32 66 A equal to B. D0.u64[laneId] = S0.i == S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_I32 67 A less than or equal to B. D0.u64[laneId] = S0.i <= S1.i; // D0 = VCC in VOPC encoding. 16.9. VOPC Instructions 309 of 597 "RDNA3" Instruction Set Architecture Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_I32 68 A greater than B. D0.u64[laneId] = S0.i > S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_I32 69 A not equal to B. D0.u64[laneId] = S0.i <> S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_I32 70 A greater than or equal to B. D0.u64[laneId] = S0.i >= S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_I32 71 True. 16.9. VOPC Instructions 310 of 597 "RDNA3" Instruction Set Architecture D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_F_U32 72 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_U32 73 A less than B. D0.u64[laneId] = S0.u < S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_U32 74 A equal to B. D0.u64[laneId] = S0.u == S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9. VOPC Instructions 311 of 597 "RDNA3" Instruction Set Architecture V_CMP_LE_U32 75 A less than or equal to B. D0.u64[laneId] = S0.u <= S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_U32 76 A greater than B. D0.u64[laneId] = S0.u > S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_U32 77 A not equal to B. D0.u64[laneId] = S0.u <> S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_U32 78 A greater than or equal to B. D0.u64[laneId] = S0.u >= S1.u; // D0 = VCC in VOPC encoding. Notes 16.9. VOPC Instructions 312 of 597 "RDNA3" Instruction Set Architecture Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_U32 79 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_F_I64 80 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_I64 81 A less than B. D0.u64[laneId] = S0.i64 < S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_I64 82 A equal to B. D0.u64[laneId] = S0.i64 == S1.i64; 16.9. VOPC Instructions 313 of 597 "RDNA3" Instruction Set Architecture // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_I64 83 A less than or equal to B. D0.u64[laneId] = S0.i64 <= S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_I64 84 A greater than B. D0.u64[laneId] = S0.i64 > S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_I64 85 A not equal to B. D0.u64[laneId] = S0.i64 <> S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_I64 16.9. VOPC Instructions 86 314 of 597 "RDNA3" Instruction Set Architecture A greater than or equal to B. D0.u64[laneId] = S0.i64 >= S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_I64 87 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_F_U64 88 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_U64 89 A less than B. D0.u64[laneId] = S0.u64 < S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9. VOPC Instructions 315 of 597 "RDNA3" Instruction Set Architecture V_CMP_EQ_U64 90 A equal to B. D0.u64[laneId] = S0.u64 == S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_U64 91 A less than or equal to B. D0.u64[laneId] = S0.u64 <= S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_U64 92 A greater than B. D0.u64[laneId] = S0.u64 > S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_U64 93 A not equal to B. D0.u64[laneId] = S0.u64 <> S1.u64; // D0 = VCC in VOPC encoding. 16.9. VOPC Instructions 316 of 597 "RDNA3" Instruction Set Architecture Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_U64 94 A greater than or equal to B. D0.u64[laneId] = S0.u64 >= S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_U64 95 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_CLASS_F16 125 IEEE numeric class function specified in S1.u, performed on S0.f16. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. 16.9. VOPC Instructions 317 of 597 "RDNA3" Instruction Set Architecture declare result : 1'U; if isSignalNAN(64'F(S0.f16)) then result = S1.u[0] elsif isQuietNAN(64'F(S0.f16)) then result = S1.u[1] elsif exponent(S0.f16) == 31 then // +-INF result = S1.u[sign(S0.f16) ? 2 : 9] elsif exponent(S0.f16) > 0 then // +-normal value result = S1.u[sign(S0.f16) ? 3 : 8] elsif 64'F(abs(S0.f16)) > 0.0 then // +-denormal value result = S1.u[sign(S0.f16) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f16) ? 5 : 6] endif; D0.u64[laneId] = result; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_CLASS_F32 126 IEEE numeric class function specified in S1.u, performed on S0.f. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(64'F(S0.f)) then result = S1.u[0] elsif isQuietNAN(64'F(S0.f)) then result = S1.u[1] elsif exponent(S0.f) == 255 then // +-INF result = S1.u[sign(S0.f) ? 2 : 9] 16.9. VOPC Instructions 318 of 597 "RDNA3" Instruction Set Architecture elsif exponent(S0.f) > 0 then // +-normal value result = S1.u[sign(S0.f) ? 3 : 8] elsif 64'F(abs(S0.f)) > 0.0 then // +-denormal value result = S1.u[sign(S0.f) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f) ? 5 : 6] endif; D0.u64[laneId] = result; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_CLASS_F64 127 IEEE numeric class function specified in S1.u, performed on S0.f64. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(S0.f64) then result = S1.u[0] elsif isQuietNAN(S0.f64) then result = S1.u[1] elsif exponent(S0.f64) == 1023 then // +-INF result = S1.u[sign(S0.f64) ? 2 : 9] elsif exponent(S0.f64) > 0 then // +-normal value result = S1.u[sign(S0.f64) ? 3 : 8] elsif abs(S0.f64) > 0.0 then // +-denormal value result = S1.u[sign(S0.f64) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f64) ? 5 : 6] 16.9. VOPC Instructions 319 of 597 "RDNA3" Instruction Set Architecture endif; D0.u64[laneId] = result; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_F16 128 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_F16 129 A less than B. EXEC.u64[laneId] = S0.f16 < S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_F16 130 A equal to B. EXEC.u64[laneId] = S0.f16 == S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_F16 16.9. VOPC Instructions 131 320 of 597 "RDNA3" Instruction Set Architecture A less than or equal to B. EXEC.u64[laneId] = S0.f16 <= S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_F16 132 A greater than B. EXEC.u64[laneId] = S0.f16 > S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LG_F16 133 A less than or greater than B. EXEC.u64[laneId] = S0.f16 <> S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_F16 134 A greater than or equal to B. EXEC.u64[laneId] = S0.f16 >= S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_O_F16 16.9. VOPC Instructions 135 321 of 597 "RDNA3" Instruction Set Architecture A orderable with B. EXEC.u64[laneId] = (!isNAN(64'F(S0.f16)) && !isNAN(64'F(S1.f16))) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_U_F16 136 A not orderable with B. EXEC.u64[laneId] = (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16))) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGE_F16 137 A not greater than or equal to B. EXEC.u64[laneId] = !(S0.f16 >= S1.f16); // With NAN inputs this is not the same operation as < Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLG_F16 138 A not less than or greater than B. EXEC.u64[laneId] = !(S0.f16 <> S1.f16); // With NAN inputs this is not the same operation as == Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9. VOPC Instructions 322 of 597 "RDNA3" Instruction Set Architecture V_CMPX_NGT_F16 139 A not greater than B. EXEC.u64[laneId] = !(S0.f16 > S1.f16); // With NAN inputs this is not the same operation as <= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLE_F16 140 A not less than or equal to B. EXEC.u64[laneId] = !(S0.f16 <= S1.f16); // With NAN inputs this is not the same operation as > Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NEQ_F16 141 A not equal to B. EXEC.u64[laneId] = !(S0.f16 == S1.f16); // With NAN inputs this is not the same operation as != Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLT_F16 142 A not less than B. EXEC.u64[laneId] = !(S0.f16 < S1.f16); // With NAN inputs this is not the same operation as >= Notes 16.9. VOPC Instructions 323 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_F16 143 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_F32 144 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_F32 145 A less than B. EXEC.u64[laneId] = S0.f < S1.f Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_F32 146 A equal to B. EXEC.u64[laneId] = S0.f == S1.f Notes 16.9. VOPC Instructions 324 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_F32 147 A less than or equal to B. EXEC.u64[laneId] = S0.f <= S1.f Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_F32 148 A greater than B. EXEC.u64[laneId] = S0.f > S1.f Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LG_F32 149 A less than or greater than B. EXEC.u64[laneId] = S0.f <> S1.f Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_F32 150 A greater than or equal to B. EXEC.u64[laneId] = S0.f >= S1.f Notes 16.9. VOPC Instructions 325 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_O_F32 151 A orderable with B. EXEC.u64[laneId] = (!isNAN(64'F(S0.f)) && !isNAN(64'F(S1.f))) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_U_F32 152 A not orderable with B. EXEC.u64[laneId] = (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f))) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGE_F32 153 A not greater than or equal to B. EXEC.u64[laneId] = !(S0.f >= S1.f); // With NAN inputs this is not the same operation as < Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLG_F32 154 A not less than or greater than B. EXEC.u64[laneId] = !(S0.f <> S1.f); // With NAN inputs this is not the same operation as == 16.9. VOPC Instructions 326 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGT_F32 155 A not greater than B. EXEC.u64[laneId] = !(S0.f > S1.f); // With NAN inputs this is not the same operation as <= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLE_F32 156 A not less than or equal to B. EXEC.u64[laneId] = !(S0.f <= S1.f); // With NAN inputs this is not the same operation as > Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NEQ_F32 157 A not equal to B. EXEC.u64[laneId] = !(S0.f == S1.f); // With NAN inputs this is not the same operation as != Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLT_F32 158 A not less than B. 16.9. VOPC Instructions 327 of 597 "RDNA3" Instruction Set Architecture EXEC.u64[laneId] = !(S0.f < S1.f); // With NAN inputs this is not the same operation as >= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_F32 159 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_F64 160 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_F64 161 A less than B. EXEC.u64[laneId] = S0.f64 < S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_F64 16.9. VOPC Instructions 162 328 of 597 "RDNA3" Instruction Set Architecture A equal to B. EXEC.u64[laneId] = S0.f64 == S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_F64 163 A less than or equal to B. EXEC.u64[laneId] = S0.f64 <= S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_F64 164 A greater than B. EXEC.u64[laneId] = S0.f64 > S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LG_F64 165 A less than or greater than B. EXEC.u64[laneId] = S0.f64 <> S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_F64 16.9. VOPC Instructions 166 329 of 597 "RDNA3" Instruction Set Architecture A greater than or equal to B. EXEC.u64[laneId] = S0.f64 >= S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_O_F64 167 A orderable with B. EXEC.u64[laneId] = (!isNAN(S0.f64) && !isNAN(S1.f64)) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_U_F64 168 A not orderable with B. EXEC.u64[laneId] = (isNAN(S0.f64) || isNAN(S1.f64)) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGE_F64 169 A not greater than or equal to B. EXEC.u64[laneId] = !(S0.f64 >= S1.f64); // With NAN inputs this is not the same operation as < Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9. VOPC Instructions 330 of 597 "RDNA3" Instruction Set Architecture V_CMPX_NLG_F64 170 A not less than or greater than B. EXEC.u64[laneId] = !(S0.f64 <> S1.f64); // With NAN inputs this is not the same operation as == Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGT_F64 171 A not greater than B. EXEC.u64[laneId] = !(S0.f64 > S1.f64); // With NAN inputs this is not the same operation as <= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLE_F64 172 A not less than or equal to B. EXEC.u64[laneId] = !(S0.f64 <= S1.f64); // With NAN inputs this is not the same operation as > Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NEQ_F64 173 A not equal to B. EXEC.u64[laneId] = !(S0.f64 == S1.f64); // With NAN inputs this is not the same operation as != Notes 16.9. VOPC Instructions 331 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLT_F64 174 A not less than B. EXEC.u64[laneId] = !(S0.f64 < S1.f64); // With NAN inputs this is not the same operation as >= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_F64 175 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_I16 177 A less than B. EXEC.u64[laneId] = S0.i16 < S1.i16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_I16 178 A equal to B. EXEC.u64[laneId] = S0.i16 == S1.i16 16.9. VOPC Instructions 332 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_I16 179 A less than or equal to B. EXEC.u64[laneId] = S0.i16 <= S1.i16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_I16 180 A greater than B. EXEC.u64[laneId] = S0.i16 > S1.i16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_I16 181 A not equal to B. EXEC.u64[laneId] = S0.i16 <> S1.i16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_I16 182 A greater than or equal to B. EXEC.u64[laneId] = S0.i16 >= S1.i16 16.9. VOPC Instructions 333 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_U16 185 A less than B. EXEC.u64[laneId] = S0.u16 < S1.u16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_U16 186 A equal to B. EXEC.u64[laneId] = S0.u16 == S1.u16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_U16 187 A less than or equal to B. EXEC.u64[laneId] = S0.u16 <= S1.u16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_U16 188 A greater than B. EXEC.u64[laneId] = S0.u16 > S1.u16 16.9. VOPC Instructions 334 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_U16 189 A not equal to B. EXEC.u64[laneId] = S0.u16 <> S1.u16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_U16 190 A greater than or equal to B. EXEC.u64[laneId] = S0.u16 >= S1.u16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_I32 192 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_I32 193 A less than B. EXEC.u64[laneId] = S0.i < S1.i 16.9. VOPC Instructions 335 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_I32 194 A equal to B. EXEC.u64[laneId] = S0.i == S1.i Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_I32 195 A less than or equal to B. EXEC.u64[laneId] = S0.i <= S1.i Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_I32 196 A greater than B. EXEC.u64[laneId] = S0.i > S1.i Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_I32 197 A not equal to B. EXEC.u64[laneId] = S0.i <> S1.i 16.9. VOPC Instructions 336 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_I32 198 A greater than or equal to B. EXEC.u64[laneId] = S0.i >= S1.i Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_I32 199 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_U32 200 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_U32 201 A less than B. EXEC.u64[laneId] = S0.u < S1.u 16.9. VOPC Instructions 337 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_U32 202 A equal to B. EXEC.u64[laneId] = S0.u == S1.u Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_U32 203 A less than or equal to B. EXEC.u64[laneId] = S0.u <= S1.u Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_U32 204 A greater than B. EXEC.u64[laneId] = S0.u > S1.u Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_U32 205 A not equal to B. EXEC.u64[laneId] = S0.u <> S1.u 16.9. VOPC Instructions 338 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_U32 206 A greater than or equal to B. EXEC.u64[laneId] = S0.u >= S1.u Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_U32 207 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_I64 208 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_I64 209 A less than B. EXEC.u64[laneId] = S0.i64 < S1.i64 16.9. VOPC Instructions 339 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_I64 210 A equal to B. EXEC.u64[laneId] = S0.i64 == S1.i64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_I64 211 A less than or equal to B. EXEC.u64[laneId] = S0.i64 <= S1.i64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_I64 212 A greater than B. EXEC.u64[laneId] = S0.i64 > S1.i64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_I64 213 A not equal to B. EXEC.u64[laneId] = S0.i64 <> S1.i64 16.9. VOPC Instructions 340 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_I64 214 A greater than or equal to B. EXEC.u64[laneId] = S0.i64 >= S1.i64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_I64 215 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_U64 216 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_U64 217 A less than B. EXEC.u64[laneId] = S0.u64 < S1.u64 16.9. VOPC Instructions 341 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_U64 218 A equal to B. EXEC.u64[laneId] = S0.u64 == S1.u64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_U64 219 A less than or equal to B. EXEC.u64[laneId] = S0.u64 <= S1.u64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_U64 220 A greater than B. EXEC.u64[laneId] = S0.u64 > S1.u64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_U64 221 A not equal to B. EXEC.u64[laneId] = S0.u64 <> S1.u64 16.9. VOPC Instructions 342 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_U64 222 A greater than or equal to B. EXEC.u64[laneId] = S0.u64 >= S1.u64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_U64 223 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_CLASS_F16 253 IEEE numeric class function specified in S1.u, performed on S0.f16. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; 16.9. VOPC Instructions 343 of 597 "RDNA3" Instruction Set Architecture if isSignalNAN(64'F(S0.f16)) then result = S1.u[0] elsif isQuietNAN(64'F(S0.f16)) then result = S1.u[1] elsif exponent(S0.f16) == 31 then // +-INF result = S1.u[sign(S0.f16) ? 2 : 9] elsif exponent(S0.f16) > 0 then // +-normal value result = S1.u[sign(S0.f16) ? 3 : 8] elsif 64'F(abs(S0.f16)) > 0.0 then // +-denormal value result = S1.u[sign(S0.f16) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f16) ? 5 : 6] endif; EXEC.u64[laneId] = result Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_CLASS_F32 254 IEEE numeric class function specified in S1.u, performed on S0.f. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(64'F(S0.f)) then result = S1.u[0] elsif isQuietNAN(64'F(S0.f)) then result = S1.u[1] elsif exponent(S0.f) == 255 then // +-INF result = S1.u[sign(S0.f) ? 2 : 9] elsif exponent(S0.f) > 0 then // +-normal value result = S1.u[sign(S0.f) ? 3 : 8] 16.9. VOPC Instructions 344 of 597 "RDNA3" Instruction Set Architecture elsif 64'F(abs(S0.f)) > 0.0 then // +-denormal value result = S1.u[sign(S0.f) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f) ? 5 : 6] endif; EXEC.u64[laneId] = result Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_CLASS_F64 255 IEEE numeric class function specified in S1.u, performed on S0.f64. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(S0.f64) then result = S1.u[0] elsif isQuietNAN(S0.f64) then result = S1.u[1] elsif exponent(S0.f64) == 1023 then // +-INF result = S1.u[sign(S0.f64) ? 2 : 9] elsif exponent(S0.f64) > 0 then // +-normal value result = S1.u[sign(S0.f64) ? 3 : 8] elsif abs(S0.f64) > 0.0 then // +-denormal value result = S1.u[sign(S0.f64) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f64) ? 5 : 6] endif; EXEC.u64[laneId] = result 16.9. VOPC Instructions 345 of 597 "RDNA3" Instruction Set Architecture Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.9.1. VOPC using VOP3 encoding Instructions in this format may also be encoded as VOP3. VOP3 allows access to the extra control bits (e.g. ABS, OMOD) at the expense of a larger instruction word. The VOP3 opcode is: VOP2 opcode + 0x000. When the CLAMP microcode bit is set to 1, these compare instructions signal an exception when either of the inputs is NaN. When CLAMP is set to zero, NaN does not signal an exception. The second eight VOPC instructions have {OP8} embedded in them. This refers to each of the compare operations listed below. VDST = Destination for instruction in the VGPR. ABS = Floating-point absolute value. CLMP = Clamp output. OP = Instruction opcode. SRC0 = First operand for instruction. SRC1 = Second operand for instruction. SRC2 = Third operand for instruction. Unused in VOPC instructions. OMOD = Output modifier for instruction. Unused in VOPC instructions. NEG = Floating-point negation. 16.9. VOPC Instructions 346 of 597 "RDNA3" Instruction Set Architecture 16.10. VOP3P Instructions V_PK_MAD_I16 0 Packed multiply-add on signed shorts. D0[31 : 16].i16 = S0[31 : 16].i16 * S1[31 : 16].i16 + S2[31 : 16].i16; D0[15 : 0].i16 = S0[15 : 0].i16 * S1[15 : 0].i16 + S2[15 : 0].i16 V_PK_MUL_LO_U16 1 Packed multiply on unsigned shorts. D0[31 : 16].u16 = S0[31 : 16].u16 * S1[31 : 16].u16; D0[15 : 0].u16 = S0[15 : 0].u16 * S1[15 : 0].u16 V_PK_ADD_I16 2 Packed addition on signed shorts. D0[31 : 16].i16 = S0[31 : 16].i16 + S1[31 : 16].i16; D0[15 : 0].i16 = S0[15 : 0].i16 + S1[15 : 0].i16 V_PK_SUB_I16 3 Packed subtraction on signed shorts. The second operand is subtracted from the first. D0[31 : 16].i16 = S0[31 : 16].i16 - S1[31 : 16].i16; D0[15 : 0].i16 = S0[15 : 0].i16 - S1[15 : 0].i16 V_PK_LSHLREV_B16 4 Packed logical shift left. The shift count is in the first operand. 16.10. VOP3P Instructions 347 of 597 "RDNA3" Instruction Set Architecture D0[31 : 16].u16 = S1[31 : 16].u16 << S0.u[19 : 16].u; D0[15 : 0].u16 = S1[15 : 0].u16 << S0.u[3 : 0].u V_PK_LSHRREV_B16 5 Packed logical shift right. The shift count is in the first operand. D0[31 : 16].u16 = S1[31 : 16].u16 >> S0.u[19 : 16].u; D0[15 : 0].u16 = S1[15 : 0].u16 >> S0.u[3 : 0].u V_PK_ASHRREV_I16 6 Packed arithmetic shift right (preserve sign bit). The shift count is in the first operand. D0[31 : 16].i16 = S1[31 : 16].i16 >> S0.u[19 : 16].u; D0[15 : 0].i16 = S1[15 : 0].i16 >> S0.u[3 : 0].u V_PK_MAX_I16 7 Packed maximum of signed shorts. D0[31 : 16].i16 = S0[31 : 16].i16 >= S1[31 : 16].i16 ? S0[31 : 16].i16 : S1[31 : 16].i16; D0[15 : 0].i16 = S0[15 : 0].i16 >= S1[15 : 0].i16 ? S0[15 : 0].i16 : S1[15 : 0].i16 V_PK_MIN_I16 8 Packed minimum of signed shorts. D0[31 : 16].i16 = S0[31 : 16].i16 < S1[31 : 16].i16 ? S0[31 : 16].i16 : S1[31 : 16].i16; D0[15 : 0].i16 = S0[15 : 0].i16 < S1[15 : 0].i16 ? S0[15 : 0].i16 : S1[15 : 0].i16 V_PK_MAD_U16 9 Packed multiply-add on unsigned shorts. 16.10. VOP3P Instructions 348 of 597 "RDNA3" Instruction Set Architecture D0[31 : 16].u16 = S0[31 : 16].u16 * S1[31 : 16].u16 + S2[31 : 16].u16; D0[15 : 0].u16 = S0[15 : 0].u16 * S1[15 : 0].u16 + S2[15 : 0].u16 V_PK_ADD_U16 10 Packed addition on unsigned shorts. D0[31 : 16].u16 = S0[31 : 16].u16 + S1[31 : 16].u16; D0[15 : 0].u16 = S0[15 : 0].u16 + S1[15 : 0].u16 V_PK_SUB_U16 11 Packed subtraction on unsigned shorts. The second operand is subtracted from the first. D0[31 : 16].u16 = S0[31 : 16].u16 - S1[31 : 16].u16; D0[15 : 0].u16 = S0[15 : 0].u16 - S1[15 : 0].u16 V_PK_MAX_U16 12 Packed maximum of unsigned shorts. D0[31 : 16].u16 = S0[31 : 16].u16 >= S1[31 : 16].u16 ? S0[31 : 16].u16 : S1[31 : 16].u16; D0[15 : 0].u16 = S0[15 : 0].u16 >= S1[15 : 0].u16 ? S0[15 : 0].u16 : S1[15 : 0].u16 V_PK_MIN_U16 13 Packed minimum of unsigned shorts. D0[31 : 16].u16 = S0[31 : 16].u16 < S1[31 : 16].u16 ? S0[31 : 16].u16 : S1[31 : 16].u16; D0[15 : 0].u16 = S0[15 : 0].u16 < S1[15 : 0].u16 ? S0[15 : 0].u16 : S1[15 : 0].u16 V_PK_FMA_F16 14 Packed fused-multiply-add of FP16 values. 16.10. VOP3P Instructions 349 of 597 "RDNA3" Instruction Set Architecture D0[31 : 16].f16 = fma(S0[31 : 16].f16, S1[31 : 16].f16, S2[31 : 16].f16); D0[15 : 0].f16 = fma(S0[15 : 0].f16, S1[15 : 0].f16, S2[15 : 0].f16) V_PK_ADD_F16 15 Packed addition of FP16 values. D0[31 : 16].f16 = S0[31 : 16].f16 + S1[31 : 16].f16; D0[15 : 0].f16 = S0[15 : 0].f16 + S1[15 : 0].f16 V_PK_MUL_F16 16 Packed multiply of FP16 values. D0[31 : 16].f16 = S0[31 : 16].f16 * S1[31 : 16].f16; D0[15 : 0].f16 = S0[15 : 0].f16 * S1[15 : 0].f16 V_PK_MIN_F16 17 Packed minimum of FP16 values. D0[31 : 16].f16 = v_min_f16(S0[31 : 16].f16, S1[31 : 16].f16); D0[15 : 0].f16 = v_min_f16(S0[15 : 0].f16, S1[15 : 0].f16) V_PK_MAX_F16 18 Packed maximum of FP16 values. D0[31 : 16].f16 = v_max_f16(S0[31 : 16].f16, S1[31 : 16].f16); D0[15 : 0].f16 = v_max_f16(S0[15 : 0].f16, S1[15 : 0].f16) V_DOT2_F32_F16 19 Dot product of packed FP16 values. 16.10. VOP3P Instructions 350 of 597 "RDNA3" Instruction Set Architecture tmp = 32'F(S0[15 : 0].f16) * 32'F(S1[15 : 0].f16); tmp += 32'F(S0[31 : 16].f16) * 32'F(S1[31 : 16].f16); tmp += S2.f; D0.f = tmp V_DOT4_I32_IU8 22 Dot product of signed or unsigned bytes. declare A : 32'I[4]; declare B : 32'I[4]; // Figure out whether inputs are signed/unsigned. for i in 0 : 3 do A8 = S0[i * 8 + 7 : i * 8]; B8 = S1[i * 8 + 7 : i * 8]; A[i] = NEG[0].u1 ? 32'I(signext(A8.i8)) : 32'I(32'U(A8.u8)); B[i] = NEG[1].u1 ? 32'I(signext(B8.i8)) : 32'I(32'U(B8.u8)) endfor; C = S2.i; // Signed multiplier/adder. Extend unsigned inputs with leading 0. D0.i = A[0] * B[0]; D0.i += A[1] * B[1]; D0.i += A[2] * B[2]; D0.i += A[3] * B[3]; D0.i += C Notes This opcode does not depend on the inference or deep learning features being enabled. V_DOT4_U32_U8 23 Dot product of unsigned bytes. tmp = 32'U(S0[7 : 0].u8) * 32'U(S1[7 : 0].u8); tmp += 32'U(S0[15 : 8].u8) * 32'U(S1[15 : 8].u8); tmp += 32'U(S0[23 : 16].u8) * 32'U(S1[23 : 16].u8); tmp += 32'U(S0[31 : 24].u8) * 32'U(S1[31 : 24].u8); tmp += S2.u; D0.u = tmp Notes This opcode does not depend on the inference or deep learning features being enabled. 16.10. VOP3P Instructions 351 of 597 "RDNA3" Instruction Set Architecture V_DOT8_I32_IU4 24 Dot product of signed or unsigned nibbles. declare A : 32'I[8]; declare B : 32'I[8]; // Figure out whether inputs are signed/unsigned. for i in 0 : 7 do A4 = S0[i * 4 + 3 : i * 4]; B4 = S1[i * 4 + 3 : i * 4]; A[i] = NEG[0].u1 ? 32'I(signext(A4.i4)) : 32'I(32'U(A4.u4)); B[i] = NEG[1].u1 ? 32'I(signext(B4.i4)) : 32'I(32'U(B4.u4)) endfor; C = S2.i; // Signed multiplier/adder. Extend unsigned inputs with leading 0. D0.i = A[0] * B[0]; D0.i += A[1] * B[1]; D0.i += A[2] * B[2]; D0.i += A[3] * B[3]; D0.i += A[4] * B[4]; D0.i += A[5] * B[5]; D0.i += A[6] * B[6]; D0.i += A[7] * B[7]; D0.i += C V_DOT8_U32_U4 25 Dot product of unsigned nibbles. tmp = 32'U(S0[3 : 0].u4) * 32'U(S1[3 : 0].u4); tmp += 32'U(S0[7 : 4].u4) * 32'U(S1[7 : 4].u4); tmp += 32'U(S0[11 : 8].u4) * 32'U(S1[11 : 8].u4); tmp += 32'U(S0[15 : 12].u4) * 32'U(S1[15 : 12].u4); tmp += 32'U(S0[19 : 16].u4) * 32'U(S1[19 : 16].u4); tmp += 32'U(S0[23 : 20].u4) * 32'U(S1[23 : 20].u4); tmp += 32'U(S0[27 : 24].u4) * 32'U(S1[27 : 24].u4); tmp += 32'U(S0[31 : 28].u4) * 32'U(S1[31 : 28].u4); tmp += S2.u; D0.u = tmp V_DOT2_F32_BF16 26 Dot product of packed brain-float values. tmp = 32'F(S0[15 : 0].bf16) * 32'F(S1[15 : 0].bf16); tmp += 32'F(S0[31 : 16].bf16) * 32'F(S1[31 : 16].bf16); tmp += S2.f; 16.10. VOP3P Instructions 352 of 597 "RDNA3" Instruction Set Architecture D0.f = tmp V_FMA_MIX_F32 32 Fused-multiply-add of single-precision values with MIX encoding. Size and location of S0, S1 and S2 controlled by OPSEL: 0=src[31:0], 1=src[31:0], 2=src[15:0], 3=src[31:16]. Also, for FMA_MIX, the NEG_HI field acts instead as an absolute-value modifier. declare in : 32'F[3]; declare S : 32'B[3]; for i in 0 : 2 do if !OPSEL_HI.u3[i] then in[i] = S[i].f elsif OPSEL.u3[i] then in[i] = f16_to_f32(S[i][31 : 16].f16) else in[i] = f16_to_f32(S[i][15 : 0].f16) endif endfor; D0[31 : 0].f = fma(in[0], in[1], in[2]) V_FMA_MIXLO_F16 33 Fused-multiply-add of FP16 values with MIX encoding, result stored in low 16 bits of destination. Size and location of S0, S1 and S2 controlled by OPSEL: 0=src[31:0], 1=src[31:0], 2=src[15:0], 3=src[31:16]. Also, for FMA_MIX, the NEG_HI field acts instead as an absolute-value modifier. declare in : 32'F[3]; declare S : 32'B[3]; for i in 0 : 2 do if !OPSEL_HI.u3[i] then in[i] = S[i].f elsif OPSEL.u3[i] then in[i] = f16_to_f32(S[i][31 : 16].f16) else in[i] = f16_to_f32(S[i][15 : 0].f16) endif endfor; D0[15 : 0].f16 = f32_to_f16(fma(in[0], in[1], in[2])) V_FMA_MIXHI_F16 34 Fused-multiply-add of FP16 values with MIX encoding, result stored in HIGH 16 bits of destination. 16.10. VOP3P Instructions 353 of 597 "RDNA3" Instruction Set Architecture Size and location of S0, S1 and S2 controlled by OPSEL: 0=src[31:0], 1=src[31:0], 2=src[15:0], 3=src[31:16]. Also, for FMA_MIX, the NEG_HI field acts instead as an absolute-value modifier. declare in : 32'F[3]; declare S : 32'B[3]; for i in 0 : 2 do if !OPSEL_HI.u3[i] then in[i] = S[i].f elsif OPSEL.u3[i] then in[i] = f16_to_f32(S[i][31 : 16].f16) else in[i] = f16_to_f32(S[i][15 : 0].f16) endif endfor; D0[31 : 16].f16 = f32_to_f16(fma(in[0], in[1], in[2])) V_WMMA_F32_16X16X16_F16 64 WMMA matrix multiplication with F16 multiplicands and single precision result. saved_exec = EXEC; EXEC = 64'B(-1); eval "D0.f32(16x16) = S0.f16(16x16) * S1.f16(16x16) + S2.f32(16x16)"; EXEC = saved_exec V_WMMA_F32_16X16X16_BF16 65 WMMA matrix multiplication with brain float multiplicands and single precision result. saved_exec = EXEC; EXEC = 64'B(-1); eval "D0.f32(16x16) = S0.bf16(16x16) * S1.bf16(16x16) + S2.f32(16x16)"; EXEC = saved_exec V_WMMA_F16_16X16X16_F16 66 WMMA matrix multiplication with F16 multiplicands and F16 result. saved_exec = EXEC; EXEC = 64'B(-1); eval "D0.f16(16x16) = S0.f16(16x16) * S1.f16(16x16) + S2.f16(16x16)"; EXEC = saved_exec 16.10. VOP3P Instructions 354 of 597 "RDNA3" Instruction Set Architecture V_WMMA_BF16_16X16X16_BF16 67 WMMA matrix multiplication with brain float multiplicands and brain float result. saved_exec = EXEC; EXEC = 64'B(-1); eval "D0.bf16(16x16) = S0.bf16(16x16) * S1.bf16(16x16) + S2.bf16(16x16)"; EXEC = saved_exec V_WMMA_I32_16X16X16_IU8 68 WMMA matrix multiplication with 8-bit integer multiplicands and signed 32-bit integer result. saved_exec = EXEC; EXEC = 64'B(-1); eval "D0.i32(16x16) = S0.iu8(16x16) * S1.iu8(16x16) + S2.i32(16x16)"; EXEC = saved_exec V_WMMA_I32_16X16X16_IU4 69 WMMA matrix multiplication with 4-bit integer multiplicands and signed 32-bit integer result. saved_exec = EXEC; EXEC = 64'B(-1); eval "D0.i32(16x16) = S0.iu4(16x16) * S1.iu4(16x16) + S2.i32(16x16)"; EXEC = saved_exec 16.10. VOP3P Instructions 355 of 597 "RDNA3" Instruction Set Architecture 16.11. VOPD Instructions The VOPD encoded describes two VALU opcodes that are executed in parallel. For instruction definitions, refer to the VOP1, VOP2 and VOP3 sections. 16.11.1. VOPD X-Instructions V_DUAL_FMAC_F32 0 Dual-issue opcode of FMAC_F32. Fused multiply-add of single-precision floats, accumulate with destination. V_DUAL_FMAAK_F32 1 Dual-issue opcode of FMAAK_F32. Multiply two single-precision floats and add a literal constant using fused multiply-add. V_DUAL_FMAMK_F32 2 Dual-issue opcode of FMAMK_F32. Multiply a single-precision float with a literal constant and add a second single-precision float using fused multiply-add. V_DUAL_MUL_F32 3 Dual-issue opcode of MUL_F32. Multiply two single-precision values. V_DUAL_ADD_F32 4 Dual-issue opcode of ADD_F32. Add two single-precision values. 16.11. VOPD Instructions 356 of 597 "RDNA3" Instruction Set Architecture V_DUAL_SUB_F32 5 Dual-issue opcode of SUB_F32. Subtract the second single-precision input from the first input. V_DUAL_SUBREV_F32 6 Dual-issue opcode of SUBREV_F32. Subtract the first single-precision input from the second input. V_DUAL_MUL_DX9_ZERO_F32 7 Dual-issue opcode of MUL_DX9_ZERO_F32. Multiply two single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE compliant). V_DUAL_MOV_B32 8 Dual-issue opcode of MOV_B32. Move data to a VGPR. V_DUAL_CNDMASK_B32 9 Dual-issue opcode of CNDMASK_B32. Conditional mask on each thread. V_DUAL_MAX_F32 10 Dual-issue opcode of MAX_F32. Compute the maximum of two floats. V_DUAL_MIN_F32 16.11. VOPD Instructions 11 357 of 597 "RDNA3" Instruction Set Architecture Dual-issue opcode of MIN_F32. Compute the minimum of two floats. V_DUAL_DOT2ACC_F32_F16 12 Dual-issue opcode of DOT2ACC_F32_F16. Dot product of packed FP16 values, accumulate with destination. The initial value in D is used as S2. V_DUAL_DOT2ACC_F32_BF16 13 Dual-issue opcode of DOT2ACC_F32_BF16. Dot product of packed brain-float values. The initial value in D is used as S2. 16.11.2. VOPD Y-Instructions V_DUAL_FMAC_F32 0 Dual-issue opcode of FMAC_F32. Fused multiply-add of single-precision floats, accumulate with destination. V_DUAL_FMAAK_F32 1 Dual-issue opcode of FMAAK_F32. Multiply two single-precision floats and add a literal constant using fused multiply-add. V_DUAL_FMAMK_F32 2 Dual-issue opcode of FMAMK_F32. Multiply a single-precision float with a literal constant and add a second single-precision float using fused multiply-add. V_DUAL_MUL_F32 3 Dual-issue opcode of MUL_F32. 16.11. VOPD Instructions 358 of 597 "RDNA3" Instruction Set Architecture Multiply two single-precision values. V_DUAL_ADD_F32 4 Dual-issue opcode of ADD_F32. Add two single-precision values. V_DUAL_SUB_F32 5 Dual-issue opcode of SUB_F32. Subtract the second single-precision input from the first input. V_DUAL_SUBREV_F32 6 Dual-issue opcode of SUBREV_F32. Subtract the first single-precision input from the second input. V_DUAL_MUL_DX9_ZERO_F32 7 Dual-issue opcode of MUL_DX9_ZERO_F32. Multiply two single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE compliant). V_DUAL_MOV_B32 8 Dual-issue opcode of MOV_B32. Move data to a VGPR. V_DUAL_CNDMASK_B32 9 Dual-issue opcode of CNDMASK_B32. Conditional mask on each thread. 16.11. VOPD Instructions 359 of 597 "RDNA3" Instruction Set Architecture V_DUAL_MAX_F32 10 Dual-issue opcode of MAX_F32. Compute the maximum of two floats. V_DUAL_MIN_F32 11 Dual-issue opcode of MIN_F32. Compute the minimum of two floats. V_DUAL_DOT2ACC_F32_F16 12 Dual-issue opcode of DOT2ACC_F32_F16. Dot product of packed FP16 values, accumulate with destination. The initial value in D is used as S2. V_DUAL_DOT2ACC_F32_BF16 13 Dual-issue opcode of DOT2ACC_F32_BF16. Dot product of packed brain-float values. The initial value in D is used as S2. V_DUAL_ADD_NC_U32 16 Dual-issue opcode of ADD_NC_U32. Add two unsigned integers. No carry-in or carry-out. V_DUAL_LSHLREV_B32 17 Dual-issue opcode of LSHLREV_B32. Logical shift left with shift count in the first operand. V_DUAL_AND_B32 18 Dual-issue opcode of AND_B32. Bitwise AND. 16.11. VOPD Instructions 360 of 597 "RDNA3" Instruction Set Architecture 16.11. VOPD Instructions 361 of 597 "RDNA3" Instruction Set Architecture 16.12. VOP3 & VOP3SD Instructions VOP3 instructions use one of two encodings: VOP3SD this encoding allows specifying a unique scalar destination, and is used only for: V_ADD_CO_U32 V_SUB_CO_U32 V_SUBREV_CO_U32 V_ADDC_CO_U32 V_SUBB_CO_U32 V_SUBBREV_CO_U32 V_DIV_SCALE_F32 V_DIV_SCALE_F64 V_MAD_U64_U32 V_MAD_I64_I32 VOP3 all other VALU instructions use this encoding V_NOP 384 Do nothing. V_MOV_B32 385 Move data to a VGPR. D0.b = S0.b Notes Floating-point modifiers are valid for this instruction if S0.u is a 32-bit floating point value. This instruction is suitable for negating or taking the absolute value of a floating-point value. Functional examples: v_mov_b32 v0, v1 // Move v1 to v0 v_mov_b32 v0, -v1 // Set v1 to the negation of v0 v_mov_b32 v0, abs(v1) // Set v1 to the absolute value of v0 16.12. VOP3 & VOP3SD Instructions 362 of 597 "RDNA3" Instruction Set Architecture V_READFIRSTLANE_B32 386 Copy one VGPR value from the lowest active lane to one SGPR. declare lane : 32'U; if WAVE64 then // 64 lanes if EXEC == 0x0LL then lane = 0U; // Force lane 0 if all lanes are disabled else lane = 32'U(s_ff1_i32_b64(EXEC)); // Lowest active lane endif else // 32 lanes if EXEC_LO.i == 0 then lane = 0U; // Force lane 0 if all lanes are disabled else lane = 32'U(s_ff1_i32_b32(EXEC_LO)); // Lowest active lane endif endif; D0.b = VGPR[lane][SRC0.u] Notes Ignores EXEC mask for the VGPR read. Input and output modifiers not supported; this is an untyped operation. V_CVT_I32_F64 387 Convert from a double-precision float to a signed integer. D0.i = f64_to_i32(S0.f64) Notes 0.5ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_CVT_F64_I32 388 Convert from a signed integer to a double-precision float. 16.12. VOP3 & VOP3SD Instructions 363 of 597 "RDNA3" Instruction Set Architecture D0.f64 = i32_to_f64(S0.i) Notes 0ULP accuracy. V_CVT_F32_I32 389 Convert from a signed integer to a single-precision float. D0.f = i32_to_f32(S0.i) Notes 0.5ULP accuracy. V_CVT_F32_U32 390 Convert from an unsigned integer to a single-precision float. D0.f = u32_to_f32(S0.u) Notes 0.5ULP accuracy. V_CVT_U32_F32 391 Convert from a single-precision float to an unsigned integer. D0.u = f32_to_u32(S0.f) Notes 1ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. 16.12. VOP3 & VOP3SD Instructions 364 of 597 "RDNA3" Instruction Set Architecture V_CVT_I32_F32 392 Convert from a single-precision float to a signed integer. D0.i = f32_to_i32(S0.f) Notes 1ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_CVT_F16_F32 394 Convert from a single-precision float to an FP16 float. D0.f16 = f32_to_f16(S0.f) Notes 0.5ULP accuracy, supports input modifiers and creates FP16 denormals when appropriate. Flush denorms on output if specified based on DP denorm mode. Output rounding based on DP rounding mode. V_CVT_F32_F16 395 Convert from an FP16 float to a single-precision float. D0.f = f16_to_f32(S0.f16) Notes 0ULP accuracy, FP16 denormal inputs are accepted. Flush denorms on input if specified based on DP denorm mode. V_CVT_NEAREST_I32_F32 396 Convert from a single-precision float to a signed integer, round to nearest integer. D0.i = f32_to_i32(floor(S0.f + 0.5F)) 16.12. VOP3 & VOP3SD Instructions 365 of 597 "RDNA3" Instruction Set Architecture Notes 0.5ULP accuracy, denormals are supported. V_CVT_FLOOR_I32_F32 397 Convert from a single-precision float to a signed integer, round down. D0.i = f32_to_i32(floor(S0.f)) Notes 1ULP accuracy, denormals are supported. V_CVT_OFF_F32_I4 398 4-bit signed int to 32-bit float. Used for interpolation in shader. Lookup table on S0[3:0]: S0 binary Result 1000 -0.5000f 1001 -0.4375f 1010 -0.3750f 1011 -0.3125f 1100 -0.2500f 1101 -0.1875f 1110 -0.1250f 1111 -0.0625f 0000 +0.0000f 0001 +0.0625f 0010 +0.1250f 0011 +0.1875f 0100 +0.2500f 0101 +0.3125f 0110 +0.3750f 0111 +0.4375f declare CVT_OFF_TABLE : 32'F[16]; D0.f = CVT_OFF_TABLE[S0.u[3 : 0]] V_CVT_F32_F64 16.12. VOP3 & VOP3SD Instructions 399 366 of 597 "RDNA3" Instruction Set Architecture Convert from a double-precision float to a single-precision float. D0.f = f64_to_f32(S0.f64) Notes 0.5ULP accuracy, denormals are supported. V_CVT_F64_F32 400 Convert from a single-precision float to a double-precision float. D0.f64 = f32_to_f64(S0.f) Notes 0ULP accuracy, denormals are supported. V_CVT_F32_UBYTE0 401 Convert an unsigned byte (byte 0) to a single-precision float. D0.f = u32_to_f32(S0.u[7 : 0].u) V_CVT_F32_UBYTE1 402 Convert an unsigned byte (byte 1) to a single-precision float. D0.f = u32_to_f32(S0.u[15 : 8].u) V_CVT_F32_UBYTE2 403 Convert an unsigned byte (byte 2) to a single-precision float. D0.f = u32_to_f32(S0.u[23 : 16].u) 16.12. VOP3 & VOP3SD Instructions 367 of 597 "RDNA3" Instruction Set Architecture V_CVT_F32_UBYTE3 404 Convert an unsigned byte (byte 3) to a single-precision float. D0.f = u32_to_f32(S0.u[31 : 24].u) V_CVT_U32_F64 405 Convert from a double-precision float to an unsigned integer. D0.u = f64_to_u32(S0.f64) Notes 0.5ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_CVT_F64_U32 406 Convert from an unsigned integer to a double-precision float. D0.f64 = u32_to_f64(S0.u) Notes 0ULP accuracy. V_TRUNC_F64 407 Return integer part of a number with round-to-zero semantics. D0.f64 = trunc(S0.f64) V_CEIL_F64 408 Round up to next whole integer. 16.12. VOP3 & VOP3SD Instructions 368 of 597 "RDNA3" Instruction Set Architecture D0.f64 = trunc(S0.f64); if ((S0.f64 > 0.0) && (S0.f64 != D0.f64)) then D0.f64 += 1.0 endif V_RNDNE_F64 409 Round-to-nearest-even semantics. D0.f64 = floor(S0.f64 + 0.5); if (isEven(floor(S0.f64)) && (fract(S0.f64) == 0.5)) then D0.f64 -= 1.0 endif V_FLOOR_F64 410 Round down to previous whole integer. D0.f64 = trunc(S0.f64); if ((S0.f64 < 0.0) && (S0.f64 != D0.f64)) then D0.f64 += -1.0 endif V_PIPEFLUSH 411 Flush the VALU destination cache. V_MOV_B16 412 Move data to a VGPR. D0.b16 = S0.b16 Notes Floating-point modifiers are valid for this instruction if S0.u16 is a 16-bit floating point value. This instruction is suitable for negating or taking the absolute value of a floating-point value. 16.12. VOP3 & VOP3SD Instructions 369 of 597 "RDNA3" Instruction Set Architecture V_FRACT_F32 416 Return fractional portion of a number. D0.f = S0.f + -floor(S0.f) Notes 0.5ULP accuracy, denormals are accepted. This complies with the DX specification of fract where the function behaves like an extension of integer modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8 in DX. Obey round mode, result clamped to 0x3f7fffff. V_TRUNC_F32 417 Return integer part of a number with round-to-zero semantics. D0.f = trunc(S0.f) V_CEIL_F32 418 Round up to next whole integer. D0.f = trunc(S0.f); if ((S0.f > 0.0F) && (S0.f != D0.f)) then D0.f += 1.0F endif V_RNDNE_F32 419 Round-to-nearest-even semantics. D0.f = floor(S0.f + 0.5F); if (isEven(64'F(floor(S0.f))) && (fract(S0.f) == 0.5F)) then D0.f -= 1.0F endif 16.12. VOP3 & VOP3SD Instructions 370 of 597 "RDNA3" Instruction Set Architecture V_FLOOR_F32 420 Round down to previous whole integer. D0.f = trunc(S0.f); if ((S0.f < 0.0F) && (S0.f != D0.f)) then D0.f += -1.0F endif V_EXP_F32 421 Base 2 exponentiation. D0.f = pow(2.0F, S0.f) Notes 1ULP accuracy, denormals are flushed. Functional examples: V_EXP_F32(0xff800000) ⇒ 0x00000000 // exp(-INF) = 0 V_EXP_F32(0x80000000) ⇒ 0x3f800000 // exp(-0.0) = 1 V_EXP_F32(0x7f800000) ⇒ 0x7f800000 // exp(+INF) = +INF V_LOG_F32 423 Base 2 logarithm. D0.f = log2(S0.f) Notes 1ULP accuracy, denormals are flushed. Functional examples: V_LOG_F32(0xff800000) ⇒ 0xffc00000 // log(-INF) = NAN V_LOG_F32(0xbf800000) ⇒ 0xffc00000 // log(-1.0) = NAN V_LOG_F32(0x80000000) ⇒ 0xff800000 // log(-0.0) = -INF V_LOG_F32(0x00000000) ⇒ 0xff800000 // log(+0.0) = -INF V_LOG_F32(0x3f800000) ⇒ 0x00000000 // log(+1.0) = 0 V_LOG_F32(0x7f800000) ⇒ 0x7f800000 // log(+INF) = +INF 16.12. VOP3 & VOP3SD Instructions 371 of 597 "RDNA3" Instruction Set Architecture V_RCP_F32 426 Compute reciprocal with IEEE rules. D0.f = 1.0F / S0.f Notes 1ULP accuracy. Accuracy converges to < 0.5ULP when using the Newton-Raphson method and 2 FMA operations. Denormals are flushed. Functional examples: V_RCP_F32(0xff800000) ⇒ 0x80000000 // rcp(-INF) = -0 V_RCP_F32(0xc0000000) ⇒ 0xbf000000 // rcp(-2.0) = -0.5 V_RCP_F32(0x80000000) ⇒ 0xff800000 // rcp(-0.0) = -INF V_RCP_F32(0x00000000) ⇒ 0x7f800000 // rcp(+0.0) = +INF V_RCP_F32(0x7f800000) ⇒ 0x00000000 // rcp(+INF) = +0 V_RCP_IFLAG_F32 427 Compute reciprocal as part of integer divide. D0.f = 1.0F / S0.f; // Can only raise integer DIV_BY_ZERO exception Notes Can raise integer DIV_BY_ZERO exception but cannot raise floating-point exceptions. To be used in an integer reciprocal macro by the compiler with one of the sequences listed below (depending on signed or unsigned operation). Unsigned usage: CVT_F32_U32 RCP_IFLAG_F32 MUL_F32 (2**32 - 1) CVT_U32_F32 + Signed usage: CVT_F32_I32 RCP_IFLAG_F32 MUL_F32 (2**31 - 1) CVT_I32_F32 V_RSQ_F32 16.12. VOP3 & VOP3SD Instructions 430 372 of 597 "RDNA3" Instruction Set Architecture Reciprocal square root with IEEE rules. D0.f = 1.0F / sqrt(S0.f) Notes 1ULP accuracy, denormals are flushed. Functional examples: V_RSQ_F32(0xff800000) ⇒ 0xffc00000 // rsq(-INF) = NAN V_RSQ_F32(0x80000000) ⇒ 0xff800000 // rsq(-0.0) = -INF V_RSQ_F32(0x00000000) ⇒ 0x7f800000 // rsq(+0.0) = +INF V_RSQ_F32(0x40800000) ⇒ 0x3f000000 // rsq(+4.0) = +0.5 V_RSQ_F32(0x7f800000) ⇒ 0x00000000 // rsq(+INF) = +0 V_RCP_F64 431 Reciprocal with IEEE rules. D0.f64 = 1.0 / S0.f64 Notes This opcode has (2**29)ULP accuracy and supports denormals. V_RSQ_F64 433 Reciprocal square root with IEEE rules. D0.f64 = 1.0 / sqrt(S0.f64) Notes This opcode has (2**29)ULP accuracy and supports denormals. V_SQRT_F32 435 Square root. D0.f = sqrt(S0.f) 16.12. VOP3 & VOP3SD Instructions 373 of 597 "RDNA3" Instruction Set Architecture Notes 1ULP accuracy, denormals are flushed. Functional examples: V_SQRT_F32(0xff800000) ⇒ 0xffc00000 // sqrt(-INF) = NAN V_SQRT_F32(0x80000000) ⇒ 0x80000000 // sqrt(-0.0) = -0 V_SQRT_F32(0x00000000) ⇒ 0x00000000 // sqrt(+0.0) = +0 V_SQRT_F32(0x40800000) ⇒ 0x40000000 // sqrt(+4.0) = +2.0 V_SQRT_F32(0x7f800000) ⇒ 0x7f800000 // sqrt(+INF) = +INF V_SQRT_F64 436 Square root. D0.f64 = sqrt(S0.f64) Notes This opcode has (2**29)ULP accuracy and supports denormals. V_SIN_F32 437 Trigonometric sine. D0.f = 32'F(sin(64'F(S0.f) * 2.0 * PI)) Notes Denormals are supported. Full range input is supported. Functional examples: V_SIN_F32(0xff800000) ⇒ 0xffc00000 // sin(-INF) = NAN V_SIN_F32(0xff7fffff) ⇒ 0x00000000 // -MaxFloat, finite V_SIN_F32(0x80000000) ⇒ 0x80000000 // sin(-0.0) = -0 V_SIN_F32(0x3e800000) ⇒ 0x3f800000 // sin(0.25) = 1 V_SIN_F32(0x7f800000) ⇒ 0xffc00000 // sin(+INF) = NAN V_COS_F32 438 Trigonometric cosine. 16.12. VOP3 & VOP3SD Instructions 374 of 597 "RDNA3" Instruction Set Architecture D0.f = 32'F(cos(64'F(S0.f) * 2.0 * PI)) Notes Denormals are supported. Full range input is supported. Functional examples: V_COS_F32(0xff800000) ⇒ 0xffc00000 // cos(-INF) = NAN V_COS_F32(0xff7fffff) ⇒ 0x3f800000 // -MaxFloat, finite V_COS_F32(0x80000000) ⇒ 0x3f800000 // cos(-0.0) = 1 V_COS_F32(0x3e800000) ⇒ 0x00000000 // cos(0.25) = 0 V_COS_F32(0x7f800000) ⇒ 0xffc00000 // cos(+INF) = NAN V_NOT_B32 439 Bitwise negation. D0.u = ~S0.u Notes Input and output modifiers not supported. V_BFREV_B32 440 Bitfield reverse. D0.u[31 : 0] = S0.u[0 : 31] Notes Input and output modifiers not supported. V_CLZ_I32_U32 441 Count leading zeros. Counts how many zeros before the first one starting from the MSB. Returns -1 if there are no ones. D0.i = -1; // Set if no ones are found 16.12. VOP3 & VOP3SD Instructions 375 of 597 "RDNA3" Instruction Set Architecture for i in 0 : 31 do // Search from MSB if S0.u[31 - i] == 1'1U then D0.i = i; break endif endfor Notes Compare with S_CLZ_I32_U32, which performs the equivalent operation in the scalar ALU. Functional examples: V_CLZ_I32_U32(0x00000000) ⇒ 0xffffffff V_CLZ_I32_U32(0x800000ff) ⇒ 0 V_CLZ_I32_U32(0x100000ff) ⇒ 3 V_CLZ_I32_U32(0x0000ffff) ⇒ 16 V_CLZ_I32_U32(0x00000001) ⇒ 31 V_CTZ_I32_B32 442 Count trailing zeros. Returns the bit position of the first one from the LSB, or -1 if there are no ones. D0.i = -1; // Set if no ones are found for i in 0 : 31 do // Search from LSB if S0.u[i] == 1'1U then D0.i = i; break endif endfor Notes Compare with S_CTZ_I32_B32, which performs the equivalent operation in the scalar ALU. Functional examples: V_CTZ_I32_B32(0x00000000) ⇒ 0xffffffff V_CTZ_I32_B32(0xff000001) ⇒ 0 V_CTZ_I32_B32(0xff000008) ⇒ 3 V_CTZ_I32_B32(0xffff0000) ⇒ 16 V_CTZ_I32_B32(0x80000000) ⇒ 31 16.12. VOP3 & VOP3SD Instructions 376 of 597 "RDNA3" Instruction Set Architecture V_CLS_I32 443 Count leading sign bits. Counts how many bits in a row (from MSB to LSB) are the same as the sign bit. Returns -1 if all bits are the same. D0.i = -1; // Set if all bits are the same for i in 1 : 31 do // Search from MSB if S0.i[31 - i] != S0.i[31] then D0.i = i; break endif endfor Notes Compare with S_CLS_I32, which performs the equivalent operation in the scalar ALU. Functional examples: V_CLS_I32(0x00000000) ⇒ 0xffffffff V_CLS_I32(0x40000000) ⇒ 1 V_CLS_I32(0x80000000) ⇒ 1 V_CLS_I32(0x0fffffff) ⇒ 4 V_CLS_I32(0xffff0000) ⇒ 16 V_CLS_I32(0xfffffffe) ⇒ 31 V_CLS_I32(0xffffffff) ⇒ 0xffffffff V_FREXP_EXP_I32_F64 444 Returns exponent of single precision float input. This operation satisfies the invariant S0.f64 = significand * (2 ** exponent). See also V_FREXP_MANT_F64, which returns the significand. See the C library function frexp() for more information. if ((S0.f64 == +INF) || (S0.f64 == -INF) || isNAN(S0.f64)) then D0.i = 0 else D0.i = exponent(S0.f64) - 1023 + 1 endif V_FREXP_MANT_F64 445 Returns binary significand of double precision float input. 16.12. VOP3 & VOP3SD Instructions 377 of 597 "RDNA3" Instruction Set Architecture This operation satisfies the invariant S0.f64 = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0) in normal cases. See also V_FREXP_EXP_I32_F64, which returns integer exponent. See the C library function frexp() for more information. if ((S0.f64 == +INF) || (S0.f64 == -INF) || isNAN(S0.f64)) then D0.f64 = S0.f64 else D0.f64 = mantissa(S0.f64) endif V_FRACT_F64 446 Return fractional portion of a number. D0.f64 = S0.f64 + -floor(S0.f64) Notes 0.5ULP accuracy, denormals are accepted. This complies with the DX specification of fract where the function behaves like an extension of integer modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8 in DX. Obey round mode, result clamped to 0x3fefffffffffffff. V_FREXP_EXP_I32_F32 447 Returns exponent of single precision float input. This operation satisfies the invariant S0.f = significand * (2 ** exponent). See also V_FREXP_MANT_F32, which returns the significand. See the C library function frexp() for more information. if ((64'F(S0.f) == +INF) || (64'F(S0.f) == -INF) || isNAN(64'F(S0.f))) then D0.i = 0 else D0.i = exponent(S0.f) - 127 + 1 endif V_FREXP_MANT_F32 448 Returns binary significand of single precision float input. 16.12. VOP3 & VOP3SD Instructions 378 of 597 "RDNA3" Instruction Set Architecture This operation satisfies the invariant S0.f = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0) in normal cases. See also V_FREXP_EXP_I32_F32, which returns integer exponent. See the C library function frexp() for more information. if ((64'F(S0.f) == +INF) || (64'F(S0.f) == -INF) || isNAN(64'F(S0.f))) then D0.f = S0.f else D0.f = mantissa(S0.f) endif V_MOVRELD_B32 450 Move to a relative destination address. addr = DST.u; // Raw value from instruction addr += M0.u[31 : 0]; VGPR[laneId][addr].b = S0.b Notes Example: The following instruction sequence performs the move v15 <= v7: s_mov_b32 m0, 10 v_movreld_b32 v5, v7 V_MOVRELS_B32 451 Move from a relative source address. addr = SRC0.u; // Raw value from instruction addr += M0.u[31 : 0]; D0.b = VGPR[laneId][addr].b Notes Example: The following instruction sequence performs the move v5 <= v17: s_mov_b32 m0, 10 v_movrels_b32 v5, v7 16.12. VOP3 & VOP3SD Instructions 379 of 597 "RDNA3" Instruction Set Architecture V_MOVRELSD_B32 452 Move from a relative source address to a relative destination address. addrs = SRC0.u; // Raw value from instruction addrd = DST.u; // Raw value from instruction addrs += M0.u[31 : 0]; addrd += M0.u[31 : 0]; VGPR[laneId][addrd].b = VGPR[laneId][addrs].b Notes Example: The following instruction sequence performs the move v15 <= v17: s_mov_b32 m0, 10 v_movrelsd_b32 v5, v7 V_MOVRELSD_2_B32 456 Move from a relative source address to a relative destination address, with different relative offsets. addrs = SRC0.u; // Raw value from instruction addrd = DST.u; // Raw value from instruction addrs += M0.u[9 : 0].u; addrd += M0.u[25 : 16].u; VGPR[laneId][addrd].b = VGPR[laneId][addrs].b Notes Example: The following instruction sequence performs the move v25 <= v17: s_mov_b32 m0, ((20 << 16) | 10) v_movrelsd_2_b32 v5, v7 V_CVT_F16_U16 464 Convert from an unsigned short to an FP16 float. D0.f16 = u16_to_f16(S0.u16) 16.12. VOP3 & VOP3SD Instructions 380 of 597 "RDNA3" Instruction Set Architecture Notes 0.5ULP accuracy, supports denormals, rounding, exception flags and saturation. V_CVT_F16_I16 465 Convert from a signed short to an FP16 float. D0.f16 = i16_to_f16(S0.i16) Notes 0.5ULP accuracy, supports denormals, rounding, exception flags and saturation. V_CVT_U16_F16 466 Convert from an FP16 float to an unsigned short. D0.u16 = f16_to_u16(S0.f16) Notes 1ULP accuracy, supports rounding, exception flags and saturation. FP16 denormals are accepted. Conversion is done with truncation. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. V_CVT_I16_F16 467 Convert from an FP16 float to a signed short. D0.i16 = f16_to_i16(S0.f16) Notes 1ULP accuracy, supports rounding, exception flags and saturation. FP16 denormals are accepted. Conversion is done with truncation. Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this conversion iff CLAMP == 1. 16.12. VOP3 & VOP3SD Instructions 381 of 597 "RDNA3" Instruction Set Architecture V_RCP_F16 468 Reciprocal with IEEE rules. D0.f16 = 16'1.0 / S0.f16 Notes 0.51ULP accuracy. Functional examples: V_RCP_F16(0xfc00) ⇒ 0x8000 // rcp(-INF) = -0 V_RCP_F16(0xc000) ⇒ 0xb800 // rcp(-2.0) = -0.5 V_RCP_F16(0x8000) ⇒ 0xfc00 // rcp(-0.0) = -INF V_RCP_F16(0x0000) ⇒ 0x7c00 // rcp(+0.0) = +INF V_RCP_F16(0x7c00) ⇒ 0x0000 // rcp(+INF) = +0 V_SQRT_F16 469 Square root. D0.f16 = sqrt(S0.f16) Notes 0.51ULP accuracy, denormals are supported. Functional examples: V_SQRT_F16(0xfc00) ⇒ 0xfe00 // sqrt(-INF) = NAN V_SQRT_F16(0x8000) ⇒ 0x8000 // sqrt(-0.0) = -0 V_SQRT_F16(0x0000) ⇒ 0x0000 // sqrt(+0.0) = +0 V_SQRT_F16(0x4400) ⇒ 0x4000 // sqrt(+4.0) = +2.0 V_SQRT_F16(0x7c00) ⇒ 0x7c00 // sqrt(+INF) = +INF V_RSQ_F16 470 Reciprocal square root with IEEE rules. D0.f16 = 16'1.0 / sqrt(S0.f16) Notes 16.12. VOP3 & VOP3SD Instructions 382 of 597 "RDNA3" Instruction Set Architecture 0.51ULP accuracy, denormals are supported. Functional examples: V_RSQ_F16(0xfc00) ⇒ 0xfe00 // rsq(-INF) = NAN V_RSQ_F16(0x8000) ⇒ 0xfc00 // rsq(-0.0) = -INF V_RSQ_F16(0x0000) ⇒ 0x7c00 // rsq(+0.0) = +INF V_RSQ_F16(0x4400) ⇒ 0x3800 // rsq(+4.0) = +0.5 V_RSQ_F16(0x7c00) ⇒ 0x0000 // rsq(+INF) = +0 V_LOG_F16 471 Base 2 logarithm. D0.f16 = log2(S0.f16) Notes 0.51ULP accuracy, denormals are supported. Functional examples: V_LOG_F16(0xfc00) ⇒ 0xfe00 // log(-INF) = NAN V_LOG_F16(0xbc00) ⇒ 0xfe00 // log(-1.0) = NAN V_LOG_F16(0x8000) ⇒ 0xfc00 // log(-0.0) = -INF V_LOG_F16(0x0000) ⇒ 0xfc00 // log(+0.0) = -INF V_LOG_F16(0x3c00) ⇒ 0x0000 // log(+1.0) = 0 V_LOG_F16(0x7c00) ⇒ 0x7c00 // log(+INF) = +INF V_EXP_F16 472 Base 2 exponentiation. D0.f16 = pow(16'2.0, S0.f16) Notes 0.51ULP accuracy, denormals are supported. Functional examples: V_EXP_F16(0xfc00) ⇒ 0x0000 // exp(-INF) = 0 V_EXP_F16(0x8000) ⇒ 0x3c00 // exp(-0.0) = 1 V_EXP_F16(0x7c00) ⇒ 0x7c00 // exp(+INF) = +INF 16.12. VOP3 & VOP3SD Instructions 383 of 597 "RDNA3" Instruction Set Architecture V_FREXP_MANT_F16 473 Returns binary significand of half precision float input. This operation satisfies the invariant S0.f16 = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0) in normal cases. See also V_FREXP_EXP_I16_F16, which returns integer exponent. See the C library function frexp() for more information. if ((64'F(S0.f16) == +INF) || (64'F(S0.f16) == -INF) || isNAN(64'F(S0.f16))) then D0.f16 = S0.f16 else D0.f16 = mantissa(S0.f16) endif V_FREXP_EXP_I16_F16 474 Returns exponent of half precision float input. This operation satisfies the invariant S0.f16 = significand * (2 ** exponent). See also V_FREXP_MANT_F16, which returns the significand. See the C library function frexp() for more information. if ((64'F(S0.f16) == +INF) || (64'F(S0.f16) == -INF) || isNAN(64'F(S0.f16))) then D0.i = 0 else D0.i = exponent(S0.f16) - 15 + 1 endif V_FLOOR_F16 475 Round down to previous whole integer. D0.f16 = trunc(S0.f16); if ((S0.f16 < 16'0.0) && (S0.f16 != D0.f16)) then D0.f16 += -16'1.0 endif V_CEIL_F16 476 Round up to next whole integer. D0.f16 = trunc(S0.f16); if ((S0.f16 > 16'0.0) && (S0.f16 != D0.f16)) then D0.f16 += 16'1.0 16.12. VOP3 & VOP3SD Instructions 384 of 597 "RDNA3" Instruction Set Architecture endif V_TRUNC_F16 477 Return integer part of a number with round-to-zero semantics. D0.f16 = trunc(S0.f16) V_RNDNE_F16 478 Round-to-nearest-even semantics. D0.f16 = floor(S0.f16 + 16'0.5); if (isEven(64'F(floor(S0.f16))) && (fract(S0.f16) == 16'0.5)) then D0.f16 -= 16'1.0 endif V_FRACT_F16 479 Return fractional portion of a number. D0.f16 = S0.f16 + -floor(S0.f16) Notes 0.5ULP accuracy, denormals are accepted. This complies with the DX specification of fract where the function behaves like an extension of integer modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8 in DX. V_SIN_F16 480 Trigonometric sine. D0.f16 = 16'F(sin(64'F(S0.f16) * 2.0 * PI)) Notes 16.12. VOP3 & VOP3SD Instructions 385 of 597 "RDNA3" Instruction Set Architecture Denormals are supported. Full range input is supported. Functional examples: V_SIN_F16(0xfc00) ⇒ 0xfe00 // sin(-INF) = NAN V_SIN_F16(0xfbff) ⇒ 0x0000 // Most negative finite FP16 V_SIN_F16(0x8000) ⇒ 0x8000 // sin(-0.0) = -0 V_SIN_F16(0x3400) ⇒ 0x3c00 // sin(0.25) = 1 V_SIN_F16(0x7bff) ⇒ 0x0000 // Most positive finite FP16 V_SIN_F16(0x7c00) ⇒ 0xfe00 // sin(+INF) = NAN V_COS_F16 481 Trigonometric cosine. D0.f16 = 16'F(cos(64'F(S0.f16) * 2.0 * PI)) Notes Denormals are supported. Full range input is supported. Functional examples: V_COS_F16(0xfc00) ⇒ 0xfe00 // cos(-INF) = NAN V_COS_F16(0xfbff) ⇒ 0x3c00 // Most negative finite FP16 V_COS_F16(0x8000) ⇒ 0x3c00 // cos(-0.0) = 1 V_COS_F16(0x3400) ⇒ 0x0000 // cos(0.25) = 0 V_COS_F16(0x7bff) ⇒ 0x3c00 // Most positive finite FP16 V_COS_F16(0x7c00) ⇒ 0xfe00 // cos(+INF) = NAN V_SAT_PK_U8_I16 482 Packed 8-bit saturating of packed 16-bit integer values. Used for 4x16bit data packed as 4x8bit data. SAT8 = lambda(n) ( if n.i <= 0 then return 8'0U elsif n >= 16'I(0xff) then return 8'255U else return n[7 : 0].u8 endif); D0.b16 = { SAT8(S0[31 : 16].i16), SAT8(S0[15 : 0].i16) } 16.12. VOP3 & VOP3SD Instructions 386 of 597 "RDNA3" Instruction Set Architecture V_CVT_NORM_I16_F16 483 Convert from an FP16 float to a signed normalized short. D0.i16 = f16_to_snorm(S0.f16) Notes 0.5ULP accuracy, supports rounding, exception flags and saturation, denormals are supported. V_CVT_NORM_U16_F16 484 Convert from an FP16 float to an unsigned normalized short. D0.u16 = f16_to_unorm(S0.f16) Notes 0.5ULP accuracy, supports rounding, exception flags and saturation, denormals are supported. V_NOT_B16 489 Bitwise negation. D0.u16 = ~S0.u16 Notes Input and output modifiers not supported. V_CVT_I32_I16 490 Convert from an 16-bit signed integer to a 32-bit signed integer, sign extending as needed. D0.i = 32'I(signext(S0.i16)) Notes To convert in the other direction (from 32-bit to 16-bit integer) use V_MOV_B16. 16.12. VOP3 & VOP3SD Instructions 387 of 597 "RDNA3" Instruction Set Architecture V_CVT_U32_U16 491 Convert from an 16-bit unsigned integer to a 32-bit unsigned integer, zero extending as needed. D0 = { 16'0, S0.u16 } Notes To convert in the other direction (from 32-bit to 16-bit integer) use V_MOV_B16. V_CNDMASK_B32 257 Conditional mask on each thread. D0.u = VCC.u64[laneId] ? S1.u : S0.u Notes In VOP3 the VCC source may be a scalar GPR specified in S2.u. Floating-point modifiers are valid for this instruction if S0.u and S1.u are 32-bit floating point values. This instruction is suitable for negating or taking the absolute value of a floating-point value. V_ADD_F32 259 Add two single-precision values. D0.f = S0.f + S1.f Notes 0.5ULP precision, denormals are supported. V_SUB_F32 260 Subtract the second single-precision input from the first input. D0.f = S0.f - S1.f 16.12. VOP3 & VOP3SD Instructions 388 of 597 "RDNA3" Instruction Set Architecture V_SUBREV_F32 261 Subtract the first single-precision input from the second input. D0.f = S1.f - S0.f V_FMAC_DX9_ZERO_F32 262 Multiply two single-precision values and accumulate the result with the destination. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE compliant). if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then // DX9 rules, 0.0 * x = 0.0 D0.f = S2.f else D0.f = fma(S0.f, S1.f, D0.f) endif V_MUL_DX9_ZERO_F32 263 Multiply two single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE compliant). if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then // DX9 rules, 0.0 * x = 0.0 D0.f = 0.0F else D0.f = S0.f * S1.f endif V_MUL_F32 264 Multiply two single-precision values. D0.f = S0.f * S1.f Notes 0.5ULP precision, denormals are supported. 16.12. VOP3 & VOP3SD Instructions 389 of 597 "RDNA3" Instruction Set Architecture V_MUL_I32_I24 265 Multiply two signed 24-bit integers and store the result as a signed 32-bit integer. D0.i = 32'I(S0.i24) * 32'I(S1.i24) Notes This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision floating point multiplier. See also V_MUL_HI_I32_I24. V_MUL_HI_I32_I24 266 Multiply two signed 24-bit integers and store the high 32 bits of the result as a signed 32-bit integer. D0.i = 32'I(64'I(S0.i24) * 64'I(S1.i24) >> 32U) Notes See also V_MUL_I32_I24. V_MUL_U32_U24 267 Multiply two unsigned 24-bit integers and store the result as an unsigned 32-bit integer. D0.u = 32'U(S0.u24) * 32'U(S1.u24) Notes This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision floating point multiplier. See also V_MUL_HI_U32_U24. V_MUL_HI_U32_U24 268 Multiply two unsigned 24-bit integers and store the high 32 bits of the result as an unsigned 32-bit integer. D0.u = 32'U(64'U(S0.u24) * 64'U(S1.u24) >> 32U) Notes See also V_MUL_U32_U24. 16.12. VOP3 & VOP3SD Instructions 390 of 597 "RDNA3" Instruction Set Architecture V_MIN_F32 271 Compute the minimum of two floats. LT_NEG_ZERO = lambda(a, b) ( ((a < b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && sign(a) && !sign(b)))); // Version of comparison where -0.0 < +0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(64'F(S0.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S0.f))) elsif isSignalNAN(64'F(S1.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S1.f))) elsif isQuietNAN(64'F(S1.f)) then D0.f = S0.f elsif isQuietNAN(64'F(S0.f)) then D0.f = S1.f elsif LT_NEG_ZERO(S0.f, S1.f) then // NOTE: -0<+0 is TRUE in this comparison D0.f = S0.f else D0.f = S1.f endif else if isNAN(64'F(S1.f)) then D0.f = S0.f elsif isNAN(64'F(S0.f)) then D0.f = S1.f elsif LT_NEG_ZERO(S0.f, S1.f) then // NOTE: -0<+0 is TRUE in this comparison D0.f = S0.f else D0.f = S1.f endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. V_MAX_F32 272 Compute the maximum of two floats. 16.12. VOP3 & VOP3SD Instructions 391 of 597 "RDNA3" Instruction Set Architecture GT_NEG_ZERO = lambda(a, b) ( ((a > b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && !sign(a) && sign(b)))); // Version of comparison where +0.0 > -0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(64'F(S0.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S0.f))) elsif isSignalNAN(64'F(S1.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S1.f))) elsif isQuietNAN(64'F(S1.f)) then D0.f = S0.f elsif isQuietNAN(64'F(S0.f)) then D0.f = S1.f elsif GT_NEG_ZERO(S0.f, S1.f) then // NOTE: +0>-0 is TRUE in this comparison D0.f = S0.f else D0.f = S1.f endif else if isNAN(64'F(S1.f)) then D0.f = S0.f elsif isNAN(64'F(S0.f)) then D0.f = S1.f elsif GT_NEG_ZERO(S0.f, S1.f) then // NOTE: +0>-0 is TRUE in this comparison D0.f = S0.f else D0.f = S1.f endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. V_MIN_I32 273 Compute the minimum of two signed integers. D0.i = S0.i < S1.i ? S0.i : S1.i V_MAX_I32 16.12. VOP3 & VOP3SD Instructions 274 392 of 597 "RDNA3" Instruction Set Architecture Compute the maximum of two signed integers. D0.i = S0.i >= S1.i ? S0.i : S1.i V_MIN_U32 275 Compute the minimum of two unsigned integers. D0.u = S0.u < S1.u ? S0.u : S1.u V_MAX_U32 276 Compute the maximum of two unsigned integers. D0.u = S0.u >= S1.u ? S0.u : S1.u V_LSHLREV_B32 280 Logical shift left with shift count in the first operand. D0.u = S1.u << S0[4 : 0].u V_LSHRREV_B32 281 Logical shift right with shift count in the first operand. D0.u = S1.u >> S0[4 : 0].u V_ASHRREV_I32 282 Arithmetic shift right (preserve sign bit) with shift count in the first operand. D0.i = S1.i >> S0[4 : 0].u 16.12. VOP3 & VOP3SD Instructions 393 of 597 "RDNA3" Instruction Set Architecture V_AND_B32 283 Bitwise AND. D0.u = (S0.u & S1.u) Notes Input and output modifiers not supported. V_OR_B32 284 Bitwise OR. D0.u = (S0.u | S1.u) Notes Input and output modifiers not supported. V_XOR_B32 285 Bitwise XOR. D0.u = (S0.u ^ S1.u) Notes Input and output modifiers not supported. V_XNOR_B32 286 Bitwise XNOR. D0.u = ~(S0.u ^ S1.u) Notes Input and output modifiers not supported. 16.12. VOP3 & VOP3SD Instructions 394 of 597 "RDNA3" Instruction Set Architecture V_ADD_CO_CI_U32 288 Add two unsigned integers and a carry-in from VCC. Store the result and also save the carry-out to VCC. tmp = 64'U(S0.u) + 64'U(S1.u) + VCC.u64[laneId].u64; VCC.u64[laneId] = tmp >= 0x100000000ULL ? 1'1U : 1'0U; D0.u = tmp.u Notes In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at S2.u. V_SUB_CO_CI_U32 289 Subtract the second unsigned integer from the first unsigned integer and then subtract a carry-in from VCC. Store the result and also save the carry-out to VCC. tmp = S0.u - S1.u - VCC.u64[laneId].u; VCC.u64[laneId] = 64'U(S1.u) + VCC.u64[laneId].u64 > 64'U(S0.u) ? 1'1U : 1'0U; D0.u = tmp.u Notes In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at S2.u. V_SUBREV_CO_CI_U32 290 Subtract the first unsigned integer from the second unsigned integer and then subtract a carry-in from VCC. Store the result and also save the carry-out to VCC. tmp = S1.u - S0.u - VCC.u64[laneId].u; VCC.u64[laneId] = 64'U(S1.u) + VCC.u64[laneId].u64 > 64'U(S0.u) ? 1'1U : 1'0U; D0.u = tmp.u Notes In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at S2.u. V_ADD_NC_U32 16.12. VOP3 & VOP3SD Instructions 293 395 of 597 "RDNA3" Instruction Set Architecture Add two unsigned integers. No carry-in or carry-out. D0.u = S0.u + S1.u V_SUB_NC_U32 294 Subtract the second unsigned integer from the first unsigned integer. No carry-in or carry-out. D0.u = S0.u - S1.u V_SUBREV_NC_U32 295 Subtract the first unsigned integer from the second unsigned integer. No carry-in or carry-out. D0.u = S1.u - S0.u V_FMAC_F32 299 Fused multiply-add of single-precision floats, accumulate with destination. D0.f = fma(S0.f, S1.f, D0.f) V_CVT_PK_RTZ_F16_F32 303 Convert two single-precision floats into a packed FP16 result and round to zero (ignore the current rounding mode). D0[15 : 0].f16 = f32_to_f16(S0.f); D0[31 : 16].f16 = f32_to_f16(S1.f); // Round-toward-zero regardless of current round mode setting in hardware. Notes This opcode is intended for use with 16-bit compressed exports. See V_CVT_F16_F32 for a version that respects the current rounding mode. 16.12. VOP3 & VOP3SD Instructions 396 of 597 "RDNA3" Instruction Set Architecture V_ADD_F16 306 Add two FP16 values. D0.f16 = S0.f16 + S1.f16 Notes 0.5ULP precision. Supports denormals, round mode, exception flags and saturation. V_SUB_F16 307 Subtract the second FP16 value from the first. D0.f16 = S0.f16 - S1.f16 Notes 0.5ULP precision, Supports denormals, round mode, exception flags and saturation. V_SUBREV_F16 308 Subtract the first FP16 value from the second. D0.f16 = S1.f16 - S0.f16 Notes 0.5ULP precision. Supports denormals, round mode, exception flags and saturation. V_MUL_F16 309 Multiply two FP16 values. D0.f16 = S0.f16 * S1.f16 Notes 0.5ULP precision. Supports denormals, round mode, exception flags and saturation. 16.12. VOP3 & VOP3SD Instructions 397 of 597 "RDNA3" Instruction Set Architecture V_FMAC_F16 310 Fused multiply-add of FP16 values, accumulate with destination. D0.f16 = fma(S0.f16, S1.f16, D0.f16) Notes 0.5ULP precision. Supports denormals, round mode, exception flags and saturation. V_MAX_F16 313 Compute the maximum of two floats. GT_NEG_ZERO = lambda(a, b) ( ((a > b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && !sign(a) && sign(b)))); // Version of comparison where +0.0 > -0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(64'F(S0.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S0.f16))) elsif isSignalNAN(64'F(S1.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16))) elsif isQuietNAN(64'F(S1.f16)) then D0.f16 = S0.f16 elsif isQuietNAN(64'F(S0.f16)) then D0.f16 = S1.f16 elsif GT_NEG_ZERO(S0.f16, S1.f16) then // NOTE: +0>-0 is TRUE in this comparison D0.f16 = S0.f16 else D0.f16 = S1.f16 endif else if isNAN(64'F(S1.f16)) then D0.f16 = S0.f16 elsif isNAN(64'F(S0.f16)) then D0.f16 = S1.f16 elsif GT_NEG_ZERO(S0.f16, S1.f16) then // NOTE: +0>-0 is TRUE in this comparison D0.f16 = S0.f16 else D0.f16 = S1.f16 endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. 16.12. VOP3 & VOP3SD Instructions 398 of 597 "RDNA3" Instruction Set Architecture Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. V_MIN_F16 314 Compute the minimum of two floats. LT_NEG_ZERO = lambda(a, b) ( ((a < b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && sign(a) && !sign(b)))); // Version of comparison where -0.0 < +0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(64'F(S0.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S0.f16))) elsif isSignalNAN(64'F(S1.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16))) elsif isQuietNAN(64'F(S1.f16)) then D0.f16 = S0.f16 elsif isQuietNAN(64'F(S0.f16)) then D0.f16 = S1.f16 elsif LT_NEG_ZERO(S0.f16, S1.f16) then // NOTE: -0<+0 is TRUE in this comparison D0.f16 = S0.f16 else D0.f16 = S1.f16 endif else if isNAN(64'F(S1.f16)) then D0.f16 = S0.f16 elsif isNAN(64'F(S0.f16)) then D0.f16 = S1.f16 elsif LT_NEG_ZERO(S0.f16, S1.f16) then // NOTE: -0<+0 is TRUE in this comparison D0.f16 = S0.f16 else D0.f16 = S1.f16 endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. 16.12. VOP3 & VOP3SD Instructions 399 of 597 "RDNA3" Instruction Set Architecture V_LDEXP_F16 315 Load exponent. Multiply an FP16 value by an integral power of 2, compare with the ldexp() function in C. The second argument is an integer value. D0.f16 = S0.f16 * f32_to_f16(2.0F ** 32'I(S1.i16)) V_FMA_DX9_ZERO_F32 521 Multiply and add single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE compliant). if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then // DX9 rules, 0.0 * x = 0.0 D0.f = S2.f else D0.f = fma(S0.f, S1.f, S2.f) endif V_MAD_I32_I24 522 Multiply two signed 24-bit integers, add a signed 32-bit integer and store the result as a signed 32-bit integer. D0.i = 32'I(S0.i24) * 32'I(S1.i24) + S2.i Notes This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision floating point multiplier. V_MAD_U32_U24 523 Multiply two unsigned 24-bit integers, add an unsigned 32-bit integer and store the result as an unsigned 32-bit integer. D0.u = 32'U(S0.u24) * 32'U(S1.u24) + S2.u Notes 16.12. VOP3 & VOP3SD Instructions 400 of 597 "RDNA3" Instruction Set Architecture This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision floating point multiplier. V_CUBEID_F32 524 Cubemap Face ID determination. Result is a floating point face ID. // Set D0.f = cubemap face ID ({0.0, 1.0, ..., 5.0}). // XYZ coordinate is given in (S0.f, S1.f, S2.f). // S0.f = x // S1.f = y // S2.f = z if ((abs(S2.f) >= abs(S0.f)) && (abs(S2.f) >= abs(S1.f))) then if S2.f < 0.0F then D0.f = 5.0F else D0.f = 4.0F endif elsif abs(S1.f) >= abs(S0.f) then if S1.f < 0.0F then D0.f = 3.0F else D0.f = 2.0F endif else if S0.f < 0.0F then D0.f = 1.0F else D0.f = 0.0F endif endif V_CUBESC_F32 525 Cubemap S coordinate. // D0.f = cubemap S coordinate. // XYZ coordinate is given in (S0.f, S1.f, S2.f). // S0.f = x // S1.f = y // S2.f = z if ((abs(S2.f) >= abs(S0.f)) && (abs(S2.f) >= abs(S1.f))) then if S2.f < 0.0F then D0.f = -S0.f else D0.f = S0.f endif elsif abs(S1.f) >= abs(S0.f) then D0.f = S0.f else 16.12. VOP3 & VOP3SD Instructions 401 of 597 "RDNA3" Instruction Set Architecture if S0.f < 0.0F then D0.f = S2.f else D0.f = -S2.f endif endif V_CUBETC_F32 526 Cubemap T coordinate. // D0.f = cubemap T coordinate. // XYZ coordinate is given in (S0.f, S1.f, S2.f). // S0.f = x // S1.f = y // S2.f = z if ((abs(S2.f) >= abs(S0.f)) && (abs(S2.f) >= abs(S1.f))) then D0.f = -S1.f elsif abs(S1.f) >= abs(S0.f) then if S1.f < 0.0F then D0.f = -S2.f else D0.f = S2.f endif else D0.f = -S1.f endif V_CUBEMA_F32 527 Determine cubemap major axis. // D0.f = 2.0 * cubemap major axis. // XYZ coordinate is given in (S0.f, S1.f, S2.f). // S0.f = x // S1.f = y // S2.f = z if ((abs(S2.f) >= abs(S0.f)) && (abs(S2.f) >= abs(S1.f))) then D0.f = S2.f * 2.0F elsif abs(S1.f) >= abs(S0.f) then D0.f = S1.f * 2.0F else D0.f = S0.f * 2.0F endif V_BFE_U32 16.12. VOP3 & VOP3SD Instructions 528 402 of 597 "RDNA3" Instruction Set Architecture Bitfield extract. Extract unsigned bitfield from first operand using field offset in second operand and field size in third operand. D0.u = ((S0.u >> S1.u[4 : 0].u) & 32'U((1 << S2.u[4 : 0].u) - 1)) V_BFE_I32 529 Bitfield extract. Extract signed bitfield from first operand using field offset in second operand and field size in third operand. tmp = ((S0.i >> S1.u[4 : 0].u) & ((1 << S2.u[4 : 0].u) - 1)); D0.i = 32'I(signextFromBit(tmp.i, S2.i[4 : 0].i)) V_BFI_B32 530 Bitfield insert. Using a bitmask from the first operand, merge bitfield in second operand with packed value in third operand. D0.u = ((S0.u & S1.u) | (~S0.u & S2.u)) V_FMA_F32 531 Fused single precision multiply add. D0.f = fma(S0.f, S1.f, S2.f) Notes 0.5ULP accuracy, denormals are supported. V_FMA_F64 532 Fused double precision multiply add. D0.f64 = fma(S0.f64, S1.f64, S2.f64) Notes 16.12. VOP3 & VOP3SD Instructions 403 of 597 "RDNA3" Instruction Set Architecture 0.5ULP precision, denormals are supported. V_LERP_U8 533 Unsigned 8-bit pixel average on packed unsigned bytes (linear interpolation). Each byte in S2 acts as a round mode; if the LSB is set then 0.5 rounds up, otherwise 0.5 truncates. D0.u = 32'U(S0.u[31 : 24] + S1.u[31 : 24] + S2.u[24].u8 >> 1U << 24U); D0.u += 32'U(S0.u[23 : 16] + S1.u[23 : 16] + S2.u[16].u8 >> 1U << 16U); D0.u += 32'U(S0.u[15 : 8] + S1.u[15 : 8] + S2.u[8].u8 >> 1U << 8U); D0.u += 32'U(S0.u[7 : 0] + S1.u[7 : 0] + S2.u[0].u8 >> 1U) V_ALIGNBIT_B32 534 Align a value to the specified bit position. D0.u = 32'U(({ S0.u, S1.u } >> S2.u[4 : 0].u) & 0xffffffffLL) Notes  S0 carries the MSBs and S1 carries the LSBs of the value being aligned. V_ALIGNBYTE_B32 535 Align a value to the specified byte position. D0.u = 32'U(({ S0.u, S1.u } >> S2.u[1 : 0].u * 8U) & 0xffffffffLL) Notes  S0 carries the MSBs and S1 carries the LSBs of the value being aligned. V_MULLIT_F32 536 Multiply for lighting. Specific rules apply: 0.0 * x = 0.0; specific INF, NAN, overflow rules. if ((S1.f == -MAX_FLOAT_F32) || (64'F(S1.f) == -INF) || isNAN(64'F(S1.f)) || (S2.f <= 0.0F) || isNAN(64'F(S2.f))) then 16.12. VOP3 & VOP3SD Instructions 404 of 597 "RDNA3" Instruction Set Architecture D0.f = -MAX_FLOAT_F32 else D0.f = S0.f * S1.f endif Notes V_MIN3_F32 537 Return minimum single-precision value of three inputs. D0.f = v_min_f32(v_min_f32(S0.f, S1.f), S2.f) V_MIN3_I32 538 Return minimum signed integer value of three inputs. D0.i = v_min_i32(v_min_i32(S0.i, S1.i), S2.i) V_MIN3_U32 539 Return minimum unsigned integer value of three inputs. D0.u = v_min_u32(v_min_u32(S0.u, S1.u), S2.u) V_MAX3_F32 540 Return maximum single precision value of three inputs. D0.f = v_max_f32(v_max_f32(S0.f, S1.f), S2.f) V_MAX3_I32 541 Return maximum signed integer value of three inputs. 16.12. VOP3 & VOP3SD Instructions 405 of 597 "RDNA3" Instruction Set Architecture D0.i = v_max_i32(v_max_i32(S0.i, S1.i), S2.i) V_MAX3_U32 542 Return maximum unsigned integer value of three inputs. D0.u = v_max_u32(v_max_u32(S0.u, S1.u), S2.u) V_MED3_F32 543 Return median single precision value of three inputs. if (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f)) || isNAN(64'F(S2.f))) then D0.f = v_min3_f32(S0.f, S1.f, S2.f) elsif v_max3_f32(S0.f, S1.f, S2.f) == S0.f then D0.f = v_max_f32(S1.f, S2.f) elsif v_max3_f32(S0.f, S1.f, S2.f) == S1.f then D0.f = v_max_f32(S0.f, S2.f) else D0.f = v_max_f32(S0.f, S1.f) endif V_MED3_I32 544 Return median signed integer value of three inputs. if v_max3_i32(S0.i, S1.i, S2.i) == S0.i then D0.i = v_max_i32(S1.i, S2.i) elsif v_max3_i32(S0.i, S1.i, S2.i) == S1.i then D0.i = v_max_i32(S0.i, S2.i) else D0.i = v_max_i32(S0.i, S1.i) endif V_MED3_U32 545 Return median unsigned integer value of three inputs. if v_max3_u32(S0.u, S1.u, S2.u) == S0.u then D0.u = v_max_u32(S1.u, S2.u) 16.12. VOP3 & VOP3SD Instructions 406 of 597 "RDNA3" Instruction Set Architecture elsif v_max3_u32(S0.u, S1.u, S2.u) == S1.u then D0.u = v_max_u32(S0.u, S2.u) else D0.u = v_max_u32(S0.u, S1.u) endif V_SAD_U8 546 Sum of absolute differences with accumulation, overflow into upper bits is allowed. ABSDIFF = lambda(x, y) ( x > y ? x - y : y - x); // UNSIGNED comparison D0.u = S2.u; D0.u += 32'U(ABSDIFF(S0.u[31 : 24], S1.u[31 : 24])); D0.u += 32'U(ABSDIFF(S0.u[23 : 16], S1.u[23 : 16])); D0.u += 32'U(ABSDIFF(S0.u[15 : 8], S1.u[15 : 8])); D0.u += 32'U(ABSDIFF(S0.u[7 : 0], S1.u[7 : 0])) V_SAD_HI_U8 547 Sum of absolute differences with accumulation, accumulate from the higher-order bits of the third source operand. D0.u = (32'U(v_sad_u8(S0, S1, 0U)) << 16U) + S2.u V_SAD_U16 548 Short SAD with accumulation. ABSDIFF = lambda(x, y) ( x > y ? x - y : y - x); // UNSIGNED comparison D0.u = S2.u; D0.u += ABSDIFF(S0[31 : 16].u16, S1[31 : 16].u16); D0.u += ABSDIFF(S0[15 : 0].u16, S1[15 : 0].u16) V_SAD_U32 549 Dword SAD with accumulation. 16.12. VOP3 & VOP3SD Instructions 407 of 597 "RDNA3" Instruction Set Architecture ABSDIFF = lambda(x, y) ( x > y ? x - y : y - x); // UNSIGNED comparison D0.u = ABSDIFF(S0.u, S1.u) + S2.u V_CVT_PK_U8_F32 550 Packed float to byte conversion. Convert floating point value S0 to 8-bit unsigned integer and pack the result into byte S1 of dword S2. D0.u = (S2.u & 32'U(~(0xff << S1.u[1 : 0].u * 8U))); D0.u = (D0.u | ((32'U(f32_to_u8(S0.f)) & 255U) << S1.u[1 : 0].u * 8U)) V_DIV_FIXUP_F32 551 Single precision division fixup. S0 = Quotient, S1 = Denominator, S2 = Numerator. Given a numerator, denominator, and quotient from a divide, this opcode detects and applies specific case numerics, touching up the quotient if necessary. This opcode also generates invalid, denorm and divide by zero exceptions caused by the division. sign_out = (sign(S1.f) ^ sign(S2.f)); if isNAN(64'F(S2.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S2.f))) elsif isNAN(64'F(S1.f)) then D0.f = 32'F(cvtToQuietNAN(64'F(S1.f))) elsif ((64'F(S1.f) == 0.0) && (64'F(S2.f) == 0.0)) then // 0/0 D0.f = 32'F(0xffc00000) elsif ((64'F(abs(S1.f)) == +INF) && (64'F(abs(S2.f)) == +INF)) then // inf/inf D0.f = 32'F(0xffc00000) elsif ((64'F(S1.f) == 0.0) || (64'F(abs(S2.f)) == +INF)) then // x/0, or inf/y D0.f = sign_out ? -INF.f : +INF.f elsif ((64'F(abs(S1.f)) == +INF) || (64'F(S2.f) == 0.0)) then // x/inf, 0/y D0.f = sign_out ? -0.0F : 0.0F elsif exponent(S2.f) - exponent(S1.f) < -150 then D0.f = sign_out ? -UNDERFLOW_F32 : UNDERFLOW_F32 elsif exponent(S1.f) == 255 then D0.f = sign_out ? -OVERFLOW_F32 : OVERFLOW_F32 else D0.f = sign_out ? -abs(S0.f) : abs(S0.f) 16.12. VOP3 & VOP3SD Instructions 408 of 597 "RDNA3" Instruction Set Architecture endif V_DIV_FIXUP_F64 552 Double precision division fixup. S0 = Quotient, S1 = Denominator, S2 = Numerator. Given a numerator, denominator, and quotient from a divide, this opcode detects and applies specific case numerics, touching up the quotient if necessary. This opcode also generates invalid, denorm and divide by zero exceptions caused by the division. sign_out = (sign(S1.f64) ^ sign(S2.f64)); if isNAN(S2.f64) then D0.f64 = cvtToQuietNAN(S2.f64) elsif isNAN(S1.f64) then D0.f64 = cvtToQuietNAN(S1.f64) elsif ((S1.f64 == 0.0) && (S2.f64 == 0.0)) then // 0/0 D0.f64 = 64'F(0xfff8000000000000LL) elsif ((abs(S1.f64) == +INF) && (abs(S2.f64) == +INF)) then // inf/inf D0.f64 = 64'F(0xfff8000000000000LL) elsif ((S1.f64 == 0.0) || (abs(S2.f64) == +INF)) then // x/0, or inf/y D0.f64 = sign_out ? -INF : +INF elsif ((abs(S1.f64) == +INF) || (S2.f64 == 0.0)) then // x/inf, 0/y D0.f64 = sign_out ? -0.0 : 0.0 elsif exponent(S2.f64) - exponent(S1.f64) < -1075 then D0.f64 = sign_out ? -UNDERFLOW_F64 : UNDERFLOW_F64 elsif exponent(S1.f64) == 2047 then D0.f64 = sign_out ? -OVERFLOW_F64 : OVERFLOW_F64 else D0.f64 = sign_out ? -abs(S0.f64) : abs(S0.f64) endif V_DIV_FMAS_F32 567 Single precision FMA with fused scale. This opcode performs a standard Fused Multiply-Add operation and conditionally scales the resulting exponent if VCC is set. if VCC.u64[laneId] then D0.f = 2.0F ** 32 * fma(S0.f, S1.f, S2.f) else D0.f = fma(S0.f, S1.f, S2.f) 16.12. VOP3 & VOP3SD Instructions 409 of 597 "RDNA3" Instruction Set Architecture endif Notes Input denormals are not flushed, but output flushing is allowed. V_DIV_FMAS_F64 568 Double precision FMA with fused scale. This opcode performs a standard Fused Multiply-Add operation and conditionally scales the resulting exponent if VCC is set. if VCC.u64[laneId] then D0.f64 = 2.0 ** 64 * fma(S0.f64, S1.f64, S2.f64) else D0.f64 = fma(S0.f64, S1.f64, S2.f64) endif Notes Input denormals are not flushed, but output flushing is allowed. V_MSAD_U8 569 Masked sum of absolute differences with accumulation, overflow into upper bits is allowed. Components where the reference value in S1 is zero are not included in the sum. ABSDIFF = lambda(x, y) ( x > y ? x - y : y - x); // UNSIGNED comparison D0.u = S2.u; D0.u += S1.u[31 : 24] == 8'0U ? 0U : 32'U(ABSDIFF(S0.u[31 : 24], S1.u[31 : 24])); D0.u += S1.u[23 : 16] == 8'0U ? 0U : 32'U(ABSDIFF(S0.u[23 : 16], S1.u[23 : 16])); D0.u += S1.u[15 : 8] == 8'0U ? 0U : 32'U(ABSDIFF(S0.u[15 : 8], S1.u[15 : 8])); D0.u += S1.u[7 : 0] == 8'0U ? 0U : 32'U(ABSDIFF(S0.u[7 : 0], S1.u[7 : 0])) V_QSAD_PK_U16_U8 570 Quad-byte SAD with 16-bit packed accumulation. D0[63 : 48] = 16'B(v_sad_u8(S0[55 : 24], S1[31 : 0], S2[63 : 48].u)); D0[47 : 32] = 16'B(v_sad_u8(S0[47 : 16], S1[31 : 0], S2[47 : 32].u)); 16.12. VOP3 & VOP3SD Instructions 410 of 597 "RDNA3" Instruction Set Architecture D0[31 : 16] = 16'B(v_sad_u8(S0[39 : 8], S1[31 : 0], S2[31 : 16].u)); D0[15 : 0] = 16'B(v_sad_u8(S0[31 : 0], S1[31 : 0], S2[15 : 0].u)) V_MQSAD_PK_U16_U8 571 Quad-byte masked SAD with 16-bit packed accumulation. D0[63 : 48] = 16'B(v_msad_u8(S0[55 : 24], S1[31 : 0], S2[63 : 48].u)); D0[47 : 32] = 16'B(v_msad_u8(S0[47 : 16], S1[31 : 0], S2[47 : 32].u)); D0[31 : 16] = 16'B(v_msad_u8(S0[39 : 8], S1[31 : 0], S2[31 : 16].u)); D0[15 : 0] = 16'B(v_msad_u8(S0[31 : 0], S1[31 : 0], S2[15 : 0].u)) V_MQSAD_U32_U8 573 Quad-byte masked SAD with 32-bit packed accumulation. D0[127 : 96] = 32'B(v_msad_u8(S0[55 : 24], S1[31 : 0], S2[127 : 96].u)); D0[95 : 64] = 32'B(v_msad_u8(S0[47 : 16], S1[31 : 0], S2[95 : 64].u)); D0[63 : 32] = 32'B(v_msad_u8(S0[39 : 8], S1[31 : 0], S2[63 : 32].u)); D0[31 : 0] = 32'B(v_msad_u8(S0[31 : 0], S1[31 : 0], S2[31 : 0].u)) V_XOR3_B32 576 Bitwise XOR of three inputs. D0.u = (S0.u ^ S1.u ^ S2.u) Notes Input and output modifiers not supported. V_MAD_U16 577 Multiply and add three unsigned short values. D0.u16 = S0.u16 * S1.u16 + S2.u16 Notes 16.12. VOP3 & VOP3SD Instructions 411 of 597 "RDNA3" Instruction Set Architecture Supports saturation (unsigned 16-bit integer domain). V_PERM_B32 580 Byte permute. BYTE_PERMUTE = lambda(data, sel) ( declare in : 8'B[8]; for i in 0 : 7 do in[i] = data[i * 8 + 7 : i * 8].b8 endfor; if sel.u >= 13U then return 8'0xff elsif sel.u == 12U then return 8'0x0 elsif sel.u == 11U then return in[7][7].b8 * 8'0xff elsif sel.u == 10U then return in[5][7].b8 * 8'0xff elsif sel.u == 9U then return in[3][7].b8 * 8'0xff elsif sel.u == 8U then return in[1][7].b8 * 8'0xff else return in[sel] endif); D0[31 : 24] = BYTE_PERMUTE({ S0.u, S1.u }, S2.u[31 : 24]); D0[23 : 16] = BYTE_PERMUTE({ S0.u, S1.u }, S2.u[23 : 16]); D0[15 : 8] = BYTE_PERMUTE({ S0.u, S1.u }, S2.u[15 : 8]); D0[7 : 0] = BYTE_PERMUTE({ S0.u, S1.u }, S2.u[7 : 0]) Notes Selects 8 through 11 are useful in modeling sign extension of a smaller-precision signed integer to a largerprecision result. Note the MSBs of the 64-bit value being selected are stored in S0. This is counterintuitive for a little-endian architecture. V_XAD_U32 581 Bitwise XOR and then add. D0.u = (S0.u ^ S1.u) + S2.u Notes No carryin/carryout and no saturation. This opcode is designed to help accelerate the SHA256 hash algorithm. 16.12. VOP3 & VOP3SD Instructions 412 of 597 "RDNA3" Instruction Set Architecture V_LSHL_ADD_U32 582 Logical shift left and then add. D0.u = (S0.u << S1.u[4 : 0].u) + S2.u V_ADD_LSHL_U32 583 Add and then logical shift left the result. D0.u = S0.u + S1.u << S2.u[4 : 0].u V_FMA_F16 584 Fused half precision multiply add. D0.f16 = fma(S0.f16, S1.f16, S2.f16) Notes 0.5ULP accuracy, denormals are supported. V_MIN3_F16 585 Return minimum FP16 value of three inputs. D0.f16 = v_min_f16(v_min_f16(S0.f16, S1.f16), S2.f16) V_MIN3_I16 586 Return minimum signed short value of three inputs. D0.i16 = v_min_i16(v_min_i16(S0.i16, S1.i16), S2.i16) 16.12. VOP3 & VOP3SD Instructions 413 of 597 "RDNA3" Instruction Set Architecture V_MIN3_U16 587 Return minimum unsigned short value of three inputs. D0.u16 = v_min_u16(v_min_u16(S0.u16, S1.u16), S2.u16) V_MAX3_F16 588 Return maximum FP16 value of three inputs. D0.f16 = v_max_f16(v_max_f16(S0.f16, S1.f16), S2.f16) V_MAX3_I16 589 Return maximum signed short value of three inputs. D0.i16 = v_max_i16(v_max_i16(S0.i16, S1.i16), S2.i16) V_MAX3_U16 590 Return maximum unsigned short value of three inputs. D0.u16 = v_max_u16(v_max_u16(S0.u16, S1.u16), S2.u16) V_MED3_F16 591 Return median FP16 value of three inputs. if (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16)) || isNAN(64'F(S2.f16))) then D0.f16 = v_min3_f16(S0.f16, S1.f16, S2.f16) elsif v_max3_f16(S0.f16, S1.f16, S2.f16) == S0.f16 then D0.f16 = v_max_f16(S1.f16, S2.f16) elsif v_max3_f16(S0.f16, S1.f16, S2.f16) == S1.f16 then D0.f16 = v_max_f16(S0.f16, S2.f16) else D0.f16 = v_max_f16(S0.f16, S1.f16) endif 16.12. VOP3 & VOP3SD Instructions 414 of 597 "RDNA3" Instruction Set Architecture V_MED3_I16 592 Return median signed short value of three inputs. if v_max3_i16(S0.i16, S1.i16, S2.i16) == S0.i16 then D0.i16 = v_max_i16(S1.i16, S2.i16) elsif v_max3_i16(S0.i16, S1.i16, S2.i16) == S1.i16 then D0.i16 = v_max_i16(S0.i16, S2.i16) else D0.i16 = v_max_i16(S0.i16, S1.i16) endif V_MED3_U16 593 Return median unsigned short value of three inputs. if v_max3_u16(S0.u16, S1.u16, S2.u16) == S0.u16 then D0.u16 = v_max_u16(S1.u16, S2.u16) elsif v_max3_u16(S0.u16, S1.u16, S2.u16) == S1.u16 then D0.u16 = v_max_u16(S0.u16, S2.u16) else D0.u16 = v_max_u16(S0.u16, S1.u16) endif V_MAD_I16 595 Multiply and add three signed short values. D0.i16 = S0.i16 * S1.i16 + S2.i16 Notes Supports saturation (signed 16-bit integer domain). V_DIV_FIXUP_F16 596 Half precision division fixup. S0 = Quotient, S1 = Denominator, S2 = Numerator. Given a numerator, denominator, and quotient from a divide, this opcode detects and applies specific case numerics, touching up the quotient if necessary. This opcode also generates invalid, denorm and divide by zero exceptions caused by the division. 16.12. VOP3 & VOP3SD Instructions 415 of 597 "RDNA3" Instruction Set Architecture sign_out = (sign(S1.f16) ^ sign(S2.f16)); if isNAN(64'F(S2.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S2.f16))) elsif isNAN(64'F(S1.f16)) then D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16))) elsif ((64'F(S1.f16) == 0.0) && (64'F(S2.f16) == 0.0)) then // 0/0 D0.f16 = 16'F(0xfe00) elsif ((64'F(abs(S1.f16)) == +INF) && (64'F(abs(S2.f16)) == +INF)) then // inf/inf D0.f16 = 16'F(0xfe00) elsif ((64'F(S1.f16) == 0.0) || (64'F(abs(S2.f16)) == +INF)) then // x/0, or inf/y D0.f16 = sign_out ? -INF.f16 : +INF.f16 elsif ((64'F(abs(S1.f16)) == +INF) || (64'F(S2.f16) == 0.0)) then // x/inf, 0/y D0.f16 = sign_out ? -16'0.0 : 16'0.0 else D0.f16 = sign_out ? -abs(S0.f16) : abs(S0.f16) endif V_ADD3_U32 597 Add three unsigned integers. D0.u = S0.u + S1.u + S2.u V_LSHL_OR_B32 598 Logical shift left and then bitwise OR. D0.u = ((S0.u << S1.u[4 : 0].u) | S2.u) V_AND_OR_B32 599 Bitwise AND and then bitwise OR. D0.u = ((S0.u & S1.u) | S2.u) V_OR3_B32 16.12. VOP3 & VOP3SD Instructions 600 416 of 597 "RDNA3" Instruction Set Architecture Bitwise OR of three inputs. D0.u = (S0.u | S1.u | S2.u) V_MAD_U32_U16 601 Multiply and add unsigned values. D0.u = 32'U(S0.u16) * 32'U(S1.u16) + S2.u V_MAD_I32_I16 602 Multiply and add signed values. D0.i = 32'I(S0.i16) * 32'I(S1.i16) + S2.i V_PERMLANE16_B32 603 Perform arbitrary gather-style operation within a row (16 contiguous lanes). The first source must be a VGPR and the second and third sources must be scalar values; the second and third source are combined into a single 64-bit value representing lane selects used to swizzle within each row. OPSEL is not used in its typical manner for this instruction. For this instruction OPSEL[0] is overloaded to represent the DPP 'FI' (Fetch Inactive) bit and OPSEL[1] is overloaded to represent the DPP 'BOUND_CTRL' bit. The remaining OPSEL bits are reserved for this instruction. Compare with V_PERMLANEX16_B32. declare tmp : 32'B[64]; lanesel = { S2.u, S1.u }; // Concatenate lane select bits for i in 0 : WAVE32 ? 31 : 63 do // Copy original S0 in case D==S0 tmp[i] = VGPR[i][SRC0.u] endfor; for row in 0 : WAVE32 ? 1 : 3 do // Implement arbitrary swizzle within each row for i in 0 : 15 do if EXEC[row * 16 + i].u1 then VGPR[row * 16 + i][VDST.u] = tmp[64'B(row * 16) + lanesel[i * 4 + 3 : i * 4]] endif endfor 16.12. VOP3 & VOP3SD Instructions 417 of 597 "RDNA3" Instruction Set Architecture endfor Notes ABS, NEG and OMOD modifiers should all be zeroed for this instruction. Example implementing a rotation within each row: v_mov_b32 s0, 0x87654321; v_mov_b32 s1, 0x0fedcba9; v_permlane16_b32 v1, v0, s0, s1; // ROW 0: // v1.lane[0] <- v0.lane[1] // v1.lane[1] <- v0.lane[2] // ... // v1.lane[14] <- v0.lane[15] // v1.lane[15] <- v0.lane[0] // // ROW 1: // v1.lane[16] <- v0.lane[17] // v1.lane[17] <- v0.lane[18] // ... // v1.lane[30] <- v0.lane[31] // v1.lane[31] <- v0.lane[16] V_PERMLANEX16_B32 604 Perform arbitrary gather-style operation across two rows (each row is 16 contiguous lanes). The first source must be a VGPR and the second and third sources must be scalar values; the second and third source are combined into a single 64-bit value representing lane selects used to swizzle within each row. OPSEL is not used in its typical manner for this instruction. For this instruction OPSEL[0] is overloaded to represent the DPP 'FI' (Fetch Inactive) bit and OPSEL[1] is overloaded to represent the DPP 'BOUND_CTRL' bit. The remaining OPSEL bits are reserved for this instruction. Compare with V_PERMLANE16_B32. declare tmp : 32'B[64]; lanesel = { S2.u, S1.u }; // Concatenate lane select bits for i in 0 : WAVE32 ? 31 : 63 do // Copy original S0 in case D==S0 tmp[i] = VGPR[i][SRC0.u] endfor; for row in 0 : WAVE32 ? 1 : 3 do // Implement arbitrary swizzle across two rows altrow = { row[1], ~row[0] }; // 1<->0, 3<->2 for i in 0 : 15 do if EXEC[row * 16 + i].u1 then 16.12. VOP3 & VOP3SD Instructions 418 of 597 "RDNA3" Instruction Set Architecture VGPR[row * 16 + i][VDST.u] = tmp[64'B(altrow.i * 16) + lanesel[i * 4 + 3 : i * 4]] endif endfor endfor Notes ABS, NEG and OMOD modifiers should all be zeroed for this instruction. Example implementing a rotation across an entire wave32 wavefront: // Note for this to work, source and destination VGPRs must be different. // For this rotation, lane 15 gets data from lane 16, lane 31 gets data from lane 0. // These are the only two lanes that need to use v_permlanex16_b32. // Enable only the threads that get data from their own row. v_mov_b32 exec_lo, 0x7fff7fff; // Lanes getting data from their own row v_mov_b32 s0, 0x87654321; v_mov_b32 s1, 0x0fedcba9; v_permlane16_b32 v1, v0, s0, s1 fi; // FI bit needed for lanes 14 and 30 // ROW 0: // v1.lane[0] <- v0.lane[1] // v1.lane[1] <- v0.lane[2] // ... // v1.lane[14] <- v0.lane[15] (needs FI to read) // v1.lane[15] unset // // ROW 1: // v1.lane[16] <- v0.lane[17] // v1.lane[17] <- v0.lane[18] // ... // v1.lane[30] <- v0.lane[31] (needs FI to read) // v1.lane[31] unset // Enable only the threads that get data from the other row. v_mov_b32 exec_lo, 0x80008000; // Lanes getting data from the other row v_permlanex16_b32 v1, v0, s0, s1 fi; // FI bit needed for lanes 15 and 31 // v1.lane[15] <- v0.lane[16] // v1.lane[31] <- v0.lane[0] V_CNDMASK_B16 605 Conditional mask on each thread. D0.u16 = VCC.u1 ? S1.u16 : S0.u16 Notes In VOP3 the VCC source may be a scalar GPR specified in S2.u. 16.12. VOP3 & VOP3SD Instructions 419 of 597 "RDNA3" Instruction Set Architecture Floating-point modifiers are valid for this instruction if S0.u16 and S1.u16 are 16-bit floating point values. This instruction is suitable for negating or taking the absolute value of a floating-point value. V_MAXMIN_F32 606 Compute maximum of first two operands, followed by minimum of that result and the third operand. This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the clamp's maxBound < minBound. D0.f = v_min_f32(v_max_f32(S0.f, S1.f), S2.f) Notes Support input denorm control, allow output denorm value. Exceptions are supported. Note: +0.0 > -0.0 is true. V_MINMAX_F32 607 Compute minimum of first two operands, followed by maximum of that result and the third operand. This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the clamp's minBound > maxBound. D0.f = v_max_f32(v_min_f32(S0.f, S1.f), S2.f) Notes Support input denorm control, allow output denorm value. Exceptions are supported. Note: +0.0 > -0.0 is true. V_MAXMIN_F16 608 Compute maximum of first two operands, followed by minimum of that result and the third operand. This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the clamp's maxBound < minBound. D0.f16 = v_min_f16(v_max_f16(S0.f16, S1.f16), S2.f16) Notes Support input denorm control, allow output denorm value. Exceptions are supported. Note: +0.0 > -0.0 is true. 16.12. VOP3 & VOP3SD Instructions 420 of 597 "RDNA3" Instruction Set Architecture V_MINMAX_F16 609 Compute minimum of first two operands, followed by maximum of that result and the third operand. This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the clamp's maxBound < minBound. D0.f16 = v_max_f16(v_min_f16(S0.f16, S1.f16), S2.f16) Notes Support input denorm control, allow output denorm value. Exceptions are supported. Note: +0.0 > -0.0 is true. V_MAXMIN_U32 610 Compute maximum of first two operands, followed by minimum of that result and the third operand. This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the clamp's maxBound < minBound. D0.i = 32'I(v_min_u32(v_max_u32(S0.u, S1.u), S2.u)) V_MINMAX_U32 611 Compute minimum of first two operands, followed by maximum of that result and the third operand. This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the clamp's maxBound < minBound. D0.i = 32'I(v_max_u32(v_min_u32(S0.u, S1.u), S2.u)) V_MAXMIN_I32 612 Compute maximum of first two operands, followed by minimum of that result and the third operand. This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the clamp's maxBound < minBound. D0.i = v_min_i32(v_max_i32(S0.i, S1.i), S2.i) 16.12. VOP3 & VOP3SD Instructions 421 of 597 "RDNA3" Instruction Set Architecture V_MINMAX_I32 613 Compute minimum of first two operands, followed by maximum of that result and the third operand. This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the clamp's maxBound < minBound. D0.i = v_max_i32(v_min_i32(S0.i, S1.i), S2.i) V_DOT2_F16_F16 614 Dot product of packed FP16 values. tmp = S0[15 : 0].f16 * S1[15 : 0].f16; tmp += S0[31 : 16].f16 * S1[31 : 16].f16; tmp += S2.f16; D0.f16 = tmp Notes OPSEL[2] controls which half of S2 is read and OPSEL[3] controls which half of D is written; OPSEL[1:0] are ignored. V_DOT2_BF16_BF16 615 Dot product of packed brain-float values. tmp = S0[15 : 0].bf16 * S1[15 : 0].bf16; tmp += S0[31 : 16].bf16 * S1[31 : 16].bf16; tmp += S2.bf16; D0.bf16 = tmp Notes OPSEL[2] controls which half of S2 is read and OPSEL[3] controls which half of D is written; OPSEL[1:0] are ignored. V_DIV_SCALE_F32 764 Single precision division pre-scale. S0 = Input to scale (either denominator or numerator), S1 = Denominator, S2 = Numerator. 16.12. VOP3 & VOP3SD Instructions 422 of 597 "RDNA3" Instruction Set Architecture Given a numerator and denominator, this opcode appropriately scales inputs for division to avoid subnormal terms during Newton-Raphson correction method. S0 must be the same value as either S1 or S2. This opcode produces a VCC flag for post-scaling of the quotient (using V_DIV_FMAS_F32). VCC = 0x0LL; if ((64'F(S2.f) == 0.0) || (64'F(S1.f) == 0.0)) then D0.f = NAN.f elsif exponent(S2.f) - exponent(S1.f) >= 96 then // N/D near MAX_FLOAT_F32 VCC = 0x1LL; if S0.f == S1.f then // Only scale the denominator D0.f = ldexp(S0.f, 64) endif elsif S1.f == DENORM.f then D0.f = ldexp(S0.f, 64) elsif ((1.0 / 64'F(S1.f) == DENORM.f64) && (S2.f / S1.f == DENORM.f)) then VCC = 0x1LL; if S0.f == S1.f then // Only scale the denominator D0.f = ldexp(S0.f, 64) endif elsif 1.0 / 64'F(S1.f) == DENORM.f64 then D0.f = ldexp(S0.f, -64) elsif S2.f / S1.f == DENORM.f then VCC = 0x1LL; if S0.f == S2.f then // Only scale the numerator D0.f = ldexp(S0.f, 64) endif elsif exponent(S2.f) <= 23 then // Numerator is tiny D0.f = ldexp(S0.f, 64) endif V_DIV_SCALE_F64 765 Double precision division pre-scale. S0 = Input to scale (either denominator or numerator), S1 = Denominator, S2 = Numerator. Given a numerator and denominator, this opcode appropriately scales inputs for division to avoid subnormal terms during Newton-Raphson correction method. S0 must be the same value as either S1 or S2. This opcode produces a VCC flag for post-scaling of the quotient (using V_DIV_FMAS_F64). VCC = 0x0LL; if ((S2.f64 == 0.0) || (S1.f64 == 0.0)) then D0.f64 = NAN.f64 elsif exponent(S2.f64) - exponent(S1.f64) >= 768 then // N/D near MAX_FLOAT_F64 16.12. VOP3 & VOP3SD Instructions 423 of 597 "RDNA3" Instruction Set Architecture VCC = 0x1LL; if S0.f64 == S1.f64 then // Only scale the denominator D0.f64 = ldexp(S0.f64, 128) endif elsif S1.f64 == DENORM.f64 then D0.f64 = ldexp(S0.f64, 128) elsif ((1.0 / S1.f64 == DENORM.f64) && (S2.f64 / S1.f64 == DENORM.f64)) then VCC = 0x1LL; if S0.f64 == S1.f64 then // Only scale the denominator D0.f64 = ldexp(S0.f64, 128) endif elsif 1.0 / S1.f64 == DENORM.f64 then D0.f64 = ldexp(S0.f64, -128) elsif S2.f64 / S1.f64 == DENORM.f64 then VCC = 0x1LL; if S0.f64 == S2.f64 then // Only scale the numerator D0.f64 = ldexp(S0.f64, 128) endif elsif exponent(S2.f64) <= 53 then // Numerator is tiny D0.f64 = ldexp(S0.f64, 128) endif V_MAD_U64_U32 766 Multiply and add unsigned integers and produce a 64-bit result. { vcc_out.u1, D0.u64 } = 65'B(65'U(S0.u) * 65'U(S1.u) + 65'U(S2.u64)) V_MAD_I64_I32 767 Multiply and add signed integers and produce a 64-bit result. { vcc_out.u1, D0.i64 } = 65'B(65'I(S0.i) * 65'I(S1.i) + 65'I(S2.i64)) V_ADD_CO_U32 768 Add two unsigned integers with carry-out. tmp = 64'U(S0.u) + 64'U(S1.u); VCC.u64[laneId] = tmp >= 0x100000000ULL ? 1'1U : 1'0U; // VCC is an UNSIGNED overflow/carry-out for V_ADD_CO_CI_U32. 16.12. VOP3 & VOP3SD Instructions 424 of 597 "RDNA3" Instruction Set Architecture D0.u = tmp.u Notes In VOP3 the VCC destination may be an arbitrary SGPR-pair. If clamp enabled, saturate output between 0 and maxUInt32. V_SUB_CO_U32 769 Subtract the second unsigned integer from the first with carry-out. tmp = S0.u - S1.u; VCC.u64[laneId] = S1.u > S0.u ? 1'1U : 1'0U; // VCC is an UNSIGNED overflow/carry-out for V_SUB_CO_CI_U32. D0.u = tmp.u Notes In VOP3 the VCC destination may be an arbitrary SGPR-pair. If clamp enabled, saturate output between 0 and maxUInt32. V_SUBREV_CO_U32 770 Subtract the first unsigned integer from the second with carry-out. tmp = S1.u - S0.u; VCC.u64[laneId] = S0.u > S1.u ? 1'1U : 1'0U; // VCC is an UNSIGNED overflow/carry-out for V_SUB_CO_CI_U32. D0.u = tmp.u Notes In VOP3 the VCC destination may be an arbitrary SGPR-pair. If clamp enabled, saturate output between 0 and maxUInt32. V_ADD_NC_U16 771 Add two unsigned shorts. No carry-in or carry-out. D0.u16 = S0.u16 + S1.u16 16.12. VOP3 & VOP3SD Instructions 425 of 597 "RDNA3" Instruction Set Architecture Notes Supports saturation (unsigned 16-bit integer domain). V_SUB_NC_U16 772 Subtract the second unsigned short from the first. No carry-in or carry-out. D0.u16 = S0.u16 - S1.u16 Notes Supports saturation (unsigned 16-bit integer domain). V_MUL_LO_U16 773 Multiply two unsigned shorts. D0.u16 = S0.u16 * S1.u16 Notes Supports saturation (unsigned 16-bit integer domain). V_CVT_PK_I16_F32 774 Convert two single-precision floats into a packed value of signed words. D0[31 : 16] = 16'B(v_cvt_i16_f32(S1.f)); D0[15 : 0] = 16'B(v_cvt_i16_f32(S0.f)) V_CVT_PK_U16_F32 775 Convert two single-precision floats into a packed value of unsigned words. D0[31 : 16] = 16'B(v_cvt_u16_f32(S1.f)); D0[15 : 0] = 16'B(v_cvt_u16_f32(S0.f)) 16.12. VOP3 & VOP3SD Instructions 426 of 597 "RDNA3" Instruction Set Architecture V_MAX_U16 777 Maximum of two unsigned shorts. D0.u16 = S0.u16 >= S1.u16 ? S0.u16 : S1.u16 V_MAX_I16 778 Maximum of two signed shorts. D0.i16 = S0.i16 >= S1.i16 ? S0.i16 : S1.i16 V_MIN_U16 779 Minimum of two unsigned shorts. D0.u16 = S0.u16 < S1.u16 ? S0.u16 : S1.u16 V_MIN_I16 780 Minimum of two signed shorts. D0.i16 = S0.i16 < S1.i16 ? S0.i16 : S1.i16 V_ADD_NC_I16 781 Add two signed shorts. No carry-in or carry-out. D0.i16 = S0.i16 + S1.i16 Notes Supports saturation (signed 16-bit integer domain). V_SUB_NC_I16 16.12. VOP3 & VOP3SD Instructions 782 427 of 597 "RDNA3" Instruction Set Architecture Subtract the second signed short from the first. No carry-in or carry-out. D0.i16 = S0.i16 - S1.i16 Notes Supports saturation (signed 16-bit integer domain). V_PACK_B32_F16 785 Pack two FP16 values together. D0[31 : 16].f16 = S1.f16; D0[15 : 0].f16 = S0.f16 V_CVT_PK_NORM_I16_F16 786 Convert two FP16 values into packed signed normalized shorts. D0[15 : 0].i16 = f16_to_snorm(S0[15 : 0].f16); D0[31 : 16].i16 = f16_to_snorm(S1[15 : 0].f16) V_CVT_PK_NORM_U16_F16 787 Convert two FP16 values into packed unsigned normalized shorts. D0[15 : 0].u16 = f16_to_unorm(S0[15 : 0].f16); D0[31 : 16].u16 = f16_to_unorm(S1[15 : 0].f16) V_LDEXP_F32 796 Load exponent. Multiply a single-precision float by an integral power of 2, compare with the ldexp() function in C. D0.f = S0.f * 2.0F ** S1.i 16.12. VOP3 & VOP3SD Instructions 428 of 597 "RDNA3" Instruction Set Architecture V_BFM_B32 797 Bitfield modify. S0 is the bitfield width and S1 is the bitfield offset. D0.u = 32'U((1 << S0[4 : 0].u) - 1 << S1[4 : 0].u) V_BCNT_U32_B32 798 Bit count. D0.u = S1.u; for i in 0 : 31 do D0.u += S0[i].u; // count i'th bit endfor V_MBCNT_LO_U32_B32 799 Masked bit count. laneId is the position of this thread in the wavefront (in 0..63). See also V_MBCNT_HI_U32_B32. ThreadMask = (1LL << laneId.u) - 1LL; MaskedValue = (S0.u & ThreadMask[31 : 0].u); D0.u = S1.u; for i in 0 : 31 do D0.u += MaskedValue[i] == 1'1U ? 1U : 0U endfor V_MBCNT_HI_U32_B32 800 Masked bit count, high pass. laneId is the position of this thread in the wavefront (in 0..63). See also V_MBCNT_LO_U32_B32. ThreadMask = (1LL << laneId.u) - 1LL; MaskedValue = (S0.u & ThreadMask[63 : 32].u); D0.u = S1.u; for i in 0 : 31 do D0.u += MaskedValue[i] == 1'1U ? 1U : 0U 16.12. VOP3 & VOP3SD Instructions 429 of 597 "RDNA3" Instruction Set Architecture endfor Notes Note that in Wave32 mode ThreadMask[63:32] == 0 and this instruction simply performs a move from S1 to D. Example to compute each thread's position in 0..63: v_mbcnt_lo_u32_b32 v0, -1, 0 v_mbcnt_hi_u32_b32 v0, -1, v0 // v0 now contains laneId V_CVT_PK_NORM_I16_F32 801 Convert two single-precision floats into a packed signed normalized value. D0[15 : 0].i16 = f32_to_snorm(S0.f); D0[31 : 16].i16 = f32_to_snorm(S1.f) V_CVT_PK_NORM_U16_F32 802 Convert two single-precision floats into a packed unsigned normalized value. D0[15 : 0].u16 = f32_to_unorm(S0.f); D0[31 : 16].u16 = f32_to_unorm(S1.f) V_CVT_PK_U16_U32 803 Convert two unsigned integers into a packed unsigned short. D0[15 : 0].u16 = u32_to_u16(S0.u); D0[31 : 16].u16 = u32_to_u16(S1.u) V_CVT_PK_I16_I32 804 Convert two signed integers into a packed signed short. D0[15 : 0].i16 = i32_to_i16(S0.i); 16.12. VOP3 & VOP3SD Instructions 430 of 597 "RDNA3" Instruction Set Architecture D0[31 : 16].i16 = i32_to_i16(S1.i) V_SUB_NC_I32 805 Subtract the second signed integer from the first. No carry-in or carry-out. D0.i = S0.i - S1.i Notes Supports saturation (signed 32-bit integer domain). V_ADD_NC_I32 806 Add two signed integers. No carry-in or carry-out. D0.i = S0.i + S1.i Notes Supports saturation (signed 32-bit integer domain). V_ADD_F64 807 Add two double-precision values. D0.f64 = S0.f64 + S1.f64 Notes 0.5ULP precision, denormals are supported. V_MUL_F64 808 Multiply two double-precision values. D0.f64 = S0.f64 * S1.f64 16.12. VOP3 & VOP3SD Instructions 431 of 597 "RDNA3" Instruction Set Architecture Notes 0.5ULP precision, denormals are supported. V_MIN_F64 809 Compute the minimum of two floats. LT_NEG_ZERO = lambda(a, b) ( ((a < b) || ((abs(a) == 0.0) && (abs(b) == 0.0) && sign(a) && !sign(b)))); // Version of comparison where -0.0 < +0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(S0.f64) then D0.f64 = cvtToQuietNAN(S0.f64) elsif isSignalNAN(S1.f64) then D0.f64 = cvtToQuietNAN(S1.f64) elsif isQuietNAN(S1.f64) then D0.f64 = S0.f64 elsif isQuietNAN(S0.f64) then D0.f64 = S1.f64 elsif LT_NEG_ZERO(S0.f64, S1.f64) then // NOTE: -0<+0 is TRUE in this comparison D0.f64 = S0.f64 else D0.f64 = S1.f64 endif else if isNAN(S1.f64) then D0.f64 = S0.f64 elsif isNAN(S0.f64) then D0.f64 = S1.f64 elsif LT_NEG_ZERO(S0.f64, S1.f64) then // NOTE: -0<+0 is TRUE in this comparison D0.f64 = S0.f64 else D0.f64 = S1.f64 endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. 16.12. VOP3 & VOP3SD Instructions 432 of 597 "RDNA3" Instruction Set Architecture V_MAX_F64 810 Compute the maximum of two floats. GT_NEG_ZERO = lambda(a, b) ( ((a > b) || ((abs(a) == 0.0) && (abs(b) == 0.0) && !sign(a) && sign(b)))); // Version of comparison where +0.0 > -0.0, differs from IEEE if WAVE_MODE.IEEE then if isSignalNAN(S0.f64) then D0.f64 = cvtToQuietNAN(S0.f64) elsif isSignalNAN(S1.f64) then D0.f64 = cvtToQuietNAN(S1.f64) elsif isQuietNAN(S1.f64) then D0.f64 = S0.f64 elsif isQuietNAN(S0.f64) then D0.f64 = S1.f64 elsif GT_NEG_ZERO(S0.f64, S1.f64) then // NOTE: +0>-0 is TRUE in this comparison D0.f64 = S0.f64 else D0.f64 = S1.f64 endif else if isNAN(S1.f64) then D0.f64 = S0.f64 elsif isNAN(S0.f64) then D0.f64 = S1.f64 elsif GT_NEG_ZERO(S0.f64, S1.f64) then // NOTE: +0>-0 is TRUE in this comparison D0.f64 = S0.f64 else D0.f64 = S1.f64 endif endif; // Inequalities in the above pseudocode behave differently from IEEE // when both inputs are +-0. Notes IEEE compliant. Supports denormals, round mode, exception flags, saturation. Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can be a denorm and this operation outputs as a denorm regardless of output denorm mode. V_LDEXP_F64 811 Load exponent. Multiply a double-precision float by an integral power of 2, compare with the ldexp() function in C. 16.12. VOP3 & VOP3SD Instructions 433 of 597 "RDNA3" Instruction Set Architecture D0.f64 = S0.f64 * 2.0 ** S1.i V_MUL_LO_U32 812 Multiply two unsigned integers. D0.u = S0.u * S1.u Notes If you only need to multiply integers with small magnitudes consider V_MUL_U32_U24, which is a more efficient implementation. V_MUL_HI_U32 813 Multiply two unsigned integers and store the high 32 bits of the result. D0.u = 32'U(64'U(S0.u) * 64'U(S1.u) >> 32U) Notes If you only need to multiply integers with small magnitudes consider V_MUL_HI_U32_U24, which is a more efficient implementation. V_MUL_HI_I32 814 Multiply two signed integers and store the high 32 bits of the result. D0.i = 32'I(64'I(S0.i) * 64'I(S1.i) >> 32U) Notes If you only need to multiply integers with small magnitudes consider V_MUL_HI_I32_I24, which is a more efficient implementation. V_TRIG_PREOP_F64 815 Look Up 2/PI (S0.f64) with segment select S1.u[4:0]. 16.12. VOP3 & VOP3SD Instructions 434 of 597 "RDNA3" Instruction Set Architecture This operation returns an aligned, double precision segment of 2/PI needed to do range reduction on S0.f64 (double-precision value). Multiple segments can be specified through S1.u[4:0]. Rounding is round-to-zero. Large inputs (exp > 1968) are scaled to avoid loss of precision through denormalization. shift = 32'I(S1[4 : 0].u) * 53; if exponent(S0.f64) > 1077 then shift += exponent(S0.f64) - 1077 endif; // (2.0/PI) == 0.{b_1200, b_1199, b_1198, ..., b_1, b_0} // b_1200 is the MSB of the fractional part of 2.0/PI // Left shift operation indicates which bits are brought // into the whole part of the number. // Only whole part of result is kept. result = 64'F((1201'B(2.0 / PI)[1200 : 0] << shift.u) & 1201'0x1fffffffffffff); scale = -53 - shift; if exponent(S0.f64) >= 1968 then scale += 128 endif; D0.f64 = ldexp(result, scale) V_LSHLREV_B16 824 Logical shift left, count is in the first operand. D0.u[15 : 0] = S1.u[15 : 0] << S0.u[3 : 0].u V_LSHRREV_B16 825 Logical shift right, count is in the first operand. D0.u[15 : 0] = S1.u[15 : 0] >> S0.u[3 : 0].u V_ASHRREV_I16 826 Arithmetic shift right (preserve sign bit), count is in the first operand. D0.i[15 : 0] = 16'I(signext(S1.i[15 : 0]) >> S0.u[3 : 0].u) V_LSHLREV_B64 16.12. VOP3 & VOP3SD Instructions 828 435 of 597 "RDNA3" Instruction Set Architecture Logical shift left, count is in the first operand. D0.u64 = S1.u64 << S0.u[5 : 0].u Notes Only one scalar broadcast constant is allowed. V_LSHRREV_B64 829 Logical shift right, count is in the first operand. D0.u64 = S1.u64 >> S0.u[5 : 0].u Notes Only one scalar broadcast constant is allowed. V_ASHRREV_I64 830 Arithmetic shift right (preserve sign bit), count is in the first operand. D0.u64 = 64'U(signext(S1.u64) >> S0.u[5 : 0].u) Notes Only one scalar broadcast constant is allowed. V_READLANE_B32 864 Copy one VGPR value from a single lane to one SGPR. declare lane : 32'U; if WAVE32 then lane = S1.u[4 : 0].u; // Lane select for wave32 else lane = S1.u[5 : 0].u; // Lane select for wave64 endif; D0.b = VGPR[lane][SRC0.u] 16.12. VOP3 & VOP3SD Instructions 436 of 597 "RDNA3" Instruction Set Architecture Notes Ignores EXEC mask for the VGPR read. Input and output modifiers not supported; this is an untyped operation. V_WRITELANE_B32 865 Write scalar value into one VGPR in one lane. declare lane : 32'U; if WAVE32 then lane = S1.u[4 : 0].u; // Lane select for wave32 else lane = S1.u[5 : 0].u; // Lane select for wave64 endif; VGPR[lane][VDST.u] = S0.b Notes Ignores EXEC mask for the VGPR write. Input and output modifiers not supported; this is an untyped operation. V_AND_B16 866 Bitwise AND. D0.u16 = (S0.u16 & S1.u16) Notes Input and output modifiers not supported. V_OR_B16 867 Bitwise OR. D0.u16 = (S0.u16 | S1.u16) Notes Input and output modifiers not supported. 16.12. VOP3 & VOP3SD Instructions 437 of 597 "RDNA3" Instruction Set Architecture V_XOR_B16 868 Bitwise XOR. D0.u16 = (S0.u16 ^ S1.u16) Notes Input and output modifiers not supported. V_CMP_F_F16 0 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_F16 1 A less than B. D0.u64[laneId] = S0.f16 < S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_F16 2 A equal to B. D0.u64[laneId] = S0.f16 == S1.f16; // D0 = VCC in VOPC encoding. Notes 16.12. VOP3 & VOP3SD Instructions 438 of 597 "RDNA3" Instruction Set Architecture Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_F16 3 A less than or equal to B. D0.u64[laneId] = S0.f16 <= S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_F16 4 A greater than B. D0.u64[laneId] = S0.f16 > S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LG_F16 5 A less than or greater than B. D0.u64[laneId] = S0.f16 <> S1.f16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_F16 6 A greater than or equal to B. D0.u64[laneId] = S0.f16 >= S1.f16; 16.12. VOP3 & VOP3SD Instructions 439 of 597 "RDNA3" Instruction Set Architecture // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_O_F16 7 A orderable with B. D0.u64[laneId] = (!isNAN(64'F(S0.f16)) && !isNAN(64'F(S1.f16))); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_U_F16 8 A not orderable with B. D0.u64[laneId] = (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16))); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGE_F16 9 A not greater than or equal to B. D0.u64[laneId] = !(S0.f16 >= S1.f16); // With NAN inputs this is not the same operation as < // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 440 of 597 "RDNA3" Instruction Set Architecture V_CMP_NLG_F16 10 A not less than or greater than B. D0.u64[laneId] = !(S0.f16 <> S1.f16); // With NAN inputs this is not the same operation as == // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGT_F16 11 A not greater than B. D0.u64[laneId] = !(S0.f16 > S1.f16); // With NAN inputs this is not the same operation as <= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLE_F16 12 A not less than or equal to B. D0.u64[laneId] = !(S0.f16 <= S1.f16); // With NAN inputs this is not the same operation as > // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NEQ_F16 13 A not equal to B. D0.u64[laneId] = !(S0.f16 == S1.f16); // With NAN inputs this is not the same operation as != 16.12. VOP3 & VOP3SD Instructions 441 of 597 "RDNA3" Instruction Set Architecture // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLT_F16 14 A not less than B. D0.u64[laneId] = !(S0.f16 < S1.f16); // With NAN inputs this is not the same operation as >= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_F16 15 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_F_F32 16 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 442 of 597 "RDNA3" Instruction Set Architecture V_CMP_LT_F32 17 A less than B. D0.u64[laneId] = S0.f < S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_F32 18 A equal to B. D0.u64[laneId] = S0.f == S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_F32 19 A less than or equal to B. D0.u64[laneId] = S0.f <= S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_F32 20 A greater than B. D0.u64[laneId] = S0.f > S1.f; // D0 = VCC in VOPC encoding. Notes 16.12. VOP3 & VOP3SD Instructions 443 of 597 "RDNA3" Instruction Set Architecture Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LG_F32 21 A less than or greater than B. D0.u64[laneId] = S0.f <> S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_F32 22 A greater than or equal to B. D0.u64[laneId] = S0.f >= S1.f; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_O_F32 23 A orderable with B. D0.u64[laneId] = (!isNAN(64'F(S0.f)) && !isNAN(64'F(S1.f))); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_U_F32 24 A not orderable with B. D0.u64[laneId] = (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f))); 16.12. VOP3 & VOP3SD Instructions 444 of 597 "RDNA3" Instruction Set Architecture // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGE_F32 25 A not greater than or equal to B. D0.u64[laneId] = !(S0.f >= S1.f); // With NAN inputs this is not the same operation as < // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLG_F32 26 A not less than or greater than B. D0.u64[laneId] = !(S0.f <> S1.f); // With NAN inputs this is not the same operation as == // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGT_F32 27 A not greater than B. D0.u64[laneId] = !(S0.f > S1.f); // With NAN inputs this is not the same operation as <= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 445 of 597 "RDNA3" Instruction Set Architecture V_CMP_NLE_F32 28 A not less than or equal to B. D0.u64[laneId] = !(S0.f <= S1.f); // With NAN inputs this is not the same operation as > // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NEQ_F32 29 A not equal to B. D0.u64[laneId] = !(S0.f == S1.f); // With NAN inputs this is not the same operation as != // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLT_F32 30 A not less than B. D0.u64[laneId] = !(S0.f < S1.f); // With NAN inputs this is not the same operation as >= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_F32 31 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. 16.12. VOP3 & VOP3SD Instructions 446 of 597 "RDNA3" Instruction Set Architecture Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_F_F64 32 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_F64 33 A less than B. D0.u64[laneId] = S0.f64 < S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_F64 34 A equal to B. D0.u64[laneId] = S0.f64 == S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_F64 35 A less than or equal to B. 16.12. VOP3 & VOP3SD Instructions 447 of 597 "RDNA3" Instruction Set Architecture D0.u64[laneId] = S0.f64 <= S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_F64 36 A greater than B. D0.u64[laneId] = S0.f64 > S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LG_F64 37 A less than or greater than B. D0.u64[laneId] = S0.f64 <> S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_F64 38 A greater than or equal to B. D0.u64[laneId] = S0.f64 >= S1.f64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 448 of 597 "RDNA3" Instruction Set Architecture V_CMP_O_F64 39 A orderable with B. D0.u64[laneId] = (!isNAN(S0.f64) && !isNAN(S1.f64)); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_U_F64 40 A not orderable with B. D0.u64[laneId] = (isNAN(S0.f64) || isNAN(S1.f64)); // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGE_F64 41 A not greater than or equal to B. D0.u64[laneId] = !(S0.f64 >= S1.f64); // With NAN inputs this is not the same operation as < // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLG_F64 42 A not less than or greater than B. D0.u64[laneId] = !(S0.f64 <> S1.f64); // With NAN inputs this is not the same operation as == // D0 = VCC in VOPC encoding. 16.12. VOP3 & VOP3SD Instructions 449 of 597 "RDNA3" Instruction Set Architecture Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NGT_F64 43 A not greater than B. D0.u64[laneId] = !(S0.f64 > S1.f64); // With NAN inputs this is not the same operation as <= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLE_F64 44 A not less than or equal to B. D0.u64[laneId] = !(S0.f64 <= S1.f64); // With NAN inputs this is not the same operation as > // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NEQ_F64 45 A not equal to B. D0.u64[laneId] = !(S0.f64 == S1.f64); // With NAN inputs this is not the same operation as != // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NLT_F64 16.12. VOP3 & VOP3SD Instructions 46 450 of 597 "RDNA3" Instruction Set Architecture A not less than B. D0.u64[laneId] = !(S0.f64 < S1.f64); // With NAN inputs this is not the same operation as >= // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_F64 47 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_I16 49 A less than B. D0.u64[laneId] = S0.i16 < S1.i16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_I16 50 A equal to B. D0.u64[laneId] = S0.i16 == S1.i16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 451 of 597 "RDNA3" Instruction Set Architecture V_CMP_LE_I16 51 A less than or equal to B. D0.u64[laneId] = S0.i16 <= S1.i16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_I16 52 A greater than B. D0.u64[laneId] = S0.i16 > S1.i16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_I16 53 A not equal to B. D0.u64[laneId] = S0.i16 <> S1.i16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_I16 54 A greater than or equal to B. D0.u64[laneId] = S0.i16 >= S1.i16; // D0 = VCC in VOPC encoding. 16.12. VOP3 & VOP3SD Instructions 452 of 597 "RDNA3" Instruction Set Architecture Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_U16 57 A less than B. D0.u64[laneId] = S0.u16 < S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_U16 58 A equal to B. D0.u64[laneId] = S0.u16 == S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_U16 59 A less than or equal to B. D0.u64[laneId] = S0.u16 <= S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_U16 60 A greater than B. 16.12. VOP3 & VOP3SD Instructions 453 of 597 "RDNA3" Instruction Set Architecture D0.u64[laneId] = S0.u16 > S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_U16 61 A not equal to B. D0.u64[laneId] = S0.u16 <> S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_U16 62 A greater than or equal to B. D0.u64[laneId] = S0.u16 >= S1.u16; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_F_I32 64 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 454 of 597 "RDNA3" Instruction Set Architecture V_CMP_LT_I32 65 A less than B. D0.u64[laneId] = S0.i < S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_I32 66 A equal to B. D0.u64[laneId] = S0.i == S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_I32 67 A less than or equal to B. D0.u64[laneId] = S0.i <= S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_I32 68 A greater than B. D0.u64[laneId] = S0.i > S1.i; // D0 = VCC in VOPC encoding. Notes 16.12. VOP3 & VOP3SD Instructions 455 of 597 "RDNA3" Instruction Set Architecture Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_I32 69 A not equal to B. D0.u64[laneId] = S0.i <> S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_I32 70 A greater than or equal to B. D0.u64[laneId] = S0.i >= S1.i; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_I32 71 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_F_U32 72 False. D0.u64[laneId] = 1'0U; 16.12. VOP3 & VOP3SD Instructions 456 of 597 "RDNA3" Instruction Set Architecture // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_U32 73 A less than B. D0.u64[laneId] = S0.u < S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_U32 74 A equal to B. D0.u64[laneId] = S0.u == S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_U32 75 A less than or equal to B. D0.u64[laneId] = S0.u <= S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_U32 16.12. VOP3 & VOP3SD Instructions 76 457 of 597 "RDNA3" Instruction Set Architecture A greater than B. D0.u64[laneId] = S0.u > S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_U32 77 A not equal to B. D0.u64[laneId] = S0.u <> S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_U32 78 A greater than or equal to B. D0.u64[laneId] = S0.u >= S1.u; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_U32 79 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 458 of 597 "RDNA3" Instruction Set Architecture V_CMP_F_I64 80 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_I64 81 A less than B. D0.u64[laneId] = S0.i64 < S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_I64 82 A equal to B. D0.u64[laneId] = S0.i64 == S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LE_I64 83 A less than or equal to B. D0.u64[laneId] = S0.i64 <= S1.i64; // D0 = VCC in VOPC encoding. 16.12. VOP3 & VOP3SD Instructions 459 of 597 "RDNA3" Instruction Set Architecture Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_I64 84 A greater than B. D0.u64[laneId] = S0.i64 > S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_I64 85 A not equal to B. D0.u64[laneId] = S0.i64 <> S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_I64 86 A greater than or equal to B. D0.u64[laneId] = S0.i64 >= S1.i64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_I64 87 True. 16.12. VOP3 & VOP3SD Instructions 460 of 597 "RDNA3" Instruction Set Architecture D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_F_U64 88 False. D0.u64[laneId] = 1'0U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_LT_U64 89 A less than B. D0.u64[laneId] = S0.u64 < S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_EQ_U64 90 A equal to B. D0.u64[laneId] = S0.u64 == S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 461 of 597 "RDNA3" Instruction Set Architecture V_CMP_LE_U64 91 A less than or equal to B. D0.u64[laneId] = S0.u64 <= S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GT_U64 92 A greater than B. D0.u64[laneId] = S0.u64 > S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_NE_U64 93 A not equal to B. D0.u64[laneId] = S0.u64 <> S1.u64; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_GE_U64 94 A greater than or equal to B. D0.u64[laneId] = S0.u64 >= S1.u64; // D0 = VCC in VOPC encoding. Notes 16.12. VOP3 & VOP3SD Instructions 462 of 597 "RDNA3" Instruction Set Architecture Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_T_U64 95 True. D0.u64[laneId] = 1'1U; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_CLASS_F16 125 IEEE numeric class function specified in S1.u, performed on S0.f16. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(64'F(S0.f16)) then result = S1.u[0] elsif isQuietNAN(64'F(S0.f16)) then result = S1.u[1] elsif exponent(S0.f16) == 31 then // +-INF result = S1.u[sign(S0.f16) ? 2 : 9] elsif exponent(S0.f16) > 0 then // +-normal value result = S1.u[sign(S0.f16) ? 3 : 8] elsif 64'F(abs(S0.f16)) > 0.0 then // +-denormal value result = S1.u[sign(S0.f16) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f16) ? 5 : 6] 16.12. VOP3 & VOP3SD Instructions 463 of 597 "RDNA3" Instruction Set Architecture endif; D0.u64[laneId] = result; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMP_CLASS_F32 126 IEEE numeric class function specified in S1.u, performed on S0.f. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(64'F(S0.f)) then result = S1.u[0] elsif isQuietNAN(64'F(S0.f)) then result = S1.u[1] elsif exponent(S0.f) == 255 then // +-INF result = S1.u[sign(S0.f) ? 2 : 9] elsif exponent(S0.f) > 0 then // +-normal value result = S1.u[sign(S0.f) ? 3 : 8] elsif 64'F(abs(S0.f)) > 0.0 then // +-denormal value result = S1.u[sign(S0.f) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f) ? 5 : 6] endif; D0.u64[laneId] = result; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 464 of 597 "RDNA3" Instruction Set Architecture V_CMP_CLASS_F64 127 IEEE numeric class function specified in S1.u, performed on S0.f64. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(S0.f64) then result = S1.u[0] elsif isQuietNAN(S0.f64) then result = S1.u[1] elsif exponent(S0.f64) == 1023 then // +-INF result = S1.u[sign(S0.f64) ? 2 : 9] elsif exponent(S0.f64) > 0 then // +-normal value result = S1.u[sign(S0.f64) ? 3 : 8] elsif abs(S0.f64) > 0.0 then // +-denormal value result = S1.u[sign(S0.f64) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f64) ? 5 : 6] endif; D0.u64[laneId] = result; // D0 = VCC in VOPC encoding. Notes Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_F16 128 False. 16.12. VOP3 & VOP3SD Instructions 465 of 597 "RDNA3" Instruction Set Architecture EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_F16 129 A less than B. EXEC.u64[laneId] = S0.f16 < S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_F16 130 A equal to B. EXEC.u64[laneId] = S0.f16 == S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_F16 131 A less than or equal to B. EXEC.u64[laneId] = S0.f16 <= S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_F16 132 A greater than B. 16.12. VOP3 & VOP3SD Instructions 466 of 597 "RDNA3" Instruction Set Architecture EXEC.u64[laneId] = S0.f16 > S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LG_F16 133 A less than or greater than B. EXEC.u64[laneId] = S0.f16 <> S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_F16 134 A greater than or equal to B. EXEC.u64[laneId] = S0.f16 >= S1.f16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_O_F16 135 A orderable with B. EXEC.u64[laneId] = (!isNAN(64'F(S0.f16)) && !isNAN(64'F(S1.f16))) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_U_F16 136 A not orderable with B. 16.12. VOP3 & VOP3SD Instructions 467 of 597 "RDNA3" Instruction Set Architecture EXEC.u64[laneId] = (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16))) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGE_F16 137 A not greater than or equal to B. EXEC.u64[laneId] = !(S0.f16 >= S1.f16); // With NAN inputs this is not the same operation as < Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLG_F16 138 A not less than or greater than B. EXEC.u64[laneId] = !(S0.f16 <> S1.f16); // With NAN inputs this is not the same operation as == Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGT_F16 139 A not greater than B. EXEC.u64[laneId] = !(S0.f16 > S1.f16); // With NAN inputs this is not the same operation as <= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 468 of 597 "RDNA3" Instruction Set Architecture V_CMPX_NLE_F16 140 A not less than or equal to B. EXEC.u64[laneId] = !(S0.f16 <= S1.f16); // With NAN inputs this is not the same operation as > Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NEQ_F16 141 A not equal to B. EXEC.u64[laneId] = !(S0.f16 == S1.f16); // With NAN inputs this is not the same operation as != Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLT_F16 142 A not less than B. EXEC.u64[laneId] = !(S0.f16 < S1.f16); // With NAN inputs this is not the same operation as >= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_F16 143 True. EXEC.u64[laneId] = 1'1U Notes 16.12. VOP3 & VOP3SD Instructions 469 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_F32 144 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_F32 145 A less than B. EXEC.u64[laneId] = S0.f < S1.f Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_F32 146 A equal to B. EXEC.u64[laneId] = S0.f == S1.f Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_F32 147 A less than or equal to B. EXEC.u64[laneId] = S0.f <= S1.f Notes 16.12. VOP3 & VOP3SD Instructions 470 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_F32 148 A greater than B. EXEC.u64[laneId] = S0.f > S1.f Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LG_F32 149 A less than or greater than B. EXEC.u64[laneId] = S0.f <> S1.f Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_F32 150 A greater than or equal to B. EXEC.u64[laneId] = S0.f >= S1.f Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_O_F32 151 A orderable with B. EXEC.u64[laneId] = (!isNAN(64'F(S0.f)) && !isNAN(64'F(S1.f))) Notes 16.12. VOP3 & VOP3SD Instructions 471 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_U_F32 152 A not orderable with B. EXEC.u64[laneId] = (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f))) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGE_F32 153 A not greater than or equal to B. EXEC.u64[laneId] = !(S0.f >= S1.f); // With NAN inputs this is not the same operation as < Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLG_F32 154 A not less than or greater than B. EXEC.u64[laneId] = !(S0.f <> S1.f); // With NAN inputs this is not the same operation as == Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGT_F32 155 A not greater than B. EXEC.u64[laneId] = !(S0.f > S1.f); 16.12. VOP3 & VOP3SD Instructions 472 of 597 "RDNA3" Instruction Set Architecture // With NAN inputs this is not the same operation as <= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLE_F32 156 A not less than or equal to B. EXEC.u64[laneId] = !(S0.f <= S1.f); // With NAN inputs this is not the same operation as > Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NEQ_F32 157 A not equal to B. EXEC.u64[laneId] = !(S0.f == S1.f); // With NAN inputs this is not the same operation as != Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLT_F32 158 A not less than B. EXEC.u64[laneId] = !(S0.f < S1.f); // With NAN inputs this is not the same operation as >= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_F32 16.12. VOP3 & VOP3SD Instructions 159 473 of 597 "RDNA3" Instruction Set Architecture True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_F64 160 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_F64 161 A less than B. EXEC.u64[laneId] = S0.f64 < S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_F64 162 A equal to B. EXEC.u64[laneId] = S0.f64 == S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_F64 16.12. VOP3 & VOP3SD Instructions 163 474 of 597 "RDNA3" Instruction Set Architecture A less than or equal to B. EXEC.u64[laneId] = S0.f64 <= S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_F64 164 A greater than B. EXEC.u64[laneId] = S0.f64 > S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LG_F64 165 A less than or greater than B. EXEC.u64[laneId] = S0.f64 <> S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_F64 166 A greater than or equal to B. EXEC.u64[laneId] = S0.f64 >= S1.f64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_O_F64 16.12. VOP3 & VOP3SD Instructions 167 475 of 597 "RDNA3" Instruction Set Architecture A orderable with B. EXEC.u64[laneId] = (!isNAN(S0.f64) && !isNAN(S1.f64)) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_U_F64 168 A not orderable with B. EXEC.u64[laneId] = (isNAN(S0.f64) || isNAN(S1.f64)) Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NGE_F64 169 A not greater than or equal to B. EXEC.u64[laneId] = !(S0.f64 >= S1.f64); // With NAN inputs this is not the same operation as < Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLG_F64 170 A not less than or greater than B. EXEC.u64[laneId] = !(S0.f64 <> S1.f64); // With NAN inputs this is not the same operation as == Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 476 of 597 "RDNA3" Instruction Set Architecture V_CMPX_NGT_F64 171 A not greater than B. EXEC.u64[laneId] = !(S0.f64 > S1.f64); // With NAN inputs this is not the same operation as <= Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLE_F64 172 A not less than or equal to B. EXEC.u64[laneId] = !(S0.f64 <= S1.f64); // With NAN inputs this is not the same operation as > Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NEQ_F64 173 A not equal to B. EXEC.u64[laneId] = !(S0.f64 == S1.f64); // With NAN inputs this is not the same operation as != Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NLT_F64 174 A not less than B. EXEC.u64[laneId] = !(S0.f64 < S1.f64); // With NAN inputs this is not the same operation as >= Notes 16.12. VOP3 & VOP3SD Instructions 477 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_F64 175 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_I16 177 A less than B. EXEC.u64[laneId] = S0.i16 < S1.i16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_I16 178 A equal to B. EXEC.u64[laneId] = S0.i16 == S1.i16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_I16 179 A less than or equal to B. EXEC.u64[laneId] = S0.i16 <= S1.i16 Notes 16.12. VOP3 & VOP3SD Instructions 478 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_I16 180 A greater than B. EXEC.u64[laneId] = S0.i16 > S1.i16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_I16 181 A not equal to B. EXEC.u64[laneId] = S0.i16 <> S1.i16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_I16 182 A greater than or equal to B. EXEC.u64[laneId] = S0.i16 >= S1.i16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_U16 185 A less than B. EXEC.u64[laneId] = S0.u16 < S1.u16 Notes 16.12. VOP3 & VOP3SD Instructions 479 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_U16 186 A equal to B. EXEC.u64[laneId] = S0.u16 == S1.u16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_U16 187 A less than or equal to B. EXEC.u64[laneId] = S0.u16 <= S1.u16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_U16 188 A greater than B. EXEC.u64[laneId] = S0.u16 > S1.u16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_U16 189 A not equal to B. EXEC.u64[laneId] = S0.u16 <> S1.u16 Notes 16.12. VOP3 & VOP3SD Instructions 480 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_U16 190 A greater than or equal to B. EXEC.u64[laneId] = S0.u16 >= S1.u16 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_I32 192 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_I32 193 A less than B. EXEC.u64[laneId] = S0.i < S1.i Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_I32 194 A equal to B. EXEC.u64[laneId] = S0.i == S1.i Notes 16.12. VOP3 & VOP3SD Instructions 481 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_I32 195 A less than or equal to B. EXEC.u64[laneId] = S0.i <= S1.i Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_I32 196 A greater than B. EXEC.u64[laneId] = S0.i > S1.i Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_I32 197 A not equal to B. EXEC.u64[laneId] = S0.i <> S1.i Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_I32 198 A greater than or equal to B. EXEC.u64[laneId] = S0.i >= S1.i Notes 16.12. VOP3 & VOP3SD Instructions 482 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_I32 199 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_U32 200 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_U32 201 A less than B. EXEC.u64[laneId] = S0.u < S1.u Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_U32 202 A equal to B. EXEC.u64[laneId] = S0.u == S1.u Notes 16.12. VOP3 & VOP3SD Instructions 483 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_U32 203 A less than or equal to B. EXEC.u64[laneId] = S0.u <= S1.u Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_U32 204 A greater than B. EXEC.u64[laneId] = S0.u > S1.u Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_U32 205 A not equal to B. EXEC.u64[laneId] = S0.u <> S1.u Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_U32 206 A greater than or equal to B. EXEC.u64[laneId] = S0.u >= S1.u Notes 16.12. VOP3 & VOP3SD Instructions 484 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_U32 207 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_I64 208 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_I64 209 A less than B. EXEC.u64[laneId] = S0.i64 < S1.i64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_I64 210 A equal to B. EXEC.u64[laneId] = S0.i64 == S1.i64 Notes 16.12. VOP3 & VOP3SD Instructions 485 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_I64 211 A less than or equal to B. EXEC.u64[laneId] = S0.i64 <= S1.i64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_I64 212 A greater than B. EXEC.u64[laneId] = S0.i64 > S1.i64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_I64 213 A not equal to B. EXEC.u64[laneId] = S0.i64 <> S1.i64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_I64 214 A greater than or equal to B. EXEC.u64[laneId] = S0.i64 >= S1.i64 Notes 16.12. VOP3 & VOP3SD Instructions 486 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_I64 215 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_F_U64 216 False. EXEC.u64[laneId] = 1'0U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LT_U64 217 A less than B. EXEC.u64[laneId] = S0.u64 < S1.u64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_EQ_U64 218 A equal to B. EXEC.u64[laneId] = S0.u64 == S1.u64 Notes 16.12. VOP3 & VOP3SD Instructions 487 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_LE_U64 219 A less than or equal to B. EXEC.u64[laneId] = S0.u64 <= S1.u64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GT_U64 220 A greater than B. EXEC.u64[laneId] = S0.u64 > S1.u64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_NE_U64 221 A not equal to B. EXEC.u64[laneId] = S0.u64 <> S1.u64 Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_GE_U64 222 A greater than or equal to B. EXEC.u64[laneId] = S0.u64 >= S1.u64 Notes 16.12. VOP3 & VOP3SD Instructions 488 of 597 "RDNA3" Instruction Set Architecture Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_T_U64 223 True. EXEC.u64[laneId] = 1'1U Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_CLASS_F16 253 IEEE numeric class function specified in S1.u, performed on S0.f16. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(64'F(S0.f16)) then result = S1.u[0] elsif isQuietNAN(64'F(S0.f16)) then result = S1.u[1] elsif exponent(S0.f16) == 31 then // +-INF result = S1.u[sign(S0.f16) ? 2 : 9] elsif exponent(S0.f16) > 0 then // +-normal value result = S1.u[sign(S0.f16) ? 3 : 8] elsif 64'F(abs(S0.f16)) > 0.0 then // +-denormal value result = S1.u[sign(S0.f16) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f16) ? 5 : 6] endif; 16.12. VOP3 & VOP3SD Instructions 489 of 597 "RDNA3" Instruction Set Architecture EXEC.u64[laneId] = result Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. V_CMPX_CLASS_F32 254 IEEE numeric class function specified in S1.u, performed on S0.f. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(64'F(S0.f)) then result = S1.u[0] elsif isQuietNAN(64'F(S0.f)) then result = S1.u[1] elsif exponent(S0.f) == 255 then // +-INF result = S1.u[sign(S0.f) ? 2 : 9] elsif exponent(S0.f) > 0 then // +-normal value result = S1.u[sign(S0.f) ? 3 : 8] elsif 64'F(abs(S0.f)) > 0.0 then // +-denormal value result = S1.u[sign(S0.f) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f) ? 5 : 6] endif; EXEC.u64[laneId] = result Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 490 of 597 "RDNA3" Instruction Set Architecture V_CMPX_CLASS_F64 255 IEEE numeric class function specified in S1.u, performed on S0.f64. The function reports true if the floating point value is any of the numeric types selected in S1.u according to the following list: + S1.u[0] -- value is a signaling NAN. S1.u[1] -- value is a quiet NAN. S1.u[2] -- value is negative infinity. S1.u[3] -- value is a negative normal value. S1.u[4] -- value is a negative denormal value. S1.u[5] -- value is negative zero. S1.u[6] -- value is positive zero. S1.u[7] -- value is a positive denormal value. S1.u[8] -- value is a positive normal value. S1.u[9] -- value is positive infinity. declare result : 1'U; if isSignalNAN(S0.f64) then result = S1.u[0] elsif isQuietNAN(S0.f64) then result = S1.u[1] elsif exponent(S0.f64) == 1023 then // +-INF result = S1.u[sign(S0.f64) ? 2 : 9] elsif exponent(S0.f64) > 0 then // +-normal value result = S1.u[sign(S0.f64) ? 3 : 8] elsif abs(S0.f64) > 0.0 then // +-denormal value result = S1.u[sign(S0.f64) ? 4 : 7] else // +-0.0 result = S1.u[sign(S0.f64) ? 5 : 6] endif; EXEC.u64[laneId] = result Notes Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set. 16.12. VOP3 & VOP3SD Instructions 491 of 597 "RDNA3" Instruction Set Architecture 16.13. VINTERP Instructions Parameter interpolation VALU instructions. V_INTERP_P10_F32 0 Parameter interpolation, first pass. D0.f = S0[lane.i % 4 + 1].f * S1.f + S2[lane.i % 4].f Notes Performs a V_FMA_F32 operation using fixed DPP8 settings. S0 and S2 refer to a VGPR previously loaded with LDS_PARAM_LOAD that contains packed interpolation data. S1 is the I/J coordinate. S0 uses a fixed DPP8 lane select of {1,1,1,1,5,5,5,5}. S2 uses a fixed DPP8 lane select of {0,0,0,0,4,4,4,4}. Example usage: s_mov_b32 m0, s0 // assume s0 contains newprim mask lds_param_load v0, attr0 // v0 is a temporary register v_interp_p10_f32 v3, v0, v1, v0 // v1 contains i coordinate v_interp_p2_f32 v3, v0, v2, v3 // v2 contains j coordinate V_INTERP_P2_F32 1 Parameter interpolation, second pass. D0.f = fma(S0[lane.i % 4 + 2].f, S1.f, S2.f) Notes Performs a V_FMA_F32 operation using fixed DPP8 settings. S0 refers to a VGPR previously loaded with LDS_PARAM_LOAD that contains packed interpolation data. S1 is the I/J coordinate. S2 is the result of a previous V_INTERP_P10_F32 instruction. S0 uses a fixed DPP8 lane select of {2,2,2,2,6,6,6,6}. 16.13. VINTERP Instructions 492 of 597 "RDNA3" Instruction Set Architecture V_INTERP_P10_F16_F32 2 Parameter interpolation, first pass. D0.f = 32'F(S0[lane.i % 4 + 1].f16) * S1.f + 32'F(S2[lane.i % 4].f16) Notes Performs a hybrid 16/32-bit multiply-add operation using fixed DPP8 settings. S0 and S2 refer to a VGPR previously loaded with LDS_PARAM_LOAD that contains packed interpolation data. S1 is the I/J coordinate. S0 uses a fixed DPP8 lane select of {1,1,1,1,5,5,5,5}. S2 uses a fixed DPP8 lane select of {0,0,0,0,4,4,4,4}. OPSEL is allowed for S0 and S2 to specify which half of the register to read from. Note that the I/J coordinate is 32-bit and the destination is also 32-bit. V_INTERP_P2_F16_F32 3 Parameter interpolation, second pass. D0.f16 = 16'F(32'F(S0[lane.i % 4 + 2].f16) * S1.f + S2.f) Notes Performs a hybrid 16/32-bit multiply-add operation using fixed DPP8 settings. S0 refers to a VGPR previously loaded with LDS_PARAM_LOAD that contains packed interpolation data. S1 is the I/J coordinate. S2 is the result of a previous V_INTERP_P10_F16_F32 instruction. S0 uses a fixed DPP8 lane select of {2,2,2,2,6,6,6,6}. OPSEL is allowed for D and S0 to specify which half of the register to write to/read from. Note that the I/J coordinate is 32-bit and the accumulator input is also 32-bit. V_INTERP_P10_RTZ_F16_F32 4 Same as V_INTERP_P10_F16_F32 except rounding mode is overridden to round toward zero. V_INTERP_P2_RTZ_F16_F32 5 Same as V_INTERP_P2_F16_F32 except rounding mode is overridden to round toward zero. 16.13. VINTERP Instructions 493 of 597 "RDNA3" Instruction Set Architecture 16.13. VINTERP Instructions 494 of 597 "RDNA3" Instruction Set Architecture 16.14. Parameter and Direct Load from LDS Instructions These instructions load data from LDS into a VGPR where the LDS address is derived from wave state and the M0 register. LDS_PARAM_LOAD 0 Transfer parameter data from LDS to VGPRs and expand data in LDS using the NewPrimMask (provided in M0) to place per-quad data into lanes 0-3 of each quad as follows: {P0, P10, P20, 0.0} This data may be extracted using DPP8 for interpolation operations. The V_INTERP_* instructions unpack data automatically. When loading FP16 parameters, two attributes are loaded into a single VGPR: Attribute 2*ATTR is loaded into the low 16 bits and attribute 2*ATTR+1 is loaded into the high 16 bits. This instruction runs in whole quad mode: if any pixel of a quad is active then all 4 pixels of that quad are written. This is required for interpolation instructions to have all the parameter information available for the quad. LDS_DIRECT_LOAD 1 Read a single 32-bit value from LDS to all lanes. A single DWORD is read from LDS memory at ADDR[M0[15:0]], where M0[15:0] is a byte address and is dword-aligned. M0[18:16] specify the data type for the read and may be 0=UBYTE, 1=USHORT, 2=DWORD, 4=SBYTE, 5=SSHORT.  This instruction runs in whole quad mode: if any pixel of a quad is active then all 4 pixels of that quad are written. 16.14. Parameter and Direct Load from LDS Instructions 495 of 597 "RDNA3" Instruction Set Architecture 16.15. LDS & GDS Instructions This suite of instructions operates on data stored within the data share memory. The instructions transfer data between VGPRs and data share memory. The bitfield map for the LDS/GDS is: OFFSET0 = Unsigned byte offset added to the address from the ADDR VGPR. OFFSET1 = Unsigned byte offset added to the address from the ADDR VGPR. GDS = Set if GDS, cleared if LDS. OP = DS instruction opcode ADDR = Source LDS address VGPR 0 - 255. DATA0 = Source data0 VGPR 0 - 255. DATA1 = Source data1 VGPR 0 - 255. VDST = Destination VGPR 0- 255.  All instructions with RTN in the name return the value that was in memory before the operation was performed. DS_ADD_U32 0 Add data register to memory value. tmp = MEM[ADDR].u; MEM[ADDR].u += DATA.u; RETURN_DATA.u = tmp DS_SUB_U32 1 Subtract data register from memory value. tmp = MEM[ADDR].u; MEM[ADDR].u -= DATA.u; RETURN_DATA.u = tmp DS_RSUB_U32 2 Subtraction with reversed operands. 16.15. LDS & GDS Instructions 496 of 597 "RDNA3" Instruction Set Architecture tmp = MEM[ADDR].b; MEM[ADDR] = DATA.b - MEM[ADDR].b; RETURN_DATA = tmp DS_INC_U32 3 Increment memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U; RETURN_DATA.u = tmp DS_DEC_U32 4 Decrement memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U; RETURN_DATA.u = tmp DS_MIN_I32 5 Minimum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src < tmp ? src : tmp; RETURN_DATA.i = tmp DS_MAX_I32 6 Maximum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src > tmp ? src : tmp; RETURN_DATA.i = tmp 16.15. LDS & GDS Instructions 497 of 597 "RDNA3" Instruction Set Architecture DS_MIN_U32 7 Minimum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src < tmp ? src : tmp; RETURN_DATA.u = tmp DS_MAX_U32 8 Maximum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src > tmp ? src : tmp; RETURN_DATA.u = tmp DS_AND_B32 9 Bitwise AND of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp & DATA.b); RETURN_DATA.b = tmp DS_OR_B32 10 Bitwise OR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp | DATA.b); RETURN_DATA.b = tmp DS_XOR_B32 11 Bitwise XOR of register value and memory value. 16.15. LDS & GDS Instructions 498 of 597 "RDNA3" Instruction Set Architecture tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp ^ DATA.b); RETURN_DATA.b = tmp DS_MSKOR_B32 12 Masked dword OR, D0 contains the mask and D1 contains the new value. tmp = MEM[ADDR].b; MEM[ADDR].b = ((tmp & ~DATA.b) | DATA2.b); RETURN_DATA.b = tmp DS_STORE_B32 13 Write dword. MEM[ADDR] = DATA.b DS_STORE_2ADDR_B32 14 Write 2 dwords. MEM[ADDR_BASE.u + OFFSET0.u * 4U] = DATA.b; MEM[ADDR_BASE.u + OFFSET1.u * 4U] = DATA2.b DS_STORE_2ADDR_STRIDE64_B32 15 Write 2 dwords with larger stride. MEM[ADDR_BASE.u + OFFSET0.u * 4U * 64U] = DATA.b; MEM[ADDR_BASE.u + OFFSET1.u * 4U * 64U] = DATA2.b DS_CMPSTORE_B32 16 Compare and store. 16.15. LDS & GDS Instructions 499 of 597 "RDNA3" Instruction Set Architecture tmp = MEM[ADDR].b; src = DATA.b; cmp = DATA2.b; MEM[ADDR].b = tmp == cmp ? src : tmp; RETURN_DATA.b = tmp Notes In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode. DS_CMPSTORE_F32 17 Floating point compare and store that handles NAN/INF/denormal values. tmp = MEM[ADDR].f; src = DATA.f; cmp = DATA2.f; MEM[ADDR].f = tmp == cmp ? src : tmp; RETURN_DATA.f = tmp Notes In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode. DS_MIN_F32 18 Minimum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src < tmp ? src : tmp; RETURN_DATA = tmp Notes Floating-point compare handles NAN/INF/denorm. DS_MAX_F32 19 Maximum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; 16.15. LDS & GDS Instructions 500 of 597 "RDNA3" Instruction Set Architecture MEM[ADDR].f = src > tmp ? src : tmp; RETURN_DATA = tmp Notes Floating-point compare handles NAN/INF/denorm. DS_NOP 20 Do nothing. DS_ADD_F32 21 Add data register to floating-point memory value. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src + tmp; RETURN_DATA.f = tmp Notes Floating-point addition handles NAN/INF/denorm. DS_STORE_B8 30 Byte write. MEM[ADDR].b8 = DATA[7 : 0].b8 DS_STORE_B16 31 Short write. MEM[ADDR].b16 = DATA[15 : 0].b16 DS_ADD_RTN_U32 16.15. LDS & GDS Instructions 32 501 of 597 "RDNA3" Instruction Set Architecture Add data register to memory value. tmp = MEM[ADDR].u; MEM[ADDR].u += DATA.u; RETURN_DATA.u = tmp DS_SUB_RTN_U32 33 Subtract data register from memory value. tmp = MEM[ADDR].u; MEM[ADDR].u -= DATA.u; RETURN_DATA.u = tmp DS_RSUB_RTN_U32 34 Subtraction with reversed operands. tmp = MEM[ADDR].b; MEM[ADDR] = DATA.b - MEM[ADDR].b; RETURN_DATA = tmp DS_INC_RTN_U32 35 Increment memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U; RETURN_DATA.u = tmp DS_DEC_RTN_U32 36 Decrement memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U; RETURN_DATA.u = tmp 16.15. LDS & GDS Instructions 502 of 597 "RDNA3" Instruction Set Architecture DS_MIN_RTN_I32 37 Minimum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src < tmp ? src : tmp; RETURN_DATA.i = tmp DS_MAX_RTN_I32 38 Maximum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src > tmp ? src : tmp; RETURN_DATA.i = tmp DS_MIN_RTN_U32 39 Minimum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src < tmp ? src : tmp; RETURN_DATA.u = tmp DS_MAX_RTN_U32 40 Maximum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src > tmp ? src : tmp; RETURN_DATA.u = tmp DS_AND_RTN_B32 16.15. LDS & GDS Instructions 41 503 of 597 "RDNA3" Instruction Set Architecture Bitwise AND of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp & DATA.b); RETURN_DATA.b = tmp DS_OR_RTN_B32 42 Bitwise OR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp | DATA.b); RETURN_DATA.b = tmp DS_XOR_RTN_B32 43 Bitwise XOR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp ^ DATA.b); RETURN_DATA.b = tmp DS_MSKOR_RTN_B32 44 Masked dword OR, D0 contains the mask and D1 contains the new value. tmp = MEM[ADDR].b; MEM[ADDR].b = ((tmp & ~DATA.b) | DATA2.b); RETURN_DATA.b = tmp DS_STOREXCHG_RTN_B32 45 Write-exchange operation. tmp = MEM[ADDR].b; MEM[ADDR].b = DATA.b; RETURN_DATA.b = tmp 16.15. LDS & GDS Instructions 504 of 597 "RDNA3" Instruction Set Architecture DS_STOREXCHG_2ADDR_RTN_B32 46 Write-exchange 2 separate dwords. addr1 = ADDR_BASE.u + OFFSET0.u * 4U; addr2 = ADDR_BASE.u + OFFSET1.u * 4U; tmp1 = MEM[addr1].b; tmp2 = MEM[addr2].b; MEM[addr1].b = DATA.b; MEM[addr2].b = DATA2.b; // Note DATA2 can be any other register RETURN_DATA[31 : 0] = tmp1; RETURN_DATA[63 : 32] = tmp2 DS_STOREXCHG_2ADDR_STRIDE64_RTN_B32 47 Write-exchange 2 separate dwords with a stride of 64 dwords. addr1 = ADDR_BASE.u + OFFSET0.u * 4U * 64U; addr2 = ADDR_BASE.u + OFFSET1.u * 4U * 64U; tmp1 = MEM[addr1].b; tmp2 = MEM[addr2].b; MEM[addr1].b = DATA.b; MEM[addr2].b = DATA2.b; // Note DATA2 can be any other register RETURN_DATA[31 : 0] = tmp1; RETURN_DATA[63 : 32] = tmp2 DS_CMPSTORE_RTN_B32 48 Compare and store. tmp = MEM[ADDR].b; src = DATA.b; cmp = DATA2.b; MEM[ADDR].b = tmp == cmp ? src : tmp; RETURN_DATA.b = tmp Notes In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode. DS_CMPSTORE_RTN_F32 49 16.15. LDS & GDS Instructions 505 of 597 "RDNA3" Instruction Set Architecture Floating point compare and store that handles NAN/INF/denormal values. tmp = MEM[ADDR].f; src = DATA.f; cmp = DATA2.f; MEM[ADDR].f = tmp == cmp ? src : tmp; RETURN_DATA.f = tmp Notes In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode. DS_MIN_RTN_F32 50 Minimum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src < tmp ? src : tmp; RETURN_DATA = tmp Notes Floating-point compare handles NAN/INF/denorm. DS_MAX_RTN_F32 51 Maximum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src > tmp ? src : tmp; RETURN_DATA = tmp Notes Floating-point compare handles NAN/INF/denorm. DS_WRAP_RTN_B32 52 Wrap calculation. Intended for use in ring buffer management. tmp = MEM[ADDR].u; 16.15. LDS & GDS Instructions 506 of 597 "RDNA3" Instruction Set Architecture MEM[ADDR].u = tmp >= DATA.u ? tmp - DATA.u : tmp + DATA2.u; RETURN_DATA = tmp DS_SWIZZLE_B32 53 Dword swizzle, no data is written to LDS memory. Swizzles input thread data based on offset mask and returns; note does not read or write the DS memory banks. Note that reading from an invalid thread results in 0x0. This opcode supports two specific modes, FFT and rotate, plus two basic modes which swizzle in groups of 4 or 32 consecutive threads. The FFT mode (offset >= 0xe000) swizzles the input based on offset[4:0] to support FFT calculation. Example swizzles using input {1, 2, … 20} are: Offset[4:0]: Swizzle 0x00: {1,11,9,19,5,15,d,1d,3,13,b,1b,7,17,f,1f,2,12,a,1a,6,16,e,1e,4,14,c,1c,8,18,10,20} 0x10: {1,9,5,d,3,b,7,f,2,a,6,e,4,c,8,10,11,19,15,1d,13,1b,17,1f,12,1a,16,1e,14,1c,18,20} 0x1f: No swizzle The rotate mode (offset >= 0xc000 and offset < 0xe000) rotates the input either left (offset[10] == 0) or right (offset[10] == 1) a number of threads equal to offset[9:5]. The rotate mode also uses a mask value which can alter the rotate result. For example, mask == 1 swaps the odd threads across every other even thread (rotate left), or even threads across every other odd thread (rotate right). Offset[9:5]: Swizzle 0x01, mask=0, rotate left: {2,3,4,5,6,7,8,9,a,b,c,d,e,f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,1} 0x01, mask=0, rotate right: {20,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f} 0x01, mask=1, rotate left: {2,1,4,7,6,5,8,b,a,9,c,f,e,d,10,13,12,11,14,17,16,15,18,1b,1a,19,1c,1f,1e,1d,20,3} 0x01, mask=1, rotate right: {1e,1,4,3,2,5,8,7,6,9,c,b,a,d,10,f,e,11,14,13,12,15,18,17,16,19,1c,1b,1a,1d,20,1f} If offset < 0xc000, one of the basic swizzle modes is used based on offset[15]. If offset[15] == 1, groups of 4 consecutive threads are swizzled together. If offset[15] == 0, all 32 threads are swizzled together. The first basic swizzle mode (when offset[15] == 1) allows full data sharing between a group of 4 consecutive threads. Any thread within the group of 4 can get data from any other thread within the group of 4, specified by the corresponding offset bits --- [1:0] for the first thread, [3:2] for the second thread, [5:4] for the third thread, [7:6] for the fourth thread. Note that the offset bits apply to all groups of 4 within a wavefront; thus if offset[1:0] == 1, then thread0 grabs thread1, thread4 grabs thread5, etc. The second basic swizzle mode (when offset[15] == 0) allows limited data sharing between 32 consecutive threads. In this case, the offset is used to specify a 5-bit xor-mask, 5-bit or-mask, and 5-bit and-mask used to generate a thread mapping. Note that the offset bits apply to each group of 32 within a wavefront. The details of the thread mapping are listed below. Some example usages: SWAPX16 : xor_mask = 0x10, or_mask = 0x00, and_mask = 0x1f SWAPX8 : xor_mask = 0x08, or_mask = 0x00, and_mask = 0x1f 16.15. LDS & GDS Instructions 507 of 597 "RDNA3" Instruction Set Architecture SWAPX4 : xor_mask = 0x04, or_mask = 0x00, and_mask = 0x1f SWAPX2 : xor_mask = 0x02, or_mask = 0x00, and_mask = 0x1f SWAPX1 : xor_mask = 0x01, or_mask = 0x00, and_mask = 0x1f REVERSEX32 : xor_mask = 0x1f, or_mask = 0x00, and_mask = 0x1f REVERSEX16 : xor_mask = 0x0f, or_mask = 0x00, and_mask = 0x1f REVERSEX8 : xor_mask = 0x07, or_mask = 0x00, and_mask = 0x1f REVERSEX4 : xor_mask = 0x03, or_mask = 0x00, and_mask = 0x1f REVERSEX2 : xor_mask = 0x01 or_mask = 0x00, and_mask = 0x1f BCASTX32: xor_mask = 0x00, or_mask = thread, and_mask = 0x00 BCASTX16: xor_mask = 0x00, or_mask = thread, and_mask = 0x10 BCASTX8: xor_mask = 0x00, or_mask = thread, and_mask = 0x18 BCASTX4: xor_mask = 0x00, or_mask = thread, and_mask = 0x1c BCASTX2: xor_mask = 0x00, or_mask = thread, and_mask = 0x1e Pseudocode follows: offset = offset1:offset0; if (offset >= 0xe000) { // FFT decomposition mask = offset[4:0]; for (i = 0; i < 64; i++) { j = reverse_bits(i & 0x1f); j = (j >> count_ones(mask)); j |= (i & mask); j |= i & 0x20; thread_out[i] = thread_valid[j] ? thread_in[j] : 0; } } elsif (offset >= 0xc000) { // rotate rotate = offset[9:5]; mask = offset[4:0]; if (offset[10]) { rotate = -rotate; } for (i = 0; i < 64; i++) { j = (i & mask) | ((i + rotate) & ~mask); j |= i & 0x20; thread_out[i] = thread_valid[j] ? thread_in[j] : 0; 16.15. LDS & GDS Instructions 508 of 597 "RDNA3" Instruction Set Architecture } } elsif (offset[15]) { // full data sharing within 4 consecutive threads for (i = 0; i < 64; i+=4) { thread_out[i+0] = thread_valid[i+offset[1:0]]?thread_in[i+offset[1:0]]:0; thread_out[i+1] = thread_valid[i+offset[3:2]]?thread_in[i+offset[3:2]]:0; thread_out[i+2] = thread_valid[i+offset[5:4]]?thread_in[i+offset[5:4]]:0; thread_out[i+3] = thread_valid[i+offset[7:6]]?thread_in[i+offset[7:6]]:0; } } else { // offset[15] == 0 // limited data sharing within 32 consecutive threads xor_mask = offset[14:10]; or_mask = offset[9:5]; and_mask = offset[4:0]; for (i = 0; i < 64; i++) { j = (((i & 0x1f) & and_mask) | or_mask) ^ xor_mask; j |= (i & 0x20); // which group of 32 thread_out[i] = thread_valid[j] ? thread_in[j] : 0; } } DS_LOAD_B32 54 Read dword. RETURN_DATA = MEM[ADDR].b DS_LOAD_2ADDR_B32 55 Read 2 dwords. RETURN_DATA[31 : 0] = MEM[ADDR_BASE.u + OFFSET0.u * 4U].b; RETURN_DATA[63 : 32] = MEM[ADDR_BASE.u + OFFSET1.u * 4U].b DS_LOAD_2ADDR_STRIDE64_B32 56 Read 2 dwords with a larger stride. RETURN_DATA[31 : 0] = MEM[ADDR_BASE.u + OFFSET0.u * 4U * 64U].b; 16.15. LDS & GDS Instructions 509 of 597 "RDNA3" Instruction Set Architecture RETURN_DATA[63 : 32] = MEM[ADDR_BASE.u + OFFSET1.u * 4U * 64U].b DS_LOAD_I8 57 Signed byte read. RETURN_DATA.i = 32'I(signext(MEM[ADDR][7 : 0].i8)) DS_LOAD_U8 58 Unsigned byte read. RETURN_DATA.u = 32'U({ 24'0, MEM[ADDR][7 : 0].u8 }) DS_LOAD_I16 59 Signed short read. RETURN_DATA.i = 32'I(signext(MEM[ADDR][15 : 0].i16)) DS_LOAD_U16 60 Unsigned short read. RETURN_DATA.u = 32'U({ 16'0, MEM[ADDR][15 : 0].u16 }) DS_CONSUME 61 LDS & GDS. Subtract (count_bits(exec_mask)) from the value stored in DS memory at (M0.base + instr_offset). Return the pre-operation value to VGPRs. The DS subtracts count_bits(vector valid mask) from the value stored at address M0.base + instruction based offset and returns the pre-op value to all valid lanes. This op can be used in both the LDS and GDS. In the LDS this address is an offset to HWBASE and clamped by M0.size, but in the GDS the M0.base constant has the physical GDS address and the compiler must force offset to zero. In GDS it is for the traditional append buffer operations. In LDS it is for local thread group appends and can be used to regroup divergent threads. The use 16.15. LDS & GDS Instructions 510 of 597 "RDNA3" Instruction Set Architecture of the M0 register enables the compiler to do indexing of UAV append/consume counters. For GDS (system wide) consume, the compiler must use a zero for {offset1,offset0}, for LDS the compiler uses {offset1,offset0} to provide the relative address to the append counter in the LDS for runtime index offset or index. Inside DS, do one atomic add for first valid lane and broadcast result to all valid lanes. Offset = 0ffset1:offset0; Interpreted as byte offset. Only aligned atomics are supported, so 2 lsbs of offset must be set to zero. addr = M0.base + offset; // offset by LDS HWBASE, limit to M.size rtnval = LDS(addr); LDS(addr) = LDS(addr) - countbits(valid mask); GPR[VDST] = rtnval; // return to all valid threads DS_APPEND 62 LDS & GDS. Add (count_bits(exec_mask)) to the value stored in DS memory at (M0.base + instr_offset). Return the pre-operation value to VGPRs. The DS adds count_bits(vector valid mask) from the value stored at address M0.base + instruction based offset and return the pre-op value to all valid lanes. This op can be used in both the LDS and GDS. In the LDS this address is an offset to HWBASE and clamped by M0.size, but in the GDS the M0.base constant has the physical GDS address and the compiler must set offset to zero. In GDS it is for the traditional append buffer operations. In LDS it is for local thread group appends and can be used to regroup divergent threads. The use of the M0 register enables the compiler to do indexing of UAV append/consume counters. For GDS (system wide) consume, the compiler must use a zero for {offset1,offset0}, for LDS the compiler uses {offset1,offset0} to provide the relative address to the append counter in the LDS for runtime index offset or index. Inside DS, do one atomic add for first valid lane and broadcast result to all valid lanes. Offset = 0ffset1:offset0; Interpreted as byte offset. Only aligned atomics are supported, so 2 lsbs of offset must be set to zero. addr = M0.base + offset; // offset by LDS HWBASE, limit to M.size rtnval = LDS(addr); LDS(addr) = LDS(addr) + countbits(valid mask); GPR[VDST] = rtnval; // return to all valid threads DS_ORDERED_COUNT 63 GDS-only: Intercepted by GDS and processed by ordered append module. The ordered append module queues request until this request wave is the oldest in the queue at which time the oldest wave request is dispatched to the DS with an atomic opcode indicated by OFFSET1[5:4]. Unlike append/consume this operation is sent even if there are no valid lanes when it is issued. The GDS adds zero and advances the tracking walker that needs to match up with the dispatch counter. 16.15. LDS & GDS Instructions 511 of 597 "RDNA3" Instruction Set Architecture The following attributes are encoded in the instruction: • OFFSET0[7:2] contains the ordered_count_index (in dwords). • OFFSET1[0] contains the wave_release flag. • OFFSET1[1] contains the wave_done flag. • OFFSET1[5:4] contains the ord_idx_opcode: 2'b00 = DS_ADD_RTN_U32, 2'b01 = DS_STOREXCHG_RTN_B32, 2'b11 = DS_WRAP_RTN_B32. • VGPR_DST is the VGPR the result is written to. • VGPR_ADDR specifies the increment in the first valid lane. If no lanes are valid (EXEC = 0) then the increment is zero. • M0 normally carries {16'gds_base, 16'gds_size} for GDS usage. gds_base[15:2] is ordered_count_base[13:0] (in dwords) and gds_size is used to hold the logical_wave_id, the width is based on total number of waves in the chip. The wave type is determined automatically based on the ME_ID and QUEUE_ID of the wavefront. DS_ADD_U64 64 Add data register to 64-bit memory value. tmp = MEM[ADDR].u64; MEM[ADDR].u64 += DATA.u64; RETURN_DATA.u64 = tmp DS_SUB_U64 65 Subtract data register from 64-bit memory value. tmp = MEM[ADDR].u64; MEM[ADDR].u64 -= DATA.u64; RETURN_DATA.u64 = tmp DS_RSUB_U64 66 Subtraction with reversed operands. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = DATA.b64 - tmp; RETURN_DATA.b64 = tmp 16.15. LDS & GDS Instructions 512 of 597 "RDNA3" Instruction Set Architecture DS_INC_U64 67 Increment 64-bit memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL; RETURN_DATA.u64 = tmp DS_DEC_U64 68 Decrement 64-bit memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL; RETURN_DATA.u64 = tmp DS_MIN_I64 69 Minimum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; src = DATA.i64; MEM[ADDR].i64 = src < tmp ? src : tmp; RETURN_DATA.i64 = tmp DS_MAX_I64 70 Maximum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; src = DATA.i64; MEM[ADDR].i64 = src > tmp ? src : tmp; RETURN_DATA.i64 = tmp DS_MIN_U64 71 Minimum of two unsigned 64-bit integer values. 16.15. LDS & GDS Instructions 513 of 597 "RDNA3" Instruction Set Architecture tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src < tmp ? src : tmp; RETURN_DATA.u64 = tmp DS_MAX_U64 72 Maximum of two unsigned 64-bit integer values. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src > tmp ? src : tmp; RETURN_DATA.u64 = tmp DS_AND_B64 73 Bitwise AND of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp & DATA.b64); RETURN_DATA.b64 = tmp DS_OR_B64 74 Bitwise OR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp | DATA.b64); RETURN_DATA.b64 = tmp DS_XOR_B64 75 Bitwise XOR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp ^ DATA.b64); RETURN_DATA.b64 = tmp 16.15. LDS & GDS Instructions 514 of 597 "RDNA3" Instruction Set Architecture DS_MSKOR_B64 76 Masked dword OR, D0 contains the mask and D1 contains the new value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = ((tmp & ~DATA.b64) | DATA2.b64); RETURN_DATA.b64 = tmp DS_STORE_B64 77 Write qword. MEM[ADDR].b64 = DATA.b64 DS_STORE_2ADDR_B64 78 Write 2 qwords. MEM[ADDR_BASE.u + OFFSET0.u * 8U].b64 = DATA.b64; MEM[ADDR_BASE.u + OFFSET1.u * 8U].b64 = DATA2.b64 DS_STORE_2ADDR_STRIDE64_B64 79 Write 2 qwords with a larger stride. MEM[ADDR_BASE.u + OFFSET0.u * 8U * 64U].b64 = DATA.b64; MEM[ADDR_BASE.u + OFFSET1.u * 8U * 64U].b64 = DATA2.b64 DS_CMPSTORE_B64 80 Compare and store. tmp = MEM[ADDR].b64; src = DATA.b64; cmp = DATA2.b64; MEM[ADDR].b64 = tmp == cmp ? src : tmp; RETURN_DATA.b64 = tmp 16.15. LDS & GDS Instructions 515 of 597 "RDNA3" Instruction Set Architecture Notes In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode. DS_CMPSTORE_F64 81 Floating point compare and store that handles NAN/INF/denormal values. tmp = MEM[ADDR].f64; src = DATA.f64; cmp = DATA2.f64; MEM[ADDR].f64 = tmp == cmp ? src : tmp; RETURN_DATA.f64 = tmp Notes In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode. DS_MIN_F64 82 Minimum of two floating-point values. tmp = MEM[ADDR].f64; src = DATA.f64; MEM[ADDR].f64 = src < tmp ? src : tmp; RETURN_DATA.f64 = tmp Notes Floating-point compare handles NAN/INF/denorm. DS_MAX_F64 83 Maximum of two floating-point values. tmp = MEM[ADDR].f64; src = DATA.f64; MEM[ADDR].f64 = src > tmp ? src : tmp; RETURN_DATA.f64 = tmp Notes Floating-point compare handles NAN/INF/denorm. 16.15. LDS & GDS Instructions 516 of 597 "RDNA3" Instruction Set Architecture DS_ADD_RTN_U64 96 Add data register to 64-bit memory value. tmp = MEM[ADDR].u64; MEM[ADDR].u64 += DATA.u64; RETURN_DATA.u64 = tmp DS_SUB_RTN_U64 97 Subtract data register from 64-bit memory value. tmp = MEM[ADDR].u64; MEM[ADDR].u64 -= DATA.u64; RETURN_DATA.u64 = tmp DS_RSUB_RTN_U64 98 Subtraction with reversed operands. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = DATA.b64 - tmp; RETURN_DATA.b64 = tmp DS_INC_RTN_U64 99 Increment 64-bit memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL; RETURN_DATA.u64 = tmp DS_DEC_RTN_U64 100 Decrement 64-bit memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u64; 16.15. LDS & GDS Instructions 517 of 597 "RDNA3" Instruction Set Architecture src = DATA.u64; MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL; RETURN_DATA.u64 = tmp DS_MIN_RTN_I64 101 Minimum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; src = DATA.i64; MEM[ADDR].i64 = src < tmp ? src : tmp; RETURN_DATA.i64 = tmp DS_MAX_RTN_I64 102 Maximum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; src = DATA.i64; MEM[ADDR].i64 = src > tmp ? src : tmp; RETURN_DATA.i64 = tmp DS_MIN_RTN_U64 103 Minimum of two unsigned 64-bit integer values. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src < tmp ? src : tmp; RETURN_DATA.u64 = tmp DS_MAX_RTN_U64 104 Maximum of two unsigned 64-bit integer values. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src > tmp ? src : tmp; RETURN_DATA.u64 = tmp 16.15. LDS & GDS Instructions 518 of 597 "RDNA3" Instruction Set Architecture DS_AND_RTN_B64 105 Bitwise AND of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp & DATA.b64); RETURN_DATA.b64 = tmp DS_OR_RTN_B64 106 Bitwise OR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp | DATA.b64); RETURN_DATA.b64 = tmp DS_XOR_RTN_B64 107 Bitwise XOR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp ^ DATA.b64); RETURN_DATA.b64 = tmp DS_MSKOR_RTN_B64 108 Masked dword OR, D0 contains the mask and D1 contains the new value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = ((tmp & ~DATA.b64) | DATA2.b64); RETURN_DATA.b64 = tmp DS_STOREXCHG_RTN_B64 109 Write-exchange operation. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = DATA.b64; 16.15. LDS & GDS Instructions 519 of 597 "RDNA3" Instruction Set Architecture RETURN_DATA.b64 = tmp DS_STOREXCHG_2ADDR_RTN_B64 110 Write-exchange 2 separate qwords. addr1 = ADDR_BASE.u + OFFSET0.u * 8U; addr2 = ADDR_BASE.u + OFFSET1.u * 8U; tmp1 = MEM[addr1].b64; tmp2 = MEM[addr2].b64; MEM[addr1].b64 = DATA.b64; MEM[addr2].b64 = DATA2.b64; // Note DATA2 can be any other register RETURN_DATA[63 : 0] = tmp1; RETURN_DATA[127 : 64] = tmp2 DS_STOREXCHG_2ADDR_STRIDE64_RTN_B64 111 Write-exchange 2 qwords with a stride of 64 qwords. addr1 = ADDR_BASE.u + OFFSET0.u * 8U * 64U; addr2 = ADDR_BASE.u + OFFSET1.u * 8U * 64U; tmp1 = MEM[addr1].b64; tmp2 = MEM[addr2].b64; MEM[addr1].b64 = DATA.b64; MEM[addr2].b64 = DATA2.b64; // Note DATA2 can be any other register RETURN_DATA[63 : 0] = tmp1; RETURN_DATA[127 : 64] = tmp2 DS_CMPSTORE_RTN_B64 112 Compare and store. tmp = MEM[ADDR].b64; src = DATA.b64; cmp = DATA2.b64; MEM[ADDR].b64 = tmp == cmp ? src : tmp; RETURN_DATA.b64 = tmp Notes In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode. 16.15. LDS & GDS Instructions 520 of 597 "RDNA3" Instruction Set Architecture DS_CMPSTORE_RTN_F64 113 Floating point compare and store that handles NAN/INF/denormal values. tmp = MEM[ADDR].f64; src = DATA.f64; cmp = DATA2.f64; MEM[ADDR].f64 = tmp == cmp ? src : tmp; RETURN_DATA.f64 = tmp Notes In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode. DS_MIN_RTN_F64 114 Minimum of two floating-point values. tmp = MEM[ADDR].f64; src = DATA.f64; MEM[ADDR].f64 = src < tmp ? src : tmp; RETURN_DATA.f64 = tmp Notes Floating-point compare handles NAN/INF/denorm. DS_MAX_RTN_F64 115 Maximum of two floating-point values. tmp = MEM[ADDR].f64; src = DATA.f64; MEM[ADDR].f64 = src > tmp ? src : tmp; RETURN_DATA.f64 = tmp Notes Floating-point compare handles NAN/INF/denorm. DS_LOAD_B64 16.15. LDS & GDS Instructions 118 521 of 597 "RDNA3" Instruction Set Architecture Read 1 qword. RETURN_DATA = MEM[ADDR].b64 DS_LOAD_2ADDR_B64 119 Read 2 qwords. RETURN_DATA[63 : 0] = MEM[ADDR_BASE.u + OFFSET0.u * 8U].b64; RETURN_DATA[127 : 64] = MEM[ADDR_BASE.u + OFFSET1.u * 8U].b64 DS_LOAD_2ADDR_STRIDE64_B64 120 Read 2 qwords with a larger stride. RETURN_DATA[63 : 0] = MEM[ADDR_BASE.u + OFFSET0.u * 8U * 64U].b64; RETURN_DATA[127 : 64] = MEM[ADDR_BASE.u + OFFSET1.u * 8U * 64U].b64 DS_ADD_RTN_F32 121 Add data register to floating-point memory value. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src + tmp; RETURN_DATA.f = tmp Notes Floating-point addition handles NAN/INF/denorm. DS_ADD_GS_REG_RTN 122 Perform an atomic add to data in specific registers embedded in GDS rather than operating on GDS memory directly. This instruction returns the pre-op value. This instruction is only used by the GS stage and is used to facilitate streamout. The return value may be 32 bits or 64 bits depending on the GS register accessed. The data value is 32 bits. 16.15. LDS & GDS Instructions 522 of 597 "RDNA3" Instruction Set Architecture if OFFSET0[5:2] > 7 // 64-bit GS register access addr = (OFFSET0[5:2] - 8) * 2 + 8; VDST[0] = GS_REGS(addr + 0); VDST[1] = GS_REGS(addr + 1); {GS_REGS(addr + 1), GS_REGS(addr)} += DATA0[0]; // source is 32 bit else addr = OFFSET0[5:2]; VDST[0] = GS_REGS(addr); GS_REGS(addr) += DATA0[0]; endif. 32-bit GS registers: offset[5:2] Register 0 GDS_STRMOUT_BUFFER_FILLED_SIZE_0 1 GDS_STRMOUT_BUFFER_FILLED_SIZE_1 2 GDS_STRMOUT_BUFFER_FILLED_SIZE_2 3 GDS_STRMOUT_BUFFER_FILLED_SIZE_3 4 GDS_GS_0 5 GDS_GS_1 6 GDS_GS_2 7 GDS_GS_3 64-bit GS registers: offset[5:2] Register 8 GDS_STRMOUT_PRIMS_NEEDED_0 9 GDS_STRMOUT_PRIMS_WRITTEN_0 10 GDS_STRMOUT_PRIMS_NEEDED_1 11 GDS_STRMOUT_PRIMS_WRITTEN_1 12 GDS_STRMOUT_PRIMS_NEEDED_2 13 GDS_STRMOUT_PRIMS_WRITTEN_2 14 GDS_STRMOUT_PRIMS_NEEDED_3 15 GDS_STRMOUT_PRIMS_WRITTEN_3 DS_SUB_GS_REG_RTN 123 Perform an atomic subtraction from data in specific registers embedded in GDS rather than operating on GDS memory directly. This instruction returns the pre-op value. This instruction is only used by the GS stage and is used to facilitate streamout. The return value may be 32 bits or 64 bits depending on the GS register accessed. The data value is 32 bits. if OFFSET0[5:2] > 7 // 64-bit GS register access addr = (OFFSET0[5:2] - 8) * 2 + 8; VDST[0] = GS_REGS(addr + 0); VDST[1] = GS_REGS(addr + 1); 16.15. LDS & GDS Instructions 523 of 597 "RDNA3" Instruction Set Architecture {GS_REGS(addr + 1), GS_REGS(addr)} -= DATA0[0]; // source is 32 bit else addr = OFFSET0[5:2]; VDST[0] = GS_REGS(addr); GS_REGS(addr) -= DATA0[0]; endif. 32-bit GS registers: offset[5:2] Register 0 GDS_STRMOUT_BUFFER_FILLED_SIZE_0 1 GDS_STRMOUT_BUFFER_FILLED_SIZE_1 2 GDS_STRMOUT_BUFFER_FILLED_SIZE_2 3 GDS_STRMOUT_BUFFER_FILLED_SIZE_3 4 GDS_GS_0 5 GDS_GS_1 6 GDS_GS_2 7 GDS_GS_3 64-bit GS registers: offset[5:2] Register 8 GDS_STRMOUT_PRIMS_NEEDED_0 9 GDS_STRMOUT_PRIMS_WRITTEN_0 10 GDS_STRMOUT_PRIMS_NEEDED_1 11 GDS_STRMOUT_PRIMS_WRITTEN_1 12 GDS_STRMOUT_PRIMS_NEEDED_2 13 GDS_STRMOUT_PRIMS_WRITTEN_2 14 GDS_STRMOUT_PRIMS_NEEDED_3 15 GDS_STRMOUT_PRIMS_WRITTEN_3 DS_CONDXCHG32_RTN_B64 126 Conditional write exchange. declare OFFSET0 : 8'U; declare OFFSET1 : 8'U; declare RETURN_DATA : 32'U[2]; ADDR = S0.u; DATA = S1.u64; offset = { OFFSET1, OFFSET0 }; ADDR0 = ((ADDR + offset.u) & 0xfff8U); ADDR1 = ADDR0 + 4U; RETURN_DATA[0] = LDS[ADDR0].u; if DATA[31] then LDS[ADDR0] = { 1'0, DATA[30 : 0] } endif; RETURN_DATA[1] = LDS[ADDR1].u; if DATA[63] then LDS[ADDR1] = { 1'0, DATA[62 : 32] } 16.15. LDS & GDS Instructions 524 of 597 "RDNA3" Instruction Set Architecture endif DS_STORE_B8_D16_HI 160 Byte write in to high word. MEM[ADDR].b8 = DATA[23 : 16].b8 DS_STORE_B16_D16_HI 161 Short write in to high word. MEM[ADDR].b16 = DATA[31 : 16].b16 DS_LOAD_U8_D16 162 Unsigned byte read with masked return to lower word. RETURN_DATA[15 : 0].u16 = 16'U({ 8'0U, MEM[ADDR][7 : 0].u8 }) DS_LOAD_U8_D16_HI 163 Unsigned byte read with masked return to upper word. RETURN_DATA[31 : 16].u16 = 16'U({ 8'0U, MEM[ADDR][7 : 0].u8 }) DS_LOAD_I8_D16 164 Signed byte read with masked return to lower word. RETURN_DATA[15 : 0].i16 = 16'I(signext(MEM[ADDR][7 : 0].i8)) DS_LOAD_I8_D16_HI 16.15. LDS & GDS Instructions 165 525 of 597 "RDNA3" Instruction Set Architecture Signed byte read with masked return to upper word. RETURN_DATA[31 : 16].i16 = 16'I(signext(MEM[ADDR][7 : 0].i8)) DS_LOAD_U16_D16 166 Unsigned short read with masked return to lower word. RETURN_DATA[15 : 0].u16 = MEM[ADDR][15 : 0].u16 DS_LOAD_U16_D16_HI 167 Unsigned short read with masked return to upper word. RETURN_DATA[31 : 16].u16 = MEM[ADDR][15 : 0].u16 DS_BVH_STACK_RTN_B32 173 Ray tracing involves traversing a BVH which is a kind of tree where nodes have up to 4 children. Each shader thread processes one child at a time, and overflow nodes are stored temporarily in LDS using a stack. This instruction supports pushing/popping the stack to reduce the number of VALU instructions required per traversal and reduce VMEM bandwidth requirements. The LDS stack address is computed using values packed into ADDR and part of OFFSET1. ADDR carries the stack address for the lane. OFFSET1[5:4] contains stack_size[1:0] -- this value is constant for all lanes and is patched into the shader by software. Valid stack sizes are {8, 16, 32, 64}. A new stack address is returned to ADDR --- note that this VGPR is an in-out operand. DATA0 contains the last node pointer for BVH. DATA1 contains up to 4 valid data DWORDs for each thread. At a high level the first 3 DWORDs (DATA1[0:2]) is pushed to the stack if they are valid, and the last DWORD (DATA1[3]) is returned. If the last DWORD is invalid then pop the stack and return the value from memory. In general this instruction performs the following : (stack_base, stack_index) = DECODE_ADDR(ADDR, OFFSET1); last_node_ptr = DATA0; // First 3 passes: push data onto stack for i = 0..2 do if DATA_VALID(DATA1[i]) 16.15. LDS & GDS Instructions 526 of 597 "RDNA3" Instruction Set Architecture MEM[stack_base + stack_index] = DATA1[i]; Increment stack_index elsif DATA1[i] == last_node_ptr // Treat all further data as invalid as well. break endif endfor // Fourth pass: return data or pop if DATA_VALID(DATA1[3]) VGPR_RTN = DATA1[3] else VGPR_RTN = MEM[stack_base + stack_index]; MEM[stack_base + stack_index] = INVALID_NODE; Decrement stack_index endif ADDR = ENCODE_ADDR(stack_base, stack_index). function DATA_VALID(data): if data == INVALID_NODE return false elsif last_node_ptr != INVALID_NODE && data == last_node_ptr // Match last_node_ptr return false else return true endif endfunction. DS_STORE_ADDTID_B32 176 Write dword with thread ID offset. declare OFFSET0 : 8'U; declare OFFSET1 : 8'U; MEM[32'I({ OFFSET1, OFFSET0 } + M0[15 : 0]) + laneID.i * 4].u = DATA0.u DS_LOAD_ADDTID_B32 177 Read dword with thread ID offset. declare OFFSET0 : 8'U; declare OFFSET1 : 8'U; RETURN_DATA.u = MEM[32'I({ OFFSET1, OFFSET0 } + M0[15 : 0]) + laneID.i * 4].u DS_PERMUTE_B32 16.15. LDS & GDS Instructions 178 527 of 597 "RDNA3" Instruction Set Architecture Forward permute. This does not access LDS memory and may be called even if no LDS memory is allocated to the wave. It uses LDS to implement an arbitrary swizzle across threads in a wavefront. Note the address passed in is the thread ID multiplied by 4. If multiple sources map to the same destination lane, it is not deterministic which source lane writes to the destination lane. See also DS_BPERMUTE_B32. // VGPR[laneId][index] is the VGPR RAM // VDST, ADDR and DATA0 are from the microcode DS encoding declare tmp : 32'B[64]; declare OFFSET : 16'U; declare DATA0 : 32'U; declare VDST : 32'U; for i in 0 : WAVE64 ? 63 : 31 do tmp[i] = 0x0 endfor; for i in 0 : WAVE64 ? 63 : 31 do // If a source thread is disabled, it does not propagate data. if EXEC[i].u1 then // ADDR needs to be divided by 4. // High-order bits are ignored. // NOTE: destination lane is MOD 32 regardless of wave size. dst_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32; tmp[dst_lane] = VGPR[i][DATA0] endif endfor; // Copy data into destination VGPRs. If multiple sources // select the same destination thread, the highest-numbered // source thread wins. for i in 0 : WAVE64 ? 63 : 31 do if EXEC[i].u1 then VGPR[i][VDST] = tmp[i] endif endfor Notes Examples (simplified 4-thread wavefronts): VGPR[SRC0] = { A, B, C, D } VGPR[ADDR] = { 0, 0, 12, 4 } EXEC = 0xF, OFFSET = 0 VGPR[VDST] = { B, D, 0, C } VGPR[SRC0] = { A, B, C, D } VGPR[ADDR] = { 0, 0, 12, 4 } EXEC = 0xA, OFFSET = 0 VGPR[VDST] = { -, D, -, 0 } 16.15. LDS & GDS Instructions 528 of 597 "RDNA3" Instruction Set Architecture DS_BPERMUTE_B32 179 Backward permute. This does not access LDS memory and may be called even if no LDS memory is allocated to the wave. It uses LDS hardware to implement an arbitrary swizzle across threads in a wavefront. Note the address passed in is the thread ID multiplied by 4. Note that EXEC mask is applied to both VGPR read and write. If src_lane selects a disabled thread then zero is returned. See also DS_PERMUTE_B32. // VGPR[laneId][index] is the VGPR RAM // VDST, ADDR and DATA0 are from the microcode DS encoding declare tmp : 32'B[64]; declare OFFSET : 16'U; declare DATA0 : 32'U; declare VDST : 32'U; for i in 0 : WAVE64 ? 63 : 31 do tmp[i] = 0x0 endfor; for i in 0 : WAVE64 ? 63 : 31 do // ADDR needs to be divided by 4. // High-order bits are ignored. // NOTE: destination lane is MOD 32 regardless of wave size. src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32; // EXEC is applied to the source VGPR reads. if EXEC[src_lane].u1 then tmp[i] = VGPR[src_lane][DATA0] endif endfor; // Copy data into destination VGPRs. Some source // data may be broadcast to multiple lanes. for i in 0 : WAVE64 ? 63 : 31 do if EXEC[i].u1 then VGPR[i][VDST] = tmp[i] endif endfor Notes Examples (simplified 4-thread wavefronts): VGPR[SRC0] = { A, B, C, D } VGPR[ADDR] = { 0, 0, 12, 4 } EXEC = 0xF, OFFSET = 0 VGPR[VDST] = { A, A, D, B } VGPR[SRC0] = { A, B, C, D } VGPR[ADDR] = { 0, 0, 12, 4 } 16.15. LDS & GDS Instructions 529 of 597 "RDNA3" Instruction Set Architecture EXEC = 0xA, OFFSET = 0 VGPR[VDST] = { -, 0, -, B } DS_STORE_B96 222 Tri-dword write. MEM[ADDR + 0U].b = DATA[31 : 0]; MEM[ADDR + 4U].b = DATA[63 : 32]; MEM[ADDR + 8U].b = DATA[95 : 64] DS_STORE_B128 223 Quad-dword write. MEM[ADDR + 0U].b = DATA[31 : 0]; MEM[ADDR + 4U].b = DATA[63 : 32]; MEM[ADDR + 8U].b = DATA[95 : 64]; MEM[ADDR + 12U].b = DATA[127 : 96] DS_LOAD_B96 254 Tri-dword read. RETURN_DATA[31 : 0] = MEM[ADDR + 0U].b; RETURN_DATA[63 : 32] = MEM[ADDR + 4U].b; RETURN_DATA[95 : 64] = MEM[ADDR + 8U].b DS_LOAD_B128 255 Quad-dword read. RETURN_DATA[31 : 0] = MEM[ADDR + 0U].b; RETURN_DATA[63 : 32] = MEM[ADDR + 4U].b; RETURN_DATA[95 : 64] = MEM[ADDR + 8U].b; RETURN_DATA[127 : 96] = MEM[ADDR + 12U].b 16.15. LDS & GDS Instructions 530 of 597 "RDNA3" Instruction Set Architecture 16.15.1. LDS Instruction Limitations Some of the DS instructions are available only to GDS, not LDS. These are: • DS_GWS_SEMA_RELEASE_ALL • DS_GWS_INIT • DS_GWS_SEMA_V • DS_GWS_SEMA_BR • DS_GWS_SEMA_P • DS_GWS_BARRIER • DS_ORDERED_COUNT 16.15. LDS & GDS Instructions 531 of 597 "RDNA3" Instruction Set Architecture 16.16. MUBUF Instructions The bitfield map of the MUBUF format is: BUFFER_LOAD_FORMAT_X 0 Untyped buffer load 1 component with format conversion. VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]); // Mem access size depends on format BUFFER_LOAD_FORMAT_XY 1 Untyped buffer load 2 components with format conversion. VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]); // Mem access size depends on format VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]) BUFFER_LOAD_FORMAT_XYZ 2 Untyped buffer load 3 components with format conversion. VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]); // Mem access size depends on format VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]); VDATA[95 : 64].b = ConvertFromFormat(MEM[TADDR.Z]) BUFFER_LOAD_FORMAT_XYZW 3 Untyped buffer load 4 components with format conversion. VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]); // Mem access size depends on format VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]); VDATA[95 : 64].b = ConvertFromFormat(MEM[TADDR.Z]); 16.16. MUBUF Instructions 532 of 597 "RDNA3" Instruction Set Architecture VDATA[127 : 96].b = ConvertFromFormat(MEM[TADDR.W]) BUFFER_STORE_FORMAT_X 4 Untyped buffer store 1 component with format conversion. MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b); // Mem access size depends on format BUFFER_STORE_FORMAT_XY 5 Untyped buffer store 2 components with format conversion. MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b) BUFFER_STORE_FORMAT_XYZ 6 Untyped buffer store 3 components with format conversion. MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b); MEM[TADDR.Z] = ConvertToFormat(VDATA[95 : 64].b) BUFFER_STORE_FORMAT_XYZW 7 Untyped buffer store 4 components with format conversion. MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b); MEM[TADDR.Z] = ConvertToFormat(VDATA[95 : 64].b); MEM[TADDR.W] = ConvertToFormat(VDATA[127 : 96].b) BUFFER_LOAD_D16_FORMAT_X 16.16. MUBUF Instructions 8 533 of 597 "RDNA3" Instruction Set Architecture Untyped buffer load 1 component with format conversion, packed 16-bit components in data register. VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X])); // Mem access size depends on format // VDATA[31:16].b16 is preserved. BUFFER_LOAD_D16_FORMAT_XY 9 Untyped buffer load 2 components with format conversion, packed 16-bit components in data register. VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X])); // Mem access size depends on format VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y])) BUFFER_LOAD_D16_FORMAT_XYZ 10 Untyped buffer load 3 components with format conversion, packed 16-bit components in data register. VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X])); // Mem access size depends on format VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y])); VDATA[47 : 32].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Z])); // VDATA[63:48].b16 is preserved. BUFFER_LOAD_D16_FORMAT_XYZW 11 Untyped buffer load 4 components with format conversion, packed 16-bit components in data register. VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X])); // Mem access size depends on format VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y])); VDATA[47 : 32].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Z])); VDATA[63 : 48].b16 = 16'B(ConvertFromFormat(MEM[TADDR.W])) BUFFER_STORE_D16_FORMAT_X 12 Untyped buffer store 1 component with format conversion, packed 16-bit components in data register. MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16)); 16.16. MUBUF Instructions 534 of 597 "RDNA3" Instruction Set Architecture // Mem access size depends on format BUFFER_STORE_D16_FORMAT_XY 13 Untyped buffer store 2 components with format conversion, packed 16-bit components in data register. MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16)); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16)) BUFFER_STORE_D16_FORMAT_XYZ 14 Untyped buffer store 3 components with format conversion, packed 16-bit components in data register. MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16)); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16)); MEM[TADDR.Z] = ConvertToFormat(32'B(VDATA[47 : 32].b16)) BUFFER_STORE_D16_FORMAT_XYZW 15 Untyped buffer store 4 components with format conversion, packed 16-bit components in data register. MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16)); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16)); MEM[TADDR.Z] = ConvertToFormat(32'B(VDATA[47 : 32].b16)); MEM[TADDR.W] = ConvertToFormat(32'B(VDATA[63 : 48].b16)) BUFFER_LOAD_U8 16 Untyped buffer load unsigned byte, zero extend in data register. VDATA.u = 32'U({ 24'0, MEM[ADDR].u8 }) BUFFER_LOAD_I8 16.16. MUBUF Instructions 17 535 of 597 "RDNA3" Instruction Set Architecture Untyped buffer load signed byte, sign extend in data register. VDATA.i = 32'I(signext(MEM[ADDR].i8)) BUFFER_LOAD_U16 18 Untyped buffer load unsigned short, zero extend in data register. VDATA.u = 32'U({ 16'0, MEM[ADDR].u16 }) BUFFER_LOAD_I16 19 Untyped buffer load signed short, sign extend in data register. VDATA.i = 32'I(signext(MEM[ADDR].i16)) BUFFER_LOAD_B32 20 Untyped buffer load dword. VDATA.b = MEM[ADDR].b BUFFER_LOAD_B64 21 Untyped buffer load 2 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b BUFFER_LOAD_B96 22 Untyped buffer load 3 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b; 16.16. MUBUF Instructions 536 of 597 "RDNA3" Instruction Set Architecture VDATA[95 : 64] = MEM[ADDR + 8U].b BUFFER_LOAD_B128 23 Untyped buffer load 4 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b; VDATA[95 : 64] = MEM[ADDR + 8U].b; VDATA[127 : 96] = MEM[ADDR + 12U].b BUFFER_STORE_B8 24 Untyped buffer store byte. MEM[ADDR].b8 = VDATA[7 : 0] BUFFER_STORE_B16 25 Untyped buffer store short. MEM[ADDR].b16 = VDATA[15 : 0] BUFFER_STORE_B32 26 Untyped buffer store dword. MEM[ADDR].b = VDATA[31 : 0] BUFFER_STORE_B64 27 Untyped buffer store 2 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32] 16.16. MUBUF Instructions 537 of 597 "RDNA3" Instruction Set Architecture BUFFER_STORE_B96 28 Untyped buffer store 3 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32]; MEM[ADDR + 8U].b = VDATA[95 : 64] BUFFER_STORE_B128 29 Untyped buffer store 4 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32]; MEM[ADDR + 8U].b = VDATA[95 : 64]; MEM[ADDR + 12U].b = VDATA[127 : 96] BUFFER_LOAD_D16_U8 30 Untyped buffer load unsigned byte, use low 16 bits of data register. VDATA[15 : 0].u16 = 16'U({ 8'0, MEM[ADDR].u8 }); // VDATA[31:16] is preserved. BUFFER_LOAD_D16_I8 31 Untyped buffer load signed byte, use low 16 bits of data register. VDATA[15 : 0].i16 = 16'I(signext(MEM[ADDR].i8)); // VDATA[31:16] is preserved. BUFFER_LOAD_D16_B16 32 Untyped buffer load short, use low 16 bits of data register. VDATA[15 : 0].b16 = MEM[ADDR].b16; 16.16. MUBUF Instructions 538 of 597 "RDNA3" Instruction Set Architecture // VDATA[31:16] is preserved. BUFFER_LOAD_D16_HI_U8 33 Untyped buffer load unsigned byte, use high 16 bits of data register. VDATA[31 : 16].u16 = 16'U({ 8'0, MEM[ADDR].u8 }); // VDATA[15:0] is preserved. BUFFER_LOAD_D16_HI_I8 34 Untyped buffer load signed byte, use high 16 bits of data register. VDATA[31 : 16].i16 = 16'I(signext(MEM[ADDR].i8)); // VDATA[15:0] is preserved. BUFFER_LOAD_D16_HI_B16 35 Untyped buffer load short, use high 16 bits of data register. VDATA[31 : 16].b16 = MEM[ADDR].b16; // VDATA[15:0] is preserved. BUFFER_STORE_D16_HI_B8 36 Untyped buffer store byte, use high 16 bits of data register. MEM[ADDR].b8 = VDATA[23 : 16].b8 BUFFER_STORE_D16_HI_B16 37 Untyped buffer store short, use high 16 bits of data register. MEM[ADDR].b16 = VDATA[31 : 16].b16 16.16. MUBUF Instructions 539 of 597 "RDNA3" Instruction Set Architecture BUFFER_LOAD_D16_HI_FORMAT_X 38 Untyped buffer load 1 dword with format conversion, use high 16 bits of data register. VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X])); // Mem access size depends on format // VDATA[15:0].b16 is preserved. BUFFER_STORE_D16_HI_FORMAT_X 39 Untyped buffer store 1 dword with format conversion, use high 16 bits of data register. MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[31 : 16].b16)); // Mem access size depends on format BUFFER_GL0_INV 43 Write back and invalidate the shader L0. Returns ACK to shader. BUFFER_GL1_INV 44 Invalidate the GL1 cache only. Returns ACK to shader. BUFFER_LOAD_LDS_U8 45 Untyped buffer load unsigned byte, zero extend and store in LDS destination. BUFFER_LOAD_LDS_I8 46 Untyped buffer load signed byte, sign extend and store in LDS destination. BUFFER_LOAD_LDS_U16 47 Untyped buffer load unsigned short, zero extend and store in LDS destination. 16.16. MUBUF Instructions 540 of 597 "RDNA3" Instruction Set Architecture BUFFER_LOAD_LDS_I16 48 Untyped buffer load signed short, sign extend and store in LDS destination. BUFFER_LOAD_LDS_B32 49 Untyped buffer load dword, store in LDS destination. BUFFER_LOAD_LDS_FORMAT_X 50 Untyped buffer load 1 dword with format conversion, store in LDS destination. BUFFER_ATOMIC_SWAP_B32 51 Swap values in data register and memory. tmp = MEM[ADDR].b; MEM[ADDR].b = DATA.b; RETURN_DATA.b = tmp BUFFER_ATOMIC_CMPSWAP_B32 52 Compare and swap with memory value. tmp = MEM[ADDR].b; src = DATA[31 : 0].b; cmp = DATA[63 : 32].b; MEM[ADDR].b = tmp == cmp ? src : tmp; RETURN_DATA.b = tmp BUFFER_ATOMIC_ADD_U32 53 Add data register to memory value. tmp = MEM[ADDR].u; MEM[ADDR].u += DATA.u; RETURN_DATA.u = tmp 16.16. MUBUF Instructions 541 of 597 "RDNA3" Instruction Set Architecture BUFFER_ATOMIC_SUB_U32 54 Subtract data register from memory value. tmp = MEM[ADDR].u; MEM[ADDR].u -= DATA.u; RETURN_DATA.u = tmp BUFFER_ATOMIC_CSUB_U32 55 Subtract data register from memory value, clamp to zero. declare new_value : 32'U; old_value = MEM[ADDR].u; if old_value < DATA.u then new_value = 0U else new_value = old_value - DATA.u endif; MEM[ADDR].u = new_value; RETURN_DATA.u = old_value BUFFER_ATOMIC_MIN_I32 56 Minimum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src < tmp ? src : tmp; RETURN_DATA.i = tmp BUFFER_ATOMIC_MIN_U32 57 Minimum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src < tmp ? src : tmp; RETURN_DATA.u = tmp 16.16. MUBUF Instructions 542 of 597 "RDNA3" Instruction Set Architecture BUFFER_ATOMIC_MAX_I32 58 Maximum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src > tmp ? src : tmp; RETURN_DATA.i = tmp BUFFER_ATOMIC_MAX_U32 59 Maximum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src > tmp ? src : tmp; RETURN_DATA.u = tmp BUFFER_ATOMIC_AND_B32 60 Bitwise AND of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp & DATA.b); RETURN_DATA.b = tmp BUFFER_ATOMIC_OR_B32 61 Bitwise OR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp | DATA.b); RETURN_DATA.b = tmp BUFFER_ATOMIC_XOR_B32 62 Bitwise XOR of register value and memory value. tmp = MEM[ADDR].b; 16.16. MUBUF Instructions 543 of 597 "RDNA3" Instruction Set Architecture MEM[ADDR].b = (tmp ^ DATA.b); RETURN_DATA.b = tmp BUFFER_ATOMIC_INC_U32 63 Increment memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U; RETURN_DATA.u = tmp BUFFER_ATOMIC_DEC_U32 64 Decrement memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U; RETURN_DATA.u = tmp BUFFER_ATOMIC_SWAP_B64 65 Swap 64-bit values in data register and memory. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = DATA.b64; RETURN_DATA.b64 = tmp BUFFER_ATOMIC_CMPSWAP_B64 66 Compare and swap with 64-bit memory value. tmp = MEM[ADDR].b64; src = DATA[63 : 0].b64; cmp = DATA[127 : 64].b64; MEM[ADDR].b64 = tmp == cmp ? src : tmp; RETURN_DATA.b64 = tmp 16.16. MUBUF Instructions 544 of 597 "RDNA3" Instruction Set Architecture BUFFER_ATOMIC_ADD_U64 67 Add data register to 64-bit memory value. tmp = MEM[ADDR].u64; MEM[ADDR].u64 += DATA.u64; RETURN_DATA.u64 = tmp BUFFER_ATOMIC_SUB_U64 68 Subtract data register from 64-bit memory value. tmp = MEM[ADDR].u64; MEM[ADDR].u64 -= DATA.u64; RETURN_DATA.u64 = tmp BUFFER_ATOMIC_MIN_I64 69 Minimum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; src = DATA.i64; MEM[ADDR].i64 = src < tmp ? src : tmp; RETURN_DATA.i64 = tmp BUFFER_ATOMIC_MIN_U64 70 Minimum of two unsigned 64-bit integer values. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src < tmp ? src : tmp; RETURN_DATA.u64 = tmp BUFFER_ATOMIC_MAX_I64 71 Maximum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; 16.16. MUBUF Instructions 545 of 597 "RDNA3" Instruction Set Architecture src = DATA.i64; MEM[ADDR].i64 = src > tmp ? src : tmp; RETURN_DATA.i64 = tmp BUFFER_ATOMIC_MAX_U64 72 Maximum of two unsigned 64-bit integer values. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src > tmp ? src : tmp; RETURN_DATA.u64 = tmp BUFFER_ATOMIC_AND_B64 73 Bitwise AND of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp & DATA.b64); RETURN_DATA.b64 = tmp BUFFER_ATOMIC_OR_B64 74 Bitwise OR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp | DATA.b64); RETURN_DATA.b64 = tmp BUFFER_ATOMIC_XOR_B64 75 Bitwise XOR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp ^ DATA.b64); RETURN_DATA.b64 = tmp 16.16. MUBUF Instructions 546 of 597 "RDNA3" Instruction Set Architecture BUFFER_ATOMIC_INC_U64 76 Increment 64-bit memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL; RETURN_DATA.u64 = tmp BUFFER_ATOMIC_DEC_U64 77 Decrement 64-bit memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL; RETURN_DATA.u64 = tmp BUFFER_ATOMIC_CMPSWAP_F32 80 Compare and swap with floating-point memory value. tmp = MEM[ADDR].f; src = DATA[31 : 0].f; cmp = DATA[63 : 32].f; MEM[ADDR].f = tmp == cmp ? src : tmp; RETURN_DATA.f = tmp Notes Floating-point compare handles NAN/INF/denorm. BUFFER_ATOMIC_MIN_F32 81 Minimum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src < tmp ? src : tmp; RETURN_DATA = tmp Notes 16.16. MUBUF Instructions 547 of 597 "RDNA3" Instruction Set Architecture Floating-point compare handles NAN/INF/denorm. BUFFER_ATOMIC_MAX_F32 82 Maximum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src > tmp ? src : tmp; RETURN_DATA = tmp Notes Floating-point compare handles NAN/INF/denorm. BUFFER_ATOMIC_ADD_F32 86 Add data register to floating-point memory value. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src + tmp; RETURN_DATA.f = tmp Notes Floating-point addition handles NAN/INF/denorm. 16.16. MUBUF Instructions 548 of 597 "RDNA3" Instruction Set Architecture 16.17. MTBUF Instructions The bitfield map of the MTBUF format is: TBUFFER_LOAD_FORMAT_X 0 Typed buffer load 1 component with format conversion. VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]); // Mem access size depends on format TBUFFER_LOAD_FORMAT_XY 1 Typed buffer load 2 components with format conversion. VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]); // Mem access size depends on format VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]) TBUFFER_LOAD_FORMAT_XYZ 2 Typed buffer load 3 components with format conversion. VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]); // Mem access size depends on format VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]); VDATA[95 : 64].b = ConvertFromFormat(MEM[TADDR.Z]) TBUFFER_LOAD_FORMAT_XYZW 3 Typed buffer load 4 components with format conversion. VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]); // Mem access size depends on format VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]); VDATA[95 : 64].b = ConvertFromFormat(MEM[TADDR.Z]); 16.17. MTBUF Instructions 549 of 597 "RDNA3" Instruction Set Architecture VDATA[127 : 96].b = ConvertFromFormat(MEM[TADDR.W]) TBUFFER_STORE_FORMAT_X 4 Typed buffer store 1 component with format conversion. MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b); // Mem access size depends on format TBUFFER_STORE_FORMAT_XY 5 Typed buffer store 2 components with format conversion. MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b) TBUFFER_STORE_FORMAT_XYZ 6 Typed buffer store 3 components with format conversion. MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b); MEM[TADDR.Z] = ConvertToFormat(VDATA[95 : 64].b) TBUFFER_STORE_FORMAT_XYZW 7 Typed buffer store 4 components with format conversion. MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b); MEM[TADDR.Z] = ConvertToFormat(VDATA[95 : 64].b); MEM[TADDR.W] = ConvertToFormat(VDATA[127 : 96].b) TBUFFER_LOAD_D16_FORMAT_X 16.17. MTBUF Instructions 8 550 of 597 "RDNA3" Instruction Set Architecture Typed buffer load 1 component with format conversion, packed 16-bit components in data register. VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X])); // Mem access size depends on format // VDATA[31:16].b16 is preserved. TBUFFER_LOAD_D16_FORMAT_XY 9 Typed buffer load 2 components with format conversion, packed 16-bit components in data register. VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X])); // Mem access size depends on format VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y])) TBUFFER_LOAD_D16_FORMAT_XYZ 10 Typed buffer load 3 components with format conversion, packed 16-bit components in data register. VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X])); // Mem access size depends on format VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y])); VDATA[47 : 32].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Z])); // VDATA[63:48].b16 is preserved. TBUFFER_LOAD_D16_FORMAT_XYZW 11 Typed buffer load 4 components with format conversion, packed 16-bit components in data register. VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X])); // Mem access size depends on format VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y])); VDATA[47 : 32].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Z])); VDATA[63 : 48].b16 = 16'B(ConvertFromFormat(MEM[TADDR.W])) TBUFFER_STORE_D16_FORMAT_X 12 Typed buffer store 1 component with format conversion, packed 16-bit components in data register. MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16)); 16.17. MTBUF Instructions 551 of 597 "RDNA3" Instruction Set Architecture // Mem access size depends on format TBUFFER_STORE_D16_FORMAT_XY 13 Typed buffer store 2 components with format conversion, packed 16-bit components in data register. MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16)); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16)) TBUFFER_STORE_D16_FORMAT_XYZ 14 Typed buffer store 3 components with format conversion, packed 16-bit components in data register. MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16)); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16)); MEM[TADDR.Z] = ConvertToFormat(32'B(VDATA[47 : 32].b16)) TBUFFER_STORE_D16_FORMAT_XYZW 15 Typed buffer store 4 components with format conversion, packed 16-bit components in data register. MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16)); // Mem access size depends on format MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16)); MEM[TADDR.Z] = ConvertToFormat(32'B(VDATA[47 : 32].b16)); MEM[TADDR.W] = ConvertToFormat(32'B(VDATA[63 : 48].b16)) 16.17. MTBUF Instructions 552 of 597 "RDNA3" Instruction Set Architecture 16.18. MIMG Instructions The bitfield map of the MIMG format is: IMAGE_LOAD 0 Load element from largest miplevel in resource view, with format conversion specified in the resource constant. No sampler. IMAGE_LOAD_MIP 1 Load element from user-specified miplevel in resource view, with format conversion specified in the resource constant. No sampler. IMAGE_LOAD_PCK 2 Load element from largest miplevel in resource view, without format conversion. 8- and 16-bit elements are not sign-extended. No sampler. IMAGE_LOAD_PCK_SGN 3 Load element from largest miplevel in resource view, without format conversion. 8- and 16-bit elements are sign-extended. No sampler. IMAGE_LOAD_MIP_PCK 4 Load element from user-supplied miplevel in resource view, without format conversion. 8- and 16-bit elements are not sign-extended. No sampler. IMAGE_LOAD_MIP_PCK_SGN 5 Load element from user-supplied miplevel in resource view, without format conversion. 8- and 16-bit elements are sign-extended. No sampler. 16.18. MIMG Instructions 553 of 597 "RDNA3" Instruction Set Architecture IMAGE_STORE 6 Store element to largest miplevel in resource view, with format conversion specified in resource constant. No sampler. IMAGE_STORE_MIP 7 Store element to user-specified miplevel in resource view, with format conversion specified in resource constant. No sampler. IMAGE_STORE_PCK 8 Store element to largest miplevel in resource view, without format conversion. No sampler. IMAGE_STORE_MIP_PCK 9 Store element to user-specified miplevel in resource view, without format conversion. No sampler. IMAGE_ATOMIC_SWAP 10 Swap values in data register and memory. tmp = MEM[ADDR].b; MEM[ADDR].b = DATA.b; RETURN_DATA.b = tmp IMAGE_ATOMIC_CMPSWAP 11 Compare and swap with memory value. tmp = MEM[ADDR].b; src = DATA[31 : 0].b; cmp = DATA[63 : 32].b; MEM[ADDR].b = tmp == cmp ? src : tmp; RETURN_DATA.b = tmp IMAGE_ATOMIC_ADD 12 16.18. MIMG Instructions 554 of 597 "RDNA3" Instruction Set Architecture Add data register to memory value. tmp = MEM[ADDR].u; MEM[ADDR].u += DATA.u; RETURN_DATA.u = tmp IMAGE_ATOMIC_SUB 13 Subtract data register from memory value. tmp = MEM[ADDR].u; MEM[ADDR].u -= DATA.u; RETURN_DATA.u = tmp IMAGE_ATOMIC_SMIN 14 Minimum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src < tmp ? src : tmp; RETURN_DATA.i = tmp IMAGE_ATOMIC_UMIN 15 Minimum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src < tmp ? src : tmp; RETURN_DATA.u = tmp IMAGE_ATOMIC_SMAX 16 Maximum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src > tmp ? src : tmp; 16.18. MIMG Instructions 555 of 597 "RDNA3" Instruction Set Architecture RETURN_DATA.i = tmp IMAGE_ATOMIC_UMAX 17 Maximum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src > tmp ? src : tmp; RETURN_DATA.u = tmp IMAGE_ATOMIC_AND 18 Bitwise AND of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp & DATA.b); RETURN_DATA.b = tmp IMAGE_ATOMIC_OR 19 Bitwise OR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp | DATA.b); RETURN_DATA.b = tmp IMAGE_ATOMIC_XOR 20 Bitwise XOR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp ^ DATA.b); RETURN_DATA.b = tmp IMAGE_ATOMIC_INC 21 16.18. MIMG Instructions 556 of 597 "RDNA3" Instruction Set Architecture Increment memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U; RETURN_DATA.u = tmp IMAGE_ATOMIC_DEC 22 Decrement memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U; RETURN_DATA.u = tmp IMAGE_GET_RESINFO 23 Return resource info for a given mip level specified in the address vgpr. No sampler. Returns 4 integer values into VGPRs 3-0: {num_mip_levels, depth, height, width}. IMAGE_MSAA_LOAD 24 Load up to 4 samples of 1 component from an MSAA resource with a user-specified fragment ID. No sampler. IMAGE_BVH_INTERSECT_RAY 25 Intersection test on bound volume hierarchy nodes for ray tracing acceleration. 32-bit node pointer. No sampler. DATA: The destination VGPRs contain the results of intersection testing. The values returned here are different depending on the type of BVH node that was fetched. For box nodes the results contain the 4 pointers of the children boxes in intersection time sorted order. For triangle BVH nodes the results contain the intersection time and triangle ID of the triangle tested. The address GPR packing varies based on addressing mode (A16) and NSA mode. ADDR (A16 = 0): 16.18. MIMG Instructions 557 of 597 "RDNA3" Instruction Set Architecture 11 address VGPRs contain the ray data and BVH node pointer for the intersection test. The data is laid out as follows (dependent on NSA mode): • NSA=0 NSA=1 Value VADDR[0] VADDR[0] = node_pointer (uint32) VADDR[1] VADDRA[0] = ray_extent (float32) VADDR[2] VADDRB[0] = ray_origin.x (float32) VADDR[3] VADDRB[1] = ray_origin.y (float32) VADDR[4] VADDRB[2] = ray_origin.z (float32) VADDR[5] VADDRC[0] = ray_dir.x (float32) VADDR[6] VADDRC[1] = ray_dir.y (float32) VADDR[7] VADDRC[2] = ray_dir.z (float32) VADDR[8] VADDRD[0] = ray_inv_dir.x (float32) VADDR[9] VADDRD[1] = ray_inv_dir.y (float32) VADDR[10] VADDRD[2] = ray_inv_dir.z (float32) ADDR (A16 = 1): For performance and power optimization, the instruction can be encoded to use 16 bit floats for ray_dir and ray_inv_dir by setting A16 to 1. When the instruction is encoded with 16 bit addresses only 8 address VGPRs are used as follows (dependent on NSA mode): • NSA=0 NSA=1 Value VADDR[0] VADDR[0] = node_pointer (uint32) VADDR[1] VADDRA[0] = ray_extent (float32) VADDR[2] VADDRB[0] = ray_origin.x (float32) VADDR[3] VADDRB[1] = ray_origin.y (float32) VADDR[4] VADDRB[2] = ray_origin.z (float32) VADDR[5] VADDRC[0] = {ray_inv_dir.x, ray_dir.x} (2x float16) VADDR[6] VADDRC[1] = {ray_inv_dir.y, ray_dir.y} (2x float16) VADDR[7] VADDRC[2] = {ray_inv_dir.z, ray_dir.z} (2x float16) RSRC: The resource is the texture descriptor for the operation. The instruction must be encoded with r128=1. RESTRICTIONS: The image_bvh_intersect_ray and image_bvh64_intersect_ray opcode do not support all of the features of a standard MIMG instruction. This puts some restrictions on how the instruction is encoded: • DMASK must be set to 0xf (instruction returns all four DWORDs) • D16 must be set to 0 (16 bit return data is not supported) • R128 must be set to 1 (256 bit T#s are not supported) • UNRM must be set to 1 (only unnormalized coordinates are supported) • DIM must be set to 0 (BVH textures are 1D) • LWE must be set to 0 (LOD warn is not supported) • TFE must be set to 0 (no support for writing out the extra DWORD for the PRT hit status) These restrictions must be respected by the SW/compiler, and are not enforced by HW. HW is allowed to assume that these values are encoded according to the above restrictions, and ignore improper values, or do 16.18. MIMG Instructions 558 of 597 "RDNA3" Instruction Set Architecture any other undefined behavior, if the above fields do not match their specified values for these instructions. The HW also has some additional restrictions on the BVH instructions when they are issued: • The HW ignores the return order settings of the BVH ops and schedules them in the in order read return queue when fetching data from the texture pipe. IMAGE_BVH64_INTERSECT_RAY 26 Intersection test on bound volume hierarchy nodes for ray tracing acceleration. 64-bit node pointer. No sampler. This instruction allows support for very large BVHs (larger than 32 GBs) that may occur in workstation workloads. See IMAGE_BVH_INTERSECT_RAY for basic information including restrictions. Only differences are described here. ADDR (A16 = 0): 12 address VGPRs contain the ray data and BVH node pointer for the intersection test. The data is laid out as follows (dependent on NSA mode): • NSA=0 NSA=1 Value VADDR[0] VADDR[0] = node_pointer[31:0] (uint32) VADDR[1] VADDR[1] = node_pointer[63:32] (uint32) VADDR[2] VADDRA[0] = ray_extent (float32) VADDR[3] VADDRB[0] = ray_origin.x (float32) VADDR[4] VADDRB[1] = ray_origin.y (float32) VADDR[5] VADDRB[2] = ray_origin.z (float32) VADDR[6] VADDRC[0] = ray_dir.x (float32) VADDR[7] VADDRC[1] = ray_dir.y (float32) VADDR[8] VADDRC[2] = ray_dir.z (float32) VADDR[9] VADDRD[0] = ray_inv_dir.x (float32) VADDR[10] VADDRD[1] = ray_inv_dir.y (float32) VADDR[11] VADDRD[2] = ray_inv_dir.z (float32) ADDR (A16 = 1): When the instruction is encoded with 16 bit addresses only 9 address VGPRs are used as follows (dependent on NSA mode): • NSA=0 NSA=1 Value VADDR[0] VADDR[0] = node_pointer[31:0] (uint32) VADDR[1] VADDR[1] = node_pointer[63:32] (uint32) VADDR[2] VADDRA[0] = ray_extent (float32) VADDR[3] VADDRB[0] = ray_origin.x (float32) VADDR[4] VADDRB[1] = ray_origin.y (float32) VADDR[5] VADDRB[2] = ray_origin.z (float32) VADDR[6] VADDRC[0] = {ray_inv_dir.x, ray_dir.x} (2x float16) VADDR[7] VADDRC[1] = {ray_inv_dir.y, ray_dir.y} (2x float16) VADDR[8] VADDRC[2] = {ray_inv_dir.z, ray_dir.z} (2x float16) 16.18. MIMG Instructions 559 of 597 "RDNA3" Instruction Set Architecture IMAGE_SAMPLE 27 Sample texture map. IMAGE_SAMPLE_D 28 Sample texture map, with user derivatives. IMAGE_SAMPLE_L 29 Sample texture map, with user LOD. IMAGE_SAMPLE_B 30 Sample texture map, with lod bias. IMAGE_SAMPLE_LZ 31 Sample texture map, from level 0. IMAGE_SAMPLE_C 32 Sample texture map, with PCF. IMAGE_SAMPLE_C_D 33 SAMPLE_C, with user derivatives. IMAGE_SAMPLE_C_L 34 SAMPLE_C, with user LOD. IMAGE_SAMPLE_C_B 35 SAMPLE_C, with lod bias. 16.18. MIMG Instructions 560 of 597 "RDNA3" Instruction Set Architecture IMAGE_SAMPLE_C_LZ 36 SAMPLE_C, from level 0. IMAGE_SAMPLE_O 37 Sample texture map, with user offsets. IMAGE_SAMPLE_D_O 38 SAMPLE_O, with user derivatives. IMAGE_SAMPLE_L_O 39 SAMPLE_O, with user LOD. IMAGE_SAMPLE_B_O 40 SAMPLE_O, with lod bias. IMAGE_SAMPLE_LZ_O 41 SAMPLE_O, from level 0. IMAGE_SAMPLE_C_O 42 SAMPLE_C with user specified offsets. IMAGE_SAMPLE_C_D_O 43 SAMPLE_C_O, with user derivatives. IMAGE_SAMPLE_C_L_O 44 SAMPLE_C_O, with user LOD. 16.18. MIMG Instructions 561 of 597 "RDNA3" Instruction Set Architecture IMAGE_SAMPLE_C_B_O 45 SAMPLE_C_O, with lod bias. IMAGE_SAMPLE_C_LZ_O 46 SAMPLE_C_O, from level 0. IMAGE_GATHER4 47 Gather 4 single component elements (2x2). IMAGE_GATHER4_L 48 Gather 4 single component elements (2x2) with user LOD. IMAGE_GATHER4_B 49 Gather 4 single component elements (2x2) with user bias. IMAGE_GATHER4_LZ 50 Gather 4 single component elements (2x2) at level 0. IMAGE_GATHER4_C 51 Gather 4 single component elements (2x2) with PCF. IMAGE_GATHER4_C_LZ 52 Gather 4 single component elements (2x2) at level 0, with PCF. IMAGE_GATHER4_O 53 GATHER4, with user offsets. 16.18. MIMG Instructions 562 of 597 "RDNA3" Instruction Set Architecture IMAGE_GATHER4_LZ_O 54 GATHER4_LZ, with user offsets. IMAGE_GATHER4_C_LZ_O 55 GATHER4_C_LZ, with user offsets. IMAGE_GET_LOD 56 Return calculated LOD as two 32-bit floating point values. VDATA[0] = clampedLOD; VDATA[1] = rawLOD. IMAGE_SAMPLE_D_G16 57 SAMPLE_D with 16-bit floating point derivatives (gradients). IMAGE_SAMPLE_C_D_G16 58 SAMPLE_C_D with 16-bit floating point derivatives (gradients). IMAGE_SAMPLE_D_O_G16 59 SAMPLE_D_O with 16-bit floating point derivatives (gradients). IMAGE_SAMPLE_C_D_O_G16 60 SAMPLE_C_D_O with 16-bit floating point derivatives (gradients). IMAGE_SAMPLE_CL 64 Sample texture map, with LOD clamp specified in shader. 16.18. MIMG Instructions 563 of 597 "RDNA3" Instruction Set Architecture IMAGE_SAMPLE_D_CL 65 Sample texture map, with LOD clamp specified in shader, with user derivatives. IMAGE_SAMPLE_B_CL 66 Sample texture map, with LOD clamp specified in shader, with lod bias. IMAGE_SAMPLE_C_CL 67 SAMPLE_C, with LOD clamp specified in shader. IMAGE_SAMPLE_C_D_CL 68 SAMPLE_C, with LOD clamp specified in shader, with user derivatives. IMAGE_SAMPLE_C_B_CL 69 SAMPLE_C, with LOD clamp specified in shader, with lod bias. IMAGE_SAMPLE_CL_O 70 SAMPLE_O with LOD clamp specified in shader. IMAGE_SAMPLE_D_CL_O 71 SAMPLE_O, with LOD clamp specified in shader, with user derivatives. IMAGE_SAMPLE_B_CL_O 72 SAMPLE_O, with LOD clamp specified in shader, with lod bias. IMAGE_SAMPLE_C_CL_O 73 SAMPLE_C_O, with LOD clamp specified in shader. 16.18. MIMG Instructions 564 of 597 "RDNA3" Instruction Set Architecture IMAGE_SAMPLE_C_D_CL_O 74 SAMPLE_C_O, with LOD clamp specified in shader, with user derivatives. IMAGE_SAMPLE_C_B_CL_O 75 SAMPLE_C_O, with LOD clamp specified in shader, with lod bias. IMAGE_SAMPLE_C_D_CL_G16 84 SAMPLE_C_D_CL with 16-bit floating point derivatives (gradients). IMAGE_SAMPLE_D_CL_O_G16 85 SAMPLE_D_CL_O with 16-bit floating point derivatives (gradients). IMAGE_SAMPLE_C_D_CL_O_G16 86 SAMPLE_C_D_CL_O with 16-bit floating point derivatives (gradients). IMAGE_SAMPLE_D_CL_G16 95 SAMPLE_D_CL with 16-bit floating point derivatives (gradients). IMAGE_GATHER4_CL 96 Gather 4 single component elements (2x2) with user LOD clamp. IMAGE_GATHER4_B_CL 97 Gather 4 single component elements (2x2) with user bias and clamp. IMAGE_GATHER4_C_CL 98 Gather 4 single component elements (2x2) with user LOD clamp and PCF. 16.18. MIMG Instructions 565 of 597 "RDNA3" Instruction Set Architecture IMAGE_GATHER4_C_L 99 Gather 4 single component elements (2x2) with user LOD and PCF. IMAGE_GATHER4_C_B 100 Gather 4 single component elements (2x2) with user bias and PCF. IMAGE_GATHER4_C_B_CL 101 Gather 4 single component elements (2x2) with user bias, clamp and PCF. IMAGE_GATHER4H 144 Fetch 1 component per texel from 4x1 texels. DMASK selects which component to read (R,G,B,A) and must have only one bit set to 1. 16.18. MIMG Instructions 566 of 597 "RDNA3" Instruction Set Architecture 16.19. EXPORT Instructions Transfer vertex position, vertex parameter, pixel color, or pixel depth information to the output buffer. Every pixel shader must do at least one export to a color, depth or NULL target with the VM bit set to 1. This communicates the pixel-valid mask to the color and depth buffers. Every pixel does only one of the above export types with the DONE bit set to 1. Vertex shaders must do one or more position exports, and at least one parameter export. The final position export must have the DONE bit set to 1. 16.19. EXPORT Instructions 567 of 597 "RDNA3" Instruction Set Architecture 16.20. FLAT, Scratch and Global Instructions The bitfield map of the FLAT format is: 16.20.1. Flat Instructions Flat instructions look at the per work-item address and determine for each work-item if the target memory address is in global, private or scratch memory. FLAT_LOAD_U8 16 Untyped buffer load unsigned byte, zero extend in data register. VDATA.u = 32'U({ 24'0, MEM[ADDR].u8 }) FLAT_LOAD_I8 17 Untyped buffer load signed byte, sign extend in data register. VDATA.i = 32'I(signext(MEM[ADDR].i8)) FLAT_LOAD_U16 18 Untyped buffer load unsigned short, zero extend in data register. VDATA.u = 32'U({ 16'0, MEM[ADDR].u16 }) FLAT_LOAD_I16 19 Untyped buffer load signed short, sign extend in data register. VDATA.i = 32'I(signext(MEM[ADDR].i16)) 16.20. FLAT, Scratch and Global Instructions 568 of 597 "RDNA3" Instruction Set Architecture FLAT_LOAD_B32 20 Untyped buffer load dword. VDATA.b = MEM[ADDR].b FLAT_LOAD_B64 21 Untyped buffer load 2 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b FLAT_LOAD_B96 22 Untyped buffer load 3 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b; VDATA[95 : 64] = MEM[ADDR + 8U].b FLAT_LOAD_B128 23 Untyped buffer load 4 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b; VDATA[95 : 64] = MEM[ADDR + 8U].b; VDATA[127 : 96] = MEM[ADDR + 12U].b FLAT_STORE_B8 24 Untyped buffer store byte. MEM[ADDR].b8 = VDATA[7 : 0] 16.20. FLAT, Scratch and Global Instructions 569 of 597 "RDNA3" Instruction Set Architecture FLAT_STORE_B16 25 Untyped buffer store short. MEM[ADDR].b16 = VDATA[15 : 0] FLAT_STORE_B32 26 Untyped buffer store dword. MEM[ADDR].b = VDATA[31 : 0] FLAT_STORE_B64 27 Untyped buffer store 2 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32] FLAT_STORE_B96 28 Untyped buffer store 3 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32]; MEM[ADDR + 8U].b = VDATA[95 : 64] FLAT_STORE_B128 29 Untyped buffer store 4 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32]; MEM[ADDR + 8U].b = VDATA[95 : 64]; MEM[ADDR + 12U].b = VDATA[127 : 96] 16.20. FLAT, Scratch and Global Instructions 570 of 597 "RDNA3" Instruction Set Architecture FLAT_LOAD_D16_U8 30 Untyped buffer load unsigned byte, use low 16 bits of data register. VDATA[15 : 0].u16 = 16'U({ 8'0, MEM[ADDR].u8 }); // VDATA[31:16] is preserved. FLAT_LOAD_D16_I8 31 Untyped buffer load signed byte, use low 16 bits of data register. VDATA[15 : 0].i16 = 16'I(signext(MEM[ADDR].i8)); // VDATA[31:16] is preserved. FLAT_LOAD_D16_B16 32 Untyped buffer load short, use low 16 bits of data register. VDATA[15 : 0].b16 = MEM[ADDR].b16; // VDATA[31:16] is preserved. FLAT_LOAD_D16_HI_U8 33 Untyped buffer load unsigned byte, use high 16 bits of data register. VDATA[31 : 16].u16 = 16'U({ 8'0, MEM[ADDR].u8 }); // VDATA[15:0] is preserved. FLAT_LOAD_D16_HI_I8 34 Untyped buffer load signed byte, use high 16 bits of data register. VDATA[31 : 16].i16 = 16'I(signext(MEM[ADDR].i8)); // VDATA[15:0] is preserved. FLAT_LOAD_D16_HI_B16 16.20. FLAT, Scratch and Global Instructions 35 571 of 597 "RDNA3" Instruction Set Architecture Untyped buffer load short, use high 16 bits of data register. VDATA[31 : 16].b16 = MEM[ADDR].b16; // VDATA[15:0] is preserved. FLAT_STORE_D16_HI_B8 36 Untyped buffer store byte, use high 16 bits of data register. MEM[ADDR].b8 = VDATA[23 : 16].b8 FLAT_STORE_D16_HI_B16 37 Untyped buffer store short, use high 16 bits of data register. MEM[ADDR].b16 = VDATA[31 : 16].b16 FLAT_ATOMIC_SWAP_B32 51 Swap values in data register and memory. tmp = MEM[ADDR].b; MEM[ADDR].b = DATA.b; RETURN_DATA.b = tmp FLAT_ATOMIC_CMPSWAP_B32 52 Compare and swap with memory value. tmp = MEM[ADDR].b; src = DATA[31 : 0].b; cmp = DATA[63 : 32].b; MEM[ADDR].b = tmp == cmp ? src : tmp; RETURN_DATA.b = tmp FLAT_ATOMIC_ADD_U32 16.20. FLAT, Scratch and Global Instructions 53 572 of 597 "RDNA3" Instruction Set Architecture Add data register to memory value. tmp = MEM[ADDR].u; MEM[ADDR].u += DATA.u; RETURN_DATA.u = tmp FLAT_ATOMIC_SUB_U32 54 Subtract data register from memory value. tmp = MEM[ADDR].u; MEM[ADDR].u -= DATA.u; RETURN_DATA.u = tmp FLAT_ATOMIC_MIN_I32 56 Minimum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src < tmp ? src : tmp; RETURN_DATA.i = tmp FLAT_ATOMIC_MIN_U32 57 Minimum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src < tmp ? src : tmp; RETURN_DATA.u = tmp FLAT_ATOMIC_MAX_I32 58 Maximum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src > tmp ? src : tmp; 16.20. FLAT, Scratch and Global Instructions 573 of 597 "RDNA3" Instruction Set Architecture RETURN_DATA.i = tmp FLAT_ATOMIC_MAX_U32 59 Maximum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src > tmp ? src : tmp; RETURN_DATA.u = tmp FLAT_ATOMIC_AND_B32 60 Bitwise AND of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp & DATA.b); RETURN_DATA.b = tmp FLAT_ATOMIC_OR_B32 61 Bitwise OR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp | DATA.b); RETURN_DATA.b = tmp FLAT_ATOMIC_XOR_B32 62 Bitwise XOR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp ^ DATA.b); RETURN_DATA.b = tmp FLAT_ATOMIC_INC_U32 16.20. FLAT, Scratch and Global Instructions 63 574 of 597 "RDNA3" Instruction Set Architecture Increment memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U; RETURN_DATA.u = tmp FLAT_ATOMIC_DEC_U32 64 Decrement memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U; RETURN_DATA.u = tmp FLAT_ATOMIC_SWAP_B64 65 Swap 64-bit values in data register and memory. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = DATA.b64; RETURN_DATA.b64 = tmp FLAT_ATOMIC_CMPSWAP_B64 66 Compare and swap with 64-bit memory value. NOTE: RETURN_DATA[2:3] is not modified. tmp = MEM[ADDR].b64; src = DATA[63 : 0].b64; cmp = DATA[127 : 64].b64; MEM[ADDR].b64 = tmp == cmp ? src : tmp; RETURN_DATA.b64 = tmp FLAT_ATOMIC_ADD_U64 67 Add data register to 64-bit memory value. 16.20. FLAT, Scratch and Global Instructions 575 of 597 "RDNA3" Instruction Set Architecture tmp = MEM[ADDR].u64; MEM[ADDR].u64 += DATA.u64; RETURN_DATA.u64 = tmp FLAT_ATOMIC_SUB_U64 68 Subtract data register from 64-bit memory value. tmp = MEM[ADDR].u64; MEM[ADDR].u64 -= DATA.u64; RETURN_DATA.u64 = tmp FLAT_ATOMIC_MIN_I64 69 Minimum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; src = DATA.i64; MEM[ADDR].i64 = src < tmp ? src : tmp; RETURN_DATA.i64 = tmp FLAT_ATOMIC_MIN_U64 70 Minimum of two unsigned 64-bit integer values. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src < tmp ? src : tmp; RETURN_DATA.u64 = tmp FLAT_ATOMIC_MAX_I64 71 Maximum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; src = DATA.i64; MEM[ADDR].i64 = src > tmp ? src : tmp; RETURN_DATA.i64 = tmp 16.20. FLAT, Scratch and Global Instructions 576 of 597 "RDNA3" Instruction Set Architecture FLAT_ATOMIC_MAX_U64 72 Maximum of two unsigned 64-bit integer values. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src > tmp ? src : tmp; RETURN_DATA.u64 = tmp FLAT_ATOMIC_AND_B64 73 Bitwise AND of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp & DATA.b64); RETURN_DATA.b64 = tmp FLAT_ATOMIC_OR_B64 74 Bitwise OR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp | DATA.b64); RETURN_DATA.b64 = tmp FLAT_ATOMIC_XOR_B64 75 Bitwise XOR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp ^ DATA.b64); RETURN_DATA.b64 = tmp FLAT_ATOMIC_INC_U64 76 Increment 64-bit memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u64; 16.20. FLAT, Scratch and Global Instructions 577 of 597 "RDNA3" Instruction Set Architecture src = DATA.u64; MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL; RETURN_DATA.u64 = tmp FLAT_ATOMIC_DEC_U64 77 Decrement 64-bit memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL; RETURN_DATA.u64 = tmp FLAT_ATOMIC_CMPSWAP_F32 80 Compare and swap with floating-point memory value. tmp = MEM[ADDR].f; src = DATA[31 : 0].f; cmp = DATA[63 : 32].f; MEM[ADDR].f = tmp == cmp ? src : tmp; RETURN_DATA.f = tmp Notes Floating-point compare handles NAN/INF/denorm. FLAT_ATOMIC_MIN_F32 81 Minimum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src < tmp ? src : tmp; RETURN_DATA = tmp Notes Floating-point compare handles NAN/INF/denorm. FLAT_ATOMIC_MAX_F32 16.20. FLAT, Scratch and Global Instructions 82 578 of 597 "RDNA3" Instruction Set Architecture Maximum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src > tmp ? src : tmp; RETURN_DATA = tmp Notes Floating-point compare handles NAN/INF/denorm. FLAT_ATOMIC_ADD_F32 86 Add data register to floating-point memory value. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src + tmp; RETURN_DATA.f = tmp Notes Floating-point addition handles NAN/INF/denorm. 16.20.2. Scratch Instructions Scratch instructions are like Flat, but assume all work-item addresses fall in scratch (private) space. SCRATCH_LOAD_U8 16 Untyped buffer load unsigned byte, zero extend in data register. VDATA.u = 32'U({ 24'0, MEM[ADDR].u8 }) SCRATCH_LOAD_I8 17 Untyped buffer load signed byte, sign extend in data register. 16.20. FLAT, Scratch and Global Instructions 579 of 597 "RDNA3" Instruction Set Architecture VDATA.i = 32'I(signext(MEM[ADDR].i8)) SCRATCH_LOAD_U16 18 Untyped buffer load unsigned short, zero extend in data register. VDATA.u = 32'U({ 16'0, MEM[ADDR].u16 }) SCRATCH_LOAD_I16 19 Untyped buffer load signed short, sign extend in data register. VDATA.i = 32'I(signext(MEM[ADDR].i16)) SCRATCH_LOAD_B32 20 Untyped buffer load dword. VDATA.b = MEM[ADDR].b SCRATCH_LOAD_B64 21 Untyped buffer load 2 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b SCRATCH_LOAD_B96 22 Untyped buffer load 3 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b; VDATA[95 : 64] = MEM[ADDR + 8U].b 16.20. FLAT, Scratch and Global Instructions 580 of 597 "RDNA3" Instruction Set Architecture SCRATCH_LOAD_B128 23 Untyped buffer load 4 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b; VDATA[95 : 64] = MEM[ADDR + 8U].b; VDATA[127 : 96] = MEM[ADDR + 12U].b SCRATCH_STORE_B8 24 Untyped buffer store byte. MEM[ADDR].b8 = VDATA[7 : 0] SCRATCH_STORE_B16 25 Untyped buffer store short. MEM[ADDR].b16 = VDATA[15 : 0] SCRATCH_STORE_B32 26 Untyped buffer store dword. MEM[ADDR].b = VDATA[31 : 0] SCRATCH_STORE_B64 27 Untyped buffer store 2 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32] 16.20. FLAT, Scratch and Global Instructions 581 of 597 "RDNA3" Instruction Set Architecture SCRATCH_STORE_B96 28 Untyped buffer store 3 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32]; MEM[ADDR + 8U].b = VDATA[95 : 64] SCRATCH_STORE_B128 29 Untyped buffer store 4 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32]; MEM[ADDR + 8U].b = VDATA[95 : 64]; MEM[ADDR + 12U].b = VDATA[127 : 96] SCRATCH_LOAD_D16_U8 30 Untyped buffer load unsigned byte, use low 16 bits of data register. VDATA[15 : 0].u16 = 16'U({ 8'0, MEM[ADDR].u8 }); // VDATA[31:16] is preserved. SCRATCH_LOAD_D16_I8 31 Untyped buffer load signed byte, use low 16 bits of data register. VDATA[15 : 0].i16 = 16'I(signext(MEM[ADDR].i8)); // VDATA[31:16] is preserved. SCRATCH_LOAD_D16_B16 32 Untyped buffer load short, use low 16 bits of data register. VDATA[15 : 0].b16 = MEM[ADDR].b16; // VDATA[31:16] is preserved. 16.20. FLAT, Scratch and Global Instructions 582 of 597 "RDNA3" Instruction Set Architecture SCRATCH_LOAD_D16_HI_U8 33 Untyped buffer load unsigned byte, use high 16 bits of data register. VDATA[31 : 16].u16 = 16'U({ 8'0, MEM[ADDR].u8 }); // VDATA[15:0] is preserved. SCRATCH_LOAD_D16_HI_I8 34 Untyped buffer load signed byte, use high 16 bits of data register. VDATA[31 : 16].i16 = 16'I(signext(MEM[ADDR].i8)); // VDATA[15:0] is preserved. SCRATCH_LOAD_D16_HI_B16 35 Untyped buffer load short, use high 16 bits of data register. VDATA[31 : 16].b16 = MEM[ADDR].b16; // VDATA[15:0] is preserved. SCRATCH_STORE_D16_HI_B8 36 Untyped buffer store byte, use high 16 bits of data register. MEM[ADDR].b8 = VDATA[23 : 16].b8 SCRATCH_STORE_D16_HI_B16 37 Untyped buffer store short, use high 16 bits of data register. MEM[ADDR].b16 = VDATA[31 : 16].b16 SCRATCH_LOAD_LDS_U8 16.20. FLAT, Scratch and Global Instructions 45 583 of 597 "RDNA3" Instruction Set Architecture Untyped buffer load unsigned byte, zero extend and store in LDS destination. SCRATCH_LOAD_LDS_I8 46 Untyped buffer load signed byte, sign extend and store in LDS destination. SCRATCH_LOAD_LDS_U16 47 Untyped buffer load unsigned short, zero extend and store in LDS destination. SCRATCH_LOAD_LDS_I16 48 Untyped buffer load signed short, sign extend and store in LDS destination. SCRATCH_LOAD_LDS_B32 49 Untyped buffer load dword, store in LDS destination. 16.20.3. Global Instructions Global instructions are like Flat, but assume all work-item addresses fall in global memory space. GLOBAL_LOAD_U8 16 Untyped buffer load unsigned byte, zero extend in data register. VDATA.u = 32'U({ 24'0, MEM[ADDR].u8 }) GLOBAL_LOAD_I8 17 Untyped buffer load signed byte, sign extend in data register. VDATA.i = 32'I(signext(MEM[ADDR].i8)) 16.20. FLAT, Scratch and Global Instructions 584 of 597 "RDNA3" Instruction Set Architecture GLOBAL_LOAD_U16 18 Untyped buffer load unsigned short, zero extend in data register. VDATA.u = 32'U({ 16'0, MEM[ADDR].u16 }) GLOBAL_LOAD_I16 19 Untyped buffer load signed short, sign extend in data register. VDATA.i = 32'I(signext(MEM[ADDR].i16)) GLOBAL_LOAD_B32 20 Untyped buffer load dword. VDATA.b = MEM[ADDR].b GLOBAL_LOAD_B64 21 Untyped buffer load 2 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b GLOBAL_LOAD_B96 22 Untyped buffer load 3 dwords. VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b; VDATA[95 : 64] = MEM[ADDR + 8U].b GLOBAL_LOAD_B128 23 Untyped buffer load 4 dwords. 16.20. FLAT, Scratch and Global Instructions 585 of 597 "RDNA3" Instruction Set Architecture VDATA[31 : 0] = MEM[ADDR + 0U].b; VDATA[63 : 32] = MEM[ADDR + 4U].b; VDATA[95 : 64] = MEM[ADDR + 8U].b; VDATA[127 : 96] = MEM[ADDR + 12U].b GLOBAL_STORE_B8 24 Untyped buffer store byte. MEM[ADDR].b8 = VDATA[7 : 0] GLOBAL_STORE_B16 25 Untyped buffer store short. MEM[ADDR].b16 = VDATA[15 : 0] GLOBAL_STORE_B32 26 Untyped buffer store dword. MEM[ADDR].b = VDATA[31 : 0] GLOBAL_STORE_B64 27 Untyped buffer store 2 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32] GLOBAL_STORE_B96 28 Untyped buffer store 3 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; 16.20. FLAT, Scratch and Global Instructions 586 of 597 "RDNA3" Instruction Set Architecture MEM[ADDR + 4U].b = VDATA[63 : 32]; MEM[ADDR + 8U].b = VDATA[95 : 64] GLOBAL_STORE_B128 29 Untyped buffer store 4 dwords. MEM[ADDR + 0U].b = VDATA[31 : 0]; MEM[ADDR + 4U].b = VDATA[63 : 32]; MEM[ADDR + 8U].b = VDATA[95 : 64]; MEM[ADDR + 12U].b = VDATA[127 : 96] GLOBAL_LOAD_D16_U8 30 Untyped buffer load unsigned byte, use low 16 bits of data register. VDATA[15 : 0].u16 = 16'U({ 8'0, MEM[ADDR].u8 }); // VDATA[31:16] is preserved. GLOBAL_LOAD_D16_I8 31 Untyped buffer load signed byte, use low 16 bits of data register. VDATA[15 : 0].i16 = 16'I(signext(MEM[ADDR].i8)); // VDATA[31:16] is preserved. GLOBAL_LOAD_D16_B16 32 Untyped buffer load short, use low 16 bits of data register. VDATA[15 : 0].b16 = MEM[ADDR].b16; // VDATA[31:16] is preserved. GLOBAL_LOAD_D16_HI_U8 33 Untyped buffer load unsigned byte, use high 16 bits of data register. 16.20. FLAT, Scratch and Global Instructions 587 of 597 "RDNA3" Instruction Set Architecture VDATA[31 : 16].u16 = 16'U({ 8'0, MEM[ADDR].u8 }); // VDATA[15:0] is preserved. GLOBAL_LOAD_D16_HI_I8 34 Untyped buffer load signed byte, use high 16 bits of data register. VDATA[31 : 16].i16 = 16'I(signext(MEM[ADDR].i8)); // VDATA[15:0] is preserved. GLOBAL_LOAD_D16_HI_B16 35 Untyped buffer load short, use high 16 bits of data register. VDATA[31 : 16].b16 = MEM[ADDR].b16; // VDATA[15:0] is preserved. GLOBAL_STORE_D16_HI_B8 36 Untyped buffer store byte, use high 16 bits of data register. MEM[ADDR].b8 = VDATA[23 : 16].b8 GLOBAL_STORE_D16_HI_B16 37 Untyped buffer store short, use high 16 bits of data register. MEM[ADDR].b16 = VDATA[31 : 16].b16 GLOBAL_LOAD_ADDTID_B32 40 Untyped buffer load dword. No VGPR address is supplied in this instruction. TID is added to the address as shown below: memory_Addr = sgpr_addr(64) + inst_offset(12) + tid*4 16.20. FLAT, Scratch and Global Instructions 588 of 597 "RDNA3" Instruction Set Architecture GLOBAL_STORE_ADDTID_B32 41 Untyped buffer store dword. No VGPR address is supplied in this instruction. TID is added to the address as shown below: memory_Addr = sgpr_addr(64) + inst_offset(12) + tid*4 GLOBAL_LOAD_LDS_ADDTID_B32 42 Untyped buffer load dword, store results to LDS. No VGPR address is supplied in this instruction. TID is added to the address as shown below: memory_Addr = sgpr_addr(64) + inst_offset(12) + tid*4 GLOBAL_LOAD_LDS_U8 45 Untyped buffer load unsigned byte, zero extend and store in LDS destination. GLOBAL_LOAD_LDS_I8 46 Untyped buffer load signed byte, sign extend and store in LDS destination. GLOBAL_LOAD_LDS_U16 47 Untyped buffer load unsigned short, zero extend and store in LDS destination. GLOBAL_LOAD_LDS_I16 48 Untyped buffer load signed short, sign extend and store in LDS destination. GLOBAL_LOAD_LDS_B32 49 Untyped buffer load dword, store in LDS destination. GLOBAL_ATOMIC_SWAP_B32 51 Swap values in data register and memory. 16.20. FLAT, Scratch and Global Instructions 589 of 597 "RDNA3" Instruction Set Architecture tmp = MEM[ADDR].b; MEM[ADDR].b = DATA.b; RETURN_DATA.b = tmp GLOBAL_ATOMIC_CMPSWAP_B32 52 Compare and swap with memory value. tmp = MEM[ADDR].b; src = DATA[31 : 0].b; cmp = DATA[63 : 32].b; MEM[ADDR].b = tmp == cmp ? src : tmp; RETURN_DATA.b = tmp GLOBAL_ATOMIC_ADD_U32 53 Add data register to memory value. tmp = MEM[ADDR].u; MEM[ADDR].u += DATA.u; RETURN_DATA.u = tmp GLOBAL_ATOMIC_SUB_U32 54 Subtract data register from memory value. tmp = MEM[ADDR].u; MEM[ADDR].u -= DATA.u; RETURN_DATA.u = tmp GLOBAL_ATOMIC_CSUB_U32 55 Subtract data register from memory value, clamp to zero. declare new_value : 32'U; old_value = MEM[ADDR].u; if old_value < DATA.u then new_value = 0U else new_value = old_value - DATA.u 16.20. FLAT, Scratch and Global Instructions 590 of 597 "RDNA3" Instruction Set Architecture endif; MEM[ADDR].u = new_value; RETURN_DATA.u = old_value GLOBAL_ATOMIC_MIN_I32 56 Minimum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src < tmp ? src : tmp; RETURN_DATA.i = tmp GLOBAL_ATOMIC_MIN_U32 57 Minimum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src < tmp ? src : tmp; RETURN_DATA.u = tmp GLOBAL_ATOMIC_MAX_I32 58 Maximum of two signed integer values. tmp = MEM[ADDR].i; src = DATA.i; MEM[ADDR].i = src > tmp ? src : tmp; RETURN_DATA.i = tmp GLOBAL_ATOMIC_MAX_U32 59 Maximum of two unsigned integer values. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = src > tmp ? src : tmp; RETURN_DATA.u = tmp 16.20. FLAT, Scratch and Global Instructions 591 of 597 "RDNA3" Instruction Set Architecture GLOBAL_ATOMIC_AND_B32 60 Bitwise AND of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp & DATA.b); RETURN_DATA.b = tmp GLOBAL_ATOMIC_OR_B32 61 Bitwise OR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp | DATA.b); RETURN_DATA.b = tmp GLOBAL_ATOMIC_XOR_B32 62 Bitwise XOR of register value and memory value. tmp = MEM[ADDR].b; MEM[ADDR].b = (tmp ^ DATA.b); RETURN_DATA.b = tmp GLOBAL_ATOMIC_INC_U32 63 Increment memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u; src = DATA.u; MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U; RETURN_DATA.u = tmp GLOBAL_ATOMIC_DEC_U32 64 Decrement memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u; 16.20. FLAT, Scratch and Global Instructions 592 of 597 "RDNA3" Instruction Set Architecture src = DATA.u; MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U; RETURN_DATA.u = tmp GLOBAL_ATOMIC_SWAP_B64 65 Swap 64-bit values in data register and memory. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = DATA.b64; RETURN_DATA.b64 = tmp GLOBAL_ATOMIC_CMPSWAP_B64 66 Compare and swap with 64-bit memory value. tmp = MEM[ADDR].b64; src = DATA[63 : 0].b64; cmp = DATA[127 : 64].b64; MEM[ADDR].b64 = tmp == cmp ? src : tmp; RETURN_DATA.b64 = tmp GLOBAL_ATOMIC_ADD_U64 67 Add data register to 64-bit memory value. tmp = MEM[ADDR].u64; MEM[ADDR].u64 += DATA.u64; RETURN_DATA.u64 = tmp GLOBAL_ATOMIC_SUB_U64 68 Subtract data register from 64-bit memory value. tmp = MEM[ADDR].u64; MEM[ADDR].u64 -= DATA.u64; RETURN_DATA.u64 = tmp 16.20. FLAT, Scratch and Global Instructions 593 of 597 "RDNA3" Instruction Set Architecture GLOBAL_ATOMIC_MIN_I64 69 Minimum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; src = DATA.i64; MEM[ADDR].i64 = src < tmp ? src : tmp; RETURN_DATA.i64 = tmp GLOBAL_ATOMIC_MIN_U64 70 Minimum of two unsigned 64-bit integer values. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src < tmp ? src : tmp; RETURN_DATA.u64 = tmp GLOBAL_ATOMIC_MAX_I64 71 Maximum of two signed 64-bit integer values. tmp = MEM[ADDR].i64; src = DATA.i64; MEM[ADDR].i64 = src > tmp ? src : tmp; RETURN_DATA.i64 = tmp GLOBAL_ATOMIC_MAX_U64 72 Maximum of two unsigned 64-bit integer values. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = src > tmp ? src : tmp; RETURN_DATA.u64 = tmp GLOBAL_ATOMIC_AND_B64 73 Bitwise AND of register value and 64-bit memory value. 16.20. FLAT, Scratch and Global Instructions 594 of 597 "RDNA3" Instruction Set Architecture tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp & DATA.b64); RETURN_DATA.b64 = tmp GLOBAL_ATOMIC_OR_B64 74 Bitwise OR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp | DATA.b64); RETURN_DATA.b64 = tmp GLOBAL_ATOMIC_XOR_B64 75 Bitwise XOR of register value and 64-bit memory value. tmp = MEM[ADDR].b64; MEM[ADDR].b64 = (tmp ^ DATA.b64); RETURN_DATA.b64 = tmp GLOBAL_ATOMIC_INC_U64 76 Increment 64-bit memory value with wraparound to zero when incremented to register value. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL; RETURN_DATA.u64 = tmp GLOBAL_ATOMIC_DEC_U64 77 Decrement 64-bit memory value with wraparound to register value when decremented below zero. tmp = MEM[ADDR].u64; src = DATA.u64; MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL; RETURN_DATA.u64 = tmp 16.20. FLAT, Scratch and Global Instructions 595 of 597 "RDNA3" Instruction Set Architecture GLOBAL_ATOMIC_CMPSWAP_F32 80 Compare and swap with floating-point memory value. tmp = MEM[ADDR].f; src = DATA[31 : 0].f; cmp = DATA[63 : 32].f; MEM[ADDR].f = tmp == cmp ? src : tmp; RETURN_DATA.f = tmp Notes Floating-point compare handles NAN/INF/denorm. GLOBAL_ATOMIC_MIN_F32 81 Minimum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src < tmp ? src : tmp; RETURN_DATA = tmp Notes Floating-point compare handles NAN/INF/denorm. GLOBAL_ATOMIC_MAX_F32 82 Maximum of two floating-point values. tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src > tmp ? src : tmp; RETURN_DATA = tmp Notes Floating-point compare handles NAN/INF/denorm. GLOBAL_ATOMIC_ADD_F32 86 Add data register to floating-point memory value. 16.20. FLAT, Scratch and Global Instructions 596 of 597 "RDNA3" Instruction Set Architecture tmp = MEM[ADDR].f; src = DATA.f; MEM[ADDR].f = src + tmp; RETURN_DATA.f = tmp Notes Floating-point addition handles NAN/INF/denorm. 16.20. FLAT, Scratch and Global Instructions 597 of 597