"RDNA3" Instruction Set Architecture

Reference Guide
20-February-2023

"RDNA3" Instruction Set Architecture

Chapter 1. Introduction
This document describes the instruction set and shader program accessible state for RDNA3 devices.
The AMD RDNA3 processor implements a parallel micro-architecture that provides a platform for computer
graphics applications and also for general-purpose data parallel applications.

1.1. Terminology
The following terminology and conventions are used in this document:
Table 1. Conventions
*

Any number of alphanumeric characters in the name of a code format, parameter, or instruction.

<>

Angle brackets denote streams.

[1,2)

A range that includes the left-most value (in this case, 1), but excludes the right-most value (in this
case, 2).

[1,2]

A range that includes both the left-most and right-most values.

{x | y} or {x, y}

One of the multiple options listed. In this case, X or Y.

0.0

A floating-point value.

1011b
'b0010
32’b0010

A binary value, in this example a 4-bit value.
A binary value of unspecified size.
A 32-bit binary value. Binary values may include underscores for readability and can be ignored
when parsing the value.

0x1A
'h123
24’h01

A hexadecimal value.
A hexadecimal value.
A 24-bit hexadecimal value.

7:4
[7:4]

A bit range, from bit 7 to bit 4, inclusive. The high-order bit is shown first. May be enclosed in
brackets.

italicized word or phrase The first use of a term or concept basic to the understanding of stream computing.

Table 2. Basic Terms
Term

Description

RDNA3 Processor

The RDNA3 shader processor is a scalar and vector ALU with memory access designed to run
complex programs on behalf of a wave.

Kernel

A program executed by the shader processor for each work item submitted to it.

Shader Program

Same meaning as "Kernel". The shader types are:
CS (Compute Shader), and for graphics-capable devices, PS (Pixel Shader), GS (Geometry Shader),
and HS (Hull Shader).

Dispatch

A dispatch launches a 1D, 2D, or 3D grid of work to the RDNA3 processor array.

Work-group

A work-group is a collection of waves that have the ability to synchronize with each other with
barriers; they also can share data through the Local Data Share. Waves in a work-group all run on
the same WGP.

Wave

A collection of 32 or 64 work-items that execute in parallel on a single RDNA3 processor.

Work-item

A single element of work: one element from the dispatch grid, or in graphics a pixel, vertex or
primitive.

Thread

A synonym for "work-item".

Lane

A synonym for "work-item" typically used only when describing VALU operations.

SA

Shader Array. A collection of compute units.

1.1. Terminology

3 of 597

"RDNA3" Instruction Set Architecture

Term

Description

SE

Shader Engine. A collection of shader arrays.

SGPR

Scalar General Purpose Registers. 32-bit registers that are shared by work-items in each wave.

VGPR

Vector General Purpose Registers. 32-bit registers that are private to each work-items in a wave.

LDS

Local Data Share. A 32-bank scratch memory allocated to waves or work-groups

GDS

Global Data Share. A scratch memory shared by all shader engines. Similar to LDS but also
supports append operations.

VMEM

Vector Memory. Refers to LDS, Texture, Global, Flat and Scratch memory.

SIMD32

Single Instruction Multiple Data. In this document a SIMD refers to the Vector ALU unit that
processes instructions for a single wave.

Literal Constant

A 32-bit integer or float constant that is placed in the instruction stream.

Scalar ALU (SALU)

The scalar ALU operates on one value per wave and manages all control flow.

Vector ALU (VALU)

The vector ALU maintains Vector GPRs that are unique for each work item and execute arithmetic
operations uniquely on each work-item.

Work-group Processor The basic unit of shader computation hardware, including scalar & vector ALU’s and memory, as
(WGP)
well as LDS and scalar caches.
Compute Unit (CU)

One half of a WGP. Contains 2 SIMD32’s that share one path to memory.

Microcode format

The microcode format describes the bit patterns used to encode instructions. Each instruction is
32-bits or more, in units of 32-bits.

Instruction

An instruction is the basic unit of the kernel. Instructions include: vector ALU, scalar ALU,
memory transfer, and control flow operations.

Quad

A quad is a 2x2 group of screen-aligned pixels. This is relevant for sampling texture maps.

Texture Sampler (S#)

A texture sampler is a 128-bit entity that describes how the vector memory system reads and
samples (filters) a texture map.

Texture Resource (T#)

A texture resource descriptor describes an image in memory: address, data format, width, height,
depth, etc.

Buffer Resource (V#)

A buffer resource descriptor describes a buffer in memory: address, data format, stride, etc.

NGG

Next Generation Graphics pipeline

DPP

Data Parallel Primitives: VALU instructions which can pass data between work-items

LSB

Least Significant Bit

MSB

Most Significant Bit

DWORD

32-bit data

SHORT

16-bit data

BYTE

8-bit data

Table 3. Instruction suffixes have the following definitions:
Format

Meaning

B32

binary (untyped data) 32-bit

B64

binary (untyped data) 64-bit

F16

floating-point 16-bit (sign + exp5 + mant10)

F32

floating-point 32-bit (IEEE 754 single-precision float) (sign + exp8 + mant23)

F64

floating-point 64-bit (IEEE 754 double-precision float) (sign + exp11 + mant52)

BF16

floating-point 16-bit for machine learning ("bfloat16"). (sign + exp8 + mant7)

I8

signed 8-bit integer

I16

signed 16-bit integer

I32

signed 32-bit integer

I64

signed 64-bit integer

U16

unsigned 16-bit integer

U32

unsigned 32-bit integer

1.1. Terminology

4 of 597

"RDNA3" Instruction Set Architecture

Format

Meaning

U64

unsigned 64-bit integer

D.i

Destination which is a signed integer

D.u

Destination which is an unsigned integer

D.f

Destination which is a float

S*.i

Source which is a signed integer

S*.u

Source which is an unsigned integer

S*.f

Source which is a float

If an instruction has two suffixes (for example, _I32_F32), the first suffix indicates the destination type, the
second the source type.
The following abbreviations are used in instruction definitions:
• D = destination
• U = unsigned integer
• S = source
• SCC = scalar condition code
• I = signed integer
• B = bitfield
Note: .u or .i specifies to interpret the argument as an unsigned or signed integer.

1.2. Hardware Overview
The figure below shows a block diagram of the AMD RDNA3 Generation series processors:

Figure 1. AMD RDNA3 Generation Series Block Diagram

1.2. Hardware Overview

5 of 597

"RDNA3" Instruction Set Architecture

The RDNA3 device includes a data-parallel processor array, a command processor, a memory controller, and
other logic (not shown). The command processor reads commands that the host has written to memorymapped registers in the system-memory address space. The command processor sends hardware-generated
interrupts to the host when the command is completed. The memory controller has direct access to all device
memory and the host-specified areas of system memory. To satisfy read and write requests, the memory
controller performs the functions of a direct-memory access (DMA) controller, including computing memoryaddress offsets based on the format of the requested data in memory.
In the RDNA3 environment, a complete application includes two parts:
• a program running on the host processor, and
• programs, called shader programs or kernels, running on the RDNA3 processor.
The RDNA3 programs are controlled by a driver running on the host that:
• sets internal base-address and other configuration registers,
• specifies the data domain on which the GPU is to operate,
• invalidates and flushes caches on the GPU, and
• causes the GPU to begin execution of a program.

1.2.1. Work-group Processor (WGP)
The processor array is the heart of the GPU. The array is organized as a set of work-group processor (WGP)
pipelines, each independent from the others, that operate in parallel on streams of floating-point or integer
data. The work-group processor pipelines can process data or, through the memory controller, transfer data to,
or from, memory. Computation in a work-group processor pipeline can be made conditional. Outputs written
to memory can also be made conditional.
When it receives a request, the work-group processor pipeline loads instructions and data from memory,
begins execution, and continues until the end of the kernel. As kernels are running, the GPU hardware
automatically fetches instructions from memory into on-chip caches; software plays no role in this. Kernels
can load data from off-chip memory into on-chip general-purpose registers (GPRs) and caches.
The GPU devices can detect floating point exceptions and can generate interrupts to the host. In particular,
they detect IEEE-754 floating-point exceptions in hardware; these can be recorded for post-execution analysis.
The GPU hides memory latency by keeping track of potentially hundreds of work-items in various stages of
execution, and by overlapping compute operations with memory-access operations.

1.2.2. Data Sharing
The processors may share data between different work-items. Data sharing can boost performance. The figure
below shows the memory hierarchy that is available to each work-item. The actual number of GPRs may differ
from what is shown in the image below.

1.2. Hardware Overview

6 of 597

"RDNA3" Instruction Set Architecture

Figure 2. Shared Memory Hierarchy

1.2.2.1. Local Data Share (LDS)
Each work-group processor (WGP) has a 128kB memory space that enables low-latency communication
between work-items within a work-group, or the work-items within a wave; this is the local data share (LDS).
This memory is configured with 64 banks, each with 512 entries of 4 bytes. The shared memory contains 64
integer atomic units to enable fast, unordered atomic operations. This memory can be used as a software cache
for predictable re-use of data, a data exchange machine for the work-items of a work-group, or as a cooperative
way to enable efficient access to off-chip memory. A single work-group may allocate up to 64kB of LDS space.

1.2.2.2. Global Data Share (GDS)
The AMD RDNA3 devices use a 4kB global data share (GDS) memory that can be used by waves of a kernel on
all WGPs. This memory provides 128 bytes per cycle of memory access to all the processing elements. It
provides full access to any location for any processor. The shared memory contains 2 integer atomic units to
enable fast, unordered atomic operations. This memory can be used as a software cache to store important
control data for compute kernels, reduction operations, or a small global shared surface. Data can be
preloaded from memory prior to kernel launch and written to memory after kernel completion. The GDS block
contains support logic for unordered append/consume and domain launch ordered append/consume
operations to buffers in memory. These dedicated circuits enable fast compaction of data or the creation of
complex data structures in memory.

1.2.3. Device Memory
The AMD RDNA3 devices offer several methods for access to off-chip memory from the processing elements

1.2. Hardware Overview

7 of 597

"RDNA3" Instruction Set Architecture

(PE) within each WGP. On the primary read path, the device consists of multiple channels of L2 cache that
provides data to read-only L1 caches, and finally to L0 caches per WGP. Specific cache-less load instructions
can force data to be retrieved from device memory during an execution of a load clause. Load requests that
overlap within the clause are cached with respect to each other. The output cache is formed by two levels of
cache: the first for write-combining cache (collect scatter and store operations and combine them to provide
good access patterns to memory); the second is a read/write cache with atomic units that lets each processing
element complete unordered atomic accesses that return the initial value. Each processing element provides
the destination address on which the atomic operation acts, the data to be used in the atomic operation, and a
return address for the read/write atomic unit to store the pre-op value in memory. Each store or atomic
operation can be set up to return an acknowledgment to the requesting PE upon write confirmation of the
return value (pre-atomic op value at destination) being stored to device memory.
This acknowledgment has two purposes:
• enabling a PE to recover the pre-op value from an atomic operation by performing a cache-less load from
its return address after receipt of the write confirmation acknowledgment, and
• enabling the system to maintain a relaxed consistency model.
Each scatter write from a given PE to a given memory channel maintains order. The acknowledgment enables
one processing element to implement a fence to maintain serial consistency by ensuring all writes have been
posted to memory prior to completing a subsequent write. In this manner, the system can maintain a relaxed
consistency model between all parallel work-items operating on the system.

1.2. Hardware Overview

8 of 597

"RDNA3" Instruction Set Architecture

Chapter 2. Shader Concepts
RDNA3 shader programs (kernels) are programs executed by the GPU processor. Conceptually, the shader
program is executed independently on every work-item, but in reality the processor groups up to 32 or 64
work-items into a wave, which executes the shader program on all 32 or 64 work-items in one pass.
The RDNA3 processor consists primarily of:
• A scalar ALU, which operates on one value per wave (common to all work-items)
• A vector ALU, which operates on unique values per work-item
• Local data storage, which allows work-items within a work-group to communicate and share data
• Scalar memory, which can transfer data between SGPRs and memory through a cache
• Vector memory, which can transfer data between VGPRs and memory, including sampling texture maps
• Exports which transfer data from the shader to dedicated rendering hardware
Program control flow is handled using scalar ALU instructions. This includes if/else, branches and looping.
Scalar ALU (SALU) and memory instructions work on an entire wave and operate on up to two SGPRs, as well
as literal constants.
Vector memory and ALU instructions operate on all work-items in the wave at one time. In order to support
branching and conditional execute, every wave has an EXECute mask that determines which work-items are
active at that moment, and which are dormant. Active work-items execute the vector instruction, and dormant
ones treat the instruction as a NOP. The EXEC mask can be written at any time by Scalar ALU instructions or
VALU comparisons.
Vector ALU instructions can typically take up to three arguments, which can come from VGPRs, SGPRs, or
literal constants that are part of the instruction stream. They operate on all work-items enabled by the EXEC
mask. Vector compare and add-with-carry-out return a bit-per-work-item mask back to the SGPRs to indicate,
per work-item, which had a "true" result from the compare or generated a carry-out.
Vector memory instructions transfer data between VGPRs and memory. Each work-item supplies its own
memory address and supplies or receives unique data. These instructions are also subject to the EXEC mask.

2.1. Wave32 and Wave64
The shader supports both waves of 32 work-items ("wave32") and waves of 64 work-items ("wave64").
Both wave sizes are supported for all operations, but shader programs must be compiled for and run as a
particular wave size, regardless of how many work-items are active in any given wave.
Wave32 waves issue each instruction at most once. Wave64 waves typically issue each instruction twice: once
for the low half (work-items 31-0) and then again for the high half (work-items 63-32). This occurs only for
VALU and VMEM (LDS, texture, buffer, flat) instructions; scalar ALU and memory as well as branch and
messages are issued only once regardless of the wave size. Export requests also issue just once regardless of
wave size. It is possible that instructions from other waves may be executed in between the low and high half
of a given wave’s instructions.
Hardware may choose to skip either half if the EXEC mask for that half is all zeros, but does not skip both
halves for VMEM instructions as that would confuse the outstanding-memory-instruction counters, unless

2.1. Wave32 and Wave64

9 of 597

"RDNA3" Instruction Set Architecture

there are no outstanding VMEM instructions from this wave. It also does not skip either half of a VALU
instruction which writes an SGPR. See Instruction Skipping: EXEC==0 for details on instruction skipping rules.
Hardware operates such that both passes of a wave64 use the state of the wave prior to instruction execution;
the first pass of the wave64 does not affect the input to the second pass.
In addition to the EXEC mask being different between the low and high half, scalar inputs may vary between
the two passes. Both passes use the same constants, but different masks and carry-in/out.
The differences in the second pass are:
• Input increments: Carry-in, div-fmas and v_cndmask all use the next SGPR (SSRC + 1, or VCC_HI)
• Output increments: Carry-out, div-scale and v_cmp all write to the next SGPR (SDST + 1, or VCC_HI)
◦ v_cmpx writes to EXEC_HI instead of EXEC_LO
The upper 32-bits of EXEC and VCC are ignored for wave32 waves. VCCZ and EXECZ reflect the status of the
lowest 32-bits of VCC and EXEC respectively for wave32 waves.

2.2. Shader Types
2.2.1. Compute Shaders
Compute kernels (shaders) are generic programs that can run on the RDNA3 processor, taking data from
memory, processing it, and writing results back to memory. Compute kernels are created by a dispatch, which
causes the RDNA3 processors to run the kernel over all of the work-items in a 1D, 2D, or 3D grid of data. The
RDNA3 processor walks through this grid and generates waves, which then run the compute kernel. Each
work-item is initialized with its unique address (index) within the grid. Based on this index, the work-item
computes the address of the data it is required to work on and what to do with the results.

2.2.2. Graphics Shaders
The shader supports 3 types of graphics waves: PS, GS, and HS.
Rendering modes (launch behavior):
• Normal NGG - Geometry Engine (GE) sends info to wave launch hardware to init VGPRs for each element
(prim) launched; GE fetches index and vertex buffer data and loads to VGPRs
• Mesh shader - turns GS-launch into a CS-style launch, and wave launch hardware does unrolling into
elements and generates element indices on the fly. The mesh shader program determines how to use this
index value.

2.2. Shader Types

10 of 597

"RDNA3" Instruction Set Architecture

The amplification shader decides how many mesh shader groups to launch. The mesh shader processes vertices and then
primitives.

2.3. Work-groups
A work-group is a collection of waves which can share data through LDS and can synchronize at a barrier.
Waves in a work-group are all issued to the same WGP but can run on any of the 4 SIMD32’s and can share data
through LDS. The WGP supports up to 32 work-groups with a maximum of 1024 work-items per work-group.
Waves in a work-group may share up to 64kB of LDS space. Work-groups consisting of a single wave do not
count against the limit of 32. They do not allocate a barrier resource, and barrier ops are treated as S_NOP.
Each work-group or wave can operate in one of two modes, selectable per draw/dispatch at wave-create time:
CU mode
In this mode, the LDS is effectively split into a separate upper and lower LDS, each serving two SIMD32’s.
Waves are allocated LDS space within the half of LDS which is associated with the SIMD the wave is running
on. For work-groups, all waves are assigned to the pair of SIMD32’s. This mode may provide faster
operation since both halves run in parallel, but limits data sharing (upper waves cannot read data in the
lower half of LDS and vice versa). When in CU mode, all waves in the work-group are resident within the
same CU.

2.3. Work-groups

11 of 597

"RDNA3" Instruction Set Architecture

WGP mode
In this mode, the LDS is one large contiguous memory that all waves on the WGP can access. In WGP mode,
waves of a work-group may be distributed across both CU’s (all 4 SIMD32’s) in the WGP.
LDS_PARAM_LOAD and LDS_DIRECT_LOAD are not supported in WGP mode.
The WGP (and LDS) can simultaneously have some waves running in WGP mode and other waves in CU mode
running.
A barrier is a synchronization primitive which makes each wave reach a given point in the shader before any
wave proceeds.

2.4. Shader Padding Requirement
Due to aggressive instruction prefetching used in some graphics devices, the user must pad all shaders with 64
extra DWORDs (256 bytes) of data past the end of the shader. It is recommended to use the S_CODE_END
instruction as padding. This ensures that if the instruction prefetch hardware goes beyond the end of the
shader, it may not reach into uninitialized memory (or unmapped memory pages).
The amount of shader padding required is related to how far the shader may prefetch ahead. The shader can be
set to prefetch 1, 2 or 3 cachelines (64 bytes) ahead of the current program counter. This is controlled via a
wave-launch state register, or by the shader program itself with S_SET_INST_PREFETCH_DISTANCE.

2.4. Shader Padding Requirement

12 of 597

"RDNA3" Instruction Set Architecture

Chapter 3. Wave State
This chapter describes the state variables visible to the shader program. Each wave has a private copy of this
state unless otherwise specified.

3.1. State Overview
The table below shows the hardware states readable or writable by a shader program. All registers below are
unique to each wave except for TBA and TMA which are shared.
Table 4. Readable and Writable Hardware States
Abbrev.

Name

Size
(bits)

Description

PC

Program Counter

48

Points to the memory address of the next shader instruction
to execute. Read/write only via scalar control flow
instructions and indirectly using branch. The 2 LSB’s are
forced to zero.

V0-V255

VGPR

32

Vector general-purpose register. (32 bits per work-item x (32
or 64) work-items per wave).

S0-S105

SGPR

32

Scalar general-purpose register. All waves are allocated 106
SGPRs + 16 TTMPs.

LDS

Local Data Share

64kB

Local data share is a scratch RAM with built-in arithmetic
capabilities that allow data to be shared between threads in a
work-group.

EXEC

Execute Mask

64

A bit mask with one bit per thread, which is applied to vector
instructions and controls which threads execute and which
ignore the instruction.

EXECZ

EXEC is zero

1

A single bit flag indicating that the EXEC mask is all zeros.
For wave32 it considers only EXEC[31:0].

VCC

Vector Condition Code

64

A bit mask with one bit per thread; it holds the result of a
vector compare operation or integer carry-out. Physically
VCC is stored in specific SGPRs.

VCCZ

VCC is zero

1

A single-bit flag indicating that the VCC mask is all zeros. For
wave32 it considers only VCC[31:0].

SCC

Scalar Condition Code

1

Result from a scalar ALU comparison instruction.

FLAT_SCRATCH

Flat scratch address

48

The base address of scratch memory for this wave. Used by
Flat and Scratch instructions. Read-only by user shader.

STATUS

Status

32

Read-only shader status bits.

MODE

Mode

32

Writable shader mode bits.

M0

Misc Reg

32

A temporary register that has various uses, including GPR
indexing and bounds checking.

TRAPSTS

Trap Status

32

Holds information about exceptions and pending traps.

TBA

Trap Base Address

48

Holds the pointer to the current trap handler program
address. Per-VMID register. Bit [63] indicates if the trap
handler is present (1) or not (0) and is not considered part of
the address (bit[62] is replicated into address bit[63]).
Accessed via S_SENDMSG_RTN

TMA

Trap Memory Address

48

Temporary register for shader operations. For example, can
hold a pointer to memory used by the trap handler.

3.1. State Overview

13 of 597

"RDNA3" Instruction Set Architecture

Abbrev.

Name

Size
(bits)

Description

TTMP0-TTMP15

Trap Temporary SGPRs

32

16 SGPRs available only to the Trap Handler for temporary
storage.

VMcnt

Vector memory load
instruction count

6

Counts the number of VMEM load and sample instructions
issued but not yet completed.

VScnt

Vector memory store
instruction count

6

Counts the number of VMEM store instructions issued but
not yet completed.

EXPcnt

Export Count

3

Counts the number of Export and GDS instructions issued
but not yet completed. Also counts VMEM writes that have
not yet sent their write-data to the last level cache, and
parameter loads outstanding.

LGKMcnt

LDS, GDS, Constant and
Message count

6

Counts the number of LDS, GDS, constant-fetch (scalar
memory read), and message instructions issued but not yet
completed.

3.2. Control State: PC and EXEC
3.2.1. Program Counter (PC)
The Program Counter is a DWORD-aligned byte address that points to the next instruction to execute. When a
wave is created the PC is initialized to the first instruction in the program.
There are a few instructions to interact directly with the PC: S_GETPC_B64, S_SETPC_B64, S_CALL_B64,
S_RFE_B64 and S_SWAPPC_B64. These transfer the PC to and from an even-aligned SGPR pair (sign-extended).
Branches jump to (PC_of_the_instruction_after_the_branch + offset*4). Branches, GET_PC and SWAP_PC are PCrelative to the next instruction, not the current one. S_TRAP, on the other hand, saves the PC of the S_TRAP
instruction itself.
During wave debugging, the program counter may be read. The PC points to the next instruction to issue. All
prior instructions have been issued but may or may not have completed execution.

3.2.2. EXECute Mask
The Execute mask (64-bit) controls which threads in the vector are executed. Each bit indicates how one thread
behaves for vector instructions: 1 = execute, 0 = do not execute. EXEC can be read and written via scalar
instructions, and can also be written as a result of a vector-alu compare. EXEC affects: vector-alu, vectormemory, LDS, GDS and export instructions. It does not affect scalar execution or branches.
Wave64 uses all 64 bits of the exec mask. Wave32 waves use only bits 31:0 and hardware does not act upon the
upper bits.
There is a summary bit (EXECZ) that indicates that the entire execute mask is zero. It can be used as a condition
for branches to skip code when EXEC is zero. For wave32, this reflects the state of EXEC[31:0].

3.2. Control State: PC and EXEC

14 of 597

"RDNA3" Instruction Set Architecture

3.2.3. Instruction Skipping: EXEC==0
The shader hardware may skip vector instructions when EXEC==0. Instructions which may be skipped are:
• VALU - skip if EXEC == 0
◦ Not skipped if the instruction writes SGPRs/VCC
◦ Does not skip WMMA or SWMMA
◦ This skipping is opportunistic and may not occur depending on timing after a V_CMPX.
• These are not skipped regardless of EXEC mask value, and are issued only once in wave64
◦ V_NOP, V_PIPEFLUSH, V_READLANE, V_READFIRSTLANE, V_WRITELANE
◦ BUFFER_GL1_INV, BUFFER_GL0_INV
• These are not skipped and are issued twice regardless of EXEC mask value in wave64 mode
◦ V_CMP which writes SGPR or VCC (not V_CMPX - may skip one pass but not both)
◦ Any VALU which writes an SGPR
• Export Request - skip unless: Done==1 or if export target is POS0
◦ Skipped if the wave was created with SKIP_EXPORT=1
• LDS_param_load / LDS-direct: are skipped when EXEC==0
• LDS, Memory, GDS - do not skip
◦ VMEM can be skipped only if: VMcnt/VScnt==0 and EXEC==0
▪ otherwise for wave64 one pass can be skipped if EXEC==0 for that half, but not both halves.
◦ LDS can be skipped only if: LGKMcnt==0 and EXEC==0
◦ Does not skip GDS or GWS

3.3. Storage State: SGPR, VGPR, LDS
3.3.1. SGPRs
3.3.1.1. SGPR Allocation and storage
Every wave is allocated a fixed number of SGPRs:
• 106 normal SGPRs
• VCC_HI and VCC_LO (stored in SGPRs 106 and 107)
• 16 Trap-temporary SGPRs, meant for use by the trap handler

3.3.1.2. VCC
The Vector Condition Code (VCC) can be written by V_CMP and integer vector ADD/SUB instructions. VCC is
implicitly read by V_ADD_CI, V_SUB_CI, V_CNDMASK and V_DIV_FMAS. VCC is a named SGPR-pair and is
subject to the same dependency checks as any other SGPR.

3.3. Storage State: SGPR, VGPR, LDS

15 of 597

"RDNA3" Instruction Set Architecture

3.3.1.3. SGPR Alignment
There are a few cases where even-aligned SGPRs are required:
1. any time 64-bit data is used
a. this includes moves to/from 64-bit registers, including PC
2. Scalar memory reads when the address-base comes from an SGPR-pair
Quad-alignment of SGPRs is required for operation on more than 64-bits, and for the data GPR when a scalar
memory operation (read, write or atomic) operates on more than 2 DWORDs. Similarly, when a 64-bit SGPR
data value is used as a source to a VALU op, it must be even aligned regardless of size. In contrast, when a 32bit SGPR data value is used as a source to a VALU op, it can be arbitrarily aligned regardless of wave size.
When a 64-bit quantity is stored in SGPRs, the LSB’s are in SGPR[n], and the MSB’s are in SGPR[n+1].
It is illegal to use mis-aligned source or destination SGPRs for data larger than 32 bits and results are
unpredictable.
As an example, VALU ops with carry-in or carry-out:
• When used with wave32, these are 32 bit values and may have any arbitrary alignment
• When used with wave64, these are 64 bit values and must be aligned to an even SGPR address
Hardware enforces SGPR alignment by ignoring LSB’s as necessary and treating them as zero. For
*MOVREL*_B64, the LSB of the index is also ignored and treated as zero.

3.3.1.4. SGPR Out of Range Behavior
Scalar sources and dests use a 7-bit encoding:
Scalar 0-105=SGPR; 106,107=VCC, 108-123=TTMP0-15, and 124-127={NULL, M0, EXEC_LO, EXEC_HI}.
It is illegal to use GPR indexing or a multi-DWORD operand to cross SGPR regions. The regions are:
• SGPRs 0 - 107 (includes VCC)
• Trap Temp SGPRs
• All other SGPR & Scalar-source addresses must not be indexed and no single operand can reference
multiple register ranges.
General Rules:
• Out of range source SGPRs return zero (using a TTMP when STATUS.PRIV=0, NULL, M0 or EXEC where not
allowed)
• Writes to an out of range SGPR are ignored
TTMP0-15 can only be written while in the trap handler (STATUS.PRIV=1) but can be read by the user’s shader
(STATUS.PRIV=0). Writes to TTMPs while outside the trap handler are ignored. SALU instructions which try
but fail to write a TTMP also do not update SCC.
• SALU: Above rules apply.
◦ WREXEC and SAVEEXEC write the EXEC mask even when the SDST is out-of-range
• VALU: Above rules apply.
• VMEM: S#, T#, V# must be contained within one region.

3.3. Storage State: SGPR, VGPR, LDS

16 of 597

"RDNA3" Instruction Set Architecture

◦ T# (128b), V# or S#: no possible range violation exists (forced alignment puts all in 1 range).
◦ T# (256b) starting at 104 and extending into TTMPs; or starting at TTMP12 and going past TTMP15 is a
violation. If this occurs, force to use S0.
• SMEM return data starting in SGPRs/VCC and extending into TTMPs, or starting in TTMPs and extending
outside TTMPs becomes out of range.
◦ No data gets written to dest-SGPRs that are out-of-range
◦ Addr and write-data are aligned and so cannot go out of range, except:
▪ Referencing M0, NULL, or EXEC* returns zero, and SMEM loads cannot load into these registers.
• S_MOVREL:
◦ Indexing is allowed only within SGPRs and TTMPs, and must not cross between the two. Indexing must
stay within the "base" range (the operand type where index==0).
The ranges are: [ SGPRs 0-105 and VCC_LO, VCC_HI ], [ Trap Temps 0-15 ], [ all other values ]
◦ Indexing must not reach M0, exec or inline constants, the rule is:
▪ Base is SGPR: addr > VCC_HI (or if 64-bit operand, addr > VCC_LO)
▪ Base is TTMP: addr > TTMP15 (or if B64 if addr > ttmp14)
◦ If the source is out of range, S0 is used.
If the dest is out of range, nothing is written.

3.3.2. VGPRs
3.3.2.1. VGPR Allocation and Alignment
VGPRs are allocated in blocks of 16 for wave32 or 8 for wave64, and a shader may have up to 256 VGPRs. In
other words, VGPRs are allocated in units of (16*32 or 8*64 = 512 DWORDs). A wave may not be created with zero
VGPRs. Devices which have 1536 VGPRs per SIMD allocate in blocks of 24 for wave32 and 12 for wave64.
A wave may voluntarily deallocate all of its VGPRs via S_SENDMSG. Once this is done, the wave may not
reallocate them and the only valid action is to terminate the wave. This can be useful if a wave has issued stores
to memory and is waiting for the write-confirms before terminating. Releasing the VGPRs while waiting may
allow a new wave to allocate them and start earlier.

3.3.2.2. VGPR Out of Range Behavior
Given an instruction operand that uses one or more DWORDs of VGPR data: "V"
Vs = the first VGPR DWORD (start)
Ve = the last VGPR DWORD (end)
For a 32-bit operand, Vs==Ve; for a 64-bit operand Ve=Vs+1, etc.
Operand is out of range if:
• Vs < 0 || Vs >= VGPR_SIZE
• Ve < 0 || Ve >= VGPR_SIZE
V_MOVREL indexed operand out of range if either:
• Index > 255

3.3. Storage State: SGPR, VGPR, LDS

17 of 597

"RDNA3" Instruction Set Architecture

• (Vs + M0) >= VGPR_SIZE
• (Ve + M0) >= VGPR_SIZE
Out of range consequences:
• If a dest VGPR is out of range, the instruction is ignored (treat as NOP).
• V_SWAP & V_SWAPREL : since both arguments are destinations, if either is out of range, discard the
instruction.
◦ VALU instructions with multiple destination (e.g. VGPR and SGPR): nothing is written to any GPR
• If a source VGPR is out of range in a VMEM or Export instruction: VGPR0 is used
◦ Memory instructions that use a group of consecutive VGPRs that are out of range: the group is forced to
start at VGPR0.
• If a source VGPR in a VALU instruction is out of range in a VALU instruction: VGPR0
◦ VOPD has different rules: the source address forced to (VGPRaddr % 4).
Instructions with multiple destinations (e.g. V_ADD_CO): if any destination is out of range, no results are
written.

3.3.3. Memory Alignment and Out-of-Range Behavior
This section defines the behavior when a source or destination GPR or memory address is outside the legal
range for a wave. Except where noted, these rules apply to LDS, GDS, buffer, global, flat and scratch memory
accesses.
Memory, LDS & GDS: Reads and Atomics with return:
• If any source VGPR or SGPR is out-of-range, the data value is undefined.
• If any destination VGPR is out-of-range, the operation is nullified by issuing the instruction as if the EXEC
mask were cleared to 0.
◦ This out-of-range test checks all VGPRs which could be returned (e.g. VDST to VDST+3 for a
BUFFER_LOAD_B128)
◦ This check also includes the extra PRT (partially resident texture) VGPR and nullifies the fetch if this
VGPR would be out of range no matter whether the texture system actually returns this value or not.
◦ Atomic operations with out-of-range destination VGPRs are nullified: issued, but with EXEC mask of
zero.
• Image loads and stores consider DMASK bits when making an out-of-bounds determination.
• Note: VDST is only checked for lds/gds/mem-atomic that actually return a value.
VMEM (texture) memory alignment rules are defined using the config register:
SH_MEM_CONFIG.alignment_mode. This setting also affects LDS, Flat/Scratch/Global operations.
DWORD

Automatic alignment to multiple of the smaller of element size or a DWORD.

UNALIGNED

No alignment requirements.

Formatted ops like buffer_load_format_* must be aligned to "min(DWORD, ElementSize)" where ElementSize
is the size of one "thing" described by the FORMAT. E.g. format 5_6_6 = 2 bytes so it must be 4-byte (DWORD)
aligned. Non-formatted ops follow the above alignment mode. Atomics must be aligned to the data size, or will
trigger a MEMVIOL.

3.3. Storage State: SGPR, VGPR, LDS

18 of 597

"RDNA3" Instruction Set Architecture

3.3.4. LDS
Waves may be allocated LDS memory, and waves in a work-group all share the same LDS memory allocation. A
wave may have 0 - 64kbyte of LDS space allocated, and it is allocated in blocks of 1024 bytes. All accesses to LDS
are restricted to the space allocated to that wave/work-group.
Internally LDS is composed of two blocks of memory of 64kB each. Each one of these two blocks is affiliated
with one CU or the other: byte addresses 0-65535 with CU0, 65536-131071 with CU1. Allocations of LDS space to
a wave or work-group do not wrap around: the allocation starting address is less than the ending address.
In CU mode, a wave’s entire LDS allocation resides in the same "side" of LDS as the wave is loaded. No access is
allowed to cross over or wrap around to the other side.
In WGP mode, a wave’s LDS allocation may be entirely in either the CU0 or CU1 part of LDS, or it may straddle
the boundary and be partially in each CU. The location of the LDS storage is unrelated to which CU the wave is
on.
Pixel parameters are loaded into the same CU side as the wave resides and do not cross over into the other side
of LDS storage. Pixel shaders are run only in CU mode. Pixel shader may request additional LDS space in addition
to what is required for vertex parameters.

3.3.4.1. LDS/GDS Alignment and Out-of-Range
Any DS_LOAD or DS_STORE of any size can be byte aligned if the alignment mode is set to "unaligned". For all
other alignment modes, LDS forces alignment by zeroing out address least significant bits.
• 32-bit Atomics must be aligned to a 4-byte address; 64-bit atomics to an 8-byte address.
• LDS operations report MEMVIOL if the LDS-address is out of range and
LDS_CONFIG.ADDR_OUT_OF_RANGE_REPORTING==1
• MEMVIOL is reported for misaligned LDS accesses when the alignment mode is set to STRICT or
DWORD_STRICT.
Out Of Range
• If the LDS-ADDRESS is out of range (addr < 0 or >= LDS_size):
◦ Writes out-of-range are discarded.
◦ Reads return the value zero. For multi-DWORD reads, if any part of the LDS-address is out of range, the
entire instruction returns zero.
• If any source-VGPR is out of range, the value from VGPR0 is used to supply the LDS address or data.
• If the dest-VGPR is out of range, nullify the instruction (issue with EXEC=0)
"Native" Alignment in LDS & GDS is:
B8: byte aligned
B16 or D16: 2 byte aligned
B32: 4 byte aligned
B64: 8 byte aligned
B128 and B96: 16 byte aligned
If the alignment mode is set to "unaligned", the LDS disables its auto-alignment and doesn’t report error for
misaligned reads & writes.

3.3. Storage State: SGPR, VGPR, LDS

19 of 597

"RDNA3" Instruction Set Architecture

if (sh_alignment_mode == unaligned)

align = 0xffff

else if (B32)

align = 0xfffC

else if (B64)

align = 0xfff8

else if (B96 or B128)

align = 0xfff0

LDSaddr = (addr + offset) & align

3.4. Wave State Registers
The following registers are accessed infrequently, and are only readable/writable via S_GETREG and S_SETREG
instructions. Some of these registers are read-only, some are writable and others are writable only when in the
trap handler ("PRIV").
Code

Register

0

Reserved

1

MODE

read / write

2

STATUS

read / write. Only writable when priv=1

3

TRAPSTS

read / write

14

FLUSH_IB

write-only. Writing this causes all waves to flush their instruction buffers

15

SH_MEM_BASES

read-only. Allows a wave to read the value of this register to do aperture checks and
memory space conversions. Bits [15:0] = Private Base; [31:16] = Shared Base.

20

FLAT_SCRATCH_LO

read only (writable only while in trap handler)

21

FLAT_SCRATCH_HI

read only (writable only while in trap handler)

23

HW_ID1

read only. debug only - not predictable values

24

HW_ID2

read only. debug only - not predictable values

29

SHADER_CYCLES

Get the current graphics clock counter value

3.4.1. Status register
Status register fields can be read but not written to by the shader. While in the trap handler, certain STATUS fields
can be written. These bits are initialized at wave-creation time. The table below describes the status register
fields.
Table 5. Status Register Fields
Field

Bit Write Description
Pos when
Priv?

SCC

0

Y

Scalar condition code. Used as a carry-out bit. For a comparison instruction, this bit
indicates failure or success. For logical operations, this is 1 if the result is non-zero.

SYS_PRIO

2:1 Y

Wave priority set at wave creation time. See S_SETPRIO instruction for details. 0 is
lowest, 3 is highest priority.

USER_PRIO

4:3 Y

Wave’s priority set by shader program itself. See S_SETPRIO instruction for details.

PRIV

5

N

Privileged mode. Indicates that the wave is in the trap handler. Gives write access to
TTMP registers.

TRAP_EN

6

N

Indicates that a trap handler is present. When set to zero, traps are not taken.

3.4. Wave State Registers

20 of 597

"RDNA3" Instruction Set Architecture

Field

Bit Write Description
Pos when
Priv?

EXPORT_RDY

8

Y

This status bit indicates if export buffer space has been allocated. The shader stalls
any export instruction until this bit becomes "1". It gets set to 1 when export buffer space
has been allocated.
Shader hardware checks this bit before executing any EXPORT instruction to
Position, Z or MRT targets, and put the wave into a waiting state if the alloc has not
yet been received. The alloc arrives eventually (unless SKIP_EXPORT is set) as a
message and the shader then continues with the export.

EXECZ

9

N

Exec Mask is Zero.

VCCZ

10

N

Vector Condition Code is Zero.

IN_WG

11

N

Wave is a member of a work-group of more than one wave.

IN_BARRIER

12

N

Wave is waiting at a barrier.

HALT

13

Y

Wave is halted or scheduled to halt.
HALT can be set by the host via wave-control messages, or by the shader. The HALT
bit is ignored while in the trap handler (PRIV = 1). HALT is also ignored if a hostinitiated trap is received (request to enter the trap handler).

TRAP

14

N

Wave is flagged to enter the trap handler as soon as possible.

VALID

16

N

Wave is valid (has been created and not yet ended)

SKIP_EXPORT

18

Y

For Pixel and Vertex Shaders only.
"1" means this shader is not allocated export buffer space, so export instructions are
ignored (treated as NOPs). For pixel shaders, this is set to 1 when both the
COL0_EXPORT_FORMAT and Z_EXPORT_FORMAT are set to ZERO. If
SKIP_EXPORT==1, Must_export must be zero and vice versa.

PERF_EN

19

N

Performance counters are enabled for this wave

CDBG_USER

20

Y

User-controlled conditional debug. Set at wave-create time by a user register. Can be
used in conditional branches.

CDBG_SYS

21

Y

System-controlled conditional debug. Set at wave-create time by a system register.
Can be used in conditional branches.

FATAL_HALT

23

N

Indicates that the wave has halted due to a fatal error:
illegal instruction . The difference between halt and fatal_halt is that fatal_halt stops
waves even when PRIV=1.

NO_VGPRS

24

N

Indicates that this wave has released all of its VGPRs.

LDS_PARAM_RDY

25

Y

PS shaders only: indicates that LDS has been written with vertex attribute data and
the shader may now execute LDS_PARAM_LOAD instructions. If the wave attempts to
issue LDS_PARAM_LOAD before this bit is set, it stalls until the bit is set.

MUST_GS_ALLOC

26

N

GS shader must issue a GS_ALLOC_REQ message before terminating.
Sending this message clears this bit.

MUST_EXPORT

27

Y

PS: this wave must export color ("export-done") before it terminates.
Set to 1 for PS waves unless "skip_export==1". Cleared when PS exports data with
export’s Done bit set to 1.
GS: this wave must perform a GDS_ordered_count before terminating. Cleared when
a GS shader issues a GDS_ordered_count. GS is initialized to 1 normally, but to zero
for "no export" passes (stream-out only).

IDLE

28

N

Wave is idle (has no outstanding instructions). Used by the host (GRBM) to
determine if a wave is valid, halted and idle - able to read other wave state.

SCRATCH_EN

29

Y

Indicate that the wave has scratch memory allocated. This bit gets set to 1 if the wave
has FLAT_SCRATCH initialized; otherwise is zero.

3.4. Wave State Registers

21 of 597

"RDNA3" Instruction Set Architecture

3.4.2. Mode register
Mode register fields can be read from, and written to, by the shader through scalar instructions. The table
below describes the mode register fields.
Table 6. Mode Register Fields
Field

Bit
Pos

Description

FP_ROUND

3:0

Controls round modes for math operations
[1:0] Single precision round mode
[3:2] Double precision and half precision (FP16) round mode
Round Modes: 0=nearest even, 1= +infinity, 2= -infinity, 3= toward zero
Round mode affects float ops in VALU, but not LDS or memory.

FP_DENORM

7:4

Controls whether floating point denormals are flushed or not.
[5:4] Single precision denormal mode
[7:6] Double precision and FP16 denormal mode
Denormal modes: 2 bits = { allow_output_denorms, allow_input_denorms }
0 = flush input and output denorms
1 = allow input denorms, flush output denorms
2 = flush input denorms, allow output denorms
3 = allow input and output denorms
Denorm mode affects float ops in: VALU, LDS, and VMEM atomics.
Texture/Buffer/Flat considers only bits 4 and 6 (allowing mode control over input-denorm
flushing, and not flushing output denorms), while LDS uses all bits for DS ops (but not for
FLAT).

DX10_CLAMP

8

Used by the vector ALU to force DX10 style treatment of NaN’s. When set, clamp NaN to
zero, otherwise pass NaN thru and also suppress all VALU exceptions. The clamping only
occurs when the instruction has the CLAMP bit set to 1, but exceptions are suppressed
when DX10_CLAMP==1.

IEEE

9

IEEE==0: IEEE-754-1985/DX10 behavior for Min and Max, pass signaling NaN.
IEEE==1: IEEE-754-2008 behavior for Min and Max, quiet signaling NaN.
When set to 1, floating point opcodes that support exception flag gathering quiet and
propagate signaling NaN inputs per IEEE 754-2008. Min_f32/f64 and Max_f32/f64 become
IEEE 754-2008 compliant due to signaling NaN propagation and quieting. When set to 1,
MAX performs a ">" compare, but when set to zero (directX mode/IEEE 754-1985 mode)
MAX performs a ">=" compare. This only affects results for +/-0 and input denormals
which are flushed to zero.

LOD_CLAMPED

10

Sticky status bit - indicates that one or more texture accesses had their LOD clamped.

TRAP_AFTER_ INST

11

Forces the wave to jump to the exception handler after each instruction is executed (but
not after ENDPGM). Only works if TRAP_EN = 1.

EXCP_EN

21:12

Enable mask for exceptions. Enabled means if the exception occurs and if TRAP_EN==1, a
trap may be taken.
[12] : invalid
[13] : inputDenormal
[14] : float_div0
[15] : overflow
[16] : underflow
[17] : inexact
[18] : int_div0
[19] : addr_watch - take exception when TC sees wave access an "address of interest"
[21] : trap on wave end - h/w clears this upon entering trap handler for end-of-wave

3.4. Wave State Registers

22 of 597

"RDNA3" Instruction Set Architecture

Field

Bit
Pos

Description

FP16_OVFL

23

If set, an overflowed FP16 VALU result is clamped to +/- MAX_FP16 regardless of round
mode, while still preserving true INF values. (Inputs which are infinity may result in infinity,
as does divide-by-zero).

DISABLE_PERF

27

1 = disable performance counting for this wave.

3.4.3. M0 : Miscellaneous Register
There is one 32-bit M0 register per wave and is it used for:
Table 7. M0 Register Fields
Operation

M0 Contents

Notes

LDS_PARAM_LOAD

{ 1’b0, new_prim_mask[15:1],
parameter_offset[15:0] }

Offset is in bytes and offset[6:0] must be zero.
Wave32: new_prim_mask is {8’b0, mask[7:1] }

LDS_DIRECT_LOAD

{ 13’b0, DataType[2:0],
LDS_address[15:0] }

address is in bytes

LDS ADDTID

{ 16’h0, lds_offset[15:0] }

offset is in bytes, must be 4-byte aligned

Global Data Share

{ base[15:0] , size[15:0] }

base and size are in bytes

GDS Ordered Count

{ base[15:0], 3’h0,
logical_wave_id[12:0] }

used for deferred attribute shading (split-GS)

Global Wave Sync

various uses

see instruction definition

S/V_MOVREL

GPR index

See S_MOVREL and V_MOVREL instructions

S_SENDMSG / _RTN

varies

sendmsg data. See [Send_Message_Types]

EXPORT

Row number for mesh shader POS
& Param exports

See Export chapter

SMEM

address_offset[31:0]

see SMEM section

Temporary

data[31:0]

can be used as general temporary data storage

M0 can only be written by the scalar ALU.

3.4.4. NULL
NULL is a scalar source and destination. Reading NULL returns zero, writing to NULL has no effect (write data
is discarded).
NULL may be used anywhere scalar sources can normally be used:
• When NULL is used as the destination of an SALU instruction, the instruction executes: SDST is not written
but SCC is updated (if the instruction normally updates SCC).
• NULL may not be used as an S#, V# or T#.

3.4.5. SCC: Scalar Condition Code
Many scalar ALU instructions set the Scalar Condition Code (SCC) bit, indicating the result of the operation.
Compare operations: 1 = true

3.4. Wave State Registers

23 of 597

"RDNA3" Instruction Set Architecture

Arithmetic operations: 1 = carry out
Bit/logical operations: 1 = result was not zero
Move: does not alter SCC
The SCC can be used as the carry-in for extended-precision integer arithmetic, as well as the selector for
conditional moves and branches.

3.4.6. Vector Compares: VCC and VCCZ
Vector ALU comparison instructions (V_CMP) compare two values and return a bit-mask of the result, where
each bit represents one lane (work-item) where: 1= pass, 0 = fail. This result mask is the Vector Condition Code
(VCC). VCC is also set for selected integer ALU operations (carry-out).
These instructions write this mask either to VCC, an SGPR or to EXEC, but do not write to both EXEC and
SGPRs. Wave32 writes only the low 32 bits of VCC, EXEC or a single SGPR; Wave64 writes 64-bits of VCC, EXEC
or an aligned pair of SGPRs.
Whenever any instruction writes a value to VCC, the hardware automatically updates a "VCC summary" bit
called VCCZ. This bit indicates whether or not the entire VCC mask is zero for the current wave-size. Wave32
ignores VCC[63:32] and only bits[31:0] contribute to VCCZ. This is useful for early-exit branch tests. VCC is also set
for certain integer ALU operations (carry-out).
The EXEC mask determines which threads execute an instruction. The VCC indicates which executing threads
passed the conditional test, or which threads generated a carry-out from an integer add or subtract.
S_MOV_B64

EXEC, 0x00000001

// set just one thread active; others are inactive

V_CMP_EQ_B32

VCC, V0, V0

// compare (V0 == V0) and write result to VCC (all bits in VCC are

updated)



VCC physically resides in the SGPR register file in a specific pair of SGPRs, so when an
instruction sources VCC, that counts against the limit on the total number of SGPRs that can
be sourced for a given instruction.

Wave32 waves may use any SGPR for mask/carry/borrow operations, but may not use VCC_HI or EXEC_HI.

3.4.7. FLAT_SCRATCH
FLAT_SCRATCH is a 64-bit register that holds a pointer to the base of scratch memory for this wave. For waves
that have scratch space allocated, wave-launch hardware initializes the FLAT_SCRATCH register with the
scratch base address unique to this wave. This register is read-only, except while in the trap handler where it is
writable. The value is a byte address and must be 256byte aligned. If the wave has no scratch space allocated,
then reading FLAT_SCRATCH returns zero.
The value for FLAT_SCRATCH is computed in hardware and initialized for any wave that has scratch space
allocated:
scratch_base = scratch_base[63:0] + spi_scratch_offset[31:0]

3.4. Wave State Registers

24 of 597

"RDNA3" Instruction Set Architecture

FLAT_SCRATCH_LO = scratch_base [31:0]
FLAT_SCRATCH_HI = scratch_base [63:32]

3.4.8. Hardware Internal Registers
These registers are read-only and can be accessed by the S_GETREG instruction. They return information
about hardware allocation and status. HW_ID and the various *_BASE values are not predictable and may
change over the lifetime of a wave if context-switching can occur.
HW_ID1
Field

Bits

Description

WAVE_ID

4:0

Wave id within the SIMD.

SIMD_ID

9:8

SIMD_ID within the WGP: [0] = row, [1] = column.

WGP_ID

13:10

Physical WGP ID.

SA_ID

16

Shader Array ID

SE_ID

20:18

Shader Engine ID

DP_RATE

31:29

Number of double-precision float units per SIMD. 1+log2(#DP-alu’s). 0=none, 1=1/32rate (1 dp
lane/clk), 2=1/16 rate (2 dp lanes/clk), 3=1/8, 4=1/4, 5=1/2, 6=full rate (32 dp lanes per clock).

HW_ID2
Field

Bits

Description

QUEUE_ID

3:0

Queue_ID (also encodes shader stage)

PIPE_ID

5:4

Pipeline ID

ME_ID

9:8

MicroEngine ID: 0 = graphics, 1 & 2 = ACE compute

STATE_ID

14:12

State context ID

WG_ID

20:16

Work-group ID (0-31) within the WGP.

VM_ID

27:24

Virtual Memory ID

Other S_GETREG, S_SETREG targets:
Register

Bits

Description

FLUSH_IB

1

Writing this with bit[0]=1 flushes the instruction fetch buffers for the targeted wave.

SH_MEM_BASES

16, 16

Per-VMID register, readable by the shader, which holds the private and shared
apertures.

PC_LO
PC_HI

32
32

Program counter low and high halves. GETREG should not be used to read the PC use S_GETPC instead.

FLAT_SCRATCH_HI
FLAT_SCRATCH_LO

32
32

Flat scratch base address. Only writable when in trap handler

Note: TMA and TBA are read using S_SENDMSG_RTN.

3.4.9. Trap and Exception registers
Each type of exception can be enabled or disabled independently by setting, or clearing, bits in the TRAPSTS
register’s EXCP_EN field. This section describes the registers that control and report shader exceptions.
Trap temporary SGPRs (TTMP*) are privileged for writes - they can be written only when in the trap handler

3.4. Wave State Registers

25 of 597

"RDNA3" Instruction Set Architecture

(STATUS.PRIV = 1). TTMPs can be read by the user shader.
When the shader is not privileged (STATUS.PRIV==0), writes to these are ignored. TMA and TBA are read-only;
they can be accessed through S_SENDMSG_RTN.
When a trap is taken (either user initiated, exception or host initiated), the shader hardware generates an
S_TRAP instruction. This loads trap information into a pair of SGPRS:
{TTMP1, TTMP0} = {7'h0, HT[0],trapID[7:0], PC[47:0]}.

HT is set to one for host initiated traps, and zero for user traps (s_trap) or exceptions. TRAP_ID is zero for
exceptions, or the user/host trapID for those traps.
STATUS . TRAP_EN
This bit tells the shader whether or not a trap handler is present. When one is not present, traps are not
taken no matter whether they’re floating point, user or host-initiated traps. When the trap handler is
present, the wave uses an extra 16 SGPRs for trap processing.
If trap_en == 0, all traps and exceptions are ignored, and s_trap is converted by hardware to NOP.
MODE . EXCP_EN[8:0]
Exception enable mask. Defines which of the sources of exception cause the shader to jump to the trap
handler when the exception occurs. 1 = enable traps; 0 = disable traps.
MEMVIOL and Illegal-Instruction jump to the trap handler and cannot be masked off.
Bit Exception

Cause

Result

0

invalid

operand is invalid for operation: 0 * inf, 0/0, sqrt(-x), any input QNaN
is SNaN.

1

Input
Denormal

one or more operands was subnormal

2

Divide by zero Float X / 0

correct signed infinity

3

overflow

The rounded result would be larger than the largest finite
number.

Depends on rounding mode.
Signed max# or infinity.

4

underflow

The exact or rounded result is less than the smallest normal
(non-subnormal) representable number.

subnormal or zero

5

inexact

The rounded result of a valid operation is different from the
infinitely precise result.

Operation result

6

integer divide Integer X / 0
by zero

7

address watch VMEM or SMEM has witnessed a thread access an 'address of
interest'

8

reserved

ordinary result

undefined

TRAPSTS Register
TRAPSTS contains information about traps and exceptions, and may be written by user shader or trap handler.

3.4. Wave State Registers

26 of 597

"RDNA3" Instruction Set Architecture

Field

Bit
Pos

Description

EXCP

8:0

Status bits of which exceptions have occurred. These bits are sticky and
accumulate results until the shader program clears them. These bits are
accumulated regardless of the setting of EXCP_EN. These can be read or
written without shader privilege.
Bit

Exception

0

invalid

1

Input Denormal

2

Divide by zero

3

overflow

4

underflow

5

inexact

6

integer divide by zero

7

address watch

8

memory violation

SAVECTX

10

A bit set by the host command via GRBM (or context-save/restore unit)
indicating that this wave must jump to its trap handler and save its context.
This bit should be cleared by the trap handler using S_SETREG.

ILLEGAL_INST

11

An illegal instruction has been detected. If a trap handler is present and the
wave is not in the trap handler: jump to the trap handler; Otherwise, send an
interrupt and halt.

ADDR_WATCH1-3

14:12

Indicates that address watch 1, 2 or 3 have been hit. [12]=addr_watch1.
Addr_watch0 is indicated by the existing bit TRAPSTS.EXCP[7].

BUFFER_OOB

15

Buffer Out Of Bounds indicator.
Set when a buffer (MUBUF, MTBUF) instruction requests an address that is
out of bounds. Does not cause a trap. Status bit is sticky.

HOST_TRAP

16

Trap handler has been called to service a host trap. Trap may simultaneously
have been called to handle other traps as well

WAVE_START

17

Trap handler has been called before the first instruction of a new wave.

WAVE_END

18

Trap handler has been called after the last instruction of a wave.

TRAP_AFTER_INST

20

Trap handler has been called due to "trap after instruction" mode

3.4.10. Time
There are two methods for measuring time in the shader:
• "TIME" - measure cycles in graphics core clocks (20 bit counter)
• "REALTIME" - measure time based on a fixed frequency, constantly running clock (typically 100MHz),
providing a 64bit value.
Shader programs have access to a free-running clock counter in order to measure the duration of portions of a
wave’s execution. This counter can be read via: "S_GETREG S0, SHADER_CYCLES" and returns a 20-bit cycle
counter value. This counter is not synchronized across different SIMDs and should only be used to measure
time-delta within one wave. Reading the counter is handled through the SALU which has a typical latency of
around 8 cycles.
For measuring time between different waves or SIMDs, or to reference a clock that does not stop counting
when the chip is idle, use "REALTIME". Real-time is a clock counter that comes from the clock-generator and
runs at a constant speed, regardless of the shader or memory clock speeds. This counter can be read by:

3.4. Wave State Registers

27 of 597

"RDNA3" Instruction Set Architecture

S_SENDMSG_RTN_B64 S[2:3] REALTIME
S_WAITCNT LGKMcnt == 0

3.5. Initial Wave State
Before a wave begins execution, some of the state registers including SGPRs and VGPRs are initialized with
values derived either from state data, dynamic or derived data (e.g. interpolants or unique per-wave data). The
values are derived from register state and dynamic wave-launch state.
Note that some of this state is common across all waves in a draw call, and other state is unique per wave.
This section describes what state is initialized per shader stage. Note that as usual in this spec, the shader
stages refer to hardware shader stages and these often are not identical to software shader stages.
State initialization is controlled by state registers which are defined in other documentation.

3.5.1. EXEC initialization
Normally, EXEC is initialized with the mask of which threads are active in a wave. There are, however, cases
where the EXEC mask is initialized to zero indicating that this wave should do no work and exit immediately.
These are referred to as "Null waves" (EXEC==0) and exit immediately after starting execution.

3.5.2. FLAT_SCRATCH Initialization
Waves that have scratch memory space allocated to them are initialized with their FLAT_SCRATCH register
having a pointer to the address in global memory. Waves without scratch have this initialized to zero.

3.5.3. SGPR Initialization
SGPRs are initialized based on various SPI_PGM_RSRC* or COMPUTE_PGM_* register settings. Note that only
the enabled values are loaded, and they are packed into consecutive SGPRs, skipping over disabled values
regardless of the number of user-constants loaded. No SGPRs are skipped for alignment.
The tables below show how to control which values are initialized prior to shader launch.

3.5.3.1. Pixel Shader (PS)
Table 8. PS SGPR Load
SGPR Order

Description

Enable

First 0..32 of

User data registers

SPI_SHADER_PGM_RSRC2_PS.user_sgpr

then

{bc_optimize, prim_mask[14:0], lds_offset[15:0]}

N/A

then

{ps_wave_id[9:0], ps_wave_index[5:0]}

SPI_SHADER_PGM_RSRC2_PS.wave_cnt_en

3.5. Initial Wave State

28 of 597

"RDNA3" Instruction Set Architecture

SGPR Order

Description

Enable

then

Provoking Vtx Info:
{prim15[1:0], prim14[1:0], …, prim0[1:0]}

SPI_SHADER_PGM_RSRC1_PS .
LOAD_PROVOKING_VTX

PS_wave_index is (se_id[1:0] * GPU__GC__NUM_PACKER_PER_SE + packer_id).
PS_wave_id is an index value which is incremented for every wave. There is a separate counter per
packer, so the combination of { ps_wave_id, ps_wave_index } forms a unique ID for any wave on the
chip. The wave-id counter wraps at SPI_PS_MAX_WAVE_ID.

3.5.3.2. Geometry Shader (GS)
ES and GS are launched as a combined wave, of type GS. The shader is initialized as a GS wave type, with the PC
pointing to the ES shader and with GS user-SGPRs preloaded, along with a memory pointer to more GS user
SGPRs. The shader executes to the ES program first, then upon completion executes the GS shader. Once the ES
shader completes, it may re-use the SGPRs which contain ES user data and the GS shader address.
The first 8 SGPRs are automatically initialized - no values are skipped (unused ones are written with zero).
State registers:
• SPI_SHADER_PGM_{LO,HI}_ES : address of the GS shader
• SPI_SHADER_PGM_RSRC1: resources of combined ES + GS shader
◦ GS_VGPR_COMP_CNT = # of GS VGPRs to load (2 bits)
• SPI_SHADER_PGM_RSRC2: resources of combined ES + GS shader
◦ VGPR_COMP_CNT = # of VGPRs to load (2 bits)
◦ OC_LDS_EN
• SPI_SHADER_PGM_RSRC{3,4}: resources of combined ES + GS shader
Table 9. GS SGPR Load
SGPR #

GS with FAST_LAUNCH != 2

GS with FAST_LAUNCH == 2

Enable

0

GS Program Address [31:0]
comes from:
SPI_SHADER_PGM_LO_GS

GS Program Address [31:0]
comes from:
SPI_SHADER_PGM_LO_GS

automatically loaded

1

GS Program Address [63:32]
comes from:
SPI_SHADER_PGM_HI_GS

GS Program Address [63:32]
comes from:
SPI_SHADER_PGM_HI_GS

automatically loaded

2

{1’b0, gsAmpPrimPerGrp[8:0], 32’h0
1’b0, esAmpVertPerGrp[8:0],
ordered_wave_id[11:0]}

Must not be overwritten, in some cases listed
below.

3

{ TGsize[3:0],
WaveInGroup[3:0], 8’h0,
gsInputPrimCnt[7:0],
esInputVertCnt[7:0] }

{ TGsize[3:0],
WaveInGroup[3:0], 24’h0 }

automatically loaded.

4

Off-chip LDS base [31:0]

{ TGID_Y[15:0],
TGID_X[15:0] }

SPI_SHADER_PGM_RSRC2_GS.oc_lds_en

5

{ 17’h0, attrSgBase[14:0] }

{ TGID_Z[15:0], 1’b0,
attrSgBase[14:0] }

-

SPI is loading flat_scratch[63:0] at this time

-

6
7

3.5. Initial Wave State

-

29 of 597

"RDNA3" Instruction Set Architecture

SGPR #

GS with FAST_LAUNCH != 2

GS with FAST_LAUNCH == 2

8 - (up to)
39

User data registers of GS
shader

User data registers of GS shader SPI_SHADER_PGM_RSRC2_GS.user_sgpr

Enable

When stream-out is used, SGPR[2] must not be modified or overwritten any time before the final stream out is
issued (GDS ordered count with 'done' = 1). This is because the pipeline reset sequence which hardware
automatically executes reads SGPR to fabricate a GDS-ordered-count instruction and relies on this value.

3.5.3.3. Front End Shader (HS)
LS and HS are launched as a combined wave, of type HS. The shader is initialized as an HS wave type, with the
PC pointing to the LS shader and with HS user-SGPRs preloaded, along with a memory pointer to more HS user
SGPRs. The shader executes to the LS program first, then upon completion executes the HS shader. Once the
LS shader completes, it may re-use the SGPRs which contain LS user data and the HS shader address.
The first 8 SGPRs are automatically initialized - no values are skipped (unused ones are written with zero).
Other registers:
• SPI_SHADER_PGM_{LO,HI}_LS : address of the LS shader
• SPI_SHADER_PGM_RSRC1: resources of combined LS + HS shader
◦ LS_VGPR_COMP_CNT = # of LS VGPRs to load (2 bits)
• SPI_SHADER_PGM_RSRC{2,3,4}: resources of combined LS + HS shader
Table 10. HS (LS) SGPR Load
SGPR #

Description

Enable

0

HS Program Address Low ([31:0])

SPI_SHADER_USER_DATA_LO_HS

1

HS Program Address High ([63:32])

SPI_SHADER_USER_DATA_HI_HS

2

Off-chip LDS base [31:0]

automatically loaded

3

{first_wave[0], lshs_TGsize[6:0],
lshs_PatchCount[7:0], HS_vertCount[7:0],
LS_vertCount[7:0]}

automatically loaded

4

TF buffer base [15:0]

automatically loaded

5

{ 27’b0, wave_id_in_group[4:0] }

SPI_SHADER_PGM_RSRC2_HS.scratch_en

8 - (up to) 39

User data registers of HS shader

SPI_SHADER_PGM_RSRC2_HS.user_sgpr

3.5.3.4. Compute Shader (CS)
Table 11. CS SGPR Load
SGPR Order

Description

Enable

First 0.. 16 of

User data registers

COMPUTE_PGM_RSRC2.user_sgpr

then

work_group_id0[31:0]

COMPUTE_PGM_RSRC2.tgid_x_en

then

work_group_id1[31:0]

COMPUTE_PGM_RSRC2.tgid_y_en

then

work_group_id2[31:0]

COMPUTE_PGM_RSRC2.tgid_z_en

then

{first_wave, 6’h00, wave_id_in_group[4:0], 2’h0,
ordered_append_term[11:0], workgroup_size_in_waves[5:0]}

COMPUTE_PGM_RSRC2.tg_size_en

3.5. Initial Wave State

30 of 597

"RDNA3" Instruction Set Architecture

3.5.4. Which VGPRs Get Initialized
The table shows the VGPRs which may be initialized prior to wave launch. COMPUTE_PGM_RSRC* or
SPI_SHADER_PGM_RSRC* control registers can select a reduced set per shader stage.

3.5.4.1. Pixel Shader VGPR Input Control
Pixel Shader VGPR input loading is quite a bit more complicated. There is a CAM which maps VS outputs to PS
inputs. Of the PS inputs which need loading, they are loaded in this order:
I persp sample
J persp sample
I persp center
J persp center
I persp centroid
J persp centroid
I/W
J/W
1/W

I linear sample
J linear sample
I linear center
J linear center
I linear centroid
J linear centroid
Line stipple

X float
Y float
Z float
W float
Facedness
Ancillary: RTA, ISN, PT,
eye-id
Sample mask
X/Y fixed

Two registers (SPI_PS_INPUT_ENA and SPI_PS_INPUT_ADDR) control the enabling of IJ calculations and
specifying of VGPR initialization for PS waves. SPI_PS_INPUT_ENA is used to determine what gradients are
enabled for setup, whether per-pixel Z is enabled, what terms are calculated and/or passed through the
barycentric logic, and what is loaded into VGPR for PS. SPI_PS_INPUT_ADDR can be used to manipulate the
VGPR destination of terms that are enabled by INPUT_ENA, typically providing a way to maintain consistent
VGPR addressing when terms are removed from INPUT_ENA. It is valid to set a bit in ADDR when the
corresponding bit in ENA is not set, but if the ENA bit is set then the corresponding bit in ADDR must also be
set.
The two Pixel Staging Register (PSR) control registers contain an identical set of fields and consist of the
following:

3.5. Initial Wave State

31 of 597

"RDNA3" Instruction Set Architecture

Field Name

IJ / VGPR Terms

BITS

VGPR Dest with Full
Load

PERSP_SAMPLE_ENA

PERSP_SAMPLE I

32

VGPR0

PERSP_SAMPLE J

32

VGPR1

PERSP_CENTER I

32

VGPR2

PERSP_CENTER J

32

VGPR3

PERSP_CENTROID I

32

VGPR4

PERSP_CENTROID J

PERSP_CENTER_ENA
PERSP_CENTROID_ENA

32

VGPR5

PERSP_PULL_MODEL_ENA PERSP_PULL_MODEL I/W

32

VGPR6

PERSP_PULL_MODEL J/W

32

VGPR7

PERSP_PULL_MODEL 1/W

32

VGPR8

LINEAR_SAMPLE I

32

VGPR9

LINEAR_SAMPLE J

32

VGPR10

LINEAR_CENTER I

32

VGPR11

LINEAR_CENTER J

32

VGPR12

LINEAR_CENTROID I

32

VGPR13

LINEAR_SAMPLE_ENA
LINEAR_CENTER_ENA
LINEAR_CENTROID_ENA

LINEAR_CENTROID J

32

VGPR14

LINE_STIPPLE_TEX_ENA

LINE_STIPPLE_TEX

32

VGPR15

POS_X_FLOAT_ENA

POS_X_FLOAT

32

VGPR16

POS_Y_FLOAT_ENA

POS_Y_FLOAT

32

VGPR17

POS_Z_FLOAT_ENA

POS_Z_FLOAT

32

VGPR18

POS_W_FLOAT_ENA

POS_W_FLOAT

32

VGPR19

FRONT_FACE_ENA

FRONT_FACE

32

VGPR20

ANCILLARY_ENA

RTA_Index[28:16],
Sample_Num[11:8],
Eye_id[7],
VRSrateY[5:4],
VRSrateX[3:2],
Prim Typ[1:0]

29

VGPR21

SAMPLE_COVERAGE_ENA

SAMPLE_COVERAGE

16

VGPR22

POS_FIXED_PT_ENA

Position {Y[16], X[16]}

32

VGPR23

The above table shows VGPR destinations for PS when all possible terms are enabled. If PS_INPUT_ADDR ==
PS_INPUT_ENA, then PS VGPRs pack towards VGPR0 as terms are disabled, as shown in the table below:
Field Name

ENA

ADDR

IJ / VGPR Terms

VGPR Dest

PERSP_SAMPLE_ENA

1

1

PERSP_SAMPLE I

VGPR0

PERSP_SAMPLE J

VGPR1

PERSP_CENTER I

VGPR2

PERSP_CENTER J

VGPR3

PERSP_CENTROID I

X

PERSP_CENTROID J

X

PERSP_PULL_MODEL I/W

X

PERSP_PULL_MODEL J/W

X

PERSP_PULL_MODEL 1/W

X

LINEAR_SAMPLE I

X

LINEAR_SAMPLE J

X

LINEAR_CENTER I

X

LINEAR_CENTER J

X

PERSP_CENTER_ENA
PERSP_CENTROID_ENA
PERSP_PULL_MODEL_ENA

LINEAR_SAMPLE_ENA
LINEAR_CENTER_ENA

3.5. Initial Wave State

1
0
0

0
0

1
0
0

0
0

32 of 597

"RDNA3" Instruction Set Architecture

Field Name

ENA

ADDR

IJ / VGPR Terms

VGPR Dest

LINEAR_CENTROID_ENA

0

0

LINEAR_CENTROID I

X

LINEAR_CENTROID J

X

LINE_STIPPLE_TEX_ENA

0

0

LINE_STIPPLE_TEX

X

POS_X_FLOAT_ENA

1

1

POS_X_FLOAT

VGPR4

POS_Y_FLOAT_ENA

1

1

POS_Y_FLOAT

VGPR5

POS_Z_FLOAT_ENA

0

0

POS_Z_FLOAT

X

POS_W_FLOAT_ENA

0

0

POS_W_FLOAT

X

FRONT_FACE_ENA

0

0

FRONT_FACE

X

ANCILLARY_ENA

0

0

Ancil Data

X

SAMPLE_COVERAGE_ENA

0

0

SAMPLE_COVERAGE

X

POS_FIXED_PT_ENA

0

0

Position {Y[16], X[16]}

X

However, if PS_INPUT_ADDR != PS_INPUT_ENA then the VGPR destination of enabled terms can be
manipulated. An example is this is shown in the table below:
Field Name

ENA

ADDR

IJ / VGPR Terms

VGPR Dest

PERSP_SAMPLE_ENA

1

1

PERSP_SAMPLE I

VGPR0

PERSP_SAMPLE J

VGPR1

PERSP_CENTER_ENA

1

1

PERSP_CENTER I

VGPR2

PERSP_CENTER J

VGPR3

PERSP_CENTROID I

VGPR4 skipped

PERSP_CENTROID J

VGPR5 skipped

PERSP_PULL_MODEL I/W

VGPR6 skipped

PERSP_PULL_MODEL J/W

VGPR7 skipped

PERSP_PULL_MODEL 1/W

VGPR8 skipped

LINEAR_SAMPLE I

X

LINEAR_SAMPLE J

X

LINEAR_CENTER I

X

LINEAR_CENTER J

X

LINEAR_CENTROID I

VGPR9 skipped

LINEAR_CENTROID J

VGPR10 skipped

PERSP_CENTROID_ENA
PERSP_PULL_MODEL_ENA

0
0

1
1

LINEAR_SAMPLE_ENA

0

0

LINEAR_CENTER_ENA

0

0

LINEAR_CENTROID_ENA

0

1

LINE_STIPPLE_TEX_ENA

0

1

LINE_STIPPLE_TEX

VGPR11 skipped

POS_X_FLOAT_ENA

1

1

POS_X_FLOAT

VGPR12

POS_Y_FLOAT_ENA

1

1

POS_Y_FLOAT

VGPR13

POS_Z_FLOAT_ENA

0

0

POS_Z_FLOAT

X

POS_W_FLOAT_ENA

0

0

POS_W_FLOAT

X

FRONT_FACE_ENA

0

0

FRONT_FACE

X

ANCILLARY_ENA

0

0

Ancil Data

X

SAMPLE_COVERAGE_ENA

0

0

SAMPLE_COVERAGE

X

POS_FIXED_PT_ENA

0

0

Position {Y[16], X[16]}

X

3.5.5. LDS Initialization
Only pixel shader (PS) waves have LDS pre-initialized with data before the wave launches. For PS wave, LDS is
preloaded with vertex parameter data that can be interpolated using barycentrics (I and J) to compute per-pixel
parameters.

3.5. Initial Wave State

33 of 597

"RDNA3" Instruction Set Architecture

Chapter 4. Shader Instruction Set
This chapter describes the shader instruction set. Instructions are divided into the following groups:
• Program Flow
• Scalar ALU
• Scalar memory read from constant cache
• Vector ALU & Parameter-Interpolate
• Vector Memory read/write :
◦ buffers
◦ Flat, Global and Scratch
◦ LDS
• GDS
• Misc: wait on counter, barrier, send message
Instructions are encoded in various microcode formats. The formats are defined by a set of "encoding" bits (in
red) that define the family of instructions and the meaning of the rest of the bits in the instruction. Not every
instruction uses every field in its encoding. Fields which can specify an SGPR as a source or dest are typically
set to NULL when unused; other fields are typically set to zero.

4.1. Common Instruction Fields
"inline constant" - a constant specified in place of a source argument, # 128-248. E.g 1.0, -0.5, 32 etc.
Float constants work with single, double and 16bit float instructions, and when used in non-float
instructions, the data is not converted (remains a float).
Float constants are encoded according to the size of the source operand. For 16-bit operations (both
packed and non-packed), a float constant is treated as zero-extended 32-bit data, i.e. with the 16-bit
floating point in the low bits and zeros in the high bits.
Integer constants used with 32-bit or smaller operands are treated as 32-bit signed integers. Integer
constants are signed extended for 64-bit sources.
"literal constant" - a 32-bit constant in the instruction stream immediately after a 32- or 64-bit instruction.
When used in a 64-bit signed integer operation, it is sign-extended to 64 bits. For unsigned 64-bit integer
ops (and 64-bit binary ops) it is zero extended. When used in a double-float operation, the 32-bit literal is
the most-significant bits, and the LSBs are zero. Other operations (32 bits or less, or packed math) treat it
as 32-bit data.

4.1. Common Instruction Fields

34 of 597

"RDNA3" Instruction Set Architecture

Vector
Source
(when 9
bits)

Scalar
Scalar
Source (8 Dest (7
bits)
bits)

Code

Meaning

0-105

SGPR 0 .. 105

SGPRs. One DWORD each.

106

VCC_LO

VCC[31:0]

107

VCC_HI

VCC[63:32]

108-123

ttmp0 .. ttmp15

Trap handler temporary SGPRs (privileged)

124

NULL

Reads return zero, writes are ignored. When used
as a destination, nullifies the instruction.

125

M0

Temporary register, use for a variety of functions

126

EXEC_LO

EXEC[31:0]

127

EXEC_HI

EXEC[63:32]

0

Inline constant zero

int 1 .. 64

Integer inline constants

Integer
128
Inline
129-192
Constants 193-208

int -1 .. -16

209-232

Reserved

Reserved

233

DPP8

8-lane DPP (only valid as SRC0)

234

DPP8FI

8-lane DPP with Fetch-Invalid (only valid as SRC0)

235

SHARED_BASE

Memory Aperture Definition

236

SHARED_LIMIT

237

PRIVATE_BASE

238

PRIVATE_LIMIT

239

Reserved

Reserved

Float
240
Inline
241
Constants 242

0.5

243

-1.0

Inline floating point constants. Can be used in 16,
32 and 64 bit floating point math. They may be
used with non-float instructions but the value
remains a float.

244

2.0

245

-2.0

246

4.0

247

-4.0

248

1.0 / (2 * PI)

249

Reserved

Reserved

250

DPP16

data parallel primitive

251

Reserved

Reserved

252

Reserved

Reserved

253

SCC

{ 31’b0, SCC }

254

Reserved

Reserved

255

Literal constant

32 bit constant from instruction stream

256 - 511

VGPR 0 .. 255

Vector GPRs. One DWORD each.

Vector Src/Dst
(8 bits)

-0.5
1.0

1/(2*PI) is 0.15915494. The hex values are:
half: 0x3118
single: 0x3e22f983
double: 0x3fc45f306dc9c882

4.1.1. Cache Controls: SLC, GLC and DLC
Scalar and vector memory instructions contain bits that control cache behavior. The SLC, GLC and DLC
instruction bits influence cache behavior for loads, stores, and atomics.
GLC

controls the graphics first-level cache

4.1. Common Instruction Fields

35 of 597

"RDNA3" Instruction Set Architecture

SLC

controls the graphics L2 cache

DLC

controls the Memory-Attached Last-Level cache (MALL) if it is present (ignored otherwise)

Typically loads use GLC=0 (except for load-acquire). GLC=1 forces a miss in the first level cache and reads data
rom the L2 cache. If there was a line in the GPU L0 that matched, it is invalidated; L2 is reread.
Shader LOAD ops (load, sample, gather, etc…)
SRD

ISA

Resulting Policy in Cache

SCOPE

llc_
noalloc

DLC SLC

GLC

MALL GL2
(NOA)

GL1

Tex(L0)

0 or 1

0

0

0

0

LRU

HIT_LRU

HIT_LRU

0 or 1

0

0

1

0

LRU

MISS_EVICT

0 or 1

0

1

0

0

STREAM

HIT_EVICT

0 or 1

0

1

1

0

STREAM

0 or 1

1

0

0

1

0 or 1

1

0

1

0 or 1

1

1

0 or 1

1

1

2 or 3

0

2 or 3

Non-Temporal Hint
MALL

GL2

GL1

Tex(L0)

CU

no

no

no

no

MISS_EVICT

DEVICE

no

no

_NA_

_NA_

HIT_LRU

CU

no

yes

yes

no

MISS_EVICT

MISS_EVICT

DEVICE

no

yes

_NA_

_NA_

LRU

HIT_LRU

HIT_LRU

CU

yes

no

no

no

1

LRU

MISS_EVICT

MISS_EVICT

DEVICE

yes

no

_NA_

_NA_

0

1

STREAM

HIT_EVICT

HIT_LRU

CU

yes

yes

yes

no

1

1

STREAM

MISS_EVICT

MISS_EVICT

DEVICE

yes

yes

_NA_

_NA_

0

0

1

LRU

HIT_LRU

HIT_LRU

CU

no

no

no

no

0

0

1

1

LRU

MISS_EVICT

MISS_EVICT

DEVICE

no

no

_NA_

_NA_

2 or 3

0

1

0

1

STREAM

HIT_EVICT

HIT_LRU

CU

no

yes

yes

no

2 or 3

0

1

1

1

STREAM

MISS_EVICT

MISS_EVICT

DEVICE

no

yes

_NA_

_NA_

2 or 3

1

0

0

1

LRU

HIT_LRU

HIT_LRU

CU

yes

no

no

no

2 or 3

1

0

1

1

LRU

MISS_EVICT

MISS_EVICT

DEVICE

yes

no

_NA_

_NA_

2 or 3

1

1

0

1

STREAM

HIT_EVICT

HIT_LRU

CU

yes

yes

yes

no

2 or 3

1

1

1

1

STREAM

MISS_EVICT

MISS_EVICT

DEVICE

yes

yes

_NA_

_NA_

• For S_BUFFER_LOAD instructions, LLC_NOALLOC comes from V#.LLC_noalloc.
For S_LOAD, LLC_NOALLOC is zero.
• SMEM operations have SLC set to zero.
Shader STORE / ATOMIC ops (all are device scope)
SRD

ISA

Policy in Cache

Non-Temporal Hint

llc_
noalloc

DLC

SLC

MALL
(NOA)

GL2

MALL

GL2

0 or 2

0

0

0

LRU

no

no

0 or 2

0

1

0

STREAM

no

yes

0 or 2

1

0

1

LRU

yes

no

0 or 2

1

1

1

STREAM

yes

yes

1 or 3

0

0

1

LRU

no

no

1 or 3

0

1

1

STREAM

no

yes

1 or 3

1

0

1

LRU

no

no

1 or 3

1

1

1

STREAM

no

yes

"Temporal Hint" = expect data to have temporal reuse.
"SRD" = Shader Resource Descriptor
• ISA.GLC ⇒ this is a scope bit for load operations (including sample, gather, etc…)
◦ 0 : CU (work-group) scope
◦ 1 : DEVICE scope

4.1. Common Instruction Fields

36 of 597

"RDNA3" Instruction Set Architecture

◦ All stores/atomic ops are device scope (GLC has non-perf related functionality)
• ISA.SLC ⇒ Temporal Hint for graphic client caches
◦ 0 : Regular
◦ 1 : Stream (non-temporal)
• ISA.DLC ⇒ Temporal Hint for Infinity Cache
◦ 0 : Regular
◦ 1 : Non-temporal
GLC is used by atomics to indicate:
• 0: return nothing
• 1: return pre-operation value from memory to VGPR

4.1. Common Instruction Fields

37 of 597

"RDNA3" Instruction Set Architecture

Chapter 5. Program Flow Control
Program flow control is programmed using scalar ALU instructions. This includes loops, branches, subroutine
calls, and traps. The program uses SGPRs to store branch conditions and loop counters. Constants can be
fetched from the scalar constant cache directly into SGPRs.

5.1. Program Control
The instructions in the table below control the priority and termination of a shader program, as well as provide
support for trap handlers.
Table 12. Wave Termination and Traps
Instructions

Description

S_ENDPGM

Terminates the wave. It can appear anywhere in the shader program and can appear
multiple times.

S_ENDPGM_SAVED

Terminates the wave due to context save. Intended for use only within the trap handler.

S_TRAP

Jump to the trap handler and pass in 8-bit TRAP id from SIMM[7:0].
It does not affect SCCZ.

<wait for outstanding instructions to finish>
{TTMP1,TTMP0} = {7'h0,HT[0],trapID[7:0],PC[47:0]}
PC = TBA (trap base address)
PRIV = 1

"HT" : 1 = this is a host-initiated trap, 0 = user (s_trap). Host traps cause the shader
hardware to generate an S_TRAP instruction. Note: the save-PC points to the S_TRAP
instruction. TRAPID 0 is reserved for hardware use.
S_RFE_B64

Return from exception (trap handler) and continue.
Start executing at PC (trap handler must increment PC past the faulting instruction).
MOVE PC, <src> ; STATUS.PRIV = 0.
This instruction may only be used within a trap handler.

S_SETKILL

Set the KILL bit to 1, causing the shader to s_endpgm immediately. Used primarily for
debugging 'kill' wave-command behavior.

S_SETHALT

Set the HALT bit to the value of SIMM16[0].
Setting to 1 halts the shader when PRIV=0 (not in trap handler);
setting to 0 resumes the shader (can only occur in trap handler).
Fatal Halt control: SIMM16[2] 1 : set fatal halt; 0 : clear fatal halt.

Table 13. Dependency, Delay and Scheduling Instructions
Instructions

Description

S_NOP

NOP. Repeat SIMM16[3:0] times. (1..16)
Like a short version of S_SLEEP

S_SLEEP

Cause a wave to sleep for approx. 64*SIMM16[6:0] clocks.
"s_sleep 0" sleeps the wave for 0 cycles.

S_WAKEUP

Causes one wave in a work-group to signal all other waves in the same work-group to wake
up from S_SLEEP early. If waves are not sleeping, they are not affected by this instruction.

S_SETPRIO

Set 2-bits of USER_PRIO: user-settable wave priority. 0 = low, 3 = high.
Overall wave priority is: {MIN(3,(SysPrio[1:0] + UserPrio[1:0])), WaveAge[3:0]}

5.1. Program Control

38 of 597

"RDNA3" Instruction Set Architecture

Instructions

Description

S_CLAUSE

Begin a clause consisting of instructions matching the instruction after the s_clause. The
clause length is: (SIMM16[5:0] + 1), and clauses must be between 2 and 63 instructions.
SIMM16[5:0] must be 1-62, not 0 or 63. The clause breaks after every N instructions, N =
simm[11:8] (0 - 15; 0 = no breaks)

S_BARRIER

Synchronize waves within a work-group. If not all waves in group have been created yet,
waits for entire group before proceeding. Waves that have ended do not prevent barriers
from being satisfied. Waves not in a work-group (or work-group size = 1 wave), treat this as
S_NOP.

Table 14. Control Instructions
Instructions

Description

S_VERSION

Does nothing (treated as S_NOP), but can be used as a code comment to indicate the
hardware version the shader is compiled for (using the SIMM16 field).

S_CODE_END

Treated as an illegal instruction. Used to pad past the end of shaders.

S_SENDMSG

Send a message upstream to the Interrupt handler or dedicated hardware. SIMM[9:0] is an
immediate value holding the message type. There is no "s_waitcnt" enforced before this.

S_SENDMSG_RTN_B32
S_SENDMSG_RTN_B64

Send a message upstream to that requests that some data be returned to an SGPR. Uses
LGKMcnt to track when data is returned. (or an aligned SGPR-pair for "_B64").
SDST = SGPR to return to.
SSRC0 = enum, not an SGPR with the code for what data is requested. (see the message table
below).
If this is used to write VCC, then VCCZ is undefined.

S_SENDMSGHALT

S_SENDMSG and then HALT.

S_ICACHE_INV

Invalidate first-level shader instruction cache for the WGP associated with this wave.

5.2. Instruction Clauses
An instruction clause is a group of instructions of the same type that are to be executed in an uninterrupted
sequence. Normally hardware may interleave instructions from different waves, but a clause can be used to
override that behavior and force the hardware to service only one wave for a given instruction type for the
duration of the clause, even if that leaves the execution hardware idle.
Clauses are defined and started using the S_CLAUSE instruction, and must contain only a single type of
instruction. The clause-type is implicitly defined by the type of instruction immediately following the clause.
Clause Types are:
• Image (no sampler) load
• Image store
• Image atomic
• Image sample
• Buffer / Global / Scratch load
• Buffer / Global / Scratch store
• Buffer / Global / Scratch atomic
• Flat load
• Flat store
• Flat atomic
• LDS load / store / atomic / bvh_stack
• IMAGE_BVH

5.2. Instruction Clauses

39 of 597

"RDNA3" Instruction Set Architecture

• SMEM
• VALU
May also be in a clause ("clause internal instructions"):
• S_DELAY_ALU is legal inside a clause (internal) but is pointless.
◦ S_DELAY_ALU must not occur within a VALU clause.
• S_NOP and S_SLEEP may be used inside a clause, but the first instruction of the clause must be the clausetype instruction (ALU, memory).
Cannot be in a clause:
• Instructions of a different type those of the clause type are illegal
• S_CLAUSE
• S_ENDPGM
• SALU, Export, branch, message, GDS, lds_param_load, lds_direct_load
• S_WAITCNT, S_WAIT_IDLE, S_WAIT_DEPCTR
S_CLAUSE defines both the total length of the clause, and how often it should be broken to allow other waves a
chance to go. For instance, it could say: clause of 16 instructions, but break after every 4th to allow a higher
priority wave to get access to the execution unit. "clause internal instructions" count against this clause size.
If a clause defines regular clause breaks (e.g. a clause of 16 instructions, but break every 4th), the first
instruction of each sub-clause (every 4 instructions) must be of the clause-type, not a "clause internal
instruction". Each group of instructions must have at least two of the clause-type of instructions. E.g. a clause of
12 VALU instructions broken up into 4 groups of 3 instructions - each group of 3 instructions must have at least two
VALU instructions. Clause groups with only 1 VALU instruction per group make no sense - they are no longer a clause.
If the first instruction in a VALU clause has EXEC==0, then the clause is ignored and instructions are issued as
if there were no clause. If the VALU clause starts with EXEC!=0 but EXEC becomes zero in the middle of the
clause, the clause continues until the last instruction of the specified clause.
If an S_DELAY_ALU is needed before starting a clause, the order must be:
S_DELAY_ALU // must not come immediately after S_CLAUSE - that inst declares clause type
S_CLAUSE
<first instruction in clause>

If the first instruction after S_CLAUSE is skipped (e.g. due to EXEC==0, or VMEM-load skipped due to EXEC==0
and VMcnt==0) then then a clause is not started. Subsequent instructions within what would have been the
clause that are not skipped and are still executed but individually, not as part of a clause.

5.2.1. Clause Breaks
The following conditions can break a clause:
1. VALU exception (trap) breaks a VALU clause
2. Host commands to wave (halt, resume, single step, etc) breaks all active clauses.
Context-save breaks clauses of affected waves.
This allows the host to read and write SGPRs & VGPRs while debugging. If clauses were not broken by host

5.2. Instruction Clauses

40 of 597

"RDNA3" Instruction Set Architecture

commands, the GPRs could not be read from waves other than the one currently in a clause.
If a wave halts or is kill, its clauses are ended.
3. Any action that cause a wave to jump to its trap handler breaks clause (includes context-save).
A wave entering HALT (including for host-initiated single-step) may break clauses.

5.3. Send Message Types
S_SENDMSG is used to send messages to fixed function hardware, the host, or to request that a value be
returned to the wave. S_SENDMSG encodes the message type in the SIMM16 field and the message payload in
M0. S_SENDMSG_RTN encodes the message type in the SSRC0 field (does not read an SGPR), the payload (if
any) in M0, and the destination SGPR in SDST.
Completion is tracked with LGKMcnt.
The table below lists the messages that can be generated using the S_SENDMSG command.
S_SENDMSG_RTN_B* instructions return data to the shader: increment LGKMcnt by 2, and then decrement by
1 when the messages goes out, and by another 1 when the data returns. This allows the user to simply use
"s_waitcnt LGKMcnt==0" to wait for the data to be returned.
All message codes not listed are reserved (illegal).
Table 15. S_SENDMSG Messages
Message

SIMM16
[7:0]

Payload

Reserved

0x00

Reserved

Interrupt

0x01

Software-generated interrupt. M0[23:0] carries user data. ID’s are also sent (wave_id,
cu_id, etc.)

HS TessFactor

0x02

Indicates HS tessellation factor is all zero or one for all patches in this HS work-group.
Data from M0[0]: 1 = "all are zero or one". This message is optional, but do not send
more than once or from any shader stage other than HS.

Dealloc VGPRs

0x03

Deallocate all VGPRs for this wave, allowing another wave to allocate these VGPRs
before this wave ends. Use only when next instruction is S_ENDPGM. Typically used
when a shader is waiting memory-write-acknowledgments before ending.

GS alloc req

0x09

Request GS space in parameter cache. M0[9:0] = number of vertices, M0[22:12] =
number of primitives. Response: a GS-alloc response to non-zero requests (broadcast to
work-group).

S_SENDMSG_RTN is used to send messages that return a value to the wave. The instruction specifies which
SGPR receives the data in SDST field. The message is encoded in SSRC0 (in the instruction field, not in an
SGPR).
Table 16. S_SENDMSG_RTN Messages
Message

SSRC0

Payload

Get Doorbell ID

0x80

Get the doorbell ID associated with this wave.
(does not exist for ME0. Return 0x0bad. Also returns 0x0bad for invalid pipeID or
queueID).

Get Draw ID

0x81

Get the Draw or dispatch ID associated with this wave.

Get TMA

0x82

Get the Trap Memory Address: [31:0] or [63:0] depending on the request size.

5.3. Send Message Types

41 of 597

"RDNA3" Instruction Set Architecture

Message

SSRC0

Payload

Get REALTIME

0x83

Get the value of the constant frequency (REFCLK) time counter: [31:0] or [63:0]
depending on the request size.

Save wave

0x84

Used in context switching in indicate this wave is ready to be context saved.
Only the trap handler can send this message (user shaders have this converted to
MSG_ILLEGAL_RTN).

Get TBA

0x85

Gets the Trap Base Address [31:0] or [63:0] depending on request size

MSG_ILLEGAL _RTN

0xFF

Illegal message with data return to wave

5.4. Branching
Branching is done using one of the following scalar ALU instructions. "SIMM16" is a sign-extended 16 bit
integer constant, treated as a DWORD offset for branches.
Table 17. Branch Instructions
Instructions

Description

S_BRANCH

Unconditional branch. PC = PC + (SIMM16 * 4) + 4

S_CBRANCH_<test>

Conditional branch. Branch only if <condition> is true.
if (cond) PC = PC + (SIMM16 *4) +4; else NOP;
If SIMM16=0, the branch goes to the next instruction).
<cond> : SCC1, SCC0, VCCZ, VCCNZ, EXECZ, EXECNZ (SCC==1, SCC==0, VCC==0, VCC!=0,
EXEC==0, EXEC!=0)

S_CBRANCH_CDBGSYS

Conditional branch, taken if the COND_DBG_SYS status bit is set.
if (cond) PC = PC + (SIMM16 *4) +4; else NOP;
<cond> = SYS, USER, SYS_AND_USER, SYS_OR_USER.

S_CBRANCH_CDBGUSER

Conditional branch, taken if the COND_DBG_USER status bit is set.

S_CBRANCH_CDBGSYS_AND Conditional branch, taken only if both COND_DBG_SYS and COND_DBG_USER are set.
_USER
S_CBRANCH_CDBGSYS_OR_U Conditional branch, taken if either COND_DBG_SYS or COND_DBG_USER is set.
SER
S_SETPC_B64

Directly set the PC from an SGPR pair: PC = SGPR-pair

S_SWAPPC_B64

Swap the current PC with an address in an SGPR pair. SWAP (PC+4, SGPR-pair).
(result is: PC of this instruction + 4, zero extended)

S_GETPC_B64

Retrieve the current PC value (does not cause a branch). (SGPR-pair = PC of this instruction
+ 4, zero extended)

S_CALL_B64

Jump to a subroutine, and save return address. SGPR_pair = PC+4; PC = PC+4+SIMM16*4.

For conditional branches, the branch condition can be determined by either scalar or vector operations. A
scalar compare operation sets the Scalar Condition Code (SCC) which then can be used as a conditional branch
condition. Vector compare operations set the VCC mask, and VCCZ or VCCNZ then can be used to determine
branching.

5.5. Work-groups and Barriers
Work-groups are collections of waves running on the same work-group processor that can synchronize and
share data. Up to 1024 work-items (16 wave64’s or 32 wave32’s) can be combined into a work-group. When
multiple waves are in a work-group, the S_BARRIER instruction can be used to force each wave to wait until all
other waves reach the same instruction; then, all waves continue. Work-groups of a single wave treat all

5.4. Branching

42 of 597

"RDNA3" Instruction Set Architecture

barrier instructions as S_NOP.
If a wave executes an S_BARRIER before all of the waves of the work-group have been created, the wave waits
until the work-group is complete.
Any wave may terminate early using S_ENDPGM, and the barrier is considered satisfied when the remaining
live waves reach their barrier instruction.

5.6. Data Dependency Resolution
Shader hardware can resolve most data dependencies, but a few cases must be explicitly handled by the shader
program. In these cases, the program must insert S_WAITCNT instructions to ensure that previous operations
have completed before continuing.
The shader has four counters that track the progress of issued instructions. S_WAITCNT waits for the values of
these counters to be at, or below, specified values before continuing. These allow the shader writer to schedule
long-latency instructions, execute unrelated work, and specify when results of long-latency operations are
needed.
Inserting S_NOP is not required to achieve correct operation.
Table 18. Data Dependency Instructions
Instructions

Description

S_WAITCNT

Wait for count of outstanding instruction counters to be less-than or equal-to all of these
values before continuing.
SIMM16 = { VMcnt[5:0], LGKMcnt[5:0], 1’b0, EXPcnt[2:0] }

S_WAITCNT_VSCNT

Wait for VSCNT, VMCNT, EXPCNT or LGKMcnt to be less-than or equal-to the count in
SIMM16 before continuing.

S_WAITCNT_LGKMCNT
S_WAITCNT_EXPCNT
S_WAITCNT_VMCNT
S_WAIT_EVENT

Wait for an event to occur before proceeding
SIMM16[0] : 1=don’t wait, 0= wait for export-ready; other bits are reserved.
Any exception waits for this to complete before being processed, including: KILL, savecontext, host trap, memviol and anything that causes a trap to be taken.

S_DELAY_ALU

Insert delay between dependent SALU/VALU instructions.
SIMM16[3:0] = InstID0
SIMM16[6:4] = InstSkip
SIMM16[10:7] = InstID1
This instruction describes dependencies for two instructions, directing the hardware to insert
delay if the dependent instruction was issued too recently to forward data to the second. For
details, see: S_DELAY_ALU.

S_WAITCNT* waits for outstanding instructions that use the specified counter to complete. Instructions within
a type often return in the order they were issued compared to other instructions of that type, but typically
return out of order with respect to instructions of different types. These counters count instructions, not threads.
These are the memory instruction groups - each returns out of order with respect to the others:
• VMcnt:
◦ Texture SAMPLE
◦ Texture/Buffer/Global/Scratch/Flat Loads and atomic-with-return

5.6. Data Dependency Resolution

43 of 597

"RDNA3" Instruction Set Architecture

• VScnt:
◦ Texture/Buffer/Global/Scratch/Flat Stores and atomic-without-return
• LGKMcnt:
◦ LDS indexed operations
◦ SMEM: scalar memory loads may return completely out-of-order with respect to other scalar memory
loads
◦ GDS & GWS
◦ FLAT instructions (uses both LGKMcnt and either VMcnt or VScnt)
◦ Messages
• EXPcnt:
◦ LDS parameter-load and direct-load
◦ Exports: stay in order within a type (MRT, Z, position, primitive data) but out of order between types
It is possible for data to be written to VGPRs out-of-order, but the counter-decrement still reflects in-order
completion. Stores from a wave are not kept in order with stores from that same wave when they write to
different addresses.
Simple S_WAITCNT Example
global_load_b32 V0, V[4:5], 0x0

// load memory[ {V5, V4} ] into V0

global_load_b32 V1, V[4:5], 0x8

// load memory[ {V5, V4} +8 ] into V1

s_waitcnt VMcnt <= 1

// wait for first global_load to have completed

v_mov_b32

// move V0 into V9

V9, V0

5.7. ALU Instruction Software Scheduling
The shader program may include instructions to delay ALU instructions from being issued in order to attempt
to avoid pipeline stalls caused by issuing dependent instructions too closely together.
This is accomplished with the: S_DELAY_ALU instruction: "insert delay with respect to a previous VALU
instruction". The compiler may insert S_DELAY_ALU instructions to indicate data dependencies that might
benefit from having extra idle cycles inserted between them.
This instruction is inserted before the instruction which the user wants to delay, and it specifies which
previous instructions this one is dependent on. The hardware then determines the number of cycles of delay to
add.
This instruction is optional - it is not necessary for correct operation. It should be inserted only when necessary
to avoid dependency stalls. If enough independent instructions are between dependent ones then no delay is
necessary. For wave64, the user may not know the status of the EXEC mask and hence not know if instructions
take 1 or 2 passes to issue.
The S_DELAY_ALU instruction says: wait for the VALU-Inst N ago to have completed. To reduce instruction
stream overhead, the S_DELAY_ALU instructions packs two delay values into one instruction, with a "skip"
indicator so the two delayed instructions don’t need to be back-to-back.
S_DELAY_ALU may be executed in zero cycles - it may be executed in parallel with the instruction before it.
This avoids extra delay if no delay is needed.

5.7. ALU Instruction Software Scheduling

44 of 597

"RDNA3" Instruction Set Architecture

S_DELAY_ALU InstID1[4], Skip[3], InstID0[4] // packed into SIMM16

INSTID

counts backwards N VALU instructions that were issued. This means it does not count
instructions which were branched over. VALU instructions skipped due to EXEC==0 do count
(scoreboard immediately marked 'ready').

SKIP

counts the number of instructions skipped before the instruction which has the second
dependency. Every instruction is counted for skipping - all types.

If another S_DELAY_ALU is encountered before the info from the previous one is consumed, the current
S_DELAY_ALU replaces any previous dependency info. This means if an instruction is dependent on two
separate previous instructions, both of those dependencies can be expressed in a single S_DELAY_ALU op, but
not in two separate S_DELAY_ALU ops.
S_DELAY_ALU is applied to any type of opcode, even non-alu (but serves no purpose).
S_DELAY_ALU should not be used within VALU clauses.
Table 19. S_DELAY_ALU Instruction Codes
DEP
Code

Dep Code Meaning

SKIP
Code

SKIP Code Meaning

0

no dependency

0

Same op. Both DEP codes apply to the next instruction

1-4

dependent on previous VALU
1-4 back

1

No skip. Dep0 applies to the following instruction, and DEP1 applies to
the instruction after that one.

5-7

dependent on previous trans.
VALU 1-4 back

2

Skip 1. Dep0 applies to the following instruction. Dep1 applies to 2
instructions ahead (skip 1 instruction).

8

Reserved

3-5

Skip 2-4 instructions between Dep0 and Dep1.

9-11

Wait 1-3 cycles for previous
SALU ops

6

Reserved

Codes 9-11: SALU ops typically complete in a single cycle, so waiting for 1 cycle is roughly equivalent to waiting
for 1 SALU op to execute before continuing.

5.7. ALU Instruction Software Scheduling

45 of 597

"RDNA3" Instruction Set Architecture

Chapter 6. Scalar ALU Operations
Scalar ALU (SALU) instructions operate on values that are common to all work-items in the wave. These
operations consist of 32-bit integer or float arithmetic, and 32- or 64-bit bit-wise operations. The SALU also can
perform operations directly on the Program Counter, allowing the program to create a call stack in SGPRs.
Many operations also set the Scalar Condition Code bit (SCC) to indicate the result of a comparison, a carry-out,
or whether the instruction result was zero.

6.1. SALU Instruction Formats
SALU instructions are encoded in one of five microcode formats, shown below:

Name

Size

Function

SOP1

32 bit

SALU op with 1 input

SOP2

32 bit

SALU op with 2 inputs

SOPK

32 bit

SALU op with 1 constant signed 16-bit integer input

SOPC

32 bit

SALU compare op

SOPP

32 bit

SALU program control op

Each of these instruction formats uses some of these fields:
Field

Description

OP

Opcode: instruction to be executed.

SDST

Destination SGPR, M0, NULL or EXEC.

SSRC0

First source operand.

SSRC1

Second source operand.

SIMM16

Signed immediate 16-bit integer constant.

The lists of similar instructions sometimes use a condensed form using curly braces { } to express a list of
possible names. For example, S_AND_{B32, B64} defines two legal instructions: S_AND_B32 and S_AND_B64.

6.2. Scalar ALU Operands
Valid operands of SALU instructions are:

6.1. SALU Instruction Formats

46 of 597

"RDNA3" Instruction Set Architecture

• SGPRs, including trap temporary SGPRs
• Mode register
• Status register (read-only)
• M0 register
• EXEC mask
• VCC mask
• SCC
• Inline constants: integers from -16 to 64, and select floating point values
• Hardware registers (at most 1 of: EXEC, M0, SCC)
• One 32-bit literal constant
• If the destination is NULL, the instruction does not execute: nothing is written and SCC is not modified
In the table below, 0-127 can be used as scalar sources or destinations; 128-255 can only be used as sources.
Table 20. Scalar Operands

6.2. Scalar ALU Operands

47 of 597

"RDNA3" Instruction Set Architecture

Code
Scalar
Source (8
bits)

Scalar Dest 0-105
(7 bits)
106

Meaning
SGPR 0 .. 105

SGPRs. One DWORD each.

VCC_LO

VCC[31:0]

107

VCC_HI

VCC[63:32]

108-123

ttmp0 .. ttmp15

Trap handler temporary SGPRs (privileged)

124

NULL

Reads return zero, writes are ignored. When used as a
destination, nullifies the instruction.

125

M0

Temporary register, use for a variety of functions

126

EXEC_LO

EXEC[31:0]

127

EXEC_HI

EXEC[63:32]

0

Inline constant zero

int 1 .. 64

Integer inline constants

Integer
128
Inline
129-192
Constants 193-208

int -1 .. -16

209-232

Reserved

Reserved

233

DPP8

8-lane DPP (only valid as SRC0)

234

DPP8FI

8-lane DPP with Fetch-Invalid (only valid as SRC0)

235

SHARED_BASE

Memory Aperture Definition

236

SHARED_LIMIT

237

PRIVATE_BASE

238

PRIVATE_LIMIT

239

Reserved

Reserved

Float
240
Inline
241
Constants 242

0.5

Inline floating point constants. Can be used in 16, 32 and
64 bit floating point math. They may be used with nonfloat instructions but the value remains a float.

243

-1.0

244

2.0

245

-2.0

246

4.0

247

-4.0

248

1.0 / (2 * PI)

249

Reserved

Reserved

250

DPP16

data parallel primitive

251

Reserved

Reserved

252

Reserved

Reserved

253

SCC

{ 31’b0, SCC }

254

Reserved

Reserved

255

Literal constant

32 bit constant from instruction stream

-0.5
1.0

1/(2*PI) is 0.15915494. The hex values are:
half: 0x3118
single: 0x3e22f983
double: 0x3fc45f306dc9c882

SALU destinations are in the range 0-127.
SALU instructions can use a 32-bit literal constant. This constant is part of the instruction stream and is
available to all SALU microcode formats except SOPP and SOPK (except literal is allowed in
S_SETREG_IMM32_B32). Literal constants are used by setting the source instruction field to "literal" (255), and
then the following instruction DWORD is used as the source value.
If the destination SGPR is out-of-range, no SGPR is written with the result and SCC is not updated.
If an instruction uses 64-bit data in SGPRs, the SGPR pair must be aligned to an even boundary. For example, it
is legal to use SGPRs 2 and 3 or 8 and 9 (but not 11 and 12) to represent 64-bit data.

6.2. Scalar ALU Operands

48 of 597

"RDNA3" Instruction Set Architecture

6.3. Scalar Condition Code (SCC)
The scalar condition code (SCC) is written as a result of executing most SALU instructions. For integer
arithmetic it is used as carry/borrow in for extended integer arithmetic.
The SCC is set by many instructions:
• Compare operations: 1 = true.
• Arithmetic operations: 1 = carry out.
◦ SCC = overflow for signed add and subtract operations. For add ops, overflow = both operands are of
the same sign, and the MSB (sign bit) of the result is different than the sign of the operands. For
subtract (A - B), overflow = A and B have opposite signs and the resulting sign is not the same as the
sign of A.
• Bit/logical operations: 1 = result was not zero.

6.4. Integer Arithmetic Instructions
This section describes the arithmetic operations supplied by the SALU. The table below shows the scalar
integer arithmetic instructions:
Table 21. Integer Arithmetic Instructions
Instruction

Encoding

Sets SCC?

Operation

S_ADD_I32

SOP2

Ovfl

D = S0 + S1, SCC = overflow.

S_ADD_U32

SOP2

Cout

D = S0 + S1, SCC = carry out.

S_ADDC_U32

SOP2

Cout

D = S0 + S1 + SCC, SCC = overflow.

S_SUB_I32

SOP2

Ovfl

D = S0 - S1, SCC = overflow.

S_SUB_U32

SOP2

Cout

D = S0 - S1, SCC = carry out.

S_SUBB_U32

SOP2

Cout

D = S0 - S1 - SCC, SCC = carry out.

S_ADD_LSH{1,2,3,4}_U32 SOP2

D!=0

D = S0 + (S1 << {1,2,3,4})

S_ABSDIFF_I32

SOP2

D!=0

D = abs (S0 - S1), SCC = result not zero.

S_MIN_I32
S_MIN_U32

SOP2

D!=0

D = (S0 < S1) ? S0 : S1
SCC = (S0 < S1)

S_MAX_I32
S_MAX_U32

SOP2

D!=0

D = (S0 > S1) ? S0 : S1
SCC = (S0 > S1)

S_MUL_I32

SOP2

No

D = S0 * S1 low 32bits of result
works identically for unsigned data

S_ADDK_I32

SOPK

Ovfl

D = D + simm16, SCC = overflow. Sign extended version of
simm16.

S_MULK_I32

SOPK

No

D = D * simm16. Return low 32bits. Sign extended version of
simm16.

S_ABS_I32

SOP1

D!=0

D.i = abs (S0.i). SCC=result not zero.

S_SEXT_I32_I8

SOP1

No

D = { 24{S0[7]}, S0[7:0] }.

S_SEXT_I32_I16

SOP1

No

D = { 16{S0[15]}, S0[15:0] }.

S_MUL_HI_I32

SOP2

No

D = S0 * S1 high 32bits of result

S_MUL_HI_U32

SOP2

No

D = S0 * S1 high 32bits of result

S_PACK_LL_B32_B16

SOP2

No

D = { S1[15:0], S0[15:0] }

S_PACK_LH_B32_B16

SOP2

No

D = { S1[31:16], S0[15:0] }

S_PACK_HL_B32_B16

SOP2

No

D = { S1[15:0], S0[31:16] }

6.3. Scalar Condition Code (SCC)

49 of 597

"RDNA3" Instruction Set Architecture

Instruction

Encoding

Sets SCC?

Operation

S_PACK_HH_B32_B16

SOP2

No

D = { S1[31:16], S0[31:16] }

6.5. Conditional Move Instructions
Conditional instructions use the SCC flag to determine whether to perform the operation, or (for CSELECT)
which source operand to use.
Table 22. Conditional Instructions
Instruction

Encoding Sets SCC? Operation

S_CSELECT_{B32, B64}

SOP2

No

D = SCC ? S0 : S1.

S_CMOVK_I32

SOPK

No

if (SCC) D = signext(simm16).

S_CMOV_{B32,B64}

SOP1

No

if (SCC) D = S0, else NOP.

6.6. Comparison Instructions
These instructions compare two values and set the SCC to 1 if the comparison yielded a TRUE result.
Table 23. Conditional Instructions
Instruction

Encoding

Sets SCC?

Operation

S_CMP_EQ_U64, S_CMP_LG_U64

SOPC

Test

Compare two 64-bit source values. SCC = S0 <cond> S1.

S_CMP_{EQ,LG,GT,GE,LE,LT}_{I32 SOPC
,U32}

Test

Compare two source values. SCC = S0 <cond> S1.

S_BITCMP0_{B32,B64}

SOPC

Test

Test for "is a bit zero". SCC = !S0[S1].

S_BITCMP1_{B32,B64}

SOPC

Test

Test for "is a bit one". SCC = S0[S1].

6.7. Bit-Wise Instructions
Bit-wise instructions operate on 32- or 64-bit data without interpreting it has having a type. For bit-wise
operations if noted in the table below, SCC is set if the result is nonzero.
Table 24. Bit-Wise Instructions
Instruction

Encoding

Sets SCC? Operation

S_MOV_{B32,B64}

SOP1

No

D = S0

S_MOVK_I32

SOPK

No

D = signext(simm16)

{S_AND,S_OR,S_XOR}_{B32,B64}

SOP2

D!=0

D = S0 & S1, S0 OR S1, S0 XOR S1

{S_AND_NOT1,S_OR_NOT1}_{B32,B64}

SOP2

D!=0

D = S0 & ~S1, S0 OR ~S1

{S_NAND,S_NOR,S_XNOR}_{B32,B64}

SOP2

D!=0

D = ~(S0 & S1), ~(S0 OR S1), ~(S0 XOR S1)

S_LSHL_{B32,B64}

SOP2

D!=0

D = S0 << S1[4:0], [5:0] for B64.

S_LSHR_{B32,B64}

SOP2

D!=0

D = S0 >> S1[4:0], [5:0] for B64.

S_ASHR_{I32,I64}

SOP2

D!=0

D = sext(S0 >> S1[4:0]) ([5:0] for I64).

S_BFM_{B32,B64}

SOP2

No

Bit field mask
D = ( (1 << S0[4:0]) -1) << S1[4:0]
(uses [5:0] for the B64 version)

6.5. Conditional Move Instructions

50 of 597

"RDNA3" Instruction Set Architecture

Instruction

Encoding

Sets SCC? Operation

S_BFE_U32, S_BFE_U64
S_BFE_I32, S_BFE_I64
(signed/unsigned)

SOP2

D!=0

Bit Field Extract, then sign extend result for I32/64
instructions.
S0 = data, S1[22:16]= width
I32/U32: S1[4:0] = offset
I64/U64: S1[5:0] = offset

S_NOT_{B32,B64}

SOP1

D!=0

D = ~S0.

S_WQM_{B32,B64}

SOP1

D!=0

D = wholeQuadMode(S0)
Per quad (4 bits): set the result to 1111 if any of the 4
bits in the corresponding source mask are set to 1.
D[n*4] = (S[n*4] || S[n*4+1] || S[n*4+2] || S[n*4+3] )
D[n*4+1] = (S[n*4] || S[n*4+1] || S[n*4+2] || S[n*4+3] )
D[n*4+2] = (S[n*4] || S[n*4+1] || S[n*4+2] || S[n*4+3] )
D[n*4+3] = (S[n*4] || S[n*4+1] || S[n*4+2] || S[n*4+3] )

S_QUADMASK_{B32,B64}

SOP1

D!=0

Create a 1-bit per quad mask from a 1 bit per pixel
mask.
Creates an 8-bit mask from 32-bits, or 16 bits from 64.
D[0] = (S0[3:0] != 0),
D[1] = (S0[7:4] != 0), …

S_BITREPLICATE_B64_B32

SOP1

No

Replicate each bit in 32-bit S0 twice:
D = { … S0[1], S0[1], S0[0], S0[0] }.
Two of these instructions is the inverse of
S_QUADMASK.
Two of these instructions expands a quad mask into a
thread-mask.

S_BREV_{B32,B64}

SOP1

No

D = S0[0:31] are reverse bits.

S_BCNT0_I32_{B32,B64}

SOP1

D!=0

D = CountZeroBits(S0).

S_BCNT1_I32_{B32,B64}

SOP1

D!=0

D = CountOneBits(S0).

S_CTZ_I32_{B32,B64}

SOP1

No

Count Trailing zeroes: Find-first One from LSB.
D = Bit position of first one in S0
starting from LSB. -1 if not found

S_CLZ_I32_{B32,B64}

SOP1

No

Count Leading zeroes. D = "how many zeros before
the first one starting from the MSB".
Returns -1 if none.

S_CLS_I32_{B32,B64}

SOP1

N

Count Leading Sign-bits: Count how many bits in a
row (from MSB to LSB) are the same as the sign bit.
Return -1 if the input is zero or all 1’s (-1). 32-bit
pseudo-code:

if (S0 == 0 || S0 == -1) D = -1
else
D = 0
for (I = 31 .. 0)
if (S0[I] == S0[31])
D++
else break

S_BITSET0_{B32,B64}

SOP1

No

D[S0[4:0], [5:0] for B64] = 0

S_BITSET1_{B32,B64}

SOP1

No

D[S0[4:0], [5:0] for B64] = 1

6.7. Bit-Wise Instructions

51 of 597

"RDNA3" Instruction Set Architecture

Instruction

Encoding

Sets SCC? Operation

S_{and, or, xor, and_not0,
and_not1,or_not0, or_not1, nand, nor,
xnor}_SAVEEXEC_{B32,B64}

SOP1

D!=0

Save the EXEC mask, then apply a bit-wise operation
to it.
D = EXEC
EXEC = S0 <op> EXEC
SCC = (EXEC != 0)
("not1" version inverts EXEC)
("not0" version inverts SGPR)

S_{AND_NOT{0,1}_WREXEC_B{32,64}

SOP1

D!=0

NOT0: EXEC, D = ~S0 & EXEC
NOT1: EXEC, D = S0 & ~EXEC
Both D and EXEC get the same result. SCC = (result !=
0). D cannot be EXEC.

S_MOVRELS_{B32,B64}
S_MOVRELD_{B32,B64}

SOP1

No

Move a value into an SGPR relative to the value in M0.
MOVRELS: D = SGPR[S0+M0]
MOVRELD: SGPR[D+M0] = S0
Index must be even for B64. M0 is an unsigned index.

6.8. Access Instructions
These instructions access hardware internal registers.
Table 25. Hardware Internal Registers
Instruction

Encoding

Sets
SCC?

Operation

S_GETREG_B32

SOPK

No

Read a hardware register into the LSBs of SDST.

S_SETREG_B32

SOPK

No

Write the LSBs of SDST into a hardware register. (Note that SDST is
used as a source SGPR).

S_SETREG_IMM32_B32

SOPK

No

S_SETREG where 32-bit data comes from a literal constant (so this is
a 64-bit instruction format).

GETREG/SETREG : #SIMM16 = { Size[4:0], Offset[4:0], hwRegId[5:0] }
Offset is 0..31. Size is 1..32.
S_ROUND_MODE

SOPP

No

Set the round mode from an immediate: simm16[3:0]

S_DENORM_MODE

SOPP

No

Set the denorm mode from an immediate: simm16[3:0]

For hardware register index values, see Hardware Registers .

6.9. Memory Aperture Query
Shaders can query the memory aperture base and size for shared and private space through scalar operands:
• PRIVATE_BASE
• PRIVATE_LIMIT
• SHARED_BASE
• SHARED_LIMIT
These values originate from the SH_MEM_BASES register ("SMB"), and are used primarily with FLAT memory
instructions. Setting Shared Base or Private Base to zero disables that aperture.
"PTR32" is short for "Address mode is 32bit", and "SMB" is short for "SH_MEM_BASES". These constants can be

6.8. Access Instructions

52 of 597

"RDNA3" Instruction Set Architecture

used by SALU and VALU ops, and are 64-bit unsigned integers:
SHARED_BASE = ptr32 ? {32’h0, SMB.shared_base[15:0], 16’h0000} : {SMB.shared_base[15:0], 48’h000000000000}
SHARED_LIMIT = ptr32 ? {32’h0, SMB.shared_base[15:0], 16’hFFFF} : {SMB.shared_base[15:0], 48’h0000FFFFFFFF}
PRIVATE_BASE = ptr32 ? {32’h0, SMB.private_base[15:0], 16’h0000} : {SMB.private_base[15:0], 48’h000000000000}
PRIVATE_LIMIT =ptr32 ? {32’h0, SMB.private_base[15:0], 16’hFFFF} : {SMB.private_base[15:0], 48’h0000FFFFFFFF}

"Hole" = (addr[63:47] != all zeros or all ones) and is the illegal address section of memory

6.9. Memory Aperture Query

53 of 597

"RDNA3" Instruction Set Architecture

Chapter 7. Vector ALU Operations
Vector ALU instructions (VALU) perform an arithmetic or logical operations on data for each of 32 or 64
threads and write results back to VGPRs, SGPRs or the EXEC mask.
Parameter interpolation is a two step process involving an LDS instruction followed by a VALU instruction and
is described in: Parameter Interpolation
Vector ALU (VALU) instructions control the SIMD32’s math unit and operate on 32 work-items of data at a time.
Each instruction may take input from either VGPRs, SGPRs or constants and typically returns results to VGPRs.
Mask results and carry-out are returned to SGPRs. The ALU provides operations that work on 16, 32 and 64-bit
data of both integer and float types. The ALU also supports "packed" data types that pack 2 16-bit values into
one VGPR, or 4 8-bit values into a VGPR.

7.1. Microcode Encodings
VALU instructions are encoded in one of these ways:

Name

Size

Function

Modifiers

VOP1

32 bit

VALU op with 1 input

-

VOP2

32 bit

VALU op with 2 inputs

-

VOP3

64 bit

VALU op with 3 inputs, or a VOP1,2,C instruction

abs, neg, omod, clamp

VOP3SD

64 bit

VALU op with 3 inputs and SDST

neg, omod, clamp

VOPC

32 bit

VALU compare op with 2 inputs, writes to VCC/EXEC

-

VOP3P

64 bit

VALU op with 3 inputs using packed math

neg, clamp

VOPD

64 bit

VALU dual opcode : 2 operations in one instruction

-

Many VALU instructions are available in two encodings: VOP3 that uses 64-bits of instruction, and one of three
32-bit encodings that offer a restricted set of capabilities but smaller code size. Some instructions are only
available in the VOP3 encoding. When an instruction is available in two microcode formats, it is up to the user
to decide which to use. It is recommended to use the 32-bit encoding whenever possible. VOP2 can also be used

7.1. Microcode Encodings

54 of 597

"RDNA3" Instruction Set Architecture

for "ACCUM" type ops where the third input is implied to be the same as the dest.
Advantages of using VOP3 include:
• More flexibility in source addressing (all source fields are 9 bits)
• NEG, ABS, and OMOD fields (for floating point only)
• CLAMP field for output range limiting
• Ability to select alternate source and destination registers for VCC (carry in and out)
The following VOP1 and VOP2 instructions may not be promoted to VOP3:
• swap and swaprel
• fmamk, fmaak, pk_fmac
The VOP3 encoding has two variants:
• VOP3 - used for most instructions including V_CMP*; has OPSEL and ABS fields
• VOP3SD - has an SDST field instead of OPSEL and ABS. This encoding is used only for:
◦ V_{ADD,SUB,SUBREV}_CO_CI_U32, V_{ADD,SUB,SUBREV}_CO_U32 (adds with carry-out)
◦ V_DIV_SCALE_{F32, F64}, V_MAD_U64_U32, V_MAD_I64_I32.
◦ V_DOT2ACC_F32_F16
◦ VOP3SD is not used for V_CMP*.
Any of the VALU microcode formats may use a 32-bit literal constant, as well VOP3. Note however that VOP3
plus a literal makes a 96-bit instruction and excessive use of this combination may reduce performance.
VOP3P is for instructions that use "packed math": instructions that performs an operation on a pair of input
values that are packed into the high and low 16-bits of each operand; the two 16-bit results are written to a
single VGPR as two packed values.
Field

Size

Description

OP

varies

instruction opcode

SRC0

9

first instruction argument. May come from: vgpr, sgpr, VCC, M0, EXEC, SCC, or a constant

SRC1

9

second instruction argument. May come from: vgpr, sgpr, VCC, M0, EXEC, SCC, or a constant

VSRC1

8

second instruction argument. May come from: vgpr only

SRC2

9

third instruction argument. May come from: vgpr, sgpr, VCC, M0, EXEC, SCC, or a constant

VDST

8

VGPR that takes the result.
For V_READLANE and V_CMP, indicates the SGPR that receives the result. This cannot be M0 or EXEC.

SDST

8

SGPR that takes the result of operations that produce a scalar output. Can’t be M0 or EXEC. Supports
NULL to not write any SDST.
Used for: V_{ADD,SUB,SUBREV}_CO_U32, V_{ADD,SUB,SUBREV}_CO_CI_U32, V_DIV_SCALE*; not
used for V_CMP.

OMOD

2

output modifier. for float results only.
0 = no modifier, 1=multiply result by 2, 2=multiply result by 4, 3=divide result by 2

NEG

3

negate the input (invert sign bit). float inputs only.
bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2.

ABS

3

apply absolute value on input. float inputs only. applied before 'neg'.
bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2.

7.1. Microcode Encodings

55 of 597

"RDNA3" Instruction Set Architecture

Field

Size

Description

CLMP

1

clamp or compare-signal (depends on opcode):
V_CMP: clmp=1 means signaling-compare when qNaN detected; 0 = non-signaling
Float arithmetic: clamp result to [0, 1.0]; -0 is clamped to +0.
Signed integer arithmetic: clamp result to [min_int, +max_int]
Unsigned integer arithmetic: clamp result to [0, +max_uint]
Where "min_int" and "max_int" are the largest negative and positive representable integers for the size
of integer being used (16, 32 or 64 bit). "max_uint" is the largest unsigned int.

OPSEL

4

Operation select for 16-bit math: 1=select high half, 0=select low half
[0]=src0, [1]=src1, [2]=src2, [3]=dest
For dest=0, dest_vgpr[31:0] = {prev_dst_vgpr[31:16], result[15:0] }
For dest=1, dest_vgpr[31:0] = {result[15:0], prev_dst_vgpr[15:0] }
OPSEL may only be used for 16-bit operands, and must be zero for any other operands/results.
For V_PERMLANE*, OPSEL[0] is "fetch invalid"; OPSEL[1] is "bounds control" (like DPP8).
DOT2_F16 and_BF16: src0 and src1 must have OPSEL[1:0] = 0

7.2. Operands
Most VALU instructions take at least one input operand. The data-size of the operands is explicitly defined in
the name of the instruction. For example, V_FMA_F32 operates on 32-bit floating point data.
VGPR Alignment: there is no alignment restriction for single or double-float operations.
Table 26. VALU Instruction Operands

7.2. Operands

56 of 597

"RDNA3" Instruction Set Architecture

Vector
Source
(when 9
bits)

Scalar
Scalar
Source (8 Dest (7
bits)
bits)

Code

Meaning

0-105

SGPR 0 .. 105

SGPRs. One DWORD each.

106

VCC_LO

VCC[31:0]

107

VCC_HI

VCC[63:32]

108-123

ttmp0 .. ttmp15

Trap handler temporary SGPRs (privileged)

124

NULL

Reads return zero, writes are ignored. When used
as a destination, nullifies the instruction.

125

M0

Temporary register, use for a variety of functions

126

EXEC_LO

EXEC[31:0]

127

EXEC_HI

EXEC[63:32]

0

Inline constant zero

int 1 .. 64

Integer inline constants

Integer
128
Inline
129-192
Constants 193-208

int -1 .. -16

209-232

Reserved

Reserved

233

DPP8

8-lane DPP (only valid as SRC0)

234

DPP8FI

8-lane DPP with Fetch-Invalid (only valid as SRC0)

235

SHARED_BASE

Memory Aperture Definition

236

SHARED_LIMIT

237

PRIVATE_BASE

238

PRIVATE_LIMIT

239

Reserved

Reserved

Float
240
Inline
241
Constants 242

0.5

243

-1.0

Inline floating point constants. Can be used in 16,
32 and 64 bit floating point math. They may be
used with non-float instructions but the value
remains a float.

244

2.0

245

-2.0

246

4.0

247

-4.0

248

1.0 / (2 * PI)

249

Reserved

Reserved

250

DPP16

data parallel primitive

251

Reserved

Reserved

252

Reserved

Reserved

253

SCC

{ 31’b0, SCC }

254

Reserved

Reserved

255

Literal constant

32 bit constant from instruction stream

256 - 511

VGPR 0 .. 255

Vector GPRs. One DWORD each.

Vector Src/Dst
(8 bits)

-0.5
1.0

1/(2*PI) is 0.15915494. The hex values are:
half: 0x3118
single: 0x3e22f983
double: 0x3fc45f306dc9c882

7.2.1. Non-Standard Uses of Operand Fields
A few instructions use the operand fields in non-standard ways:

7.2. Operands

57 of 597

"RDNA3" Instruction Set Architecture

Opcode

VDST

SDST

VSRC0

VSRC1

VSRC2

V_{ADD,SUB,SUBREV} VOP2
_CO_U32,
V_{ADD,SUB,SUBREV} VOP3SD
_CO_CI_U32

add result
(VCC=carry-out)

n/a

in0

in1

unused
(carry-in=VCC)

add result

carry-out

in0

in1

carry-in

V_DIV_SCALE

VOP3SD

result

carry-out

in0

in1

in2

V_READLANE

VOP3

scalar dst (SGPR
only)

n/a

vgpr#

lane-sel: sgpr, M0,
inline

n/a

V_READFIRSTLANE

VOP1

scalar dst (SGPR
only)

n/a

vgpr#

n/a (lane-sel = exec)

n/a

V_WRITELANE

VOP3

vgpr dst

n/a

sgpr#, const, lane-sel: sgpr, M0,
M0
inline

n/a

V_CMP*

VOPC

"VCC" implied

n/a

in0

in1

n/a

VOP3SD

cmp-result (sgpr) unused

in0

in1

unused

VOP2

dest vgpr

n/a

in0

in1

unused (implied:
VCC)

VOP3

dest vgpr

unused

in0

in1

select sgpr (e.g.
VCC)

V_CNDMASK

Encoding

The readlane lane-select is limited to the valid range of lanes (0-31 for wave32, 0-63 for wave64) by ignoring
upper bits of the lane number.
Inline constants with DOT2_F16_F16 and DOT2_BF16_BF16
For these 2 instructions, the inline constant for sources 0 and 1 replicate the inline constant value into
bits[31:16]. For source2, the OPSEL bit is used to control replication or not (gets zero if not replicating low
bits).

7.2.2. Inputs Operands
VALU instructions can use any of the following sources for input, subject to restrictions listed below:
• VOP1, VOP2, VOPC:
◦ SRC0 is 9 bits and may be a VGPR, SGPR (including TTMPs and VCC), M0, EXEC, inline or literal
constant.
◦ SRC1 is 8 bits and may specify only a VGPR
• VOP3 : all 3 sources are 9 bits but still have restrictions:
◦ Not all VOPC/1/2 instructions are available in VOP3 (only those that benefit from VOP3 encoding).
• See complete operand list: VALU Instruction Operands

7.2.2.1. Input Operand Modifiers
The input modifiers ABS and NEG apply to floating point inputs and are undefined for any other type of input.
In addition, input modifiers are supported for: V_MOV_B32, V_MOV_B16, V_MOVREL*_B32 and V_CNDMASK.
ABS returns the absolute value, and NEG negates the input.
Input modifiers are not supported for:
• readlane, readfirstlane, writelane
• integer arithmetic or bitwise operations
• permlane

7.2. Operands

58 of 597

"RDNA3" Instruction Set Architecture

• QSAD

7.2.2.2. Literal Expansion to 64 bits
Literal constants are 32-bits, but they can be used as sources that normally require 64-bit data.
They are expanded to 64 bits following these rules:
• 64 bit float: the lower 32-bit are padded with zero
• 64-bit unsigned integer: zero extended to 64 bits
• 64-bit signed integer: sign extended to 64 bits

7.2.2.3. Source Operand Restrictions
Not every combination of source operands that can be expressed in the microcode format is legal. This section
describes the legal and illegal settings.
Terminology for this section:
"scalar value" = SGPR, EXEC, VCC, M0, SCC or literal constant; can be 32 or 64 bits.
• Instructions may use at most two Scalar Values: SGPR, VCC, M0, EXEC, SCC, Literal
• All instruction formats including VOP3 and VOP3P may use one literal constant
◦ Inline constants are free (do not count against 2 scalar value limit).
◦ Literals may not be used with DPP
◦ It is permissible for both scalar values to be SGPRs, although VCC counts as an SGPR.
▪ VCC when used implicitly counts against this limit: addci, subci, fmas, cndmask
◦ 64-bit shift instructions can use only one scalar value input, and can’t use the same one twice
(inlines don’t count against this limit)
◦ Using the same scalar value twice only counts as a single scalar value, however using the same scalar
value twice, but with different sizes has specific rules and limits:
▪ Using the same literal with different sizes counts as 2 scalar values, not 1.
▪ S[0] and S[0:1] can be considered as 1 scalar value, but S[1] and S[0:1] count as 2.
In general, these rules apply to any S[2n] and S[2n:2n+1] count as one, but S[2n+1] and S[2n:2n+1] count
as 2.
• SGPR source rules must be met for both passes of a wave64, bearing in mind that sources that read a mask
(bit-per-lane) increment the SGPR address for the second pass, and they may not be shared with other
sources.

7.2.2.4. OPSEL Field Restrictions
The OPSEL field (of VOP3) is usable only for a subset of VOP3 instructions, as well as VOP1/2/C instructions
promoted to VOP3.
Table 27. Opcodes usable with OPSEL

7.2. Operands

V_MAD_I16

V_MAD_U16

V_FMA_F16

V_ADD_NC_U16

V_ADD_NC_I16

V_CVT_PKNORM_I16_F16

V_SUB_NC_U16

V_SUB_NC_I16

V_CVT_PKNORM_U16_F16

59 of 597

"RDNA3" Instruction Set Architecture

V_MUL_LO_U16

V_MAD_U32_U16

V_MAD_I32_I16

V_LSHLREV_B16

V_LSHRREV_B16

V_ASHRREV_I16

V_ALIGNBIT_B32

V_ALIGNBYTE_B32

V_DIV_FIXUP_F16

V_MIN3_{F16,I16,U16}

V_MAX3_{F16,I16,U16}

V_MED3_{F16,I16,U16}

V_MAX_{I16,U16}

V_MIN_{I16,U16}

V_PACK_B32_F16

V_MAXMIN_F16

V_MINMAX_F16

V_CNDMASK_B16

V_XOR_B16

V_AND_B16

V_OR_B16

V_DOT2_F16_F16

V_DOT2_BF16_BF16

V_INTERP_P10_RTZ_F16_F32

V_INTERP_P2_RTZ_F16_F32

V_INTERP_P2_F16_F32

V_INTERP_P10_F16_F32

7.2.3. Output Operands
VALU instructions typically write their results to VGPRs specified in the VDST field of the microcode word. A
thread only writes a result if the associated bit in the EXEC mask is set to 1.
V_CMPX instructions write the result of their comparison (one bit per thread) to the EXEC mask.
Instructions producing a carry-out (integer add and subtract) write their result to VCC when used in the VOP2
form, and to an arbitrary SGPR-pair when used in the VOP3 form.
When the VOP3 form is used, instructions with a floating-point result may apply an output modifier (OMOD
field) that multiplies the result by: 0.5, 2.0, or 4.0. Optionally, the result can be clamped (CLAMP field) to the
min and max representable range (see next section).

7.2.3.1. Output Operand Modifiers
Output modifiers (OMOD) apply to half, single and double floating point results only and scale the result by :
0.5, 2.0, 4.0 or do not scale. Integer and packed float 16 results ignore the omod setting. Output modifiers are
not compatible with output denormals: if output denormals are enabled, then output modifiers are ignored. If
output denormals are disabled, then the output modifier is applied and denormals are flushed to zero. These
are not IEEE compatible: -0 is flushed to +0. Output modifiers are ignored if the IEEE mode bit is set to 1. A few
opcodes force output denorms to be disabled.
Output Modifiers are not supported for:
• V_PERMLANE
• DOT2_F16_F16
• DOT2_BF16_BF16
The clamp bit has multiple uses. For V_CMP instructions, setting the clamp bit to 1 indicates that the compare
signals if a floating point exception occurs. For integer operations, it clamps the result to the largest and
smallest representable value. For floating point operations, it clamps the result to the range: [0.0, 1.0].
Output Clamping: The clamp instruction bit applies to the following operations and data types:
• Float clamp to [0.0, 1.0]
• Signed Int [-max_int, +max_int]
• Unsigned int [0, +max_int]
• Bool (V_CMP) enables signaling compare

7.2. Operands

60 of 597

"RDNA3" Instruction Set Architecture

The clamp bit is not supported for (ignored):
V_PERMLANE*

V_PERM_B32

Float DOT instructions

V_SWAP and V_SWAPREL

WMMA ops

V_ADD3

V_ADD_LSHL

V_ALIGN*

Bitwise ops

V_CMP*_CLASS

V_CMP on integers

V_READLANE

V_READFIRSTLANE

V_WRITELANE

7.2.3.2. Wave64 Destination Restrictions
When a VALU instruction is issued from a wave64, it may issue twice as two wave32 instructions. While in most
cases the programmer need not be aware of this, it does impose a prohibition on wave64 VALU instructions
that both write and read the same SGPR value. Doing this may lead to unpredictable results. Specifically, the first
pass of a wave64 VALU instruction may not overwrite a scalar value used by the second half.

7.2.4. Denormalized and Rounding Modes
The shader program has explicit control over the rounding mode applied and the handling of denormalized
inputs and results. The MODE register is set using the S_SETREG instruction; it has separate bits for controlling
the behavior of single and double-precision floating-point numbers.
Round and denormal modes can also be set using S_ROUND_MODE and S_DENORM_MODE which is the
preferred method over using S_SETREG.
16-bit floats support denormals, infinity and NaN.
Table 28. Round and Denormal Modes
Field

Bit Position

Description

FP_ROUND

3:0

[1:0] Single-precision round mode.
[3:2] Double and Half-precision (FP16) round mode.
Round Modes:
0=nearest even
1= +infinity
2= -infinity
3= toward zero

FP_DENORM

7:4

[5:4] Single-precision denormal mode.
[7:6] Double and Half-precision (FP16) denormal mode.
Denormal modes:
0 = Flush input and output denorms
1 = Allow input denorms, flush output denorms
2 = Flush input denorms, allow output denorms
3 = Allow input and output denorms

These mode bits do not affect rounding and denormal handling of F32 global memory atomics.
DOT2_F16_F16 and DOT2_BF16_BF16 support round-to-nearest-even rounding. DOT2_F16_F16 supports
denorms, and DOT2_BF16_BF16 disables all denorms.

7.2. Operands

61 of 597

"RDNA3" Instruction Set Architecture

7.2.5. Instructions using SGPRs as Mask or Carry
Every VALU instruction can use SGPRs as a constant, but the following can read or write SGPRs as masks or
carry:
Read Mask or Carry in

Write Carry out

Implicitly Reads VCC

Implicitly Writes VCC

V_CNDMASK_B32

V_CMP*

V_DIV_FMAS_F32

V_DIV_SCALE_F32

V_ADD_CO_CI_U32

V_ADD_CO_CI_U32

V_DIV_FMAS_F64

V_DIV_SCALE_F64

V_SUB_CO_CI_U32

V_SUB_CO_CI_U32

(fmas reads 3 operands + VCC)

V_CMP (not V_CMPX)

V_SUBREV_CO_CI_U32

V_SUBREV_CO_CI_U32

V_CNDMASK in VOP2

V_ADD_CO_U32

V_{ADD,SUB,SUBREV}_CO_CI_U
32 in VOP2

V_SUB_CO_U32
V_SUBREV_CO_U32
V_MAD_U64_U32
V_MAD_I64_I32
Write Data out (not carry)
V_READLANE
V_READFIRSTLANE

"VCC" in the above table refers to VCC in a VOP2 or VOPC encoding, or any SGPR specified in the SRC2 or SDST
field for VOP3 encoding, except for DIV_FMAS that implicitly reads VCC (no choice).
V_CMPX is the only VALU instruction that writes EXEC.

7.2.6. Wave64 use of SGPRs
VALU instructions may use SGPRs as a uniform input, shared by all work-items. If the value is used as simple
data value, then the same SGPR is distributed to all 64 work-items. If, on the other hand, the data value
represents a mask (e.g. carry-in, mask for CNDMASK), then each work-item receives a separate value, and two
consecutive SGPRs are read.

7.2.7. Out-of-Range GPRs
When a source VGPR is out-of-range, the instruction uses as input the value from VGPR0.
When the destination GPR is out-of-range, the instruction executes but does not write the results.
See VGPR Out Of Range Behavior for more information.

7.2.8. PERMLANE Specific Rules
V_PERMLANE may not occur immediately after a V_CMPX. To prevent this, any other VALU opcode may be
inserted (e.g. V_NOP).

7.2. Operands

62 of 597

"RDNA3" Instruction Set Architecture

7.3. Instructions
The table below lists the complete VALU instruction set by microcode encoding, except for VOP3P instructions
which are listed in a later section.
VOP3

VOP3 - 2 operands

VOP2

VOP1

V_ADD3_U32

V_ADD_CO_U32

V_ADD_CO_CI_U32

V_BFREV_B32

V_ADD_LSHL_U32

V_ADD_F64

V_ADD_F16

V_CEIL_F16

V_ALIGNBIT_B32

V_ADD_NC_I16

V_ADD_F32

V_CEIL_F32

V_ALIGNBYTE_B32

V_ADD_NC_I32

V_ADD_NC_U32

V_CEIL_F64

V_AND_OR_B32

V_ADD_NC_U16

V_AND_B32

V_CLS_I32

V_BFE_I32

V_AND_B16

V_ASHRREV_I32

V_CLZ_I32_U32

V_BFE_U32

V_ASHRREV_I16

V_CNDMASK_B32

V_COS_F16

V_BFI_B32

V_ASHRREV_I64

V_CVT_PK_RTZ_F16_F32

V_COS_F32

V_CNDMASK_B16

V_BCNT_U32_B32

V_DOT2ACC_F32_F16

V_CTZ_I32_B32

V_CUBEID_F32

V_BFM_B32

V_FMAAK_F16

V_CVT_F16_F32

V_CUBEMA_F32

V_CVT_PK_I16_F32

V_FMAAK_F32

V_CVT_F16_I16

V_CUBESC_F32

V_CVT_PK_I16_I32

V_FMAC_DX9_ZERO_F32

V_CVT_F16_U16

V_CUBETC_F32

V_CVT_PK_NORM_I16_F16

V_FMAC_F16

V_CVT_F32_F16

V_CVT_PK_U8_F32

V_CVT_PK_NORM_I16_F32

V_FMAC_F32

V_CVT_F32_F64

V_DIV_FIXUP_F16

V_CVT_PK_NORM_U16_F16

V_FMAMK_F16

V_CVT_F32_I32

V_DIV_FIXUP_F32

V_CVT_PK_NORM_U16_F32

V_FMAMK_F32

V_CVT_F32_U32

V_DIV_FIXUP_F64

V_CVT_PK_U16_F32

V_LDEXP_F16

V_CVT_F32_UBYTE0

V_DIV_FMAS_F32

V_CVT_PK_U16_U32

V_LSHLREV_B32

V_CVT_F32_UBYTE1

V_DIV_FMAS_F64

V_LDEXP_F32

V_LSHRREV_B32

V_CVT_F32_UBYTE2

V_DIV_SCALE_F32

V_LDEXP_F64

V_MAX_F16

V_CVT_F32_UBYTE3

V_DIV_SCALE_F64

V_LSHLREV_B16

V_MAX_F32

V_CVT_F64_F32

V_DOT2_BF16_BF16

V_LSHLREV_B64

V_MAX_I32

V_CVT_F64_I32

V_DOT2_F16_F16

V_LSHRREV_B16

V_MAX_U32

V_CVT_F64_U32

V_FMA_DX9_ZERO_F32

V_LSHRREV_B64

V_MIN_F16

V_CVT_FLOOR_I32_F32

V_FMA_F16

V_MAX_F64

V_MIN_F32

V_CVT_I16_F16

V_FMA_F32

V_MAX_I16

V_MIN_I32

V_CVT_I32_F32

V_FMA_F64

V_MAX_U16

V_MIN_U32

V_CVT_I32_F64

V_LERP_U8

V_MBCNT_HI_U32_B32

V_MUL_DX9_ZERO_F32

V_CVT_I32_I16

V_LSHL_ADD_U32

V_MBCNT_LO_U32_B32

V_MUL_F16

V_CVT_NEAREST_I32_F32

V_LSHL_OR_B32

V_MIN_F64

V_MUL_F32

V_CVT_NORM_I16_F16

V_MAD_I16

V_MIN_I16

V_MUL_HI_I32_I24

V_CVT_NORM_U16_F16

V_MAD_I32_I16

V_MIN_U16

V_MUL_HI_U32_U24

V_CVT_OFF_F32_I4

V_MAD_I32_I24

V_MUL_F64

V_MUL_I32_I24

V_CVT_U16_F16

V_MAD_I64_I32

V_MUL_HI_I32

V_MUL_U32_U24

V_CVT_U32_F32

V_MAD_U16

V_MUL_HI_U32

V_OR_B32

V_CVT_U32_F64

V_MAD_U32_U16

V_MUL_LO_U16

V_PK_FMAC_F16

V_CVT_U32_U16

V_MAD_U32_U24

V_MUL_LO_U32

V_SUBREV_CO_CI_U32

V_EXP_F16

V_MAD_U64_U32

V_OR_B16

V_SUBREV_F16

V_EXP_F32

V_MAX3_F16

V_PACK_B32_F16

V_SUBREV_F32

V_FLOOR_F16

V_MAX3_F32

V_READLANE_B32

V_SUBREV_NC_U32

V_FLOOR_F32

V_MAX3_I16

V_SUBREV_CO_U32

V_SUB_CO_CI_U32

V_FLOOR_F64

V_MAX3_I32

V_SUB_CO_U32

V_SUB_F16

V_FRACT_F16

V_MAX3_U16

V_SUB_NC_I16

V_SUB_F32

V_FRACT_F32

V_MAX3_U32

V_SUB_NC_I32

V_SUB_NC_U32

V_FRACT_F64

V_MAXMIN_F16

V_SUB_NC_U16

V_XNOR_B32

V_FREXP_EXP_I16_F16

V_MAXMIN_F32

V_TRIG_PREOP_F64

V_XOR_B32

V_FREXP_EXP_I32_F32

V_MAXMIN_I32

V_WRITELANE_B32

7.3. Instructions

V_FREXP_EXP_I32_F64

63 of 597

"RDNA3" Instruction Set Architecture

VOP3

VOP3 - 2 operands

V_MAXMIN_U32

V_XOR_B16

VOP2

VOP1
V_FREXP_MANT_F16

V_MED3_F16

V_FREXP_MANT_F32

V_MED3_F32

V_FREXP_MANT_F64

V_MED3_I16

V_LOG_F16

V_MED3_I32

V_LOG_F32

V_MED3_U16

V_MOVRELD_B32

V_MED3_U32

V_MOVRELSD_2_B32

V_MIN3_F16

V_MOVRELSD_B32

V_MIN3_F32

V_MOVRELS_B32

V_MIN3_I16

V_MOV_B16

V_MIN3_I32

V_MOV_B32

V_MIN3_U16

V_NOP

V_MIN3_U32

V_NOT_B16

V_MINMAX_F16

V_NOT_B32

V_MINMAX_F32

V_PERMLANE64_B32

V_MINMAX_I32

V_PIPEFLUSH

V_MINMAX_U32

V_RCP_F16

V_MQSAD_PK_U16_U8

V_RCP_F32

V_MQSAD_U32_U8

V_RCP_F64

V_MSAD_U8

V_RCP_IFLAG_F32

V_MULLIT_F32

V_READFIRSTLANE_B32

V_OR3_B32

V_RNDNE_F16

V_PERMLANE16_B32

V_RNDNE_F32

V_PERMLANEX16_B32

V_RNDNE_F64

V_PERM_B32

V_RSQ_F16

V_QSAD_PK_U16_U8

V_RSQ_F32

V_SAD_HI_U8

V_RSQ_F64

V_SAD_U16

V_SAT_PK_U8_I16

V_SAD_U32

V_SIN_F16

V_SAD_U8

V_SIN_F32

V_XAD_U32

V_SQRT_F16

V_XOR3_B32

V_SQRT_F32
V_SQRT_F64
V_SWAPREL_B32
V_SWAP_B16
V_SWAP_B32
V_TRUNC_F16
V_TRUNC_F32
V_TRUNC_F64

VOPC - Compare Ops
VOPC writes to VCC, VOP3 writes compare result to any SGPR
V_CMP
V_CMPX
V_CMP

I16, I32, I64, U16, U32, U64 F, LT, EQ, LE, GT, LG, GE, T

V_CMPX_CLASS

7.3. Instructions

write exec

F16, F32, F64

F, LT, EQ, LE, GT, LG, GE, T,
write VCC
O, U, NGE, NLG, NGT, NLE, NEQ, NLT
(T = True, F = False, O = total order, U = unordered, "N" write exec
= Not (inverse) compare)

F16, F32, F64

Test for any combination of: signaling-NaN, quiet-NaN, write VCC
positive or negative: infinity, normal, subnormal, zero. write exec

V_CMPX
V_CMP_CLASS

write VCC

64 of 597

"RDNA3" Instruction Set Architecture

7.4. 16-bit Math and VGPRs
VALU instructions that operate on 16-bit data (non-packed) can separately address the two halves of a 32-bit
VGPR.
16-bit VGPR-pairs are packed into a 32-bit VGPRs: the 32-bit VGPR "V0" contains two 16-bit VGPRs: "V0.L"
representing V0[15:0] and "V0.H" representing V0[31:16].
How this addressing is encoded in the ISA varies by the instruction encoding: The 16-bit instructions can be
encoded using VOP1/2/C as well as VOP3/VOP3P/VINTERP.
16bit VGPR Naming
The 32-bit VGPR is "V0". The two halves are called "V0.L" and "V0.H".
VOP1, VOP2, VOPC Encoding
16-bit VGPRs are encoded as:
SRC/DST[6:0] = 32-bit VGPR address;
SRC/DST[7] = (1=hi, 0=lo half)
In this encoding, only 256 16-bit VGPRs can be addressed.
VOP3, VOP3P, VINTERP
16-bit VGPRs are encoded as:
SRC/DST[7:0] = 32-bit VGPR address, OPSEL = high/low.
In this encoding, a wave can address 512 16-bit VGPRs.
The packing shown below allows reading or writing in one cycle:
• 32 lanes of one 32-bit VGPR: V0
• 64 lanes of one 16-bit VGPR: V0.L
• 32 lanes of two 16-bit VGPRs (a pair, as used by packed math): V0.L and V0.H

7.5. Packed Math
Packed math is a form of operation that accelerates arithmetic on two values packed into the same VGPR. It
performs operations on two 16-bit values within a DWORD as if they were separate threads. For example, a
packed add of V0=V1+V2 is really two separate adds: adding the low 16 bits of each DWORD and storing the
result in the low 16 bits of V0, and adding the high halves and storing the result in the high 16 bits of V0.
Packed math uses the instructions below and the microcode format "VOP3P". This format has OPSEL and NEG
fields for both the low and high operands, and does not have ABS and OMOD.
Table 29. Packed Math Opcodes:
Packed Math ops
V_PK_MUL_F16

V_PK_FMA_F16

V_PK_MIN_F16

V_PK_ADD_F16

V_PK_FMAC_F16

V_PK_MAX_F16

V_PK_ADD_I16

V_PK_MAD_I16

V_PK_MIN_I16

7.4. 16-bit Math and VGPRs

V_PK_LSHLREV_B16

65 of 597

"RDNA3" Instruction Set Architecture

Packed Math ops
V_PK_ADD_U16

V_PK_MAD_U16

V_PK_MIN_U16

V_PK_LSHRREV_B16

V_PK_SUB_I16

V_PK_MUL_LO_U16

V_PK_MAX_I16

V_PK_ASHRREV_I16

V_PK_SUB_U16

V_PK_MAX_U16

V_FMA_MIX_F32

V_FMA_MIXLO_F16

V_FMA_MIXHI_F16

V_WMMA_F32_16X16X16_F16

V_DOT2_F32_BF16

V_WMMA_F32_16X16X16_BF16

V_DOT2_F32_F16

V_WMMA_F16_16X16X16_F16

V_DOT4_I32_IU8

V_WMMA_BF16_16X16X16_BF16

V_DOT4_U32_U8

V_WMMA_I32_16X16X16_IU8

V_DOT8_I32_IU4

V_WMMA_I32_16X16X16_IU4

V_DOT8_U32_U4

V_FMA_MIX_* and WMMA instructions are not packed math, but perform a single MAD
operation on a mixture of 16- and 32-bit inputs. They are listed here because they use the
VOP3P encoding.



VOP3P Instruction Fields

Field

Size

Description

OP

7

instruction opcode

SRC0

9

first instruction argument. May come from: vgpr, sgpr, VCC, M0, exec or a constant
WMMA: must be a VGPR

SRC1

9

second instruction argument. May come from: vgpr, sgpr, VCC, M0, exec or a constant
WMMA: must be a VGPR

SRC2

9

third instruction argument. May come from: vgpr, sgpr, VCC, M0, exec or a constant

VDST

8

vgpr that takes the result.
For V_READLANE, indicates the SGPR that receives the result.

NEG

3

negate the input (invert sign bit) for the lower-16bit operand. float inputs only.
bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2.
For V_FMA_MIX_* opcodes, this modifies all inputs.
For DOT…IU… and WMMA…IU… NEG[1:0] = signed(1)/unsigned(0) for src0 and src1,
and Neg[2] behavior is undefined.

NEG_HI

3

negate the input (invert sign bit) for the higher-16bit operand. float inputs only.
bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2.
For V_FMA_MIX_* opcodes, this acts as an ABS (absolute value) modifier.
For DOT…IU… and WMMA…IU… NEG_HI behavior is undefined.

OPSEL
[13:11]

3

Select the high (1) or low (0) operand as input to the operation that results in the lower-half of the
destination. [0] = src0, [1] = src1, [2] = src2
If either the source operand or destination operand is 32bits, the corresponding OPSEL bit must set
to zero. This rule does not apply to MIX instructions, which have a unique interpretation of OPSEL. See
notes below. OPSEL works for 16-bit VGPR, SGPR and literal-constant sources; for inline constant
sources OPSEL must be zero (value only exists in lower 16 bits).
OPSEL[0] and [1] are unused for WMMA ops, and OPSEL[2] is used only with WMMA ops with 16-bit
output to control whether the C matrix is read from upper or lower bits in the VGPR, and whether
the D matrix is stored into upper or lower bits.

7.5. Packed Math

66 of 597

"RDNA3" Instruction Set Architecture

Field

Size

Description

OPSEL_HI 3
{[60:59],[14]}

Select the high (1) or low (0) operand as input to the operation that results in the upper-half of the
destination. [0] = src0, [1] = src1, [2] = src2. Concatenation of ISA fields { OPSLH, OPSLH0 }. If either
the source operand or destination operand is 32bits or is a constant, the corresponding OPSEL_HI
bit must set to zero. This rule does not apply to MIX instructions, which have a unique interpretation of
OPSEL. See notes below.

CLMP

clamp result.
Float arithmetic: clamp result to [0, 1.0]; -0 is clamped to +0.
Signed integer arithmetic: clamp result to [min_int, +max_int]
Unsigned integer arithmetic: clamp result to [0, +max_uint]
Where "min_int" and "max_int" are the largest negative and positive representable integers for the
size of integer being used (16, 32 or 64 bit). "max_uint" is the largest unsigned int.

1

OPSEL for MIX instructions
MIX, MIXLO and MIXHI interpret OPSEL and OPSEL_HI as three 2-bit fields, one per source operand:
{ OPSEL_HI[0], OPSEL[0] } controls source0;
{ OPSEL_HI[1], OPSEL[1] } controls source1;
{ OPSEL_HI[2], OPSEL[2] } controls source2.
These 2-bit fields control source-selection for each of the 3 source operands:
2’b00: Src[31:0] as FP32
2’b01: Src[31:0] as FP32
2’b10: Src[15:0] as FP16
2’b11: Src[31:16] as FP16
V_WMMA…IU… and V_DOT4…IU… with NEG::
These instructions use the NEG[1:0] bits to indicate signed (0=unsigned, 1=signed) per input source
instead of meaning "negate". NEG[2] should be set to zero (behavior is undefined). NEG_HI must be zero.

7.5.1. Inline Constants with Packed Math
Inline constants may be used with packed math, but they require the use of OPSEL. Inline constants produce a
value in only the low 16-bits of the 32-bit constant value. Inline constants used with float 16-bit sources produce
an F16 constant value. Without using OPSEL, only the lower half of the source would contain the constant. To
use the inline constant in both halves, use OPSEL to select the lower input for both low and high sources.
BF16 uses 32-bit float constants and then the BF16 operand selects the upper 16 bits of the FP32 constant
(matches the definition of BF16).
For the WMMA_F16_F16_16x16x16 or VOPD DOT2_F32_F16, hardware automatically selects the low 16 bits of
the constant.
Any packed math instructions that use data sizes less than 16 bits do not work with inline constants, other than
the DOT instructions below:
Opcode

inline

OPSEL

DOT4_I32_IU8

use 32bit inline src0/1 (ignore OPSEL)

OPSEL/OPSEL_HI on src0/1

DOT8_I32_IU4

use 32bit inline src0/1 (ignore OPSEL)

OPSEL/OPSEL_HI on src0/1

DOT4_U32_U8

use 32bit inline src0/1 (ignore OPSEL)

OPSEL/OPSEL_HI on src0/1

7.5. Packed Math

67 of 597

"RDNA3" Instruction Set Architecture

Opcode

inline

OPSEL

DOT8_U32_U4

use 32bit inline src0/1 (ignore OPSEL)

OPSEL/OPSEL_HI on src0/1

DOT2_F32_F16

use FP32 inline, supports OPSEL

OPSEL/OPSEL_HI on src0/1

DOT2_F32_BF16

upper16(FP32)/same as replicate (src0/1) ignore OPSEL

OPSEL/OPSEL_HI on src0/1

DOT2ACC_F32_F16

Duplicate lo to hi, ignore OPSEL

none

DOT2ACC_F32_BF16

Duplicate lo to hi, ignore OPSEL

none

7.6. Dual Issue VALU
The VOPD instruction encoding allows a single shader instruction to encode two separate VALU operations
that are executed in parallel. The two operations must be independent of each other. This instruction has
certain restrictions that must be met - hardware does not function correctly if they are not. This instruction
format is legal only for wave32. It must not be used by wave64’s. It is skipped for wave64.
The instruction defines 2 operations, named "X" and "Y", each with their own sources and destination VGPRs.
The two instructions packed into this one ISA are referred to as OpcodeX and OpcodeY.
• OpcodeX sources data from SRC0X (a VGPR, SGPR or constant), and SRC1X (a VGPR);
• OpcodeY sources data from SRC0Y (a VGPR, SGPR or constant), and SRC1Y (a VGPR).
The two instructions in the VOPD are executed at the same time, so there are no races between them if one
reads a VGPR and the other writes the same VGPR. The 'read' gets the old value.
Restrictions:
• Each of the two instructions may use up to 2 VGPRs
• Each instruction in the pair may use at most 1 SGPR or they may share a single literal
◦ Legal combinations for the dual-op: at most 2 SGPRs, or 1 SGPR + 1 literal, or share a literal.
• SRC0 can be either a VGPR or SGPR (or constant)
• VSRC1 can only be a VGPR
• Instructions must not exceed the VGPR source-cache port limits
◦ There are 4 VGPR banks (indexed by SRC[1:0]), and each bank has a cache
◦ Each cache has 3 read ports: one dedicated to SRC0, one dedicated to SRC1 and one for SRC2
▪ A cache can read all 3 of them at once, but it can’t read two SRC0’s at once (or SRC1/2).
◦ SRCX0 and SRCY0 must use different VGPR banks;
◦ VSRCX1 and VSRCY1 must use different banks.
▪ FMAMK is an exception : V = S0 + K * S1 ("S1" uses the SRC2 read port)
◦ If both operations use the SRC2 input, then one SRC2 input must be even and the other SRC2 input
must be odd. The following operations use SRC2: FMAMK_F32 (second input operand);
DOT2ACC_F32_F16, DOT2ACC_F32_BF16, FMAC_F32 (destination operand).
◦ These are hard rules - the instruction does not function if these rules are broken
• The pair of instructions combined have the following restrictions:
◦ At most one literal constant, or they may share the same literal
◦ Dest VGPRs: one must be even and the other odd
◦ The instructions must be independent of each other
• Must not use DPP
• Must be wave32.

7.6. Dual Issue VALU

68 of 597

"RDNA3" Instruction Set Architecture

VOPD Instruction Fields

Field

Size

Description

opX

4

instruction opcode for the X operation

opY

5

instruction opcode for the Y operation

src0X

9

Source 0 for X operation. May be a VGPR, SGPR, exec, inline or literal constant

src0Y

9

Source 0 for Y operation. May be a VGPR, SGPR, exec, inline or literal constant

vsrc1X

8

Source 1 for X operation. Must be a VGPR. Ignored for V_MOV_B32

vsrc1Y

8

Source 1 for Y operation. Must be a VGPR. Ignored for V_MOV_B32

vdstX

8

Destination VGPR for X operation.

vdstY

7

Destination VGPR for Y operation. vdstY specifies bits [7:1]. The LSB of the destination address is:
!vdstX[0]. vdstX and vdstY: one must be even and the other is an odd VGPR.

See VOPD for a list of opcodes usable in the X and Y opcode fields.
V_CNDMASK_B32 is the "VOP2" form that uses VCC as the select. VCC counts as one SGPR read.
VOPD instruction pairs generate only a single exception if either or both raise an exception.

7.7. Data Parallel Processing (DPP)
Data Parallel Processing (DPP) operations allow VALU instruction to select operands from different lanes
(threads) rather than just using a thread’s own data. DPP operations are indicated by the use of the inline
constant: DPP8 or DPP16 in the SRC0 operand. Note that since SRC0 is set to the DPP value, the actual VGPR
address for SRC0 comes from the DPP DWORD.
One example of using DPP is for scan operations. A scan operation is one that computes a value per thread that
is based on the values of the previous threads and possibly itself. E.g. a running sum is the sum of the values
from previous threads in the vector. A reduction operation is essentially a scan that returns a single value from
the highest numbered active thread. A scan operation requires that the EXEC mask to be set to all 1’s for proper
operation. Unused threads (lanes) should be set to a value that does not change the result prior to the scan.
There are two forms of the DPP instruction word:
DPP8

allows arbitrary swizzling between groups of 8 lanes

DPP16

allows a set of predefined swizzles between groups of 16 lanes

DPP may be used only with: VOP1, VOP2, VOPC, VOP3 and VOP3P (but not "packed math" ops).
DPP instructions incur an extra cycle of delay to execute.
Table 30. Which instructions support DPP

7.7. Data Parallel Processing (DPP)

69 of 597

"RDNA3" Instruction Set Architecture

Encoding

Opcodes

* Rule*

Encoding

Opcodes

Rule

VOP1

All 64-bit opcodes

NO DPP

VOP3

All 64bit opcodes

NO DPP

READFIRSTLANE_B32

NO DPP

MUL_LO_U32

NO DPP

SWAP_B32

NO DPP

MUL_HI_U32

NO DPP

PIPEFLUSH

NO DPP

MUL_HI_I32

NO DPP

WRITELANE_REGWR_B32

NO DPP

QSAD_PK_U16_U8

NO DPP

PERMUTE64

NO DPP

MQSAD_PK_U16_U8

NO DPP

All Others

Allow DPP

MQSAD_U32_U8

NO DPP

ALL 64bit opcodes

NO DPP

READLANE_REGRD_B32

NO DPP

FMAMK/AD_F32/16

NO DPP

READLANE_B32

NO DPP

All Others

Allow DPP

WRITELANE_B32

NO DPP

V_DOT4_I32_IU8
V_DOT4_U32_U8
V_DOT8_I32_IU4
V_DOT8_U32_U4
V_PK_*
WMMA

NO DPP

PERMLANE16_B32

NO DPP

ALL Others:
V_FMA_MIX_*
V_DOT2_F32_{BF16, F16}

Allow DPP

PERMLANEX16_B32

NO DPP

ALL

NO DPP
The others

Allow DPP

All 64bit opcodes

NO DPP

The others

Allow DPP

VOP2

VOP3P

VINTERP
VOPD

ALL

NO DPP

VOPC

V_CMP and V_CMPX write the full mask, not a partial mask. When using DPP with V_CMP or V_CMPX and
setting bound_ctrl=0, lanes that have their EXEC mask bit set to zero instead of not writing the bit, a zero bit is
written. "FI" (Fetch Inactive) with DPP16 causes a lane to act as if it is active when supplying data, but the
compare result for that lane will still be zero for V_CMPX (V_CMPX with FI=1 will not turn on a lane that was
off).

7.7.1. DPP16
DPP16 allows selection of data within groups of 16 lanes with a fixed set of possible swizzle patterns.
Both VOP3/VOP3P and DPP16 have ABS and NEG fields:
• VOP3’s ABS & NEG fields are used, and DPP16’s are ignored
• VOP3P’s NEG/NEG_HI fields are used and DPP16’s ABS & NEG are ignored.
DPP16 Instruction Fields
Field

BITS

Description

row_mask

31:28

Applies to the VGPR destination write only, does not impact the thread mask when fetching
source VGPR data. For VOPC, the SGPR/VCC bit associated with the disabled lane receives
zero.
31==0: lanes[63:48] are disabled (wave 64 only)
30==0: lanes[47:32] are disabled (wave 64 only)
29==0: lanes[31:16] are disabled
28==0: lanes[15:0] are disabled

7.7. Data Parallel Processing (DPP)

70 of 597

"RDNA3" Instruction Set Architecture

Field

BITS

Description

bank_mask

27:24

Applies to the VGPR destination write only, does not impact the thread mask when fetching
source VGPR data. For VOPC, the SGPR/VCC bit associated with the disabled lane receives
zero.
In wave32 mode:
27==0: lanes[12:15, 28:31] are disabled
26==0: lanes[8:11, 24:27 are disabled
25==0: lanes[4:7, 20:23] are disabled
24==0: lanes[0:3, 16:19] are disabled
In wave64 mode:
27==0: lanes[12:15, 28:31, 44:47, 60:63] are disabled
26==0: lanes[8:11, 24:27, 40:43, 56:59] are disabled
25==0: lanes[4:7, 20:23, 36:39, 52:55] are disabled
24==0: lanes[0:3, 16:19, 32:35, 48:51] are disabled
Notice: the term "bank" here is not the same as was used for the VGPR bank.

src1_imod

23:22

23: Apply Absolute value to SRC1
22: Apply Negate to SRC1 (done after absolute value)

src0_imod

21:20

21: Apply Absolute value to SRC0
20: Apply Negate to SRC0 (done after absolute value)

BC

19

Bound_ctrl is used to determine what a thread should do if its source operand is from a
disabled thread or invalid input: use the value zero, or disable the write. For example, a right
shift into lane 0 is an invalid input, so the VALU uses Bound_ctrl to decide if lane 0’s src0 should
be 0 or if it’s VGPR write enable should be disabled.
19==0: Do not write when source is invalid or out-of-range (DPP_BOUND_OFF)
19==1: User zero as input if source is invalid or out-of-range. (DPP_BOUND_ZERO)

FI

18

Fetch inactive lane behavior:
18 == 0: If source lane is invalid (disabled thread or out-of-range), use "bound_ctrl" to
determine the source value.
18 == 1: If the source lane is disabled, fetch the source value anyway (ignoring the
bound_ctrl bit). If the source lane is out-of-range, behavior is decided by the bound_ctrl bit.

rsvd

17

Reserved

dpp_ctrl

16:8

Data Share control word.

Src0

7:0

DPP_QUAD_PERM{00:FF}

000-0FF

DPP_UNUSED

100

DPP_ROW_SL{1:15}

101-10F

DPP_ROW_SR{1:15}

111-11F

DPP_ROW_RR{1:15}

121-12F

DPP_ROW_MIRROR

140

DPP_ROW_HALF_MIRROR

141

DPP_ROW_SHARE{0:15}

150 - 15F

DPP_ROW_XMASK{0:15}

160 - 16F

VGPR address of srcA operand

Table 31. BC and FI Behavior
BC FI

Source lane out-ofrange

Source lane in-range
but disabled

Source lane in-range
and active

0

0

Disable write

Disable write

Normal

1

0

Src0 = 0

Src0 = 0

Normal

0

1

Src0 = 0

Normal

Normal

1

1

Normal

Normal

Normal

7.7. Data Parallel Processing (DPP)

71 of 597

"RDNA3" Instruction Set Architecture

Where "out of range" means the lane offset goes outside a group of 16 lanes (e.g. 0..15, or 16..31).

7.7.2. DPP8
DPP8 allows arbitrary cross-lane swizzling within groups of 8 lanes. There are two forms of DPP8: normal,
which reads zero from lanes whose EXEC mask bit is zero, and DPP8FI, which fetches data from inactive lanes
instead of using the value zero.
DPP8 follows DPP16’s "BC = 1" behavior and assumes all source lanes are in-range.
DPP8 Instruction Fields
Field

Size

Description

SRC

8

Source 0 (VGPR). Since the VOP1/VOP2 source0 slot was filled with the constant "DPP" or
"DPPFI", this field provides the actual source-0 vgpr.

SEL0
SEL1
SEL2
SEL3
SEL4
SEL5
SEL6
SEL7

3

Selects which lane to pull data from, within a group of 8 lanes.
SEL0 selects which lane to read from to supply data into lane 0.
SEL1 selects which lane to read from to supply data into lane 1.
etc.
0 = read from lane 0, 1 = read from lane 1, … 7 = read from lane 7.
Lanes 0-7 can pull from any of lanes 0-7; lanes 8-15 can pull from lanes 8-15, etc.

7.8. VGPR Indexing
The VALU provides a set of instructions that move or swap VGPRs where the source, dest or both are indexed
by a value in the M0 register. Indices are unsigned.
Table 32. VGPR Indexing Instructions
Instruction

Index

Function

V_MOVRELD_B32

M0[31:0]

Move with relative destination:
VGPR[dst + M0[31:0]] = VGPR[src]

V_MOVRELS_B32

Move with relative source:
VGPR[dst] = VGPR[src + M0[31:0]]

V_MOVRELSD_B32

Move with relative source and destination:
VGPR[dst + M0[31:0]] = VGPR[src + M0[31:0]]

V_MOVRELSD_2_B32
V_SWAPREL_B32

Src: M0[9:0]
Dst: M0[25:16]

Move with relative source and destination, each different:
VGPR[dst + M0[25:16]] = VGPR[src + M0[9:0]]
Swap two VGPRs, each relative to a separate index:
tmp = VGPR[src + M0[9:0]]
VGPR[src + M0[9:0]] = VGPR[dst + M0[25:16]]
VGPR[dst + M0[25:16]] = tmp

7.9. Wave Matrix Multiply Accumulate (WMMA)
Wave Matrix Multiply-Accumulate (WMMA) instructions provide acceleration for common matrix arithmetic
operations. The instructions are encoded using the VOP3P encoding.

7.8. VGPR Indexing

72 of 597

"RDNA3" Instruction Set Architecture

These perform: A * B + C ⇒ D, where A, B, C and D are matrices.
WMMA does not generate any ALU exceptions.
These are all encoded using VOP3P. The NEG[1:0] field is repurposed for the "IU" integer types to indicate
whether the inputs are signed or not (0=unsigned, 1=signed). For WMMA_*UI8/UI4, NEG[1:0] indicates whether
SRC0 and 1 are signed or unsigned, and NEG[2] and NEG_HI[2:0] must be zero. For WMMA*_F16/BF16, NEG[1:0] is
applied on SRC1 and SRC0’s low 16bit. NEG_HI[1:0] is applied on SRC1 and SRC0’s high 16bit. {NEG_HI[2],
NEG[2]} is applied on SRC2, act as {ABS, NEG}. The destination is signed for the integer types. Neg[0] applies to
the A-matrix, and Neg[1] to the B-matrix. Neg[2] must be set to zero.
Table 33. WMMA Instructions
Instruction

Matrix A

Matrix B

Matrix C

Result Matrix

V_WMMA_F32_16X16X16_F16

16x16 F16

16x16 F16

16x16 F32

16x16 F32

V_WMMA_F32_16X16X16_BF16

16x16 BF16 16x16 BF16 16x16 F32

16x16 F32

V_WMMA_F16_16X16X16_F16

16x16 F16

16x16 F16

V_WMMA_BF16_16X16X16_BF16

16x16 BF16 16x16 BF16 16x16 BF16 16x16 BF16

V_WMMA_I32_16X16X16_IU8

16x16 IU8

16x16 IU8

16x16 I32

16x16 I32

V_WMMA_I32_16X16X16_IU4

16x16 IU4

16x16 IU4

16x16 I32

16x16 I32

16x16 F16

16x16 F16

"IU4" and "IU8" mean that the operand is either signed or unsigned (4 or 8 bits) as indicate by the NEG bits.
These instructions work over multiple cycles to compute the result matrix and internally use the DOT
instructions. In order to achieve this performance, the user must arrange the data such that:
• A and B matrices: lanes 0-15 data are replicated into lanes 16-31 (for wave64: also into lanes 32-47 and 4863).
WMMA supports only round-to-nearest-even rounding for float types.
Inline constants: can only be used for C-matrix. For F16 and BF16, the inline value is replicated into both low
and high halves of the DWORD.
Back-to-back dependent WMMA instructions require one V_NOP (or independent VALU op) between them if
the first instruction’s matrix D is the same or overlaps with the second instruction’s matrices A or B. Matrix A/B
can overlap C as long as C is distinct from D. The typical case is that C and D are the same.
Simplified example of matrix multiplication on 4x4 matrices:

This diagram below shows the A, B, C and D matrices in the traditional point of view: one row is a horizontal

7.9. Wave Matrix Multiply Accumulate (WMMA)

73 of 597

"RDNA3" Instruction Set Architecture

strip of entries, and columns are a vertical strip. This is the linear algebra view, regardless of layout in memory
or in VGPRs. The matrix operation is defined as: D = A * B + C. Each entry in D is the result of multiplication of
a row from A with a column from B, added to the C value for that entry.

This diagram below shows how the matrices are laid out in VGPRs when M = N = K = 16. Note that the A matrix
is column-major while the others are in row-major order.

7.9. Wave Matrix Multiply Accumulate (WMMA)

74 of 597

"RDNA3" Instruction Set Architecture

7.9. Wave Matrix Multiply Accumulate (WMMA)

75 of 597

"RDNA3" Instruction Set Architecture

Chapter 8. Scalar Memory Operations
Scalar Memory Loads (SMEM) instructions allow a shader program to load data from memory into SGPRs
through the Constant Cache ("Kcache"). Instructions can load from 1 to 16 DWORDs. Data is loaded directly
into SGPRs without any format conversion.
The scalar unit loads consecutive DWORDs from memory to the SGPRs. This is intended primarily for loading
ALU constants and for indirect T#/S# lookup. No data formatting is supported, nor is byte or short data.
Loads come in two forms: one that simply takes a base-address pointer, and the other that uses a vertex-buffer
constant to provide: base, size and stride.

8.1. Microcode Encoding
Scalar memory load instructions are encoded using the SMEM microcode format.

The fields are described in the table below:
Table 34. SMEM Encoding Field Descriptions
Field

Size Description

OP

8

Opcode. See the next table.

SDATA

7

SGPRs to return Load data to.
• Loads of 2 DWORDs must have an even SDST-sgpr.
• Loads of 4 or more DWORDs must have their DST-gpr aligned to a multiple of 4.
• SDATA must be: SGPR or VCC. Not: EXEC, M0 or NULL except for instructions that return nothing: these
may use NULL

SBASE

6

SGPR-pair (SBASE has an implied LSB of zero) that provides a base address, or for BUFFER instructions, a
set of 4 SGPRs (4-sgpr aligned) that hold the resource constant.
For BUFFER instructions, the only resource fields used are: base, stride, num_records.

OFFSET 21

Instruction Address Offset : An immediate signed byte offset.
Negative offsets only work with S_LOAD; a negative offset applied to S_BUFFER results in a MEMVIOL.

SOFFSET 7

SGPR that has the 32-bit unsigned byte offset. May only specify an SGPR, M0 or set to "NULL" to not use
(offset=0).

GLC

1

Globally Coherent.

DLC

1

Device Coherent.

Table 35. SMEM Instructions
Opcode # Name

Opcode # Name

0

S_LOAD_B32

9

S_BUFFER_LOAD_B64

1

S_LOAD_B64

10

S_BUFFER_LOAD_B128

2

S_LOAD_B128

11

S_BUFFER_LOAD_B256

3

S_LOAD_B256

12

S_BUFFER_LOAD_B512

4

S_LOAD_B512

32

S_GL1_INV

8

S_BUFFER_LOAD_B32

33

S_DCACHE_INV

8.1. Microcode Encoding

76 of 597

"RDNA3" Instruction Set Architecture

These instructions load 1-16 DWORDs from memory. The data in SGPRs is specified in SDATA, and the address
is composed of the SBASE, OFFSET, and SOFFSET fields.

8.1.1. Scalar Memory Addressing
Non-buffer S_LOAD instructions use the following formula to calculate the memory address:
ADDR = SGPR[base] + inst_offset + { M0 or SGPR[offset] or zero }
All components of the address (base, offset, inst_offset, M0) are in bytes, but the two LSBs are ignored and
treated as if they were zero.
It is illegal and undefined for the inst_offset to be negative if the resulting
(inst_offset + (M0 or SGPR[offset])) is negative.

8.1.2. Loads using Buffer Constant
S_BUFFER_LOAD instructions use a similar formula, but the base address comes from the buffer constant’s
base_address field.
Buffer constant fields used: base_address, stride, num_records. Other fields are ignored.
Scalar memory load does not support "swizzled" buffers. Stride is used only for memory address bounds
checking, not for computing the address to access.
The SMEM supplies only a SBASE address (byte) and an offset (byte or DWORD). Any "index * stride" must be
calculated manually in shader code and added to the offset prior to the SMEM. Inst_offset must be nonnegative - a negative value of inst_offset results in a MEMVIOL.
The two LSBs of V#.base and of the final address are ignored to force DWORD alignment.
"m_*" components come from the buffer constant (V#):
offset

= OFFSET + SOFFSET (M0, SGPR or zero)

m_base

= { SGPR[SBASE * 2 +1][15:0], SGPR[SBASE*2] }

m_stride

= SGPR[SBASE * 2 +1][31:16]

m_num_records = SGPR[SBASE * 2 + 2]
m_size

= (m_stride == 0 ? 1 : m_stride) * m_num_records

addr

= (m_base & ~3) + (offset & ~0x3)

SGPR[SDST] = load_dword_from_dcache(addr, m_size)
If more than 1 DWORD is being loaded, it is returned to SDST+1, SDST+2, etc,
and the offset is incremented by 4 bytes per DWORD.

8.1.3. S_DCACHE_INV and S_GL1_INV
This instruction invalidates the entire scalar cache or L1 cache. It does not return anything to SDST.

8.1. Microcode Encoding

77 of 597

"RDNA3" Instruction Set Architecture

S_GL1_INV and S_DCACHE_INV do not have any address or data arguments.

8.2. Dependency Checking
Scalar memory loads can return data out-of-order from how they were issued; they can return partial results at
different times when the load crosses two cache lines. The shader program uses the LGKMcnt counter to
determine when the data has been returned to the SDST SGPRs. This is done as follows.
• LGKMcnt is incremented by 1 for every fetch of a single DWORD, or cache invalidates.
• LGKMcnt is incremented by 2 for every fetch of two or more DWORDs.
• LGKMcnt is decremented by an equal amount when each instruction completes.
Because the instructions can return out-of-order, the only sensible way to use this counter is to implement
"S_WAITCNT LGKMcnt 0"; this imposes a wait for all data to return from previous SMEMs before continuing.
Cache invalidate instructions are not assured to have completed until the shader waits for LGKMcnt==0.

8.3. Scalar Memory Clauses and Groups
A clause is a sequence of instructions starting with S_CLAUSE and continuing for 2-63 instructions. Clauses
lock the instruction arbiter onto this wave until the clause completes.
A group is a set of the same type of instruction that happen to occur in the code but are not necessarily
executed as a clause. A group ends when a non-SMEM instruction is encountered. Scalar memory instructions
are issued in groups. The hardware does not enforce that a single wave executes an entire group before issuing
instructions from another wave.
Group restrictions:
• INV must be in a group by itself and may not be in a clause

8.4. Alignment and Bounds Checking
SDST
The value of SDST must be even for fetches of two DWORDs, or a multiple of four for larger fetches. If this
rule is not followed, invalid data can result.
SBASE
The value of SBASE must be even for S_BUFFER_LOAD (specifying the address of an SGPR which is a
multiple of four). If SBASE is out-of-range, the value from SGPR0 is used.
OFFSET
The value of OFFSET has no alignment restrictions.

8.2. Dependency Checking

78 of 597

"RDNA3" Instruction Set Architecture

8.4.1. Address and GPR Range Checking
The hardware checks for both the address being out of range (BUFFER instructions only), and for the source or
destination SGPRs being out of range.
Address Out-of-Range if

offset >= ( (stride==0 ? 1 : stride) * num_records).
where "offset" is: inst_offset + {M0 or sgpr-offset}
Any DWORDs that are out of range in memory from a buffer_load
return zero. If a multi-DWORD request (e.g. S_BUFFER_LOAD_B256) is
partially out of range, the DWORDs that are in range return data as
normal, and the out-of-range DWORDs return zero.

Source SGPR out of range

If any source data is out of the range of SGPRs (either partially or
completely), the value 'zero' is used instead.

Destination SGPR out of range

If the destination SGPR is partially or fully out of range, no data is
written back to SGPRs for this instruction.

8.4. Alignment and Bounds Checking

79 of 597

"RDNA3" Instruction Set Architecture

Chapter 9. Vector Memory Buffer Instructions
Vector-memory (VM) buffer operations transfer data between the VGPRs and buffer objects in memory
through the texture cache (TC). Vector means that one or more piece of data is transferred uniquely for every
thread in the wave, in contrast to scalar memory loads that transfer only one value that is shared by all threads
in the wave.
The instruction defines which VGPR(s) supply the addresses for the operation, which VGPRs supply or receive
data from the operation, and a series of SGPRs that contain the memory buffer descriptor (V#). Buffer atomics
have the option of returning the pre-op memory value to VGPRs.
Examples of buffer objects are vertex buffers, raw buffers, stream-out buffers, and structured buffers.
Buffer objects support both homogeneous and heterogeneous data, but no filtering of load-data (no samplers).
Buffer instructions are divided into two groups:
MUBUF: Untyped buffer objects
• Data format is specified in the resource constant.
• Load, store, atomic operations, with or without data format conversion.
MTBUF: Typed buffer objects
• Data format is specified in the instruction.
• The only operations are Load and Store, both with data format conversion.
All buffer operations use a buffer resource constant (V#) that is a 128-bit value in SGPRs. This constant is sent
to the texture cache when the instruction is executed. This constant defines the address and characteristics of
the buffer in memory. Typically, these constants are fetched from memory using scalar memory loads prior to
executing VM instructions, but these constants also can be generated within the shader.
Memory operations of different types (loads, stores) can complete out of order with respect to each other.
Simplified view of buffer addressing
The equation below shows how the memory address is calculated for a buffer access:

Memory instructions return MEMVIOL for any misaligned access when the alignment mode does not allow it.

9.1. Buffer Instructions
Buffer instructions (MTBUF and MUBUF) allow the shader program to load from, and store to, linear buffers in
memory. These operations can operate on data as small as one byte, and up to four DWORDs per work-item.
Atomic operations take data from VGPRs and combine them arithmetically with data already in memory.
Optionally, the value that was in memory before the operation took place can be returned to the shader.

9.1. Buffer Instructions

80 of 597

"RDNA3" Instruction Set Architecture

The D16 instruction variants of buffer ops convert the results to and from packed 16-bit values. For example,
BUFFER_LOAD_D16_FORMAT_XYZW stores two VGPRs with 4 16-bit values.
Table 36. Buffer Instructions
MTBUF Instructions
TBUFFER_LOAD_FORMAT_{x,xy,xyz,xyzw}
TBUFFER_STORE_FORMAT_{x,xy,xyz,xyzw}
TBUFFER_LOAD_D16_FORMAT_{x,xy,xyz,xyzw}
TBUFFER_STORE_D16_FORMAT_{x,xy,xyz,xyzw}

Load from or store to a Typed buffer object.
Convert data to 16-bits before loading into VGPRs.
Convert data from 16-bits to tex-format before storing to memory

MUBUF Instructions
BUFFER_LOAD_FORMAT_{x,xy,xyz,xyzw}
BUFFER_STORE_FORMAT_{x,xy,xyz,xyzw}
BUFFER_LOAD_D16_FORMAT_{x,xy,xyz,xyzw}
BUFFER_STORE_D16_FORMAT_{x,xy,xyz,xyzw}
BUFFER_LOAD_<size> BUFFER_STORE_<size>
BUFFER_{LOAD,STORE}_D16_FORMAT_X
BUFFER_{LOAD,STORE}_D16_HI_FORMAT_X

Load from or store to an Untyped Buffer object
<size> = I8, U8, I16, U16, B32, B64, B96, B128

BUFFER_ATOMIC_<op>

Buffer object atomic operation. Automatically globally coherent.
Operates on 32bit or 64bit values.

BUFFER_GL{0,1}_INV

Cache invalidate: either L0 or L1 cache for the CU (L0) and Shader
Array (L1) associated with this wave.

Table 37. Microcode Formats
Field

Bit Size Description

OP

4
8

MTBUF: Opcode for Typed buffer instructions.
MUBUF: Opcode for Untyped buffer instructions.

VADDR

8

Address of VGPR to supply first component of address (offset or index). When both index and offset are
used, index is in the first VGPR, offset in the second.

VDATA

8

Address of VGPR to supply first component of store data or receive first component of load-data.

SOFFSET 8

SGPR to supply unsigned byte offset. SGPR, M0, NULL, or inline constant.

SRSRC

Specifies which SGPR supplies V# (resource constant) in four consecutive SGPRs. This field is missing
the two LSBs of the SGPR address, since this address is be aligned to a multiple of four SGPRs.

5

FORMA 7
T

Data Format of data in memory buffer. See: Buffer Image Format Table

OFFSET 12

Unsigned byte offset.

OFFEN

1

1 = Supply an offset from VGPR (VADDR). 0 = Do not (offset = 0).

IDXEN

1

1 = Supply an index from VGPR (VADDR). 0 = Do not (index = 0).

GLC

1

Globally Coherent. Controls how loads and stores are handled by the L0 texture cache.
ATOMIC
GLC = 0 Previous data value is not returned.
GLC = 1 Previous data value is returned.

DLC

1

Device Level Coherent.

SLC

1

System Level Coherent.

9.1. Buffer Instructions

81 of 597

"RDNA3" Instruction Set Architecture

Field

Bit Size Description

TFE

1

Texel Fault Enable for PRT (partially resident textures). When set to 1 and fetch returns a NACK, status
is written to the VGPR after the last fetch-dest VGPR.

Table 38. MTBUF Instructions
Opcode

Description - all address components for buffer ops are uint

TBUFFER_LOAD_FORMAT_X

load X component w/ format convert

TBUFFER_LOAD_FORMAT_XY

load XY components w/ format convert

TBUFFER_LOAD_FORMAT_XYZ

load XYZ components w/ format convert

TBUFFER_LOAD_FORMAT_XYZW

load XYZW components w/ format convert

TBUFFER_STORE_FORMAT_X

store X component w/ format convert

TBUFFER_STORE_FORMAT_XY

store XY components w/ format convert

TBUFFER_STORE_FORMAT_XYZ

store XYZ components w/ format convert

TBUFFER_STORE_FORMAT_XYZW

store XYZW components w/ format convert

TBUFFER_LOAD_D16_FORMAT_X

load X component w/ format convert, 16bit

TBUFFER_LOAD_D16_FORMAT_XY

load XY components w/ format convert, 16bit

TBUFFER_LOAD_D16_FORMAT_XYZ

load XYZ components w/ format convert, 16bit

TBUFFER_LOAD_D16_FORMAT_XYZW

load XYZW components w/ format convert, 16bit

TBUFFER_STORE_D16_FORMAT_X

store X component w/ format convert, 16bit

TBUFFER_STORE_D16_FORMAT_XY

store XY components w/ format convert, 16bit

TBUFFER_STORE_D16_FORMAT_XYZ

store XYZ components w/ format convert, 16bit

TBUFFER_STORE_D16_FORMAT_XYZW

store XYZW components w/ format convert, 16bit

• TBUFFER*_FORMAT instructions include a data-format conversion specified in the instruction.
Table 39. MUBUF Instructions
Opcode

Description - all address components for buffer ops are uint

BUFFER_LOAD_U8

load unsigned byte (extend 0’s to MSB’s of DWORD VGPR)

BUFFER_LOAD_D16_U8

load unsigned byte into VGPR[15:0]

BUFFER_LOAD_D16_HI_U8

load unsigned byte into VGPR[31:16]

BUFFER_LOAD_I8

load signed byte (sign extend to MSB’s of DWORD VGPR)

BUFFER_LOAD_D16_I8

load signed byte into VGPR[15:0]

BUFFER_LOAD_D16_HI_I8

load signed byte into VGPR[31:16]

BUFFER_LOAD_U16

load unsigned short (extend 0’s to MSB’s of DWORD VGPR)

BUFFER_LOAD_I16

load signed short (sign extend to MSB’s of DWORD VGPR)

BUFFER_LOAD_D16_B16

load short into VGPR[15:0]

BUFFER_LOAD_D16_HI_B16

load short into VGPR[31:16]

BUFFER_LOAD_B32

load DWORD

BUFFER_LOAD_B64

load 2 DWORD per element

BUFFER_LOAD_B96

load 3 DWORD per element

BUFFER_LOAD_B128

load 4 DWORD per element

BUFFER_LOAD_FORMAT_X

load X component w/ format convert

BUFFER_LOAD_FORMAT_XY

load XY components w/ format convert

BUFFER_LOAD_FORMAT_XYZ

load XYZ components w/ format convert

BUFFER_LOAD_FORMAT_XYZW

load XYZW components w/ format convert

BUFFER_LOAD_D16_FORMAT_X

load X component w/ format convert, 16b

BUFFER_LOAD_D16_HI_FORMAT_X

load X component w/ format convert, 16b

BUFFER_LOAD_D16_FORMAT_XY

load XY components w/ format convert, 16b

9.1. Buffer Instructions

82 of 597

"RDNA3" Instruction Set Architecture

Opcode

Description - all address components for buffer ops are uint

BUFFER_LOAD_D16_FORMAT_XYZ

load XYZ components w/ format convert, 16b

BUFFER_LOAD_D16_FORMAT_XYZW

load XYZW components w/ format convert, 16b

BUFFER_STORE_B8

store byte (ignore MSB’s of DWORD VGPR)

BUFFER_STORE_D16_HI_B8

store byte from VGPR bits [23:16]

BUFFER_STORE_B16

store short (ignore MSB’s of DWORD VGPR)

BUFFER_STORE_D16_HI_B16

store short from VGPR bits [32:16]

BUFFER_STORE_B32

store DWORD

BUFFER_STORE_B64

store 2 DWORD per element

BUFFER_STORE_B96

store 3 DWORD per element

BUFFER_STORE_B128

store 4 DWORD per element

BUFFER_STORE_FORMAT_X

store X component w/ format convert

BUFFER_STORE_FORMAT_XY

store XY components w/ format convert

BUFFER_STORE_FORMAT_XYZ

store XYZ components w/ format convert

BUFFER_STORE_FORMAT_XYZW

store XYZW components w/ format convert

BUFFER_STORE_D16_FORMAT_X

store X component w/ format convert, 16b

BUFFER_STORE_D16_HI_FORMAT_X

store X component w/ format convert, 16b

BUFFER_STORE_D16_FORMAT_XY

store XY components w/ format convert, 16b

BUFFER_STORE_D16_FORMAT_XYZ

store XYZ components w/ format convert, 16b

BUFFER_STORE_D16_FORMAT_XYZW

store XYZW components w/ format convert, 16b

BUFFER_ATOMIC_ADD_U32

32b , dst += src, returns previous value if glc==1

BUFFER_ATOMIC_ADD_F32

32b , dst += src, returns previous value if glc==1

BUFFER_ATOMIC_ADD_U64

64b , dst += src, returns previous value if glc==1

BUFFER_ATOMIC_AND_B32

32b , dst &= src, returns previous value if glc==1

BUFFER_ATOMIC_AND_B64

64b , dst &= src, returns previous value if glc==1

BUFFER_ATOMIC_CMPSWAP_B32

32b , dst = (dst == cmp) ? src : dst, returns previous value if glc==1. Src is from
vdata, cmp from vdata+1

BUFFER_ATOMIC_CMPSWAP_B64

64b , dst = (dst == cmp) ? src : dst, returns previous value if glc==1

BUFFER_ATOMIC_CSUB_U32

32b , dst = if (src > dst) ? 0 : dst - src, returns previous . GLC must be set to 1.

BUFFER_ATOMIC_DEC_U32

32b , dst = dst == 0) | (dst > src ? src : dst-1, returns previous value if glc==1

BUFFER_ATOMIC_DEC_U64

64b , dst = dst == 0) | (dst > src ? src : dst-1, returns previous value if glc==1

BUFFER_ATOMIC_CMPSWAP_F32

32b , dst = (dst == cmp) ? src : dst, returns previous value if glc==1. Src is from
vdata, cmp from vdata+1

BUFFER_ATOMIC_MAX_F32

32b , dst = (src > dst) ? src : dst, (float) returns previous value if glc==1

BUFFER_ATOMIC_MIN_F32

32b , dst = (src < dst) ? src : dst, (float) returns previous value if glc==1

BUFFER_ATOMIC_INC_U32

32b , dst = (dst >= src) ? 0 : dst+1, returns previous value if glc==1

BUFFER_ATOMIC_INC_U64

64b , dst = (dst >= src) ? 0 : dst+1, returns previous value if glc==1

BUFFER_ATOMIC_OR_B32

32b , dst |= src, returns previous value if glc==1

BUFFER_ATOMIC_OR_B64

64b , dst |= src, returns previous value if glc==1

BUFFER_ATOMIC_MAX_I32

32b , dst = (src > dst) ? src : dst, (signed) returns previous value if glc==1

BUFFER_ATOMIC_MAX_I64

64b , dst = (src > dst) ? src : dst, (signed) returns previous value if glc==1

BUFFER_ATOMIC_MIN_I32

32b , dst = (src < dst) ? src : dst, (signed) returns previous value if glc==1

BUFFER_ATOMIC_MIN_I64

64b , dst = (src < dst) ? src : dst, (signed) returns previous value if glc==1

BUFFER_ATOMIC_SUB_U32

32b , dst -= src, returns previous value if glc==1

BUFFER_ATOMIC_SUB_U64

64b , dst -= src, returns previous value if glc==1

BUFFER_ATOMIC_SWAP_B32

32b , dst = src, returns previous value of dst if glc==1

BUFFER_ATOMIC_SWAP_B64

64b , dst = src, returns previous value of dst if glc==1

BUFFER_ATOMIC_MAX_U32

32b , dst = (src > dst) ? src : dst, (unsigned) returns previous value if glc==1

BUFFER_ATOMIC_MAX_U64

64b , dst = (src > dst) ? src : dst, (unsigned) returns previous value if glc==1

9.1. Buffer Instructions

83 of 597

"RDNA3" Instruction Set Architecture

Opcode

Description - all address components for buffer ops are uint

BUFFER_ATOMIC_MIN_U32

32b , dst = (src < dst) ? src : dst, (unsigned) returns previous value if glc==1

BUFFER_ATOMIC_MIN_U64

64b , dst = (src < dst) ? src : dst, (unsigned) returns previous value if glc==1

BUFFER_ATOMIC_XOR_B32

32b , dst ^= src, returns previous value if glc==1

BUFFER_ATOMIC_XOR_B64

64b , dst ^= src, returns previous value if glc==1

BUFFER_GL0_INV

invalidate the shader L0 cache (texture cache) associated with this wave.

BUFFER_GL1_INV

invalidate the GL1 (L1) cache associated with this wave, for this wave’s VMID

• BUFFER*_FORMAT instructions include a data-format conversion specified in the resource constant (V#).
• In the table above, "D16" means the data in the VGPR is 16-bits, not the usual 32 bits.
"D16_HI" means that the upper 16-bits of the VGPR is used instead of "D16" that uses the lower 16 bits.

9.2. VGPR Usage
VGPRs supply address and store-data, and they can be the destination for return data.
Address
Zero, one or two VGPRs are used, depending on the index-enable (IDXEN) and offset-enable (OFFEN) in the
instruction word. These are unsigned ints.
For 64-bit addresses the LSBs are in VGPRn and the MSBs are in VGPRn+1.
Table 40. Address VGPRs
IDXEN OFFEN VGPRn

VGPRn+1

0

0

nothing

0

1

uint offset

1

0

uint index

1

1

uint index uint offset

Store Data : N consecutive VGPRs, starting at VDATA. The data format specified in the instruction word’s
opcode and D16 setting determines how many DWORDs the shader provides to store.
Load Data : Same as stores. Data is returned to consecutive VGPRs.
Load Data Format : Load data is 32 or 16 bits, based on the data format in the instruction or resource and D16.
Float or normalized data is returned as floats; integer formats are returned as integers (signed or unsigned,
same type as the memory storage format). Memory loads of data in memory that is 32 or 64 bits do not undergo
any format conversion unless they return as 16-bit due to D16 being set to 1.
Atomics with Return : Data is read out of the VGPR(s) starting at VDATA to supply to the atomic operation. If
the atomic returns a value to VGPRs, that data is returned to those same VGPRs starting at VDATA.
Table 41. Data format in VGPRs and Memory
Instruction

Memory Format

VGPR Format

BUFFER_LOAD_U8

ubyte

V0[31:0] = {24’b0, byte}

BUFFER_LOAD_D16_U8

ubyte

V0[15:0] = {8’b0, byte}

writes only 16 bits

BUFFER_LOAD_D16_HI_U8

ubyte

V0[31:16] = {8’h0, byte}

writes only 16 bits

BUFFER_LOAD_S8

sbyte

V0[31:0] = { 24{sign}, byte}

BUFFER_LOAD_D16_S8

sbyte

V0[15:0] {8{sign}, byte}

9.2. VGPR Usage

Notes

writes only 16 bits

84 of 597

"RDNA3" Instruction Set Architecture

Instruction

Memory Format

VGPR Format

Notes

BUFFER_LOAD_D16_HI_S8

sbyte

V0[31:16] = {8{sign}, byte}

writes only 16 bits

BUFFER_LOAD_U16

ushort

V0[31:0] = { 16’b0, short}

BUFFER_LOAD_S16

sshort

V0[31:0] = { 16{sign}, short}

BUFFER_LOAD_D16_B16

short

V0[15:0] = short

writes only 16 bits

BUFFER_LOAD_D16_HI_B16

short

V0[31:16] = short

writes only 16 bits

BUFFER_LOAD_B32

DWORD

DWORD

BUFFER_LOAD_FORMAT_X

FORMAT field

float, uint or sint
Load X into V0[31:0]

BUFFER_LOAD_FORMAT_XY

FORMAT field

BUFFER_LOAD_FORMAT_XYZ

FORMAT field

data type in VGPR is
based on FORMAT
field.
float, uint or sint
Load X,Y into V0[31:0], V1[31:0] (D16_X and D16_HI_X
write only 16 bits)
float, uint or sint
Load X,Y,Z into V0[31:0],
V1[31:0], V2[31:0]

BUFFER_LOAD_FORMAT_XYZW

FORMAT field

float, uint or sint
Load X,Y,Z,W into V0[31:0],
V1[31:0], V2[31:0], v3[31:0]

BUFFER_LOAD_D16_FORMAT_X

FORMAT field

float, uint or sint
Load X into in V0[15:0]

BUFFER_LOAD_D16_HI_FORMAT_X

FORMAT field

float, ushort or sshort
Load X into in V0[31:16]

BUFFER_LOAD_D16_FORMAT_XY

FORMAT field

float, ushort or sshort
Load X,Y into in V0[15:0],
V0[31:16]

BUFFER_LOAD_D16_FORMAT_XYZ

FORMAT field

float, ushort or sshort
Load X,Y,Z into in V0[15:0],
V0[31:16], V1[15:0]

BUFFER_LOAD_D16_FORMAT_XYZW

FORMAT field

float, ushort or sshort
Load X,Y,Z,W into in V0[15:0],
V0[31:16], V1[15:0], V1[31:16]

Where "V0" is the VDATA VGPR; V1 is the VDATA+1 VGPR, etc.
Instruction

VGPR Format

Memory
Format

BUFFER_STORE_B8

byte in [7:0]

byte

BUFFER_STORE_D16_HI_B8

byte in [23:16]

byte

BUFFER_STORE_B16

short in [15:0]

short

BUFFER_STORE_D16_HI_B16

short in [31:16]

short

BUFFER_STORE_B32

data in [31:0]

DWORD

9.2. VGPR Usage

Notes

85 of 597

"RDNA3" Instruction Set Architecture

Instruction

VGPR Format

Memory
Format

Notes

BUFFER_STORE_FORMAT_X

float, uint or sint
data in V0[31:0]

BUFFER_STORE_D16_FORMAT_X

float, ushort or sshort
data in V0[15:0]

FORMAT field data type in VGPR is
based on FORMAT
field.

BUFFER_STORE_D16_FORMAT_XY

float, ushort or sshort
data in V0[15:0], V0[31:16]

BUFFER_STORE_D16_FORMAT_XYZ

float, ushort or sshort
data in V0[15:0], V0[31:16], V1[15:0]

BUFFER_STORE_D16_FORMAT_XYZW

float, ushort or sshort
data in V0[15:0], V0[31:16], V1[15:0],
V1[31:16]

BUFFER_STORE_D16_HI_FORMAT_X

float, ushort or sshort
data in V0[31:16]

9.3. Buffer Data
The amount and type of data that is loaded or stored is controlled by the following: the resource format field,
destination-component-selects (dst_sel), and the opcode.
Data-format can come from the resource, instruction fields, or the opcode itself. MTBUF derives data-format
from the instruction, MUBUF-"format" instructions use format from the resource, and other MUBUF opcode
derive data-format from the instruction itself.
DST_SEL comes from the resource, but is ignored for many operations.
Table 42. Buffer Instructions
Instruction

Data Format

DST SEL

TBUFFER_LOAD_FORMAT_*

instruction

identity

TBUFFER_STORE_FORMAT_*

instruction

identity

BUFFER_LOAD_<type>

derived

identity

BUFFER_STORE_<type>

derived

identity

BUFFER_LOAD_FORMAT_*

resource

resource

BUFFER_STORE_FORMAT_*

resource

resource

BUFFER_ATOMIC_*

derived

identity

Instruction : The instruction’s format field is used instead of the resource’s fields.
Data format derived : The data format is derived from the opcode and ignores the resource definition. For
example, BUFFER_LOAD_U8 sets the data-format to uint-8.



The resource’s data format must not be INVALID; that format has specific meaning
(unbound resource), and for that case the data format is not replaced by the instruction’s
implied data format.

DST_SEL identity : Depending on the number of components in the data-format, this is: X000, XY00, XYZ0, or
XYZW.

9.3. Buffer Data

86 of 597

"RDNA3" Instruction Set Architecture

9.3.1. D16 Instructions
Load-format and store-format instructions also come in a "D16" variant. The D16 buffer instructions allow a
shader program to load or store just 16 bits per work-item between VGPRs and memory. For stores, each 32bit
VGPR holds two 16bit data elements that are passed to the texture unit which in turn, converts to the texture
format before writing to memory. For loads, data returned from the texture unit is converted to 16 bits and a
pair of data are stored in each 32bit VGPR (LSBs first, then MSBs). Control over int vs. float is controlled by
FORMAT. Conversion of float32 to float16 uses truncation; conversion of other input data formats uses roundto-nearest-even.
There are two variants of these instructions:
• D16 loads data into or stores data from the lower 16 bits of a VGPR.
• D16_HI loads data into or stores data from the upper 16 bits of a VGPR.
For example, BUFFER_LOAD_D16_U8 loads a byte per work-item from memory, converts it to a 16-bit integer,
then loads it into the lower 16 bits of the data VGPR.

9.3.2. LOAD/STORE_FORMAT and DATA-FORMAT mismatches
The "format" instructions specify a number of elements (x, xy, xyz or xyzw) and this could mismatch with the
number of elements in the data format specified in the instruction’s or resource’s data-format field. When that
happens.
• buffer_load_format_x and dfmt is "32_32_32_32" : load 4 DWORDs from memory, but only load first into
the shader
• buffer_store_format_x and dfmt is "32_32_32_32" : stores 4 DWORDs to memory based on dst_sel
• buffer_load_format_xyzw and dfmt is "32" : load 1 DWORD from memory, return 4 to shader (dst_sel)
• buffer_store_format_xyzw and dfmt is "32" : store 1 DWORD (X) to memory, ignore YZW.

9.4. Buffer Addressing
A buffer is a data structure in memory that is addressed with an index and an offset. The index points to a
particular record of size stride bytes, and the offset is the byte-offset within the record. The stride comes from
the resource, the index from a VGPR (or zero), and the offset from an SGPR or VGPR and also from the
instruction itself.
Table 43. BUFFER Instruction Fields for Addressing
Field

Size Description

inst_offset 12

Literal byte offset from the instruction.

inst_idxen 1

Boolean: get per-lane index from VGPR when true, or no index when false.

inst_offen 1

Boolean: get per-lane offset from VGPR when true, or no offset when false. Note that inst_offset is present
regardless of this bit.

The "element size" for a buffer instruction is the amount of data the instruction transfers in bytes. It is
determined by the FORMAT field for MTBUF instructions, or from the opcode for MUBUF instructions, and is:
1, 2, 4, 8, 12 or 16 bytes. For example, format "16_16" has an element size of 4-bytes.

9.4. Buffer Addressing

87 of 597

"RDNA3" Instruction Set Architecture

Table 44. Buffer Resource Constant Fields for Addressing
Field

Size

Description

const_base

48

Base address of the buffer resource, in bytes.

const_stride

14

Stride of the record in bytes (0 to 16,383 bytes).

const_num_records

32

Number of records in the buffer. In units of:
Bytes if: const_stride == 0 || const_swizzle_enable == false
Otherwise, in units of "stride".

const_add_tid_enable

1

Boolean. Add thread_ID within the wave to the index when true.

const_swizzle_enable

2

Swizzle AOS according to stride, index_stride and element_size:
0: disabled
1: enabled with element_size = 4-byte
2: Reserved
3: enabled with element_size = 16-byte

const_index_stride

2

Used only when const_swizzle_en = true. Number of contiguous indices for a single
element (of const_element_size=4 or 16 bytes) before switching to the next element.
8, 16, 32 or 64 indices.

Table 45. Address Components from GPRs
Field

Size Description

SGPR_offset

32

An unsigned byte-offset to the address. Comes from an SGPR or M0.

VGPR_offset

32

An optional unsigned byte-offset. It is per-thread, and comes from a VGPR.

VGPR_index

32

An optional index value. It is per-thread and comes from a VGPR.

The final buffer memory address is composed of three parts:
• the base address from the buffer resource (V#),
• the offset from the SGPR, and
• a buffer-offset that is calculated differently, depending on whether the buffer is linearly addressed (a
simple Array-of-Structures calculation) or is swizzled.
Address Calculation for a Linear Buffer

9.4.1. Range Checking
Buffer addresses are checked against the size of the memory buffer. Loads that are out of range return zero,
and stores and atomics are dropped. Range checking is per-component for non-formatted loads and stores that
are larger than one DWORD. Note that load/store_B64, B96 and B128 are considered "2-DWORD/3-DWORD/4DWORD load/store", and each DWORD is bounds checked separately. The method of clamping is controlled by

9.4. Buffer Addressing

88 of 597

"RDNA3" Instruction Set Architecture

a 2-bit field in the buffer resource: OOB_SELECT (Out of Bounds select).
Table 46. Buffer Out Of Bounds Selection
OOB
SELECT

Out of Bounds Check

Description or use

0

(index >= NumRecords) || (offset+payload > stride)

structured buffers

1

(index >= NumRecords)

Raw buffers

2

(NumRecords == 0)

do not check bounds (except
empty buffer)

3

Bounds check:

Raw
In this mode, "num_records" is
reduced by "sgpr_offset"

if (swizzle_en && const_stride != 0x0)
OOB = (index >= NumRecords || (offset+payload > stride))
else
OOB = (offset+payload > NumRecords)

Where "payload" is the number of bytes the instruction transfers.

Notes:
1. Loads that go out-of-range return zero (except for components with V#.dst_sel = SEL_1 that return 1).
2. Stores that are out-of-range do not store anything.
3. Load/store-format-* instruction and atomics are range-checked "all or nothing" - either entirely in or out.
4. Load/store-B{64,96,128} and range-check per component.
For MTBUF, if any component of the thread is out of bounds, the whole thread is considered out of bounds
and returns zero. For MUBUF, only the components that are out of bounds return zero.

9.4.1.1. Structured Buffer
The address calculation for swizzle_en==0 is: (unswizzled structured buffer)
ADDR = Base

+ baseOff + Ioff +

V#

SGPR

INST

Stride * Vidx
V#

+ (OffEn ? Voff : 0)

VGPR

INST

VGPR

NumRecords for structured buffer is in units of stride.

9.4.1.2. Raw Buffer
ADDR = Base

+ baseOff + Ioff +

V#

SGPR

INST

(OffEn ? Voff : 0)
INST

VGPR

NumRecords for raw buffer is in units of bytes. This is an exact range check, meaning it includes the payload
and handles multi-DWORD and unaligned correctly. The stride field is ignored.

9.4. Buffer Addressing

89 of 597

"RDNA3" Instruction Set Architecture

9.4.1.3. Scratch Buffer
The address calculation for swizzle_en = 0 is…(unswizzled scratch buffer)
ADDR = Base

+ baseOffset + Ioff +

V#

SGPR

INST

Stride * TID +
V#

(OffEn ? Voff : 0)

0..63

INST

VGPR

Swizzle of scratch buffer is also supported (and is typical). The MSBs of the TID (TID / 64) is folded into
baseOffset. No range checking (using OOB mode 2).

9.4.1.4. Scalar Memory
Scalar memory does the following, that works with RAW buffers and unswizzled structured buffers:
Addr =

Base

+

V#

offset
SGPR or Inst

Address Out-of-Range if: offset >= ( (stride==0 ? 1 : stride) * num_records).
Notes
1. Loads that go out-of-range return zero (except for components with V#.dst_sel = SEL_1 that return 1).
Stores that are out of range do not write anything.
2. Load/store-format-* instruction and atomics are range-checked "all or nothing" - either entirely in or out.
3. Load/store-DWORD-x{2,3,4} perform range-check per component.

9.4.2. Swizzled Buffer Addressing
Swizzled addressing rearranges the data in the buffer that may improve cache locality for arrays of structures.
Swizzled addressing also requires DWORD-aligned accesses. A single fetch instruction must not fetch a unit
larger than const_element_size. The buffer’s STRIDE must be a multiple of const_element_size.
const_element_size is either 4 or 16 bytes, depending on the setting of V#.swizzle_enable
Index

= (inst_idxen ? vgpr_index : 0) + (const_add_tid_enable ? thread_id[5:0] : 0)

Offset

= (inst_offen ? vgpr_offset : 0) + inst_offset

index_msb

= index / const_index_stride

index_lsb

= index % const_index_stride

offset_msb

= offset / const_element_size

offset_lsb

= offset % const_element_size

buffer_offset

= (index_msb * const_stride + offset_msb * const_element_size) * const_index_stride +
index_lsb * const_element_size + offset_lsb

Final Address = const_base + sgpr_offset + buffer_offset
The "sgpr_offset" is not a part of the "offset" term in the above equations - it's in the "base".

9.4. Buffer Addressing

90 of 597

"RDNA3" Instruction Set Architecture

Example of Buffer Swizzling

9.5. Alignment
Formatted ops such as BUFFER_LOAD_FORMAT_* must be aligned to element_size.
Memory alignment enforcement for non-formatted ops is controlled by a configuration register:
SH_MEM_CONFIG.alignment_mode.
Options are:
0. : DWORD - hardware automatically aligns request to the smaller of: element-size or DWORD.
For DWORD or larger loads or stores of non-formatted ops (such as BUFFER_LOAD_DWORD), the two
LSBs of the byte-address are ignored, thus forcing DWORD alignment.
1. : DWORD_STRICT - must be aligned to the smaller of: element-size or DWORD.
2. : STRICT - access must be aligned to data size

9.5. Alignment

91 of 597

"RDNA3" Instruction Set Architecture

3. : UNALIGNED - any alignment is allowed
Options 1 and 2 report MEMVIOL if a request is made with incorrect address alignment. In options 1 and 2,
loads that are misaligned return zero, and stores that are misaligned are discarded. Note that in this context
"element-size" refers to the size of the data transfer indicated by the instruction, not const_element_size.

9.6. Buffer Resource
The buffer resource (V#) describes the location of a buffer in memory and the format of the data in the buffer.
It is specified in four consecutive SGPRs (4-SGPR aligned) and sent to the texture cache with each buffer
instruction.
The table below details the fields that make up the buffer resource descriptor.
Table 47. Buffer Resource Descriptor
Bits

Size

Name

Description

47:0

48

Base address

Byte address.

61:48

14

Stride

Bytes 0 to 16383

63:62

2

swizzle Enable

Swizzle AOS according to stride, index_stride and element_size;
otherwise linear.
0: disabled
1: enabled with element_size = 4byte
2: Reserved
3: enabled with element_size = 16byte

95:64

32

Num_records

In units of stride if (stride >=1), else in bytes.

98:96

3

Dst_sel_x

101:99

3

Dst_sel_y

Destination channel select:
0=0, 1=1, 4=R, 5=G, 6=B, 7=A

104:102

3

Dst_sel_z

107:105

3

Dst_sel_w

113:108

6

Format

Memory data type.

118:117

2

Index stride

0:8, 1:16, 2:32, or 3:64. Used for swizzled buffer addressing.

119

1

Add tid enable

Add thread ID to the index for to calculate the address.

123:122

2

LLC NoAlloc

May become deprecated. Please use shader instruction fields instead.
0: NOALLOC = (PTE.NOALLOC | instruction.dlc)
1: NOALLOC = Read ? (PTE.NOALLOC | instruction.dlc) : 1
2: NOALLOC = Read ? 1 : (PTE.NOALLOC | instruction.dlc)
3: NOALLOC = 1

125:124

2

OOB_SELECT

Out of bounds select.

127:126

2

Type

Value == 0 for buffer. Overlaps upper two bits of four-bit TYPE field in
128-bit V# resource.

Unbound Resources
Setting the resource constant to all zeros has the effect of forcing any loads to return zero, and stores to be
ignored. This is keyed off the "data-format" being set to zero (INVALID), and for MUBUF the "add_tid_en =
false".
Resource - Instruction mismatch
If the resource type and instruction mismatch (e.g. a buffer constant with an image instruction, or an image

9.6. Buffer Resource

92 of 597

"RDNA3" Instruction Set Architecture

resource with a buffer instruction), the instruction is ignored (loads return nothing and stores do not alter
memory).

9.6. Buffer Resource

93 of 597

"RDNA3" Instruction Set Architecture

Chapter 10. Vector Memory Image Instructions
Vector Memory (VMEM) Image operations transfer data between the VGPRs and memory through the texture
cache. Image operations support access to image objects such as texture maps and typed surfaces. Sample
operations read multiple elements from a surface and combine them to produce a single result per lane.
Image objects are accessed using from one to four dimensional addresses; they are composed of homogeneous
samples, each sample containing one to four elements. These image objects are read from, or written to, using
IMAGE_* or SAMPLE_* instructions, all of which use the MIMG instruction format. IMAGE_LOAD instructions
load an element from the image buffer directly into VGPRS, and SAMPLE instructions use sampler constants
(S#) and apply filtering to the data after it is read. IMAGE_ATOMIC instructions combine data from VGPRs with
data already in memory, and optionally return the value that was in memory before the operation.
VMEM image operations use an image resource constant (T#) that is a 128-bit or 256-bit value in SGPRs. This
constant is sent to the texture cache when the instruction is executed. This constant defines the address, data
format, and characteristics of the surface in memory. Some image instructions also use a sampler constant that
is a 128-bit constant in SGPRs. Typically, these constants are fetched from memory using scalar memory loads
prior to executing VM instructions, but these constants can also be generated within the shader.
Texture fetch instructions have a data mask (DMASK) field. DMASK specifies how many data components it
receives. If DMASK is less than the number of components in the texture, the texture unit only sends DMASK
components, starting with R, then G, B, and A. if DMASK specifies more than the texture format specifies, the
shader receives data based on T#.DST_SEL for the missing components. Image ops do not generate MemViol instead they apply clamp modes if the address goes out of range.
Memory operations of different types (e.g. loads, stores and samples) can complete out of order with respect to
each other.

10.1. Image Instructions
This section describes the image instruction set, and the microcode fields available to those instructions.
MIMG Instructions
IMAGE_SAMPLE
IMAGE_SAMPLE_G16

Load and filter data from a image object
Sample with 16-bit gradients

IMAGE_GATHER4

Load and return samples from 4 texels for software filtering. Returns a single
component, starting with the lower-left texel and in counter-clockwise order.

IMAGE_GATHER4H

4H: fetch 1 component per texel from 4x1 texels
"DMASK" selects which component to load (R,G,B,A) and must have only one bit
set to 1.

IMAGE_LOAD_{-, PCK, PCK_SGN}
Load data from an image object
IMAGE_LOAD_MIP_{-, PCK, PCK_SGN } Load data from an image object from a specified mip level.
IMAGE_MSAA_LOAD

Load up to 4 samples of 1 component from an MSAA resource with a userspecified fragment ID.
Uses DMASK as component select - it behaves like gather4 ops and returns 4
VGPR (2 if D16=1).

IMAGE_STORE_{-, PCK }
IMAGE_STORE_MIP_{-, PCK }

Store data to an image object to a specific mipmap level

10.1. Image Instructions

94 of 597

"RDNA3" Instruction Set Architecture

MIMG Instructions
IMAGE_ATOMIC_{SWAP, CMPSWAP,
Image atomic operations
ADD, SUB, SMIN, UMIN, SMAX, UMAX,
AND, OR, XOR, INC, DEC }
IMAGE_GET_RESINFO

Return resource info into 4 VGPRs for the MIP level specified. These are 32bit
integer values:
VDATA3-0 = { #mipLevels, depth, height, width }
For cubemaps, depth = 6 * Number_of_array_faces.
(DX expects the # of cubes, but gets # of faces instead)

IMAGE_GET_LOD

Return the calculated LOD. Treated as a Sample instruction.
Returns the "raw" LOD and the "clamped" LOD into VDATA as two 32 bit floats:
First VGPR = clampLOD
Second VGPR = rawLOD

Table 48. Instruction Fields
Instruction Fields
Field

Size

Description

OP

8

Opcode

VADDR

8

Address of VGPR to supply first component of address.

VDATA

8

Address of VGPR to supply first component of store-data or receive first component of load-data.

SSAMP

5

SGPR to supply S# (sampler constant) in 4 consecutive SGPRs.
missing 2 LSB’s of SGPR-address since must be aligned to 4.

SRSRC

5

SGPR to supply T# (resource constant) in 8 consecutive SGPRs.
missing 2 LSB’s of SGPR-address since must be aligned to 4.

UNRM

1

Force address to be un-normalized. Must be set to 1 for Image stores & atomics.
0: for image ops with samplers, S,T,R from [0.0, 1.0] span the entire texture map;
1: for image ops with samplers, S,T,R from [0.0 to N] span the texture map, where N is width,
height or depth. Array/cube slice, lod, bias etc. are not affected. Image ops without sampler are
not affected. UINT inputs are "unnormalized".
This bit is logically OR’d with the S#.force_unnormalized bit.

R128

1

Texture Resource Size: 1 = 128bits, 0 = 256bits

A16

1

Address components are 16-bits (instead of the usual 32 bits).
When set, all address components are 16 bits (packed into 2 per DWORD), except:
Texel offsets (3 6bit UINT packed into 1 DWORD)
PCF reference (for "_C" instructions)
Address components are 16b uint for image ops without sampler; 16b float with sampler.

DIM

3

Surface Dimension:

10.1. Image Instructions

0: 1D

4: 1d array

1: 2D

5: 2d array

2: 3D

6: 2d msaa

3: cube

7: 2d msaa array

95 of 597

"RDNA3" Instruction Set Architecture

Instruction Fields
DMASK

4

Data VGPR enable mask: 1 .. 4 consecutive VGPRs
Loads: defines which components are returned: 0=red,1=green,2=blue,3=alpha
Stores: defines which components are written with data from VGPRs (missing components get 0).
Enabled components come from consecutive VGPRs.
E.G. DMASK=1001 : Red is in VGPRn and alpha in VGPRn+1.
For D16 loads, DMASK indicates which components to return;
For D16 stores, the DMASK the mask indicates which components to store but has restrictions:
Data is read out of consecutive VGPRs: LSB’s of VDATA, then MSB’s of VDATA then LSB’s
of VDATA+1 and last if needed MSB’s of VDATA+1. This is regardless of which DMASK bits
are set, only how many bits are set. The position of the DMASK bits controls which components
are written in memory.
If DMASK==0, the TA overrides DMASK=1 and puts zeros in VGPR followed by LWE status if exists. TFE
status is not generated since the fetch is dropped.
For IMAGE_GATHER4* instructions, DMASK indicates which component (RGBA), and the
number of VGPRs to use is determined automatically by hardware (4 VGPRs when D16=0, and 2
VGPRs when D16=1).

GLC

1

Group Level Coherent.
Atomics:
1 = return the memory value before the atomic operation is performed.
0 = do not return anything.

DLC

1

Device Level Coherent. Controls behavior of L1 cache (GL1).

SLC

1

System Level Coherent.

TFE

1

Texel Fault Enable for PRT (Partially Resident Textures). When set, fetch may return a NACK that
causes a VGPR write into DST+1 (first GPR after all fetch-dest gprs).

LWE

1

LOD Warning Enable. When set to 1, a texture fetch may return "LOD_CLAMPED = 1", and causes
a VGPR write into DST+1 (first GPR after all fetch-dest gprs). LWE only works for sampler ops;
LWE is ignored for non-sampler ops.

D16

1

VGPR-Data-16bit. On loads, convert data in memory to 16-bit format before storing it in VGPRs.
For stores, convert 16-bit data in VGPRs to the memory format before going to memory. Whether
the data is treated as float or int is decided by NFMT. Allowed only with these opcodes:
• IMAGE_SAMPLE*
• IMAGE_GATHER4
• IMAGE_LOAD
• IMAGE_LOAD_MIP
• IMAGE_STORE
• IMAGE_STORE_MIP

NSA

1

Non-Sequential Address
When NSA=0, the image addresses must be in sequential VGPRs starting at 'VADDR'.
When NSA=1, the instruction encoding allows up to 5 address components to be specified
separately by using an additional instruction DWORD.

ADDR1-4

4x8

Four 8-bit VGPR address fields, used by NSA. The "VADDR" field provides ADDR0.

10.1.1. Texture Fault Enable (TFE) and LOD Warning Enable (LWE)
This is related to "Partially Resident Textures".
When either of these bits are set in the instruction, any texture fetch may return one extra VGPR after all of the
data-return VGPRs. This data is returned uniquely to each thread and indicates the error / warning status of
that thread.

10.1. Image Instructions

96 of 597

"RDNA3" Instruction Set Architecture

The data returned is: TEXEL_FAIL | (LOD_WARNING << 1) | (LOD << 16)
• TEXEL_FAIL : 1 bit indicating that 1 or more texels for this pixel produced a NACK.
"failure" means accessing an unmapped page.
◦ TFE == 0
▪ TD writes the data for threads that didn’t NACK to VGPR DST
▪ TD writes zeros or the result of blend using zeros for samples that NACKed to VGPR DST
◦ TFE == 1
▪ VGPR DST is written similar to above
▪ TD writes to VGPR DST+1 with a status where the bits corresponding to threads that NACKed are
set to 1
• LOD_WARNING : 1 bit indicating a that a pixel attempted to access a texel at too small a LOD:
warn = ( LOD < T#.min_lod_warning)
• LOD : indicates which LOD was attempted to be accessed that caused the NACK. Returns the floor of the
requested LOD.
A pixel cannot receive both TEXEL_FAIL and LOD_WARNING: TEXEL_FAIL takes precedence.

10.1.2. D16 Instructions
Load-format and store-format instructions also come in a "d16" variant. For stores, each 32-bit VGPR holds two
16-bit data elements that are passed to the texture unit. The texture unit converts them to the texture format
before writing to memory. For loads, data returned from the texture unit is converted to 16 bits, and a pair of
data are stored in each 32- bit VGPR (LSBs first, then MSBs). The DMASK bit represents individual 16- bit
elements; so, when DMASK=0011 for an image-load, two 16-bit components are loaded into a single 32-bit
VGPR.

10.1.3. A16 Instructions
The A16 instruction bit indicates that the address components are 16 bits instead of the usual 32 bits.
Components are packed such that the first address component goes into the low 16 bits ([15:0]), and the next
into the high 16 bits ([31:16]).

10.1.4. G16 Instructions
The instructions with "G16" in the name mean the user provided derivatives are 16 bits instead of the usual 32
bits. Derivatives are packed such that the first derivative goes into the low 16 bits ([15:0]), and the next into the
high 16 bits ([31:16]).

10.1.5. Image Non-Sequential Address (NSA)
To avoid having many V_MOV instructions to pack image address VGPRs together, MIMG supports a "Non
Sequential Address" version of the instruction where the VGPR of every address component is uniquely
defined. Data components are still packed. This format creates a larger instruction word, which can be up to 3
DWORDs long. The first address goes in the VADDR field, and subsequent addresses go into ADDR1-4. This 3
DWORD form of the instruction can supply up to 5 addresses.

10.1. Image Instructions

97 of 597

"RDNA3" Instruction Set Architecture

NSA allows an image instruction to specify up to 5 unique address VGPRs. These are the rules for how
instructions requiring more than 5 addresses are handled with NSA. It is permissible to use non-NSA mode
where all addresses are in sequential VGPRs.
• VADDR provides the first address component
• ADDR1 provides the second address component
• ADDR2 provides the third address component
• ADDR3 provides the fourth address component
• ADDR4 provides all additional components in sequential VGPRs: VADDR4, VADDR4+1, etc.
When using 16-bit addresses, each VGPR holds a pair of addresses and these cannot be located in different
VGPRs. The lower numbered 16-bit value is in the LSBs of the VGPR.
For Ray Tracing, the VGPRs are divided up into 5 groups of VGPRs. The VGPRs within each group must be
contiguous, but the groups can be scattered. The packing is different when A16=1 because RayDir.Z and
RayInvDir.x are in the same DWORD. In A16 mode, the RayDir and RayInvDir are merged into 3 VGPRs but in a
different order: RayDir and RayInvDir per component share a VGPR.

10.2. Image Opcodes with No Sampler
For image opcodes with no sampler, all VGPR address values are taken as uint.
For cubemaps, face_id = slice * 6 + face.
MSAA surfaces support only load, store and atomics; not load-mip or store-mip.
The table below shows the contents of address VGPRs for the various image opcodes.
Opcode

a16[0] type

acnt VGPRn[31:0]

GET_RESINFO

x

Any

0

mipid

MSAA_LOAD

0

2D MSAA

2

2D Array MSAA

3

2D MSAA
2D Array MSAA

1

10.2. Image Opcodes with No Sampler

VGPRn+1[31:0]

VGPRn+2[31:0]

s

t

fragid

s

t

slice

2

t, s

-, fragid

3

t, s

fragid, slice

VGPRn+3[31:0]

fragid

98 of 597

"RDNA3" Instruction Set Architecture

Opcode

a16[0] type

acnt VGPRn[31:0]

LOAD
LOAD_PCK
LOAD_PCK_SGN
STORE
STORE_PCK

0

1D

0

s

2D

1

s

t

3D

2

s

t

r

Cube/Cube Array 2

s

t

face

1D Array

1

s

slice

2D Array

2

s

t

slice

2D MSAA

2

s

t

fragid

2D Array MSAA

3

s

t

slice

1D

0

-, s

2D

1

t, s

3D

1

ATOMIC

0

1

LOAD_MIP
0
LOAD_MIP_PCK
LOAD_MIP_PCK_SGN
STORE_MIP
STORE_MIP_PCK

1

VGPRn+1[31:0]

VGPRn+2[31:0]

VGPRn+3[31:0]

fragid

2

t, s

-, r

Cube/Cube Array 2

t, s

-, face

1D Array

1

slice, s

2D Array

2

t, s

-, slice

2D MSAA

2

t, s

-, fragid

2D Array MSAA

3

t, s

fragid, slice

1D

0

s

2D

1

s

t

3D

2

s

t

1D Array

1

s

slice

2D Array

2

s

t

slice

2D MSAA

2

s

t

fragid

2D Array MSAA

3

s

t

slice

1D

0

-, s

2D

1

t, s

3D

2

t, s

1D Array

1

slice, s

2D Array

2

t, s

-, slice

2D MSAA

2

t, s

-, fragid

2D Array MSAA

3

t, s

fragid, slice

1D

1

s

mipid

2D

2

s

t

mipid

3D

3

s

t

r

mipid

Cube/Cube Array 3

s

t

face

mipid

1D Array

2

s

slice

mipid

2D Array

3

s

t

slice

1D

1

mipid, s

2D

2

t, s

-, mipid

3D

r

fragid

-, r

3

t, s

mipid, r

Cube/Cube Array 3

t, s

mipid, face

1D Array

2

slice, s

-, mipid

2D Array

3

t, s

mipid, slice

mipid

• Image_Load : image_load, image_load_mip, image_load_{pck, pck_sgn, mip_pck, mip_pck_sgn}
• Image_Store: image_store, image_store_mip
• Image_Atomic_*: swap, cmpswap, add, sub, {u,s}{min,max}, and, or, xor, inc, dec.
"ACNT" is the Address Count: the number of VGPRs that supply the "body" of the address, derived from the

10.2. Image Opcodes with No Sampler

99 of 597

"RDNA3" Instruction Set Architecture

instruction’s DIM field and the opcode.

10.3. Image Opcodes with a Sampler
Opcodes with a sampler: all VGPR address values are taken as FLOAT except for Texel-offset which are UINT.
For cubemaps, face_id = slice * 8 + face.
(Note that the "*8" differs from the non-sampler case which is "*6").
Certain sample and gather opcodes require additional values from VGPRs beyond what is shown in the table
below. These values are: offset, bias, z-compare and gradients. Please see the next section for details. MSAA
surfaces do not support sample or gather4 operations.
Opcode

a16[0] acnt type

VGPRn[31:0]

Sample
GetLod

0

1

Sample "_L":

0

1

Sample "_CL":

0

1

10.3. Image Opcodes with a Sampler

VGPRn+1[31:0]

VGPRn+2[31:0]

VGPRn+3[31:0]

0

1D

s

1

2D

s

t

2

3D

s

t

r

2

Cube(Array) s

t

face

1

1D Array

s

slice

2

2D Array

s

t

0

1D

-, s

1

2D

t, s

2

3D

t, s

2

Cube(Array) t, s

1

1D Array

slice, s

2

2D Array

t, s

-, slice

1

1D

s

lod

2

2D

s

t

lod

3

3D

s

t

r

lod

3

Cube(Array) s

t

face

lod

2

1D Array

s

slice

lod

3

2D Array

s

t

slice

1

1D

lod, s

2

2D

t, s

-, lod

3

3D

t, s

lod, r

3

Cube(Array) t, s

lod, face

2

1D Array

slice, s

-, lod

3

2D Array

t, s

lod, slice

1

1D

s

clamp

2

2D

s

t

clamp

3

3D

s

t

r

clamp

3

Cube(Array) s

t

face

clamp

2

1D Array

s

slice

clamp

3

2D Array

s

t

slice

1

1D

clamp, s

2

2D

t, s

-, clamp

3

3D

t, s

clamp, r

3

Cube(Array) t, s

clamp, face

2

1D Array

slice, s

-, clamp

3

2D Array

t, s

clamp, slice

slice

-, r
-, face

lod

clamp

100 of 597

"RDNA3" Instruction Set Architecture

Opcode

a16[0] acnt type

VGPRn[31:0]

VGPRn+1[31:0]

Gather

0

1

Gather "_L"

0

1

Gather "_CL"

0

1

VGPRn+2[31:0]

VGPRn+3[31:0]

1

2D

s

t

2

Cube(Array) s

t

face

2

2D Array

s

t

slice

1

2D

t, s

2

Cube(Array) t, s

-, face

2

2D Array

t, s

-, slice

2

2D

s

t

lod

3

Cube(Array) s

t

face

lod

3

2D Array

s

t

slice

lod

2

2D

t, s

-, lod

3

Cube(Array) t, s

lod, face

3

2D Array

t, s

lod, slice

2

2D

s

t

clamp

3

Cube(Array) s

t

face

clamp

3

2D Array

s

t

slice

clamp

2

2D

t, s

-, clamp

3

Cube(Array) t, s

clamp, face

3

2D Array

clamp, slice

t, s

The table below lists and briefly describes the legal suffixes for image instructions:
Table 49. Sample Instruction Suffix Key
Suffix

Meaning

Extra Addresses

Description

_L

LOD

-

LOD is used instead of computed LOD.

_B

LOD BIAS

1: lod bias

Add this BIAS to the computed LOD.

_CL

LOD CLAMP

-

Clamp the computed LOD to be no larger than this value.

_D

Derivative

2,4 or 6: slopes

Send dx/dv, dx/dy, etc. slopes to be used in LOD computation.

_LZ

Level 0

-

Force use of MIP level 0.

_C

PCF

1: z-comp

Percentage closer filtering.

_O

Offset

1: offsets

Send X, Y, Z integer offsets (packed into 1 DWORD) to offset XYZ address.

_G16

Gradient 16b

-

Gradients are 16-bits instead of 32-bits, packed 2 gradients per VGPR (dX in
low 16bits, dY in high 16bits).

10.4. VGPR Usage
Address: The address consists of up to 5 parts: { offset } { bias } { z-compare } { derivative } { body }
These are all packed into consecutive VGPRs, (may be non-consecutive if "NSA" is used), and can consist of up to
12 values.
• Offset: SAMPLE*O*, GATHER*O*
1 DWORD of 'offset_xyz' . The offsets are 6-bit signed integers: X=[5:0], Y=[13:8], Z=[21:16]
• Bias: SAMPLE*B*, GATHER*B*. 1 DWORD float.
• Z-compare: SAMPLE*C*, GATHER*C*. 1 DWORD.
• Derivatives (SAMPLE_D): 2,4 or 6 DWORDS - these packed 1 DWORD per derivative as shown below (F32).
• Body: One to four DWORDs, as defined by the table: Image Opcodes with a Sampler
Address components are X,Y,Z,W with X in VGPR[M], Y in VGPR[M]+1, etc.

10.4. VGPR Usage

101 of 597

"RDNA3" Instruction Set Architecture

The number of components in "body" is the value of the ACNT field in the table, plus one.
Address components are X,Y,Z,W with X in VGPR[M], Y in VGPR[M]+1, etc.
Note: Bias and Derivatives are mutually exclusive - the shader can use one or the other, but not both.
32-bit derivatives:
Image Dim

VGPR N

N+1

N+2

N+3

N+4

N+5

1D

dx/dh

dx/dv

-

-

-

-

2D/cube

dx/dh

dy/dh

dx/dv

dy/dv

-

—

3D

dx/dh

dy/dh

dz/dh

dx/dv

dy/dv

dz/dv

16-bit derivatives:
Image Type

VGPR_D

1 (1D, 1D Array)

16’hx, dx/dh 16’hx dx/dv

VGPR_D+1

VGPR_D+2

VGPR_D+3

-

-

2 (2D, 2D Array, Cubemap)

dy/dh, dx/dh dy/dv, dx/dv -

-

3 (3D)

dy/dh, dx/dh 16’hx, dz/dh dy/dv, dx/dv 16’hx, dz/dv

The "A16" instruction bit specifies that address components are 16 bits instead of the usual 32 bits.
Data :
data is stored from or returned to 1-4 consecutive VGPRs. The amount of data loaded or stored is completely
determined by the DMASK field of the instruction.
Loads
DMASK specifies which elements of the resource are returned to consecutive VGPRs. The texture system
loads data from memory and based on the data format expands it to a canonical RGBA form, filling in
values for missing components based on T#.dst_sel. Then DMASK is applied and only those components
selected are returned to the shader.
Stores
When writing an image object, it is only possible to write an entire element (all components) - not only
individual components. The components come from consecutive VGPRs and the texture system fill in the
value zero for any missing components of the image’s data format, and ignore any values that are not part
of the stored data format. For example if the DMASK=1001, the shader sends Red from VGPR_N and Alpha from
VGPR_N+1 to the texture unit. If the image object is RGB, the texel is overwritten with Red from the VGPR_N,
Green and Blue set to zero, and Alpha from the shader ignored. For D16=1, the DMASK has 1 bit set per 16-bits of
data to be written from VGPRs to memory. The position of the bits in DMASK is irrelevant, only the number
of bits set to 1.
"D16" instructions
Load and store instructions also come in a "d16" variant. For stores, each 32bit VGPR holds two 16bit data
elements that are passed to the texture unit which in turn, converts to the texture format before writing to
memory. For loads, data returned from the texture unit is converted to 16 bits and a pair of data are stored
in each 32bit VGPR (LSBs first, then MSBs). If there is only one component, the data goes into the lower half
of the VGPR unless the "HI" instruction variant is used in which case the high-half of the VGPR is loaded
with data.

10.4. VGPR Usage

102 of 597

"RDNA3" Instruction Set Architecture

Atomics
Image atomic operations are supported only on 32- and 64-bit-per-pixel surfaces. The surface data format is
specified in the resource constant. Atomic operations treat the element as a single component of 32- or 64bits. For atomic operations, DMASK is set to the number of VGPRs (DWORDs) to send to the texture unit.
DMASK legal values for atomic image operations: All other values of DMASK are illegal.
• 0x1 = 32bit atomics except cmpswap
• 0x3 = 32bit atomic cmpswap
• 0x3 = 64bit atomics except cmpswap
• 0xf = 64bit atomic cmpswap
• Atomics with Return: Data is read out of the VGPR(s), starting at VDATA, to supply to the atomic
operation. If the atomic returns a value to VGPRs, that data is returned to those same VGPRs starting at
VDATA.
The DMASK must be compatible with the resource’s data format.
Denormals in Floats
Sample ops flush denormals, and loads do not modify denormals.

10.4.1. Data format in VGPRs
Data in VGPRs sent to texture (stores) or returned from texture (loads) is in one of a few standard formats, and
the texture unit converts to/from the memory format.
FORMAT

VGPR data format

If D16==1

SINT

signed 32-bit integer

16 bit signed int

UINT

unsigned 32-bit integer

16 bit unsigned int

others

32-bit float

16 bit float

Atomics

depends on opcode: uint or float

-

ASTC data formats

32-bit float

-

10.5. Image Resource
The image resource (also referred to as T#) defines the location of the image buffer in memory, its dimensions,
tiling, and data format. These resources are stored in four or eight consecutive SGPRs and are read by MIMG
instructions. All undefined or reserved bit must be set to zero unless otherwise specified.
Table 50. Image Resource Definition
Bits

Size

Name

Comments

128-bit Resource: 1D-tex, 2d-tex, 2d-msaa (multi-sample anti-aliasing)
39:0

40

base address

256-byte aligned (represents bits 47:8).

47

1

Big Page

0 = No page size override, 1 = coalesce page translation requests to 64kB
granularity. Use only when entire resource uses pages 64kB or greater.

51:48

4

max mip

MSAA resources: holds Log2(number of samples); others holds:
MipLevels-1. This describes the resource, not the resource view.

59:52

8

format

Memory Data format

75:62

14

width

width-1 of mip 0 in texels

10.5. Image Resource

103 of 597

"RDNA3" Instruction Set Architecture

Bits

Size

Name

Comments

91:78

14

height

height-1 of mip 0 in texels

98:96

3

dst_sel_x

0 = 0, 1 = 1, 4 = R, 5 = G, 6 = B, 7 = A.

101:99

3

dst_sel_y

104:102

3

dst_sel_z

107:105

3

dst_sel_w

111:108

4

base level

largest mip level in the resource view. For MSAA, this should be set to 0

115:112

4

last level

smallest mip level in resource view. For MSAA, holds log2(number of
samples).

123:121

3

BC Swizzle

Specifies channel ordering for border color data independent of the T#
dst_sel_*s. Internal xyzw channels get the following border color
channels as stored in memory. 0=xyzw, 1=xwyz, 2=wzyx, 3=wxyz, 4=zyxw,
5=yxwz

127:124

4

type

0 = buf, 8 = 1d, 9 = 2d, 10 = 3d, 11 = cube, 12 = 1d-array, 13 = 2d-array, 14 =
2d-msaa, 15 = 2d-msaa-array. 1-7 are reserved.

256-bit Resource: 1d-array, 2d-array, 3d, cubemap, MSAA
140:128

13

depth

Depth-1 of Mip0 for a 3D map; last array slice for a 2D-array or 1D-array
or cube-map; (pitch-1)[12:0] of mip0 for 1D, 2D, 2D-MSAA resources if
pitch > width.

141

1

Pitch[13]

(pitch-1)[13] of mip0 for 1D, 2D and 2D-MSAA.

156:144

13

base array

First slice in array of the resource view.

163:160

4

array pitch

For 3D, bit 0 indicates SRV or UAV:
0: SRV (base_array ignored, depth w.r.t. base map)
1: UAV (base_array and depth are first and last layer in view, and w.r.t.
mip level specified)

179:168

12

min lod warn

feedback trigger for LOD, u4.8 format

183

1

corner samples mod Describes how texels were generated in the resource. 0=center sampled,
1 = corner sampled.

198:187

12

min_lod

smallest LOD allowed for PRTs, U4.8 format

198:187

12

min LOD

smallest LOD allowed for PRTs, u4.8 format.

202

1

Iterate 256

Indicates that compressed tiles in this surface have been flushed out to
every 256B of the tile. Applies only to MSAA depth surfaces.

211

1

Meta Pipe Aligned

Maintains pipe alignment in metadata addressing (DCC and tiling)

213

1

Compression Enable enable delta color compression (DCC)

214

1

Alpha is on MSB

Set to 1 if the surface’s component swap is not reversed (DCC)

215

1

Color Transform

Auto=0, none=1 (DCC)

255:216

40

Meta Data Address

Upper bits of meta-data address (DCC) [47:8]

A resource that is all zeros is treated as 'unbound': it returns all zeros and not generate a memory transaction.
The "resource-level" field is ignored when checking for "all zeros".

10.6. Image Sampler
The sampler resource (also referred to as S#) defines what operations to perform on texture map data loaded
by sample instructions. These are primarily address clamping and filter options. Sampler resources are
defined in four consecutive SGPRs and are supplied to the texture cache with every sample instruction.
Table 51. Image Sampler Definition

10.6. Image Sampler

104 of 597

"RDNA3" Instruction Set Architecture

Bits

Size

Name

Description

2:0

3

clamp x

5:3

3

clamp y

8:6

3

clamp z

Clamp/wrap mode:
0: Wrap
1: Mirror
2: ClampLastTexel
3: MirrorOnceLastTexel
4: ClampHalfBorder
5: MirrorOnceHalfBorder
6: ClampBorder
7: MirrorOnceBorder

11:9

3

max aniso ratio

0 = 1:1
1 = 2:1
2 = 4:1
3 = 8:1
4 = 16:1

14:12

3

depth compare func

0: Never
1: Less
2: Equal
3: Less than or equal
4: Greater
5: Not equal
6: Greater than or equal
7: Always

15

1

force unnormalized

Force address cords to be unorm: 0 = address coordinates are
normalized, in [0,1); 1 = address coordinates are unnormalized in the
range [0,dim).

18:16

3

aniso threshold

threshold under which floor(aniso ratio) determines number of samples
and step size

19

1

mc coord trunc

enables bilinear blend fraction truncation to 1 bit for motion
compensation

20

1

force degamma

force format to srgb if data_format allows

26:21

6

aniso bias

6 bits, in u1.5 format.

27

1

trunc coord

selects texel coordinate rounding or truncation.

28

1

disable cube wrap

disables seamless DX10 cubemaps, allows cubemaps to clamp according
to clamp_x and clamp_y fields

30:29

2

filter_mode

0 = Blend (lerp); 1 = min, 2 = max.

31

1

skip degamma

disabled degamma (sRGB→Linear) conversion.

43:32

12

min lod

minimum LOD ins resource view space (0.0 = T#.base_level) u4.8.

55:44

12

max lod

maximum LOD ins resource view space

77:64

14

lod bias

LOD bias s6.8.

83:78

6

lod bias sec

bias (s2.4) added to computed LOD

85:84

2

xy mag filter

Magnification filter: 0=point, 1=bilinear, 2=aniso-point, 3=aniso-linear

87:86

2

xy min filter

Minification filter: 0=point, 1=bilinear, 2=aniso-point, 3=aniso-linear

89:88

2

z filter

Volume Filter: 0=none (use XY min/mag filter), 1=point, 2=linear

91:90

2

mip filter

Mip level filter: 0=none (disable mipmapping,use base-leve), 1=point,
2=linear

94

1

Blend PRT

For PRT fetches, bled the PRT_default valu for non-resident levels

107:96

12

border color ptr

10.6. Image Sampler

105 of 597

"RDNA3" Instruction Set Architecture

Bits

Size

Name

Description

127:126

2

border color type

Opaque-black, transparent-black, white, use border color ptr.
0: Transparent Black
1: Opaque Black
2: Opaque White
3: Register (User border color, pointed to by border_color_ptr)"

10.7. Data Formats
The table below details all the data formats that can be used by image and buffer resources.
Table 52. Buffer and Image Data Formats
#

Buffer and Image Formats

#

Buffer and Image Formats

#

Image Formats

0

INVALID

31

11_11_10_FLOAT

64

8_SRGB

1

8_UNORM

32

10_10_10_2_UNORM

65

8_8_SRGB

2

8_SNORM

33

10_10_10_2_SNORM

66

8_8_8_8_SRGB

3

8_USCALED

34

10_10_10_2_UINT

67

5_9_9_9_FLOAT

4

8_SSCALED

35

10_10_10_2_SINT

68

5_6_5_UNORM

5

8_UINT

36

2_10_10_10_UNORM

69

1_5_5_5_UNORM

6

8_SINT

37

2_10_10_10_SNORM

70

5_5_5_1_UNORM

7

16_UNORM

38

2_10_10_10_USCALED

71

4_4_4_4_UNORM

8

16_SNORM

39

2_10_10_10_SSCALED

72

4_4_UNORM

9

16_USCALED

40

2_10_10_10_UINT

73

1_UNORM

10

16_SSCALED

41

2_10_10_10_SINT

74

1_REVERSED_UNORM

11

16_UINT

42

8_8_8_8_UNORM

75

32_FLOAT_CLAMP

12

16_SINT

43

8_8_8_8_SNORM

76

8_24_UNORM

13

16_FLOAT

44

8_8_8_8_USCALED

77

8_24_UINT

14

8_8_UNORM

45

8_8_8_8_SSCALED

78

24_8_UNORM

15

8_8_SNORM

46

8_8_8_8_UINT

79

24_8_UINT

16

8_8_USCALED

47

8_8_8_8_SINT

80

X24_8_32_UINT

17

8_8_SSCALED

48

32_32_UINT

81

X24_8_32_FLOAT

18

8_8_UINT

49

32_32_SINT

82

GB_GR_UNORM

19

8_8_SINT

50

32_32_FLOAT

83

GB_GR_SNORM

20

32_UINT

51

16_16_16_16_UNORM

84

GB_GR_UINT

21

32_SINT

52

16_16_16_16_SNORM

85

GB_GR_SRGB

22

32_FLOAT

53

16_16_16_16_USCALED

86

BG_RG_UNORM

23

16_16_UNORM

54

16_16_16_16_SSCALED

87

BG_RG_SNORM

24

16_16_SNORM

55

16_16_16_16_UINT

88

BG_RG_UINT

25

16_16_USCALED

56

16_16_16_16_SINT

89

BG_RG_SRGB

26

16_16_SSCALED

57

16_16_16_16_FLOAT

27

16_16_UINT

58

32_32_32_UINT

28

16_16_SINT

59

32_32_32_SINT

109

BC1_UNORM

29

16_16_FLOAT

60

32_32_32_FLOAT

110

BC1_SRGB

30

10_11_11_FLOAT

61

32_32_32_32_UINT

111

BC2_UNORM

62

32_32_32_32_SINT

112

BC2_SRGB

63

32_32_32_32_FLOAT

113

BC3_UNORM

114

BC3_SRGB

115

BC4_UNORM

10.7. Data Formats

Compressed Formats

106 of 597

"RDNA3" Instruction Set Architecture

#

Buffer and Image Formats

#

Buffer and Image Formats

#

Image Formats

116

BC4_SNORM

117

BC5_UNORM

118

BC5_SNORM

119

BC6_UFLOAT

120

BC6_SFLOAT

121

BC7_UNORM

122

BC7_SRGB

205

YCBCR_UNORM

206

YCBCR_SRGB

227

6E4_FLOAT

228

7E3_FLOAT

10.8. Vector Memory Instruction Data Dependencies
When a VM instruction is issued, it schedules the reads of address and store-data from VGPRs to be sent to the
texture unit. Any ALU instruction that attempts to write this data before it has been sent to the texture unit is
stalled.
The shader developer’s responsibility to avoid data hazards associated with VMEM instructions include waiting
for VMEM load instruction completion before reading data fetched from the cache (VMCNT and VSCNT).
This is explained in the section: Data Dependency Resolution

10.9. Ray Tracing
Ray Tracing support includes the following instructions:
• IMAGE_BVH_INTERSECT_RAY
• IMAGE_BVH64_INTERSECT_RAY
These instructions receive ray data from the VGPRs and fetch BVH (Bounding Volume Hierarchy) from
memory.
• Box BVH nodes perform 4x Ray/Box intersection, sorts the 4 children based on intersection distance and
returns the child pointers and hit status.
• Triangle nodes perform 1 Ray/Triangle intersection test and returns the intersection point and triangle ID.
The two instructions are identical, except that the "64" version supports a 64-bit address while the normal
version supports only a 32bit address. Both instructions can use the "A16" instruction field to reduce some (but
not all) of the address components to 16 bits (from 32). These addresses are: ray_dir and ray_inv_dir.

10.9.1. Instruction definition and fields
image_bvh_intersect_ray vgpr_d[4], vgpr_a[11], sgpr_r[4]
image_bvh_intersect_ray vgpr_d[4], vgpr_a[8], sgpr_r[4] A16=1
image_bvh64_intersect_ray vgpr_d[4], vgpr_a[12], sgpr_r[4]

10.8. Vector Memory Instruction Data Dependencies

107 of 597

"RDNA3" Instruction Set Architecture

image_bvh64_intersect_ray vgpr_d[4], vgpr_a[9], sgpr_r[4]

A16=1

Table 53. Ray Tracing VGPR Contents
VGPR_ BVH A16=0
A

BVH A16=1

BVH64 A16=0

BVH64 A16=1

0

node_pointer (u32)

node_pointer (u32)

node_pointer [31:0] (u32)

node_pointer [31:0] (u32)

1

ray_extent (f32)

ray_extent (f32)

node_pointer [63:32] (u32)

node_pointer [63:32] (u32)

2

ray_origin.x (f32)

ray_origin.x (f32)

ray_extent (f32)

ray_extent (f32)

3

ray_origin.y (f32)

ray_origin.y (f32)

ray_origin.x (f32)

ray_origin.x (f32)

4

ray_origin.z (f32)

ray_origin.z (f32)

ray_origin.y (f32)

ray_origin.y (f32)

5

ray_dir.x (f32)

[15:0] = ray_dir.x (f16)
[31:16] = ray_inv_dir.x (f16)

ray_origin.z (f32)

ray_origin.z (f32)

6

ray_dir.y (f32)

[15:0] = ray_dir.y (f16)
[31:16] = ray_inv_dir.y(f16)

ray_dir.x (f32)

[15:0] = ray_dir.x (f16)
[31:16] = ray_inv_dir.x (f16)

7

ray_dir.z (f32)

[15:0] = ray_dir.z (f16)
[31:16] = ray_inv_dir.z (f16)

ray_dir.y (f32)

[15:0] = ray_dir.y (f16)
[31:16] = ray_inv_dir.y(f16)

8

ray_inv_dir.x (f32)

unused

ray_dir.z (f32)

[15:0] = ray_dir.z (f16)
[31:16] = ray_inv_dir.z (f16)

9

ray_inv_dir.y (f32)

unused

ray_inv_dir.x (f32)

unused

10

ray_inv_dir.z (f32)

unused

ray_inv_dir.y (f32)

unused

11

unused

unused

ray_inv_dir.z (f32)

unused

Vgpr_d[4] are the destination VGPRs of the results of intersection testing. The values returned here are
different depending on the type of BVH node that was fetched. For box nodes the results contain the 4 pointers
of the children boxes in intersection time sorted order. For triangle BVH nodes the results contain the
intersection time and triangle ID of the triangle tested.
Sgpr_r[4] is the texture descriptor for the operation. The instruction is encoded with use_128bit_resource=1.
Restrictions on image_bvh instructions
• DMASK must be set to 0xf (instruction returns all four DWORDs)
• D16 must be set to 0 (16 bit return data is not supported)
• R128 must be set to 1 (256 bit T#s are not supported)
• UNRM must be set to 1 (only unnormalized coordinates are supported)
• DIM must be set to 0 (BVH textures are 1D)
• LWE must be set to 0 (LOD warn is not supported)
• TFE must be set to 0 (no support for writing out the extra DWORD for the PRT hit status)
• SSAMP must be set to 0 (just a placeholder, since samplers are not used by the instruction)
The return order settings of the BVH ops are ignored instead they use the in-order load return queue.

10.9.2. Using BVH with NSA
When using the BVH instruction with Non-Sequential Address, the BVH components fall into 5 groups each of
which is specified by a NSA address VGPR.
• node pointer : 1 vgpr
• ray extent : 1 vgpr

10.9. Ray Tracing

108 of 597

"RDNA3" Instruction Set Architecture

• ray origin : 3 consecutive vgprs
• ray dir : 3 consecutive vgprs
• ray inv dir : 3 consecutive vgprs (paired with ray-dir for 16-bit addresses)
NSA and A16:
• A16=0, MIMG-NSA specifies 5 groups of consecutive VGPRs: node_pointer, ray_extent, ray_origin, ray_dir
and ray_inv_dir.
• A16=1, MIMG-NSA specifies 4 groups. In the above set, ray_dir and ray_inv_dir are packed into 3 VGPRs.
When using A16=1 mode, ray-dir and ray-inv-dir share the same vgprs and ADDR4 is unused.

10.9.3. Texture Resource Definition
The T# used with these instructions is different from other image instructions.
Table 54. BVH Resource Definition
Field

Bits

Size

Data

Base Address

39:0

40

Base address of the BVH texture 256 byte aligned

Reserved

54:40

15

Set to zero

Box growing
amount

62:55

8

Number of ULPs to be added during ray-box test, encoded as unsigned integer

Box sorting
enable

63

1

Whether the ray-box test result need to be sorted

Size

105:64

42

Number of nodes minus 1 in the BVH texture used to enforce bounds checking

Reserved

118:106

13

Set to zero

Pointer Flags

119

1

0: Do not use pointer flags or features supported by point flags
1: Utilize pointer flags to enable HW winding, backface cull, opaque/non-opaque
culling and primitive type-based culling.

triangle_return 120
_mode

1

0: Return data for triangle tests are
{0: t_num, 1: t_denom, 2: triangle_id, 3: hit_status}
1: Return data for triangle tests are
{0: t_num, 1: t_denom, 2: I_num, 3: J_num}

llc_stream or
unused

122:121

2

0: use the LLC for load/store if enabled in Mtype
1: use the LLC for load, bypass for store/atomics (store/atomics probe-invalidate)
2: Reserved
3: bypass the LLC for all ops

big_page

123

1

Describes resource page usage
0 : No page size override.
1 : Indicates when a whole resource is only using pages that are >= 64kB in size.

Type

127:124

4

Set to 0x8

Barycentrics
The ray-tracing hardware is designed to support computation of barycentric coordinates directly in hardware.
This uses the "triangle_return_mode" in the table in the previous section (T# descriptor).
Table 55. Ray Tracing Return Mode
DWORD
0

Return Mode =0

Return Mode = 1

Field Name

Type

Field Name

Type

t_num

float32

t_num

float32

10.9. Ray Tracing

109 of 597

"RDNA3" Instruction Set Architecture

DWORD

Return Mode =0

1

t_denom

float32

Return Mode = 1
t_denom

float32

2

triangle_id

uint32

I_num

float32

3

hit_status

uint32 (boolean value)

J_num

float32

10.10. Partially Resident Textures
"Partially Resident Textures" provides support for texture maps in which not all levels of detail are resident in
memory. The shader compiler declares the texture map as being P.R.T. in the resource, but the shader
program must also be aware of this because if a texture fetch accesses a MIP level that is not present, the
texture unit returns an extra DWORD of status into VGPRs indicating the fetch failure. If any of the texels are
not present in memory, the texture cache returns NACK that causes a non-zero value to be written into
DST_VGPR+1 for each failing thread. The value may represent the LOD requested. The shader program must
allocate this extra VGPR for all PRT texture fetches and check that it is zero after the fetch. This PRT VGPR
must have previously been initialized to zero by the shader.
PRT is enabled when the texture resource MIN_LOD_WARN value is non-zero. Normal textures cannot NACK,
so only PRT’s can get a NACK, and a NACK causes a write to DST_VGPR+Num_VGPRS. E.g. if a SAMPLE loads 4
values into 4 VGPRs: 4,5,6,7 then PRT may return NACK status into VGPR_8.

10.10. Partially Resident Textures

110 of 597

"RDNA3" Instruction Set Architecture

Chapter 11. Global, Scratch and Flat Address
Space
Flat, Global and Scratch are a collection of VMEM instructions that allow per-thread access to global memory,
shared memory and private memory. Unlike buffer and image instructions, these do not use an SRD (resource
constant).
Flat is the most generic of the 3 types where per-thread the address may map to global, private or shared
memory. Memory is addressed as a single flat address space, where certain memory address apertures map
these regions. The determination of the memory space to which an address maps is controlled by a set of
"memory aperture" base and size registers. Flat load/store/atomic instructions are effectively a simultaneous
issue of an LDS and GLOBAL instruction at the same time with the same address. The address per-thread is
read from the ADDR VGPR and then tested to see in which address space the data exists.
Flat Address Space ("flat") instructions allow load/store/atomic access to a generic memory address pointer that
can resolve to any of the following physical memories:
• Global memory
• Scratch ("private")
• LDS ("shared")
• Invalid
• But not to: GPRs, GDS or LDS-parameters.
GLOBAL is used when all of the address fall into global memory, not LDS or Scratch. This should be used when
possible (instead of "Flat") as Global does not tie up LDS resources. SCRATCH is similar, but is used to access
scratch (private) memory space.
Scratch (thread-private memory) is an area of memory defined by the aperture registers. When an address
falls in scratch space, additional address computation is automatically performed by the hardware. For waves
that are allocated scratch memory space, the 64-bit FLAT_SCRATCH register is initialized with the a pointer to
that wave’s private scratch memory. Waves that have no scratch memory have FLAT_SCRATCH initialized to
zero. FLAT_SCRATCH is a 64-bit byte address that is implicitly used by Flat and Scratch memory instructions,
and can be manually read via S_GETREG.
The instruction specifies which VGPR supplies the address (per work-item), and that address for each workitem may be in any one of those address spaces.
Instruction Fields

Field

Size Description

OP

8

Opcode: see next table

ADDR

8

VGPR that holds address or offset. For 64-bit addresses, ADDR has the LSB’s and ADDR+1 has the MSBs.
For offset a single VGPR has a 32 bit unsigned offset.
For FLAT_*: specifies an address.
For GLOBAL_* when SADDR is NULL: specifies an address.
For GLOBAL_* when SADDR is not NULL: specifies an offset.
For SCRATCH, specifies an offset if SVE=1

111 of 597

"RDNA3" Instruction Set Architecture

Field

Size Description

DATA

8

VGPR that holds the first DWORD of store-data. Instructions can use 0-4 DWORDs.

VDST

8

VGPR destination for data returned to the shader, either from LOADs or Atomics with GLC=1 (return
pre-op value).

SLC

1

System Level Coherent. Used in conjunction with GLC to determine cache policies.

DLC

1

Device Level Coherent. Controls behavior of L1 cache (GL1).

GLC

1

Group Level Coherent - controls behavior of L0 cache. Atomics: 1 = return the memory value before the
atomic operation is performed.
0 = do not return anything.

SEG

2

Memory Segment: 0=Flat, 1=Scratch, 2=GLOBAL, 3=Reserved

OFFSET

13

Address offset: 13-bit signed byte offset
(Must be positive for Flat; MSB is ignored and forced to zero)

SADDR

7

Scalar SGPR that provides an address of offset (unsigned). To disable use, set this field to NULL. The
meaning of this field is different for Scratch and Global.
Flat: Unused
Scratch: use an SGPR as part of the address
Global: use the SGPR to provide a base address and the VGPR provides a 32-bit byte offset.

SVE

1

Scratch VGPR Enable
When set to 1, scratch instructions include a 32-bit offset from a VGPR;
when set to 0, scratch instructions do not use a VGPR for addressing.

Table 56. Instructions
Flat

GLOBAL

Scratch

FLAT_LOAD_U8

GLOBAL_LOAD_U8

SCRATCH_LOAD_U8

FLAT_LOAD_D16_U8

GLOBAL_LOAD_D16_U8

SCRATCH_LOAD_D16_U8

FLAT_LOAD_D16_HI_U8

GLOBAL_LOAD_D16_HI_U8

SCRATCH_LOAD_D16_HI_U8

FLAT_LOAD_I8

GLOBAL_LOAD_I8

SCRATCH_LOAD_I8

FLAT_LOAD_D16_I8

GLOBAL_LOAD_D16_I8

SCRATCH_LOAD_D16_I8

FLAT_LOAD_D16_HI_I8

GLOBAL_LOAD_D16_HI_I8

SCRATCH_LOAD_D16_HI_I8

FLAT_LOAD_U16

GLOBAL_LOAD_U16

SCRATCH_LOAD_U16

FLAT_LOAD_I16

GLOBAL_LOAD_I16

SCRATCH_LOAD_I16

FLAT_LOAD_D16_B16

GLOBAL_LOAD_D16_B16

SCRATCH_LOAD_D16_B16

FLAT_LOAD_D16_HI_B16

GLOBAL_LOAD_D16_HI_B16

SCRATCH_LOAD_D16_HI_B16

FLAT_LOAD_B32

GLOBAL_LOAD_B32

SCRATCH_LOAD_B32

FLAT_LOAD_B64

GLOBAL_LOAD_B64

SCRATCH_LOAD_B64

FLAT_LOAD_B96

GLOBAL_LOAD_B96

SCRATCH_LOAD_B96

FLAT_LOAD_B128

GLOBAL_LOAD_B128

SCRATCH_LOAD_B128

FLAT_STORE_B8

GLOBAL_STORE_B8

SCRATCH_STORE_B8

FLAT_STORE_D16_HI_B8

GLOBAL_STORE_D16_HI_B8

SCRATCH_STORE_D16_HI_B8

FLAT_STORE_B16

GLOBAL_STORE_B16

SCRATCH_STORE_B16

FLAT_STORE_D16_HI_B16

GLOBAL_STORE_D16_HI_B16

SCRATCH_STORE_D16_HI_B16

FLAT_STORE_B32

GLOBAL_STORE_B32

SCRATCH_STORE_B32

FLAT_STORE_B64

GLOBAL_STORE_B64

SCRATCH_STORE_B64

FLAT_STORE_B96

GLOBAL_STORE_B96

SCRATCH_STORE_B96

FLAT_STORE_B128

GLOBAL_STORE_B128

SCRATCH_STORE_B128

none

GLOBAL_LOAD_ADDTID_B32

none

none

GLOBAL_STORE_ADDTID_B32

none

FLAT_ATOMIC_SWAP_B32

GLOBAL_ATOMIC_SWAP_B32

none

FLAT_ATOMIC_CMPSWAP_B32

GLOBAL_ATOMIC_CMPSWAP_B32

none

112 of 597

"RDNA3" Instruction Set Architecture

Flat

GLOBAL

Scratch

FLAT_ATOMIC_ADD_U32

GLOBAL_ATOMIC_ADD_U32

none

FLAT_ATOMIC_ADD_F32

GLOBAL_ATOMIC_ADD_F32

none

FLAT_ATOMIC_SUB_U32

GLOBAL_ATOMIC_SUB_U32

none

FLAT_ATOMIC_MIN_I32

GLOBAL_ATOMIC_MIN_I32

none

FLAT_ATOMIC_MIN_U32

GLOBAL_ATOMIC_MIN_U32

none

FLAT_ATOMIC_MAX_I32

GLOBAL_ATOMIC_MAX_I32

none

FLAT_ATOMIC_MAX_U32

GLOBAL_ATOMIC_MAX_U32

none

FLAT_ATOMIC_AND_B32

GLOBAL_ATOMIC_AND_B32

none

FLAT_ATOMIC_OR_B32

GLOBAL_ATOMIC_OR_B32

none

FLAT_ATOMIC_XOR_B32

GLOBAL_ATOMIC_XOR_B32

none

FLAT_ATOMIC_INC_U32

GLOBAL_ATOMIC_INC_U32

none

FLAT_ATOMIC_DEC_U32

GLOBAL_ATOMIC_DEC_U32

none

FLAT_ATOMIC_CMPSWAP_F32

GLOBAL_ATOMIC_CMPSWAP_F32

none

FLAT_ATOMIC_MIN_F32

GLOBAL_ATOMIC_MIN_F32

none

FLAT_ATOMIC_MAX_F32

GLOBAL_ATOMIC_MAX_F32

none

FLAT_ATOMIC_SWAP_B64

GLOBAL_ATOMIC_SWAP_B64

none

FLAT_ATOMIC_CMPSWAP_B64

GLOBAL_ATOMIC_CMPSWAP_B64

none

FLAT_ATOMIC_ADD_U64

GLOBAL_ATOMIC_ADD_U64

none

FLAT_ATOMIC_SUB_U64

GLOBAL_ATOMIC_SUB_U64

none

FLAT_ATOMIC_MIN_I64

GLOBAL_ATOMIC_MIN_I64

none

FLAT_ATOMIC_MIN_U64

GLOBAL_ATOMIC_MIN_U64

none

FLAT_ATOMIC_MAX_I64

GLOBAL_ATOMIC_MAX_I64

none

FLAT_ATOMIC_MAX_U64

GLOBAL_ATOMIC_MAX_U64

none

FLAT_ATOMIC_AND_B64

GLOBAL_ATOMIC_AND_B64

none

FLAT_ATOMIC_OR_B64

GLOBAL_ATOMIC_OR_B64

none

FLAT_ATOMIC_XOR_B64

GLOBAL_ATOMIC_XOR_B64

none

FLAT_ATOMIC_INC_U64

GLOBAL_ATOMIC_INC_U64

none

FLAT_ATOMIC_DEC_U64

GLOBAL_ATOMIC_DEC_U64

none

none

GLOBAL_ATOMIC_CSUB_U32
(GLC must be set to 1)

none

11.1. Instructions
11.1.1. FLAT
The Flat instruction set is nearly identical to the BUFFER instruction set, minus the FORMAT loads & stores.
Flat instructions do not use a resource constant (V#) or sampler (S#), but they do use a specific SGPR-pair
(FLAT_SCRATCH) to hold scratch-space information in case any threads' address resolves to scratch space. See
"Scratch" section below.
Since Flat instruction are executed as both an LDS and a Global instruction, Flat instructions increment both
VMcnt (or VScnt) and LGKMcnt and are not considered done until both have been decremented. There is no
way a priori to determine whether a Flat instruction uses only LDS or Global memory space.
When the address from a Flat instruction falls into scratch (private) space, a different addressing mechanism is

11.1. Instructions

113 of 597

"RDNA3" Instruction Set Architecture

used. The address from the VGPR points to the memory space for a specific DWORD of scratch data owned by
this thread. The hardware maps this address to the actual memory address that holds data for all of the threads
in the wave. Flat atomics which map into scratch: 4-byte atomics are supported, and 8-byte atomics return
MEMVIOL.
The wave supplies the offset (for space allocated to this wave) with every Flat request. This is stored in a
dedicated per-wave register: FLAT_SCRATCH, that holds a 64-bit byte address.
The aperture check occurs when VGPRs are read, with invalid addresses being routed to the texture unit. The
"aperture check" is performed before "inst_offset" is added into the address, so it is undefined what occurs if
the addition of inst_offset pushes the address into a different memory aperture.
(Hole) Addr[48]

Addr[47]

Addr[46]

Aperture

0

x

x

Normal (global memory)

1

0

0

Potential Private (scratch)

1

0

1

Potential Shared (LDS)

1

1

x

Invalid

Ordering
Flat instructions may complete out of order with each other. If one Flat instruction finds all of its data in
Texture cache, and the next finds all of its data in LDS, the second instruction might complete first. If the
two fetches return data to the same VGPR, the result is unknown (order is not deterministic). Flat
instructions decrement VMcnt in order for the threads that went to global memory and those are in order
with other scratch, global, texture and buffer instructions. Separately each Flat instruction increments and
decrements LGKMcnt. This is out-of-order with the VMcnt path but is in-order with other DS (LDS)
instructions. Since the data for a Flat load can come from either LDS or the texture cache, and because
these units have different latencies, there is a potential race condition with respect to the VMcnt/VScnt and
LGKMcnt counters. Because of this, the only sensible S_WAITCNT value to use after Flat instructions is
zero.

11.1.2. Global
Global operations transfer data between VGPR and global memory. Global instructions are similar to Flat, but
the programmer is responsible to make sure that no threads access LDS or private space. Because of this, no
LDS bandwidth is used by global instructions.
Since these instructions do not access LDS, only VMcnt (or VScnt) is used, not LGKMcnt. If a global instruction
does attempt to access LDS, the instruction returns MEMVIOL.
Global includes two instructions which do not use any VGPRs for addressing, just SGPRs and INST_OFFSET:
• GLOBAL_LOAD_ADDTID_B32
• GLOBAL_STORE_ADDTID_B32

11.1.3. Scratch
Scratch instructions are similar to global but they access a private (per-thread) memory space that is swizzled.
Because of this, no LDS bandwidth is used by scratch instructions. Scratch instructions also support multiDWORD access and mis-aligned access (although mis-aligned is slower).

11.1. Instructions

114 of 597

"RDNA3" Instruction Set Architecture

Since these instructions do not access LDS, only VMcnt (or VScnt) is used, not LGKMcnt. It is not possible for a
scratch instruction to access LDS, and so no error checking is done (and no aperture check is performed).

11.2. Addressing
Global, Flat and Scratch each have their own addressing modes. Flat addressing is a subset of the global and
scratch modes. 64-bit addresses are stored with the LSB’s in the VGPR at ADDR, and the MSBs in the VGPR at
ADDR+1.
There are 4 distinct shader instructions:
• GLOBAL
• SCRATCH
• LDS
• FLAT - based on per-thread address (VGPR), can load/store: global memory, LDS or scratch memory.
Global Addressing
GV

mem_addr = VGPRU64 + INST_OFFSETI13

GVS

mem_addr = SGPRU64 + VGPRU32 + INST_OFFSETI13

GT

mem_addr = SGPRU64 + INST_OFFSETI13 + ThreadID*4

LDS Addressing (DS ops)
LDS

LDS_ADDR = VGPR_addrU32 + INST_OFFSETU16
LDS address is relative to the LDS space allocated to this wave.

Scratch Addressing
SV

mem_addr = SCRATCH_BASEU64 + SWIZZLE(VGPR_offsetU32 + INST_OFFSETI13, ThreadID)

SS

mem_addr = SCRATCH_BASEU64 + SWIZZLE(SGPR_offsetU32 + INST_OFFSETI13, ThreadID)

SVS

mem_addr = SCRATCH_BASEU64 + SWIZZLE(SGPR_offsetU32 + VGPR_offsetU32 + INST_OFFSETI13, ThreadID)

ST

mem_addr = SCRATCH_BASEU64 + SWIZZLE(INST_OFFSETI13, ThreadID)
SGPR_offset and VGPR_offset are 32 bits unsigned byte offsets.

The combined offsets inside SWIZZLE() must result in a non-negative number.
The value from an SGPR and VGPR are unsigned 32-bit byte offsets.
Flat Addressing
Aperture test on the address-VGPR value determines: Global/LDS/Scratch per thread (ignores
INST_OFFSET).
Use one of the 3 address equations per lane depending on which memory it maps to:
GLOBAL (GV)

mem_addr = VGPRU64 + INST_OFFSETI13

SCRATCH (SV)

mem_addr = SCRATCH_BASE(sgpr:U64) + SWIZZLE(VGPR_offset + INST_OFFSETI13, ThreadID)

LDS

LDS_ADDR = VGPR(addr) + INST_OFFSET - sharedApertureBase
If the address falls into LDS space, it is checked against the range: [0, LDS_allocated_size-1 ]

There is no range checking on this address.
Scratch Addressing Equation
"SWIZZLE(offset,TID)" is hard coded based on wave size (32 or 64)
Swizzle for Scratch is hard-coded to: elem_size=4bytes, const_index_stride=32 (wave32) or 64
(wave64).

11.2. Addressing

115 of 597

"RDNA3" Instruction Set Architecture

Addr = SCRATCH_BASE + (offset / 4) * 4 * const_index_stride + (offset % 4) + TID*4
where "offset" = either "INST_OFFSET + SGPR_offset" or "INST_OFFSET + VGPR_offset".
Restrictions:
• Inst_offset :
◦ Flat and Scratch-ST mode: must not be negative
◦ Global and Scratch-SS and -SV modes: can be negative
◦ In Scratch SS mode, the inst_offset must be aligned to the payload size: 4 byte aligned for 1-DWORD,
16-byte aligned for 4-DWORD.
▪ Also (SADDR + INST_OFFSET) must be at least DWORD-aligned
SADDR

SVE

MODE

==NULL

0

ST

!=NULL

0

SS

==NULL

1

SV

!=NULL

1

SVS

Scratch Instruction Modes

Indicated by SVE
/ SADDR

SV

Addr =

FLAT_SCRATCH

+ swizzle(Voff + Ioff, TID)

1 / NULL

SS

Addr =

FLAT_SCRATCH

+ swizzle(Soff + Ioff, TID)

0 / !NULL

ST

Addr =

FLAT_SCRATCH

+ swizzle(0 + Ioff, TID)

0 / NULL

SVS

Addr =

FLAT_SCRATCH

+ swizzle(Soff + Voff + Ioff, TID)

1 / !NULL

BUFFER_ Addr =
+ LOAD

T#.base

+ Soff

+ swizzle( (Vidx + TID) * stride + Ioff + Voff)

Global Instruction Modes
GV

Addr =

Vaddr64

GVS

Addr =

Saddr64

GT

Addr =

Saddr64

+ Voff32

+ Ioff

x / NULL

+ Ioff

x / !=NULL

+ Ioff + TID*4

x/x instruction

+ Ioff

x/x instruction

LDS Instruction Modes
LDS

Addr =

Vaddr

Flat Instruction Modes
Scratch

Addr =

FLAT_SCRATCH

swizzle (Voff + Ioff -privApertureBase, TID) // "SV"

x / NULL

LDS

Addr =

Vaddr

+ Ioff - sharedApertureBase // "LDS"

x / NULL

Global

Addr =

Vaddr

+ Ioff // "GV"

x / NULL

• Scratch: Voff and Soff are 32 bits, unsigned bytes.
• Global: Addresses are 64 bits, offset is 32bits.
• FLAT_SCRATCH is an SGPR-pair 64-bit address.
• "Ioff" is the offset from the instruction field.
• "x" = don’t care (either value works)

11.3. Memory Error Checking
Both Texture and LDS can report that an error occurred due to a bad address. This can occur due to:
• Invalid address (outside any aperture)
• Write to read-only global memory address

11.3. Memory Error Checking

116 of 597

"RDNA3" Instruction Set Architecture

• Misaligned data (scratch accesses may be misaligned)
• Out-of-range address:
◦ LDS access with an address outside the range: [ 0, LDS_SIZE-1 ]
The policy for threads with bad addresses is: stores outside this range do not write a value, and reads return
zero. The aperture check for invalid address occurs before adding any address offsets - it is based only on the
base address; the other checks are performed after adding the offsets.
Addressing errors from either LDS or TA are returned on their respective "instruction done" busses as
MEMVIOL. This sets the wave’s MEMVIOL TrapStatus bit, and also causes an exception (trap).

11.4. Data
FLAT instructions can use from zero to four consecutive DWORDs of data in VGPRs and/or memory. The DATA
field determines which VGPR(s) supply source data (if any) and the VDST VGPRs hold return data (if any).
There is no data-format conversion performed.
"D16" instructions use only 16-bit of the VGPR instead of the full 32bits. "D16_HI" instructions read or write
only the high 16-bits, while "D16" use the low 16-bits.
Scratch & Global D16 load instructions of the "_LDS_" type write the entire 32-bits of LDS.

11.4. Data

117 of 597

"RDNA3" Instruction Set Architecture

Chapter 12. Data Share Operations
Local data share (LDS) is a low-latency, RAM scratchpad for temporary data storage and for sharing data
between threads within a work-group. Accessing data through LDS may be significantly lower latency and
higher bandwidth than going through memory.
For compute workloads, it allows a simple method to pass data between threads in different waves within the
same work-group. For graphics, it is also used to hold vertex parameters for pixel shaders.
LDS space is allocated per work-group or wave (when work-groups not used) and recorded in dedicated LDSbase/size (allocation) registers that are not writable by the shader. These restrict all LDS accesses to the space
owned by the work-group or wave.

12.1. Overview
The figure below shows how the LDS fits into the memory hierarchy of the GPU.

Figure 3. High-Level Memory Configuration

There are 128kB of memory per work-group processor split up into 64 banks of DWORD-wide RAMs. These 64
banks are further sub-divided into two sets of 32-banks each where 32 of the banks are affiliated with a pair of
SIMD32’s, and the other 32 banks are affiliated with the other pair of SIMD32’s within the WGP. Each bank is a
512x32 two-port RAM (1R/1W per clock cycle). DWORDs are placed in the banks serially, but all banks can
execute a store or load simultaneously. One work-group can request up to 64kB memory.
The high bandwidth of the LDS memory is achieved not only through its proximity to the ALUs, but also
through simultaneous access to its memory banks. Thus, it is possible to concurrently execute 32 store or load
instructions, each nominally 32-bits; extended instructions, load_2addr/store_2addr, can be 64-bits each. If,

12.1. Overview

118 of 597

"RDNA3" Instruction Set Architecture

however, more than one access attempt is made to the same bank at the same time, a bank conflict occurs. In
this case, for indexed and atomic operations, the hardware is designed to prevent the attempted concurrent
accesses to the same bank by turning them into serial accesses. This can decrease the effective bandwidth of
the LDS. For increased throughput (optimal efficiency), therefore, it is important to avoid bank conflicts. A
knowledge of request scheduling and address mapping can be key to help achieving this.

12.1.1. Dataflow in Memory Hierarchy
The figure below is a conceptual diagram of the dataflow within the memory structure.

Data can be loaded into LDS either by transferring it from VGPRs to LDS using "DS" instructions, or by loading
in from memory. When loading from memory, the data may be loaded into VGPRs first or for some types of
loads it may be loaded directly into LDS from memory. To store data from LDS to global memory, data is read
from LDS and placed into the work-item’s VGPRs, then written out to global memory. To help make effective
use of the LDS, a shader program must perform many operations on what is transferred between global
memory and LDS.
LDS atomics are performed in the LDS hardware. Although ALUs are not directly used for these operations,
latency is incurred by the LDS executing this function.

12.1.2. LDS Modes and Allocation: CU vs. WGP Mode
Work-groups of waves are dispatched in one of two modes: CU or WGP.
See this section for details: WGP and CU Mode

12.1.3. LDS Access Methods
There are 3 forms of Local Data Share access:

12.1. Overview

119 of 597

"RDNA3" Instruction Set Architecture

Direct Load
Loads a single DWORD from LDS and broadcasts the data to a VGPR across all lanes.
Indexed load/store and Atomic ops
Load/store address comes from a VGPR and data to/from VGPR.
LDS-ops require up to 3 inputs: 2data+1addr and immediate return VGPR.
Parameter Interpolation Load
Reads pixel parameters from LDS per quad and loads them into one VGPR.
Reads all 3 parameters per quad (P1, P1-P0 and P2-P0) and loads them into 3 lanes within the quad (the 4th
lane receives zero).
The following sections describe these methods.

12.2. Pixel Parameter Interpolation
For pixel waves, vertex attribute data is preloaded into LDS and barycentrics (I, J) are preloaded into VGPRs
before the wave starts. Parameter interpolation can be performed by loading attribute data from LDS into
VGPRs using LDS_PARAM_LOAD and then using V_INTERP instructions to interpolate the value per pixel.
LDS-Parameter loads are used to read vertex parameter data and store them in VGPRs to be used for parameter
interpolation. These instructions operate like memory instructions except they use EXPcnt to track outstanding
reads and decrement EXPCnt when they arrive in VGPRs.
Pixel shaders can be launched before their parameter data has been written into LDS. Once the data is
available in LDS, the wave’s STATUS register "LDS_READY" bit is set to 1. Pixel shader waves stall if an
LDS_DIRECT_LOAD or LDS_PARAM_LOAD is to be issued before LDS_READY is set.
The most common form of interpolation involves weighting vertex parameters by the barycentric coordinates
"I" and "J". A common calculation is:
Result = P0 + I * P10 + J * P20
where "P10" is (P1 - P0), and "P20" is (P2 - P0)

Parameter interpolation involves two types of instructions:
• LDS_PARAM_LOAD : to read packed parameter data from LDS into a VGPR (data packed per quad)
• V_INTERP_* : VALU FMA instructions that unpack parameter data across lanes in a quad.

12.2.1. LDS Parameter Loads
Parameter Loads are only available in LDS, not in GDS, and only in CU mode (not WGP mode).
LDS_PARAM_LOAD reads three parameters (P0, P10, P20) of one 32-bit attribute or of two 16-bit attributes
from LDS into VGPRs. The are 3 parameters (P0, P10 and P20) are the same for the 4 pixels within a quad.
These values are spread out across VGPR lanes 0, 1 and 2 of each quad. Interpolation is then performed using
FMA with DPP so each lane uses its I or J value with the quad’s shared P0, P10 and P20 values.

12.2. Pixel Parameter Interpolation

120 of 597

"RDNA3" Instruction Set Architecture

Table 57. LDSDIR Instruction Fields
Field

Size

Description

OP

2

Opcode:
0: LDS_DIRECT_LOAD
1: LDS_PARAM_LOAD
2,3: Reserved

WAITVDST

4

Wait for the number of previously issued still outstanding VALU instructions to be less than
or equal to this number. Used to avoid Write-After-Read hazards on VGPRs.

VDST

8

Destination VGPR

ATTR_CHAN

2

Attribute channel: 0=X, 1=Y, 2=Z, 3=W. Unused for LDS_DIRECT_LOAD.

ATTR

6

Attribute number: 0 - 32. Unused for LDS_DIRECT_LOAD.

( M0 )

32

LDS_DIRECT_LOAD:
{ 13’b0, DataType[2:0], LDS_address[15:0] } //addr in bytes
LDS_PARAM_LOAD:
{ 1’b0, new_prim_mask[15:1], lds_param_offset[15:0] }

M0 is implicitly read for this instruction and must be initialized before these instructions.
new_prim_mask
a mask that has a bit per quad indicating that this quad begins a new primitive; zero indicates same
primitive as previous quad. There is an implied "one" for the first quad in the wave (every wave begins a
new primitive) and so bit[0] is omitted.
lds_param_offset
The parameter offset indicates the starting address of the parameters in LDS. Space before that can be used
as temporary wave storage space. Lds_param_offset bits [6:0] must be set to zero.
Example LDS_PARAM_LOAD (new_prim_mask[3:0] = 0110)

LDS_ADDR = lds_base + param_offset + attr#*numPrimsInVector*12DWORDs + prim#*12 + attr_offset
(attr_offset = 0..11 : 0 = P0.x, 1 = P0.Y, … 11 = P2.W)
From NewPrimMask h/w derives NumPrimInVec and Prim# (0..15)
If the dest-VGPR is out of range, the load is still performed but EXEC is forced to zero.
LDS_PARAM_LOAD and LDS_DIRECT_LOAD use EXEC per quad (if any pixel is enabled in the quad, data is
written to all 4 pixels/threads in the quad).

12.2. Pixel Parameter Interpolation

121 of 597

"RDNA3" Instruction Set Architecture

12.2.1.1. 16-bit Parameter Data
16-bit parameters are packed in LDS as pairs of attributes in DWORDs: ATTR0.X and ATTR1.X share a DWORD.
There is an alternate packing mode where the parameters are not packed (one 16-bit param in low half of
DWORD). These attributes can be read with the same LDS_PARAM_LOAD instruction, and returns the packed
DWORD with 2 attributes (when they are packed). Interpolation can then be done using specific mixedprecision FMA opcodes, along with DPP (to select P0, P10 or P20) and OPSEL (to select upper or lower 16-bits).
Barycentrics are 32-bits, not 16 bit.

12.2.1.2. Parameter Load Data Hazard Avoidance
These data dependency rules apply to both parameter and direct loads.
LDS_DIRECT_LOAD and LDS_PARAM_LOAD read data from LDS and write it into VGPRs, and they use EXPcnt
to track when the instruction has completed and written the VGPRs.
It is up to the shader program to ensure that data hazards are avoided. These instructions are issued along a
different path from VALU instructions so it is possible that previous VALU instructions may still be reading
from the VGPR that these LDS instructions are going to write and this could lead to a hazard.
EXPcnt is used to track read-after-write hazards where LDS_PARAM_LOAD writes a value to a VGPR and
another instruction reads it. The shader program uses "s_waitcnt EXPcnt" to wait for results from a
LDS_DIRECT_LOAD or LDS_PARAM_LOAD to be available in VGPRs before consuming it in a subsequent
instruction. The VINTERP instructions have a "wait_EXPcnt" field to assist in avoid this hazard.
These are skipped when EXEC==0 and EXPCnt==0 (like memory ops).
Mixed exports & LDS-direct/param instructions from the same wave might not complete in order (both use
EXPcnt), requiring "s_waitcnt 0" if they are overlapped.
LDS_PARAM_LOAD V2
S_WAITCNT EXPcnt 0

A potential Write-After-Read hazard exists if a VALU instruction reads a VGPR and then LDS_PARAM_LOAD
writes that VGPR: It is possible the LDS_PARAM_LOAD overwrites the VALU’s source VGPR before it was read.
The user must prevent this by using the "wait_Vdst" field of the LDS_PARAM_LOAD instruction. This field
indicates the maximum number of uncompleted VALU instructions that may be outstanding when this
LDS_PARAM_LOAD is issued. Use this to ensure any dependent VALU instructions have completed.
Another potential data hazard involves LDS_PARAM_LOAD overwriting a VGPR that has not yet been read as a
source by a previous VMEM (LDS, Texture, Buffer, Flat) instruction. To avoid this hazard, the user must ensure
that the VMEM instruction has read its source VGPRs. This can be achieved by issuing any VALU or export
instruction before the LDS_PARAM_LOAD.

12.2. Pixel Parameter Interpolation

122 of 597

"RDNA3" Instruction Set Architecture

12.3. VALU Parameter Interpolation
Parameter interpolation is performed using an FMA operation that includes a built-in DPP operation to unpack
the per-quad P0/P10/P20 values into per-lane values. Because this instruction reads data from neighboring
lanes, the implicit DPP acts as if "fetch invalid = 1", so that the instruction can read data from neighboring lanes
that have EXEC==0, rather than getting the value 0 from those. Standard interpolation is calculating:
Per-Pixel-Parameter = P0 + I * P10 + J * P20

// I, J are per-pixel; P0/P10/P20 are per-primitive

This parameter interpolation is realized using a pair of instructions:
// V1 = I, V2 = J, V3 = result of LDS_PARAM_LOAD
V_INTERP_P10_F32

V4, V3[1], V1, V3[0] // tmp = P0 + I*P10

V_INTERP_P20_F32

V5, V3[2], V2, V4

// uses DPP8=1,1,1,1,5,5,5,5; Src2(P0) uses DPP8=0,0,0,0,4,4,4,4
// dst = J*P20 + tmp

uses DPP8=2,2,2,2,6,6,6,6

Table 58. Parameter Interpolation Instruction Fields
Field

Size Description

OP

7

Instruction Opcode:
V_INTERP_P10_F32

// tmp = P0 + I*P10. hardcoded DPP8 on 2 sources

V_INTERP_P2_F32

// D = tmp + J*P20. hardcoded DPP8 on 1 source

V_INTERP_P10_F16_F32

// tmp = P0 + I*P10. hardcoded DPP8 on 2 sources

V_INTERP_P2_F16_F32

// D = tmp + J*P20. hardcoded DPP8 on 1 source

V_INTERP_RTZ_P10_F16_F32

// same as above, but round-toward-zero

V_INTERP_RTZ_P2_F16_F32

// same as above, but round-toward-zero

SRC0

9

First argument VGPR: Parameter data (P0 or P20) from LDS stored in a VGPR.

SRC1

9

Second argument VGPR: I or J barycentric

SRC2

9

Third argument VGPR: "P10" ops holds P10 data; "P2" ops holds partial result from "P10" op.

VDST

8

Destination VGPR

NEG

3

Negate the input (invert sign bit).
bit 0 is for src0, bit 1 is for src1 and bit 2 is for src2.
For 16-bit interpolation this applies to both low and high halves.

WaitEXP 3

Wait for EXPcnt to be less than or equal to this value before issuing this instruction.
Used to wait for a specific previous LDS_PARAM_LOAD to have completed.

OPSEL

4

Operation select for 16-bit math: 1=select high half, 0=select low half
[0]=src0, [1]=src1, [2]=src2, [3]=dest
For dest=0, dest_vgpr[31:0] = {prev_dst_vgpr[31:16], result[15:0] }
For dest=1, dest_vgpr[31:0] = {result[15:0], prev_dst_vgpr[15:0] }
OPSEL may only be used for 16-bit operands, and must be zero for any other operands/results.

CLMP

1

Clamp result to [0, 1.0]

The VINTERP instructions include a builtin "s_waitcnt EXPcnt" to easily allow data hazard resolution for data
produced by LDS_PARAM_LOAD.

12.3. VALU Parameter Interpolation

123 of 597

"RDNA3" Instruction Set Architecture

Instructions Restrictions and Limitations:
• V_INTERP instructions do not detect or report exceptions
• V_INTERP instructions do not support data forwarding into inputs that would normally come from LDS
data (sources A and C for V_INTERP_P10_* and source A for V_INTERP_P2_*).
VGPRs are preloaded with some or all of:
• I_persp_sample, J_persp_sample, I_persp_center, J_persp_center,
• I_persp_centroid, J_persp_centroid,
• I/W, J/W, 1.0/W,
• I_linear_sample, J_linear_sample,
• I_linear_center, J_linear_center,
• I_linear_centroid, J_linear_centroid
These instructions consume data that was supplied by LDS_PARAM_LOAD. These instructions contain a builtin "s_waitcnt EXPcnt <= N" capability to allow for efficient software pipelining.
lds_param_load V0,

attr0

lds_param_load V10, attr1
lds_param_load V20, attr2
lds_param_load V30, attr3
v_interp_p0

V1,

v_interp_p0

V11, V10[1], Vi, V10[0]

V0[1],

s_waitcnt EXPcnt<=2

v_interp_p0

V21, V20[1], Vi, V20[0]

s_waitcnt EXPcnt<=1

v_interp_p0

V31, V30[1], Vi, V30[0]

s_waitcnt EXPcnt<=0 //Wait V30

v_interp_p2

V2,

v_interp_p2

V12, V10[2], Vj, V11

v_interp_p2

V22, V20[2], Vj, V21

v_interp_p2

V32, V30[2], Vj, V31

V0[2],

Vi, V0[0]

s_waitcnt EXPcnt<=3 //Wait V0

Vj, V1

12.3.1. 16-bit Parameter Interpolation
16-bit interpolation operates on pairs of attribute values packed into a 16-bit VGPR. These use the same I and J
values during interpolation. OPSEL is used to select the upper or lower portion of the data.
There are variants of the 16-bit interpolation instructions that override the round mode to "round toward zero".
V_INTERP_P10_F16_F32 dst.f32 = vgpr_hi/lo.f16 * vgpr.f32 + vgpr_hi/lo.f16 // tmp = P10 * I + P0
• allows OPSEL; Src0 uses DPP8=1,1,1,1,5,5,5,5; Src2 uses DPP8=0,0,0,0,4,4,4,4
V_INTERP_P2_F16_F32 dst.f16 = vgpr_hi/lo.f16 * vgpr.f32 + vgpr.f32 // dst = P2 * J + tmp
• allows OPSEL; Src0 uses DPP8=2,2,2,2,6,6,6,6

12.4. LDS Direct Load
Direct loads are only available in LDS, not in GDS. Direct access is allowed only in CU mode, not WGP mode.
The LDS_DIRECT_LOAD instruction reads a single DWORD from LDS and returns it to a VGPR, broadcasting it

12.4. LDS Direct Load

124 of 597

"RDNA3" Instruction Set Architecture

to all active lanes in the wave. M0 provides the address and data type. LDS_DIRECT_LOAD uses EXEC per
quad, not per pixel: if any pixel in a quad is enabled then the data is written to all 4 pixels in the quad.
LDS_DIRECT_LOAD uses EXPcnt to track completion.
LDS_DIRECT_LOAD uses the same instruction format and fields as LDS_PARAM_LOAD. See Pixel Parameter
Interpolation.
LDS_addr = M0[15:0] (byte address and must be DWORD aligned)
DataType = M0[18:16]
0 unsigned byte
1 unsigned short
2 DWORD
3 unused
4 signed byte
5 signed short
6,7 Reserved

Example:

LDS_DIRECT_LOAD

V4

// load the value from LDS-address in M0[15:0] to V4

Signed byte and short data is sign-extend to 32 bits before writing the result to a VGPR; unsigned byte and short
data is zero-extended to 32 bits before writing to a VGPR.

12.5. Data Share Indexed and Atomic Access
Both LDS and GDS can perform indexed and atomic data share operations. For brevity, "LDS" is used in the text
below and, except where noted, also applies to GDS.
Indexed and atomic operations supply a unique address per work-item from the VGPRs to the LDS, and supply
or return unique data per work-item back to VGPRs. Due to the internal banked structure of LDS, operations
can complete in as little as one cycle (for wave32, or 2 cycles for wave64), or take as many 64 cycles, depending
upon the number of bank conflicts (addresses that map to the same memory bank).
Indexed operations are simple LDS load and store operations that read data from, and return data to, VGPRs.
Atomic operations are arithmetic operations that combine data from VGPRs and data in LDS, and write the
result back to LDS. Atomic operations have the option of returning the LDS "pre-op" value to VGPRs.
LDS Indexed and atomic instructions use LGKMcnt to track when they have completed. LGKMcnt is
incremented as each instruction is issued, and decremented when they have completed execution. LDS
instructions stay in-order with other LDS instructions from the same wave.
The table below lists and briefly describes the LDS instruction fields.

Table 59. LDS Instruction Fields

12.5. Data Share Indexed and Atomic Access

125 of 597

"RDNA3" Instruction Set Architecture

Field

Size Description

OP

8

LDS opcode.

GDS

1

0 = LDS, 1 = GDS.

OFFSET0 8

OFFSET1 8

Immediate address offset. Interpretation varies with opcode:
Instructions with one address:: combine the offset fields into a 16-bit unsigned byte offset: {offset1,
offset0}.
Instructions that have 2 addresses (e.g. {LOAD, STORE, XCHG}_2ADDR):: use the offsets separately as 2 8bit unsigned offsets. Each offset is multiplied by 4 for 8, 16 and 32-bit data; multiplied by 8 for 64-bit data.

VDST

8

VGPR to which result is written: either from LDS-load or atomic return value.

ADDR

8

VGPR that supplies the byte address offset.

DATA0

8

VGPR that supplies first data source.

DATA1

8

VGPR that supplies second data source.

M0

16

Unsigned byte Offset[15:0] used for: ds_load_addtid_b32, ds_write_addtid_b32 and for GDS-base/size

The M0 register is not used for most LDS-indexed operations: only the "ADDTID" instructions read M0 and for
these it represents a byte address.
Table 60. LDS Indexed Load/Store
Load / Store

Description

DS_LOAD_{B32,B64,B96,B128,U8,I8,U16,I16}

Load one value per thread into VGPRs; if signed, sign extend to
DWORD; zero e xtend if unsigned.

DS_LOAD_2ADDR_{B32,B64}

Load two values at unique addresses.

DS_LOAD_2ADDR_STRIDE64_{B32,B64}

Load 2 values at unique addresses; offset *= 64.

DS_STORE_{B32,B64,B96,B128,B8,B16}

Store one value from VGPR to LDS.

DS_STORE_2ADDR_{B32,B64}

Store two values.

DS_STORE_2ADDR_STRIDE64_{B32,B64}

Store two values, offset *= 64.

DS_STOREXCHG_RTN_{B32,B64}

Exchange GPR with LDS-memory.

DS_STOREXCHG_2ADDR_RTN_{B32,B64}

Exchange two separate GPRs with LDS-memory.

DS_STOREXCHG_2ADDR_STRIDE64_RTN_{B32,B64} Exchange GPR with LDS-memory; offset *= 64.
"D16 ops" - Load ops write only 16bits of VGPR, low or high; Store ops use 16bits of VGPR:
DS_STORE_{B8, B16}_D16_HI

Store 8 or 16 bits using high 16 bits of VGPR.

DS_LOAD_{U8, I8, U16}_D16

Load unsigned or signed 8 or 16 bits into low-half of VGPR

DS_LOAD_{U8, I8, U16}_D16_HI

Load unsigned or signed 8 or 16 bits into high-half of VGPR

DS_PERMUTE_B32

Forward permute. Does not write any LDS memory. See LDS Lanepermute Ops for details.

DS_BPERMUTE_B32

Backward permute. Does not write any LDS memory. See LDS Lanepermute Ops for details.

Single Address Instructions
LDS_Addr = LDS_BASE + VGPR[ADDR] + {InstOffset1,InstOffset0}

Double Address Instructions
LDS_Addr0 = LDS_BASE + VGPR[ADDR] + InstOffset0*ADJ +
LDS_Addr1 = LDS_BASE + VGPR[ADDR] + InstOffset1*ADJ +
Where ADJ = 4 for 8, 16 and 32-bit data types; and ADJ = 8 for 64-bit.

12.5. Data Share Indexed and Atomic Access

126 of 597

"RDNA3" Instruction Set Architecture

The double address instructions are: LOAD_2ADDR*, STORE_2ADDR*, and STOREXCHG_2ADDR_*. The
address comes from VGPR, and both VGPR[ADDR] and InstOffset are byte addresses. At the time of wave
creation, LDS_BASE is assigned to the physical LDS region owned by this wave or work-group.
DS_{LOAD,STORE}_ADDTID Addressing
LDS_Addr = LDS_BASE + {InstOffset1, InstOffset0} + TID(0..63)*4 + M0
Note: no part of the address comes from a VGPR.

M0 must be DWORD-aligned.

The "ADDTID" (add thread-id) is a separate form where the base address for the instruction is common to all
threads, but then each thread has a fixed offset added in based on its thread-ID within the wave. This can allow
a convenient way to quickly transfer data between VGPRs and LDS without having to use a VGPR to supply an
address.
LDS & GDS Opcodes
Instruction Fields: op, gds, offset0, offset1, vdst, addr, data0, data1
32-bit no return

32-bit with return

64-bit no return

ds_load_b{64,96,128}

ds_store_b{64,96,128}

ds_store_{b32,b16,b8}

ds_store_b64

ds_load_addtid_b32 (LDS
only)

ds_permute_b32 (LDS only)

ds_store_addtid_b32 (LDS
only)

ds_bpermute_b32 (LDS only)

ds_store_2addr_b32

ds_store_2addr_b64

ds_store_2addr_stride64_b3
2

ds_store_2addr_stride64_
b64

64-bit with return

ds_load_{b32, u8,i8,u16,i16}

ds_load_b64

ds_store_b8_d16_hi

ds_load_2addr_b32

ds_load_2addr_b64

ds_store_b16_d16_hi

ds_load_2addr_stride64_b32

ds_load_2addr_stride64_b64

ds_load_u8_d16

ds_consume

ds_load_u8_d16_hi

ds_append

ds_condxchg32_rtn_b64

ds_load_i8_d16
ds_load_i8_d16_hi

ds_swizzle_b32 (LDS only)

ds_load_u16_d16
ds_load_u16_d16_hi
GDS-only Opcodes
ds_ordered_count
gws_init
gws_sema_v
gws_sema_bf
gws_sema_p
gws_barrier
gws_sema_release_all
ds_add_gs_reg_rtn
ds_sub_gs_reg_rtn

12.5. Data Share Indexed and Atomic Access

127 of 597

"RDNA3" Instruction Set Architecture

12.5.1. LDS Atomic Ops
Atomic ops combine data from a VGPR with data in LDS, write the result back to LDS memory and optionally
return the "pre-op" value from LDS memory back to a VGPR. When multiple lanes in a wave access the same
LDS location there it is not specified in which order the lanes will perform their operations, only that each lane
will perform the complete read-modify-write operation before another lane operates on the data.
LDS_Addr0 = LDS_BASE + VGPR[ADDR] + {InstOffset1,InstOffset0}
VGPR[ADDR] is a byte address. VGPRs 0,1 and dst are double-GPRs for doubles data. VGPR data sources can
only be VGPRs or constant values, not SGPRs. Floating point atomic ops use the MODE register to control
denormal flushing behavior.
LDS & GDS Atomic Opcodes
Instruction Fields: op, gds, offset0, offset1, vdst, addr, data0, data1
32-bit no return

32-bit with return

64-bit no return

64-bit with return

ds_add_u32

ds_add_rtn_u32

ds_add_u64

ds_add_rtn_u64

ds_sub_u32

ds_sub_rtn_u32

ds_sub_u64

ds_rsub_rtn_u64

ds_rsub_u32

ds_rsub_rtn_u32

ds_rsub_u64

ds_rsub_rtn_u64

ds_inc_u32

ds_inc_rtn_u32

ds_inc_u64

ds_inc_rtn_u64

ds_dec_u32

ds_dec_rtn_u32

ds_dec_u64

ds_dec_rtn_u64

ds_min_{u32,i32,f32}

ds_min_rtn_{u32,i32,f32}

ds_min_{u64,i64,f64}

ds_min_rtn_{u64,i64,f64}

ds_max_{u32,i32,f32}

ds_max_rtn_{u32,i32,f32}

ds_max_{u64,i64,f64}

ds_max_rtn_{u64,i64,f64}

ds_and_b32

ds_and_rtn_b32

ds_and_b64

ds_and_rtn_b64

ds_or_b32

ds_or_rtn_b32

ds_or_b64

ds_or_rtn_b64

ds_xor_b32

ds_xor_rtn_b32

ds_xor_b64

ds_xor_rtn_b64

ds_mskor_b32

ds_mskor_rtn_b32

ds_mskor_b64

ds_mskor_rtn_b64

ds_cmpstore_b32

ds_cmpstore_rtn_b32

ds_cmpstore_b64

ds_cmpstore_rtn_b64

ds_cmpstore_f32

ds_cmpstore_rtn_f32

ds_cmpstore_f64

ds_cmpstore_rtn_f64

ds_add_f32

ds_add_rtn_f32
ds_storexchg_rtn_b32

ds_storexchg_rtn_b64

ds_storexchg_2addr_rtn_b32

ds_storexchg_2addr_rtn_b64

ds_storexchg_2addr_stride64_rt
n_b32

ds_storexchg_2addr_stride64_rt
n_b64

12.5.2. LDS Lane-permute Ops
DS_PERMUTE instructions allow data to be swizzled arbitrarily across 32 lanes. Two versions of the instruction
are provided: forward (scatter) and backward (gather). These exist in LDS only, not GDS.
Note that in wave64 mode the permute operates only across 32 lanes at a time on each half of a wave64. In
other words, it executes as if were two independent wave32’s. Each half-wave can use indices in the range 0-31
to reference lanes in that same half-wave.
These instructions use the LDS hardware but do not use any memory storage, and may be used by waves that
have not allocated any LDS space. The instructions supply a data value from VGPRs and an index value per
lane.

12.5. Data Share Indexed and Atomic Access

128 of 597

"RDNA3" Instruction Set Architecture

• ds_permute_b32 : Dst[index[0..31]] = src[0..31]

Where [0..31] is the lane number

• ds_bpermute_b32 : Dst[0..31] = src[index[0..31]]
The EXEC mask is honored for both reading the source and writing the destination. Index values out of range
wrap around (only index bits [6:2] are used, the other bits of the index are ignored). Reading from disabled
lanes returns zero.
In the instruction word: VDST is the dest VGPR, ADDR is the index VGPR, and DATA0 is the source data VGPR.
Note that index values are in bytes (so multiply by 4), and have the 'offset0' field added to them before use.

12.5.3. DS Stack Operations for Ray Tracing
DS_BVH_STACK_RTN_B32 is an LDS instruction to manage a per-thread shallow stack in LDS used in ray
tracing BVH traversal. BVH structures consist of box nodes and triangle nodes. A box node has up to four child
node pointers that may all be returned to the shader (to VGPRs) for a given ray (thread). A traversal shader
follows one pointer per ray per iteration, and extra pointers can be pushed to a per-thread stack in LDS. Note:
the returned pointers are sorted.
This "short stack" has a limited size beyond that the stack wraps around and overwrites older items. When the
stack is exhausted, the shader should switch to a stackless mode where it looks up the parent of the current
node from a table in memory. The shader program tracks the last visited address to avoid re-traversing
subtrees.
DS_BVH_STACK_RTN_B32 vgpr(dst), vgpr(stack_addr), vgpr(lvaddr), vgpr[4](data)
Field

Size

Description

OP

8

Instruction == DS_STORE_STACK (LDS only)

GDS

1

1 = GDS, 0 = LDS (must be: 0 = LDS)

OFFSET0

8

unused

OFFSET1

8

bits[5:4] carry StackSize (8, 16, 32, 64)

VDST

8

Destination VGPR for resulting address (e.g. X or top of stack)
Returns the next "LV addr"

12.5. Data Share Indexed and Atomic Access

129 of 597

"RDNA3" Instruction Set Architecture

Field

Size

Description

ADDR

8

STACK_VGPR: Both a source and destination VGPR:
supplies the LDS stack address and is written back with updated address.
stack_addr[31:18] = stack_base[15:2] : stack base address (relative to allocated LDS space).
stack_addr[17:16] = stack_size[1:0] : 0=8DWORDs, 1=16, 2=32, 3=64 DWORDs per thread
stack_addr[15:0] = stack_index[15:0]. (bits [1:0] must be zero).

DATA0

8

LVADDR: Last Visited Address. Is compared with data values (next field) to determine the next
node to visit.

DATA1

8

4 VGPRs (X,Y,Z,W).

M0

16

Unused.

12.6. Global Data Share
Global data Share is similar to LDS, but is a single memory accessible by all waves on the GPU. Global Data
Share uses the same instruction format as local data share (indexed operations only - no interpolation or direct
loads). Instructions increment the LGKMcnt for all loads, stores and atomics, and decrement LGKMcnt when
the instruction completes. GDS instructions support only one active lane per instruction. The first active lane
(based on EXEC) is used and others are ignored.
M0 is used for:
• [15:0] holds SIZE, in bytes
• [31:16] holds BASE address in bytes

12.6.1. GS NGG Streamout Instructions
The DS_ADD_GS_REG_RTN and DS_SUB_GS_REG_RTN instructions are used only by the GS stage, and are
used for streamout. These instructions perform atomic add or sub operations to data in dedicated registers, not
in GDS memory, and return the pre-op value. The source register is 32 bits and is an unsigned int. These 2
instructions increment the wave’s LGKMcnt, and decrement LGKMcnt when the instruction completes.
Table 61. GDS Streamout Register Targets
offset[5:2]

Register
32-bit source, 32-bit dest & return value

offset[5:2] Register
32-bit source, 64-bit dest & return value

0

GDS_STRMOUT_DWORDS_WRITTEN_0

8

GDS_STRMOUT_PRIMS_NEEDED_0

1

GDS_STRMOUT_DWORDS_WRITTEN_1

9

GDS_STRMOUT_PRIMS_WRITTEN_0

2

GDS_STRMOUT_DWORDS_WRITTEN_2

10

GDS_STRMOUT_PRIMS_NEEDED_1

3

GDS_STRMOUT_DWORDS_WRITTEN_3

11

GDS_STRMOUT_PRIMS_WRITTEN_1

4

GDS_GS_0

12

GDS_STRMOUT_PRIMS_NEEDED_2

5

GDS_GS_1

13

GDS_STRMOUT_PRIMS_WRITTEN_2

6

GDS_GS_2

14

GDS_STRMOUT_PRIMS_NEEDED_3

7

GDS_GS_3

15

GDS_STRMOUT_PRIMS_WRITTEN_3

Table 62. DS_ADD_GS_REG_RTN* and DS_SUB_GS_REG_RTN:
Field

Size

Description

OP

8

ds_add_gs_reg_rtn, ds_sub_gs_reg_rtn

OFFSET0

8

gs_reg_index[3:0]=offset0[5:2] indexes the GS register array

VDST

8

VGPR to write pre-op value to

12.6. Global Data Share

130 of 597

"RDNA3" Instruction Set Architecture

Field

Size

Description

DATA0

8

operand, from the first valid data; if no valid data (i.e., EXEC==0), the operand
is 0.

• The input comes from the first valid data of DATA0.
• If offset[5:2] is 8-15: The operation is mapped to 64b operation to take 2 dst registers as a combined one.
The source data is still 32b. The post-op result is 64b and store back to the 2 dst registers. The return value
takes 2 VGPRs.
• If offset[5:2] is 0-7: The operation is mapped to normal 32b operation.
• For ds_add_gs_reg_rtn, the atomic add operation is
◦ VDST[0] = GS_REG[offset0[5:2]][31:0]
◦ If (offset0[5:2] >= 8) VDST[1] = GS_REG[offset0[5:2]][63:32]
◦ GS_REG[offset0[4:2]] += DATA0
• For ds_sub_gs_reg, the atomic sub operation is
◦ VDST[0] = GS_REG[offset0[5:2]][31:0]
◦ If (offset0[5:2] >= 8) VDST[1] = GS_REG[offset0[5:2]][63:32]
◦ GS_REG[offset0[4:2]] -= DATA0

12.7. Alignment and Errors
GDS and LDS operations (both direct & indexed) report Memory Violation (memviol) for misaligned atomics.
LDS handles misaligned indexed reads & writes, but only when SH_MEM_CONFIG. alignment_mode ==
UNALIGNED. Atomics must be aligned.
LDS Alignment modes (config-reg controlled, in SH_MEM_CONFIG):
• ALIGNMENT_MODE_DWORD: Automatic alignment to multiple of element size
• ALIGNMENT_MODE_UNALIGNED: No alignment requirements.
#

LDS Access
Type

Source Inst
Types

Controls

1

Direct (Read
Broadcast)

ALU ops

LDS_CONFIG.ADDR_OUT_ Out of range direct operations report memviol if
OF_RANGE_REPORTING
ADDR_OUT_OF_RANGE_REPORTING is true.

2

Indexed
Atomic

DS ops
FLAT ops

LDS_CONFIG.ADDR_OUT_ Out of range atomic operations report memviol if
OF_RANGE_REPORTING
ADDR_OUT_OF_RANGE_REPORTING is true.

3

Indexed Non- DS ops
Atomic
FLAT ops

12.7. Alignment and Errors

Behavior

LDS_CONFIG.ADDR_OUT_ the LSBs are ignored to force alignment. No memviol
OF_RANGE_REPORTING
is generated.
Out of range indexed operations report memviol if
ADDR_OUT_OF_RANGE_REPORTING is true.

131 of 597

"RDNA3" Instruction Set Architecture

Chapter 13. Float Memory Atomics
Floating point atomics can be issued as LDS, Buffer, and Flat/Global/Scratch instructions.

13.1. Rounding
LDS and Memory atomics have the rounding mode for float-atomic-add fixed at "round to nearest even". The
MODE.round bits are ignored.

13.2. Denormals
When these operate on floating point data, there is the possibility of the data containing denormal numbers, or
the operation producing a denormal. The floating point atomic instructions have the option of passing
denormal values through, or flushing them to zero.
LDS instructions allow denormals to be passed through or flushed to zero based on the MODE.denormal wavestate register. As with VALU ops, "denorm_single" affects F32 ops and "denorm_double" affects F64. LDS
instructions use both FP_DENORM bits (allow_input_denormal, allow_output_denormal) to control flushing of
inputs and outputs separately.
• Float 32 bit adder uses both input and output denorm flush controls from MODE
• Float CMP, MIN and MAX use only the "input denormal" flushing control
◦ Each input to the comparisons flushes the mantissa of both operands to zero before the compare if the
exponent is zero and the flush denorm control is active. For Min and Max the actual result returned is
the selected non-flushed input.
◦ CompareStore ("compare swap") flushes the result when input denormal flushing occurs.
Cache Atomic Float Denormal
(Buffer, Flat, Global, Scratch)
Min/Max_F32

Mode

CmpStore_F32, _F64

Mode

Add_F32

Flush
LDS Float Atomics

Min/Max_F32

Mode

CmpStore_F32, _F64

Mode

Add_F32

Mode

Min/Max_F64

Mode

• "Flush" = flush all input denorm
• "No Flush" = don’t flush input denorm
• "Mode" = denormal flush controlled by bit from shader’s "MODE . fp_denorm" register
Note that MIN and MAX when flushing denormals only do it for the comparison, but the result is an
unmodified copy of one of the sources. CompareStore ("compare swap") flushes the result when input
denormal flushing occurs.
Memory Atomics:

13.1. Rounding

132 of 597

"RDNA3" Instruction Set Architecture

The floating point atomic instructions (ds_{min,max,cmpst}_f32) have the option of passing denormal values
through, or flushing them to zero. This is controlled with the MODE.fp_denorm bits that also control VALU
denormal behavior. There is no separate input and output denormal control: only bit 0 of sp_denorm or bit 0 of
dp_denorm is considered. The rest of the denormal rules are identical to LDS.
Float atomic add is hardwired to flush input denormals - it does not use the MODE.fp_denorm bits.

13.3. NaN Handling
Not A Number ("NaN") is a IEEE-754 value representing a result that cannot be computed.
There two types of NaN: quiet and signaling
• Quiet NaN Exponent=0xFF, Mantissa MSB=1
• Signaling NaN Exponent=0xFF, Mantissa MSB=0 and at least one other mantissa bit ==1
The LDS does not produce any exception or "signal" due to a signaling NaN.
DS_ADD_F32 can create a quiet NaN, or propagate NaN from its inputs: if either input is a NaN, the output is
that same NaN, and if both inputs are NaN, the NaN from the first input is selected as the output. Signaling NaN
is converted to Quiet NaN.
Floating point atomics (CMPSWAP, MIN, MAX) flush input denormals only when
MODE (allow_input_denorm)=0, otherwise values are passed through without modification. When flushing,
denorms are flushed before the operation (i.e. before the comparison).
FP Max Selection Rules:
if

(src0 == SNaN) result = QNaN (src0)

// bits of SRC0 are preserved but is a QNaN

else if (src1 == SNaN) result = QNaN (src1)
else

result = larger of (src0, src1)

"Larger" order from smallest to largest: QNaN, -inf, -float, -denorm, -0, +0, +denorm, +float, +inf

FP Min Selection Rules:
if

(src0 == SNaN) result = QNaN (src0)

else if (src1 == SNaN) result = QNaN (src1)
else

result = smaller of (src0, src1)

"Smaller" order from smallest to largest: -inf, -float, -denorm, -0, +0, +denorm, +float, +inf, QNaN

FP Compare Swap: only swap if the compare condition (==) is true, treating +0 and -0 as equal
doSwap = (src0 != NaN) && (src1 != NaN) && (src0 == src1) // allow +0 == -0

Float Add rules:
1. -INF + INF = QNAN (mantissa is all zeros except MSB)
2. +/-INF + NAN = QNAN (NAN input is copied to output but made quiet NAN)
3. -INF + INF, or INF - INF = -QNAN
4. -0 + 0 = +0

13.3. NaN Handling

133 of 597

"RDNA3" Instruction Set Architecture

5. INF + (float, +0, -0) = INF, with infinity sign preserved
6. NaN + NaN = SRC0’s NaN, converted to QNaN

13.4. Global Wave Sync & Atomic Ordered Count
Global Wave Sync (GWS) provides a capability to synchronize between different waves across the entire GPU.
GWS instructions use LGKMcnt to determine when the operation has completed.

13.4.1. GWS and Ordered Count Programming Rule
"GWS" instructions (ordered count and GWS*) must be issued as a single instruction clause of the form:
S_WAITCNT LGKMcnt==0 // this is only necessary if there might be any outstanding GDS instructions
GWS_instruction
S_WAITCNT LGKMcnt==0
<any instruction except: S_ENDPGM (pad with NOP if the next instruction is s_endpgm)
Before issuing a GWS or Ordered Count instruction, the user must make sure that there are no outstanding GDS
instructions. Failure to do this may cause a "NACK" to arrive out of order.
Programming Rule:

the source and destination VGPRs in a GWS or ordered count instruction must not
be the same. When an ordered count operation is NACK’d, the destination VGPR
may be written with data. If this VGPR is the same as the source VGPR, that
prevents the instruction from being replayed later if it was interrupted due to a
context switch.

13.4.2. EXEC Handling
GDS / GWS is now only a single lane wide. If the EXEC mask has more than one bit set to 1, hardware behaves
as if only EXEC had only one "1" in it: the least significant one. GDS / GWS opcodes are not skipped when
EXEC==0.
For these opcodes, if EXEC==0, the hardware acts as if EXEC==0…001 for the instruction:
ORDERED_COUNT / GWS_INIT / SEMA_BR/GWS_BARRIER
For other GDS / GWS opcodes, the instruction is sent with EXE==0, nothing is sent to or returned from
GDS/GWS. In hardware, data is sent but it is ignored and data is returned and ignored in order to keep LGKMcnt
working.

13.4.3. Ordered Count
Ordered count generates a pointer in wave-creation order to an append buffer of unlimited size.
Ordered Alloc generates a pointer to a ring buffer of finite size which is returned to the wave in "VDST". The
ordered alloc counter can be issued up to 4 times from a shader. Ordered count and alloc use the same
instruction - the difference is in how the GDS counters are initialized with their config registers.

13.4. Global Wave Sync & Atomic Ordered Count

134 of 597

"RDNA3" Instruction Set Architecture

The GDS unit supports an instruction that operates on dedicated append/consume counters:
• DS_ORDERED_COUNT Takes one value from the first valid lane and sends to GDS.
For shaders that use this function, this instruction must be issued once and only once per wave. The GDS
receives these in arbitrary order from different waves across the chip, but processes them in the order the
waves were created. The GDS contains a large fifo to hold these pending requests.
Instruction Fields
Field

Normal GDS

GDS Ordered Count

Global Wave Sync (GWS)

OP

any GDS op

DS_ORDERED_COUNT*

GWS_INIT, GWS_SEMA_V,
GWS_SEMA_BR, GWS_SEMA_P
GWS_SEMA_RELEASE_ALL,
GWS_BARRIER

GDS

1

1

1

VDST

VGPR to write result
to

VGPR to write result to

unused

ADDR

VGPR which supplies Increment, from the first valid data.
byte address offset
If no valid data, increment=0.

Used for: barrier, init and
sema_br;
unused for others.

DATA0

VGPR which supplies unused
first data source

unused

DATA1

VGPR which supplies unused
second data source

unused

Offset0[7:0]

Same usage as LDS

Ordered Count Index.
Must be multiple of 4 (2 LSB’s must be zero)

{ 0,0,resource_index[5:0] }

Offset1[0]

Same usage as LDS

wave_release

unused

Offset1[1]

Same usage as LDS

wave_done

unused

Offset1[3:2]

Same usage as LDS

unused

unused

Offset1[5:4]

Same usage as LDS

ordered-index-opcode :
0 = Add (ds_add_rtn_b32)
1 = Exchange (ds_wrxchg_rtn_b32)
2 = Reserved
3 = Wrap (ds_wrap_rtn_b32)

unused

Offset1[7:6]

Same usage as LDS

unused

unused

M0[15:0]

gds_size[15:0] in bytes { waveCrawlerInc[2:0], logicalWaveID[12:0] }
In graphics pipe, logicalWaveID[2:0] is really
packerID

unused

M0[31:16]

gds_base[15:0] in
bytes

{ 10'0, gds_base[5:0] }
gdsBase = resourceBase

orderedCntBase[15:0]
Ordered count base is in DWORDs.
(2 LSB’s are ignored, forced to zero - DWORD
aligned)

ORDERED COUNT Targets
The OFFSET0[5:2] field of ordered-count instructions reference one of 16 registers in GDS. These are listed
in the GDS section: GS NGG Streamout Instructions. See: GS NGG Streamout Instructions Only the ADD
instruction may be used on targets that are 64 bits (offset[5:2] = 8 - 15).
Exchange can only be used with offset[5:2] = 4 - 7.
APPEND and CONSUME
Append and Consume count bits in EXEC and add or subtract the count from the GDS stored value. GDS
now only operates on a single lane, but for Append & Consume the full EXEC mask is still considered.

13.4. Global Wave Sync & Atomic Ordered Count

135 of 597

"RDNA3" Instruction Set Architecture

13.4.4. Global Wave Sync
"Global Wave Sync" allows the waves running in different thread-groups, including across different CU’s and
SE’s to synchronize through barriers and semaphores.
The Global Wave Sync (GWS) unit contains 64 sync resources that are allocated by the Command Processor to
applications (VM_ID’s). These sync resources can be configured to act as counting semaphores or barriers.
• GWS registers must be configured before use via GRBM reg writes: gds_gws_resource_cntl,
gds_gws_resource
• GDS_GWS_RESOURCE: Flag, Counter (number of waves at resource), type, head_{queue, valid, flag}
• GDS_GWS_VMID: Per-VMID register identifying the range of GWS resources owned by each VMID (base &
size)
The GWS contains 64 sync resources, each of which contains the following state:
• 1-bit state flag: 0 or 1 - used to separate even & odd passes, distinguish entering waves from leaving.
• a 12-bit counter - unsigned int
• 1 byte Type: Semaphore or Barrier
• Head-of-queue + valid + flag (13 bits)
• Tail of Queue + flag (12 bits)
• FIFO - holds full wave-id and a 1-bit flag
When used by the shader, M0 supplies the "resource_base[5:0]" which is used to virtualize the resources.
The resource offset comes from the GDS/GWS instruction’s "offset0[5:0]" field and is added to M0 and also to a
base-address per VMID to get the final resource ID. Resource ID’s are clamped to the range owned by this
VMID. If clamping occurs, the GWS returns a NACK which causes the wave to rewind the PC and halt.
• GWS_resource_id = (GDS_GWS_VMID.BASE(vmid) + M0[21:16] + offset0[5:0]) % 64
Table 63. GWS Instructions
Opcode

Description

GWS_INIT
Initialize GWS resource
(uint vsrc0, u8 offset0
Initialize the global wave sync resource specified by the virtualized resource id OFFSET0[5:0] with a
)
total wave count. This is most often intended to initialize a barrier resource for use by a later
ds_gws_barrier to synchronize all waves associated with this resource, but is not type specific and
can also be used to initialize a semaphore with an initial wave release count. The total wave count
is provided by the lane of vsrc associated with the first active thread based on the current EXEC
thread mask, interpreted as a 32-bit integer value.
The resource id is also offset by the value of M0[21:16], allowing virtualization of global wave sync
resource ids between draw contexts or based on other shader initialization state.
This is primarily to be used via the GRBM.
Operation:
//Initialize GWS_RESOURCE for later gws commands:
rid = (M0[21:16] + OFFSET0[5:0]) % 64
GWS_RESOURCE[rid].counter = vsrc.lane[find_first(EXEC)].u
GWS_RESOURCE[rid].flag = 0
return //release calling wave immediately

13.4. Global Wave Sync & Atomic Ordered Count

136 of 597

"RDNA3" Instruction Set Architecture

Opcode

Description

GWS_SEMA_V
(u8 offset0)

Semaphore: Increment resource counter
For the global wave sync resource specified by the virtualized resource id OFFSET0[5:0], releases
one wave, immediately if already queued at this semaphore or once one arrives. Sets the resource
to semaphore type.
Operation:
//Release waves queued by ds_gws_sema_p instructions:
rid = (M0[21:16] + OFFSET0[5:0]) % 64
GWS_RESOURCE[rid].counter++
GWS_RESOURCE[rid].type = SEMAPHORE
return //release calling wave immediately

GWS_SEMA_BR
Semaphore Bulk Release
(uint vsrc0, u8 offset0
For the global wave sync resource specified by the virtualized resource id OFFSET0[5:0], releases
)
the number of waves specified as a 32-bit integer in the first active lane of vsrc, immediately if
already queued at this semaphore or as they arrive. Sets the resource to semaphore type.
Operation: //Release waves queued by ds_gws_sema_p instructions:
rid = (M0[21:16] + OFFSET0[5:0]) % 64
release_count = vsrc.lane[find_first(EXEC)].u
GWS_RESOURCE[rid].counter += release_count
GWS_RESOURCE[rid].type = SEMAPHORE
return //release calling wave immediately
GWS_SEMA_P
(u8 offset0 )

Semaphore acquire (wait)

GWS_SEMA_
RELEASE_ALL
(u8 offset0)

Semaphore release all waves waiting at a semaphore

Queues this wave until the global wave sync resource specified by the virtualized resource id
OFFSET0[5:0] indicates that it should be released, which may be immediately if another wave has
already issued a ds_gws_sema_v or ds_gws_sema_br instruction to the resource. Sets the resource
to semaphore type.
Operation:
//Queue this wave until released:
rid = (M0[21:16] + OFFSET0[5:0]) % 64
GWS_RESOURCE[rid].type = SEMAPHORE
while (GWS_RESOURCE[rid].counter <= 0)
WAIT_IN_QUEUE
GWS_RESOURCE[rid].counter-return //release calling wave

Operation:
//Release waves queued by ds_gws_sema_p instructions:
rid = (M0[21:16] + OFFSET0[5:0]) % 64
release_count = the number of waves currently enqueued at the semaphore
GWS_RESOURCE[rid].counter += release_count
GWS_RESOURCE[rid].type = SEMAPHORE
return //release calling wave immediately
This is typically used via the GRBM.

13.4. Global Wave Sync & Atomic Ordered Count

137 of 597

"RDNA3" Instruction Set Architecture

Opcode

Description

GWS_BARRIER
Barrier wait
(uint vsrc0, u8 offset0
Creates a global barrier for all waves associated with the global wave sync resource specified by a
)
virtualized resource id OFFSET0[5:0], which causes all waves issuing a ds_gws_barrier on the same
resource id to wait until a previously specified count of waves have also issued. Sets the resource to
barrier type. This provides functionality similar to an s_barrier instruction for local waves, but
allows synchronization of waves running on different compute units.
The wave count for completion of the barrier is initially provided by a ds_gws_init instruction.
Each subsequent ds_gws_barrier instruction may then provide the total wave count value for a
following ds_gws_barrier instruction. The total wave count minus one is provided by the lane of
vsrc associated with the first active thread based on the current EXEC thread mask, interpreted as a
32-bit integer value.
Operation:
//On entry: GWS_RESOURCE[rid].counter previously initialized
rid = (M0[21:16] + OFFSET0[5:0]) % 64
count_next = vsrc.lane[find_first(EXEC)].u
GWS_RESOURCE[rid].type = BARRIER
GWS_RESOURCE[rid].counter-flag = GWS_RESOURCE[rid].flag
if (GWS_RESOURCE[rid].counter <= 0) //last wave in group
GWS_RESOURCE[rid].flag ^= 1 //release enqueued waves
GWS_RESOURCE[rid].counter = count_next //init for next barrier
return //release calling wave
// Enqueue waves which enter until the last enters and releases them
while (1)
if (GWS_RESOURCE[rid].type == BARRIER && GWS_RESOURCE[rid].flag != flag)
return //release calling wave
The description of "flag" above is a bit simplistic. Basically, every wave which enters is tagged with the
current GWS_RESOURCE.flag value. When the barrier condition is met, all waves with that flag value are
released, and GWS_RESOURCE.flag is inverted so any incoming waves are tagged with the opposite value
of flag.

13.4. Global Wave Sync & Atomic Ordered Count

138 of 597

"RDNA3" Instruction Set Architecture

Chapter 14. Export: Position, Color/MRT
"Export" is the act of copying data from a VGPR to the one of the export buffers (position, color or Z). Exports
use the EXEC mask and only output the enabled pixels or vertices. A shader may export to each target only
once. The last export from a pixel shader, or the last position export of a vertex shader must indicate "done" there will be no more pixel shader exports or vertex position exports. This allows the values to be consumed by
the Render back-end and Primitive Assembler respectively.
Exports can transfer 32-bit or 16-bit data per element. 16-bit exports occurs in pairs: 32-bits transferred from
one VGPR that holds two 16-bit values. The export instruction does not know or care about the difference
between the two - it just moves 32-bits of data per lane. 16-bit exports are a contract between the shader
program that is responsible for converting and packing 16-bit data, and the receiving hardware in
configuration registers that declare the exported data type. 16-bit data is packed into a VGPR, with the first
component in the lower 16 bits.
Instruction Fields

Field

Size

Description

Done

1

Indicates this is the last export from the shader.Used only for Pixel, Position and Primitive
data. Must be set for primitive export.

Target

6

Export Target:
0-7

MRT 0-7

8

Z

12-16

Position 0-4 (Pos4 is for stereo rendering)

20

NGG Primitive data (connectivity data)

21

Dual source blend Left

22

Dual source blend Right

EN

4

16-bit components: export half-DWORD enable. Valid values are: 0x0,1,3
[0] enables VSRC0 : R,G from one VGPR (R in low bits, G high)
[1] enables VSRC1 : B,A from one VGPR (B in low bits, A high)
32-bit components: [0-3] = enables for VSRC0-3.

VSRC0
VSRC1
VSRC2
VSRC3

8
8
8
8

VGPR to read data from.
Pos: vsrc0=X, 1=Y, 2=Z, 3=W
MRT: vsrc0=R, 1=G, 2=B, 3=A

ROW_EN

1

0 = normal mode; 1 = use M0 to provide the row number for mesh shader’s POS and PRIM
exports.

(M0)

8

Row number for mesh shader POS and PRIM exports

32bit components

EN[0]
EN[1]
EN[2]
EN[3]

VSRC0
VSRC1
VSRC2
VSRC3

Red/X/ …
Green/Y/…
Blue/Z/…
Alpha/W/…

16-bit components

EN[0]
EN[1]
EN[2], EN[3]

VSRC0
VSRC1
ignored

{green, red} / { y, x}
{alpha, blue} / {w,z}
unused

139 of 597

"RDNA3" Instruction Set Architecture

14.1. Pixel Shader Exports
Pixel Exports
Export instructions copy color data to the MRTs. Data has up to four components (R, G, B, A).
Optionally, export instructions also output depth (Z) data.
Every pixel shader must have at least one export instruction.
The last export instruction executed must have the DONE bit set to one.
The EXEC mask is applied to all exports. Only pixels with the corresponding EXEC bit set to 1 export data to
the output buffer.
Each export target must be exported to only once.
The shader program is responsible for conversion of data from 32b to 16b for 16-bit exports.
The shader program is responsible for alpha-test.
All data that can affect the sample mask must be sent on the first export from the shader. This means if depth
is being exported, it must be exported first. If alpha to mask is enabled, MRT0 must be exported first, unless
depth is also enabled, in which case, MRT0’s alpha value must be written to the depth export’s alpha value. If
alpha to mask and coverage to mask are both enabled, then the depth export’s alpha value will be set to the
minimum of the alpha to mask value (alpha of MRT0) and the coverage to mask value (alpha of what would
have been in the depth export). If the shader can kill a pixel, it must be determined before the first export.
Pixel Shader Dual-Source Blend
In this mode, alternating lanes (threads) hold MRT0 and MRT1, not all threads going to one MRT. There are
two instructions to complete a dual-source blend export. It is required that exports to 21 and 22 be back-toback, with no other export types in between them.
Export target

EXEC mask

MRT
Exported

Lane 0

Lane 1

Lane 2

21

exec_mask =
(exec_mask & 0x5555_5555) |
((exec_mask <<1) & 0xAAAA_AAAA)

0

Pix0,
MRT0

Pix0
MRT1

Pix2 MRT0

22

exec_mask =
(exec_mask & 0xAAAA_AAAA) |
((exec_mask >>1) & 0x5555_5555)

1

Pix1,
MRT0

Pix1,
MRT1

Pix3 MRT0

14.2. Primitive Shader Exports (From GS shader stage)
The GS shader uses export instructions to output vertex position data, and memory stores for vertex parameter
data. This data is passed on to subsequent pixel shaders.
Every vertex shader must output at least one position vector (x, y, z; w is optional) to the POS0 target. The last
position export must have the DONE bit set to 1. For optimized performance, it is recommended to output all
position data as early as possible in the vertex shader.

14.3. Dependency Checking
Export instructions are executed by the hardware in two phases. First, the instruction is selected to be
executed, and EXPCNT is incremented by 1. At this time, the wave has made a request to export data, but the

14.1. Pixel Shader Exports

140 of 597

"RDNA3" Instruction Set Architecture

data has not been exported yet. Later, when the export actually occurs the EXEC mask and VGPR data is read
and the data is exported, and finally EXPcnt is decremented.
Use S_WAITCNT on EXPcnt to prevent the shader program from overwriting EXEC or the VGPRs holding the
data to be exported before the export operation has completed.
Multiple export instructions can be outstanding at one time. Exports of the same type (for example: position)
are completed in order, but exports of different types can be completed out of order. If the STATUS register’s
SKIP_EXPORT bit is set to one, the hardware treats all EXPORT instructions as if they were NOPs.

14.3. Dependency Checking

141 of 597

"RDNA3" Instruction Set Architecture

Chapter 15. Microcode Formats
This section specifies the microcode formats. The definitions can be used to simplify compilation by providing
standard templates and enumeration names for the various instruction formats.
Endian Order - The RDNA3 architecture addresses memory and registers using little-endian byte-ordering and
bit-ordering. Multi-byte values are stored with their least-significant (low-order) byte at the lowest byte
address, and they are illustrated with their least-significant byte at the right side. Byte values are stored with
their least-significant (low-order) bit (LSB) at the lowest bit address, and they are illustrated with their LSB at
the right side.
SALU and VALU instructions may optionally include a 32-bit literal constant, and some VALU instructions may
include a 32-bit DPP control DWORD at the end of the instructions. No instruction may use both DPP and a
literal constant.
The table below summarizes the microcode formats and their widths, not including extra literal or DPP
instruction words. The sections that follow provide details.
Table 64. Summary of Microcode Formats
Microcode Formats

Reference

Width (bits)

SOP2

SOP2

32

SOP1

SOP1

SOPK

SOPK

SOPP

SOPP

SOPC

SOPC

Scalar ALU and Control Formats

Scalar Memory Format
SMEM

SMEM

64

VOP1

VOP1

32

VOP2

VOP2

32

VOPC

VOPC

32

VOP3

VOP3

64

VOP3SD

VOP3SD

64

VOP3P

VOP3P

64

VOPD

VOPD

64

DPP16

DPP16

32

DPP8

DPP8

32

VINTERP

64

LDSDIR

32

DS

64

MTBUF

MTBUF

64

MUBUF

MUBUF

64

Vector ALU Format

Vector Parameter Interpolation Format
VINTERP
LDS Parameter Load and Direct Load
LDSDIR
LDS/GDS Format
DS
Vector Memory Buffer Formats

Vector Memory Image Format

142 of 597

"RDNA3" Instruction Set Architecture

Microcode Formats

Reference

Width (bits)

MIMG

MIMG

64 or 96

EXP

64

FLAT

FLAT

64

GLOBAL

GLOBAL

64

SCRATCH

SCRATCH

64

Export Format
EXP
Flat Formats



any instruction field marked as "Reserved" must be set to zero.

Instruction Suffixes
Most instructions include a suffix that indicates the data type the instruction handles. This suffix may also
include a number that indicates the size of the data.
For example: "F32" indicates "32-bit floating point data", or "B16" is "16-bit binary data".
• B = binary
• F = floating point
• BF = "brain-float" floating point
• U = unsigned integer
• S = signed integer
When more than one data-type specifier occurs in an instruction, the first one is the result type and size, and
the later one(s) is/are input data type and size.
E.g. V_CVT_F32_I32 reads an integer and writes a float.

143 of 597

"RDNA3" Instruction Set Architecture

15.1. Scalar ALU and Control Formats
15.1.1. SOP2

Description

This is a scalar instruction with two inputs and one output. Can be followed by a 32-bit
literal constant.
Table 65. SOP2 Fields

Field Name

Bits

Format or Description

SSRC0

[7:0]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249 - 252
253
254
255

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
Reserved.
SCC.
Reserved.
Literal constant.

SSRC1

[15:8]

Second scalar source operand.
Same codes as SSRC0, above.

SDST

[22:16]

Scalar destination.
Same codes as SSRC0, above except only codes 0-127 are valid.

OP

[29:23]

See Opcode table below.

ENCODING

[31:30]

'b10

Table 66. SOP2 Opcodes

15.1. Scalar ALU and Control Formats

144 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

0

S_ADD_U32

27

S_XOR_B64

1

S_SUB_U32

28

S_NAND_B32

2

S_ADD_I32

29

S_NAND_B64

3

S_SUB_I32

30

S_NOR_B32

4

S_ADDC_U32

31

S_NOR_B64

5

S_SUBB_U32

32

S_XNOR_B32

6

S_ABSDIFF_I32

33

S_XNOR_B64

8

S_LSHL_B32

34

S_AND_NOT1_B32

9

S_LSHL_B64

35

S_AND_NOT1_B64

10

S_LSHR_B32

36

S_OR_NOT1_B32

11

S_LSHR_B64

37

S_OR_NOT1_B64

12

S_ASHR_I32

38

S_BFE_U32

13

S_ASHR_I64

39

S_BFE_I32

14

S_LSHL1_ADD_U32

40

S_BFE_U64

15

S_LSHL2_ADD_U32

41

S_BFE_I64

16

S_LSHL3_ADD_U32

42

S_BFM_B32

17

S_LSHL4_ADD_U32

43

S_BFM_B64

18

S_MIN_I32

44

S_MUL_I32

19

S_MIN_U32

45

S_MUL_HI_U32

20

S_MAX_I32

46

S_MUL_HI_I32

21

S_MAX_U32

48

S_CSELECT_B32

22

S_AND_B32

49

S_CSELECT_B64

23

S_AND_B64

50

S_PACK_LL_B32_B16

24

S_OR_B32

51

S_PACK_LH_B32_B16

25

S_OR_B64

52

S_PACK_HH_B32_B16

26

S_XOR_B32

53

S_PACK_HL_B32_B16

15.1.2. SOPK

Description

This is a scalar instruction with one 16-bit signed immediate (SIMM16) input and a single
destination. Instructions that take 2 inputs use the destination as the first input and the
SIMM16 as the second input.
E.g. "S_CMPK_GT_I32 S0, 1" means "SCC = (s0 > 1)"
Table 67. SOPK Fields

Field Name

Bits

Format or Description

SIMM16

[15:0]

Signed immediate 16-bit value.

15.1. Scalar ALU and Control Formats

145 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

SDST

[22:16]
0-105
106
107
108-123
124
125
126
127

Scalar destination, and can provide second source operand.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
M0. Memory register 0.
NULL
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].

OP

[27:23]

See Opcode table below.

ENCODING

[31:28]

'b1011

Table 68. SOPK Opcodes
Opcode # Name

Opcode # Name

0

S_MOVK_I32

13

S_CMPK_LT_U32

1

S_VERSION

14

S_CMPK_LE_U32

2

S_CMOVK_I32

15

S_ADDK_I32

3

S_CMPK_EQ_I32

16

S_MULK_I32

4

S_CMPK_LG_I32

17

S_GETREG_B32

5

S_CMPK_GT_I32

18

S_SETREG_B32

6

S_CMPK_GE_I32

19

S_SETREG_IMM32_B32

7

S_CMPK_LT_I32

20

S_CALL_B64

8

S_CMPK_LE_I32

24

S_WAITCNT_VSCNT

9

S_CMPK_EQ_U32

25

S_WAITCNT_VMCNT

10

S_CMPK_LG_U32

26

S_WAITCNT_EXPCNT

11

S_CMPK_GT_U32

27

S_WAITCNT_LGKMCNT

12

S_CMPK_GE_U32

15.1.3. SOP1

Description

This is a scalar instruction with two inputs and one output. Can be followed by a 32-bit
literal constant.
Table 69. SOP1 Fields

15.1. Scalar ALU and Control Formats

146 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

SSRC0

[7:0]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249 - 252
253
254
255

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
Reserved.
SCC.
Reserved.
Literal constant.

OP

[15:8]

See Opcode table below.

SDST

[22:16]

Scalar destination.
Same codes as SSRC0, above except only codes 0-127 are valid.

ENCODING

[31:23]

'b10_1111101

Table 70. SOP1 Opcodes
Opcode # Name

Opcode # Name

0

S_MOV_B32

35

S_OR_SAVEEXEC_B64

1

S_MOV_B64

36

S_XOR_SAVEEXEC_B32

2

S_CMOV_B32

37

S_XOR_SAVEEXEC_B64

3

S_CMOV_B64

38

S_NAND_SAVEEXEC_B32

4

S_BREV_B32

39

S_NAND_SAVEEXEC_B64

5

S_BREV_B64

40

S_NOR_SAVEEXEC_B32

8

S_CTZ_I32_B32

41

S_NOR_SAVEEXEC_B64

9

S_CTZ_I32_B64

42

S_XNOR_SAVEEXEC_B32

10

S_CLZ_I32_U32

43

S_XNOR_SAVEEXEC_B64

11

S_CLZ_I32_U64

44

S_AND_NOT0_SAVEEXEC_B32

12

S_CLS_I32

45

S_AND_NOT0_SAVEEXEC_B64

13

S_CLS_I32_I64

46

S_OR_NOT0_SAVEEXEC_B32

14

S_SEXT_I32_I8

47

S_OR_NOT0_SAVEEXEC_B64

15

S_SEXT_I32_I16

48

S_AND_NOT1_SAVEEXEC_B32

15.1. Scalar ALU and Control Formats

147 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

16

S_BITSET0_B32

49

S_AND_NOT1_SAVEEXEC_B64

17

S_BITSET0_B64

50

S_OR_NOT1_SAVEEXEC_B32

18

S_BITSET1_B32

51

S_OR_NOT1_SAVEEXEC_B64

19

S_BITSET1_B64

52

S_AND_NOT0_WREXEC_B32

20

S_BITREPLICATE_B64_B32

53

S_AND_NOT0_WREXEC_B64

21

S_ABS_I32

54

S_AND_NOT1_WREXEC_B32

22

S_BCNT0_I32_B32

55

S_AND_NOT1_WREXEC_B64

23

S_BCNT0_I32_B64

64

S_MOVRELS_B32

24

S_BCNT1_I32_B32

65

S_MOVRELS_B64

25

S_BCNT1_I32_B64

66

S_MOVRELD_B32

26

S_QUADMASK_B32

67

S_MOVRELD_B64

27

S_QUADMASK_B64

68

S_MOVRELSD_2_B32

28

S_WQM_B32

71

S_GETPC_B64

29

S_WQM_B64

72

S_SETPC_B64

30

S_NOT_B32

73

S_SWAPPC_B64

31

S_NOT_B64

74

S_RFE_B64

32

S_AND_SAVEEXEC_B32

76

S_SENDMSG_RTN_B32

33

S_AND_SAVEEXEC_B64

77

S_SENDMSG_RTN_B64

34

S_OR_SAVEEXEC_B32

15.1.4. SOPC

Description

This is a scalar instruction with two inputs that are compared and produces SCC as a
result. Can be followed by a 32-bit literal constant.
Table 71. SOPC Fields

15.1. Scalar ALU and Control Formats

148 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

SSRC0

[7:0]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249 - 252
253
254
255

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
Reserved.
SCC.
Reserved.
Literal constant.

SSRC1

[15:8]

Second scalar source operand.
Same codes as SSRC0, above.

OP

[22:16]

See Opcode table below.

ENCODING

[31:23]

'b10_1111110

Table 72. SOPC Opcodes
Opcode # Name

Opcode # Name

0

S_CMP_EQ_I32

9

S_CMP_GE_U32

1

S_CMP_LG_I32

10

S_CMP_LT_U32

2

S_CMP_GT_I32

11

S_CMP_LE_U32

3

S_CMP_GE_I32

12

S_BITCMP0_B32

4

S_CMP_LT_I32

13

S_BITCMP1_B32

5

S_CMP_LE_I32

14

S_BITCMP0_B64

6

S_CMP_EQ_U32

15

S_BITCMP1_B64

7

S_CMP_LG_U32

16

S_CMP_EQ_U64

8

S_CMP_GT_U32

17

S_CMP_LG_U64

15.1. Scalar ALU and Control Formats

149 of 597

"RDNA3" Instruction Set Architecture

15.1.5. SOPP

Description

This is a scalar instruction with one 16-bit signed immediate (SIMM16) input.
Table 73. SOPP Fields

Field Name

Bits

Format or Description

SIMM16

[15:0]

Signed immediate 16-bit value.

OP

[22:16] See Opcode table below.

ENCODING

[31:23] 'b10_1111111

Table 74. SOPP Opcodes
Opcode # Name

Opcode # Name

0

S_NOP

36

S_CBRANCH_VCCNZ

1

S_SETKILL

37

S_CBRANCH_EXECZ

2

S_SETHALT

38

S_CBRANCH_EXECNZ

3

S_SLEEP

39

S_CBRANCH_CDBGSYS

4

S_SET_INST_PREFETCH_DISTANCE

40

S_CBRANCH_CDBGUSER

5

S_CLAUSE

41

S_CBRANCH_CDBGSYS_OR_USER

7

S_DELAY_ALU

42

S_CBRANCH_CDBGSYS_AND_USER

8

Reserved

48

S_ENDPGM

9

S_WAITCNT

49

S_ENDPGM_SAVED

10

S_WAIT_IDLE

50

S_ENDPGM_ORDERED_PS_DONE

11

S_WAIT_EVENT

52

S_WAKEUP

16

S_TRAP

53

S_SETPRIO

17

S_ROUND_MODE

54

S_SENDMSG

18

S_DENORM_MODE

55

S_SENDMSGHALT

31

S_CODE_END

56

S_INCPERFLEVEL

32

S_BRANCH

57

S_DECPERFLEVEL

33

S_CBRANCH_SCC0

60

S_ICACHE_INV

34

S_CBRANCH_SCC1

61

S_BARRIER

35

S_CBRANCH_VCCZ

15.1. Scalar ALU and Control Formats

150 of 597

"RDNA3" Instruction Set Architecture

15.2. Scalar Memory Format
15.2.1. SMEM

Description

Scalar Memory data load
Table 75. SMEM Fields

Field Name

Bits

Format or Description

SBASE

[5:0]

SGPR-pair that provides base address or SGPR-quad that provides V#. (LSB of SGPR
address is omitted).

SDATA

[12:6]

SGPR that provides write data or accepts return data.

DLC

[14]

Device level coherent.

GLC

[16]

Globally memory Coherent. Force bypass of L1 cache, or for atomics, cause pre-op
value to be returned.

OP

[25:18]

See Opcode table below.

ENCODING

[31:26]

'b111101

OFFSET

[52:32]

An immediate signed byte offset. Ignored for cache invalidations.

SOFFSET

[63:57]

SGPR that supplies an unsigned byte offset. Disabled if set to NULL.

Table 76. SMEM Opcodes
Opcode # Name

Opcode # Name

0

S_LOAD_B32

9

S_BUFFER_LOAD_B64

1

S_LOAD_B64

10

S_BUFFER_LOAD_B128

2

S_LOAD_B128

11

S_BUFFER_LOAD_B256

3

S_LOAD_B256

12

S_BUFFER_LOAD_B512

4

S_LOAD_B512

32

S_GL1_INV

8

S_BUFFER_LOAD_B32

33

S_DCACHE_INV

15.2. Scalar Memory Format

151 of 597

"RDNA3" Instruction Set Architecture

15.3. Vector ALU Formats
15.3.1. VOP2

Description

Vector ALU format with two input operands. Can be followed by a 32-bit literal constant
or DPP instruction DWORD when the instruction allows it.
Table 77. VOP2 Fields

Field Name

Bits

Format or Description

SRC0

[8:0]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
250
253
254
255
256 - 511

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
DPP8
DPP8FI
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
DPP16
SCC.
Reserved.
Literal constant.
VGPR 0 - 255

VSRC1

[16:9]

VGPR that provides the second operand.

VDST

[24:17]

Destination VGPR.

OP

[30:25]

See Opcode table below.

ENCODING

[31]

'b0

Table 78. VOP2 Opcodes

15.3. Vector ALU Formats

152 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

1

V_CNDMASK_B32

29

V_XOR_B32

2

V_DOT2ACC_F32_F16

30

V_XNOR_B32

3

V_ADD_F32

32

V_ADD_CO_CI_U32

4

V_SUB_F32

33

V_SUB_CO_CI_U32

5

V_SUBREV_F32

34

V_SUBREV_CO_CI_U32

6

V_FMAC_DX9_ZERO_F32

37

V_ADD_NC_U32

7

V_MUL_DX9_ZERO_F32

38

V_SUB_NC_U32

8

V_MUL_F32

39

V_SUBREV_NC_U32

9

V_MUL_I32_I24

43

V_FMAC_F32

10

V_MUL_HI_I32_I24

44

V_FMAMK_F32

11

V_MUL_U32_U24

45

V_FMAAK_F32

12

V_MUL_HI_U32_U24

47

V_CVT_PK_RTZ_F16_F32

15

V_MIN_F32

50

V_ADD_F16

16

V_MAX_F32

51

V_SUB_F16

17

V_MIN_I32

52

V_SUBREV_F16

18

V_MAX_I32

53

V_MUL_F16

19

V_MIN_U32

54

V_FMAC_F16

20

V_MAX_U32

55

V_FMAMK_F16

24

V_LSHLREV_B32

56

V_FMAAK_F16

25

V_LSHRREV_B32

57

V_MAX_F16

26

V_ASHRREV_I32

58

V_MIN_F16

27

V_AND_B32

59

V_LDEXP_F16

28

V_OR_B32

60

V_PK_FMAC_F16

15.3.2. VOP1

Description

Vector ALU format with one input operand. Can be followed by a 32-bit literal constant or
DPP instruction DWORD when the instruction allows it.
Table 79. VOP1 Fields

15.3. Vector ALU Formats

153 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

SRC0

[8:0]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
250
253
254
255
256 - 511

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
DPP8
DPP8FI
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
DPP16
SCC.
Reserved.
Literal constant.
VGPR 0 - 255

OP

[16:9]

See Opcode table below.

VDST

[24:17]

Destination VGPR.

ENCODING

[31:25]

'b0_111111

Table 80. VOP1 Opcodes
Opcode # Name

Opcode # Name

0

V_NOP

54

V_COS_F32

1

V_MOV_B32

55

V_NOT_B32

2

V_READFIRSTLANE_B32

56

V_BFREV_B32

3

V_CVT_I32_F64

57

V_CLZ_I32_U32

4

V_CVT_F64_I32

58

V_CTZ_I32_B32

5

V_CVT_F32_I32

59

V_CLS_I32

6

V_CVT_F32_U32

60

V_FREXP_EXP_I32_F64

7

V_CVT_U32_F32

61

V_FREXP_MANT_F64

8

V_CVT_I32_F32

62

V_FRACT_F64

10

V_CVT_F16_F32

63

V_FREXP_EXP_I32_F32

11

V_CVT_F32_F16

64

V_FREXP_MANT_F32

12

V_CVT_NEAREST_I32_F32

66

V_MOVRELD_B32

15.3. Vector ALU Formats

154 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

13

V_CVT_FLOOR_I32_F32

67

V_MOVRELS_B32

14

V_CVT_OFF_F32_I4

68

V_MOVRELSD_B32

15

V_CVT_F32_F64

72

V_MOVRELSD_2_B32

16

V_CVT_F64_F32

80

V_CVT_F16_U16

17

V_CVT_F32_UBYTE0

81

V_CVT_F16_I16

18

V_CVT_F32_UBYTE1

82

V_CVT_U16_F16

19

V_CVT_F32_UBYTE2

83

V_CVT_I16_F16

20

V_CVT_F32_UBYTE3

84

V_RCP_F16

21

V_CVT_U32_F64

85

V_SQRT_F16

22

V_CVT_F64_U32

86

V_RSQ_F16

23

V_TRUNC_F64

87

V_LOG_F16

24

V_CEIL_F64

88

V_EXP_F16

25

V_RNDNE_F64

89

V_FREXP_MANT_F16

26

V_FLOOR_F64

90

V_FREXP_EXP_I16_F16

27

V_PIPEFLUSH

91

V_FLOOR_F16

28

V_MOV_B16

92

V_CEIL_F16

32

V_FRACT_F32

93

V_TRUNC_F16

33

V_TRUNC_F32

94

V_RNDNE_F16

34

V_CEIL_F32

95

V_FRACT_F16

35

V_RNDNE_F32

96

V_SIN_F16

36

V_FLOOR_F32

97

V_COS_F16

37

V_EXP_F32

98

V_SAT_PK_U8_I16

39

V_LOG_F32

99

V_CVT_NORM_I16_F16

42

V_RCP_F32

100

V_CVT_NORM_U16_F16

43

V_RCP_IFLAG_F32

101

V_SWAP_B32

46

V_RSQ_F32

102

V_SWAP_B16

47

V_RCP_F64

103

V_PERMLANE64_B32

49

V_RSQ_F64

104

V_SWAPREL_B32

51

V_SQRT_F32

105

V_NOT_B16

52

V_SQRT_F64

106

V_CVT_I32_I16

53

V_SIN_F32

107

V_CVT_U32_U16

15.3.3. VOPC

Description

Vector instruction taking two inputs and producing a comparison result. Can be followed
by a 32-bit literal constant or DPP control DWORD. Vector Comparison operations are
divided into three groups:
• those that can use any one of 16 comparison operations,
• those that can use any one of 8, and
• those that have a single comparison operation.

The final opcode number is determined by adding the base for the opcode family plus the offset from the
compare op. Compare instructions write a result to VCC (for VOPC) or an SGPR (for VOP3). Additionally,

15.3. Vector ALU Formats

155 of 597

"RDNA3" Instruction Set Architecture

compare instructions have variants that writes to the EXEC mask instead of VCC or SGPR. The destination of
the compare result is VCC or EXEC when encoded using the VOPC format, and can be an arbitrary SGPR
(indicated in the VDST field) when only encoded in the VOP3 format.
Comparison Operations
Table 81. Comparison Operations
Compare Operation

Opcode
Offset

Description

Sixteen Compare Operations (COMPF)
F

0

D.u = 0

LT

1

D.u = (S0 < S1)

EQ

2

D.u = (S0 == S1)

LE

3

D.u = (S0 <= S1)

GT

4

D.u = (S0 > S1)

LG

5

D.u = (S0 <> S1)

GE

6

D.u = (S0 >= S1)

O

7

D.u = (!isNaN(S0) && !isNaN(S1))

U

8

D.u = (!isNaN(S0) || !isNaN(S1))

NGE

9

D.u = !(S0 >= S1)

NLG

10

D.u = !(S0 <> S1)

NGT

11

D.u = !(S0 > S1)

NLE

12

D.u = !(S0 <= S1)

NEQ

13

D.u = !(S0 == S1)

NLT

14

D.u = !(S0 < S1)

TRU

15

D.u = 1

Eight Compare Operations (COMPI)
F

0

D.u = 0

LT

1

D.u = (S0 < S1)

EQ

2

D.u = (S0 == S1)

LE

3

D.u = (S0 <= S1)

GT

4

D.u = (S0 > S1)

LG

5

D.u = (S0 <> S1)

GE

6

D.u = (S0 >= S1)

TRU

7

D.u = 1

Table 82. VOPC Fields

15.3. Vector ALU Formats

156 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

SRC0

[8:0]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
250
253
254
255
256 - 511

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
DPP8
DPP8FI
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
DPP16
SCC.
Reserved.
Literal constant.
VGPR 0 - 255

VSRC1

[16:9]

VGPR that provides the second operand.

OP

[24:17]

See Opcode table below.

ENCODING

[31:25]

'b0_111110

Table 83. VOPC Opcodes
Opcode # Name

Opcode # Name

0

V_CMP_F_F16

128

V_CMPX_F_F16

1

V_CMP_LT_F16

129

V_CMPX_LT_F16

2

V_CMP_EQ_F16

130

V_CMPX_EQ_F16

3

V_CMP_LE_F16

131

V_CMPX_LE_F16

4

V_CMP_GT_F16

132

V_CMPX_GT_F16

5

V_CMP_LG_F16

133

V_CMPX_LG_F16

6

V_CMP_GE_F16

134

V_CMPX_GE_F16

7

V_CMP_O_F16

135

V_CMPX_O_F16

8

V_CMP_U_F16

136

V_CMPX_U_F16

9

V_CMP_NGE_F16

137

V_CMPX_NGE_F16

10

V_CMP_NLG_F16

138

V_CMPX_NLG_F16

11

V_CMP_NGT_F16

139

V_CMPX_NGT_F16

15.3. Vector ALU Formats

157 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

12

V_CMP_NLE_F16

140

V_CMPX_NLE_F16

13

V_CMP_NEQ_F16

141

V_CMPX_NEQ_F16

14

V_CMP_NLT_F16

142

V_CMPX_NLT_F16

15

V_CMP_T_F16

143

V_CMPX_T_F16

16

V_CMP_F_F32

144

V_CMPX_F_F32

17

V_CMP_LT_F32

145

V_CMPX_LT_F32

18

V_CMP_EQ_F32

146

V_CMPX_EQ_F32

19

V_CMP_LE_F32

147

V_CMPX_LE_F32

20

V_CMP_GT_F32

148

V_CMPX_GT_F32

21

V_CMP_LG_F32

149

V_CMPX_LG_F32

22

V_CMP_GE_F32

150

V_CMPX_GE_F32

23

V_CMP_O_F32

151

V_CMPX_O_F32

24

V_CMP_U_F32

152

V_CMPX_U_F32

25

V_CMP_NGE_F32

153

V_CMPX_NGE_F32

26

V_CMP_NLG_F32

154

V_CMPX_NLG_F32

27

V_CMP_NGT_F32

155

V_CMPX_NGT_F32

28

V_CMP_NLE_F32

156

V_CMPX_NLE_F32

29

V_CMP_NEQ_F32

157

V_CMPX_NEQ_F32

30

V_CMP_NLT_F32

158

V_CMPX_NLT_F32

31

V_CMP_T_F32

159

V_CMPX_T_F32

32

V_CMP_F_F64

160

V_CMPX_F_F64

33

V_CMP_LT_F64

161

V_CMPX_LT_F64

34

V_CMP_EQ_F64

162

V_CMPX_EQ_F64

35

V_CMP_LE_F64

163

V_CMPX_LE_F64

36

V_CMP_GT_F64

164

V_CMPX_GT_F64

37

V_CMP_LG_F64

165

V_CMPX_LG_F64

38

V_CMP_GE_F64

166

V_CMPX_GE_F64

39

V_CMP_O_F64

167

V_CMPX_O_F64

40

V_CMP_U_F64

168

V_CMPX_U_F64

41

V_CMP_NGE_F64

169

V_CMPX_NGE_F64

42

V_CMP_NLG_F64

170

V_CMPX_NLG_F64

43

V_CMP_NGT_F64

171

V_CMPX_NGT_F64

44

V_CMP_NLE_F64

172

V_CMPX_NLE_F64

45

V_CMP_NEQ_F64

173

V_CMPX_NEQ_F64

46

V_CMP_NLT_F64

174

V_CMPX_NLT_F64

47

V_CMP_T_F64

175

V_CMPX_T_F64

49

V_CMP_LT_I16

177

V_CMPX_LT_I16

50

V_CMP_EQ_I16

178

V_CMPX_EQ_I16

51

V_CMP_LE_I16

179

V_CMPX_LE_I16

52

V_CMP_GT_I16

180

V_CMPX_GT_I16

53

V_CMP_NE_I16

181

V_CMPX_NE_I16

54

V_CMP_GE_I16

182

V_CMPX_GE_I16

57

V_CMP_LT_U16

185

V_CMPX_LT_U16

58

V_CMP_EQ_U16

186

V_CMPX_EQ_U16

59

V_CMP_LE_U16

187

V_CMPX_LE_U16

60

V_CMP_GT_U16

188

V_CMPX_GT_U16

61

V_CMP_NE_U16

189

V_CMPX_NE_U16

15.3. Vector ALU Formats

158 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

62

V_CMP_GE_U16

190

V_CMPX_GE_U16

64

V_CMP_F_I32

192

V_CMPX_F_I32

65

V_CMP_LT_I32

193

V_CMPX_LT_I32

66

V_CMP_EQ_I32

194

V_CMPX_EQ_I32

67

V_CMP_LE_I32

195

V_CMPX_LE_I32

68

V_CMP_GT_I32

196

V_CMPX_GT_I32

69

V_CMP_NE_I32

197

V_CMPX_NE_I32

70

V_CMP_GE_I32

198

V_CMPX_GE_I32

71

V_CMP_T_I32

199

V_CMPX_T_I32

72

V_CMP_F_U32

200

V_CMPX_F_U32

73

V_CMP_LT_U32

201

V_CMPX_LT_U32

74

V_CMP_EQ_U32

202

V_CMPX_EQ_U32

75

V_CMP_LE_U32

203

V_CMPX_LE_U32

76

V_CMP_GT_U32

204

V_CMPX_GT_U32

77

V_CMP_NE_U32

205

V_CMPX_NE_U32

78

V_CMP_GE_U32

206

V_CMPX_GE_U32

79

V_CMP_T_U32

207

V_CMPX_T_U32

80

V_CMP_F_I64

208

V_CMPX_F_I64

81

V_CMP_LT_I64

209

V_CMPX_LT_I64

82

V_CMP_EQ_I64

210

V_CMPX_EQ_I64

83

V_CMP_LE_I64

211

V_CMPX_LE_I64

84

V_CMP_GT_I64

212

V_CMPX_GT_I64

85

V_CMP_NE_I64

213

V_CMPX_NE_I64

86

V_CMP_GE_I64

214

V_CMPX_GE_I64

87

V_CMP_T_I64

215

V_CMPX_T_I64

88

V_CMP_F_U64

216

V_CMPX_F_U64

89

V_CMP_LT_U64

217

V_CMPX_LT_U64

90

V_CMP_EQ_U64

218

V_CMPX_EQ_U64

91

V_CMP_LE_U64

219

V_CMPX_LE_U64

92

V_CMP_GT_U64

220

V_CMPX_GT_U64

93

V_CMP_NE_U64

221

V_CMPX_NE_U64

94

V_CMP_GE_U64

222

V_CMPX_GE_U64

95

V_CMP_T_U64

223

V_CMPX_T_U64

125

V_CMP_CLASS_F16

253

V_CMPX_CLASS_F16

126

V_CMP_CLASS_F32

254

V_CMPX_CLASS_F32

127

V_CMP_CLASS_F64

255

V_CMPX_CLASS_F64

15.3.4. VOP3

Description

Vector ALU format with three input operands. Can be followed by a 32-bit literal constant
or DPP instruction DWORD when the instruction allows it.
Table 84. VOP3 Fields

15.3. Vector ALU Formats

159 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

VDST

[7:0]

Destination VGPR

ABS

[10:8]

Absolute value of input. [8] = src0, [9] = src1, [10] = src2

OPSEL

[14:11]

Operand select for 16-bit data. 0 = select low half, 1 = select high half. [11] = src0,
[12] = src1, [13] = src2, [14] = dest.

CLMP

[15]

Clamp output

OP

[25:16]

Opcode. See next table.

ENCODING

[31:26]

'b110101

SRC0

[40:32]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
250
253
254
255
256 - 511

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
DPP8
DPP8FI
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
DPP16
SCC.
Reserved.
Literal constant.
VGPR 0 - 255

SRC1

[49:41]

Second input operand. Same options as SRC0.

SRC2

[58:50]

Third input operand. Same options as SRC0.

OMOD

[60:59]

Output Modifier: 0=none, 1=*2, 2=*4, 3=*0.5

NEG

[63:61]

Negate input. [61] = src0, [62] = src1, [63] = src2

Table 85. VOP3 Opcodes
Opcode # Name

Opcode # Name

384

V_NOP

803

V_CVT_PK_U16_U32

385

V_MOV_B32

804

V_CVT_PK_I16_I32

386

V_READFIRSTLANE_B32

805

V_SUB_NC_I32

387

V_CVT_I32_F64

806

V_ADD_NC_I32

15.3. Vector ALU Formats

160 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

388

V_CVT_F64_I32

807

V_ADD_F64

389

V_CVT_F32_I32

808

V_MUL_F64

390

V_CVT_F32_U32

809

V_MIN_F64

391

V_CVT_U32_F32

810

V_MAX_F64

392

V_CVT_I32_F32

811

V_LDEXP_F64

394

V_CVT_F16_F32

812

V_MUL_LO_U32

395

V_CVT_F32_F16

813

V_MUL_HI_U32

396

V_CVT_NEAREST_I32_F32

814

V_MUL_HI_I32

397

V_CVT_FLOOR_I32_F32

815

V_TRIG_PREOP_F64

398

V_CVT_OFF_F32_I4

824

V_LSHLREV_B16

399

V_CVT_F32_F64

825

V_LSHRREV_B16

400

V_CVT_F64_F32

826

V_ASHRREV_I16

401

V_CVT_F32_UBYTE0

828

V_LSHLREV_B64

402

V_CVT_F32_UBYTE1

829

V_LSHRREV_B64

403

V_CVT_F32_UBYTE2

830

V_ASHRREV_I64

404

V_CVT_F32_UBYTE3

864

V_READLANE_B32

405

V_CVT_U32_F64

865

V_WRITELANE_B32

406

V_CVT_F64_U32

866

V_AND_B16

407

V_TRUNC_F64

867

V_OR_B16

408

V_CEIL_F64

868

V_XOR_B16

409

V_RNDNE_F64

0

V_CMP_F_F16

410

V_FLOOR_F64

1

V_CMP_LT_F16

411

V_PIPEFLUSH

2

V_CMP_EQ_F16

412

V_MOV_B16

3

V_CMP_LE_F16

416

V_FRACT_F32

4

V_CMP_GT_F16

417

V_TRUNC_F32

5

V_CMP_LG_F16

418

V_CEIL_F32

6

V_CMP_GE_F16

419

V_RNDNE_F32

7

V_CMP_O_F16

420

V_FLOOR_F32

8

V_CMP_U_F16

421

V_EXP_F32

9

V_CMP_NGE_F16

423

V_LOG_F32

10

V_CMP_NLG_F16

426

V_RCP_F32

11

V_CMP_NGT_F16

427

V_RCP_IFLAG_F32

12

V_CMP_NLE_F16

430

V_RSQ_F32

13

V_CMP_NEQ_F16

431

V_RCP_F64

14

V_CMP_NLT_F16

433

V_RSQ_F64

15

V_CMP_T_F16

435

V_SQRT_F32

16

V_CMP_F_F32

436

V_SQRT_F64

17

V_CMP_LT_F32

437

V_SIN_F32

18

V_CMP_EQ_F32

438

V_COS_F32

19

V_CMP_LE_F32

439

V_NOT_B32

20

V_CMP_GT_F32

440

V_BFREV_B32

21

V_CMP_LG_F32

441

V_CLZ_I32_U32

22

V_CMP_GE_F32

442

V_CTZ_I32_B32

23

V_CMP_O_F32

443

V_CLS_I32

24

V_CMP_U_F32

444

V_FREXP_EXP_I32_F64

25

V_CMP_NGE_F32

445

V_FREXP_MANT_F64

26

V_CMP_NLG_F32

15.3. Vector ALU Formats

161 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

446

V_FRACT_F64

27

V_CMP_NGT_F32

447

V_FREXP_EXP_I32_F32

28

V_CMP_NLE_F32

448

V_FREXP_MANT_F32

29

V_CMP_NEQ_F32

450

V_MOVRELD_B32

30

V_CMP_NLT_F32

451

V_MOVRELS_B32

31

V_CMP_T_F32

452

V_MOVRELSD_B32

32

V_CMP_F_F64

456

V_MOVRELSD_2_B32

33

V_CMP_LT_F64

464

V_CVT_F16_U16

34

V_CMP_EQ_F64

465

V_CVT_F16_I16

35

V_CMP_LE_F64

466

V_CVT_U16_F16

36

V_CMP_GT_F64

467

V_CVT_I16_F16

37

V_CMP_LG_F64

468

V_RCP_F16

38

V_CMP_GE_F64

469

V_SQRT_F16

39

V_CMP_O_F64

470

V_RSQ_F16

40

V_CMP_U_F64

471

V_LOG_F16

41

V_CMP_NGE_F64

472

V_EXP_F16

42

V_CMP_NLG_F64

473

V_FREXP_MANT_F16

43

V_CMP_NGT_F64

474

V_FREXP_EXP_I16_F16

44

V_CMP_NLE_F64

475

V_FLOOR_F16

45

V_CMP_NEQ_F64

476

V_CEIL_F16

46

V_CMP_NLT_F64

477

V_TRUNC_F16

47

V_CMP_T_F64

478

V_RNDNE_F16

49

V_CMP_LT_I16

479

V_FRACT_F16

50

V_CMP_EQ_I16

480

V_SIN_F16

51

V_CMP_LE_I16

481

V_COS_F16

52

V_CMP_GT_I16

482

V_SAT_PK_U8_I16

53

V_CMP_NE_I16

483

V_CVT_NORM_I16_F16

54

V_CMP_GE_I16

484

V_CVT_NORM_U16_F16

57

V_CMP_LT_U16

489

V_NOT_B16

58

V_CMP_EQ_U16

490

V_CVT_I32_I16

59

V_CMP_LE_U16

491

V_CVT_U32_U16

60

V_CMP_GT_U16

257

V_CNDMASK_B32

61

V_CMP_NE_U16

259

V_ADD_F32

62

V_CMP_GE_U16

260

V_SUB_F32

64

V_CMP_F_I32

261

V_SUBREV_F32

65

V_CMP_LT_I32

262

V_FMAC_DX9_ZERO_F32

66

V_CMP_EQ_I32

263

V_MUL_DX9_ZERO_F32

67

V_CMP_LE_I32

264

V_MUL_F32

68

V_CMP_GT_I32

265

V_MUL_I32_I24

69

V_CMP_NE_I32

266

V_MUL_HI_I32_I24

70

V_CMP_GE_I32

267

V_MUL_U32_U24

71

V_CMP_T_I32

268

V_MUL_HI_U32_U24

72

V_CMP_F_U32

271

V_MIN_F32

73

V_CMP_LT_U32

272

V_MAX_F32

74

V_CMP_EQ_U32

273

V_MIN_I32

75

V_CMP_LE_U32

274

V_MAX_I32

76

V_CMP_GT_U32

275

V_MIN_U32

77

V_CMP_NE_U32

15.3. Vector ALU Formats

162 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

276

V_MAX_U32

78

V_CMP_GE_U32

280

V_LSHLREV_B32

79

V_CMP_T_U32

281

V_LSHRREV_B32

80

V_CMP_F_I64

282

V_ASHRREV_I32

81

V_CMP_LT_I64

283

V_AND_B32

82

V_CMP_EQ_I64

284

V_OR_B32

83

V_CMP_LE_I64

285

V_XOR_B32

84

V_CMP_GT_I64

286

V_XNOR_B32

85

V_CMP_NE_I64

293

V_ADD_NC_U32

86

V_CMP_GE_I64

294

V_SUB_NC_U32

87

V_CMP_T_I64

295

V_SUBREV_NC_U32

88

V_CMP_F_U64

299

V_FMAC_F32

89

V_CMP_LT_U64

303

V_CVT_PK_RTZ_F16_F32

90

V_CMP_EQ_U64

306

V_ADD_F16

91

V_CMP_LE_U64

307

V_SUB_F16

92

V_CMP_GT_U64

308

V_SUBREV_F16

93

V_CMP_NE_U64

309

V_MUL_F16

94

V_CMP_GE_U64

310

V_FMAC_F16

95

V_CMP_T_U64

313

V_MAX_F16

125

V_CMP_CLASS_F16

314

V_MIN_F16

126

V_CMP_CLASS_F32

315

V_LDEXP_F16

127

V_CMP_CLASS_F64

521

V_FMA_DX9_ZERO_F32

128

V_CMPX_F_F16

522

V_MAD_I32_I24

129

V_CMPX_LT_F16

523

V_MAD_U32_U24

130

V_CMPX_EQ_F16

524

V_CUBEID_F32

131

V_CMPX_LE_F16

525

V_CUBESC_F32

132

V_CMPX_GT_F16

526

V_CUBETC_F32

133

V_CMPX_LG_F16

527

V_CUBEMA_F32

134

V_CMPX_GE_F16

528

V_BFE_U32

135

V_CMPX_O_F16

529

V_BFE_I32

136

V_CMPX_U_F16

530

V_BFI_B32

137

V_CMPX_NGE_F16

531

V_FMA_F32

138

V_CMPX_NLG_F16

532

V_FMA_F64

139

V_CMPX_NGT_F16

533

V_LERP_U8

140

V_CMPX_NLE_F16

534

V_ALIGNBIT_B32

141

V_CMPX_NEQ_F16

535

V_ALIGNBYTE_B32

142

V_CMPX_NLT_F16

536

V_MULLIT_F32

143

V_CMPX_T_F16

537

V_MIN3_F32

144

V_CMPX_F_F32

538

V_MIN3_I32

145

V_CMPX_LT_F32

539

V_MIN3_U32

146

V_CMPX_EQ_F32

540

V_MAX3_F32

147

V_CMPX_LE_F32

541

V_MAX3_I32

148

V_CMPX_GT_F32

542

V_MAX3_U32

149

V_CMPX_LG_F32

543

V_MED3_F32

150

V_CMPX_GE_F32

544

V_MED3_I32

151

V_CMPX_O_F32

545

V_MED3_U32

152

V_CMPX_U_F32

546

V_SAD_U8

153

V_CMPX_NGE_F32

15.3. Vector ALU Formats

163 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

547

V_SAD_HI_U8

154

V_CMPX_NLG_F32

548

V_SAD_U16

155

V_CMPX_NGT_F32

549

V_SAD_U32

156

V_CMPX_NLE_F32

550

V_CVT_PK_U8_F32

157

V_CMPX_NEQ_F32

551

V_DIV_FIXUP_F32

158

V_CMPX_NLT_F32

552

V_DIV_FIXUP_F64

159

V_CMPX_T_F32

567

V_DIV_FMAS_F32

160

V_CMPX_F_F64

568

V_DIV_FMAS_F64

161

V_CMPX_LT_F64

569

V_MSAD_U8

162

V_CMPX_EQ_F64

570

V_QSAD_PK_U16_U8

163

V_CMPX_LE_F64

571

V_MQSAD_PK_U16_U8

164

V_CMPX_GT_F64

573

V_MQSAD_U32_U8

165

V_CMPX_LG_F64

576

V_XOR3_B32

166

V_CMPX_GE_F64

577

V_MAD_U16

167

V_CMPX_O_F64

580

V_PERM_B32

168

V_CMPX_U_F64

581

V_XAD_U32

169

V_CMPX_NGE_F64

582

V_LSHL_ADD_U32

170

V_CMPX_NLG_F64

583

V_ADD_LSHL_U32

171

V_CMPX_NGT_F64

584

V_FMA_F16

172

V_CMPX_NLE_F64

585

V_MIN3_F16

173

V_CMPX_NEQ_F64

586

V_MIN3_I16

174

V_CMPX_NLT_F64

587

V_MIN3_U16

175

V_CMPX_T_F64

588

V_MAX3_F16

177

V_CMPX_LT_I16

589

V_MAX3_I16

178

V_CMPX_EQ_I16

590

V_MAX3_U16

179

V_CMPX_LE_I16

591

V_MED3_F16

180

V_CMPX_GT_I16

592

V_MED3_I16

181

V_CMPX_NE_I16

593

V_MED3_U16

182

V_CMPX_GE_I16

595

V_MAD_I16

185

V_CMPX_LT_U16

596

V_DIV_FIXUP_F16

186

V_CMPX_EQ_U16

597

V_ADD3_U32

187

V_CMPX_LE_U16

598

V_LSHL_OR_B32

188

V_CMPX_GT_U16

599

V_AND_OR_B32

189

V_CMPX_NE_U16

600

V_OR3_B32

190

V_CMPX_GE_U16

601

V_MAD_U32_U16

192

V_CMPX_F_I32

602

V_MAD_I32_I16

193

V_CMPX_LT_I32

603

V_PERMLANE16_B32

194

V_CMPX_EQ_I32

604

V_PERMLANEX16_B32

195

V_CMPX_LE_I32

605

V_CNDMASK_B16

196

V_CMPX_GT_I32

606

V_MAXMIN_F32

197

V_CMPX_NE_I32

607

V_MINMAX_F32

198

V_CMPX_GE_I32

608

V_MAXMIN_F16

199

V_CMPX_T_I32

609

V_MINMAX_F16

200

V_CMPX_F_U32

610

V_MAXMIN_U32

201

V_CMPX_LT_U32

611

V_MINMAX_U32

202

V_CMPX_EQ_U32

612

V_MAXMIN_I32

203

V_CMPX_LE_U32

613

V_MINMAX_I32

204

V_CMPX_GT_U32

15.3. Vector ALU Formats

164 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

614

V_DOT2_F16_F16

205

V_CMPX_NE_U32

615

V_DOT2_BF16_BF16

206

V_CMPX_GE_U32

771

V_ADD_NC_U16

207

V_CMPX_T_U32

772

V_SUB_NC_U16

208

V_CMPX_F_I64

773

V_MUL_LO_U16

209

V_CMPX_LT_I64

774

V_CVT_PK_I16_F32

210

V_CMPX_EQ_I64

775

V_CVT_PK_U16_F32

211

V_CMPX_LE_I64

777

V_MAX_U16

212

V_CMPX_GT_I64

778

V_MAX_I16

213

V_CMPX_NE_I64

779

V_MIN_U16

214

V_CMPX_GE_I64

780

V_MIN_I16

215

V_CMPX_T_I64

781

V_ADD_NC_I16

216

V_CMPX_F_U64

782

V_SUB_NC_I16

217

V_CMPX_LT_U64

785

V_PACK_B32_F16

218

V_CMPX_EQ_U64

786

V_CVT_PK_NORM_I16_F16

219

V_CMPX_LE_U64

787

V_CVT_PK_NORM_U16_F16

220

V_CMPX_GT_U64

796

V_LDEXP_F32

221

V_CMPX_NE_U64

797

V_BFM_B32

222

V_CMPX_GE_U64

798

V_BCNT_U32_B32

223

V_CMPX_T_U64

799

V_MBCNT_LO_U32_B32

253

V_CMPX_CLASS_F16

800

V_MBCNT_HI_U32_B32

254

V_CMPX_CLASS_F32

801

V_CVT_PK_NORM_I16_F32

255

V_CMPX_CLASS_F64

802

V_CVT_PK_NORM_U16_F32

15.3.5. VOP3SD

Description

Vector ALU format with three operands and a scalar result. This encoding is used only for
a few opcodes. Can be followed by a 32-bit literal constant or DPP instruction DWORD
when the instruction allows it.

This encoding allows specifying a unique scalar destination, and is used only for the opcodes listed below. All
other opcodes use VOP3.
Table 86. VOP3SD Fields
Field Name

Bits

Format or Description

VDST

[7:0]

Destination VGPR

SDST

[14:8]

Scalar destination

CLMP

[15]

Clamp result

OP

[25:16]

Opcode. see next table.

ENCODING

[31:26]

'b110101

15.3. Vector ALU Formats

165 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

SRC0

[40:32]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
250
253
254
255
256 - 511

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
DPP8
DPP8FI
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
DPP16
SCC.
Reserved.
Literal constant.
VGPR 0 - 255

SRC1

[49:41]

Second input operand. Same options as SRC0.

SRC2

[58:50]

Third input operand. Same options as SRC0.

OMOD

[60:59]

Output Modifier: 0=none, 1=*2, 2=*4, 3=*0.5

NEG

[63:61]

Negate input. [61] = src0, [62] = src1, [63] = src2

Table 87. VOP3SD Opcodes
Opcode # Name

Opcode # Name

288

V_ADD_CO_CI_U32

766

V_MAD_U64_U32

289

V_SUB_CO_CI_U32

767

V_MAD_I64_I32

290

V_SUBREV_CO_CI_U32

768

V_ADD_CO_U32

764

V_DIV_SCALE_F32

769

V_SUB_CO_U32

765

V_DIV_SCALE_F64

770

V_SUBREV_CO_U32

15.3.6. VOP3P

15.3. Vector ALU Formats

166 of 597

"RDNA3" Instruction Set Architecture

Description

Vector ALU format taking one, two or three pairs of 16 bit inputs and producing two 16-bit
outputs (packed into 1 DWORD). WMMA instructions have larger input and output VGPR
sets. Can be followed by a 32-bit literal constant or DPP instruction DWORD when the
instruction allows it.
Table 88. VOP3P Fields

Field Name

Bits

Format or Description

VDST

[7:0]

Destination VGPR

NEG_HI

[10:8]

Negate sources 0,1,2 of the high 16-bits.

OPSEL

[13:11]

Select low or high for low sources 0=[11], 1=[12], 2=[13].

OPSEL_HI2

[14]

Select low or high for high sources 0=[14], 1=[60], 2=[59].

CLMP

[15]

1 = clamp result.

OP

[22:16]

Opcode. see next table.

ENCODING

[31:26]

'b11001100

SRC0

[40:32]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
250
253
254
255
256 - 511

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
DPP8
DPP8FI
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
DPP16
SCC.
Reserved.
Literal constant.
VGPR 0 - 255

SRC1

[49:41]

Second input operand. Same options as SRC0.

SRC2

[58:50]

Third input operand. Same options as SRC0.

15.3. Vector ALU Formats

167 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

OPSEL_HI

[60:59]

See OPSEL_HI2.

NEG

[63:61]

Negate input for low 16-bits of sources. [61] = src0, [62] = src1, [63] = src2

Table 89. VOP3P Opcodes
Opcode # Name

Opcode # Name

0

V_PK_MAD_I16

17

V_PK_MIN_F16

1

V_PK_MUL_LO_U16

18

V_PK_MAX_F16

2

V_PK_ADD_I16

19

V_DOT2_F32_F16

3

V_PK_SUB_I16

22

V_DOT4_I32_IU8

4

V_PK_LSHLREV_B16

23

V_DOT4_U32_U8

5

V_PK_LSHRREV_B16

24

V_DOT8_I32_IU4

6

V_PK_ASHRREV_I16

25

V_DOT8_U32_U4

7

V_PK_MAX_I16

26

V_DOT2_F32_BF16

8

V_PK_MIN_I16

32

V_FMA_MIX_F32

9

V_PK_MAD_U16

33

V_FMA_MIXLO_F16

10

V_PK_ADD_U16

34

V_FMA_MIXHI_F16

11

V_PK_SUB_U16

64

V_WMMA_F32_16X16X16_F16

12

V_PK_MAX_U16

65

V_WMMA_F32_16X16X16_BF16

13

V_PK_MIN_U16

66

V_WMMA_F16_16X16X16_F16

14

V_PK_FMA_F16

67

V_WMMA_BF16_16X16X16_BF16

15

V_PK_ADD_F16

68

V_WMMA_I32_16X16X16_IU8

16

V_PK_MUL_F16

69

V_WMMA_I32_16X16X16_IU4

15.3.7. VOPD

Description

Vector ALU format describing two instructions to be executed in parallel. Can be followed
by a 32-bit literal constant, but not a DPP control DWORD.

This instruction format describe two opcodes: X and Y.
Table 90. VOPD Fields

15.3. Vector ALU Formats

168 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

SRCX0

[8:0]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
250
253
254
255
256 - 511

Source 0 for opcode X. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
DPP8
DPP8FI
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
DPP16
SCC.
Reserved.
Literal constant.
VGPR 0 - 255

VSRCX1

[16:9]

Source VGPR 1 for opcode X.

OPY

[21:17]

Opcode Y. see next table.

OPX

[25:22]

Opcode X. see next table.

ENCODING

[31:26]

'b110010

SRCY0

[40:32]

Source 0 for opcode Y. See SRCX0 for enumerations

VSRCY1

[48:41]

Source VGPR 1 for opcode Y.

VDSTY

[55:49]

Instruction Y destination VGPR, excluding LSB. LSB is the opposite of VDSTX[0].

VDSTX

[63:56]

Instruction X destination VGPR

Table 91. VOPD X-Opcodes
0 V_DUAL_FMAC_F32

7

V_DUAL_MUL_DX9_ZERO_F32

1 V_DUAL_FMAAK_F32

8

V_DUAL_MOV_B32

2 V_DUAL_FMAMK_F32

9

V_DUAL_CNDMASK_B32

3 V_DUAL_MUL_F32

10 V_DUAL_MAX_F32

4 V_DUAL_ADD_F32

11 V_DUAL_MIN_F32

5 V_DUAL_SUB_F32

12 V_DUAL_DOT2ACC_F32_F16

6 V_DUAL_SUBREV_F32

13 V_DUAL_DOT2ACC_F32_BF16

15.3. Vector ALU Formats

169 of 597

"RDNA3" Instruction Set Architecture

Table 92. VOPD Y-Opcodes
0 V_DUAL_FMAC_F32

9

1 V_DUAL_FMAAK_F32

10 V_DUAL_MAX_F32

V_DUAL_CNDMASK_B32

2 V_DUAL_FMAMK_F32

11 V_DUAL_MIN_F32

3 V_DUAL_MUL_F32

12 V_DUAL_DOT2ACC_F32_F16

4 V_DUAL_ADD_F32

13 V_DUAL_DOT2ACC_F32_BF16

5 V_DUAL_SUB_F32

16 V_DUAL_ADD_NC_U32

6 V_DUAL_SUBREV_F32

17 V_DUAL_LSHLREV_B32

7 V_DUAL_MUL_DX9_ZERO_F32

18 V_DUAL_AND_B32

8 V_DUAL_MOV_B32

15.3.8. DPP16

Description

Data Parallel Primitives over 16 lanes. This is an additional DWORD that can follow VOP1,
VOP2, VOPC, VOP3 or VOP3P instructions (in place of a literal constant) to control
selection of data from other lanes.
Table 93. DPP16 Fields

Field Name

Bits

Format or Description

SRC0

[39:32]

Real SRC0 operand (VGPR).

DPP_CTRL

[48:40]

See next table: "DPP_CTRL Enumeration"

FI

[50]

Fetch invalid data: 0 = read zero for any inactive lanes; 1 = read VGPRs even for
invalid lanes.

BC

[51]

Bounds Control: 0 = do not write when source is out of range, 1 = write.

SRC0_NEG

[52]

1 = negate source 0.

SRC0_ABS

[53]

1 = Absolute value of source 0.

SRC1_NEG

[54]

1 = negate source 1.

SRC1_ABS

[55]

1 = Absolute value of source 1.

BANK_MASK

[59:56]

Bank Mask Applies to the VGPR destination write only, does not impact the thread
mask when fetching source VGPR data.
27==0: lanes[12:15, 28:31, 44:47, 60:63] are disabled
26==0: lanes[8:11, 24:27, 40:43, 56:59] are disabled
25==0: lanes[4:7, 20:23, 36:39, 52:55] are disabled
24==0: lanes[0:3, 16:19, 32:35, 48:51] are disabled
Notice: the term "bank" here is not the same as was used for the VGPR bank.

ROW_MASK

[63:60]

Row Mask Applies to the VGPR destination write only, does not impact the thread
mask when fetching source VGPR data.
31==0: lanes[63:48] are disabled (wave 64 only)
30==0: lanes[47:32] are disabled (wave 64 only)
29==0: lanes[31:16] are disabled
28==0: lanes[15:0] are disabled

Table 94. DPP_CTRL Enumeration

15.3. Vector ALU Formats

170 of 597

"RDNA3" Instruction Set Architecture

DPP_Cntl
Enumeration

Hex
Function
Value

Description

DPP_QUAD_PE 000RM*
0FF

pix[n].srca = pix[(n&0x3c)+ dpp_cntl[n%4*2+1 :
n%4*2]].srca

Permute of four threads.

DPP_UNUSED

Undefined

Reserved.

DPP_ROW_SL* 10110F

100

if ((n&0xf) < (16-cntl[3:0])) pix[n].srca = pix[n+
cntl[3:0]].srca else use bound_cntl

Row shift left by 1-15 threads.

DPP_ROW_SR* 11111F

if ((n&0xf) >= cntl[3:0]) pix[n].srca = pix[n cntl[3:0]].srca else use bound_cntl

Row shift right by 1-15 threads.

DPP_ROW_RR* 12112F

if ((n&0xf) >= cnt[3:0]) pix[n].srca = pix[n cntl[3:0]].srca else pix[n].srca = pix[n + 16 cntl[3:0]].srca

Row rotate right by 1-15 threads.

DPP_ROW_MIR 140
ROR*

pix[n].srca = pix[15-(n&f)].srca

Mirror threads within row.

DPP_ROW_HA 141
LF_MIRROR*

pix[n].srca = pix[7-(n&7)].srca

Mirror threads within row (8 threads).

DPP_ROW_SHA 150RE*
15F

lanesel = DPP_CTRL & 0xf;
lane[n].src0 = lane[(n & 0x30) + lanesel].src0.

Select one lane within each row and share
the result with all lanes in the row.

DPP_ROW_XM 160ASK*
16F

lane[n].src0 = lane[(n & 0x30) + ((n & 0xf) ^
mask)].src0.

Fetch lane ID is the current lane ID XOR’d
with a mask specified by DPP_CTRL[3:0].

15.3.9. DPP8

Description

Data Parallel Primitives over 8 lanes. This is a second DWORD that can follow VOP1,
VOP2, VOPC, VOP3 or VOP3P instructions (in place of a literal constant) to control
selection of data from other lanes.
Table 95. DPP8 Fields

Field Name

Bits

Format or Description

SRC0

[39:32]

Real SRC0 operand (VGPR).

LANE_SEL0

[42:40]

Which lane to read for 1st output lane per 8-lane group

LANE_SEL1

[45:43]

Which lane to read for 2nd output lane per 8-lane group

LANE_SEL2

[48:46]

Which lane to read for 3rd output lane per 8-lane group

LANE_SEL3

[51:49]

Which lane to read for 4th output lane per 8-lane group

LANE_SEL4

[54:52]

Which lane to read for 5th output lane per 8-lane group

LANE_SEL5

[57:55]

Which lane to read for 6th output lane per 8-lane group

LANE_SEL6

[60:58]

Which lane to read for 7th output lane per 8-lane group

LANE_SEL7

[63:61]

Which lane to read for 8th output lane per 8-lane group

15.3. Vector ALU Formats

171 of 597

"RDNA3" Instruction Set Architecture

15.4. Vector Parameter Interpolation Format
15.4.1. VINTERP

Description

Vector Parameter Interpolation.
These opcodes perform parameter interpolation using vertex data in pixel shaders.
Table 96. VINTERP Fields

Field Name

Bits

Format or Description

VDST

[7:0]

Destination VGPR

WAITEXP

[10:8]

Wait for EXPcnt to be less-than or equal-to this value before issuing instruction.

OPSEL

[14:11]

Select low or high for low sources 0=[11], 1=[12], 2=[13], dst=[14].

CLMP

[15]

1 = clamp result.

OP

[22:16]

Opcode. see next table.

ENCODING

[31:26]

'b11001101

SRC0

[40:32]

Source 0. First operand for the instruction: VGPR 0-255.

15.4. Vector Parameter Interpolation Format

172 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

SRC0

[40:32]
0-105
106
107
108-123
124
125
126
127
128
129-192
193-208
209-232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
250
253
254
255
256 - 511

Source 0. First operand for the instruction.
SGPR0 - SGPR105: Scalar general-purpose registers.
VCC_LO: VCC[31:0].
VCC_HI: VCC[63:32].
TTMP0 - TTMP15: Trap handler temporary register.
NULL
M0. Misc register 0.
EXEC_LO: EXEC[31:0].
EXEC_HI: EXEC[63:32].
0.
Signed integer 1 to 64.
Signed integer -1 to -16.
Reserved.
DPP8
DPP8FI
SHARED_BASE (Memory Aperture definition).
SHARED_LIMIT (Memory Aperture definition).
PRIVATE_BASE (Memory Aperture definition).
PRIVATE_LIMIT (Memory Aperture definition).
Reserved.
0.5.
-0.5.
1.0.
-1.0.
2.0.
-2.0.
4.0.
-4.0.
1/(2*PI).
DPP16
SCC.
Reserved.
Literal constant.
VGPR 0 - 255

SRC1

[49:41]

Second input operand. Same options as SRC0.

SRC2

[58:50]

Third input operand. Same options as SRC0.

NEG

[63:61]

Negate input for low 16-bits of sources. [61] = src0, [62] = src1, [63] = src2

Table 97. VINTERP Opcodes
Opcode # Name

Opcode # Name

0

V_INTERP_P10_F32

3

V_INTERP_P2_F16_F32

1

V_INTERP_P2_F32

4

V_INTERP_P10_RTZ_F16_F32

2

V_INTERP_P10_F16_F32

5

V_INTERP_P2_RTZ_F16_F32

15.5. Parameter and Direct Load from LDS
15.5.1. LDSDIR

15.5. Parameter and Direct Load from LDS

173 of 597

"RDNA3" Instruction Set Architecture

Description

LDS Direct and Parameter Load.
These opcodes read either pixel parameter data or individual DWORDs from LDS into
VGPRs.
Table 98. LDSDIR Fields

Field Name

Bits

Format or Description

VDST

[7:0]

Destination VGPR

ATTR_CHAN

[9:8]

Attribute channel: 0=X, 1=Y, 2=Z, 3=W

ATTR

[15:10]

Attribute number: 0 - 32.

WAIT_VA

[19:16]

Wait for previous VALU instructions to complete to resolve data dependency. Value
is the max number of VALU ops still outstanding when issuing this instruction.

OP

[21:20]

Opcode:
0: LDS_DIRECT_LOAD
1: LDS_PARAM_LOAD
2, 3: Reserved.

ENCODING

[31:24]

'b11001110

15.5. Parameter and Direct Load from LDS

174 of 597

"RDNA3" Instruction Set Architecture

15.6. LDS and GDS Format
15.6.1. DS

Description

Local and Global Data Sharing instructions
Table 99. DS Fields

Field Name

Bits

Format or Description

OFFSET0

[7:0]

First address offset

OFFSET1

[15:8]

Second address offset. For some opcodes this is concatenated with OFFSET0.

GDS

[17]

1=GDS, 0=LDS operation.

OP

[25:18]

See Opcode table below.

ENCODING

[31:26]

'b110110

ADDR

[39:32]

VGPR that supplies the address.

DATA0

[47:40]

First data VGPR.

DATA1

[55:48]

Second data VGPR.

VDST

[63:56]

Destination VGPR when results returned to VGPRs.

Table 100. DS Opcodes
Opcode # Name

Opcode # Name

0

DS_ADD_U32

65

DS_SUB_U64

1

DS_SUB_U32

66

DS_RSUB_U64

2

DS_RSUB_U32

67

DS_INC_U64

3

DS_INC_U32

68

DS_DEC_U64

4

DS_DEC_U32

69

DS_MIN_I64

5

DS_MIN_I32

70

DS_MAX_I64

6

DS_MAX_I32

71

DS_MIN_U64

7

DS_MIN_U32

72

DS_MAX_U64

8

DS_MAX_U32

73

DS_AND_B64

9

DS_AND_B32

74

DS_OR_B64

10

DS_OR_B32

75

DS_XOR_B64

11

DS_XOR_B32

76

DS_MSKOR_B64

12

DS_MSKOR_B32

77

DS_STORE_B64

13

DS_STORE_B32

78

DS_STORE_2ADDR_B64

14

DS_STORE_2ADDR_B32

79

DS_STORE_2ADDR_STRIDE64_B64

15

DS_STORE_2ADDR_STRIDE64_B32

80

DS_CMPSTORE_B64

16

DS_CMPSTORE_B32

81

DS_CMPSTORE_F64

17

DS_CMPSTORE_F32

82

DS_MIN_F64

18

DS_MIN_F32

83

DS_MAX_F64

19

DS_MAX_F32

96

DS_ADD_RTN_U64

20

DS_NOP

97

DS_SUB_RTN_U64

21

DS_ADD_F32

98

DS_RSUB_RTN_U64

24

Reserved

99

DS_INC_RTN_U64

15.6. LDS and GDS Format

175 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

25

Reserved

100

DS_DEC_RTN_U64

26

Reserved

101

DS_MIN_RTN_I64

27

Reserved

102

DS_MAX_RTN_I64

28

Reserved

103

DS_MIN_RTN_U64

29

Reserved

104

DS_MAX_RTN_U64

30

DS_STORE_B8

105

DS_AND_RTN_B64

31

DS_STORE_B16

106

DS_OR_RTN_B64

32

DS_ADD_RTN_U32

107

DS_XOR_RTN_B64

33

DS_SUB_RTN_U32

108

DS_MSKOR_RTN_B64

34

DS_RSUB_RTN_U32

109

DS_STOREXCHG_RTN_B64

35

DS_INC_RTN_U32

110

DS_STOREXCHG_2ADDR_RTN_B64

36

DS_DEC_RTN_U32

111

DS_STOREXCHG_2ADDR_STRIDE64_RTN_B64

37

DS_MIN_RTN_I32

112

DS_CMPSTORE_RTN_B64

38

DS_MAX_RTN_I32

113

DS_CMPSTORE_RTN_F64

39

DS_MIN_RTN_U32

114

DS_MIN_RTN_F64

40

DS_MAX_RTN_U32

115

DS_MAX_RTN_F64

41

DS_AND_RTN_B32

118

DS_LOAD_B64

42

DS_OR_RTN_B32

119

DS_LOAD_2ADDR_B64

43

DS_XOR_RTN_B32

120

DS_LOAD_2ADDR_STRIDE64_B64

44

DS_MSKOR_RTN_B32

121

DS_ADD_RTN_F32

45

DS_STOREXCHG_RTN_B32

122

DS_ADD_GS_REG_RTN

46

DS_STOREXCHG_2ADDR_RTN_B32

123

DS_SUB_GS_REG_RTN

47

DS_STOREXCHG_2ADDR_STRIDE64_RTN_B32

126

DS_CONDXCHG32_RTN_B64

48

DS_CMPSTORE_RTN_B32

160

DS_STORE_B8_D16_HI

49

DS_CMPSTORE_RTN_F32

161

DS_STORE_B16_D16_HI

50

DS_MIN_RTN_F32

162

DS_LOAD_U8_D16

51

DS_MAX_RTN_F32

163

DS_LOAD_U8_D16_HI

52

DS_WRAP_RTN_B32

164

DS_LOAD_I8_D16

53

DS_SWIZZLE_B32

165

DS_LOAD_I8_D16_HI

54

DS_LOAD_B32

166

DS_LOAD_U16_D16

55

DS_LOAD_2ADDR_B32

167

DS_LOAD_U16_D16_HI

56

DS_LOAD_2ADDR_STRIDE64_B32

173

DS_BVH_STACK_RTN_B32

57

DS_LOAD_I8

176

DS_STORE_ADDTID_B32

58

DS_LOAD_U8

177

DS_LOAD_ADDTID_B32

59

DS_LOAD_I16

178

DS_PERMUTE_B32

60

DS_LOAD_U16

179

DS_BPERMUTE_B32

61

DS_CONSUME

222

DS_STORE_B96

62

DS_APPEND

223

DS_STORE_B128

63

DS_ORDERED_COUNT

254

DS_LOAD_B96

64

DS_ADD_U64

255

DS_LOAD_B128

15.6. LDS and GDS Format

176 of 597

"RDNA3" Instruction Set Architecture

15.7. Vector Memory Buffer Formats
There are two memory buffer instruction formats:
MTBUF
typed buffer access (data type is defined by the instruction)
MUBUF
untyped buffer access (data type is defined by the buffer / resource-constant)

15.7.1. MTBUF

Description

Memory Typed-Buffer Instructions
Table 101. MTBUF Fields

Field Name

Bits

Format or Description

OFFSET

[11:0]

Address offset, unsigned byte.

SLC

[12]

System Level Coherent. Used in conjunction with DLC to determine L2 cache
policies.

DLC

[13]

0 = normal, 1 = Device Coherent

GLC

[14]

0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-op
value to VGPR.

OP

[18:15]

Opcode. See table below.

FORMAT

[25:19]

Data Format of data in memory buffer. See Buffer Image format Table

ENCODING

[31:26]

'b111010

VADDR

[39:32]

Address of VGPR to supply first component of address (offset or index). When both
index and offset are used, index is in the first VGPR and offset in the second.

VDATA

[47:40]

Address of VGPR to supply first component of write data or receive first component
of read-data.

SRSRC

[52:48]

SGPR to supply V# (resource constant) in 4 or 8 consecutive SGPRs. It is missing 2
LSB’s of SGPR-address since it is aligned to 4 SGPRs.

TFE

[53]

Partially resident texture, texture fault enable.

OFFEN

[54]

1 = enable offset VGPR, 0 = use zero for address offset

IDXEN

[55]

1 = enable index VGPR, 0 = use zero for address index

SOFFSET

[63:56]

Address offset, unsigned byte.

Table 102. MTBUF Opcodes
Opcode # Name

Opcode # Name

0

TBUFFER_LOAD_FORMAT_X

8

TBUFFER_LOAD_D16_FORMAT_X

1

TBUFFER_LOAD_FORMAT_XY

9

TBUFFER_LOAD_D16_FORMAT_XY

2

TBUFFER_LOAD_FORMAT_XYZ

10

TBUFFER_LOAD_D16_FORMAT_XYZ

3

TBUFFER_LOAD_FORMAT_XYZW

11

TBUFFER_LOAD_D16_FORMAT_XYZW

4

TBUFFER_STORE_FORMAT_X

12

TBUFFER_STORE_D16_FORMAT_X

5

TBUFFER_STORE_FORMAT_XY

13

TBUFFER_STORE_D16_FORMAT_XY

15.7. Vector Memory Buffer Formats

177 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

6

TBUFFER_STORE_FORMAT_XYZ

14

TBUFFER_STORE_D16_FORMAT_XYZ

7

TBUFFER_STORE_FORMAT_XYZW

15

TBUFFER_STORE_D16_FORMAT_XYZW

15.7.2. MUBUF

Description

Memory Untyped-Buffer Instructions
Table 103. MUBUF Fields

Field Name

Bits

Format or Description

OFFSET

[11:0]

Address offset, unsigned byte.

SLC

[12]

System Level Coherent. Used in conjunction with DLC to determine L2 cache
policies.

DLC

[13]

0 = normal, 1 = Device Coherent

GLC

[14]

0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-op
value to VGPR.

OP

[25:18]

Opcode. See table below.

ENCODING

[31:26]

'b111000

VADDR

[39:32]

Address of VGPR to supply first component of address (offset or index). When both
index and offset are used, index is in the first VGPR and offset in the second.

VDATA

[47:40]

Address of VGPR to supply first component of write data or receive first component
of read-data.

SRSRC

[52:48]

SGPR to supply V# (resource constant) in 4 or 8 consecutive SGPRs. It is missing 2
LSB’s of SGPR-address since it is aligned to 4 SGPRs.

TFE

[53]

Partially resident texture, texture fault enable.

OFFEN

[54]

1 = enable offset VGPR, 0 = use zero for address offset

IDXEN

[55]

1 = enable index VGPR, 0 = use zero for address index

SOFFSET

[63:56]

Address offset, unsigned byte.

Table 104. MUBUF Opcodes
Opcode # Name

Opcode # Name

0

BUFFER_LOAD_FORMAT_X

43

BUFFER_GL0_INV

1

BUFFER_LOAD_FORMAT_XY

44

BUFFER_GL1_INV

2

BUFFER_LOAD_FORMAT_XYZ

45

BUFFER_LOAD_LDS_U8

3

BUFFER_LOAD_FORMAT_XYZW

46

BUFFER_LOAD_LDS_I8

4

BUFFER_STORE_FORMAT_X

47

BUFFER_LOAD_LDS_U16

5

BUFFER_STORE_FORMAT_XY

48

BUFFER_LOAD_LDS_I16

6

BUFFER_STORE_FORMAT_XYZ

49

BUFFER_LOAD_LDS_B32

7

BUFFER_STORE_FORMAT_XYZW

50

BUFFER_LOAD_LDS_FORMAT_X

8

BUFFER_LOAD_D16_FORMAT_X

51

BUFFER_ATOMIC_SWAP_B32

9

BUFFER_LOAD_D16_FORMAT_XY

52

BUFFER_ATOMIC_CMPSWAP_B32

10

BUFFER_LOAD_D16_FORMAT_XYZ

53

BUFFER_ATOMIC_ADD_U32

11

BUFFER_LOAD_D16_FORMAT_XYZW

54

BUFFER_ATOMIC_SUB_U32

12

BUFFER_STORE_D16_FORMAT_X

55

BUFFER_ATOMIC_CSUB_U32

15.7. Vector Memory Buffer Formats

178 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

13

BUFFER_STORE_D16_FORMAT_XY

56

BUFFER_ATOMIC_MIN_I32

14

BUFFER_STORE_D16_FORMAT_XYZ

57

BUFFER_ATOMIC_MIN_U32

15

BUFFER_STORE_D16_FORMAT_XYZW

58

BUFFER_ATOMIC_MAX_I32

16

BUFFER_LOAD_U8

59

BUFFER_ATOMIC_MAX_U32

17

BUFFER_LOAD_I8

60

BUFFER_ATOMIC_AND_B32

18

BUFFER_LOAD_U16

61

BUFFER_ATOMIC_OR_B32

19

BUFFER_LOAD_I16

62

BUFFER_ATOMIC_XOR_B32

20

BUFFER_LOAD_B32

63

BUFFER_ATOMIC_INC_U32

21

BUFFER_LOAD_B64

64

BUFFER_ATOMIC_DEC_U32

22

BUFFER_LOAD_B96

65

BUFFER_ATOMIC_SWAP_B64

23

BUFFER_LOAD_B128

66

BUFFER_ATOMIC_CMPSWAP_B64

24

BUFFER_STORE_B8

67

BUFFER_ATOMIC_ADD_U64

25

BUFFER_STORE_B16

68

BUFFER_ATOMIC_SUB_U64

26

BUFFER_STORE_B32

69

BUFFER_ATOMIC_MIN_I64

27

BUFFER_STORE_B64

70

BUFFER_ATOMIC_MIN_U64

28

BUFFER_STORE_B96

71

BUFFER_ATOMIC_MAX_I64

29

BUFFER_STORE_B128

72

BUFFER_ATOMIC_MAX_U64

30

BUFFER_LOAD_D16_U8

73

BUFFER_ATOMIC_AND_B64

31

BUFFER_LOAD_D16_I8

74

BUFFER_ATOMIC_OR_B64

32

BUFFER_LOAD_D16_B16

75

BUFFER_ATOMIC_XOR_B64

33

BUFFER_LOAD_D16_HI_U8

76

BUFFER_ATOMIC_INC_U64

34

BUFFER_LOAD_D16_HI_I8

77

BUFFER_ATOMIC_DEC_U64

35

BUFFER_LOAD_D16_HI_B16

80

BUFFER_ATOMIC_CMPSWAP_F32

36

BUFFER_STORE_D16_HI_B8

81

BUFFER_ATOMIC_MIN_F32

37

BUFFER_STORE_D16_HI_B16

82

BUFFER_ATOMIC_MAX_F32

38

BUFFER_LOAD_D16_HI_FORMAT_X

86

BUFFER_ATOMIC_ADD_F32

39

BUFFER_STORE_D16_HI_FORMAT_X

15.7. Vector Memory Buffer Formats

179 of 597

"RDNA3" Instruction Set Architecture

15.8. Vector Memory Image Format
15.8.1. MIMG

Description

Memory Image Instructions

Memory Image instructions (MIMG format) can be between 2 and 3 DWORDs. There are two variations of the
instruction:
• Normal, where the address VGPRs are specified in the "ADDR" field, and are a contiguous set of VGPRs.
This is a 2-DWORD instruction.
• Non-Sequential-Address (NSA), where each address VGPR is specified individually and the address VGPRs
can be scattered. This version uses 1 extra DWORD to specify the individual address VGPRs.
Table 105. MIMG Fields
Field Name

Bits

Format or Description

NSA

[0]

Non-sequential address. Specifies that an additional instruction DWORD exists
holding up to 4 unique VGPR addresses.

DIM

[4:2]

Dimensionality of the resource constant. Set to bits [3:1] of the resource type field.

UNRM

[7]

Force address to be un-normalized. User must set to 1 for Image stores & atomics.

DMASK

[11:8]

Data VGPR enable mask: 1 .. 4 consecutive VGPRs
Reads: defines which components are returned:
0=red,1=green,2=blue,3=alpha
Writes: defines which components are written with data from VGPRs (missing
components get 0).
Enabled components come from consecutive VGPRs.
E.G. dmask=1001 : Red is in VGPRn and alpha in VGPRn+1.
For D16 writes, DMASK is only used as a word count: each bit represents 16 bits of
data to be written starting at the LSB’s of VDATA, then MSBs, then VDATA+1 etc. Bit
position is ignored.

SLC

[12]

System Level Coherent. Used in conjunction with DLC to determine L2 cache
policies.

DLC

[13]

0 = normal, 1 = Device Coherent

GLC

[14]

0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-op
value to VGPR.

R128

[15]

Resource constant size: 1 = 128bit, 0 = 256bit

A16

[16]

Address components are 16-bits (instead of the usual 32 bits).
When set, all address components are 16 bits (packed into 2 per DWORD), except:
Texel offsets (3 6bit UINT packed into 1 DWORD)
PCF reference (for "_C" instructions)
Address components are 16b uint for image ops without sampler; 16b float with
sampler.

D16

[17]

Data components are 16-bits (instead of the usual 32 bits).

OP

[25:18]

Opcode. See table below.

ENCODING

[31:26]

'b111100

15.8. Vector Memory Image Format

180 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

VADDR

[39:32]

Address of VGPR to supply first component of address.

VDATA

[47:40]

Address of VGPR to supply first component of write data or receive first component
of read-data.

SRSRC

[52:48]

SGPR to supply V# (resource constant) in 4 or 8 consecutive SGPRs. It is missing 2
LSB’s of SGPR-address since it is aligned to 4 SGPRs.

TFE

[53]

Partially resident texture, texture fault enable.

LWE

[54]

LOD Warning Enable. When set to 1, a texture fetch may return "LOD_CLAMPED =
1".

SSAMP

[62:58]

SGPR to supply V# (resource constant) in 4 or 8 consecutive SGPRs. It is missing 2
LSB’s of SGPR-address since it is aligned to 4 SGPRs.

ADDR1

[71:64]

Second Address register or group. Present only when NSA=1.

ADDR2

[79:72]

Third Address register or group. Present only when NSA=1.

Table 106. MIMG Opcodes
Opcode # Name

Opcode # Name

0

IMAGE_LOAD

42

IMAGE_SAMPLE_C_O

1

IMAGE_LOAD_MIP

43

IMAGE_SAMPLE_C_D_O

2

IMAGE_LOAD_PCK

44

IMAGE_SAMPLE_C_L_O

3

IMAGE_LOAD_PCK_SGN

45

IMAGE_SAMPLE_C_B_O

4

IMAGE_LOAD_MIP_PCK

46

IMAGE_SAMPLE_C_LZ_O

5

IMAGE_LOAD_MIP_PCK_SGN

47

IMAGE_GATHER4

6

IMAGE_STORE

48

IMAGE_GATHER4_L

7

IMAGE_STORE_MIP

49

IMAGE_GATHER4_B

8

IMAGE_STORE_PCK

50

IMAGE_GATHER4_LZ

9

IMAGE_STORE_MIP_PCK

51

IMAGE_GATHER4_C

10

IMAGE_ATOMIC_SWAP

52

IMAGE_GATHER4_C_LZ

11

IMAGE_ATOMIC_CMPSWAP

53

IMAGE_GATHER4_O

12

IMAGE_ATOMIC_ADD

54

IMAGE_GATHER4_LZ_O

13

IMAGE_ATOMIC_SUB

55

IMAGE_GATHER4_C_LZ_O

14

IMAGE_ATOMIC_SMIN

56

IMAGE_GET_LOD

15

IMAGE_ATOMIC_UMIN

57

IMAGE_SAMPLE_D_G16

16

IMAGE_ATOMIC_SMAX

58

IMAGE_SAMPLE_C_D_G16

17

IMAGE_ATOMIC_UMAX

59

IMAGE_SAMPLE_D_O_G16

18

IMAGE_ATOMIC_AND

60

IMAGE_SAMPLE_C_D_O_G16

19

IMAGE_ATOMIC_OR

64

IMAGE_SAMPLE_CL

20

IMAGE_ATOMIC_XOR

65

IMAGE_SAMPLE_D_CL

21

IMAGE_ATOMIC_INC

66

IMAGE_SAMPLE_B_CL

22

IMAGE_ATOMIC_DEC

67

IMAGE_SAMPLE_C_CL

23

IMAGE_GET_RESINFO

68

IMAGE_SAMPLE_C_D_CL

24

IMAGE_MSAA_LOAD

69

IMAGE_SAMPLE_C_B_CL

25

IMAGE_BVH_INTERSECT_RAY

70

IMAGE_SAMPLE_CL_O

26

IMAGE_BVH64_INTERSECT_RAY

71

IMAGE_SAMPLE_D_CL_O

27

IMAGE_SAMPLE

72

IMAGE_SAMPLE_B_CL_O

28

IMAGE_SAMPLE_D

73

IMAGE_SAMPLE_C_CL_O

29

IMAGE_SAMPLE_L

74

IMAGE_SAMPLE_C_D_CL_O

30

IMAGE_SAMPLE_B

75

IMAGE_SAMPLE_C_B_CL_O

31

IMAGE_SAMPLE_LZ

84

IMAGE_SAMPLE_C_D_CL_G16

32

IMAGE_SAMPLE_C

85

IMAGE_SAMPLE_D_CL_O_G16

15.8. Vector Memory Image Format

181 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

33

IMAGE_SAMPLE_C_D

86

IMAGE_SAMPLE_C_D_CL_O_G16

34

IMAGE_SAMPLE_C_L

95

IMAGE_SAMPLE_D_CL_G16

35

IMAGE_SAMPLE_C_B

96

IMAGE_GATHER4_CL

36

IMAGE_SAMPLE_C_LZ

97

IMAGE_GATHER4_B_CL

37

IMAGE_SAMPLE_O

98

IMAGE_GATHER4_C_CL

38

IMAGE_SAMPLE_D_O

99

IMAGE_GATHER4_C_L

39

IMAGE_SAMPLE_L_O

100

IMAGE_GATHER4_C_B

40

IMAGE_SAMPLE_B_O

101

IMAGE_GATHER4_C_B_CL

41

IMAGE_SAMPLE_LZ_O

144

IMAGE_GATHER4H

15.8. Vector Memory Image Format

182 of 597

"RDNA3" Instruction Set Architecture

15.9. Flat Formats
Flat memory instructions come in three versions:

FLAT
memory address (per work-item) may be in global memory, scratch (private) memory or shared memory
(LDS)
GLOBAL
same as FLAT, but assumes all memory addresses are global memory.
SCRATCH
same as FLAT, but assumes all memory addresses are scratch (private) memory.
The microcode format is identical for each, and only the value of the SEG (segment) field differs.

15.9.1. FLAT

Description

FLAT Memory Access
Table 107. FLAT Fields

Field Name

Bits

Format or Description

OFFSET

[12:0]

Address offset
Scratch, Global: 13-bit signed byte offset
FLAT: 12-bit unsigned offset (MSB is ignored)

DLC

[13]

0 = normal, 1 = Device Coherent

GLC

[14]

0 = normal, 1 = globally coherent (bypass L0 cache) or for atomics, return pre-op
value to VGPR.

SLC

[15]

System Level Coherent. Used in conjunction with DLC to determine L2 cache
policies.

SEG

[17:16]

Memory Segment (instruction type): 0 = flat, 1 = scratch, 2 = global.

OP

[24:18]

Opcode. See tables below for FLAT, SCRATCH and GLOBAL opcodes.

ENCODING

[31:26]

'b110111

ADDR

[39:32]

VGPR that holds address or offset. For 64-bit addresses, ADDR has the LSBs and
ADDR+1 has the MSBs. For offset a single VGPR has a 32 bit unsigned offset.
For FLAT_*: specifies an address.
For GLOBAL_* and SCRATCH_* when SADDR is NULL or 0x7f: specifies an address.
For GLOBAL_* and SCRATCH_* when SADDR is not NULL or 0x7f: specifies an
offset.

DATA

[47:40]

VGPR that supplies data.

SADDR

[54:48]

Scalar SGPR that provides an address of offset (unsigned). Set this field to NULL or
0x7f to disable use.
Meaning of this field is different for Scratch and Global:
FLAT: Unused
Scratch: use an SGPR for the address instead of a VGPR
Global: use the SGPR to provide a base address and the VGPR provides a 32-bit byte
offset.

15.9. Flat Formats

183 of 597

"RDNA3" Instruction Set Architecture

Field Name

Bits

Format or Description

SVE

[55]

Scratch VGPR Enable. 1 = scratch address includes a VGPR to provide an offset; 0 =
no VGPR used.

VDST

[63:56]

Destination VGPR for data returned from memory to VGPRs.

Table 108. FLAT Opcodes
Opcode # Name

Opcode # Name

16

FLAT_LOAD_U8

56

FLAT_ATOMIC_MIN_I32

17

FLAT_LOAD_I8

57

FLAT_ATOMIC_MIN_U32

18

FLAT_LOAD_U16

58

FLAT_ATOMIC_MAX_I32

19

FLAT_LOAD_I16

59

FLAT_ATOMIC_MAX_U32

20

FLAT_LOAD_B32

60

FLAT_ATOMIC_AND_B32

21

FLAT_LOAD_B64

61

FLAT_ATOMIC_OR_B32

22

FLAT_LOAD_B96

62

FLAT_ATOMIC_XOR_B32

23

FLAT_LOAD_B128

63

FLAT_ATOMIC_INC_U32

24

FLAT_STORE_B8

64

FLAT_ATOMIC_DEC_U32

25

FLAT_STORE_B16

65

FLAT_ATOMIC_SWAP_B64

26

FLAT_STORE_B32

66

FLAT_ATOMIC_CMPSWAP_B64

27

FLAT_STORE_B64

67

FLAT_ATOMIC_ADD_U64

28

FLAT_STORE_B96

68

FLAT_ATOMIC_SUB_U64

29

FLAT_STORE_B128

69

FLAT_ATOMIC_MIN_I64

30

FLAT_LOAD_D16_U8

70

FLAT_ATOMIC_MIN_U64

31

FLAT_LOAD_D16_I8

71

FLAT_ATOMIC_MAX_I64

32

FLAT_LOAD_D16_B16

72

FLAT_ATOMIC_MAX_U64

33

FLAT_LOAD_D16_HI_U8

73

FLAT_ATOMIC_AND_B64

34

FLAT_LOAD_D16_HI_I8

74

FLAT_ATOMIC_OR_B64

35

FLAT_LOAD_D16_HI_B16

75

FLAT_ATOMIC_XOR_B64

36

FLAT_STORE_D16_HI_B8

76

FLAT_ATOMIC_INC_U64

37

FLAT_STORE_D16_HI_B16

77

FLAT_ATOMIC_DEC_U64

51

FLAT_ATOMIC_SWAP_B32

80

FLAT_ATOMIC_CMPSWAP_F32

52

FLAT_ATOMIC_CMPSWAP_B32

81

FLAT_ATOMIC_MIN_F32

53

FLAT_ATOMIC_ADD_U32

82

FLAT_ATOMIC_MAX_F32

54

FLAT_ATOMIC_SUB_U32

86

FLAT_ATOMIC_ADD_F32

15.9.2. GLOBAL
Table 109. GLOBAL Opcodes
Opcode # Name

Opcode # Name

16

GLOBAL_LOAD_U8

52

GLOBAL_ATOMIC_CMPSWAP_B32

17

GLOBAL_LOAD_I8

53

GLOBAL_ATOMIC_ADD_U32

18

GLOBAL_LOAD_U16

54

GLOBAL_ATOMIC_SUB_U32

19

GLOBAL_LOAD_I16

55

GLOBAL_ATOMIC_CSUB_U32

20

GLOBAL_LOAD_B32

56

GLOBAL_ATOMIC_MIN_I32

21

GLOBAL_LOAD_B64

57

GLOBAL_ATOMIC_MIN_U32

22

GLOBAL_LOAD_B96

58

GLOBAL_ATOMIC_MAX_I32

23

GLOBAL_LOAD_B128

59

GLOBAL_ATOMIC_MAX_U32

24

GLOBAL_STORE_B8

60

GLOBAL_ATOMIC_AND_B32

15.9. Flat Formats

184 of 597

"RDNA3" Instruction Set Architecture

Opcode # Name

Opcode # Name

25

GLOBAL_STORE_B16

61

GLOBAL_ATOMIC_OR_B32

26

GLOBAL_STORE_B32

62

GLOBAL_ATOMIC_XOR_B32

27

GLOBAL_STORE_B64

63

GLOBAL_ATOMIC_INC_U32

28

GLOBAL_STORE_B96

64

GLOBAL_ATOMIC_DEC_U32

29

GLOBAL_STORE_B128

65

GLOBAL_ATOMIC_SWAP_B64

30

GLOBAL_LOAD_D16_U8

66

GLOBAL_ATOMIC_CMPSWAP_B64

31

GLOBAL_LOAD_D16_I8

67

GLOBAL_ATOMIC_ADD_U64

32

GLOBAL_LOAD_D16_B16

68

GLOBAL_ATOMIC_SUB_U64

33

GLOBAL_LOAD_D16_HI_U8

69

GLOBAL_ATOMIC_MIN_I64

34

GLOBAL_LOAD_D16_HI_I8

70

GLOBAL_ATOMIC_MIN_U64

35

GLOBAL_LOAD_D16_HI_B16

71

GLOBAL_ATOMIC_MAX_I64

36

GLOBAL_STORE_D16_HI_B8

72

GLOBAL_ATOMIC_MAX_U64

37

GLOBAL_STORE_D16_HI_B16

73

GLOBAL_ATOMIC_AND_B64

40

GLOBAL_LOAD_ADDTID_B32

74

GLOBAL_ATOMIC_OR_B64

41

GLOBAL_STORE_ADDTID_B32

75

GLOBAL_ATOMIC_XOR_B64

42

GLOBAL_LOAD_LDS_ADDTID_B32

76

GLOBAL_ATOMIC_INC_U64

45

GLOBAL_LOAD_LDS_U8

77

GLOBAL_ATOMIC_DEC_U64

46

GLOBAL_LOAD_LDS_I8

80

GLOBAL_ATOMIC_CMPSWAP_F32

47

GLOBAL_LOAD_LDS_U16

81

GLOBAL_ATOMIC_MIN_F32

48

GLOBAL_LOAD_LDS_I16

82

GLOBAL_ATOMIC_MAX_F32

49

GLOBAL_LOAD_LDS_B32

86

GLOBAL_ATOMIC_ADD_F32

51

GLOBAL_ATOMIC_SWAP_B32

15.9.3. SCRATCH
Table 110. SCRATCH Opcodes
Opcode # Name

Opcode # Name

16

SCRATCH_LOAD_U8

30

SCRATCH_LOAD_D16_U8

17

SCRATCH_LOAD_I8

31

SCRATCH_LOAD_D16_I8

18

SCRATCH_LOAD_U16

32

SCRATCH_LOAD_D16_B16

19

SCRATCH_LOAD_I16

33

SCRATCH_LOAD_D16_HI_U8

20

SCRATCH_LOAD_B32

34

SCRATCH_LOAD_D16_HI_I8

21

SCRATCH_LOAD_B64

35

SCRATCH_LOAD_D16_HI_B16

22

SCRATCH_LOAD_B96

36

SCRATCH_STORE_D16_HI_B8

23

SCRATCH_LOAD_B128

37

SCRATCH_STORE_D16_HI_B16

24

SCRATCH_STORE_B8

45

SCRATCH_LOAD_LDS_U8

25

SCRATCH_STORE_B16

46

SCRATCH_LOAD_LDS_I8

26

SCRATCH_STORE_B32

47

SCRATCH_LOAD_LDS_U16

27

SCRATCH_STORE_B64

48

SCRATCH_LOAD_LDS_I16

28

SCRATCH_STORE_B96

49

SCRATCH_LOAD_LDS_B32

29

SCRATCH_STORE_B128

15.9. Flat Formats

185 of 597

"RDNA3" Instruction Set Architecture

15.10. Export Format
15.10.1. EXP

Description

EXPORT instructions

The export format has only a single opcode, "EXPORT".
Table 111. EXP Fields
Field Name

Bits

Format or Description

EN

[3:0]

VGPR Enables: [0] enables VSRC0, … [3] enables VSRC3.

TARGET

[9:4]

Export destination:
0..7

MRT 0..7

8

Z

12-16

Position 0-4

20

Primitive data

21

Dual Source Blend Left

22

Dual Source Blend Right

DONE

[11]

Indicates that this is the last export from the shader. Used only for Position and
Pixel/color data.

ROW

[13]

Row to export

ENCODING

[31:26]

'b111110

VSRC0

[39:32]

VGPR for source 0.

VSRC1

[47:40]

VGPR for source 1.

VSRC2

[55:48]

VGPR for source 2.

VSRC3

[63:56]

VGPR for source 3.

15.10. Export Format

186 of 597

"RDNA3" Instruction Set Architecture

Chapter 16. Instructions
This chapter lists, and provides descriptions for, all instructions in the RDNA3 Generation environment.
Instructions are grouped according to their format.
Note: Rounding and Denormal modes apply to all floating-point operations unless otherwise specified in the
instruction description.

187 of 597

"RDNA3" Instruction Set Architecture

16.1. SOP2 Instructions

Instructions in this format may use a 32-bit literal constant that occurs immediately after the instruction.

S_ADD_U32

0

Add two unsigned integers with carry-out.
tmp = 64'U(S0.u) + 64'U(S1.u);
SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U;
// unsigned overflow or carry-out for S_ADDC_U32.
D0.u = tmp.u

S_SUB_U32

1

Subtract the second unsigned integer from the first with carry-out.
tmp = S0.u - S1.u;
SCC = S1.u > S0.u ? 1'1U : 1'0U;
// unsigned overflow or carry-out for S_SUBB_U32.
D0.u = tmp.u

S_ADD_I32

2

Add two signed integers with carry-out.
tmp = S0.i + S1.i;
SCC = ((S0.u[31] == S1.u[31]) && (S0.u[31] != tmp.u[31]));
// signed overflow.
D0.i = tmp.i

Notes
This opcode is not suitable for use with S_ADDC_U32 for implementing 64-bit operations.

S_SUB_I32

3

Subtract the second signed integer from the first with carry-out.

16.1. SOP2 Instructions

188 of 597

"RDNA3" Instruction Set Architecture

tmp = S0.i - S1.i;
SCC = ((S0.u[31] != S1.u[31]) && (S0.u[31] != tmp.u[31]));
// signed overflow.
D0.i = tmp.i

Notes
This opcode is not suitable for use with S_SUBB_U32 for implementing 64-bit operations.

S_ADDC_U32

4

Add two unsigned integers with carry-in and carry-out.
tmp = 64'U(S0.u) + 64'U(S1.u) + SCC.u64;
SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U;
// unsigned overflow.
D0.u = tmp.u

S_SUBB_U32

5

Subtract the second unsigned integer from the first with carry-in and carry-out.
tmp = S0.u - S1.u - SCC.u;
SCC = 64'U(S1.u) + SCC.u64 > 64'U(S0.u) ? 1'1U : 1'0U;
// unsigned overflow.
D0.u = tmp.u

S_ABSDIFF_I32

6

Compute the absolute value of difference between two values.
D0.i = S0.i - S1.i;
if D0.i < 0 then
D0.i = -D0.i
endif;
SCC = D0.i != 0

Notes
Functional examples:
S_ABSDIFF_I32(0x00000002, 0x00000005) ⇒ 0x00000003

16.1. SOP2 Instructions

189 of 597

"RDNA3" Instruction Set Architecture

S_ABSDIFF_I32(0xffffffff, 0x00000000) ⇒ 0x00000001
S_ABSDIFF_I32(0x80000000, 0x00000000) ⇒ 0x80000000 // Note: result is negative!
S_ABSDIFF_I32(0x80000000, 0x00000001) ⇒ 0x7fffffff
S_ABSDIFF_I32(0x80000000, 0xffffffff) ⇒ 0x7fffffff
S_ABSDIFF_I32(0x80000000, 0xfffffffe) ⇒ 0x7ffffffe

S_LSHL_B32

8

Logical shift left.
D0.u = S0.u << S1.u[4 : 0].u;
SCC = D0.u != 0U

S_LSHL_B64

9

Logical shift left.
D0.u64 = S0.u64 << S1.u[5 : 0].u;
SCC = D0.u64 != 0ULL

S_LSHR_B32

10

Logical shift right.
D0.u = S0.u >> S1.u[4 : 0].u;
SCC = D0.u != 0U

S_LSHR_B64

11

Logical shift right.
D0.u64 = S0.u64 >> S1.u[5 : 0].u;
SCC = D0.u64 != 0ULL

S_ASHR_I32

12

Arithmetic shift right (preserve sign bit).

16.1. SOP2 Instructions

190 of 597

"RDNA3" Instruction Set Architecture

D0.i = 32'I(signext(S0.i) >> S1.u[4 : 0].u);
SCC = D0.i != 0

S_ASHR_I64

13

Arithmetic shift right (preserve sign bit).
D0.i64 = signext(S0.i64) >> S1.u[5 : 0].u;
SCC = D0.i64 != 0LL

S_LSHL1_ADD_U32

14

Logical shift left by 1 bit and then add.
tmp = (64'U(S0.u) << 1U) + 64'U(S1.u);
SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U;
// unsigned overflow.
D0.u = tmp.u

S_LSHL2_ADD_U32

15

Logical shift left by 2 bits and then add.
tmp = (64'U(S0.u) << 2U) + 64'U(S1.u);
SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U;
// unsigned overflow.
D0.u = tmp.u

S_LSHL3_ADD_U32

16

Logical shift left by 3 bits and then add.
tmp = (64'U(S0.u) << 3U) + 64'U(S1.u);
SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U;
// unsigned overflow.
D0.u = tmp.u

16.1. SOP2 Instructions

191 of 597

"RDNA3" Instruction Set Architecture

S_LSHL4_ADD_U32

17

Logical shift left by 4 bits and then add.
tmp = (64'U(S0.u) << 4U) + 64'U(S1.u);
SCC = tmp >= 0x100000000ULL ? 1'1U : 1'0U;
// unsigned overflow.
D0.u = tmp.u

S_MIN_I32

18

Minimum of two signed integers.
SCC = S0.i < S1.i;
D0.i = SCC ? S0.i : S1.i

S_MIN_U32

19

Minimum of two unsigned integers.
SCC = S0.u < S1.u;
D0.u = SCC ? S0.u : S1.u

S_MAX_I32

20

Maximum of two signed integers.
SCC = S0.i > S1.i;
D0.i = SCC ? S0.i : S1.i

S_MAX_U32

21

Maximum of two unsigned integers.
SCC = S0.u > S1.u;
D0.u = SCC ? S0.u : S1.u

16.1. SOP2 Instructions

192 of 597

"RDNA3" Instruction Set Architecture

S_AND_B32

22

Bitwise AND.
D0.u = (S0.u & S1.u);
SCC = D0.u != 0U

S_AND_B64

23

Bitwise AND.
D0.u64 = (S0.u64 & S1.u64);
SCC = D0.u64 != 0ULL

S_OR_B32

24

Bitwise OR.
D0.u = (S0.u | S1.u);
SCC = D0.u != 0U

S_OR_B64

25

Bitwise OR.
D0.u64 = (S0.u64 | S1.u64);
SCC = D0.u64 != 0ULL

S_XOR_B32

26

Bitwise XOR.
D0.u = (S0.u ^ S1.u);
SCC = D0.u != 0U

S_XOR_B64

16.1. SOP2 Instructions

27

193 of 597

"RDNA3" Instruction Set Architecture

Bitwise XOR.
D0.u64 = (S0.u64 ^ S1.u64);
SCC = D0.u64 != 0ULL

S_NAND_B32

28

Bitwise NAND.
D0.u = ~(S0.u & S1.u);
SCC = D0.u != 0U

S_NAND_B64

29

Bitwise NAND.
D0.u64 = ~(S0.u64 & S1.u64);
SCC = D0.u64 != 0ULL

S_NOR_B32

30

Bitwise NOR.
D0.u = ~(S0.u | S1.u);
SCC = D0.u != 0U

S_NOR_B64

31

Bitwise NOR.
D0.u64 = ~(S0.u64 | S1.u64);
SCC = D0.u64 != 0ULL

S_XNOR_B32

32

Bitwise XNOR.

16.1. SOP2 Instructions

194 of 597

"RDNA3" Instruction Set Architecture

D0.u = ~(S0.u ^ S1.u);
SCC = D0.u != 0U

S_XNOR_B64

33

Bitwise XNOR.
D0.u64 = ~(S0.u64 ^ S1.u64);
SCC = D0.u64 != 0ULL

S_AND_NOT1_B32

34

Bitwise AND with negated second argument.
D0.u = (S0.u & ~S1.u);
SCC = D0.u != 0U

S_AND_NOT1_B64

35

Bitwise AND with negated second argument.
D0.u64 = (S0.u64 & ~S1.u64);
SCC = D0.u64 != 0ULL

S_OR_NOT1_B32

36

Bitwise OR with negated second argument.
D0.u = (S0.u | ~S1.u);
SCC = D0.u != 0U

S_OR_NOT1_B64

37

Bitwise OR with negated second argument.

16.1. SOP2 Instructions

195 of 597

"RDNA3" Instruction Set Architecture

D0.u64 = (S0.u64 | ~S1.u64);
SCC = D0.u64 != 0ULL

S_BFE_U32

38

Bitfield extract. Extract unsigned bitfield from first operand using field offset and field size encoded in second
operand.
D0.u = ((S0.u >> S1.u[4 : 0].u) & 32'U((1 << S1.u[22 : 16].u) - 1));
SCC = D0.u != 0U

S_BFE_I32

39

Bitfield extract. Extract signed bitfield from first operand using field offset and field size encoded in second
operand.
tmp = ((S0.i >> S1.u[4 : 0].u) & ((1 << S1.u[22 : 16].u) - 1));
D0.i = 32'I(signextFromBit(tmp, S1.i[22 : 16].i));
SCC = D0.i != 0

S_BFE_U64

40

Bitfield extract. Extract unsigned bitfield from first operand using field offset and field size encoded in second
operand.
D0.u64 = ((S0.u64 >> S1.u[5 : 0].u) & 64'U((1U << S1.u[22 : 16].u) - 1U));
SCC = D0.u64 != 0ULL

S_BFE_I64

41

Bitfield extract. Extract signed bitfield from first operand using field offset and field size encoded in second
operand.
tmp = ((S0.i64 >> S1.u[5 : 0].u) & 64'I((1 << S1.u[22 : 16].u) - 1));
D0.i64 = signextFromBit(tmp, S1.i[22 : 16].i64);
SCC = D0.i64 != 0LL

16.1. SOP2 Instructions

196 of 597

"RDNA3" Instruction Set Architecture

S_BFM_B32

42

Bitfield mask.
D0.u = 32'U((1 << S0.u[4 : 0].u) - 1 << S1.u[4 : 0].u)

S_BFM_B64

43

Bitfield mask.
D0.u64 = (1ULL << S0.u[5 : 0].u) - 1ULL << S1.u[5 : 0].u

S_MUL_I32

44

Multiply two signed integers.
D0.i = S0.i * S1.i

S_MUL_HI_U32

45

Multiply two unsigned integers and store the high 32 bits.
D0.u = 32'U(64'U(S0.u) * 64'U(S1.u) >> 32U)

S_MUL_HI_I32

46

Multiply two signed integers and store the high 32 bits.
D0.i = 32'I(64'I(S0.i) * 64'I(S1.i) >> 32U)

S_CSELECT_B32

48

Conditional select based on scalar condition code.

16.1. SOP2 Instructions

197 of 597

"RDNA3" Instruction Set Architecture

D0.u = SCC ? S0.u : S1.u

S_CSELECT_B64

49

Conditional select based on scalar condition code.
D0.u64 = SCC ? S0.u64 : S1.u64

S_PACK_LL_B32_B16

50

Pack two short values into the destination.
D0 = { S1[15 : 0].u16, S0[15 : 0].u16 }

S_PACK_LH_B32_B16

51

Pack two short values into the destination.
D0 = { S1[31 : 16].u16, S0[15 : 0].u16 }

S_PACK_HH_B32_B16

52

Pack two short values into the destination.
D0 = { S1[31 : 16].u16, S0[31 : 16].u16 }

S_PACK_HL_B32_B16

53

Pack two short values into the destination.
D0 = { S1[15 : 0].u16, S0[31 : 16].u16 }

16.1. SOP2 Instructions

198 of 597

"RDNA3" Instruction Set Architecture

16.2. SOPK Instructions

Instructions in this format may not use a 32-bit literal constant that occurs immediately after the instruction.

S_MOVK_I32

0

Sign extension from a 16-bit constant.
D0.i = 32'I(signext(SIMM16.i16))

S_VERSION

1

Do nothing. This opcode is used to specify the microcode version for tools that interpret shader microcode.
Argument is ignored by hardware. This opcode is not designed for inserting wait states as the next instruction
may issue in the same cycle. Do not use this opcode to resolve wait state hazards, use S_NOP instead.
This opcode may also be used to validate microcode is running with the correct compatibility settings in
drivers and functional models that support multiple generations. We strongly encourage this opcode be
included at the top of every shader block to simplify debug and catch configuration errors.
This opcode must appear in the first 16 bytes of a block of shader code in order to be recognized by external
tools and functional models. Avoid placing opcodes > 32 bits or encodings that are not available in all versions
of the microcode before the S_VERSION opcode. If this opcode is absent then tools are allowed to make an
educated guess of the microcode version using cues from the environment; the guess may be incorrect and
lead to an invalid decode. It is highly recommended that this be the first opcode of a shader block except for
trap handlers, where it should be the second opcode (allowing the first opcode to be a 32-bit branch to
accommodate context switch).
SIMM16[7:0] specifies the microcode version.
SIMM16[15:8] must be set to zero.
nop();
// Do nothing - for use by tools only

S_CMOVK_I32

2

Conditional move with sign extension.
if SCC then

16.2. SOPK Instructions

199 of 597

"RDNA3" Instruction Set Architecture

D0.i = 32'I(signext(SIMM16.i16))
endif

S_CMPK_EQ_I32

3

Argument is equal to constant.
SCC = 64'I(S0.i) == signext(SIMM16.i16)

S_CMPK_LG_I32

4

Argument is not equal to constant.
SCC = 64'I(S0.i) != signext(SIMM16.i16)

S_CMPK_GT_I32

5

Argument is greater than constant.
SCC = 64'I(S0.i) > signext(SIMM16.i16)

S_CMPK_GE_I32

6

Argument is greater than or equal to constant.
SCC = 64'I(S0.i) >= signext(SIMM16.i16)

S_CMPK_LT_I32

7

Argument is less than constant.
SCC = 64'I(S0.i) < signext(SIMM16.i16)

16.2. SOPK Instructions

200 of 597

"RDNA3" Instruction Set Architecture

S_CMPK_LE_I32

8

Argument is less than or equal to constant.
SCC = 64'I(S0.i) <= signext(SIMM16.i16)

S_CMPK_EQ_U32

9

Argument is equal to constant.
SCC = S0.u == 32'U(SIMM16.u16)

S_CMPK_LG_U32

10

Argument is not equal to constant.
SCC = S0.u != 32'U(SIMM16.u16)

S_CMPK_GT_U32

11

Argument is greater than constant.
SCC = S0.u > 32'U(SIMM16.u16)

S_CMPK_GE_U32

12

Argument is greater than or equal to constant.
SCC = S0.u >= 32'U(SIMM16.u16)

S_CMPK_LT_U32

13

Argument is less than constant.

16.2. SOPK Instructions

201 of 597

"RDNA3" Instruction Set Architecture

SCC = S0.u < 32'U(SIMM16.u16)

S_CMPK_LE_U32

14

Argument is less than or equal to constant.
SCC = S0.u <= 32'U(SIMM16.u16)

S_ADDK_I32

15

Add a 16-bit signed constant to the destination with carry-out.
tmp = D0.i;
// save value so we can check sign bits for overflow later.
D0.i = 32'I(64'I(D0.i) + signext(SIMM16.i16));
SCC = ((tmp[31] == SIMM16.i16[15]) && (tmp[31] != D0.i[31]));
// signed overflow.

S_MULK_I32

16

Multiply a 16-bit signed constant with the destination.
D0.i = 32'I(64'I(D0.i) * signext(SIMM16.i16))

S_GETREG_B32

17

Read some or all of a hardware register into the LSBs of destination.
The SIMM16 argument is encoded as follows:
ID = SIMM16[5:0]
ID of hardware register to access.
OFFSET = SIMM16[10:6]
LSB offset of register bits to access.
SIZE = SIMM16[15:11]
Size of register bits to access, minus 1. Set this field to 31 to read/write all bits of the hardware register.

16.2. SOPK Instructions

202 of 597

"RDNA3" Instruction Set Architecture

hwRegId = SIMM16.u16[5 : 0];
offset = SIMM16.u16[10 : 6];
size = SIMM16.u16[15 : 11].u + 1U;
// logical size is in range 1:32
value = HW_REGISTERS[hwRegId];
D0.u = 32'U(32'I(value >> offset.u) & ((1 << size) - 1))

S_SETREG_B32

18

Write some or all of the LSBs of source argument into a hardware register.
The SIMM16 argument is encoded as follows:
ID = SIMM16[5:0]
ID of hardware register to access.
OFFSET = SIMM16[10:6]
LSB offset of register bits to access.
SIZE = SIMM16[15:11]
Size of register bits to access, minus 1. Set this field to 31 to read/write all bits of the hardware register.
hwRegId = SIMM16.u16[5 : 0];
offset = SIMM16.u16[10 : 6];
size = SIMM16.u16[15 : 11].u + 1U;
// logical size is in range 1:32
mask = (1 << size) - 1;
mask = (mask & 32'I(writeableBitMask(hwRegId.u, WAVE_STATUS.PRIV)));
// Mask of bits we are allowed to modify
value = ((S0.u << offset.u) & mask.u);
value = (value | 32'U(HW_REGISTERS[hwRegId].i & ~mask));
HW_REGISTERS[hwRegId] = value.b;
// Side-effects may trigger here if certain bits are modified

S_SETREG_IMM32_B32

19

Write some or all of the LSBs of a 32-bit literal constant into a hardware register; this instruction requires a 32bit literal constant.
The SIMM16 argument is encoded as follows:
ID = SIMM16[5:0]
ID of hardware register to access.
OFFSET = SIMM16[10:6]
LSB offset of register bits to access.

16.2. SOPK Instructions

203 of 597

"RDNA3" Instruction Set Architecture

SIZE = SIMM16[15:11]
Size of register bits to access, minus 1. Set this field to 31 to read/write all bits of the hardware register.
hwRegId = SIMM16.u16[5 : 0];
offset = SIMM16.u16[10 : 6];
size = SIMM16.u16[15 : 11].u + 1U;
// logical size is in range 1:32
mask = (1 << size) - 1;
mask = (mask & 32'I(writeableBitMask(hwRegId.u, WAVE_STATUS.PRIV)));
// Mask of bits we are allowed to modify
value = ((SIMM32.u << offset.u) & mask.u);
value = (value | 32'U(HW_REGISTERS[hwRegId].i & ~mask));
HW_REGISTERS[hwRegId] = value.b;
// Side-effects may trigger here if certain bits are modified

S_CALL_B64

20

Short call to label.
Implements a short call, where the return address (the next instruction after the S_CALL_B64) is saved to D.
Long calls should consider S_SWAPPC_B64 instead.
D0.i64 = PC + 4LL;
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL

Notes
This instruction must be 4 bytes.

S_WAITCNT_VSCNT

24

Wait for the counts of outstanding vector store events -- vector memory stores and atomics that DO NOT return
data -- to be at or below the specified level. This counter is not used in 'all-in-order' mode.
Waits for the following condition to hold before continuing:
vscnt <= S0.u[5:0] + S1.u[5:0].
// Comparison is 6 bits, no clamping is applied for add overflow

To wait on a literal constant only, write 'null' for the GPR argument.
This opcode may only appear inside a clause if the SGPR operand is set to NULL.
See also S_WAITCNT.

16.2. SOPK Instructions

204 of 597

"RDNA3" Instruction Set Architecture

S_WAITCNT_VMCNT

25

Wait for the counts of outstanding vector memory events -- everything except for memory stores and atomicswithout-return -- to be at or below the specified level. When in 'all-in-order' mode, wait for all vector memory
events.
Waits for the following condition to hold before continuing:
vmcnt <= S0.u[5:0] + S1.u[5:0].
// Comparison is 6 bits, no clamping is applied for add overflow

To wait on a literal constant only, write 'null' for the GPR argument or use S_WAITCNT.
This opcode may only appear inside a clause if the SGPR operand is set to NULL.
See also S_WAITCNT.

S_WAITCNT_EXPCNT

26

Wait for the counts of outstanding export events to be at or below the specified level.
Waits for the following condition to hold before continuing:
expcnt <= S0.u[2:0] + S1.u[2:0].
// Comparison is 3 bits, no clamping is applied for add overflow

To wait on a literal constant only, write 'null' for the GPR argument or use S_WAITCNT.
This opcode may only appear inside a clause if the SGPR operand is set to NULL.
See also S_WAITCNT.

S_WAITCNT_LGKMCNT

27

Wait for the counts of outstanding DS (LG), scalar memory (K) and message (M) events to be at or below the
specified level.
Waits for the following condition to hold before continuing:
lgkmcnt <= S0.u[5:0] + S1.u[5:0].
// Comparison is 6 bits, no clamping is applied for add overflow

To wait on a literal constant only, write 'null' for the GPR argument or use S_WAITCNT.
This opcode may only appear inside a clause if the SGPR operand is set to NULL.

16.2. SOPK Instructions

205 of 597

"RDNA3" Instruction Set Architecture

See also S_WAITCNT.

16.2. SOPK Instructions

206 of 597

"RDNA3" Instruction Set Architecture

16.3. SOP1 Instructions

Instructions in this format may use a 32-bit literal constant that occurs immediately after the instruction.

S_MOV_B32

0

Move data to an SGPR.
D0.b = S0.b

S_MOV_B64

1

Move data to an SGPR.
D0.b64 = S0.b64

S_CMOV_B32

2

Conditionally move data to an SGPR when scalar condition code is true.
if SCC then
D0.b = S0.b
endif

S_CMOV_B64

3

Conditionally move data to an SGPR when scalar condition code is true.
if SCC then
D0.b64 = S0.b64
endif

S_BREV_B32

16.3. SOP1 Instructions

4

207 of 597

"RDNA3" Instruction Set Architecture

Reverse bits.
D0.u[31 : 0] = S0.u[0 : 31]

S_BREV_B64

5

Reverse bits.
D0.u64[63 : 0] = S0.u64[0 : 63]

S_CTZ_I32_B32

8

Count trailing zeros.
Returns the bit position of the first one from the LSB, or -1 if there are no ones.
tmp = -1;
// Set if no ones are found
for i in 0 : 31 do
// Search from LSB
if S0.u[i] == 1'1U then
tmp = i;
break
endif
endfor;
D0.i = tmp

Notes
Functional examples:
S_CTZ_I32_B32(0xaaaaaaaa) ⇒ 1
S_CTZ_I32_B32(0x55555555) ⇒ 0
S_CTZ_I32_B32(0x00000000) ⇒ 0xffffffff
S_CTZ_I32_B32(0xffffffff) ⇒ 0
S_CTZ_I32_B32(0x00010000) ⇒ 16
Compare with V_CTZ_I32_B32, which performs the equivalent operation in the vector ALU.

S_CTZ_I32_B64

9

Count trailing zeros.

16.3. SOP1 Instructions

208 of 597

"RDNA3" Instruction Set Architecture

Returns the bit position of the first one from the LSB, or -1 if there are no ones.
tmp = -1;
// Set if no ones are found
for i in 0 : 63 do
// Search from LSB
if S0.u64[i] == 1'1U then
tmp = i;
break
endif
endfor;
D0.i = tmp

S_CLZ_I32_U32

10

Count leading zeros.
Counts how many zeros before the first one starting from the MSB. Returns -1 if there are no ones.
tmp = -1;
// Set if no ones are found
for i in 0 : 31 do
// Search from MSB
if S0.u[31 - i] == 1'1U then
tmp = i;
break
endif
endfor;
D0.i = tmp

Notes
Functional examples:
S_CLZ_I32_U32(0x00000000) ⇒ 0xffffffff
S_CLZ_I32_U32(0x0000cccc) ⇒ 16
S_CLZ_I32_U32(0xffff3333) ⇒ 0
S_CLZ_I32_U32(0x7fffffff) ⇒ 1
S_CLZ_I32_U32(0x80000000) ⇒ 0
S_CLZ_I32_U32(0xffffffff) ⇒ 0
Compare with V_CLZ_I32_U32, which performs the equivalent operation in the vector ALU.

S_CLZ_I32_U64

11

Count leading zeros.
Counts how many zeros before the first one starting from the MSB. Returns -1 if there are no ones.

16.3. SOP1 Instructions

209 of 597

"RDNA3" Instruction Set Architecture

tmp = -1;
// Set if no ones are found
for i in 0 : 63 do
// Search from MSB
if S0.u64[63 - i] == 1'1U then
tmp = i;
break
endif
endfor;
D0.i = tmp

S_CLS_I32

12

Count leading sign bits.
Counts how many bits in a row (from MSB to LSB) are the same as the sign bit. Returns -1 if all bits are the
same.
tmp = -1;
// Set if all bits are the same
for i in 1 : 31 do
// Search from MSB
if S0.u[31 - i] != S0.u[31] then
tmp = i;
break
endif
endfor;
D0.i = tmp

Notes
Functional examples:
S_CLS_I32(0x00000000) ⇒ 0xffffffff
S_CLS_I32(0x0000cccc) ⇒ 16
S_CLS_I32(0xffff3333) ⇒ 16
S_CLS_I32(0x7fffffff) ⇒ 1
S_CLS_I32(0x80000000) ⇒ 1
S_CLS_I32(0xffffffff) ⇒ 0xffffffff
Compare with S_CLS_I32, which performs the equivalent operation in the vector ALU.

S_CLS_I32_I64

13

Count leading sign bits.
Counts how many bits in a row (from MSB to LSB) are the same as the sign bit. Returns -1 if all bits are the

16.3. SOP1 Instructions

210 of 597

"RDNA3" Instruction Set Architecture

same.
tmp = -1;
// Set if all bits are the same
for i in 1 : 63 do
// Search from MSB
if S0.u64[63 - i] != S0.u64[63] then
tmp = i;
break
endif
endfor;
D0.i = tmp

S_SEXT_I32_I8

14

Sign extension of a signed byte.
D0.i = 32'I(signext(S0.i8))

S_SEXT_I32_I16

15

Sign extension of a signed short.
D0.i = 32'I(signext(S0.i16))

S_BITSET0_B32

16

Set a specific bit to zero.
D0.u[S0.u[4 : 0]] = 1'0U

S_BITSET0_B64

17

Set a specific bit to zero.
D0.u64[S0.u[5 : 0]] = 1'0U

16.3. SOP1 Instructions

211 of 597

"RDNA3" Instruction Set Architecture

S_BITSET1_B32

18

Set a specific bit to one.
D0.u[S0.u[4 : 0]] = 1'1U

S_BITSET1_B64

19

Set a specific bit to one.
D0.u64[S0.u[5 : 0]] = 1'1U

S_BITREPLICATE_B64_B32

20

Replicate the low 32 bits of the argument by 'doubling' each bit.
tmp = S0.u;
for i in 0 : 31 do
D0.u64[i * 2 + 0] = tmp[i];
D0.u64[i * 2 + 1] = tmp[i]
endfor

Notes
This opcode can be used to convert a quad mask into a pixel mask; given quad mask in s0, the following
sequence produces a pixel mask in s2:
s_bitreplicate_b64 s2, s0
s_bitreplicate_b64 s2, s2

To perform the inverse operation see S_QUADMASK_B64.

S_ABS_I32

21

Integer absolute value.
D0.i = S0.i < 0 ? -S0.i : S0.i;
SCC = D0.i != 0

Notes

16.3. SOP1 Instructions

212 of 597

"RDNA3" Instruction Set Architecture

Functional examples:
S_ABS_I32(0x00000001) ⇒ 0x00000001
S_ABS_I32(0x7fffffff) ⇒ 0x7fffffff
S_ABS_I32(0x80000000) ⇒ 0x80000000 // Note this is negative!
S_ABS_I32(0x80000001) ⇒ 0x7fffffff
S_ABS_I32(0x80000002) ⇒ 0x7ffffffe
S_ABS_I32(0xffffffff) ⇒ 0x00000001

S_BCNT0_I32_B32

22

Count number of bits that are zero.
tmp = 0;
for i in 0 : 31 do
tmp += S0.u[i].u == 0U ? 1 : 0
endfor;
D0.i = tmp;
SCC = D0.u != 0U

Notes
Functional examples:
S_BCNT0_I32_B32(0x00000000) ⇒ 32
S_BCNT0_I32_B32(0xcccccccc) ⇒ 16
S_BCNT0_I32_B32(0xffffffff) ⇒ 0

S_BCNT0_I32_B64

23

Count number of bits that are zero.
tmp = 0;
for i in 0 : 63 do
tmp += S0.u64[i].u == 0U ? 1 : 0
endfor;
D0.i = tmp;
SCC = D0.u64 != 0ULL

S_BCNT1_I32_B32

24

Count number of bits that are one.
tmp = 0;

16.3. SOP1 Instructions

213 of 597

"RDNA3" Instruction Set Architecture

for i in 0 : 31 do
tmp += S0.u[i].u == 1U ? 1 : 0
endfor;
D0.i = tmp;
SCC = D0.u != 0U

Notes
Functional examples:
S_BCNT1_I32_B32(0x00000000) ⇒ 0
S_BCNT1_I32_B32(0xcccccccc) ⇒ 16
S_BCNT1_I32_B32(0xffffffff) ⇒ 32

S_BCNT1_I32_B64

25

Count number of bits that are one.
tmp = 0;
for i in 0 : 63 do
tmp += S0.u64[i].u == 1U ? 1 : 0
endfor;
D0.i = tmp;
SCC = D0.u64 != 0ULL

S_QUADMASK_B32

26

Reduce a pixel mask to a quad mask.
tmp = 0U;
for i in 0 : 7 do
tmp[i] = S0.u[i * 4 + 3 : i * 4] != 0U
endfor;
D0.u = tmp;
SCC = D0.u != 0U

Notes
To perform the inverse operation see S_BITREPLICATE_B64_B32.

S_QUADMASK_B64

27

Reduce a pixel mask to a quad mask.

16.3. SOP1 Instructions

214 of 597

"RDNA3" Instruction Set Architecture

tmp = 0U;
for i in 0 : 15 do
tmp[i] = S0.u64[i * 4 + 3 : i * 4] != 0ULL
endfor;
D0.u = tmp;
SCC = D0.u != 0U

Notes
To perform the inverse operation see S_BITREPLICATE_B64_B32.

S_WQM_B32

28

Computes whole quad mode for an active/valid mask. If any pixel in a quad is active, all pixels of the quad are
marked active.
tmp = 0U;
declare i : 6'U;
for i in 6'0U : 6'31U do
tmp[i] = S0.u[i | 6'3U : i & 6'60U] != 0U
endfor;
D0.u = tmp;
SCC = D0.u != 0U

S_WQM_B64

29

Computes whole quad mode for an active/valid mask. If any pixel in a quad is active, all pixels of the quad are
marked active.
tmp = 0ULL;
declare i : 6'U;
for i in 6'0U : 6'63U do
tmp[i] = S0.u64[i | 6'3U : i & 6'60U] != 0ULL
endfor;
D0.u64 = tmp;
SCC = D0.u64 != 0ULL

S_NOT_B32

30

Bitwise negation.
D0.u = ~S0.u;
SCC = D0.u != 0U

16.3. SOP1 Instructions

215 of 597

"RDNA3" Instruction Set Architecture

S_NOT_B64

31

Bitwise negation.
D0.u64 = ~S0.u64;
SCC = D0.u64 != 0ULL

S_AND_SAVEEXEC_B32

32

Bitwise AND with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u;
EXEC.u = (S0.u & EXEC.u);
D0.u = saveexec.u;
SCC = EXEC.u != 0U

S_AND_SAVEEXEC_B64

33

Bitwise AND with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = (S0.u64 & EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

S_OR_SAVEEXEC_B32

34

Bitwise OR with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u;
EXEC.u = (S0.u | EXEC.u);
D0.u = saveexec.u;
SCC = EXEC.u != 0U

16.3. SOP1 Instructions

216 of 597

"RDNA3" Instruction Set Architecture

S_OR_SAVEEXEC_B64

35

Bitwise OR with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = (S0.u64 | EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

S_XOR_SAVEEXEC_B32

36

Bitwise XOR with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u;
EXEC.u = (S0.u ^ EXEC.u);
D0.u = saveexec.u;
SCC = EXEC.u != 0U

S_XOR_SAVEEXEC_B64

37

Bitwise XOR with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = (S0.u64 ^ EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

S_NAND_SAVEEXEC_B32

38

Bitwise NAND with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u;
EXEC.u = ~(S0.u & EXEC.u);
D0.u = saveexec.u;

16.3. SOP1 Instructions

217 of 597

"RDNA3" Instruction Set Architecture

SCC = EXEC.u != 0U

S_NAND_SAVEEXEC_B64

39

Bitwise NAND with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = ~(S0.u64 & EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

S_NOR_SAVEEXEC_B32

40

Bitwise NOR with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u;
EXEC.u = ~(S0.u | EXEC.u);
D0.u = saveexec.u;
SCC = EXEC.u != 0U

S_NOR_SAVEEXEC_B64

41

Bitwise NOR with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = ~(S0.u64 | EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

S_XNOR_SAVEEXEC_B32

42

Bitwise XNOR with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.

16.3. SOP1 Instructions

218 of 597

"RDNA3" Instruction Set Architecture

saveexec = EXEC.u;
EXEC.u = ~(S0.u ^ EXEC.u);
D0.u = saveexec.u;
SCC = EXEC.u != 0U

S_XNOR_SAVEEXEC_B64

43

Bitwise XNOR with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = ~(S0.u64 ^ EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

S_AND_NOT0_SAVEEXEC_B32

44

Bitwise AND with negated first argument with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u;
EXEC.u = (~S0.u & EXEC.u);
D0.u = saveexec.u;
SCC = EXEC.u != 0U

S_AND_NOT0_SAVEEXEC_B64

45

Bitwise AND with negated first argument with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = (~S0.u64 & EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

S_OR_NOT0_SAVEEXEC_B32

16.3. SOP1 Instructions

46

219 of 597

"RDNA3" Instruction Set Architecture

Bitwise OR with negated first argument with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u;
EXEC.u = (~S0.u | EXEC.u);
D0.u = saveexec.u;
SCC = EXEC.u != 0U

S_OR_NOT0_SAVEEXEC_B64

47

Bitwise OR with negated first argument with EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = (~S0.u64 | EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

S_AND_NOT1_SAVEEXEC_B32

48

Bitwise AND with negated EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u;
EXEC.u = (S0.u & ~EXEC.u);
D0.u = saveexec.u;
SCC = EXEC.u != 0U

S_AND_NOT1_SAVEEXEC_B64

49

Bitwise AND with negated EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = (S0.u64 & ~EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

16.3. SOP1 Instructions

220 of 597

"RDNA3" Instruction Set Architecture

S_OR_NOT1_SAVEEXEC_B32

50

Bitwise OR with negated EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u;
EXEC.u = (S0.u | ~EXEC.u);
D0.u = saveexec.u;
SCC = EXEC.u != 0U

S_OR_NOT1_SAVEEXEC_B64

51

Bitwise OR with negated EXEC mask.
The original EXEC mask is saved to the destination SGPRs before the bitwise operation is performed.
saveexec = EXEC.u64;
EXEC.u64 = (S0.u64 | ~EXEC.u64);
D0.u64 = saveexec.u64;
SCC = EXEC.u64 != 0ULL

S_AND_NOT0_WREXEC_B32

52

Bitwise AND with negated first argument with EXEC mask.
Unlike the SAVEEXEC series of opcodes, the value written to destination SGPRs is the result of the bitwise-op
result. EXEC and the destination SGPRs have the same value at the end of this instruction. This instruction is
intended to help accelerate waterfalling.
EXEC_LO.u = (~S0.u & EXEC_LO.u);
D0.u = EXEC_LO.u;
SCC = EXEC_LO.u != 0U

S_AND_NOT0_WREXEC_B64

53

Bitwise AND with negated first argument with EXEC mask.
Unlike the SAVEEXEC series of opcodes, the value written to destination SGPRs is the result of the bitwise-op
result. EXEC and the destination SGPRs have the same value at the end of this instruction. This instruction is
intended to help accelerate waterfalling.

16.3. SOP1 Instructions

221 of 597

"RDNA3" Instruction Set Architecture

EXEC.u64 = (~S0.u64 & EXEC.u64);
D0.u64 = EXEC.u64;
SCC = EXEC.u64 != 0ULL

S_AND_NOT1_WREXEC_B32

54

Bitwise AND with negated EXEC mask.
Unlike the SAVEEXEC series of opcodes, the value written to destination SGPRs is the result of the bitwise-op
result. EXEC and the destination SGPRs have the same value at the end of this instruction. This instruction is
intended to help accelerate waterfalling. See S_AND_NOT1_WREXEC_B64 for example code.
EXEC_LO.u = (S0.u & ~EXEC_LO.u);
D0.u = EXEC_LO.u;
SCC = EXEC_LO.u != 0U

S_AND_NOT1_WREXEC_B64

55

Bitwise AND with negated EXEC mask.
Unlike the SAVEEXEC series of opcodes, the value written to destination SGPRs is the result of the bitwise-op
result. EXEC and the destination SGPRs have the same value at the end of this instruction. This instruction is
intended to help accelerate waterfalling.
EXEC.u64 = (S0.u64 & ~EXEC.u64);
D0.u64 = EXEC.u64;
SCC = EXEC.u64 != 0ULL

Notes
In particular, the following sequence of waterfall code is optimized by using a WREXEC instead of two separate
scalar ops:
// V0 holds the index value per lane
// save exec mask for restore at the end
s_mov_b64 s2, exec
// exec mask of remaining (unprocessed) threads
s_mov_b64 s4, exec
loop:
// get the index value for the first active lane
v_readfirstlane_b32

s0, v0

// find all other lanes with same index value
v_cmpx_eq s0, v0
<OP>

// do the operation using the current EXEC mask. S0 holds the index.

// mask out thread that was just executed

16.3. SOP1 Instructions

222 of 597

"RDNA3" Instruction Set Architecture

// s_andn2_b64

s4, s4, exec

// s_mov_b64

exec, s4

s_andn2_wrexec_b64 s4, s4

// replaces above 2 ops

// repeat until EXEC==0
s_cbranch_scc1
s_mov_b64

loop

exec, s2

S_MOVRELS_B32

64

Move from a relative source address.
addr = SRC0.u;
// Raw value from instruction
addr += M0.u[31 : 0];
D0.b = SGPR[addr].b

Notes
Example: The following instruction sequence performs the move s5 <= s17:
s_mov_b32 m0, 10
s_movrels_b32 s5, s7

S_MOVRELS_B64

65

Move from a relative source address.
The index in M0.u and the operand address in SRC0.u must be even for this operation.
addr = SRC0.u;
// Raw value from instruction
addr += M0.u[31 : 0];
D0.b64 = SGPR[addr].b64

S_MOVRELD_B32

66

Move to a relative destination address.
addr = DST.u;
// Raw value from instruction
addr += M0.u[31 : 0];
SGPR[addr].b = S0.b

16.3. SOP1 Instructions

223 of 597

"RDNA3" Instruction Set Architecture

Notes
Example: The following instruction sequence performs the move s15 <= s7:
s_mov_b32 m0, 10
s_movreld_b32 s5, s7

S_MOVRELD_B64

67

Move to a relative destination address.
The index in M0.u and the operand address in DST.u must be even for this operation.
addr = DST.u;
// Raw value from instruction
addr += M0.u[31 : 0];
SGPR[addr].b64 = S0.b64

S_MOVRELSD_2_B32

68

Move from a relative source address to a relative destination address, with different offsets.
addrs = SRC0.u;
// Raw value from instruction
addrd = DST.u;
// Raw value from instruction
addrs += M0.u[9 : 0].u;
addrd += M0.u[25 : 16].u;
SGPR[addrd].b = SGPR[addrs].b

Notes
Example: The following instruction sequence performs the move s25 <= s17:
s_mov_b32 m0, ((20 << 16) | 10)
s_movrelsd_2_b32 s5, s7

S_GETPC_B64

71

Save current program location. The byte address of the instruction immediately following the GETPC
instruction is saved to the destination register D0.

16.3. SOP1 Instructions

224 of 597

"RDNA3" Instruction Set Architecture

D0.i64 = PC + 4LL

Notes
This instruction must be 4 bytes.

S_SETPC_B64

72

Jump to a new location. Argument is a byte address of the instruction to jump to.
PC = S0.i64

S_SWAPPC_B64

73

Save current program location and jump to a new location.
Argument is a byte address of the instruction to jump to. The byte address of the instruction immediately
following the SWAPPC instruction is saved to the destination register D0.
jump_addr = S0.i64;
D0.i64 = PC + 4LL;
PC = jump_addr.i64

Notes
This instruction must be 4 bytes.

S_RFE_B64

74

Return from exception handler and continue.
This instruction may only be used within a trap handler.
WAVE_STATUS.PRIV = 1'0U;
PC = S0.i64

S_SENDMSG_RTN_B32

76

Send a message to upstream control hardware.

16.3. SOP1 Instructions

225 of 597

"RDNA3" Instruction Set Architecture

SSRC[7:0] contains the message type encoded in the instruction directly (this instruction does not read an
SGPR). The message is expected to return a response from the upstream control hardware and the result is
written to SDST. Use s_waitcnt lgkmcnt(…) to wait for the response on the dependent instruction.
S_SENDMSG_RTN* instructions return data in-order among themselves but out-of-order with other
instructions that manipulate lgkmcnt (including S_SENDMSG and S_SENDMSGHALT).
If the message returns a 64 bit value then only the lower 32 bits are written to SDST.
If SDST is VCC then VCCZ is undefined.

S_SENDMSG_RTN_B64

77

Send a message to upstream control hardware.
SSRC[7:0] contains the message type encoded in the instruction directly (this instruction does not read an
SGPR). The message is expected to return a response from the upstream control hardware and the result is
written to SDST. Use s_waitcnt lgkmcnt(…) to wait for the response on the dependent instruction.
S_SENDMSG_RTN* instructions return data in-order among themselves but out-of-order with other
instructions that manipulate lgkmcnt (including S_SENDMSG and S_SENDMSGHALT).
If the message returns a 32 bit value then this instruction fills the upper bits of SDST with zero.
If SDST is VCC then VCCZ is undefined.

16.3. SOP1 Instructions

226 of 597

"RDNA3" Instruction Set Architecture

16.4. SOPC Instructions

Instructions in this format may use a 32-bit literal constant that occurs immediately after the instruction.

S_CMP_EQ_I32

0

A equal to B.
SCC = S0.i == S1.i

Notes
Note that S_CMP_EQ_I32 and S_CMP_EQ_U32 are identical opcodes, but both are provided for symmetry.

S_CMP_LG_I32

1

A not equal to B.
SCC = S0.i <> S1.i

Notes
Note that S_CMP_LG_I32 and S_CMP_LG_U32 are identical opcodes, but both are provided for symmetry.

S_CMP_GT_I32

2

A greater than B.
SCC = S0.i > S1.i

S_CMP_GE_I32

3

A greater than or equal to B.
SCC = S0.i >= S1.i

16.4. SOPC Instructions

227 of 597

"RDNA3" Instruction Set Architecture

S_CMP_LT_I32

4

A less than B.
SCC = S0.i < S1.i

S_CMP_LE_I32

5

A less than or equal to B.
SCC = S0.i <= S1.i

S_CMP_EQ_U32

6

A equal to B.
SCC = S0.i == S1.i

Notes
Note that S_CMP_EQ_I32 and S_CMP_EQ_U32 are identical opcodes, but both are provided for symmetry.

S_CMP_LG_U32

7

A not equal to B.
SCC = S0.i <> S1.i

Notes
Note that S_CMP_LG_I32 and S_CMP_LG_U32 are identical opcodes, but both are provided for symmetry.

S_CMP_GT_U32

8

A greater than B.
SCC = S0.u > S1.u

16.4. SOPC Instructions

228 of 597

"RDNA3" Instruction Set Architecture

Notes
Unsigned integer comparison.

S_CMP_GE_U32

9

A greater than or equal to B.
SCC = S0.u >= S1.u

Notes
Unsigned integer comparison.

S_CMP_LT_U32

10

A less than B.
SCC = S0.u < S1.u

Notes
Unsigned integer comparison.

S_CMP_LE_U32

11

A less than or equal to B.
SCC = S0.u <= S1.u

Notes
Unsigned integer comparison.

S_BITCMP0_B32

12

Test if a bit is unset.
SCC = S0.u[S1.u[4 : 0]] == 1'0U

16.4. SOPC Instructions

229 of 597

"RDNA3" Instruction Set Architecture

S_BITCMP1_B32

13

Test if a bit is set.
SCC = S0.u[S1.u[4 : 0]] == 1'1U

S_BITCMP0_B64

14

Test if a bit is unset.
SCC = S0.u64[S1.u[5 : 0]] == 1'0U

S_BITCMP1_B64

15

Test if a bit is set.
SCC = S0.u64[S1.u[5 : 0]] == 1'1U

S_CMP_EQ_U64

16

A equal to B.
SCC = S0.u64 == S1.u64

S_CMP_LG_U64

17

A not equal to B.
SCC = S0.u64 != S1.u64

16.4. SOPC Instructions

230 of 597

"RDNA3" Instruction Set Architecture

16.5. SOPP Instructions

S_NOP

0

Do nothing. Delay issue of next instruction by a small, fixed amount.
Insert 0..15 wait states based on SIMM16[3:0]. 0x0 means the next instruction can issue on the next clock, 0xf
means the next instruction can issue 16 clocks later.
for i in 0U : SIMM16.u16[3 : 0].u do
nop()
endfor

Notes
Examples:
s_nop 0

// Wait 1 cycle.

s_nop 0xf

// Wait 16 cycles.

S_SETKILL

1

Set KILL bit to value of SIMM16[0].
Used primarily for debugging kill wave host command behavior.

S_SETHALT

2

Set or clear the HALT or FATAL_HALT status bits.
The particular status bit is chosen by halt type control as indicated in SIMM16[2]; 0 = HALT bit select; 1 =
FATAL_HALT bit select.
When halt type control is set to 0 (HALT bit select): Set HALT bit to value of SIMM16[0]; 1 = halt, 0 = clear HALT
bit. The halt flag is ignored while PRIV == 1 (inside trap handlers) but the shader halts after the handler returns
if HALT is still set at that time.
When halt type control is set to 1 (FATAL HALT bit select): Set FATAL_HALT bit to value of SIMM16[0]; 1 =
fatal_halt, 0 = clear FATAL_HALT bit. Setting the fatal_halt flag halts the shader in or outside of the trap
handlers.

16.5. SOPP Instructions

231 of 597

"RDNA3" Instruction Set Architecture

S_SLEEP

3

Cause a wave to sleep for up to ~8000 clocks.
The wave sleeps for (64*(SIMM16[6:0]-1) .. 64*SIMM16[6:0]) clocks. The exact amount of delay is approximate.
Compare with S_NOP. When SIMM16[6:0] is zero then no sleep occurs.
Notes
Examples:
s_sleep 0

// Wait for 0 clocks.

s_sleep 1

// Wait for 1-64 clocks.

s_sleep 2

// Wait for 65-128 clocks.

S_SET_INST_PREFETCH_DISTANCE

4

Change instruction prefetch mode. This controls how many cachelines ahead of the current PC the shader will
try to prefetch.
SIMM16[1:0] specifies the prefetch mode to switch to. Prefetch modes are:
PREFETCH_SAFE (0x0)
Reserved, do not use.
PREFETCH_1_LINE (0x1)
Prefetch 1 cache line ahead of PC; keep 2 lines behind PC.
PREFETCH_2_LINES (0x2)
Prefetch 2 cache lines ahead of PC; keep 1 line behind PC.
PREFETCH_3_LINES (0x3)
Prefetch 3 cache lines ahead of PC; keep 0 lines behind PC.
SIMM16[15:2] must be set to zero.

S_CLAUSE

5

Mark the beginning of a clause.
The next instruction determines the clause type, which may be one of the following types.
• Image Load (non-sample instructions )
• Image Sample
• Image Store
• Image Atomic

16.5. SOPP Instructions

232 of 597

"RDNA3" Instruction Set Architecture

• Buffer/Global/Scratch Load
• Buffer/Global/Scratch Store
• Buffer/Global/Scratch Atomic
• Flat Load
• Flat Store
• Flat Atomic
• LDS (loads, stores, atomics may be in same clause)
• Scalar Memory
• Vector ALU
Once the clause type is determined, any instruction encountered within the clause that is not of the same type
(and not an internal instruction described below) is illegal and may lead to undefined behaviour. Attempting to
issue S_CLAUSE while inside a clause is also illegal.
Instructions that are processed internally do not interrupt the clause. The following instructions are internal:
• S_NOP,
• S_WAITCNT and its variants, unless they read an SGPR,
• S_SLEEP,
• S_DELAY_ALU.
Halting or killing a wave breaks the clause. VALU exceptions and other traps that cause the shader to enter its
trap handler breaks the clause. The single-step debug mode breaks the clause.
The clause length must be between 2 and 63 instructions, inclusive. Clause breaks may be from 1 to 15, or
may be disabled entirely. Clause length and breaks are encoded in the SIMM16 argument as follows:
LENGTH = SIMM16[5:0]
This field is set to the logical number of instructions in the clause, minus 1 (e.g. if a clause has 4
instructions, program this field to 3). The minimum number of instructions required for a clause is 2 and
the maximum number of instructions is 63, therefore this field must be programmed in the range [1, 62]
inclusive.
BREAK_SPAN = SIMM16[11:8]
This field is set to the number of instructions to issue before each clause break. If set to zero then there are
no clause breaks. If set to nonzero value then the maximum number of instructions between clause breaks
is 15.
The following instruction types cannot appear in a clause:
• SALU
• Export
• Branch
• Message
• LDSDIR
• VINTERP
• GDS
If you need to schedule an S_WAITCNT or S_DELAY_ALU instruction for the first instruction in the clause, the
waitcnt/delay instruction must appear before the S_CLAUSE instruction so that S_CLAUSE can accurately
determine the clause type.

16.5. SOPP Instructions

233 of 597

"RDNA3" Instruction Set Architecture

S_DELAY_ALU must not appear inside a clause. The features are orthogonal; ALU clauses should be structured
to avoid any stalling.

S_DELAY_ALU

7

Insert delay between dependent SALU/VALU instructions.
The SIMM16 argument is encoded as:
INSTID0 = SIMM16[3:0]
Hazard to delay for with the next VALU instruction.
INSTSKIP = SIMM16[6:4]
Identify the VALU instruction that the second delay condition applies to.
INSTID1 = SIMM16[10:7]
Hazard to delay for with the VALU instruction identified by INSTSKIP.
Legal values for the InstID0 and InstID1 fields are:
INSTID_NO_DEP (0x0)
No dependency on any prior instruction.
INSTID_VALU_DEP_1 (0x1)
Dependent on previous VALU instruction, 1 instruction back.
INSTID_VALU_DEP_2 (0x2)
Dependent on previous VALU instruction, 2 instructions back.
INSTID_VALU_DEP_3 (0x3)
Dependent on previous VALU instruction, 3 instructions back.
INSTID_VALU_DEP_4 (0x4)
Dependent on previous VALU instruction, 4 instructions back.
INSTID_TRANS32_DEP_1 (0x5)
Dependent on previous TRANS32 instruction, 1 instruction back.
INSTID_TRANS32_DEP_2 (0x6)
Dependent on previous TRANS32 instruction, 2 instructions back.
INSTID_TRANS32_DEP_3 (0x7)
Dependent on previous TRANS32 instruction, 3 instructions back.
INSTID_FMA_ACCUM_CYCLE_1 (0x8)
Single cycle penalty for FMA accumulation (reserved for future use in architectures with fast FMA SRC-C
accumulation).

16.5. SOPP Instructions

234 of 597

"RDNA3" Instruction Set Architecture

INSTID_SALU_CYCLE_1 (0x9)
1 cycle penalty for a prior SALU instruction.
INSTID_SALU_CYCLE_2 (0xa)
2 cycle penalty for a prior SALU instruction (reserved for future use in architectures with SALU floats).
INSTID_SALU_CYCLE_3 (0xb)
3 cycle penalty for a prior SALU instruction (reserved for future use in architectures with SALU floats).
Legal values for the InstSkip field are:
INSTSKIP_SAME (0x0)
Apply second dependency to same instruction (2 dependencies on one instruction).
INSTSKIP_NEXT (0x1)
Apply second dependency to next instruction (no skip).
INSTSKIP_SKIP_1 (0x2)
Skip 1 instruction then apply dependency.
INSTSKIP_SKIP_2 (0x3)
Skip 2 instructions then apply dependency.
INSTSKIP_SKIP_3 (0x4)
Skip 3 instructions then apply dependency.
INSTSKIP_SKIP_4 (0x5)
Skip 4 instructions then apply dependency.
This instruction describes dependencies for two instructions, directing the hardware to insert delay if the
dependent instruction was issued too recently to forward data to the second.
S_DELAY_ALU instructions record the required delay with respect to a previous VALU instruction and indicate
data dependencies that benefit from having extra idle cycles inserted between them. These instructions are
optional: without them the program still functions correctly but performance may suffer when multiple waves
are in flight; IB may issue dependent instructions that stall in the ALU, preventing those cycles from being
utilized by other wavefronts.
If enough independent instructions are between dependent ones then no delay is necessary and this
instruction may be omitted. For wave64 the compiler may not know the status of the EXEC mask and hence
does not know if instructions require 1 or 2 passes to issue. S_DELAY_ALU encodes the type of dependency so
that hardware may apply the correct delay depending on the number of active passes.
S_DELAY_ALU may execute in zero cycles.
To reduce instruction stream overhead the S_DELAY_ALU instructions packs two delay values into one
instruction, with a "skip" indicator so the two delayed instructions don't need to be back-to-back.
S_DELAY_ALU is illegal inside of a clause created by S_CLAUSE.
Example:

16.5. SOPP Instructions

235 of 597

"RDNA3" Instruction Set Architecture

v_mov_b32 v3, v0
v_lshlrev_b32

v30, 1, v31

v_lshlrev_b32

v24, 1, v25

s_delay_alu instid0(INSTID_VALU_DEP_3) | instskip(INSTSKIP_SKIP_1) | instid1(INSTID_VALU_DEP_1)
// 1 cycle delay here
v_add_f32

v0, v1, v3

v_sub_f32

v11, v9, v9

// 2 cycles delay here
v_mul_f32

v10, v13, v11

S_WAITCNT

9

Wait for the counts of outstanding lds, vector-memory and export/vmem-write-data to be at or below the
specified levels.
The SIMM16 argument is encoded as:
EXP = SIMM16[2:0]
Export wait count. 0x7 means do not wait on EXPCNT.
LGKM = SIMM16[9:4]
LGKM wait count. 0x3f means do not wait on LGKMCNT.
VM = SIMM16[15:10]
VM wait count. 0x3f means do not wait on VMCNT.
Waits for all of the following conditions to hold before continuing:
expcnt <= WaitEXPCNT
lgkmcnt <= WaitLGKMCNT
vmcnt <= WaitVMCNT

VMCNT only counts vector memory loads, image sample instructions, and vector memory atomics that return
data. Contrast with the VSCNT counter.
See also S_WAITCNT_VSCNT.

S_WAIT_IDLE

10

Wait for all activity in the wave to be complete (all dependency and memory counters at zero).

S_WAIT_EVENT

11

Wait for an event to occur or a condition to be satisfied before continuing. The SIMM16 argument specifies

16.5. SOPP Instructions

236 of 597

"RDNA3" Instruction Set Architecture

which event(s) to wait on.
DONT_WAIT_EXPORT_READY = SIMM16[0]
If this value is ZERO then sleep until the export_ready bit is 1. If the export_ready bit is already 1, no sleep
occurs. Effect is the same as the export_ready check performed before issuing an export instruction.
No wait occurs if this value is ONE.
This wait can be broken or preempted by KILL, context-save, host trap, single-step or trap after instruction
events. IB waits for the event to occur before processing internal exceptions which can delay entry to the trap
handler for a significant amount of time.

S_TRAP

16

Enter the trap handler.
This instruction may be generated internally as well in response to a host trap (HT = 1) or an exception. TrapID
0 is reserved for hardware use and should not be used in a shader-generated trap.
TrapID = SIMM16.u16[7 : 0];
"Wait for all instructions to complete";
// PC passed into trap handler points to S_TRAP itself,
// *not* to the next instruction.
{ TTMP[1], TTMP[0] } = { 7'0, HT[0], TrapID[7 : 0], PC[47 : 0] };
PC = TBA.i64;
// trap base address
WAVE_STATUS.PRIV = 1'1U

S_ROUND_MODE

17

Set floating point round mode using an immediate constant.
Avoids wait state penalty that would be imposed by S_SETREG.

S_DENORM_MODE

18

Set floating point denormal mode using an immediate constant.
Avoids wait state penalty that would be imposed by S_SETREG.

S_CODE_END

31

Generate an illegal instruction interrupt. This instruction is used to mark the end of a shader buffer for debug
tools.

16.5. SOPP Instructions

237 of 597

"RDNA3" Instruction Set Architecture

This instruction should not appear in typical shader code. It is used to pad the end of a shader program to make
it easier for analysis programs to locate the end of a shader program buffer. Use of this opcode in an embedded
shader block may cause analysis tools to fail.
To unambiguously mark the end of a shader buffer, this instruction must be specified five times in a row (total
of 20 bytes) and analysis tools must ensure the opcode occurs at least five times to be certain they are at the end
of the buffer. This is because the bit pattern generated by this opcode could incidentally appear in a valid
instruction's second dword, literal constant or as part of a multi-DWORD image instruction.
In short: do not embed this opcode in the middle of a valid shader program. DO use this opcode 5 times at the
end of a shader program to clearly mark the end of the program.
Example:
...
s_endpgm

// last real instruction in shader buffer

s_code_end

// 1

s_code_end

// 2

s_code_end

// 3

s_code_end

// 4

s_code_end

// done!

S_BRANCH

32

Unconditional short jump to label.
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL;
// short jump.

Notes
For a long jump or an indirect jump use S_SETPC_B64.
Examples:
s_branch label
s_nop 0

// Set SIMM16 = +4 = 0x0004

// 4 bytes

label:
s_nop 0

// 4 bytes

s_branch label

// Set SIMM16 = -8 = 0xfff8

S_CBRANCH_SCC0

33

Conditional short jump when SCC is zero.

16.5. SOPP Instructions

238 of 597

"RDNA3" Instruction Set Architecture

if SCC.u == 0U then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_CBRANCH_SCC1

34

Conditional short jump when SCC is one.
if SCC.u == 1U then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_CBRANCH_VCCZ

35

Conditional short jump when VCC is zero.
if VCC == 0x0LL then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_CBRANCH_VCCNZ

36

Conditional short jump when VCC is nonzero.
if VCC != 0x0LL then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_CBRANCH_EXECZ

37

Conditional short jump when EXEC is zero.

16.5. SOPP Instructions

239 of 597

"RDNA3" Instruction Set Architecture

if EXEC == 0x0LL then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_CBRANCH_EXECNZ

38

Conditional short jump when EXEC is nonzero.
if EXEC != 0x0LL then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_CBRANCH_CDBGSYS

39

Conditional short jump when the system debug flag is set.
if WAVE_STATUS.COND_DBG_SYS.u != 0U then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_CBRANCH_CDBGUSER

40

Conditional short jump when the user debug flag is set.
if WAVE_STATUS.COND_DBG_USER.u != 0U then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_CBRANCH_CDBGSYS_OR_USER

41

Conditional short jump when either the system or the user debug flag are set.

16.5. SOPP Instructions

240 of 597

"RDNA3" Instruction Set Architecture

if (WAVE_STATUS.COND_DBG_SYS || WAVE_STATUS.COND_DBG_USER) then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_CBRANCH_CDBGSYS_AND_USER

42

Conditional short jump when both the system and the user debug flag are set.
if (WAVE_STATUS.COND_DBG_SYS && WAVE_STATUS.COND_DBG_USER) then
PC = PC + signext(SIMM16.i16 * 16'4) + 4LL
else
PC = PC + 4LL
endif

S_ENDPGM

48

End of program; terminate wavefront.
The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0 before executing this instruction.
See S_ENDPGM_SAVED for the context-switch version of this instruction and
S_ENDPGM_ORDERED_PS_DONE for the POPS critical region version of this instruction.

S_ENDPGM_SAVED

49

End of program; signal that a wave has been saved by the context-switch trap handler and terminate
wavefront.
The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0 before executing this instruction.
See S_ENDPGM for additional variants.

S_ENDPGM_ORDERED_PS_DONE

50

End of program; signal that a wave has exited its POPS critical section and terminate wavefront.
The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0 before executing this instruction.
This instruction is an optimization that combines S_SENDMSG(MSG_ORDERED_PS_DONE) and S_ENDPGM;
there may be cases where you still need to send the message separately, in which case the shader must end
with a regular S_ENDPGM instruction.
See S_ENDPGM for additional variants.

16.5. SOPP Instructions

241 of 597

"RDNA3" Instruction Set Architecture

S_WAKEUP

52

Allow a wave to 'ping' all the other waves in its threadgroup to force them to wake up early from an S_SLEEP
instruction.
The ping is ignored if the waves are not sleeping. This allows for efficient polling on a memory location. The
waves which are polling can sit in a long S_SLEEP between memory reads, but the wave which writes the value
can tell them all to wake up early now that the data is available. This method is also safe from races because if
any wave misses the ping, everything is expected to work fine (waves which missed it just complete their
S_SLEEP).
If the wave executing S_WAKEUP is in a threadgroup (in_wg set), then it wakes up all waves associated with the
same threadgroup ID. Otherwise, S_WAKEUP is treated as an S_NOP.

S_SETPRIO

53

Change wave user priority.
User settable wave priority is set to SIMM16[1:0]. 0 is the lowest priority and 3 is the highest. The overall wave
priority is:
Priority = {SysUserPrio[1:0], WaveAge[3:0]}
SysUserPrio = MIN(3, SysPrio[1:0] + UserPrio[1:0]).

The system priority cannot be modified from within the wave.

S_SENDMSG

54

Send a message to upstream control hardware.
SIMM16[7:0] contains the message type.
Notes

S_SENDMSGHALT

55

Send a message to upstream control hardware and then HALT the wavefront; see S_SENDMSG for details.

S_INCPERFLEVEL

56

Increment performance counter specified in SIMM16[3:0] by 1.

16.5. SOPP Instructions

242 of 597

"RDNA3" Instruction Set Architecture

S_DECPERFLEVEL

57

Decrement performance counter specified in SIMM16[3:0] by 1.

S_ICACHE_INV

60

Invalidate entire first level instruction cache.

S_BARRIER

61

Synchronize waves within a threadgroup.
If not all waves of the threadgroup have been created yet, waits for entire group before proceeding. If some
waves in the threadgroup have already terminated, this waits on only the surviving waves. Barriers are legal
inside trap handlers.
Barrier instructions do not wait for any counters to go to zero before issuing. If you need the barrier to protect
an outstanding memory operation use the appropriate S_WAITCNT instruction before the barrier.

16.5. SOPP Instructions

243 of 597

"RDNA3" Instruction Set Architecture

16.6. SMEM Instructions

S_LOAD_B32

0

Read 1 dword from scalar data cache.
If the offset is specified as an SGPR, the SGPR contains an UNSIGNED BYTE offset (the 2 LSBs are ignored).
If the offset is specified as an immediate 21-bit constant, the constant is a SIGNED BYTE offset.
SDATA[31 : 0] = MEM[ADDR + 0U].b

S_LOAD_B64

1

Read 2 dwords from scalar data cache.
SDATA[31 : 0] = MEM[ADDR + 0U].b;
SDATA[63 : 32] = MEM[ADDR + 4U].b

Notes
See S_LOAD_B32 for details on the offset input.

S_LOAD_B128

2

Read 4 dwords from scalar data cache.
SDATA[31 : 0] = MEM[ADDR + 0U].b;
SDATA[63 : 32] = MEM[ADDR + 4U].b;
SDATA[95 : 64] = MEM[ADDR + 8U].b;
SDATA[127 : 96] = MEM[ADDR + 12U].b

Notes
See S_LOAD_B32 for details on the offset input.

S_LOAD_B256

16.6. SMEM Instructions

3

244 of 597

"RDNA3" Instruction Set Architecture

Read 8 dwords from scalar data cache.
SDATA[31 : 0] = MEM[ADDR + 0U].b;
SDATA[63 : 32] = MEM[ADDR + 4U].b;
SDATA[95 : 64] = MEM[ADDR + 8U].b;
SDATA[127 : 96] = MEM[ADDR + 12U].b;
SDATA[159 : 128] = MEM[ADDR + 16U].b;
SDATA[191 : 160] = MEM[ADDR + 20U].b;
SDATA[223 : 192] = MEM[ADDR + 24U].b;
SDATA[255 : 224] = MEM[ADDR + 28U].b

Notes
See S_LOAD_B32 for details on the offset input.

S_LOAD_B512

4

Read 16 dwords from scalar data cache.
SDATA[31 : 0] = MEM[ADDR + 0U].b;
SDATA[63 : 32] = MEM[ADDR + 4U].b;
SDATA[95 : 64] = MEM[ADDR + 8U].b;
SDATA[127 : 96] = MEM[ADDR + 12U].b;
SDATA[159 : 128] = MEM[ADDR + 16U].b;
SDATA[191 : 160] = MEM[ADDR + 20U].b;
SDATA[223 : 192] = MEM[ADDR + 24U].b;
SDATA[255 : 224] = MEM[ADDR + 28U].b;
SDATA[287 : 256] = MEM[ADDR + 32U].b;
SDATA[319 : 288] = MEM[ADDR + 36U].b;
SDATA[351 : 320] = MEM[ADDR + 40U].b;
SDATA[383 : 352] = MEM[ADDR + 44U].b;
SDATA[415 : 384] = MEM[ADDR + 48U].b;
SDATA[447 : 416] = MEM[ADDR + 52U].b;
SDATA[479 : 448] = MEM[ADDR + 56U].b;
SDATA[511 : 480] = MEM[ADDR + 60U].b

Notes
See S_LOAD_B32 for details on the offset input.

S_BUFFER_LOAD_B32

8

Read 1 dword from scalar data cache.
SDATA[31 : 0] = MEM[ADDR + 0U].b

Notes

16.6. SMEM Instructions

245 of 597

"RDNA3" Instruction Set Architecture

See S_LOAD_B32 for details on the offset input.

S_BUFFER_LOAD_B64

9

Read 2 dwords from scalar data cache.
SDATA[31 : 0] = MEM[ADDR + 0U].b;
SDATA[63 : 32] = MEM[ADDR + 4U].b

Notes
See S_LOAD_B32 for details on the offset input.

S_BUFFER_LOAD_B128

10

Read 4 dwords from scalar data cache.
SDATA[31 : 0] = MEM[ADDR + 0U].b;
SDATA[63 : 32] = MEM[ADDR + 4U].b;
SDATA[95 : 64] = MEM[ADDR + 8U].b;
SDATA[127 : 96] = MEM[ADDR + 12U].b

Notes
See S_LOAD_B32 for details on the offset input.

S_BUFFER_LOAD_B256

11

Read 8 dwords from scalar data cache.
SDATA[31 : 0] = MEM[ADDR + 0U].b;
SDATA[63 : 32] = MEM[ADDR + 4U].b;
SDATA[95 : 64] = MEM[ADDR + 8U].b;
SDATA[127 : 96] = MEM[ADDR + 12U].b;
SDATA[159 : 128] = MEM[ADDR + 16U].b;
SDATA[191 : 160] = MEM[ADDR + 20U].b;
SDATA[223 : 192] = MEM[ADDR + 24U].b;
SDATA[255 : 224] = MEM[ADDR + 28U].b

Notes
See S_LOAD_B32 for details on the offset input.

16.6. SMEM Instructions

246 of 597

"RDNA3" Instruction Set Architecture

S_BUFFER_LOAD_B512

12

Read 16 dwords from scalar data cache.
SDATA[31 : 0] = MEM[ADDR + 0U].b;
SDATA[63 : 32] = MEM[ADDR + 4U].b;
SDATA[95 : 64] = MEM[ADDR + 8U].b;
SDATA[127 : 96] = MEM[ADDR + 12U].b;
SDATA[159 : 128] = MEM[ADDR + 16U].b;
SDATA[191 : 160] = MEM[ADDR + 20U].b;
SDATA[223 : 192] = MEM[ADDR + 24U].b;
SDATA[255 : 224] = MEM[ADDR + 28U].b;
SDATA[287 : 256] = MEM[ADDR + 32U].b;
SDATA[319 : 288] = MEM[ADDR + 36U].b;
SDATA[351 : 320] = MEM[ADDR + 40U].b;
SDATA[383 : 352] = MEM[ADDR + 44U].b;
SDATA[415 : 384] = MEM[ADDR + 48U].b;
SDATA[447 : 416] = MEM[ADDR + 52U].b;
SDATA[479 : 448] = MEM[ADDR + 56U].b;
SDATA[511 : 480] = MEM[ADDR + 60U].b

Notes
See S_LOAD_B32 for details on the offset input.

S_GL1_INV

32

Invalidate the GL1 cache only.

S_DCACHE_INV

33

Invalidate the scalar data L0 cache.

16.6. SMEM Instructions

247 of 597

"RDNA3" Instruction Set Architecture

16.7. VOP2 Instructions

Instructions in this format may use a 32-bit literal constant or DPP that occurs immediately after the
instruction.

V_CNDMASK_B32

1

Conditional mask on each thread.
D0.u = VCC.u64[laneId] ? S1.u : S0.u

Notes
In VOP3 the VCC source may be a scalar GPR specified in S2.u.
Floating-point modifiers are valid for this instruction if S0.u and S1.u are 32-bit floating point values. This
instruction is suitable for negating or taking the absolute value of a floating-point value.

V_DOT2ACC_F32_F16

2

Dot product of packed FP16 values, accumulate with destination.
// Accumulate with destination
D0.f += 32'F(S0[15 : 0].f16) * 32'F(S1[15 : 0].f16);
D0.f += 32'F(S0[31 : 16].f16) * 32'F(S1[31 : 16].f16)

V_ADD_F32

3

Add two single-precision values.
D0.f = S0.f + S1.f

Notes
0.5ULP precision, denormals are supported.

V_SUB_F32

16.7. VOP2 Instructions

4

248 of 597

"RDNA3" Instruction Set Architecture

Subtract the second single-precision input from the first input.
D0.f = S0.f - S1.f

V_SUBREV_F32

5

Subtract the first single-precision input from the second input.
D0.f = S1.f - S0.f

V_FMAC_DX9_ZERO_F32

6

Multiply two single-precision values and accumulate the result with the destination. Follows DX9 rules where
0.0 times anything produces 0.0 (this is not IEEE compliant).
if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then
// DX9 rules, 0.0 * x = 0.0
D0.f = S2.f
else
D0.f = fma(S0.f, S1.f, D0.f)
endif

V_MUL_DX9_ZERO_F32

7

Multiply two single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE
compliant).
if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then
// DX9 rules, 0.0 * x = 0.0
D0.f = 0.0F
else
D0.f = S0.f * S1.f
endif

V_MUL_F32

8

Multiply two single-precision values.

16.7. VOP2 Instructions

249 of 597

"RDNA3" Instruction Set Architecture

D0.f = S0.f * S1.f

Notes
0.5ULP precision, denormals are supported.

V_MUL_I32_I24

9

Multiply two signed 24-bit integers and store the result as a signed 32-bit integer.
D0.i = 32'I(S0.i24) * 32'I(S1.i24)

Notes
This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision
floating point multiplier. See also V_MUL_HI_I32_I24.

V_MUL_HI_I32_I24

10

Multiply two signed 24-bit integers and store the high 32 bits of the result as a signed 32-bit integer.
D0.i = 32'I(64'I(S0.i24) * 64'I(S1.i24) >> 32U)

Notes
See also V_MUL_I32_I24.

V_MUL_U32_U24

11

Multiply two unsigned 24-bit integers and store the result as an unsigned 32-bit integer.
D0.u = 32'U(S0.u24) * 32'U(S1.u24)

Notes
This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision
floating point multiplier. See also V_MUL_HI_U32_U24.

V_MUL_HI_U32_U24

12

16.7. VOP2 Instructions

250 of 597

"RDNA3" Instruction Set Architecture

Multiply two unsigned 24-bit integers and store the high 32 bits of the result as an unsigned 32-bit integer.
D0.u = 32'U(64'U(S0.u24) * 64'U(S1.u24) >> 32U)

Notes
See also V_MUL_U32_U24.

V_MIN_F32

15

Compute the minimum of two floats.
LT_NEG_ZERO = lambda(a, b) (
((a < b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && sign(a) && !sign(b))));
// Version of comparison where -0.0 < +0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(64'F(S0.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S0.f)))
elsif isSignalNAN(64'F(S1.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S1.f)))
elsif isQuietNAN(64'F(S1.f)) then
D0.f = S0.f
elsif isQuietNAN(64'F(S0.f)) then
D0.f = S1.f
elsif LT_NEG_ZERO(S0.f, S1.f) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f = S0.f
else
D0.f = S1.f
endif
else
if isNAN(64'F(S1.f)) then
D0.f = S0.f
elsif isNAN(64'F(S0.f)) then
D0.f = S1.f
elsif LT_NEG_ZERO(S0.f, S1.f) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f = S0.f
else
D0.f = S1.f
endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.
Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so

16.7. VOP2 Instructions

251 of 597

"RDNA3" Instruction Set Architecture

output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

V_MAX_F32

16

Compute the maximum of two floats.
GT_NEG_ZERO = lambda(a, b) (
((a > b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && !sign(a) && sign(b))));
// Version of comparison where +0.0 > -0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(64'F(S0.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S0.f)))
elsif isSignalNAN(64'F(S1.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S1.f)))
elsif isQuietNAN(64'F(S1.f)) then
D0.f = S0.f
elsif isQuietNAN(64'F(S0.f)) then
D0.f = S1.f
elsif GT_NEG_ZERO(S0.f, S1.f) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f = S0.f
else
D0.f = S1.f
endif
else
if isNAN(64'F(S1.f)) then
D0.f = S0.f
elsif isNAN(64'F(S0.f)) then
D0.f = S1.f
elsif GT_NEG_ZERO(S0.f, S1.f) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f = S0.f
else
D0.f = S1.f
endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.
Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so
output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

V_MIN_I32

16.7. VOP2 Instructions

17

252 of 597

"RDNA3" Instruction Set Architecture

Compute the minimum of two signed integers.
D0.i = S0.i < S1.i ? S0.i : S1.i

V_MAX_I32

18

Compute the maximum of two signed integers.
D0.i = S0.i >= S1.i ? S0.i : S1.i

V_MIN_U32

19

Compute the minimum of two unsigned integers.
D0.u = S0.u < S1.u ? S0.u : S1.u

V_MAX_U32

20

Compute the maximum of two unsigned integers.
D0.u = S0.u >= S1.u ? S0.u : S1.u

V_LSHLREV_B32

24

Logical shift left with shift count in the first operand.
D0.u = S1.u << S0[4 : 0].u

V_LSHRREV_B32

25

Logical shift right with shift count in the first operand.
D0.u = S1.u >> S0[4 : 0].u

16.7. VOP2 Instructions

253 of 597

"RDNA3" Instruction Set Architecture

V_ASHRREV_I32

26

Arithmetic shift right (preserve sign bit) with shift count in the first operand.
D0.i = S1.i >> S0[4 : 0].u

V_AND_B32

27

Bitwise AND.
D0.u = (S0.u & S1.u)

Notes
Input and output modifiers not supported.

V_OR_B32

28

Bitwise OR.
D0.u = (S0.u | S1.u)

Notes
Input and output modifiers not supported.

V_XOR_B32

29

Bitwise XOR.
D0.u = (S0.u ^ S1.u)

Notes
Input and output modifiers not supported.

V_XNOR_B32

16.7. VOP2 Instructions

30

254 of 597

"RDNA3" Instruction Set Architecture

Bitwise XNOR.
D0.u = ~(S0.u ^ S1.u)

Notes
Input and output modifiers not supported.

V_ADD_CO_CI_U32

32

Add two unsigned integers and a carry-in from VCC. Store the result and also save the carry-out to VCC.
tmp = 64'U(S0.u) + 64'U(S1.u) + VCC.u64[laneId].u64;
VCC.u64[laneId] = tmp >= 0x100000000ULL ? 1'1U : 1'0U;
D0.u = tmp.u

Notes
In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at
S2.u.

V_SUB_CO_CI_U32

33

Subtract the second unsigned integer from the first unsigned integer and then subtract a carry-in from VCC.
Store the result and also save the carry-out to VCC.
tmp = S0.u - S1.u - VCC.u64[laneId].u;
VCC.u64[laneId] = 64'U(S1.u) + VCC.u64[laneId].u64 > 64'U(S0.u) ? 1'1U : 1'0U;
D0.u = tmp.u

Notes
In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at
S2.u.

V_SUBREV_CO_CI_U32

34

Subtract the first unsigned integer from the second unsigned integer and then subtract a carry-in from VCC.
Store the result and also save the carry-out to VCC.
tmp = S1.u - S0.u - VCC.u64[laneId].u;
VCC.u64[laneId] = 64'U(S1.u) + VCC.u64[laneId].u64 > 64'U(S0.u) ? 1'1U : 1'0U;

16.7. VOP2 Instructions

255 of 597

"RDNA3" Instruction Set Architecture

D0.u = tmp.u

Notes
In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at
S2.u.

V_ADD_NC_U32

37

Add two unsigned integers. No carry-in or carry-out.
D0.u = S0.u + S1.u

V_SUB_NC_U32

38

Subtract the second unsigned integer from the first unsigned integer. No carry-in or carry-out.
D0.u = S0.u - S1.u

V_SUBREV_NC_U32

39

Subtract the first unsigned integer from the second unsigned integer. No carry-in or carry-out.
D0.u = S1.u - S0.u

V_FMAC_F32

43

Fused multiply-add of single-precision floats, accumulate with destination.
D0.f = fma(S0.f, S1.f, D0.f)

V_FMAMK_F32

44

Multiply a single-precision float with a literal constant and add a second single-precision float using fused
multiply-add.

16.7. VOP2 Instructions

256 of 597

"RDNA3" Instruction Set Architecture

D0.f = fma(S0.f, SIMM32.f, S1.f)

Notes
This opcode cannot use the VOP3 encoding and cannot use input/output modifiers.

V_FMAAK_F32

45

Multiply two single-precision floats and add a literal constant using fused multiply-add.
D0.f = fma(S0.f, S1.f, SIMM32.f)

Notes
This opcode cannot use the VOP3 encoding and cannot use input/output modifiers.

V_CVT_PK_RTZ_F16_F32

47

Convert two single-precision floats into a packed FP16 result and round to zero (ignore the current rounding
mode).
D0[15 : 0].f16 = f32_to_f16(S0.f);
D0[31 : 16].f16 = f32_to_f16(S1.f);
// Round-toward-zero regardless of current round mode setting in hardware.

Notes
This opcode is intended for use with 16-bit compressed exports. See V_CVT_F16_F32 for a version that respects
the current rounding mode.

V_ADD_F16

50

Add two FP16 values.
D0.f16 = S0.f16 + S1.f16

Notes
0.5ULP precision. Supports denormals, round mode, exception flags and saturation.

16.7. VOP2 Instructions

257 of 597

"RDNA3" Instruction Set Architecture

V_SUB_F16

51

Subtract the second FP16 value from the first.
D0.f16 = S0.f16 - S1.f16

Notes
0.5ULP precision, Supports denormals, round mode, exception flags and saturation.

V_SUBREV_F16

52

Subtract the first FP16 value from the second.
D0.f16 = S1.f16 - S0.f16

Notes
0.5ULP precision. Supports denormals, round mode, exception flags and saturation.

V_MUL_F16

53

Multiply two FP16 values.
D0.f16 = S0.f16 * S1.f16

Notes
0.5ULP precision. Supports denormals, round mode, exception flags and saturation.

V_FMAC_F16

54

Fused multiply-add of FP16 values, accumulate with destination.
D0.f16 = fma(S0.f16, S1.f16, D0.f16)

Notes
0.5ULP precision. Supports denormals, round mode, exception flags and saturation.

16.7. VOP2 Instructions

258 of 597

"RDNA3" Instruction Set Architecture

V_FMAMK_F16

55

Multiply a FP16 value with a literal constant and add a second FP16 value using fused multiply-add.
D0.f16 = fma(S0.f16, SIMM32.f16, S1.f16)

Notes
This opcode cannot use the VOP3 encoding and cannot use input/output modifiers. Supports round mode,
exception flags, saturation.

V_FMAAK_F16

56

Multiply two FP16 values and add a literal constant using fused multiply-add.
D0.f16 = fma(S0.f16, S1.f16, SIMM32.f16)

Notes
This opcode cannot use the VOP3 encoding and cannot use input/output modifiers. Supports round mode,
exception flags, saturation.

V_MAX_F16

57

Compute the maximum of two floats.
GT_NEG_ZERO = lambda(a, b) (
((a > b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && !sign(a) && sign(b))));
// Version of comparison where +0.0 > -0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(64'F(S0.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S0.f16)))
elsif isSignalNAN(64'F(S1.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16)))
elsif isQuietNAN(64'F(S1.f16)) then
D0.f16 = S0.f16
elsif isQuietNAN(64'F(S0.f16)) then
D0.f16 = S1.f16
elsif GT_NEG_ZERO(S0.f16, S1.f16) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f16 = S0.f16
else
D0.f16 = S1.f16
endif
else
if isNAN(64'F(S1.f16)) then
D0.f16 = S0.f16

16.7. VOP2 Instructions

259 of 597

"RDNA3" Instruction Set Architecture

elsif isNAN(64'F(S0.f16)) then
D0.f16 = S1.f16
elsif GT_NEG_ZERO(S0.f16, S1.f16) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f16 = S0.f16
else
D0.f16 = S1.f16
endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.
Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so
output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

V_MIN_F16

58

Compute the minimum of two floats.
LT_NEG_ZERO = lambda(a, b) (
((a < b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && sign(a) && !sign(b))));
// Version of comparison where -0.0 < +0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(64'F(S0.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S0.f16)))
elsif isSignalNAN(64'F(S1.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16)))
elsif isQuietNAN(64'F(S1.f16)) then
D0.f16 = S0.f16
elsif isQuietNAN(64'F(S0.f16)) then
D0.f16 = S1.f16
elsif LT_NEG_ZERO(S0.f16, S1.f16) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f16 = S0.f16
else
D0.f16 = S1.f16
endif
else
if isNAN(64'F(S1.f16)) then
D0.f16 = S0.f16
elsif isNAN(64'F(S0.f16)) then
D0.f16 = S1.f16
elsif LT_NEG_ZERO(S0.f16, S1.f16) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f16 = S0.f16
else
D0.f16 = S1.f16

16.7. VOP2 Instructions

260 of 597

"RDNA3" Instruction Set Architecture

endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.
Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so
output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

V_LDEXP_F16

59

Load exponent.
Multiply an FP16 value by an integral power of 2, compare with the ldexp() function in C. The second argument
is an integer value.
D0.f16 = S0.f16 * f32_to_f16(2.0F ** 32'I(S1.i16))

V_PK_FMAC_F16

60

Multiply packed FP16 values and accumulate with destination.
D0[31 : 16].f16 = fma(S0[31 : 16].f16, S1[31 : 16].f16, D0[31 : 16].f16);
D0[15 : 0].f16 = fma(S0[15 : 0].f16, S1[15 : 0].f16, D0[15 : 0].f16)

Notes
VOP2 version of V_PK_FMA_F16 with third source VGPR address is the destination.

16.7.1. VOP2 using VOP3 or VOP3SD encoding
Instructions in this format may also be encoded as VOP3. VOP3 allows access to the extra control bits (e.g. ABS,
OMOD) at the expense of a larger instruction word. The VOP3 opcode is: VOP2 opcode + 0x100.

16.7. VOP2 Instructions

261 of 597

"RDNA3" Instruction Set Architecture

16.7. VOP2 Instructions

262 of 597

"RDNA3" Instruction Set Architecture

16.8. VOP1 Instructions

Instructions in this format may use a 32-bit literal constant or DPP that occurs immediately after the
instruction.

V_NOP

0

Do nothing.

V_MOV_B32

1

Move data to a VGPR.
D0.b = S0.b

Notes
Floating-point modifiers are valid for this instruction if S0.u is a 32-bit floating point value. This instruction is
suitable for negating or taking the absolute value of a floating-point value.
Functional examples:
v_mov_b32 v0, v1

// Move v1 to v0

v_mov_b32 v0, -v1

// Set v1 to the negation of v0

v_mov_b32 v0, abs(v1)

// Set v1 to the absolute value of v0

V_READFIRSTLANE_B32

2

Copy one VGPR value from the lowest active lane to one SGPR.
declare lane : 32'U;
if WAVE64 then
// 64 lanes
if EXEC == 0x0LL then
lane = 0U;
// Force lane 0 if all lanes are disabled
else
lane = 32'U(s_ff1_i32_b64(EXEC));
// Lowest active lane
endif
else

16.8. VOP1 Instructions

263 of 597

"RDNA3" Instruction Set Architecture

// 32 lanes
if EXEC_LO.i == 0 then
lane = 0U;
// Force lane 0 if all lanes are disabled
else
lane = 32'U(s_ff1_i32_b32(EXEC_LO));
// Lowest active lane
endif
endif;
D0.b = VGPR[lane][SRC0.u]

Notes
Ignores EXEC mask for the VGPR read. Input and output modifiers not supported; this is an untyped operation.

V_CVT_I32_F64

3

Convert from a double-precision float to a signed integer.
D0.i = f64_to_i32(S0.f64)

Notes
0.5ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_CVT_F64_I32

4

Convert from a signed integer to a double-precision float.
D0.f64 = i32_to_f64(S0.i)

Notes
0ULP accuracy.

V_CVT_F32_I32

5

Convert from a signed integer to a single-precision float.
D0.f = i32_to_f32(S0.i)

16.8. VOP1 Instructions

264 of 597

"RDNA3" Instruction Set Architecture

Notes
0.5ULP accuracy.

V_CVT_F32_U32

6

Convert from an unsigned integer to a single-precision float.
D0.f = u32_to_f32(S0.u)

Notes
0.5ULP accuracy.

V_CVT_U32_F32

7

Convert from a single-precision float to an unsigned integer.
D0.u = f32_to_u32(S0.f)

Notes
1ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_CVT_I32_F32

8

Convert from a single-precision float to a signed integer.
D0.i = f32_to_i32(S0.f)

Notes
1ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_CVT_F16_F32

16.8. VOP1 Instructions

10

265 of 597

"RDNA3" Instruction Set Architecture

Convert from a single-precision float to an FP16 float.
D0.f16 = f32_to_f16(S0.f)

Notes
0.5ULP accuracy, supports input modifiers and creates FP16 denormals when appropriate. Flush denorms on
output if specified based on DP denorm mode. Output rounding based on DP rounding mode.

V_CVT_F32_F16

11

Convert from an FP16 float to a single-precision float.
D0.f = f16_to_f32(S0.f16)

Notes
0ULP accuracy, FP16 denormal inputs are accepted. Flush denorms on input if specified based on DP denorm
mode.

V_CVT_NEAREST_I32_F32

12

Convert from a single-precision float to a signed integer, round to nearest integer.
D0.i = f32_to_i32(floor(S0.f + 0.5F))

Notes
0.5ULP accuracy, denormals are supported.

V_CVT_FLOOR_I32_F32

13

Convert from a single-precision float to a signed integer, round down.
D0.i = f32_to_i32(floor(S0.f))

Notes
1ULP accuracy, denormals are supported.

16.8. VOP1 Instructions

266 of 597

"RDNA3" Instruction Set Architecture

V_CVT_OFF_F32_I4

14

4-bit signed int to 32-bit float. Used for interpolation in shader.
Lookup table on S0[3:0]:
S0 binary Result
1000 -0.5000f
1001 -0.4375f
1010 -0.3750f
1011 -0.3125f
1100 -0.2500f
1101 -0.1875f
1110 -0.1250f
1111 -0.0625f
0000 +0.0000f
0001 +0.0625f
0010 +0.1250f
0011 +0.1875f
0100 +0.2500f
0101 +0.3125f
0110 +0.3750f
0111 +0.4375f
declare CVT_OFF_TABLE : 32'F[16];
D0.f = CVT_OFF_TABLE[S0.u[3 : 0]]

V_CVT_F32_F64

15

Convert from a double-precision float to a single-precision float.
D0.f = f64_to_f32(S0.f64)

Notes
0.5ULP accuracy, denormals are supported.

V_CVT_F64_F32

16

Convert from a single-precision float to a double-precision float.
D0.f64 = f32_to_f64(S0.f)

Notes

16.8. VOP1 Instructions

267 of 597

"RDNA3" Instruction Set Architecture

0ULP accuracy, denormals are supported.

V_CVT_F32_UBYTE0

17

Convert an unsigned byte (byte 0) to a single-precision float.
D0.f = u32_to_f32(S0.u[7 : 0].u)

V_CVT_F32_UBYTE1

18

Convert an unsigned byte (byte 1) to a single-precision float.
D0.f = u32_to_f32(S0.u[15 : 8].u)

V_CVT_F32_UBYTE2

19

Convert an unsigned byte (byte 2) to a single-precision float.
D0.f = u32_to_f32(S0.u[23 : 16].u)

V_CVT_F32_UBYTE3

20

Convert an unsigned byte (byte 3) to a single-precision float.
D0.f = u32_to_f32(S0.u[31 : 24].u)

V_CVT_U32_F64

21

Convert from a double-precision float to an unsigned integer.
D0.u = f64_to_u32(S0.f64)

Notes
0.5ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0.

16.8. VOP1 Instructions

268 of 597

"RDNA3" Instruction Set Architecture

Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_CVT_F64_U32

22

Convert from an unsigned integer to a double-precision float.
D0.f64 = u32_to_f64(S0.u)

Notes
0ULP accuracy.

V_TRUNC_F64

23

Return integer part of a number with round-to-zero semantics.
D0.f64 = trunc(S0.f64)

V_CEIL_F64

24

Round up to next whole integer.
D0.f64 = trunc(S0.f64);
if ((S0.f64 > 0.0) && (S0.f64 != D0.f64)) then
D0.f64 += 1.0
endif

V_RNDNE_F64

25

Round-to-nearest-even semantics.
D0.f64 = floor(S0.f64 + 0.5);
if (isEven(floor(S0.f64)) && (fract(S0.f64) == 0.5)) then
D0.f64 -= 1.0
endif

V_FLOOR_F64

16.8. VOP1 Instructions

26

269 of 597

"RDNA3" Instruction Set Architecture

Round down to previous whole integer.
D0.f64 = trunc(S0.f64);
if ((S0.f64 < 0.0) && (S0.f64 != D0.f64)) then
D0.f64 += -1.0
endif

V_PIPEFLUSH

27

Flush the VALU destination cache.

V_MOV_B16

28

Move data to a VGPR.
D0.b16 = S0.b16

Notes
Floating-point modifiers are valid for this instruction if S0.u16 is a 16-bit floating point value. This instruction is
suitable for negating or taking the absolute value of a floating-point value.

V_FRACT_F32

32

Return fractional portion of a number.
D0.f = S0.f + -floor(S0.f)

Notes
0.5ULP accuracy, denormals are accepted.
This complies with the DX specification of fract where the function behaves like an extension of integer
modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8
in DX.
Obey round mode, result clamped to 0x3f7fffff.

V_TRUNC_F32

33

Return integer part of a number with round-to-zero semantics.

16.8. VOP1 Instructions

270 of 597

"RDNA3" Instruction Set Architecture

D0.f = trunc(S0.f)

V_CEIL_F32

34

Round up to next whole integer.
D0.f = trunc(S0.f);
if ((S0.f > 0.0F) && (S0.f != D0.f)) then
D0.f += 1.0F
endif

V_RNDNE_F32

35

Round-to-nearest-even semantics.
D0.f = floor(S0.f + 0.5F);
if (isEven(64'F(floor(S0.f))) && (fract(S0.f) == 0.5F)) then
D0.f -= 1.0F
endif

V_FLOOR_F32

36

Round down to previous whole integer.
D0.f = trunc(S0.f);
if ((S0.f < 0.0F) && (S0.f != D0.f)) then
D0.f += -1.0F
endif

V_EXP_F32

37

Base 2 exponentiation.
D0.f = pow(2.0F, S0.f)

Notes
1ULP accuracy, denormals are flushed.

16.8. VOP1 Instructions

271 of 597

"RDNA3" Instruction Set Architecture

Functional examples:
V_EXP_F32(0xff800000) ⇒ 0x00000000 // exp(-INF) = 0
V_EXP_F32(0x80000000) ⇒ 0x3f800000 // exp(-0.0) = 1
V_EXP_F32(0x7f800000) ⇒ 0x7f800000 // exp(+INF) = +INF

V_LOG_F32

39

Base 2 logarithm.
D0.f = log2(S0.f)

Notes
1ULP accuracy, denormals are flushed.
Functional examples:
V_LOG_F32(0xff800000) ⇒ 0xffc00000 // log(-INF) = NAN
V_LOG_F32(0xbf800000) ⇒ 0xffc00000 // log(-1.0) = NAN
V_LOG_F32(0x80000000) ⇒ 0xff800000 // log(-0.0) = -INF
V_LOG_F32(0x00000000) ⇒ 0xff800000 // log(+0.0) = -INF
V_LOG_F32(0x3f800000) ⇒ 0x00000000 // log(+1.0) = 0
V_LOG_F32(0x7f800000) ⇒ 0x7f800000 // log(+INF) = +INF

V_RCP_F32

42

Compute reciprocal with IEEE rules.
D0.f = 1.0F / S0.f

Notes
1ULP accuracy. Accuracy converges to < 0.5ULP when using the Newton-Raphson method and 2 FMA
operations. Denormals are flushed.
Functional examples:
V_RCP_F32(0xff800000) ⇒ 0x80000000 // rcp(-INF) = -0
V_RCP_F32(0xc0000000) ⇒ 0xbf000000 // rcp(-2.0) = -0.5
V_RCP_F32(0x80000000) ⇒ 0xff800000 // rcp(-0.0) = -INF
V_RCP_F32(0x00000000) ⇒ 0x7f800000 // rcp(+0.0) = +INF
V_RCP_F32(0x7f800000) ⇒ 0x00000000 // rcp(+INF) = +0

16.8. VOP1 Instructions

272 of 597

"RDNA3" Instruction Set Architecture

V_RCP_IFLAG_F32

43

Compute reciprocal as part of integer divide.
D0.f = 1.0F / S0.f;
// Can only raise integer DIV_BY_ZERO exception

Notes
Can raise integer DIV_BY_ZERO exception but cannot raise floating-point exceptions. To be used in an integer
reciprocal macro by the compiler with one of the sequences listed below (depending on signed or unsigned
operation).
Unsigned usage:
CVT_F32_U32
RCP_IFLAG_F32
MUL_F32 (2**32 - 1)
CVT_U32_F32
+ Signed usage:
CVT_F32_I32
RCP_IFLAG_F32
MUL_F32 (2**31 - 1)
CVT_I32_F32

V_RSQ_F32

46

Reciprocal square root with IEEE rules.
D0.f = 1.0F / sqrt(S0.f)

Notes
1ULP accuracy, denormals are flushed.
Functional examples:
V_RSQ_F32(0xff800000) ⇒ 0xffc00000 // rsq(-INF) = NAN
V_RSQ_F32(0x80000000) ⇒ 0xff800000 // rsq(-0.0) = -INF
V_RSQ_F32(0x00000000) ⇒ 0x7f800000 // rsq(+0.0) = +INF
V_RSQ_F32(0x40800000) ⇒ 0x3f000000 // rsq(+4.0) = +0.5
V_RSQ_F32(0x7f800000) ⇒ 0x00000000 // rsq(+INF) = +0

V_RCP_F64

16.8. VOP1 Instructions

47

273 of 597

"RDNA3" Instruction Set Architecture

Reciprocal with IEEE rules.
D0.f64 = 1.0 / S0.f64

Notes
This opcode has (2**29)ULP accuracy and supports denormals.

V_RSQ_F64

49

Reciprocal square root with IEEE rules.
D0.f64 = 1.0 / sqrt(S0.f64)

Notes
This opcode has (2**29)ULP accuracy and supports denormals.

V_SQRT_F32

51

Square root.
D0.f = sqrt(S0.f)

Notes
1ULP accuracy, denormals are flushed.
Functional examples:
V_SQRT_F32(0xff800000) ⇒ 0xffc00000 // sqrt(-INF) = NAN
V_SQRT_F32(0x80000000) ⇒ 0x80000000 // sqrt(-0.0) = -0
V_SQRT_F32(0x00000000) ⇒ 0x00000000 // sqrt(+0.0) = +0
V_SQRT_F32(0x40800000) ⇒ 0x40000000 // sqrt(+4.0) = +2.0
V_SQRT_F32(0x7f800000) ⇒ 0x7f800000 // sqrt(+INF) = +INF

V_SQRT_F64

52

Square root.
D0.f64 = sqrt(S0.f64)

16.8. VOP1 Instructions

274 of 597

"RDNA3" Instruction Set Architecture

Notes
This opcode has (2**29)ULP accuracy and supports denormals.

V_SIN_F32

53

Trigonometric sine.
D0.f = 32'F(sin(64'F(S0.f) * 2.0 * PI))

Notes
Denormals are supported. Full range input is supported.
Functional examples:
V_SIN_F32(0xff800000) ⇒ 0xffc00000 // sin(-INF) = NAN
V_SIN_F32(0xff7fffff) ⇒ 0x00000000 // -MaxFloat, finite
V_SIN_F32(0x80000000) ⇒ 0x80000000 // sin(-0.0) = -0
V_SIN_F32(0x3e800000) ⇒ 0x3f800000 // sin(0.25) = 1
V_SIN_F32(0x7f800000) ⇒ 0xffc00000 // sin(+INF) = NAN

V_COS_F32

54

Trigonometric cosine.
D0.f = 32'F(cos(64'F(S0.f) * 2.0 * PI))

Notes
Denormals are supported. Full range input is supported.
Functional examples:
V_COS_F32(0xff800000) ⇒ 0xffc00000 // cos(-INF) = NAN
V_COS_F32(0xff7fffff) ⇒ 0x3f800000 // -MaxFloat, finite
V_COS_F32(0x80000000) ⇒ 0x3f800000 // cos(-0.0) = 1
V_COS_F32(0x3e800000) ⇒ 0x00000000 // cos(0.25) = 0
V_COS_F32(0x7f800000) ⇒ 0xffc00000 // cos(+INF) = NAN

V_NOT_B32

55

Bitwise negation.

16.8. VOP1 Instructions

275 of 597

"RDNA3" Instruction Set Architecture

D0.u = ~S0.u

Notes
Input and output modifiers not supported.

V_BFREV_B32

56

Bitfield reverse.
D0.u[31 : 0] = S0.u[0 : 31]

Notes
Input and output modifiers not supported.

V_CLZ_I32_U32

57

Count leading zeros.
Counts how many zeros before the first one starting from the MSB. Returns -1 if there are no ones.
D0.i = -1;
// Set if no ones are found
for i in 0 : 31 do
// Search from MSB
if S0.u[31 - i] == 1'1U then
D0.i = i;
break
endif
endfor

Notes
Compare with S_CLZ_I32_U32, which performs the equivalent operation in the scalar ALU.
Functional examples:
V_CLZ_I32_U32(0x00000000) ⇒ 0xffffffff
V_CLZ_I32_U32(0x800000ff) ⇒ 0
V_CLZ_I32_U32(0x100000ff) ⇒ 3
V_CLZ_I32_U32(0x0000ffff) ⇒ 16
V_CLZ_I32_U32(0x00000001) ⇒ 31

16.8. VOP1 Instructions

276 of 597

"RDNA3" Instruction Set Architecture

V_CTZ_I32_B32

58

Count trailing zeros.
Returns the bit position of the first one from the LSB, or -1 if there are no ones.
D0.i = -1;
// Set if no ones are found
for i in 0 : 31 do
// Search from LSB
if S0.u[i] == 1'1U then
D0.i = i;
break
endif
endfor

Notes
Compare with S_CTZ_I32_B32, which performs the equivalent operation in the scalar ALU.
Functional examples:
V_CTZ_I32_B32(0x00000000) ⇒ 0xffffffff
V_CTZ_I32_B32(0xff000001) ⇒ 0
V_CTZ_I32_B32(0xff000008) ⇒ 3
V_CTZ_I32_B32(0xffff0000) ⇒ 16
V_CTZ_I32_B32(0x80000000) ⇒ 31

V_CLS_I32

59

Count leading sign bits.
Counts how many bits in a row (from MSB to LSB) are the same as the sign bit. Returns -1 if all bits are the
same.
D0.i = -1;
// Set if all bits are the same
for i in 1 : 31 do
// Search from MSB
if S0.i[31 - i] != S0.i[31] then
D0.i = i;
break
endif
endfor

Notes
Compare with S_CLS_I32, which performs the equivalent operation in the scalar ALU.

16.8. VOP1 Instructions

277 of 597

"RDNA3" Instruction Set Architecture

Functional examples:
V_CLS_I32(0x00000000) ⇒ 0xffffffff
V_CLS_I32(0x40000000) ⇒ 1
V_CLS_I32(0x80000000) ⇒ 1
V_CLS_I32(0x0fffffff) ⇒ 4
V_CLS_I32(0xffff0000) ⇒ 16
V_CLS_I32(0xfffffffe) ⇒ 31
V_CLS_I32(0xffffffff) ⇒ 0xffffffff

V_FREXP_EXP_I32_F64

60

Returns exponent of single precision float input.
This operation satisfies the invariant S0.f64 = significand * (2 ** exponent). See also V_FREXP_MANT_F64,
which returns the significand. See the C library function frexp() for more information.
if ((S0.f64 == +INF) || (S0.f64 == -INF) || isNAN(S0.f64)) then
D0.i = 0
else
D0.i = exponent(S0.f64) - 1023 + 1
endif

V_FREXP_MANT_F64

61

Returns binary significand of double precision float input.
This operation satisfies the invariant S0.f64 = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0)
in normal cases. See also V_FREXP_EXP_I32_F64, which returns integer exponent. See the C library function
frexp() for more information.
if ((S0.f64 == +INF) || (S0.f64 == -INF) || isNAN(S0.f64)) then
D0.f64 = S0.f64
else
D0.f64 = mantissa(S0.f64)
endif

V_FRACT_F64

62

Return fractional portion of a number.
D0.f64 = S0.f64 + -floor(S0.f64)

16.8. VOP1 Instructions

278 of 597

"RDNA3" Instruction Set Architecture

Notes
0.5ULP accuracy, denormals are accepted.
This complies with the DX specification of fract where the function behaves like an extension of integer
modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8
in DX.
Obey round mode, result clamped to 0x3fefffffffffffff.

V_FREXP_EXP_I32_F32

63

Returns exponent of single precision float input.
This operation satisfies the invariant S0.f = significand * (2 ** exponent). See also V_FREXP_MANT_F32, which
returns the significand. See the C library function frexp() for more information.
if ((64'F(S0.f) == +INF) || (64'F(S0.f) == -INF) || isNAN(64'F(S0.f))) then
D0.i = 0
else
D0.i = exponent(S0.f) - 127 + 1
endif

V_FREXP_MANT_F32

64

Returns binary significand of single precision float input.
This operation satisfies the invariant S0.f = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0) in
normal cases. See also V_FREXP_EXP_I32_F32, which returns integer exponent. See the C library function
frexp() for more information.
if ((64'F(S0.f) == +INF) || (64'F(S0.f) == -INF) || isNAN(64'F(S0.f))) then
D0.f = S0.f
else
D0.f = mantissa(S0.f)
endif

V_MOVRELD_B32

66

Move to a relative destination address.
addr = DST.u;
// Raw value from instruction
addr += M0.u[31 : 0];

16.8. VOP1 Instructions

279 of 597

"RDNA3" Instruction Set Architecture

VGPR[laneId][addr].b = S0.b

Notes
Example: The following instruction sequence performs the move v15 <= v7:
s_mov_b32 m0, 10
v_movreld_b32 v5, v7

V_MOVRELS_B32

67

Move from a relative source address.
addr = SRC0.u;
// Raw value from instruction
addr += M0.u[31 : 0];
D0.b = VGPR[laneId][addr].b

Notes
Example: The following instruction sequence performs the move v5 <= v17:
s_mov_b32 m0, 10
v_movrels_b32 v5, v7

V_MOVRELSD_B32

68

Move from a relative source address to a relative destination address.
addrs = SRC0.u;
// Raw value from instruction
addrd = DST.u;
// Raw value from instruction
addrs += M0.u[31 : 0];
addrd += M0.u[31 : 0];
VGPR[laneId][addrd].b = VGPR[laneId][addrs].b

Notes
Example: The following instruction sequence performs the move v15 <= v17:
s_mov_b32 m0, 10

16.8. VOP1 Instructions

280 of 597

"RDNA3" Instruction Set Architecture

v_movrelsd_b32 v5, v7

V_MOVRELSD_2_B32

72

Move from a relative source address to a relative destination address, with different relative offsets.
addrs = SRC0.u;
// Raw value from instruction
addrd = DST.u;
// Raw value from instruction
addrs += M0.u[9 : 0].u;
addrd += M0.u[25 : 16].u;
VGPR[laneId][addrd].b = VGPR[laneId][addrs].b

Notes
Example: The following instruction sequence performs the move v25 <= v17:
s_mov_b32 m0, ((20 << 16) | 10)
v_movrelsd_2_b32 v5, v7

V_CVT_F16_U16

80

Convert from an unsigned short to an FP16 float.
D0.f16 = u16_to_f16(S0.u16)

Notes
0.5ULP accuracy, supports denormals, rounding, exception flags and saturation.

V_CVT_F16_I16

81

Convert from a signed short to an FP16 float.
D0.f16 = i16_to_f16(S0.i16)

Notes
0.5ULP accuracy, supports denormals, rounding, exception flags and saturation.

16.8. VOP1 Instructions

281 of 597

"RDNA3" Instruction Set Architecture

V_CVT_U16_F16

82

Convert from an FP16 float to an unsigned short.
D0.u16 = f16_to_u16(S0.f16)

Notes
1ULP accuracy, supports rounding, exception flags and saturation. FP16 denormals are accepted. Conversion
is done with truncation.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_CVT_I16_F16

83

Convert from an FP16 float to a signed short.
D0.i16 = f16_to_i16(S0.f16)

Notes
1ULP accuracy, supports rounding, exception flags and saturation. FP16 denormals are accepted. Conversion
is done with truncation.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_RCP_F16

84

Reciprocal with IEEE rules.
D0.f16 = 16'1.0 / S0.f16

Notes
0.51ULP accuracy.
Functional examples:
V_RCP_F16(0xfc00) ⇒ 0x8000 // rcp(-INF) = -0
V_RCP_F16(0xc000) ⇒ 0xb800 // rcp(-2.0) = -0.5
V_RCP_F16(0x8000) ⇒ 0xfc00 // rcp(-0.0) = -INF

16.8. VOP1 Instructions

282 of 597

"RDNA3" Instruction Set Architecture

V_RCP_F16(0x0000) ⇒ 0x7c00 // rcp(+0.0) = +INF
V_RCP_F16(0x7c00) ⇒ 0x0000 // rcp(+INF) = +0

V_SQRT_F16

85

Square root.
D0.f16 = sqrt(S0.f16)

Notes
0.51ULP accuracy, denormals are supported.
Functional examples:
V_SQRT_F16(0xfc00) ⇒ 0xfe00 // sqrt(-INF) = NAN
V_SQRT_F16(0x8000) ⇒ 0x8000 // sqrt(-0.0) = -0
V_SQRT_F16(0x0000) ⇒ 0x0000 // sqrt(+0.0) = +0
V_SQRT_F16(0x4400) ⇒ 0x4000 // sqrt(+4.0) = +2.0
V_SQRT_F16(0x7c00) ⇒ 0x7c00 // sqrt(+INF) = +INF

V_RSQ_F16

86

Reciprocal square root with IEEE rules.
D0.f16 = 16'1.0 / sqrt(S0.f16)

Notes
0.51ULP accuracy, denormals are supported.
Functional examples:
V_RSQ_F16(0xfc00) ⇒ 0xfe00 // rsq(-INF) = NAN
V_RSQ_F16(0x8000) ⇒ 0xfc00 // rsq(-0.0) = -INF
V_RSQ_F16(0x0000) ⇒ 0x7c00 // rsq(+0.0) = +INF
V_RSQ_F16(0x4400) ⇒ 0x3800 // rsq(+4.0) = +0.5
V_RSQ_F16(0x7c00) ⇒ 0x0000 // rsq(+INF) = +0

V_LOG_F16

87

Base 2 logarithm.

16.8. VOP1 Instructions

283 of 597

"RDNA3" Instruction Set Architecture

D0.f16 = log2(S0.f16)

Notes
0.51ULP accuracy, denormals are supported.
Functional examples:
V_LOG_F16(0xfc00) ⇒ 0xfe00 // log(-INF) = NAN
V_LOG_F16(0xbc00) ⇒ 0xfe00 // log(-1.0) = NAN
V_LOG_F16(0x8000) ⇒ 0xfc00 // log(-0.0) = -INF
V_LOG_F16(0x0000) ⇒ 0xfc00 // log(+0.0) = -INF
V_LOG_F16(0x3c00) ⇒ 0x0000 // log(+1.0) = 0
V_LOG_F16(0x7c00) ⇒ 0x7c00 // log(+INF) = +INF

V_EXP_F16

88

Base 2 exponentiation.
D0.f16 = pow(16'2.0, S0.f16)

Notes
0.51ULP accuracy, denormals are supported.
Functional examples:
V_EXP_F16(0xfc00) ⇒ 0x0000 // exp(-INF) = 0
V_EXP_F16(0x8000) ⇒ 0x3c00 // exp(-0.0) = 1
V_EXP_F16(0x7c00) ⇒ 0x7c00 // exp(+INF) = +INF

V_FREXP_MANT_F16

89

Returns binary significand of half precision float input.
This operation satisfies the invariant S0.f16 = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0)
in normal cases. See also V_FREXP_EXP_I16_F16, which returns integer exponent. See the C library function
frexp() for more information.
if ((64'F(S0.f16) == +INF) || (64'F(S0.f16) == -INF) || isNAN(64'F(S0.f16))) then
D0.f16 = S0.f16
else
D0.f16 = mantissa(S0.f16)
endif

16.8. VOP1 Instructions

284 of 597

"RDNA3" Instruction Set Architecture

V_FREXP_EXP_I16_F16

90

Returns exponent of half precision float input.
This operation satisfies the invariant S0.f16 = significand * (2 ** exponent). See also V_FREXP_MANT_F16,
which returns the significand. See the C library function frexp() for more information.
if ((64'F(S0.f16) == +INF) || (64'F(S0.f16) == -INF) || isNAN(64'F(S0.f16))) then
D0.i = 0
else
D0.i = exponent(S0.f16) - 15 + 1
endif

V_FLOOR_F16

91

Round down to previous whole integer.
D0.f16 = trunc(S0.f16);
if ((S0.f16 < 16'0.0) && (S0.f16 != D0.f16)) then
D0.f16 += -16'1.0
endif

V_CEIL_F16

92

Round up to next whole integer.
D0.f16 = trunc(S0.f16);
if ((S0.f16 > 16'0.0) && (S0.f16 != D0.f16)) then
D0.f16 += 16'1.0
endif

V_TRUNC_F16

93

Return integer part of a number with round-to-zero semantics.
D0.f16 = trunc(S0.f16)

V_RNDNE_F16

16.8. VOP1 Instructions

94

285 of 597

"RDNA3" Instruction Set Architecture

Round-to-nearest-even semantics.
D0.f16 = floor(S0.f16 + 16'0.5);
if (isEven(64'F(floor(S0.f16))) && (fract(S0.f16) == 16'0.5)) then
D0.f16 -= 16'1.0
endif

V_FRACT_F16

95

Return fractional portion of a number.
D0.f16 = S0.f16 + -floor(S0.f16)

Notes
0.5ULP accuracy, denormals are accepted.
This complies with the DX specification of fract where the function behaves like an extension of integer
modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8
in DX.

V_SIN_F16

96

Trigonometric sine.
D0.f16 = 16'F(sin(64'F(S0.f16) * 2.0 * PI))

Notes
Denormals are supported. Full range input is supported.
Functional examples:
V_SIN_F16(0xfc00) ⇒ 0xfe00 // sin(-INF) = NAN
V_SIN_F16(0xfbff) ⇒ 0x0000 // Most negative finite FP16
V_SIN_F16(0x8000) ⇒ 0x8000 // sin(-0.0) = -0
V_SIN_F16(0x3400) ⇒ 0x3c00 // sin(0.25) = 1
V_SIN_F16(0x7bff) ⇒ 0x0000 // Most positive finite FP16
V_SIN_F16(0x7c00) ⇒ 0xfe00 // sin(+INF) = NAN

V_COS_F16

97

Trigonometric cosine.

16.8. VOP1 Instructions

286 of 597

"RDNA3" Instruction Set Architecture

D0.f16 = 16'F(cos(64'F(S0.f16) * 2.0 * PI))

Notes
Denormals are supported. Full range input is supported.
Functional examples:
V_COS_F16(0xfc00) ⇒ 0xfe00 // cos(-INF) = NAN
V_COS_F16(0xfbff) ⇒ 0x3c00 // Most negative finite FP16
V_COS_F16(0x8000) ⇒ 0x3c00 // cos(-0.0) = 1
V_COS_F16(0x3400) ⇒ 0x0000 // cos(0.25) = 0
V_COS_F16(0x7bff) ⇒ 0x3c00 // Most positive finite FP16
V_COS_F16(0x7c00) ⇒ 0xfe00 // cos(+INF) = NAN

V_SAT_PK_U8_I16

98

Packed 8-bit saturating of packed 16-bit integer values. Used for 4x16bit data packed as 4x8bit data.
SAT8 = lambda(n) (
if n.i <= 0 then
return 8'0U
elsif n >= 16'I(0xff) then
return 8'255U
else
return n[7 : 0].u8
endif);
D0.b16 = { SAT8(S0[31 : 16].i16), SAT8(S0[15 : 0].i16) }

V_CVT_NORM_I16_F16

99

Convert from an FP16 float to a signed normalized short.
D0.i16 = f16_to_snorm(S0.f16)

Notes
0.5ULP accuracy, supports rounding, exception flags and saturation, denormals are supported.

V_CVT_NORM_U16_F16

100

Convert from an FP16 float to an unsigned normalized short.

16.8. VOP1 Instructions

287 of 597

"RDNA3" Instruction Set Architecture

D0.u16 = f16_to_unorm(S0.f16)

Notes
0.5ULP accuracy, supports rounding, exception flags and saturation, denormals are supported.

V_SWAP_B32

101

Swap values of two operands.
tmp = D0.b;
D0.b = S0.b;
S0.b = tmp

Notes
Input and output modifiers not supported; this is an untyped operation.

V_SWAP_B16

102

Swap values of two operands.
tmp = D0.b16;
D0.b16 = S0.b16;
S0.b16 = tmp

Notes
Input and output modifiers not supported; this is an untyped operation.

V_PERMLANE64_B32

103

Perform a specific permutation across lanes where the high half and low half of a wave64 are swapped.
Performs no operation in wave32 mode.
declare tmp : 32'B[64];
declare lane : 32'U;
if WAVE32 then
// Supported in wave64 ONLY
v_nop()
else
for lane in 0U : 63U do
// Copy original S0 in case D==S0

16.8. VOP1 Instructions

288 of 597

"RDNA3" Instruction Set Architecture

tmp[lane] = VGPR[lane][SRC0.u]
endfor;
for lane in 0U : 63U do
altlane = { ~lane[5], lane[4 : 0] };
// 0<->32, ..., 31<->63
if EXEC[lane].u1 then
VGPR[lane][VDST.u] = tmp[altlane]
endif
endfor
endif

Notes
In wave32 mode this opcode is translated to V_NOP and performs no writes.
In wave64 the EXEC mask of the destination lane is used as the read mask for the alternate lane; as a result this
opcode may read values from disabled lanes.
The source must be a VGPR and SVGPRs are not allowed for this opcode.
ABS, NEG and OMOD modifiers should all be zeroed for this instruction.

V_SWAPREL_B32

104

Swap values of two operands. The two addresses are relatively indexed using M0.
addrs = SRC0.u;
// Raw value from instruction
addrd = DST.u;
// Raw value from instruction
addrs += M0.u[9 : 0].u;
addrd += M0.u[25 : 16].u;
tmp = VGPR[laneId][addrd].b;
VGPR[laneId][addrd].b = VGPR[laneId][addrs].b;
VGPR[laneId][addrs].b = tmp

Notes
Input and output modifiers not supported; this is an untyped operation.
Example: The following instruction sequence swaps v25 and v17:
s_mov_b32 m0, ((20 << 16) | 10)
v_swaprel_b32 v5, v7

V_NOT_B16

16.8. VOP1 Instructions

105

289 of 597

"RDNA3" Instruction Set Architecture

Bitwise negation.
D0.u16 = ~S0.u16

Notes
Input and output modifiers not supported.

V_CVT_I32_I16

106

Convert from an 16-bit signed integer to a 32-bit signed integer, sign extending as needed.
D0.i = 32'I(signext(S0.i16))

Notes
To convert in the other direction (from 32-bit to 16-bit integer) use V_MOV_B16.

V_CVT_U32_U16

107

Convert from an 16-bit unsigned integer to a 32-bit unsigned integer, zero extending as needed.
D0 = { 16'0, S0.u16 }

Notes
To convert in the other direction (from 32-bit to 16-bit integer) use V_MOV_B16.

16.8.1. VOP1 using VOP3 encoding
Instructions in this format may also be encoded as VOP3. VOP3 allows access to the extra control bits (e.g. ABS,
OMOD) at the expense of a larger instruction word. The VOP3 opcode is: VOP2 opcode + 0x180.

16.8. VOP1 Instructions

290 of 597

"RDNA3" Instruction Set Architecture

16.9. VOPC Instructions
The bitfield map for VOPC is:

SRC0

= First operand for instruction.

VSRC1 = Second operand for instruction.
OP

= Instruction opcode.

All VOPC instructions can alternatively be encoded in the VOP3 format.

Compare instructions perform the same compare operation on each lane (work-Item or thread) using that
lane’s private data, and producing a 1 bit result per lane into VCC or EXEC.
Instructions in this format may use a 32-bit literal constant that occurs immediately after the instruction.
Most compare instructions fall into one of two categories:
• Those which can use one of 16 compare operations (floating point types). "{COMPF}"
• Those which can use one of 8 compare operations (integer types). "{COMPI}"
The opcode number is such that for these the opcode number can be calculated from a base opcode number
for the data type, plus an offset for the specific compare operation.
Table 112. Sixteen Compare Operations
Compare Operation

Opcode Offset

Description

F

0

D.u = 0

LT

1

D.u = (S0 < S1)

EQ

2

D.u = (S0 == S1)

LE

3

D.u = (S0 <= S1)

GT

4

D.u = (S0 > S1)

LG

5

D.u = (S0 <> S1)

GE

6

D.u = (S0 >= S1)

O

7

D.u = (!isNaN(S0) && !isNaN(S1))

U

8

D.u = (!isNaN(S0) || !isNaN(S1))

NGE

9

D.u = !(S0 >= S1)

NLG

10

D.u = !(S0 <> S1)

NGT

11

D.u = !(S0 > S1)

NLE

12

D.u = !(S0 <= S1)

NEQ

13

D.u = !(S0 == S1)

NLT

14

D.u = !(S0 < S1)

TRU

15

D.u = 1

Table 113. Instructions with Sixteen Compare Operations
Instruction

Description

Hex Range

V_CMP_{COMPF}_F16

16-bit float compare.

0x20 to 0x2F

V_CMPX_{COMPF}_F16

16-bit float compare. Also writes EXEC.

0x30 to 0x3F

V_CMP_{COMPF}_F32

32-bit float compare.

0x40 to 0x4F

16.9. VOPC Instructions

291 of 597

"RDNA3" Instruction Set Architecture

Instruction

Description

Hex Range

V_CMPX_{COMPF}_F32

32-bit float compare. Also writes EXEC.

0x50 to 0x5F

V_CMP_{COMPF}_F64

64-bit float compare.

0x60 to 0x6F

V_CMPX_{COMPF}_F64

64-bit float compare. Also writes EXEC.

0x70 to 0x7F

Table 114. Eight Compare Operations
Compare Operation

Opcode Offset

Description

F

0

D.u = 0

LT

1

D.u = (S0 < S1)

EQ

2

D.u = (S0 == S1)

LE

3

D.u = (S0 <= S1)

GT

4

D.u = (S0 > S1)

LG

5

D.u = (S0 <> S1)

GE

6

D.u = (S0 >= S1)

TRU

7

D.u = 1

Table 115. Instructions with Eight Compare Operations
Instruction

Description

Hex Range

V_CMP_{COMPI}_I16

16-bit signed integer compare.

0xA0 - 0xA7

V_CMP_{COMPI}_U16

16-bit signed integer compare. Also writes EXEC.

0xA8 - 0xAF

V_CMPX_{COMPI}_I16

16-bit unsigned integer compare.

0xB0 - 0xB7

V_CMPX_{COMPI}_U16

16-bit unsigned integer compare. Also writes EXEC.

0xB8 - 0xBF

V_CMP_{COMPI}_I32

32-bit signed integer compare.

0xC0 - 0xC7

V_CMP_{COMPI}_U32

32-bit signed integer compare. Also writes EXEC.

0xC8 - 0xCF

V_CMPX_{COMPI}_I32

32-bit unsigned integer compare.

0xD0 - 0xD7

V_CMPX_{COMPI}_U32

32-bit unsigned integer compare. Also writes EXEC.

0xD8 - 0xDF

V_CMP_{COMPI}_I64

64-bit signed integer compare.

0xE0 - 0xE7

V_CMP_{COMPI}_U64

64-bit signed integer compare. Also writes EXEC.

0xE8 - 0xEF

V_CMPX_{COMPI}_I64

64-bit unsigned integer compare.

0xF0 - 0xF7

V_CMPX_{COMPI}_U64

64-bit unsigned integer compare. Also writes EXEC.

0xF8 - 0xFF

V_CMP_F_F16

0

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_F16

1

A less than B.

16.9. VOPC Instructions

292 of 597

"RDNA3" Instruction Set Architecture

D0.u64[laneId] = S0.f16 < S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_F16

2

A equal to B.
D0.u64[laneId] = S0.f16 == S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_F16

3

A less than or equal to B.
D0.u64[laneId] = S0.f16 <= S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_F16

4

A greater than B.
D0.u64[laneId] = S0.f16 > S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9. VOPC Instructions

293 of 597

"RDNA3" Instruction Set Architecture

V_CMP_LG_F16

5

A less than or greater than B.
D0.u64[laneId] = S0.f16 <> S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_F16

6

A greater than or equal to B.
D0.u64[laneId] = S0.f16 >= S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_O_F16

7

A orderable with B.
D0.u64[laneId] = (!isNAN(64'F(S0.f16)) && !isNAN(64'F(S1.f16)));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_U_F16

8

A not orderable with B.
D0.u64[laneId] = (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16)));
// D0 = VCC in VOPC encoding.

Notes

16.9. VOPC Instructions

294 of 597

"RDNA3" Instruction Set Architecture

Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGE_F16

9

A not greater than or equal to B.
D0.u64[laneId] = !(S0.f16 >= S1.f16);
// With NAN inputs this is not the same operation as <
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLG_F16

10

A not less than or greater than B.
D0.u64[laneId] = !(S0.f16 <> S1.f16);
// With NAN inputs this is not the same operation as ==
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGT_F16

11

A not greater than B.
D0.u64[laneId] = !(S0.f16 > S1.f16);
// With NAN inputs this is not the same operation as <=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLE_F16

12

A not less than or equal to B.

16.9. VOPC Instructions

295 of 597

"RDNA3" Instruction Set Architecture

D0.u64[laneId] = !(S0.f16 <= S1.f16);
// With NAN inputs this is not the same operation as >
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NEQ_F16

13

A not equal to B.
D0.u64[laneId] = !(S0.f16 == S1.f16);
// With NAN inputs this is not the same operation as !=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLT_F16

14

A not less than B.
D0.u64[laneId] = !(S0.f16 < S1.f16);
// With NAN inputs this is not the same operation as >=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_F16

15

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9. VOPC Instructions

296 of 597

"RDNA3" Instruction Set Architecture

V_CMP_F_F32

16

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_F32

17

A less than B.
D0.u64[laneId] = S0.f < S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_F32

18

A equal to B.
D0.u64[laneId] = S0.f == S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_F32

19

A less than or equal to B.
D0.u64[laneId] = S0.f <= S1.f;
// D0 = VCC in VOPC encoding.

16.9. VOPC Instructions

297 of 597

"RDNA3" Instruction Set Architecture

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_F32

20

A greater than B.
D0.u64[laneId] = S0.f > S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LG_F32

21

A less than or greater than B.
D0.u64[laneId] = S0.f <> S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_F32

22

A greater than or equal to B.
D0.u64[laneId] = S0.f >= S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_O_F32

23

A orderable with B.

16.9. VOPC Instructions

298 of 597

"RDNA3" Instruction Set Architecture

D0.u64[laneId] = (!isNAN(64'F(S0.f)) && !isNAN(64'F(S1.f)));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_U_F32

24

A not orderable with B.
D0.u64[laneId] = (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f)));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGE_F32

25

A not greater than or equal to B.
D0.u64[laneId] = !(S0.f >= S1.f);
// With NAN inputs this is not the same operation as <
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLG_F32

26

A not less than or greater than B.
D0.u64[laneId] = !(S0.f <> S1.f);
// With NAN inputs this is not the same operation as ==
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9. VOPC Instructions

299 of 597

"RDNA3" Instruction Set Architecture

V_CMP_NGT_F32

27

A not greater than B.
D0.u64[laneId] = !(S0.f > S1.f);
// With NAN inputs this is not the same operation as <=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLE_F32

28

A not less than or equal to B.
D0.u64[laneId] = !(S0.f <= S1.f);
// With NAN inputs this is not the same operation as >
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NEQ_F32

29

A not equal to B.
D0.u64[laneId] = !(S0.f == S1.f);
// With NAN inputs this is not the same operation as !=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLT_F32

30

A not less than B.
D0.u64[laneId] = !(S0.f < S1.f);
// With NAN inputs this is not the same operation as >=

16.9. VOPC Instructions

300 of 597

"RDNA3" Instruction Set Architecture

// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_F32

31

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_F_F64

32

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_F64

33

A less than B.
D0.u64[laneId] = S0.f64 < S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_F64

16.9. VOPC Instructions

34

301 of 597

"RDNA3" Instruction Set Architecture

A equal to B.
D0.u64[laneId] = S0.f64 == S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_F64

35

A less than or equal to B.
D0.u64[laneId] = S0.f64 <= S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_F64

36

A greater than B.
D0.u64[laneId] = S0.f64 > S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LG_F64

37

A less than or greater than B.
D0.u64[laneId] = S0.f64 <> S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9. VOPC Instructions

302 of 597

"RDNA3" Instruction Set Architecture

V_CMP_GE_F64

38

A greater than or equal to B.
D0.u64[laneId] = S0.f64 >= S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_O_F64

39

A orderable with B.
D0.u64[laneId] = (!isNAN(S0.f64) && !isNAN(S1.f64));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_U_F64

40

A not orderable with B.
D0.u64[laneId] = (isNAN(S0.f64) || isNAN(S1.f64));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGE_F64

41

A not greater than or equal to B.
D0.u64[laneId] = !(S0.f64 >= S1.f64);
// With NAN inputs this is not the same operation as <
// D0 = VCC in VOPC encoding.

16.9. VOPC Instructions

303 of 597

"RDNA3" Instruction Set Architecture

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLG_F64

42

A not less than or greater than B.
D0.u64[laneId] = !(S0.f64 <> S1.f64);
// With NAN inputs this is not the same operation as ==
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGT_F64

43

A not greater than B.
D0.u64[laneId] = !(S0.f64 > S1.f64);
// With NAN inputs this is not the same operation as <=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLE_F64

44

A not less than or equal to B.
D0.u64[laneId] = !(S0.f64 <= S1.f64);
// With NAN inputs this is not the same operation as >
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NEQ_F64

16.9. VOPC Instructions

45

304 of 597

"RDNA3" Instruction Set Architecture

A not equal to B.
D0.u64[laneId] = !(S0.f64 == S1.f64);
// With NAN inputs this is not the same operation as !=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLT_F64

46

A not less than B.
D0.u64[laneId] = !(S0.f64 < S1.f64);
// With NAN inputs this is not the same operation as >=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_F64

47

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_I16

49

A less than B.
D0.u64[laneId] = S0.i16 < S1.i16;
// D0 = VCC in VOPC encoding.

Notes

16.9. VOPC Instructions

305 of 597

"RDNA3" Instruction Set Architecture

Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_I16

50

A equal to B.
D0.u64[laneId] = S0.i16 == S1.i16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_I16

51

A less than or equal to B.
D0.u64[laneId] = S0.i16 <= S1.i16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_I16

52

A greater than B.
D0.u64[laneId] = S0.i16 > S1.i16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_I16

53

A not equal to B.
D0.u64[laneId] = S0.i16 <> S1.i16;

16.9. VOPC Instructions

306 of 597

"RDNA3" Instruction Set Architecture

// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_I16

54

A greater than or equal to B.
D0.u64[laneId] = S0.i16 >= S1.i16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_U16

57

A less than B.
D0.u64[laneId] = S0.u16 < S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_U16

58

A equal to B.
D0.u64[laneId] = S0.u16 == S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_U16

16.9. VOPC Instructions

59

307 of 597

"RDNA3" Instruction Set Architecture

A less than or equal to B.
D0.u64[laneId] = S0.u16 <= S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_U16

60

A greater than B.
D0.u64[laneId] = S0.u16 > S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_U16

61

A not equal to B.
D0.u64[laneId] = S0.u16 <> S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_U16

62

A greater than or equal to B.
D0.u64[laneId] = S0.u16 >= S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9. VOPC Instructions

308 of 597

"RDNA3" Instruction Set Architecture

V_CMP_F_I32

64

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_I32

65

A less than B.
D0.u64[laneId] = S0.i < S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_I32

66

A equal to B.
D0.u64[laneId] = S0.i == S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_I32

67

A less than or equal to B.
D0.u64[laneId] = S0.i <= S1.i;
// D0 = VCC in VOPC encoding.

16.9. VOPC Instructions

309 of 597

"RDNA3" Instruction Set Architecture

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_I32

68

A greater than B.
D0.u64[laneId] = S0.i > S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_I32

69

A not equal to B.
D0.u64[laneId] = S0.i <> S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_I32

70

A greater than or equal to B.
D0.u64[laneId] = S0.i >= S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_I32

71

True.

16.9. VOPC Instructions

310 of 597

"RDNA3" Instruction Set Architecture

D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_F_U32

72

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_U32

73

A less than B.
D0.u64[laneId] = S0.u < S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_U32

74

A equal to B.
D0.u64[laneId] = S0.u == S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9. VOPC Instructions

311 of 597

"RDNA3" Instruction Set Architecture

V_CMP_LE_U32

75

A less than or equal to B.
D0.u64[laneId] = S0.u <= S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_U32

76

A greater than B.
D0.u64[laneId] = S0.u > S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_U32

77

A not equal to B.
D0.u64[laneId] = S0.u <> S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_U32

78

A greater than or equal to B.
D0.u64[laneId] = S0.u >= S1.u;
// D0 = VCC in VOPC encoding.

Notes

16.9. VOPC Instructions

312 of 597

"RDNA3" Instruction Set Architecture

Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_U32

79

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_F_I64

80

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_I64

81

A less than B.
D0.u64[laneId] = S0.i64 < S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_I64

82

A equal to B.
D0.u64[laneId] = S0.i64 == S1.i64;

16.9. VOPC Instructions

313 of 597

"RDNA3" Instruction Set Architecture

// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_I64

83

A less than or equal to B.
D0.u64[laneId] = S0.i64 <= S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_I64

84

A greater than B.
D0.u64[laneId] = S0.i64 > S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_I64

85

A not equal to B.
D0.u64[laneId] = S0.i64 <> S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_I64

16.9. VOPC Instructions

86

314 of 597

"RDNA3" Instruction Set Architecture

A greater than or equal to B.
D0.u64[laneId] = S0.i64 >= S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_I64

87

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_F_U64

88

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_U64

89

A less than B.
D0.u64[laneId] = S0.u64 < S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9. VOPC Instructions

315 of 597

"RDNA3" Instruction Set Architecture

V_CMP_EQ_U64

90

A equal to B.
D0.u64[laneId] = S0.u64 == S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_U64

91

A less than or equal to B.
D0.u64[laneId] = S0.u64 <= S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_U64

92

A greater than B.
D0.u64[laneId] = S0.u64 > S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_U64

93

A not equal to B.
D0.u64[laneId] = S0.u64 <> S1.u64;
// D0 = VCC in VOPC encoding.

16.9. VOPC Instructions

316 of 597

"RDNA3" Instruction Set Architecture

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_U64

94

A greater than or equal to B.
D0.u64[laneId] = S0.u64 >= S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_U64

95

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_CLASS_F16

125

IEEE numeric class function specified in S1.u, performed on S0.f16.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.

16.9. VOPC Instructions

317 of 597

"RDNA3" Instruction Set Architecture

declare result : 1'U;
if isSignalNAN(64'F(S0.f16)) then
result = S1.u[0]
elsif isQuietNAN(64'F(S0.f16)) then
result = S1.u[1]
elsif exponent(S0.f16) == 31 then
// +-INF
result = S1.u[sign(S0.f16) ? 2 : 9]
elsif exponent(S0.f16) > 0 then
// +-normal value
result = S1.u[sign(S0.f16) ? 3 : 8]
elsif 64'F(abs(S0.f16)) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f16) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f16) ? 5 : 6]
endif;
D0.u64[laneId] = result;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_CLASS_F32

126

IEEE numeric class function specified in S1.u, performed on S0.f.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(64'F(S0.f)) then
result = S1.u[0]
elsif isQuietNAN(64'F(S0.f)) then
result = S1.u[1]
elsif exponent(S0.f) == 255 then
// +-INF
result = S1.u[sign(S0.f) ? 2 : 9]

16.9. VOPC Instructions

318 of 597

"RDNA3" Instruction Set Architecture

elsif exponent(S0.f) > 0 then
// +-normal value
result = S1.u[sign(S0.f) ? 3 : 8]
elsif 64'F(abs(S0.f)) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f) ? 5 : 6]
endif;
D0.u64[laneId] = result;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_CLASS_F64

127

IEEE numeric class function specified in S1.u, performed on S0.f64.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(S0.f64) then
result = S1.u[0]
elsif isQuietNAN(S0.f64) then
result = S1.u[1]
elsif exponent(S0.f64) == 1023 then
// +-INF
result = S1.u[sign(S0.f64) ? 2 : 9]
elsif exponent(S0.f64) > 0 then
// +-normal value
result = S1.u[sign(S0.f64) ? 3 : 8]
elsif abs(S0.f64) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f64) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f64) ? 5 : 6]

16.9. VOPC Instructions

319 of 597

"RDNA3" Instruction Set Architecture

endif;
D0.u64[laneId] = result;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_F16

128

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_F16

129

A less than B.
EXEC.u64[laneId] = S0.f16 < S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_F16

130

A equal to B.
EXEC.u64[laneId] = S0.f16 == S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_F16

16.9. VOPC Instructions

131

320 of 597

"RDNA3" Instruction Set Architecture

A less than or equal to B.
EXEC.u64[laneId] = S0.f16 <= S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_F16

132

A greater than B.
EXEC.u64[laneId] = S0.f16 > S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LG_F16

133

A less than or greater than B.
EXEC.u64[laneId] = S0.f16 <> S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_F16

134

A greater than or equal to B.
EXEC.u64[laneId] = S0.f16 >= S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_O_F16

16.9. VOPC Instructions

135

321 of 597

"RDNA3" Instruction Set Architecture

A orderable with B.
EXEC.u64[laneId] = (!isNAN(64'F(S0.f16)) && !isNAN(64'F(S1.f16)))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_U_F16

136

A not orderable with B.
EXEC.u64[laneId] = (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16)))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGE_F16

137

A not greater than or equal to B.
EXEC.u64[laneId] = !(S0.f16 >= S1.f16);
// With NAN inputs this is not the same operation as <

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLG_F16

138

A not less than or greater than B.
EXEC.u64[laneId] = !(S0.f16 <> S1.f16);
// With NAN inputs this is not the same operation as ==

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9. VOPC Instructions

322 of 597

"RDNA3" Instruction Set Architecture

V_CMPX_NGT_F16

139

A not greater than B.
EXEC.u64[laneId] = !(S0.f16 > S1.f16);
// With NAN inputs this is not the same operation as <=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLE_F16

140

A not less than or equal to B.
EXEC.u64[laneId] = !(S0.f16 <= S1.f16);
// With NAN inputs this is not the same operation as >

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NEQ_F16

141

A not equal to B.
EXEC.u64[laneId] = !(S0.f16 == S1.f16);
// With NAN inputs this is not the same operation as !=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLT_F16

142

A not less than B.
EXEC.u64[laneId] = !(S0.f16 < S1.f16);
// With NAN inputs this is not the same operation as >=

Notes

16.9. VOPC Instructions

323 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_F16

143

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_F32

144

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_F32

145

A less than B.
EXEC.u64[laneId] = S0.f < S1.f

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_F32

146

A equal to B.
EXEC.u64[laneId] = S0.f == S1.f

Notes

16.9. VOPC Instructions

324 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_F32

147

A less than or equal to B.
EXEC.u64[laneId] = S0.f <= S1.f

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_F32

148

A greater than B.
EXEC.u64[laneId] = S0.f > S1.f

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LG_F32

149

A less than or greater than B.
EXEC.u64[laneId] = S0.f <> S1.f

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_F32

150

A greater than or equal to B.
EXEC.u64[laneId] = S0.f >= S1.f

Notes

16.9. VOPC Instructions

325 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_O_F32

151

A orderable with B.
EXEC.u64[laneId] = (!isNAN(64'F(S0.f)) && !isNAN(64'F(S1.f)))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_U_F32

152

A not orderable with B.
EXEC.u64[laneId] = (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f)))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGE_F32

153

A not greater than or equal to B.
EXEC.u64[laneId] = !(S0.f >= S1.f);
// With NAN inputs this is not the same operation as <

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLG_F32

154

A not less than or greater than B.
EXEC.u64[laneId] = !(S0.f <> S1.f);
// With NAN inputs this is not the same operation as ==

16.9. VOPC Instructions

326 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGT_F32

155

A not greater than B.
EXEC.u64[laneId] = !(S0.f > S1.f);
// With NAN inputs this is not the same operation as <=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLE_F32

156

A not less than or equal to B.
EXEC.u64[laneId] = !(S0.f <= S1.f);
// With NAN inputs this is not the same operation as >

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NEQ_F32

157

A not equal to B.
EXEC.u64[laneId] = !(S0.f == S1.f);
// With NAN inputs this is not the same operation as !=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLT_F32

158

A not less than B.

16.9. VOPC Instructions

327 of 597

"RDNA3" Instruction Set Architecture

EXEC.u64[laneId] = !(S0.f < S1.f);
// With NAN inputs this is not the same operation as >=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_F32

159

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_F64

160

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_F64

161

A less than B.
EXEC.u64[laneId] = S0.f64 < S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_F64

16.9. VOPC Instructions

162

328 of 597

"RDNA3" Instruction Set Architecture

A equal to B.
EXEC.u64[laneId] = S0.f64 == S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_F64

163

A less than or equal to B.
EXEC.u64[laneId] = S0.f64 <= S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_F64

164

A greater than B.
EXEC.u64[laneId] = S0.f64 > S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LG_F64

165

A less than or greater than B.
EXEC.u64[laneId] = S0.f64 <> S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_F64

16.9. VOPC Instructions

166

329 of 597

"RDNA3" Instruction Set Architecture

A greater than or equal to B.
EXEC.u64[laneId] = S0.f64 >= S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_O_F64

167

A orderable with B.
EXEC.u64[laneId] = (!isNAN(S0.f64) && !isNAN(S1.f64))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_U_F64

168

A not orderable with B.
EXEC.u64[laneId] = (isNAN(S0.f64) || isNAN(S1.f64))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGE_F64

169

A not greater than or equal to B.
EXEC.u64[laneId] = !(S0.f64 >= S1.f64);
// With NAN inputs this is not the same operation as <

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9. VOPC Instructions

330 of 597

"RDNA3" Instruction Set Architecture

V_CMPX_NLG_F64

170

A not less than or greater than B.
EXEC.u64[laneId] = !(S0.f64 <> S1.f64);
// With NAN inputs this is not the same operation as ==

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGT_F64

171

A not greater than B.
EXEC.u64[laneId] = !(S0.f64 > S1.f64);
// With NAN inputs this is not the same operation as <=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLE_F64

172

A not less than or equal to B.
EXEC.u64[laneId] = !(S0.f64 <= S1.f64);
// With NAN inputs this is not the same operation as >

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NEQ_F64

173

A not equal to B.
EXEC.u64[laneId] = !(S0.f64 == S1.f64);
// With NAN inputs this is not the same operation as !=

Notes

16.9. VOPC Instructions

331 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLT_F64

174

A not less than B.
EXEC.u64[laneId] = !(S0.f64 < S1.f64);
// With NAN inputs this is not the same operation as >=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_F64

175

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_I16

177

A less than B.
EXEC.u64[laneId] = S0.i16 < S1.i16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_I16

178

A equal to B.
EXEC.u64[laneId] = S0.i16 == S1.i16

16.9. VOPC Instructions

332 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_I16

179

A less than or equal to B.
EXEC.u64[laneId] = S0.i16 <= S1.i16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_I16

180

A greater than B.
EXEC.u64[laneId] = S0.i16 > S1.i16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_I16

181

A not equal to B.
EXEC.u64[laneId] = S0.i16 <> S1.i16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_I16

182

A greater than or equal to B.
EXEC.u64[laneId] = S0.i16 >= S1.i16

16.9. VOPC Instructions

333 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_U16

185

A less than B.
EXEC.u64[laneId] = S0.u16 < S1.u16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_U16

186

A equal to B.
EXEC.u64[laneId] = S0.u16 == S1.u16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_U16

187

A less than or equal to B.
EXEC.u64[laneId] = S0.u16 <= S1.u16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_U16

188

A greater than B.
EXEC.u64[laneId] = S0.u16 > S1.u16

16.9. VOPC Instructions

334 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_U16

189

A not equal to B.
EXEC.u64[laneId] = S0.u16 <> S1.u16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_U16

190

A greater than or equal to B.
EXEC.u64[laneId] = S0.u16 >= S1.u16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_I32

192

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_I32

193

A less than B.
EXEC.u64[laneId] = S0.i < S1.i

16.9. VOPC Instructions

335 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_I32

194

A equal to B.
EXEC.u64[laneId] = S0.i == S1.i

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_I32

195

A less than or equal to B.
EXEC.u64[laneId] = S0.i <= S1.i

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_I32

196

A greater than B.
EXEC.u64[laneId] = S0.i > S1.i

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_I32

197

A not equal to B.
EXEC.u64[laneId] = S0.i <> S1.i

16.9. VOPC Instructions

336 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_I32

198

A greater than or equal to B.
EXEC.u64[laneId] = S0.i >= S1.i

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_I32

199

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_U32

200

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_U32

201

A less than B.
EXEC.u64[laneId] = S0.u < S1.u

16.9. VOPC Instructions

337 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_U32

202

A equal to B.
EXEC.u64[laneId] = S0.u == S1.u

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_U32

203

A less than or equal to B.
EXEC.u64[laneId] = S0.u <= S1.u

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_U32

204

A greater than B.
EXEC.u64[laneId] = S0.u > S1.u

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_U32

205

A not equal to B.
EXEC.u64[laneId] = S0.u <> S1.u

16.9. VOPC Instructions

338 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_U32

206

A greater than or equal to B.
EXEC.u64[laneId] = S0.u >= S1.u

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_U32

207

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_I64

208

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_I64

209

A less than B.
EXEC.u64[laneId] = S0.i64 < S1.i64

16.9. VOPC Instructions

339 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_I64

210

A equal to B.
EXEC.u64[laneId] = S0.i64 == S1.i64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_I64

211

A less than or equal to B.
EXEC.u64[laneId] = S0.i64 <= S1.i64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_I64

212

A greater than B.
EXEC.u64[laneId] = S0.i64 > S1.i64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_I64

213

A not equal to B.
EXEC.u64[laneId] = S0.i64 <> S1.i64

16.9. VOPC Instructions

340 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_I64

214

A greater than or equal to B.
EXEC.u64[laneId] = S0.i64 >= S1.i64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_I64

215

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_U64

216

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_U64

217

A less than B.
EXEC.u64[laneId] = S0.u64 < S1.u64

16.9. VOPC Instructions

341 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_U64

218

A equal to B.
EXEC.u64[laneId] = S0.u64 == S1.u64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_U64

219

A less than or equal to B.
EXEC.u64[laneId] = S0.u64 <= S1.u64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_U64

220

A greater than B.
EXEC.u64[laneId] = S0.u64 > S1.u64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_U64

221

A not equal to B.
EXEC.u64[laneId] = S0.u64 <> S1.u64

16.9. VOPC Instructions

342 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_U64

222

A greater than or equal to B.
EXEC.u64[laneId] = S0.u64 >= S1.u64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_U64

223

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_CLASS_F16

253

IEEE numeric class function specified in S1.u, performed on S0.f16.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;

16.9. VOPC Instructions

343 of 597

"RDNA3" Instruction Set Architecture

if isSignalNAN(64'F(S0.f16)) then
result = S1.u[0]
elsif isQuietNAN(64'F(S0.f16)) then
result = S1.u[1]
elsif exponent(S0.f16) == 31 then
// +-INF
result = S1.u[sign(S0.f16) ? 2 : 9]
elsif exponent(S0.f16) > 0 then
// +-normal value
result = S1.u[sign(S0.f16) ? 3 : 8]
elsif 64'F(abs(S0.f16)) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f16) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f16) ? 5 : 6]
endif;
EXEC.u64[laneId] = result

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_CLASS_F32

254

IEEE numeric class function specified in S1.u, performed on S0.f.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(64'F(S0.f)) then
result = S1.u[0]
elsif isQuietNAN(64'F(S0.f)) then
result = S1.u[1]
elsif exponent(S0.f) == 255 then
// +-INF
result = S1.u[sign(S0.f) ? 2 : 9]
elsif exponent(S0.f) > 0 then
// +-normal value
result = S1.u[sign(S0.f) ? 3 : 8]

16.9. VOPC Instructions

344 of 597

"RDNA3" Instruction Set Architecture

elsif 64'F(abs(S0.f)) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f) ? 5 : 6]
endif;
EXEC.u64[laneId] = result

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_CLASS_F64

255

IEEE numeric class function specified in S1.u, performed on S0.f64.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(S0.f64) then
result = S1.u[0]
elsif isQuietNAN(S0.f64) then
result = S1.u[1]
elsif exponent(S0.f64) == 1023 then
// +-INF
result = S1.u[sign(S0.f64) ? 2 : 9]
elsif exponent(S0.f64) > 0 then
// +-normal value
result = S1.u[sign(S0.f64) ? 3 : 8]
elsif abs(S0.f64) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f64) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f64) ? 5 : 6]
endif;
EXEC.u64[laneId] = result

16.9. VOPC Instructions

345 of 597

"RDNA3" Instruction Set Architecture

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.9.1. VOPC using VOP3 encoding
Instructions in this format may also be encoded as VOP3. VOP3 allows access to the extra control bits (e.g. ABS,
OMOD) at the expense of a larger instruction word. The VOP3 opcode is: VOP2 opcode + 0x000.
When the CLAMP microcode bit is set to 1, these compare instructions signal an exception when either of the
inputs is NaN. When CLAMP is set to zero, NaN does not signal an exception. The second eight VOPC
instructions have {OP8} embedded in them. This refers to each of the compare operations listed below.

VDST

= Destination for instruction in the VGPR.

ABS

= Floating-point absolute value.

CLMP

= Clamp output.

OP

= Instruction opcode.

SRC0

= First operand for instruction.

SRC1

= Second operand for instruction.

SRC2

= Third operand for instruction. Unused in VOPC instructions.

OMOD

= Output modifier for instruction. Unused in VOPC instructions.

NEG

= Floating-point negation.

16.9. VOPC Instructions

346 of 597

"RDNA3" Instruction Set Architecture

16.10. VOP3P Instructions

V_PK_MAD_I16

0

Packed multiply-add on signed shorts.
D0[31 : 16].i16 = S0[31 : 16].i16 * S1[31 : 16].i16 + S2[31 : 16].i16;
D0[15 : 0].i16 = S0[15 : 0].i16 * S1[15 : 0].i16 + S2[15 : 0].i16

V_PK_MUL_LO_U16

1

Packed multiply on unsigned shorts.
D0[31 : 16].u16 = S0[31 : 16].u16 * S1[31 : 16].u16;
D0[15 : 0].u16 = S0[15 : 0].u16 * S1[15 : 0].u16

V_PK_ADD_I16

2

Packed addition on signed shorts.
D0[31 : 16].i16 = S0[31 : 16].i16 + S1[31 : 16].i16;
D0[15 : 0].i16 = S0[15 : 0].i16 + S1[15 : 0].i16

V_PK_SUB_I16

3

Packed subtraction on signed shorts. The second operand is subtracted from the first.
D0[31 : 16].i16 = S0[31 : 16].i16 - S1[31 : 16].i16;
D0[15 : 0].i16 = S0[15 : 0].i16 - S1[15 : 0].i16

V_PK_LSHLREV_B16

4

Packed logical shift left. The shift count is in the first operand.

16.10. VOP3P Instructions

347 of 597

"RDNA3" Instruction Set Architecture

D0[31 : 16].u16 = S1[31 : 16].u16 << S0.u[19 : 16].u;
D0[15 : 0].u16 = S1[15 : 0].u16 << S0.u[3 : 0].u

V_PK_LSHRREV_B16

5

Packed logical shift right. The shift count is in the first operand.
D0[31 : 16].u16 = S1[31 : 16].u16 >> S0.u[19 : 16].u;
D0[15 : 0].u16 = S1[15 : 0].u16 >> S0.u[3 : 0].u

V_PK_ASHRREV_I16

6

Packed arithmetic shift right (preserve sign bit). The shift count is in the first operand.
D0[31 : 16].i16 = S1[31 : 16].i16 >> S0.u[19 : 16].u;
D0[15 : 0].i16 = S1[15 : 0].i16 >> S0.u[3 : 0].u

V_PK_MAX_I16

7

Packed maximum of signed shorts.
D0[31 : 16].i16 = S0[31 : 16].i16 >= S1[31 : 16].i16 ? S0[31 : 16].i16 : S1[31 : 16].i16;
D0[15 : 0].i16 = S0[15 : 0].i16 >= S1[15 : 0].i16 ? S0[15 : 0].i16 : S1[15 : 0].i16

V_PK_MIN_I16

8

Packed minimum of signed shorts.
D0[31 : 16].i16 = S0[31 : 16].i16 < S1[31 : 16].i16 ? S0[31 : 16].i16 : S1[31 : 16].i16;
D0[15 : 0].i16 = S0[15 : 0].i16 < S1[15 : 0].i16 ? S0[15 : 0].i16 : S1[15 : 0].i16

V_PK_MAD_U16

9

Packed multiply-add on unsigned shorts.

16.10. VOP3P Instructions

348 of 597

"RDNA3" Instruction Set Architecture

D0[31 : 16].u16 = S0[31 : 16].u16 * S1[31 : 16].u16 + S2[31 : 16].u16;
D0[15 : 0].u16 = S0[15 : 0].u16 * S1[15 : 0].u16 + S2[15 : 0].u16

V_PK_ADD_U16

10

Packed addition on unsigned shorts.
D0[31 : 16].u16 = S0[31 : 16].u16 + S1[31 : 16].u16;
D0[15 : 0].u16 = S0[15 : 0].u16 + S1[15 : 0].u16

V_PK_SUB_U16

11

Packed subtraction on unsigned shorts. The second operand is subtracted from the first.
D0[31 : 16].u16 = S0[31 : 16].u16 - S1[31 : 16].u16;
D0[15 : 0].u16 = S0[15 : 0].u16 - S1[15 : 0].u16

V_PK_MAX_U16

12

Packed maximum of unsigned shorts.
D0[31 : 16].u16 = S0[31 : 16].u16 >= S1[31 : 16].u16 ? S0[31 : 16].u16 : S1[31 : 16].u16;
D0[15 : 0].u16 = S0[15 : 0].u16 >= S1[15 : 0].u16 ? S0[15 : 0].u16 : S1[15 : 0].u16

V_PK_MIN_U16

13

Packed minimum of unsigned shorts.
D0[31 : 16].u16 = S0[31 : 16].u16 < S1[31 : 16].u16 ? S0[31 : 16].u16 : S1[31 : 16].u16;
D0[15 : 0].u16 = S0[15 : 0].u16 < S1[15 : 0].u16 ? S0[15 : 0].u16 : S1[15 : 0].u16

V_PK_FMA_F16

14

Packed fused-multiply-add of FP16 values.

16.10. VOP3P Instructions

349 of 597

"RDNA3" Instruction Set Architecture

D0[31 : 16].f16 = fma(S0[31 : 16].f16, S1[31 : 16].f16, S2[31 : 16].f16);
D0[15 : 0].f16 = fma(S0[15 : 0].f16, S1[15 : 0].f16, S2[15 : 0].f16)

V_PK_ADD_F16

15

Packed addition of FP16 values.
D0[31 : 16].f16 = S0[31 : 16].f16 + S1[31 : 16].f16;
D0[15 : 0].f16 = S0[15 : 0].f16 + S1[15 : 0].f16

V_PK_MUL_F16

16

Packed multiply of FP16 values.
D0[31 : 16].f16 = S0[31 : 16].f16 * S1[31 : 16].f16;
D0[15 : 0].f16 = S0[15 : 0].f16 * S1[15 : 0].f16

V_PK_MIN_F16

17

Packed minimum of FP16 values.
D0[31 : 16].f16 = v_min_f16(S0[31 : 16].f16, S1[31 : 16].f16);
D0[15 : 0].f16 = v_min_f16(S0[15 : 0].f16, S1[15 : 0].f16)

V_PK_MAX_F16

18

Packed maximum of FP16 values.
D0[31 : 16].f16 = v_max_f16(S0[31 : 16].f16, S1[31 : 16].f16);
D0[15 : 0].f16 = v_max_f16(S0[15 : 0].f16, S1[15 : 0].f16)

V_DOT2_F32_F16

19

Dot product of packed FP16 values.

16.10. VOP3P Instructions

350 of 597

"RDNA3" Instruction Set Architecture

tmp = 32'F(S0[15 : 0].f16) * 32'F(S1[15 : 0].f16);
tmp += 32'F(S0[31 : 16].f16) * 32'F(S1[31 : 16].f16);
tmp += S2.f;
D0.f = tmp

V_DOT4_I32_IU8

22

Dot product of signed or unsigned bytes.
declare A : 32'I[4];
declare B : 32'I[4];
// Figure out whether inputs are signed/unsigned.
for i in 0 : 3 do
A8 = S0[i * 8 + 7 : i * 8];
B8 = S1[i * 8 + 7 : i * 8];
A[i] = NEG[0].u1 ? 32'I(signext(A8.i8)) : 32'I(32'U(A8.u8));
B[i] = NEG[1].u1 ? 32'I(signext(B8.i8)) : 32'I(32'U(B8.u8))
endfor;
C = S2.i;
// Signed multiplier/adder. Extend unsigned inputs with leading 0.
D0.i = A[0] * B[0];
D0.i += A[1] * B[1];
D0.i += A[2] * B[2];
D0.i += A[3] * B[3];
D0.i += C

Notes
This opcode does not depend on the inference or deep learning features being enabled.

V_DOT4_U32_U8

23

Dot product of unsigned bytes.
tmp = 32'U(S0[7 : 0].u8) * 32'U(S1[7 : 0].u8);
tmp += 32'U(S0[15 : 8].u8) * 32'U(S1[15 : 8].u8);
tmp += 32'U(S0[23 : 16].u8) * 32'U(S1[23 : 16].u8);
tmp += 32'U(S0[31 : 24].u8) * 32'U(S1[31 : 24].u8);
tmp += S2.u;
D0.u = tmp

Notes
This opcode does not depend on the inference or deep learning features being enabled.

16.10. VOP3P Instructions

351 of 597

"RDNA3" Instruction Set Architecture

V_DOT8_I32_IU4

24

Dot product of signed or unsigned nibbles.
declare A : 32'I[8];
declare B : 32'I[8];
// Figure out whether inputs are signed/unsigned.
for i in 0 : 7 do
A4 = S0[i * 4 + 3 : i * 4];
B4 = S1[i * 4 + 3 : i * 4];
A[i] = NEG[0].u1 ? 32'I(signext(A4.i4)) : 32'I(32'U(A4.u4));
B[i] = NEG[1].u1 ? 32'I(signext(B4.i4)) : 32'I(32'U(B4.u4))
endfor;
C = S2.i;
// Signed multiplier/adder. Extend unsigned inputs with leading 0.
D0.i = A[0] * B[0];
D0.i += A[1] * B[1];
D0.i += A[2] * B[2];
D0.i += A[3] * B[3];
D0.i += A[4] * B[4];
D0.i += A[5] * B[5];
D0.i += A[6] * B[6];
D0.i += A[7] * B[7];
D0.i += C

V_DOT8_U32_U4

25

Dot product of unsigned nibbles.
tmp = 32'U(S0[3 : 0].u4) * 32'U(S1[3 : 0].u4);
tmp += 32'U(S0[7 : 4].u4) * 32'U(S1[7 : 4].u4);
tmp += 32'U(S0[11 : 8].u4) * 32'U(S1[11 : 8].u4);
tmp += 32'U(S0[15 : 12].u4) * 32'U(S1[15 : 12].u4);
tmp += 32'U(S0[19 : 16].u4) * 32'U(S1[19 : 16].u4);
tmp += 32'U(S0[23 : 20].u4) * 32'U(S1[23 : 20].u4);
tmp += 32'U(S0[27 : 24].u4) * 32'U(S1[27 : 24].u4);
tmp += 32'U(S0[31 : 28].u4) * 32'U(S1[31 : 28].u4);
tmp += S2.u;
D0.u = tmp

V_DOT2_F32_BF16

26

Dot product of packed brain-float values.
tmp = 32'F(S0[15 : 0].bf16) * 32'F(S1[15 : 0].bf16);
tmp += 32'F(S0[31 : 16].bf16) * 32'F(S1[31 : 16].bf16);
tmp += S2.f;

16.10. VOP3P Instructions

352 of 597

"RDNA3" Instruction Set Architecture

D0.f = tmp

V_FMA_MIX_F32

32

Fused-multiply-add of single-precision values with MIX encoding.
Size and location of S0, S1 and S2 controlled by OPSEL: 0=src[31:0], 1=src[31:0], 2=src[15:0], 3=src[31:16]. Also,
for FMA_MIX, the NEG_HI field acts instead as an absolute-value modifier.
declare in : 32'F[3];
declare S : 32'B[3];
for i in 0 : 2 do
if !OPSEL_HI.u3[i] then
in[i] = S[i].f
elsif OPSEL.u3[i] then
in[i] = f16_to_f32(S[i][31 : 16].f16)
else
in[i] = f16_to_f32(S[i][15 : 0].f16)
endif
endfor;
D0[31 : 0].f = fma(in[0], in[1], in[2])

V_FMA_MIXLO_F16

33

Fused-multiply-add of FP16 values with MIX encoding, result stored in low 16 bits of destination.
Size and location of S0, S1 and S2 controlled by OPSEL: 0=src[31:0], 1=src[31:0], 2=src[15:0], 3=src[31:16]. Also,
for FMA_MIX, the NEG_HI field acts instead as an absolute-value modifier.
declare in : 32'F[3];
declare S : 32'B[3];
for i in 0 : 2 do
if !OPSEL_HI.u3[i] then
in[i] = S[i].f
elsif OPSEL.u3[i] then
in[i] = f16_to_f32(S[i][31 : 16].f16)
else
in[i] = f16_to_f32(S[i][15 : 0].f16)
endif
endfor;
D0[15 : 0].f16 = f32_to_f16(fma(in[0], in[1], in[2]))

V_FMA_MIXHI_F16

34

Fused-multiply-add of FP16 values with MIX encoding, result stored in HIGH 16 bits of destination.

16.10. VOP3P Instructions

353 of 597

"RDNA3" Instruction Set Architecture

Size and location of S0, S1 and S2 controlled by OPSEL: 0=src[31:0], 1=src[31:0], 2=src[15:0], 3=src[31:16]. Also,
for FMA_MIX, the NEG_HI field acts instead as an absolute-value modifier.
declare in : 32'F[3];
declare S : 32'B[3];
for i in 0 : 2 do
if !OPSEL_HI.u3[i] then
in[i] = S[i].f
elsif OPSEL.u3[i] then
in[i] = f16_to_f32(S[i][31 : 16].f16)
else
in[i] = f16_to_f32(S[i][15 : 0].f16)
endif
endfor;
D0[31 : 16].f16 = f32_to_f16(fma(in[0], in[1], in[2]))

V_WMMA_F32_16X16X16_F16

64

WMMA matrix multiplication with F16 multiplicands and single precision result.
saved_exec = EXEC;
EXEC = 64'B(-1);
eval "D0.f32(16x16) = S0.f16(16x16) * S1.f16(16x16) + S2.f32(16x16)";
EXEC = saved_exec

V_WMMA_F32_16X16X16_BF16

65

WMMA matrix multiplication with brain float multiplicands and single precision result.
saved_exec = EXEC;
EXEC = 64'B(-1);
eval "D0.f32(16x16) = S0.bf16(16x16) * S1.bf16(16x16) + S2.f32(16x16)";
EXEC = saved_exec

V_WMMA_F16_16X16X16_F16

66

WMMA matrix multiplication with F16 multiplicands and F16 result.
saved_exec = EXEC;
EXEC = 64'B(-1);
eval "D0.f16(16x16) = S0.f16(16x16) * S1.f16(16x16) + S2.f16(16x16)";
EXEC = saved_exec

16.10. VOP3P Instructions

354 of 597

"RDNA3" Instruction Set Architecture

V_WMMA_BF16_16X16X16_BF16

67

WMMA matrix multiplication with brain float multiplicands and brain float result.
saved_exec = EXEC;
EXEC = 64'B(-1);
eval "D0.bf16(16x16) = S0.bf16(16x16) * S1.bf16(16x16) + S2.bf16(16x16)";
EXEC = saved_exec

V_WMMA_I32_16X16X16_IU8

68

WMMA matrix multiplication with 8-bit integer multiplicands and signed 32-bit integer result.
saved_exec = EXEC;
EXEC = 64'B(-1);
eval "D0.i32(16x16) = S0.iu8(16x16) * S1.iu8(16x16) + S2.i32(16x16)";
EXEC = saved_exec

V_WMMA_I32_16X16X16_IU4

69

WMMA matrix multiplication with 4-bit integer multiplicands and signed 32-bit integer result.
saved_exec = EXEC;
EXEC = 64'B(-1);
eval "D0.i32(16x16) = S0.iu4(16x16) * S1.iu4(16x16) + S2.i32(16x16)";
EXEC = saved_exec

16.10. VOP3P Instructions

355 of 597

"RDNA3" Instruction Set Architecture

16.11. VOPD Instructions

The VOPD encoded describes two VALU opcodes that are executed in parallel.
For instruction definitions, refer to the VOP1, VOP2 and VOP3 sections.

16.11.1. VOPD X-Instructions
V_DUAL_FMAC_F32

0

Dual-issue opcode of FMAC_F32.
Fused multiply-add of single-precision floats, accumulate with destination.

V_DUAL_FMAAK_F32

1

Dual-issue opcode of FMAAK_F32.
Multiply two single-precision floats and add a literal constant using fused multiply-add.

V_DUAL_FMAMK_F32

2

Dual-issue opcode of FMAMK_F32.
Multiply a single-precision float with a literal constant and add a second single-precision float using fused
multiply-add.

V_DUAL_MUL_F32

3

Dual-issue opcode of MUL_F32.
Multiply two single-precision values.

V_DUAL_ADD_F32

4

Dual-issue opcode of ADD_F32.
Add two single-precision values.

16.11. VOPD Instructions

356 of 597

"RDNA3" Instruction Set Architecture

V_DUAL_SUB_F32

5

Dual-issue opcode of SUB_F32.
Subtract the second single-precision input from the first input.

V_DUAL_SUBREV_F32

6

Dual-issue opcode of SUBREV_F32.
Subtract the first single-precision input from the second input.

V_DUAL_MUL_DX9_ZERO_F32

7

Dual-issue opcode of MUL_DX9_ZERO_F32.
Multiply two single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE
compliant).

V_DUAL_MOV_B32

8

Dual-issue opcode of MOV_B32.
Move data to a VGPR.

V_DUAL_CNDMASK_B32

9

Dual-issue opcode of CNDMASK_B32.
Conditional mask on each thread.

V_DUAL_MAX_F32

10

Dual-issue opcode of MAX_F32.
Compute the maximum of two floats.

V_DUAL_MIN_F32

16.11. VOPD Instructions

11

357 of 597

"RDNA3" Instruction Set Architecture

Dual-issue opcode of MIN_F32.
Compute the minimum of two floats.

V_DUAL_DOT2ACC_F32_F16

12

Dual-issue opcode of DOT2ACC_F32_F16.
Dot product of packed FP16 values, accumulate with destination. The initial value in D is used as S2.

V_DUAL_DOT2ACC_F32_BF16

13

Dual-issue opcode of DOT2ACC_F32_BF16.
Dot product of packed brain-float values. The initial value in D is used as S2.

16.11.2. VOPD Y-Instructions
V_DUAL_FMAC_F32

0

Dual-issue opcode of FMAC_F32.
Fused multiply-add of single-precision floats, accumulate with destination.

V_DUAL_FMAAK_F32

1

Dual-issue opcode of FMAAK_F32.
Multiply two single-precision floats and add a literal constant using fused multiply-add.

V_DUAL_FMAMK_F32

2

Dual-issue opcode of FMAMK_F32.
Multiply a single-precision float with a literal constant and add a second single-precision float using fused
multiply-add.

V_DUAL_MUL_F32

3

Dual-issue opcode of MUL_F32.

16.11. VOPD Instructions

358 of 597

"RDNA3" Instruction Set Architecture

Multiply two single-precision values.

V_DUAL_ADD_F32

4

Dual-issue opcode of ADD_F32.
Add two single-precision values.

V_DUAL_SUB_F32

5

Dual-issue opcode of SUB_F32.
Subtract the second single-precision input from the first input.

V_DUAL_SUBREV_F32

6

Dual-issue opcode of SUBREV_F32.
Subtract the first single-precision input from the second input.

V_DUAL_MUL_DX9_ZERO_F32

7

Dual-issue opcode of MUL_DX9_ZERO_F32.
Multiply two single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE
compliant).

V_DUAL_MOV_B32

8

Dual-issue opcode of MOV_B32.
Move data to a VGPR.

V_DUAL_CNDMASK_B32

9

Dual-issue opcode of CNDMASK_B32.
Conditional mask on each thread.

16.11. VOPD Instructions

359 of 597

"RDNA3" Instruction Set Architecture

V_DUAL_MAX_F32

10

Dual-issue opcode of MAX_F32.
Compute the maximum of two floats.

V_DUAL_MIN_F32

11

Dual-issue opcode of MIN_F32.
Compute the minimum of two floats.

V_DUAL_DOT2ACC_F32_F16

12

Dual-issue opcode of DOT2ACC_F32_F16.
Dot product of packed FP16 values, accumulate with destination. The initial value in D is used as S2.

V_DUAL_DOT2ACC_F32_BF16

13

Dual-issue opcode of DOT2ACC_F32_BF16.
Dot product of packed brain-float values. The initial value in D is used as S2.

V_DUAL_ADD_NC_U32

16

Dual-issue opcode of ADD_NC_U32.
Add two unsigned integers. No carry-in or carry-out.

V_DUAL_LSHLREV_B32

17

Dual-issue opcode of LSHLREV_B32.
Logical shift left with shift count in the first operand.

V_DUAL_AND_B32

18

Dual-issue opcode of AND_B32.
Bitwise AND.

16.11. VOPD Instructions

360 of 597

"RDNA3" Instruction Set Architecture

16.11. VOPD Instructions

361 of 597

"RDNA3" Instruction Set Architecture

16.12. VOP3 & VOP3SD Instructions
VOP3 instructions use one of two encodings:

VOP3SD

this encoding allows specifying a unique scalar destination, and is used only for:
V_ADD_CO_U32
V_SUB_CO_U32
V_SUBREV_CO_U32
V_ADDC_CO_U32
V_SUBB_CO_U32
V_SUBBREV_CO_U32
V_DIV_SCALE_F32
V_DIV_SCALE_F64
V_MAD_U64_U32
V_MAD_I64_I32

VOP3

all other VALU instructions use this encoding

V_NOP

384

Do nothing.

V_MOV_B32

385

Move data to a VGPR.
D0.b = S0.b

Notes
Floating-point modifiers are valid for this instruction if S0.u is a 32-bit floating point value. This instruction is
suitable for negating or taking the absolute value of a floating-point value.
Functional examples:
v_mov_b32 v0, v1

// Move v1 to v0

v_mov_b32 v0, -v1

// Set v1 to the negation of v0

v_mov_b32 v0, abs(v1)

// Set v1 to the absolute value of v0

16.12. VOP3 & VOP3SD Instructions

362 of 597

"RDNA3" Instruction Set Architecture

V_READFIRSTLANE_B32

386

Copy one VGPR value from the lowest active lane to one SGPR.
declare lane : 32'U;
if WAVE64 then
// 64 lanes
if EXEC == 0x0LL then
lane = 0U;
// Force lane 0 if all lanes are disabled
else
lane = 32'U(s_ff1_i32_b64(EXEC));
// Lowest active lane
endif
else
// 32 lanes
if EXEC_LO.i == 0 then
lane = 0U;
// Force lane 0 if all lanes are disabled
else
lane = 32'U(s_ff1_i32_b32(EXEC_LO));
// Lowest active lane
endif
endif;
D0.b = VGPR[lane][SRC0.u]

Notes
Ignores EXEC mask for the VGPR read. Input and output modifiers not supported; this is an untyped operation.

V_CVT_I32_F64

387

Convert from a double-precision float to a signed integer.
D0.i = f64_to_i32(S0.f64)

Notes
0.5ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_CVT_F64_I32

388

Convert from a signed integer to a double-precision float.

16.12. VOP3 & VOP3SD Instructions

363 of 597

"RDNA3" Instruction Set Architecture

D0.f64 = i32_to_f64(S0.i)

Notes
0ULP accuracy.

V_CVT_F32_I32

389

Convert from a signed integer to a single-precision float.
D0.f = i32_to_f32(S0.i)

Notes
0.5ULP accuracy.

V_CVT_F32_U32

390

Convert from an unsigned integer to a single-precision float.
D0.f = u32_to_f32(S0.u)

Notes
0.5ULP accuracy.

V_CVT_U32_F32

391

Convert from a single-precision float to an unsigned integer.
D0.u = f32_to_u32(S0.f)

Notes
1ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

16.12. VOP3 & VOP3SD Instructions

364 of 597

"RDNA3" Instruction Set Architecture

V_CVT_I32_F32

392

Convert from a single-precision float to a signed integer.
D0.i = f32_to_i32(S0.f)

Notes
1ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_CVT_F16_F32

394

Convert from a single-precision float to an FP16 float.
D0.f16 = f32_to_f16(S0.f)

Notes
0.5ULP accuracy, supports input modifiers and creates FP16 denormals when appropriate. Flush denorms on
output if specified based on DP denorm mode. Output rounding based on DP rounding mode.

V_CVT_F32_F16

395

Convert from an FP16 float to a single-precision float.
D0.f = f16_to_f32(S0.f16)

Notes
0ULP accuracy, FP16 denormal inputs are accepted. Flush denorms on input if specified based on DP denorm
mode.

V_CVT_NEAREST_I32_F32

396

Convert from a single-precision float to a signed integer, round to nearest integer.
D0.i = f32_to_i32(floor(S0.f + 0.5F))

16.12. VOP3 & VOP3SD Instructions

365 of 597

"RDNA3" Instruction Set Architecture

Notes
0.5ULP accuracy, denormals are supported.

V_CVT_FLOOR_I32_F32

397

Convert from a single-precision float to a signed integer, round down.
D0.i = f32_to_i32(floor(S0.f))

Notes
1ULP accuracy, denormals are supported.

V_CVT_OFF_F32_I4

398

4-bit signed int to 32-bit float. Used for interpolation in shader.
Lookup table on S0[3:0]:
S0 binary Result
1000 -0.5000f
1001 -0.4375f
1010 -0.3750f
1011 -0.3125f
1100 -0.2500f
1101 -0.1875f
1110 -0.1250f
1111 -0.0625f
0000 +0.0000f
0001 +0.0625f
0010 +0.1250f
0011 +0.1875f
0100 +0.2500f
0101 +0.3125f
0110 +0.3750f
0111 +0.4375f
declare CVT_OFF_TABLE : 32'F[16];
D0.f = CVT_OFF_TABLE[S0.u[3 : 0]]

V_CVT_F32_F64

16.12. VOP3 & VOP3SD Instructions

399

366 of 597

"RDNA3" Instruction Set Architecture

Convert from a double-precision float to a single-precision float.
D0.f = f64_to_f32(S0.f64)

Notes
0.5ULP accuracy, denormals are supported.

V_CVT_F64_F32

400

Convert from a single-precision float to a double-precision float.
D0.f64 = f32_to_f64(S0.f)

Notes
0ULP accuracy, denormals are supported.

V_CVT_F32_UBYTE0

401

Convert an unsigned byte (byte 0) to a single-precision float.
D0.f = u32_to_f32(S0.u[7 : 0].u)

V_CVT_F32_UBYTE1

402

Convert an unsigned byte (byte 1) to a single-precision float.
D0.f = u32_to_f32(S0.u[15 : 8].u)

V_CVT_F32_UBYTE2

403

Convert an unsigned byte (byte 2) to a single-precision float.
D0.f = u32_to_f32(S0.u[23 : 16].u)

16.12. VOP3 & VOP3SD Instructions

367 of 597

"RDNA3" Instruction Set Architecture

V_CVT_F32_UBYTE3

404

Convert an unsigned byte (byte 3) to a single-precision float.
D0.f = u32_to_f32(S0.u[31 : 24].u)

V_CVT_U32_F64

405

Convert from a double-precision float to an unsigned integer.
D0.u = f64_to_u32(S0.f64)

Notes
0.5ULP accuracy, out-of-range floating point values (including infinity) saturate. NAN is converted to 0.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_CVT_F64_U32

406

Convert from an unsigned integer to a double-precision float.
D0.f64 = u32_to_f64(S0.u)

Notes
0ULP accuracy.

V_TRUNC_F64

407

Return integer part of a number with round-to-zero semantics.
D0.f64 = trunc(S0.f64)

V_CEIL_F64

408

Round up to next whole integer.

16.12. VOP3 & VOP3SD Instructions

368 of 597

"RDNA3" Instruction Set Architecture

D0.f64 = trunc(S0.f64);
if ((S0.f64 > 0.0) && (S0.f64 != D0.f64)) then
D0.f64 += 1.0
endif

V_RNDNE_F64

409

Round-to-nearest-even semantics.
D0.f64 = floor(S0.f64 + 0.5);
if (isEven(floor(S0.f64)) && (fract(S0.f64) == 0.5)) then
D0.f64 -= 1.0
endif

V_FLOOR_F64

410

Round down to previous whole integer.
D0.f64 = trunc(S0.f64);
if ((S0.f64 < 0.0) && (S0.f64 != D0.f64)) then
D0.f64 += -1.0
endif

V_PIPEFLUSH

411

Flush the VALU destination cache.

V_MOV_B16

412

Move data to a VGPR.
D0.b16 = S0.b16

Notes
Floating-point modifiers are valid for this instruction if S0.u16 is a 16-bit floating point value. This instruction is
suitable for negating or taking the absolute value of a floating-point value.

16.12. VOP3 & VOP3SD Instructions

369 of 597

"RDNA3" Instruction Set Architecture

V_FRACT_F32

416

Return fractional portion of a number.
D0.f = S0.f + -floor(S0.f)

Notes
0.5ULP accuracy, denormals are accepted.
This complies with the DX specification of fract where the function behaves like an extension of integer
modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8
in DX.
Obey round mode, result clamped to 0x3f7fffff.

V_TRUNC_F32

417

Return integer part of a number with round-to-zero semantics.
D0.f = trunc(S0.f)

V_CEIL_F32

418

Round up to next whole integer.
D0.f = trunc(S0.f);
if ((S0.f > 0.0F) && (S0.f != D0.f)) then
D0.f += 1.0F
endif

V_RNDNE_F32

419

Round-to-nearest-even semantics.
D0.f = floor(S0.f + 0.5F);
if (isEven(64'F(floor(S0.f))) && (fract(S0.f) == 0.5F)) then
D0.f -= 1.0F
endif

16.12. VOP3 & VOP3SD Instructions

370 of 597

"RDNA3" Instruction Set Architecture

V_FLOOR_F32

420

Round down to previous whole integer.
D0.f = trunc(S0.f);
if ((S0.f < 0.0F) && (S0.f != D0.f)) then
D0.f += -1.0F
endif

V_EXP_F32

421

Base 2 exponentiation.
D0.f = pow(2.0F, S0.f)

Notes
1ULP accuracy, denormals are flushed.
Functional examples:
V_EXP_F32(0xff800000) ⇒ 0x00000000 // exp(-INF) = 0
V_EXP_F32(0x80000000) ⇒ 0x3f800000 // exp(-0.0) = 1
V_EXP_F32(0x7f800000) ⇒ 0x7f800000 // exp(+INF) = +INF

V_LOG_F32

423

Base 2 logarithm.
D0.f = log2(S0.f)

Notes
1ULP accuracy, denormals are flushed.
Functional examples:
V_LOG_F32(0xff800000) ⇒ 0xffc00000 // log(-INF) = NAN
V_LOG_F32(0xbf800000) ⇒ 0xffc00000 // log(-1.0) = NAN
V_LOG_F32(0x80000000) ⇒ 0xff800000 // log(-0.0) = -INF
V_LOG_F32(0x00000000) ⇒ 0xff800000 // log(+0.0) = -INF
V_LOG_F32(0x3f800000) ⇒ 0x00000000 // log(+1.0) = 0
V_LOG_F32(0x7f800000) ⇒ 0x7f800000 // log(+INF) = +INF

16.12. VOP3 & VOP3SD Instructions

371 of 597

"RDNA3" Instruction Set Architecture

V_RCP_F32

426

Compute reciprocal with IEEE rules.
D0.f = 1.0F / S0.f

Notes
1ULP accuracy. Accuracy converges to < 0.5ULP when using the Newton-Raphson method and 2 FMA
operations. Denormals are flushed.
Functional examples:
V_RCP_F32(0xff800000) ⇒ 0x80000000 // rcp(-INF) = -0
V_RCP_F32(0xc0000000) ⇒ 0xbf000000 // rcp(-2.0) = -0.5
V_RCP_F32(0x80000000) ⇒ 0xff800000 // rcp(-0.0) = -INF
V_RCP_F32(0x00000000) ⇒ 0x7f800000 // rcp(+0.0) = +INF
V_RCP_F32(0x7f800000) ⇒ 0x00000000 // rcp(+INF) = +0

V_RCP_IFLAG_F32

427

Compute reciprocal as part of integer divide.
D0.f = 1.0F / S0.f;
// Can only raise integer DIV_BY_ZERO exception

Notes
Can raise integer DIV_BY_ZERO exception but cannot raise floating-point exceptions. To be used in an integer
reciprocal macro by the compiler with one of the sequences listed below (depending on signed or unsigned
operation).
Unsigned usage:
CVT_F32_U32
RCP_IFLAG_F32
MUL_F32 (2**32 - 1)
CVT_U32_F32
+ Signed usage:
CVT_F32_I32
RCP_IFLAG_F32
MUL_F32 (2**31 - 1)
CVT_I32_F32

V_RSQ_F32

16.12. VOP3 & VOP3SD Instructions

430

372 of 597

"RDNA3" Instruction Set Architecture

Reciprocal square root with IEEE rules.
D0.f = 1.0F / sqrt(S0.f)

Notes
1ULP accuracy, denormals are flushed.
Functional examples:
V_RSQ_F32(0xff800000) ⇒ 0xffc00000 // rsq(-INF) = NAN
V_RSQ_F32(0x80000000) ⇒ 0xff800000 // rsq(-0.0) = -INF
V_RSQ_F32(0x00000000) ⇒ 0x7f800000 // rsq(+0.0) = +INF
V_RSQ_F32(0x40800000) ⇒ 0x3f000000 // rsq(+4.0) = +0.5
V_RSQ_F32(0x7f800000) ⇒ 0x00000000 // rsq(+INF) = +0

V_RCP_F64

431

Reciprocal with IEEE rules.
D0.f64 = 1.0 / S0.f64

Notes
This opcode has (2**29)ULP accuracy and supports denormals.

V_RSQ_F64

433

Reciprocal square root with IEEE rules.
D0.f64 = 1.0 / sqrt(S0.f64)

Notes
This opcode has (2**29)ULP accuracy and supports denormals.

V_SQRT_F32

435

Square root.
D0.f = sqrt(S0.f)

16.12. VOP3 & VOP3SD Instructions

373 of 597

"RDNA3" Instruction Set Architecture

Notes
1ULP accuracy, denormals are flushed.
Functional examples:
V_SQRT_F32(0xff800000) ⇒ 0xffc00000 // sqrt(-INF) = NAN
V_SQRT_F32(0x80000000) ⇒ 0x80000000 // sqrt(-0.0) = -0
V_SQRT_F32(0x00000000) ⇒ 0x00000000 // sqrt(+0.0) = +0
V_SQRT_F32(0x40800000) ⇒ 0x40000000 // sqrt(+4.0) = +2.0
V_SQRT_F32(0x7f800000) ⇒ 0x7f800000 // sqrt(+INF) = +INF

V_SQRT_F64

436

Square root.
D0.f64 = sqrt(S0.f64)

Notes
This opcode has (2**29)ULP accuracy and supports denormals.

V_SIN_F32

437

Trigonometric sine.
D0.f = 32'F(sin(64'F(S0.f) * 2.0 * PI))

Notes
Denormals are supported. Full range input is supported.
Functional examples:
V_SIN_F32(0xff800000) ⇒ 0xffc00000 // sin(-INF) = NAN
V_SIN_F32(0xff7fffff) ⇒ 0x00000000 // -MaxFloat, finite
V_SIN_F32(0x80000000) ⇒ 0x80000000 // sin(-0.0) = -0
V_SIN_F32(0x3e800000) ⇒ 0x3f800000 // sin(0.25) = 1
V_SIN_F32(0x7f800000) ⇒ 0xffc00000 // sin(+INF) = NAN

V_COS_F32

438

Trigonometric cosine.

16.12. VOP3 & VOP3SD Instructions

374 of 597

"RDNA3" Instruction Set Architecture

D0.f = 32'F(cos(64'F(S0.f) * 2.0 * PI))

Notes
Denormals are supported. Full range input is supported.
Functional examples:
V_COS_F32(0xff800000) ⇒ 0xffc00000 // cos(-INF) = NAN
V_COS_F32(0xff7fffff) ⇒ 0x3f800000 // -MaxFloat, finite
V_COS_F32(0x80000000) ⇒ 0x3f800000 // cos(-0.0) = 1
V_COS_F32(0x3e800000) ⇒ 0x00000000 // cos(0.25) = 0
V_COS_F32(0x7f800000) ⇒ 0xffc00000 // cos(+INF) = NAN

V_NOT_B32

439

Bitwise negation.
D0.u = ~S0.u

Notes
Input and output modifiers not supported.

V_BFREV_B32

440

Bitfield reverse.
D0.u[31 : 0] = S0.u[0 : 31]

Notes
Input and output modifiers not supported.

V_CLZ_I32_U32

441

Count leading zeros.
Counts how many zeros before the first one starting from the MSB. Returns -1 if there are no ones.
D0.i = -1;
// Set if no ones are found

16.12. VOP3 & VOP3SD Instructions

375 of 597

"RDNA3" Instruction Set Architecture

for i in 0 : 31 do
// Search from MSB
if S0.u[31 - i] == 1'1U then
D0.i = i;
break
endif
endfor

Notes
Compare with S_CLZ_I32_U32, which performs the equivalent operation in the scalar ALU.
Functional examples:
V_CLZ_I32_U32(0x00000000) ⇒ 0xffffffff
V_CLZ_I32_U32(0x800000ff) ⇒ 0
V_CLZ_I32_U32(0x100000ff) ⇒ 3
V_CLZ_I32_U32(0x0000ffff) ⇒ 16
V_CLZ_I32_U32(0x00000001) ⇒ 31

V_CTZ_I32_B32

442

Count trailing zeros.
Returns the bit position of the first one from the LSB, or -1 if there are no ones.
D0.i = -1;
// Set if no ones are found
for i in 0 : 31 do
// Search from LSB
if S0.u[i] == 1'1U then
D0.i = i;
break
endif
endfor

Notes
Compare with S_CTZ_I32_B32, which performs the equivalent operation in the scalar ALU.
Functional examples:
V_CTZ_I32_B32(0x00000000) ⇒ 0xffffffff
V_CTZ_I32_B32(0xff000001) ⇒ 0
V_CTZ_I32_B32(0xff000008) ⇒ 3
V_CTZ_I32_B32(0xffff0000) ⇒ 16
V_CTZ_I32_B32(0x80000000) ⇒ 31

16.12. VOP3 & VOP3SD Instructions

376 of 597

"RDNA3" Instruction Set Architecture

V_CLS_I32

443

Count leading sign bits.
Counts how many bits in a row (from MSB to LSB) are the same as the sign bit. Returns -1 if all bits are the
same.
D0.i = -1;
// Set if all bits are the same
for i in 1 : 31 do
// Search from MSB
if S0.i[31 - i] != S0.i[31] then
D0.i = i;
break
endif
endfor

Notes
Compare with S_CLS_I32, which performs the equivalent operation in the scalar ALU.
Functional examples:
V_CLS_I32(0x00000000) ⇒ 0xffffffff
V_CLS_I32(0x40000000) ⇒ 1
V_CLS_I32(0x80000000) ⇒ 1
V_CLS_I32(0x0fffffff) ⇒ 4
V_CLS_I32(0xffff0000) ⇒ 16
V_CLS_I32(0xfffffffe) ⇒ 31
V_CLS_I32(0xffffffff) ⇒ 0xffffffff

V_FREXP_EXP_I32_F64

444

Returns exponent of single precision float input.
This operation satisfies the invariant S0.f64 = significand * (2 ** exponent). See also V_FREXP_MANT_F64,
which returns the significand. See the C library function frexp() for more information.
if ((S0.f64 == +INF) || (S0.f64 == -INF) || isNAN(S0.f64)) then
D0.i = 0
else
D0.i = exponent(S0.f64) - 1023 + 1
endif

V_FREXP_MANT_F64

445

Returns binary significand of double precision float input.

16.12. VOP3 & VOP3SD Instructions

377 of 597

"RDNA3" Instruction Set Architecture

This operation satisfies the invariant S0.f64 = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0)
in normal cases. See also V_FREXP_EXP_I32_F64, which returns integer exponent. See the C library function
frexp() for more information.
if ((S0.f64 == +INF) || (S0.f64 == -INF) || isNAN(S0.f64)) then
D0.f64 = S0.f64
else
D0.f64 = mantissa(S0.f64)
endif

V_FRACT_F64

446

Return fractional portion of a number.
D0.f64 = S0.f64 + -floor(S0.f64)

Notes
0.5ULP accuracy, denormals are accepted.
This complies with the DX specification of fract where the function behaves like an extension of integer
modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8
in DX.
Obey round mode, result clamped to 0x3fefffffffffffff.

V_FREXP_EXP_I32_F32

447

Returns exponent of single precision float input.
This operation satisfies the invariant S0.f = significand * (2 ** exponent). See also V_FREXP_MANT_F32, which
returns the significand. See the C library function frexp() for more information.
if ((64'F(S0.f) == +INF) || (64'F(S0.f) == -INF) || isNAN(64'F(S0.f))) then
D0.i = 0
else
D0.i = exponent(S0.f) - 127 + 1
endif

V_FREXP_MANT_F32

448

Returns binary significand of single precision float input.

16.12. VOP3 & VOP3SD Instructions

378 of 597

"RDNA3" Instruction Set Architecture

This operation satisfies the invariant S0.f = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0) in
normal cases. See also V_FREXP_EXP_I32_F32, which returns integer exponent. See the C library function
frexp() for more information.
if ((64'F(S0.f) == +INF) || (64'F(S0.f) == -INF) || isNAN(64'F(S0.f))) then
D0.f = S0.f
else
D0.f = mantissa(S0.f)
endif

V_MOVRELD_B32

450

Move to a relative destination address.
addr = DST.u;
// Raw value from instruction
addr += M0.u[31 : 0];
VGPR[laneId][addr].b = S0.b

Notes
Example: The following instruction sequence performs the move v15 <= v7:
s_mov_b32 m0, 10
v_movreld_b32 v5, v7

V_MOVRELS_B32

451

Move from a relative source address.
addr = SRC0.u;
// Raw value from instruction
addr += M0.u[31 : 0];
D0.b = VGPR[laneId][addr].b

Notes
Example: The following instruction sequence performs the move v5 <= v17:
s_mov_b32 m0, 10
v_movrels_b32 v5, v7

16.12. VOP3 & VOP3SD Instructions

379 of 597

"RDNA3" Instruction Set Architecture

V_MOVRELSD_B32

452

Move from a relative source address to a relative destination address.
addrs = SRC0.u;
// Raw value from instruction
addrd = DST.u;
// Raw value from instruction
addrs += M0.u[31 : 0];
addrd += M0.u[31 : 0];
VGPR[laneId][addrd].b = VGPR[laneId][addrs].b

Notes
Example: The following instruction sequence performs the move v15 <= v17:
s_mov_b32 m0, 10
v_movrelsd_b32 v5, v7

V_MOVRELSD_2_B32

456

Move from a relative source address to a relative destination address, with different relative offsets.
addrs = SRC0.u;
// Raw value from instruction
addrd = DST.u;
// Raw value from instruction
addrs += M0.u[9 : 0].u;
addrd += M0.u[25 : 16].u;
VGPR[laneId][addrd].b = VGPR[laneId][addrs].b

Notes
Example: The following instruction sequence performs the move v25 <= v17:
s_mov_b32 m0, ((20 << 16) | 10)
v_movrelsd_2_b32 v5, v7

V_CVT_F16_U16

464

Convert from an unsigned short to an FP16 float.
D0.f16 = u16_to_f16(S0.u16)

16.12. VOP3 & VOP3SD Instructions

380 of 597

"RDNA3" Instruction Set Architecture

Notes
0.5ULP accuracy, supports denormals, rounding, exception flags and saturation.

V_CVT_F16_I16

465

Convert from a signed short to an FP16 float.
D0.f16 = i16_to_f16(S0.i16)

Notes
0.5ULP accuracy, supports denormals, rounding, exception flags and saturation.

V_CVT_U16_F16

466

Convert from an FP16 float to an unsigned short.
D0.u16 = f16_to_u16(S0.f16)

Notes
1ULP accuracy, supports rounding, exception flags and saturation. FP16 denormals are accepted. Conversion
is done with truncation.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

V_CVT_I16_F16

467

Convert from an FP16 float to a signed short.
D0.i16 = f16_to_i16(S0.f16)

Notes
1ULP accuracy, supports rounding, exception flags and saturation. FP16 denormals are accepted. Conversion
is done with truncation.
Generation of the INEXACT exception is controlled by the CLAMP bit. INEXACT exceptions are enabled for this
conversion iff CLAMP == 1.

16.12. VOP3 & VOP3SD Instructions

381 of 597

"RDNA3" Instruction Set Architecture

V_RCP_F16

468

Reciprocal with IEEE rules.
D0.f16 = 16'1.0 / S0.f16

Notes
0.51ULP accuracy.
Functional examples:
V_RCP_F16(0xfc00) ⇒ 0x8000 // rcp(-INF) = -0
V_RCP_F16(0xc000) ⇒ 0xb800 // rcp(-2.0) = -0.5
V_RCP_F16(0x8000) ⇒ 0xfc00 // rcp(-0.0) = -INF
V_RCP_F16(0x0000) ⇒ 0x7c00 // rcp(+0.0) = +INF
V_RCP_F16(0x7c00) ⇒ 0x0000 // rcp(+INF) = +0

V_SQRT_F16

469

Square root.
D0.f16 = sqrt(S0.f16)

Notes
0.51ULP accuracy, denormals are supported.
Functional examples:
V_SQRT_F16(0xfc00) ⇒ 0xfe00 // sqrt(-INF) = NAN
V_SQRT_F16(0x8000) ⇒ 0x8000 // sqrt(-0.0) = -0
V_SQRT_F16(0x0000) ⇒ 0x0000 // sqrt(+0.0) = +0
V_SQRT_F16(0x4400) ⇒ 0x4000 // sqrt(+4.0) = +2.0
V_SQRT_F16(0x7c00) ⇒ 0x7c00 // sqrt(+INF) = +INF

V_RSQ_F16

470

Reciprocal square root with IEEE rules.
D0.f16 = 16'1.0 / sqrt(S0.f16)

Notes

16.12. VOP3 & VOP3SD Instructions

382 of 597

"RDNA3" Instruction Set Architecture

0.51ULP accuracy, denormals are supported.
Functional examples:
V_RSQ_F16(0xfc00) ⇒ 0xfe00 // rsq(-INF) = NAN
V_RSQ_F16(0x8000) ⇒ 0xfc00 // rsq(-0.0) = -INF
V_RSQ_F16(0x0000) ⇒ 0x7c00 // rsq(+0.0) = +INF
V_RSQ_F16(0x4400) ⇒ 0x3800 // rsq(+4.0) = +0.5
V_RSQ_F16(0x7c00) ⇒ 0x0000 // rsq(+INF) = +0

V_LOG_F16

471

Base 2 logarithm.
D0.f16 = log2(S0.f16)

Notes
0.51ULP accuracy, denormals are supported.
Functional examples:
V_LOG_F16(0xfc00) ⇒ 0xfe00 // log(-INF) = NAN
V_LOG_F16(0xbc00) ⇒ 0xfe00 // log(-1.0) = NAN
V_LOG_F16(0x8000) ⇒ 0xfc00 // log(-0.0) = -INF
V_LOG_F16(0x0000) ⇒ 0xfc00 // log(+0.0) = -INF
V_LOG_F16(0x3c00) ⇒ 0x0000 // log(+1.0) = 0
V_LOG_F16(0x7c00) ⇒ 0x7c00 // log(+INF) = +INF

V_EXP_F16

472

Base 2 exponentiation.
D0.f16 = pow(16'2.0, S0.f16)

Notes
0.51ULP accuracy, denormals are supported.
Functional examples:
V_EXP_F16(0xfc00) ⇒ 0x0000 // exp(-INF) = 0
V_EXP_F16(0x8000) ⇒ 0x3c00 // exp(-0.0) = 1
V_EXP_F16(0x7c00) ⇒ 0x7c00 // exp(+INF) = +INF

16.12. VOP3 & VOP3SD Instructions

383 of 597

"RDNA3" Instruction Set Architecture

V_FREXP_MANT_F16

473

Returns binary significand of half precision float input.
This operation satisfies the invariant S0.f16 = significand * (2 ** exponent). Result range is in (-1.0,-0.5][0.5,1.0)
in normal cases. See also V_FREXP_EXP_I16_F16, which returns integer exponent. See the C library function
frexp() for more information.
if ((64'F(S0.f16) == +INF) || (64'F(S0.f16) == -INF) || isNAN(64'F(S0.f16))) then
D0.f16 = S0.f16
else
D0.f16 = mantissa(S0.f16)
endif

V_FREXP_EXP_I16_F16

474

Returns exponent of half precision float input.
This operation satisfies the invariant S0.f16 = significand * (2 ** exponent). See also V_FREXP_MANT_F16,
which returns the significand. See the C library function frexp() for more information.
if ((64'F(S0.f16) == +INF) || (64'F(S0.f16) == -INF) || isNAN(64'F(S0.f16))) then
D0.i = 0
else
D0.i = exponent(S0.f16) - 15 + 1
endif

V_FLOOR_F16

475

Round down to previous whole integer.
D0.f16 = trunc(S0.f16);
if ((S0.f16 < 16'0.0) && (S0.f16 != D0.f16)) then
D0.f16 += -16'1.0
endif

V_CEIL_F16

476

Round up to next whole integer.
D0.f16 = trunc(S0.f16);
if ((S0.f16 > 16'0.0) && (S0.f16 != D0.f16)) then
D0.f16 += 16'1.0

16.12. VOP3 & VOP3SD Instructions

384 of 597

"RDNA3" Instruction Set Architecture

endif

V_TRUNC_F16

477

Return integer part of a number with round-to-zero semantics.
D0.f16 = trunc(S0.f16)

V_RNDNE_F16

478

Round-to-nearest-even semantics.
D0.f16 = floor(S0.f16 + 16'0.5);
if (isEven(64'F(floor(S0.f16))) && (fract(S0.f16) == 16'0.5)) then
D0.f16 -= 16'1.0
endif

V_FRACT_F16

479

Return fractional portion of a number.
D0.f16 = S0.f16 + -floor(S0.f16)

Notes
0.5ULP accuracy, denormals are accepted.
This complies with the DX specification of fract where the function behaves like an extension of integer
modulus; be aware this may differ from how fract() is defined in other domains. For example: fract(-1.2) = 0.8
in DX.

V_SIN_F16

480

Trigonometric sine.
D0.f16 = 16'F(sin(64'F(S0.f16) * 2.0 * PI))

Notes

16.12. VOP3 & VOP3SD Instructions

385 of 597

"RDNA3" Instruction Set Architecture

Denormals are supported. Full range input is supported.
Functional examples:
V_SIN_F16(0xfc00) ⇒ 0xfe00 // sin(-INF) = NAN
V_SIN_F16(0xfbff) ⇒ 0x0000 // Most negative finite FP16
V_SIN_F16(0x8000) ⇒ 0x8000 // sin(-0.0) = -0
V_SIN_F16(0x3400) ⇒ 0x3c00 // sin(0.25) = 1
V_SIN_F16(0x7bff) ⇒ 0x0000 // Most positive finite FP16
V_SIN_F16(0x7c00) ⇒ 0xfe00 // sin(+INF) = NAN

V_COS_F16

481

Trigonometric cosine.
D0.f16 = 16'F(cos(64'F(S0.f16) * 2.0 * PI))

Notes
Denormals are supported. Full range input is supported.
Functional examples:
V_COS_F16(0xfc00) ⇒ 0xfe00 // cos(-INF) = NAN
V_COS_F16(0xfbff) ⇒ 0x3c00 // Most negative finite FP16
V_COS_F16(0x8000) ⇒ 0x3c00 // cos(-0.0) = 1
V_COS_F16(0x3400) ⇒ 0x0000 // cos(0.25) = 0
V_COS_F16(0x7bff) ⇒ 0x3c00 // Most positive finite FP16
V_COS_F16(0x7c00) ⇒ 0xfe00 // cos(+INF) = NAN

V_SAT_PK_U8_I16

482

Packed 8-bit saturating of packed 16-bit integer values. Used for 4x16bit data packed as 4x8bit data.
SAT8 = lambda(n) (
if n.i <= 0 then
return 8'0U
elsif n >= 16'I(0xff) then
return 8'255U
else
return n[7 : 0].u8
endif);
D0.b16 = { SAT8(S0[31 : 16].i16), SAT8(S0[15 : 0].i16) }

16.12. VOP3 & VOP3SD Instructions

386 of 597

"RDNA3" Instruction Set Architecture

V_CVT_NORM_I16_F16

483

Convert from an FP16 float to a signed normalized short.
D0.i16 = f16_to_snorm(S0.f16)

Notes
0.5ULP accuracy, supports rounding, exception flags and saturation, denormals are supported.

V_CVT_NORM_U16_F16

484

Convert from an FP16 float to an unsigned normalized short.
D0.u16 = f16_to_unorm(S0.f16)

Notes
0.5ULP accuracy, supports rounding, exception flags and saturation, denormals are supported.

V_NOT_B16

489

Bitwise negation.
D0.u16 = ~S0.u16

Notes
Input and output modifiers not supported.

V_CVT_I32_I16

490

Convert from an 16-bit signed integer to a 32-bit signed integer, sign extending as needed.
D0.i = 32'I(signext(S0.i16))

Notes
To convert in the other direction (from 32-bit to 16-bit integer) use V_MOV_B16.

16.12. VOP3 & VOP3SD Instructions

387 of 597

"RDNA3" Instruction Set Architecture

V_CVT_U32_U16

491

Convert from an 16-bit unsigned integer to a 32-bit unsigned integer, zero extending as needed.
D0 = { 16'0, S0.u16 }

Notes
To convert in the other direction (from 32-bit to 16-bit integer) use V_MOV_B16.

V_CNDMASK_B32

257

Conditional mask on each thread.
D0.u = VCC.u64[laneId] ? S1.u : S0.u

Notes
In VOP3 the VCC source may be a scalar GPR specified in S2.u.
Floating-point modifiers are valid for this instruction if S0.u and S1.u are 32-bit floating point values. This
instruction is suitable for negating or taking the absolute value of a floating-point value.

V_ADD_F32

259

Add two single-precision values.
D0.f = S0.f + S1.f

Notes
0.5ULP precision, denormals are supported.

V_SUB_F32

260

Subtract the second single-precision input from the first input.
D0.f = S0.f - S1.f

16.12. VOP3 & VOP3SD Instructions

388 of 597

"RDNA3" Instruction Set Architecture

V_SUBREV_F32

261

Subtract the first single-precision input from the second input.
D0.f = S1.f - S0.f

V_FMAC_DX9_ZERO_F32

262

Multiply two single-precision values and accumulate the result with the destination. Follows DX9 rules where
0.0 times anything produces 0.0 (this is not IEEE compliant).
if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then
// DX9 rules, 0.0 * x = 0.0
D0.f = S2.f
else
D0.f = fma(S0.f, S1.f, D0.f)
endif

V_MUL_DX9_ZERO_F32

263

Multiply two single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not IEEE
compliant).
if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then
// DX9 rules, 0.0 * x = 0.0
D0.f = 0.0F
else
D0.f = S0.f * S1.f
endif

V_MUL_F32

264

Multiply two single-precision values.
D0.f = S0.f * S1.f

Notes
0.5ULP precision, denormals are supported.

16.12. VOP3 & VOP3SD Instructions

389 of 597

"RDNA3" Instruction Set Architecture

V_MUL_I32_I24

265

Multiply two signed 24-bit integers and store the result as a signed 32-bit integer.
D0.i = 32'I(S0.i24) * 32'I(S1.i24)

Notes
This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision
floating point multiplier. See also V_MUL_HI_I32_I24.

V_MUL_HI_I32_I24

266

Multiply two signed 24-bit integers and store the high 32 bits of the result as a signed 32-bit integer.
D0.i = 32'I(64'I(S0.i24) * 64'I(S1.i24) >> 32U)

Notes
See also V_MUL_I32_I24.

V_MUL_U32_U24

267

Multiply two unsigned 24-bit integers and store the result as an unsigned 32-bit integer.
D0.u = 32'U(S0.u24) * 32'U(S1.u24)

Notes
This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision
floating point multiplier. See also V_MUL_HI_U32_U24.

V_MUL_HI_U32_U24

268

Multiply two unsigned 24-bit integers and store the high 32 bits of the result as an unsigned 32-bit integer.
D0.u = 32'U(64'U(S0.u24) * 64'U(S1.u24) >> 32U)

Notes
See also V_MUL_U32_U24.

16.12. VOP3 & VOP3SD Instructions

390 of 597

"RDNA3" Instruction Set Architecture

V_MIN_F32

271

Compute the minimum of two floats.
LT_NEG_ZERO = lambda(a, b) (
((a < b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && sign(a) && !sign(b))));
// Version of comparison where -0.0 < +0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(64'F(S0.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S0.f)))
elsif isSignalNAN(64'F(S1.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S1.f)))
elsif isQuietNAN(64'F(S1.f)) then
D0.f = S0.f
elsif isQuietNAN(64'F(S0.f)) then
D0.f = S1.f
elsif LT_NEG_ZERO(S0.f, S1.f) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f = S0.f
else
D0.f = S1.f
endif
else
if isNAN(64'F(S1.f)) then
D0.f = S0.f
elsif isNAN(64'F(S0.f)) then
D0.f = S1.f
elsif LT_NEG_ZERO(S0.f, S1.f) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f = S0.f
else
D0.f = S1.f
endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.
Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so
output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

V_MAX_F32

272

Compute the maximum of two floats.

16.12. VOP3 & VOP3SD Instructions

391 of 597

"RDNA3" Instruction Set Architecture

GT_NEG_ZERO = lambda(a, b) (
((a > b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && !sign(a) && sign(b))));
// Version of comparison where +0.0 > -0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(64'F(S0.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S0.f)))
elsif isSignalNAN(64'F(S1.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S1.f)))
elsif isQuietNAN(64'F(S1.f)) then
D0.f = S0.f
elsif isQuietNAN(64'F(S0.f)) then
D0.f = S1.f
elsif GT_NEG_ZERO(S0.f, S1.f) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f = S0.f
else
D0.f = S1.f
endif
else
if isNAN(64'F(S1.f)) then
D0.f = S0.f
elsif isNAN(64'F(S0.f)) then
D0.f = S1.f
elsif GT_NEG_ZERO(S0.f, S1.f) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f = S0.f
else
D0.f = S1.f
endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.
Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so
output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

V_MIN_I32

273

Compute the minimum of two signed integers.
D0.i = S0.i < S1.i ? S0.i : S1.i

V_MAX_I32

16.12. VOP3 & VOP3SD Instructions

274

392 of 597

"RDNA3" Instruction Set Architecture

Compute the maximum of two signed integers.
D0.i = S0.i >= S1.i ? S0.i : S1.i

V_MIN_U32

275

Compute the minimum of two unsigned integers.
D0.u = S0.u < S1.u ? S0.u : S1.u

V_MAX_U32

276

Compute the maximum of two unsigned integers.
D0.u = S0.u >= S1.u ? S0.u : S1.u

V_LSHLREV_B32

280

Logical shift left with shift count in the first operand.
D0.u = S1.u << S0[4 : 0].u

V_LSHRREV_B32

281

Logical shift right with shift count in the first operand.
D0.u = S1.u >> S0[4 : 0].u

V_ASHRREV_I32

282

Arithmetic shift right (preserve sign bit) with shift count in the first operand.
D0.i = S1.i >> S0[4 : 0].u

16.12. VOP3 & VOP3SD Instructions

393 of 597

"RDNA3" Instruction Set Architecture

V_AND_B32

283

Bitwise AND.
D0.u = (S0.u & S1.u)

Notes
Input and output modifiers not supported.

V_OR_B32

284

Bitwise OR.
D0.u = (S0.u | S1.u)

Notes
Input and output modifiers not supported.

V_XOR_B32

285

Bitwise XOR.
D0.u = (S0.u ^ S1.u)

Notes
Input and output modifiers not supported.

V_XNOR_B32

286

Bitwise XNOR.
D0.u = ~(S0.u ^ S1.u)

Notes
Input and output modifiers not supported.

16.12. VOP3 & VOP3SD Instructions

394 of 597

"RDNA3" Instruction Set Architecture

V_ADD_CO_CI_U32

288

Add two unsigned integers and a carry-in from VCC. Store the result and also save the carry-out to VCC.
tmp = 64'U(S0.u) + 64'U(S1.u) + VCC.u64[laneId].u64;
VCC.u64[laneId] = tmp >= 0x100000000ULL ? 1'1U : 1'0U;
D0.u = tmp.u

Notes
In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at
S2.u.

V_SUB_CO_CI_U32

289

Subtract the second unsigned integer from the first unsigned integer and then subtract a carry-in from VCC.
Store the result and also save the carry-out to VCC.
tmp = S0.u - S1.u - VCC.u64[laneId].u;
VCC.u64[laneId] = 64'U(S1.u) + VCC.u64[laneId].u64 > 64'U(S0.u) ? 1'1U : 1'0U;
D0.u = tmp.u

Notes
In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at
S2.u.

V_SUBREV_CO_CI_U32

290

Subtract the first unsigned integer from the second unsigned integer and then subtract a carry-in from VCC.
Store the result and also save the carry-out to VCC.
tmp = S1.u - S0.u - VCC.u64[laneId].u;
VCC.u64[laneId] = 64'U(S1.u) + VCC.u64[laneId].u64 > 64'U(S0.u) ? 1'1U : 1'0U;
D0.u = tmp.u

Notes
In VOP3 the VCC destination may be an arbitrary SGPR-pair, and the VCC source comes from the SGPR-pair at
S2.u.

V_ADD_NC_U32

16.12. VOP3 & VOP3SD Instructions

293

395 of 597

"RDNA3" Instruction Set Architecture

Add two unsigned integers. No carry-in or carry-out.
D0.u = S0.u + S1.u

V_SUB_NC_U32

294

Subtract the second unsigned integer from the first unsigned integer. No carry-in or carry-out.
D0.u = S0.u - S1.u

V_SUBREV_NC_U32

295

Subtract the first unsigned integer from the second unsigned integer. No carry-in or carry-out.
D0.u = S1.u - S0.u

V_FMAC_F32

299

Fused multiply-add of single-precision floats, accumulate with destination.
D0.f = fma(S0.f, S1.f, D0.f)

V_CVT_PK_RTZ_F16_F32

303

Convert two single-precision floats into a packed FP16 result and round to zero (ignore the current rounding
mode).
D0[15 : 0].f16 = f32_to_f16(S0.f);
D0[31 : 16].f16 = f32_to_f16(S1.f);
// Round-toward-zero regardless of current round mode setting in hardware.

Notes
This opcode is intended for use with 16-bit compressed exports. See V_CVT_F16_F32 for a version that respects
the current rounding mode.

16.12. VOP3 & VOP3SD Instructions

396 of 597

"RDNA3" Instruction Set Architecture

V_ADD_F16

306

Add two FP16 values.
D0.f16 = S0.f16 + S1.f16

Notes
0.5ULP precision. Supports denormals, round mode, exception flags and saturation.

V_SUB_F16

307

Subtract the second FP16 value from the first.
D0.f16 = S0.f16 - S1.f16

Notes
0.5ULP precision, Supports denormals, round mode, exception flags and saturation.

V_SUBREV_F16

308

Subtract the first FP16 value from the second.
D0.f16 = S1.f16 - S0.f16

Notes
0.5ULP precision. Supports denormals, round mode, exception flags and saturation.

V_MUL_F16

309

Multiply two FP16 values.
D0.f16 = S0.f16 * S1.f16

Notes
0.5ULP precision. Supports denormals, round mode, exception flags and saturation.

16.12. VOP3 & VOP3SD Instructions

397 of 597

"RDNA3" Instruction Set Architecture

V_FMAC_F16

310

Fused multiply-add of FP16 values, accumulate with destination.
D0.f16 = fma(S0.f16, S1.f16, D0.f16)

Notes
0.5ULP precision. Supports denormals, round mode, exception flags and saturation.

V_MAX_F16

313

Compute the maximum of two floats.
GT_NEG_ZERO = lambda(a, b) (
((a > b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && !sign(a) && sign(b))));
// Version of comparison where +0.0 > -0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(64'F(S0.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S0.f16)))
elsif isSignalNAN(64'F(S1.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16)))
elsif isQuietNAN(64'F(S1.f16)) then
D0.f16 = S0.f16
elsif isQuietNAN(64'F(S0.f16)) then
D0.f16 = S1.f16
elsif GT_NEG_ZERO(S0.f16, S1.f16) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f16 = S0.f16
else
D0.f16 = S1.f16
endif
else
if isNAN(64'F(S1.f16)) then
D0.f16 = S0.f16
elsif isNAN(64'F(S0.f16)) then
D0.f16 = S1.f16
elsif GT_NEG_ZERO(S0.f16, S1.f16) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f16 = S0.f16
else
D0.f16 = S1.f16
endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.

16.12. VOP3 & VOP3SD Instructions

398 of 597

"RDNA3" Instruction Set Architecture

Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so
output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

V_MIN_F16

314

Compute the minimum of two floats.
LT_NEG_ZERO = lambda(a, b) (
((a < b) || ((64'F(abs(a)) == 0.0) && (64'F(abs(b)) == 0.0) && sign(a) && !sign(b))));
// Version of comparison where -0.0 < +0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(64'F(S0.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S0.f16)))
elsif isSignalNAN(64'F(S1.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16)))
elsif isQuietNAN(64'F(S1.f16)) then
D0.f16 = S0.f16
elsif isQuietNAN(64'F(S0.f16)) then
D0.f16 = S1.f16
elsif LT_NEG_ZERO(S0.f16, S1.f16) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f16 = S0.f16
else
D0.f16 = S1.f16
endif
else
if isNAN(64'F(S1.f16)) then
D0.f16 = S0.f16
elsif isNAN(64'F(S0.f16)) then
D0.f16 = S1.f16
elsif LT_NEG_ZERO(S0.f16, S1.f16) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f16 = S0.f16
else
D0.f16 = S1.f16
endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.
Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so
output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

16.12. VOP3 & VOP3SD Instructions

399 of 597

"RDNA3" Instruction Set Architecture

V_LDEXP_F16

315

Load exponent.
Multiply an FP16 value by an integral power of 2, compare with the ldexp() function in C. The second argument
is an integer value.
D0.f16 = S0.f16 * f32_to_f16(2.0F ** 32'I(S1.i16))

V_FMA_DX9_ZERO_F32

521

Multiply and add single-precision values. Follows DX9 rules where 0.0 times anything produces 0.0 (this is not
IEEE compliant).
if ((64'F(S0.f) == 0.0) || (64'F(S1.f) == 0.0)) then
// DX9 rules, 0.0 * x = 0.0
D0.f = S2.f
else
D0.f = fma(S0.f, S1.f, S2.f)
endif

V_MAD_I32_I24

522

Multiply two signed 24-bit integers, add a signed 32-bit integer and store the result as a signed 32-bit integer.
D0.i = 32'I(S0.i24) * 32'I(S1.i24) + S2.i

Notes
This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision
floating point multiplier.

V_MAD_U32_U24

523

Multiply two unsigned 24-bit integers, add an unsigned 32-bit integer and store the result as an unsigned 32-bit
integer.
D0.u = 32'U(S0.u24) * 32'U(S1.u24) + S2.u

Notes

16.12. VOP3 & VOP3SD Instructions

400 of 597

"RDNA3" Instruction Set Architecture

This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision
floating point multiplier.

V_CUBEID_F32

524

Cubemap Face ID determination. Result is a floating point face ID.
// Set D0.f = cubemap face ID ({0.0, 1.0, ..., 5.0}).
// XYZ coordinate is given in (S0.f, S1.f, S2.f).
// S0.f = x
// S1.f = y
// S2.f = z
if ((abs(S2.f) >= abs(S0.f)) && (abs(S2.f) >= abs(S1.f))) then
if S2.f < 0.0F then
D0.f = 5.0F
else
D0.f = 4.0F
endif
elsif abs(S1.f) >= abs(S0.f) then
if S1.f < 0.0F then
D0.f = 3.0F
else
D0.f = 2.0F
endif
else
if S0.f < 0.0F then
D0.f = 1.0F
else
D0.f = 0.0F
endif
endif

V_CUBESC_F32

525

Cubemap S coordinate.
// D0.f = cubemap S coordinate.
// XYZ coordinate is given in (S0.f, S1.f, S2.f).
// S0.f = x
// S1.f = y
// S2.f = z
if ((abs(S2.f) >= abs(S0.f)) && (abs(S2.f) >= abs(S1.f))) then
if S2.f < 0.0F then
D0.f = -S0.f
else
D0.f = S0.f
endif
elsif abs(S1.f) >= abs(S0.f) then
D0.f = S0.f
else

16.12. VOP3 & VOP3SD Instructions

401 of 597

"RDNA3" Instruction Set Architecture

if S0.f < 0.0F then
D0.f = S2.f
else
D0.f = -S2.f
endif
endif

V_CUBETC_F32

526

Cubemap T coordinate.
// D0.f = cubemap T coordinate.
// XYZ coordinate is given in (S0.f, S1.f, S2.f).
// S0.f = x
// S1.f = y
// S2.f = z
if ((abs(S2.f) >= abs(S0.f)) && (abs(S2.f) >= abs(S1.f))) then
D0.f = -S1.f
elsif abs(S1.f) >= abs(S0.f) then
if S1.f < 0.0F then
D0.f = -S2.f
else
D0.f = S2.f
endif
else
D0.f = -S1.f
endif

V_CUBEMA_F32

527

Determine cubemap major axis.
// D0.f = 2.0 * cubemap major axis.
// XYZ coordinate is given in (S0.f, S1.f, S2.f).
// S0.f = x
// S1.f = y
// S2.f = z
if ((abs(S2.f) >= abs(S0.f)) && (abs(S2.f) >= abs(S1.f))) then
D0.f = S2.f * 2.0F
elsif abs(S1.f) >= abs(S0.f) then
D0.f = S1.f * 2.0F
else
D0.f = S0.f * 2.0F
endif

V_BFE_U32

16.12. VOP3 & VOP3SD Instructions

528

402 of 597

"RDNA3" Instruction Set Architecture

Bitfield extract. Extract unsigned bitfield from first operand using field offset in second operand and field size
in third operand.
D0.u = ((S0.u >> S1.u[4 : 0].u) & 32'U((1 << S2.u[4 : 0].u) - 1))

V_BFE_I32

529

Bitfield extract. Extract signed bitfield from first operand using field offset in second operand and field size in
third operand.
tmp = ((S0.i >> S1.u[4 : 0].u) & ((1 << S2.u[4 : 0].u) - 1));
D0.i = 32'I(signextFromBit(tmp.i, S2.i[4 : 0].i))

V_BFI_B32

530

Bitfield insert. Using a bitmask from the first operand, merge bitfield in second operand with packed value in
third operand.
D0.u = ((S0.u & S1.u) | (~S0.u & S2.u))

V_FMA_F32

531

Fused single precision multiply add.
D0.f = fma(S0.f, S1.f, S2.f)

Notes
0.5ULP accuracy, denormals are supported.

V_FMA_F64

532

Fused double precision multiply add.
D0.f64 = fma(S0.f64, S1.f64, S2.f64)

Notes

16.12. VOP3 & VOP3SD Instructions

403 of 597

"RDNA3" Instruction Set Architecture

0.5ULP precision, denormals are supported.

V_LERP_U8

533

Unsigned 8-bit pixel average on packed unsigned bytes (linear interpolation).
Each byte in S2 acts as a round mode; if the LSB is set then 0.5 rounds up, otherwise 0.5 truncates.
D0.u = 32'U(S0.u[31 : 24] + S1.u[31 : 24] + S2.u[24].u8 >> 1U << 24U);
D0.u += 32'U(S0.u[23 : 16] + S1.u[23 : 16] + S2.u[16].u8 >> 1U << 16U);
D0.u += 32'U(S0.u[15 : 8] + S1.u[15 : 8] + S2.u[8].u8 >> 1U << 8U);
D0.u += 32'U(S0.u[7 : 0] + S1.u[7 : 0] + S2.u[0].u8 >> 1U)

V_ALIGNBIT_B32

534

Align a value to the specified bit position.
D0.u = 32'U(({ S0.u, S1.u } >> S2.u[4 : 0].u) & 0xffffffffLL)

Notes



S0 carries the MSBs and S1 carries the LSBs of the value being aligned.

V_ALIGNBYTE_B32

535

Align a value to the specified byte position.
D0.u = 32'U(({ S0.u, S1.u } >> S2.u[1 : 0].u * 8U) & 0xffffffffLL)

Notes



S0 carries the MSBs and S1 carries the LSBs of the value being aligned.

V_MULLIT_F32

536

Multiply for lighting. Specific rules apply: 0.0 * x = 0.0; specific INF, NAN, overflow rules.
if ((S1.f == -MAX_FLOAT_F32) || (64'F(S1.f) == -INF) || isNAN(64'F(S1.f)) || (S2.f <= 0.0F) ||
isNAN(64'F(S2.f))) then

16.12. VOP3 & VOP3SD Instructions

404 of 597

"RDNA3" Instruction Set Architecture

D0.f = -MAX_FLOAT_F32
else
D0.f = S0.f * S1.f
endif

Notes

V_MIN3_F32

537

Return minimum single-precision value of three inputs.
D0.f = v_min_f32(v_min_f32(S0.f, S1.f), S2.f)

V_MIN3_I32

538

Return minimum signed integer value of three inputs.
D0.i = v_min_i32(v_min_i32(S0.i, S1.i), S2.i)

V_MIN3_U32

539

Return minimum unsigned integer value of three inputs.
D0.u = v_min_u32(v_min_u32(S0.u, S1.u), S2.u)

V_MAX3_F32

540

Return maximum single precision value of three inputs.
D0.f = v_max_f32(v_max_f32(S0.f, S1.f), S2.f)

V_MAX3_I32

541

Return maximum signed integer value of three inputs.

16.12. VOP3 & VOP3SD Instructions

405 of 597

"RDNA3" Instruction Set Architecture

D0.i = v_max_i32(v_max_i32(S0.i, S1.i), S2.i)

V_MAX3_U32

542

Return maximum unsigned integer value of three inputs.
D0.u = v_max_u32(v_max_u32(S0.u, S1.u), S2.u)

V_MED3_F32

543

Return median single precision value of three inputs.
if (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f)) || isNAN(64'F(S2.f))) then
D0.f = v_min3_f32(S0.f, S1.f, S2.f)
elsif v_max3_f32(S0.f, S1.f, S2.f) == S0.f then
D0.f = v_max_f32(S1.f, S2.f)
elsif v_max3_f32(S0.f, S1.f, S2.f) == S1.f then
D0.f = v_max_f32(S0.f, S2.f)
else
D0.f = v_max_f32(S0.f, S1.f)
endif

V_MED3_I32

544

Return median signed integer value of three inputs.
if v_max3_i32(S0.i, S1.i, S2.i) == S0.i then
D0.i = v_max_i32(S1.i, S2.i)
elsif v_max3_i32(S0.i, S1.i, S2.i) == S1.i then
D0.i = v_max_i32(S0.i, S2.i)
else
D0.i = v_max_i32(S0.i, S1.i)
endif

V_MED3_U32

545

Return median unsigned integer value of three inputs.
if v_max3_u32(S0.u, S1.u, S2.u) == S0.u then
D0.u = v_max_u32(S1.u, S2.u)

16.12. VOP3 & VOP3SD Instructions

406 of 597

"RDNA3" Instruction Set Architecture

elsif v_max3_u32(S0.u, S1.u, S2.u) == S1.u then
D0.u = v_max_u32(S0.u, S2.u)
else
D0.u = v_max_u32(S0.u, S1.u)
endif

V_SAD_U8

546

Sum of absolute differences with accumulation, overflow into upper bits is allowed.
ABSDIFF = lambda(x, y) (
x > y ? x - y : y - x);
// UNSIGNED comparison
D0.u = S2.u;
D0.u += 32'U(ABSDIFF(S0.u[31 : 24], S1.u[31 : 24]));
D0.u += 32'U(ABSDIFF(S0.u[23 : 16], S1.u[23 : 16]));
D0.u += 32'U(ABSDIFF(S0.u[15 : 8], S1.u[15 : 8]));
D0.u += 32'U(ABSDIFF(S0.u[7 : 0], S1.u[7 : 0]))

V_SAD_HI_U8

547

Sum of absolute differences with accumulation, accumulate from the higher-order bits of the third source
operand.
D0.u = (32'U(v_sad_u8(S0, S1, 0U)) << 16U) + S2.u

V_SAD_U16

548

Short SAD with accumulation.
ABSDIFF = lambda(x, y) (
x > y ? x - y : y - x);
// UNSIGNED comparison
D0.u = S2.u;
D0.u += ABSDIFF(S0[31 : 16].u16, S1[31 : 16].u16);
D0.u += ABSDIFF(S0[15 : 0].u16, S1[15 : 0].u16)

V_SAD_U32

549

Dword SAD with accumulation.

16.12. VOP3 & VOP3SD Instructions

407 of 597

"RDNA3" Instruction Set Architecture

ABSDIFF = lambda(x, y) (
x > y ? x - y : y - x);
// UNSIGNED comparison
D0.u = ABSDIFF(S0.u, S1.u) + S2.u

V_CVT_PK_U8_F32

550

Packed float to byte conversion.
Convert floating point value S0 to 8-bit unsigned integer and pack the result into byte S1 of dword S2.
D0.u = (S2.u & 32'U(~(0xff << S1.u[1 : 0].u * 8U)));
D0.u = (D0.u | ((32'U(f32_to_u8(S0.f)) & 255U) << S1.u[1 : 0].u * 8U))

V_DIV_FIXUP_F32

551

Single precision division fixup.
S0 = Quotient, S1 = Denominator, S2 = Numerator.
Given a numerator, denominator, and quotient from a divide, this opcode detects and applies specific case
numerics, touching up the quotient if necessary. This opcode also generates invalid, denorm and divide by
zero exceptions caused by the division.
sign_out = (sign(S1.f) ^ sign(S2.f));
if isNAN(64'F(S2.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S2.f)))
elsif isNAN(64'F(S1.f)) then
D0.f = 32'F(cvtToQuietNAN(64'F(S1.f)))
elsif ((64'F(S1.f) == 0.0) && (64'F(S2.f) == 0.0)) then
// 0/0
D0.f = 32'F(0xffc00000)
elsif ((64'F(abs(S1.f)) == +INF) && (64'F(abs(S2.f)) == +INF)) then
// inf/inf
D0.f = 32'F(0xffc00000)
elsif ((64'F(S1.f) == 0.0) || (64'F(abs(S2.f)) == +INF)) then
// x/0, or inf/y
D0.f = sign_out ? -INF.f : +INF.f
elsif ((64'F(abs(S1.f)) == +INF) || (64'F(S2.f) == 0.0)) then
// x/inf, 0/y
D0.f = sign_out ? -0.0F : 0.0F
elsif exponent(S2.f) - exponent(S1.f) < -150 then
D0.f = sign_out ? -UNDERFLOW_F32 : UNDERFLOW_F32
elsif exponent(S1.f) == 255 then
D0.f = sign_out ? -OVERFLOW_F32 : OVERFLOW_F32
else
D0.f = sign_out ? -abs(S0.f) : abs(S0.f)

16.12. VOP3 & VOP3SD Instructions

408 of 597

"RDNA3" Instruction Set Architecture

endif

V_DIV_FIXUP_F64

552

Double precision division fixup.
S0 = Quotient, S1 = Denominator, S2 = Numerator.
Given a numerator, denominator, and quotient from a divide, this opcode detects and applies specific case
numerics, touching up the quotient if necessary. This opcode also generates invalid, denorm and divide by
zero exceptions caused by the division.
sign_out = (sign(S1.f64) ^ sign(S2.f64));
if isNAN(S2.f64) then
D0.f64 = cvtToQuietNAN(S2.f64)
elsif isNAN(S1.f64) then
D0.f64 = cvtToQuietNAN(S1.f64)
elsif ((S1.f64 == 0.0) && (S2.f64 == 0.0)) then
// 0/0
D0.f64 = 64'F(0xfff8000000000000LL)
elsif ((abs(S1.f64) == +INF) && (abs(S2.f64) == +INF)) then
// inf/inf
D0.f64 = 64'F(0xfff8000000000000LL)
elsif ((S1.f64 == 0.0) || (abs(S2.f64) == +INF)) then
// x/0, or inf/y
D0.f64 = sign_out ? -INF : +INF
elsif ((abs(S1.f64) == +INF) || (S2.f64 == 0.0)) then
// x/inf, 0/y
D0.f64 = sign_out ? -0.0 : 0.0
elsif exponent(S2.f64) - exponent(S1.f64) < -1075 then
D0.f64 = sign_out ? -UNDERFLOW_F64 : UNDERFLOW_F64
elsif exponent(S1.f64) == 2047 then
D0.f64 = sign_out ? -OVERFLOW_F64 : OVERFLOW_F64
else
D0.f64 = sign_out ? -abs(S0.f64) : abs(S0.f64)
endif

V_DIV_FMAS_F32

567

Single precision FMA with fused scale.
This opcode performs a standard Fused Multiply-Add operation and conditionally scales the resulting exponent
if VCC is set.
if VCC.u64[laneId] then
D0.f = 2.0F ** 32 * fma(S0.f, S1.f, S2.f)
else
D0.f = fma(S0.f, S1.f, S2.f)

16.12. VOP3 & VOP3SD Instructions

409 of 597

"RDNA3" Instruction Set Architecture

endif

Notes
Input denormals are not flushed, but output flushing is allowed.

V_DIV_FMAS_F64

568

Double precision FMA with fused scale.
This opcode performs a standard Fused Multiply-Add operation and conditionally scales the resulting exponent
if VCC is set.
if VCC.u64[laneId] then
D0.f64 = 2.0 ** 64 * fma(S0.f64, S1.f64, S2.f64)
else
D0.f64 = fma(S0.f64, S1.f64, S2.f64)
endif

Notes
Input denormals are not flushed, but output flushing is allowed.

V_MSAD_U8

569

Masked sum of absolute differences with accumulation, overflow into upper bits is allowed.
Components where the reference value in S1 is zero are not included in the sum.
ABSDIFF = lambda(x, y) (
x > y ? x - y : y - x);
// UNSIGNED comparison
D0.u = S2.u;
D0.u += S1.u[31 : 24] == 8'0U ? 0U : 32'U(ABSDIFF(S0.u[31 : 24], S1.u[31 : 24]));
D0.u += S1.u[23 : 16] == 8'0U ? 0U : 32'U(ABSDIFF(S0.u[23 : 16], S1.u[23 : 16]));
D0.u += S1.u[15 : 8] == 8'0U ? 0U : 32'U(ABSDIFF(S0.u[15 : 8], S1.u[15 : 8]));
D0.u += S1.u[7 : 0] == 8'0U ? 0U : 32'U(ABSDIFF(S0.u[7 : 0], S1.u[7 : 0]))

V_QSAD_PK_U16_U8

570

Quad-byte SAD with 16-bit packed accumulation.
D0[63 : 48] = 16'B(v_sad_u8(S0[55 : 24], S1[31 : 0], S2[63 : 48].u));
D0[47 : 32] = 16'B(v_sad_u8(S0[47 : 16], S1[31 : 0], S2[47 : 32].u));

16.12. VOP3 & VOP3SD Instructions

410 of 597

"RDNA3" Instruction Set Architecture

D0[31 : 16] = 16'B(v_sad_u8(S0[39 : 8], S1[31 : 0], S2[31 : 16].u));
D0[15 : 0] = 16'B(v_sad_u8(S0[31 : 0], S1[31 : 0], S2[15 : 0].u))

V_MQSAD_PK_U16_U8

571

Quad-byte masked SAD with 16-bit packed accumulation.
D0[63 : 48] = 16'B(v_msad_u8(S0[55 : 24], S1[31 : 0], S2[63 : 48].u));
D0[47 : 32] = 16'B(v_msad_u8(S0[47 : 16], S1[31 : 0], S2[47 : 32].u));
D0[31 : 16] = 16'B(v_msad_u8(S0[39 : 8], S1[31 : 0], S2[31 : 16].u));
D0[15 : 0] = 16'B(v_msad_u8(S0[31 : 0], S1[31 : 0], S2[15 : 0].u))

V_MQSAD_U32_U8

573

Quad-byte masked SAD with 32-bit packed accumulation.
D0[127 : 96] = 32'B(v_msad_u8(S0[55 : 24], S1[31 : 0], S2[127 : 96].u));
D0[95 : 64] = 32'B(v_msad_u8(S0[47 : 16], S1[31 : 0], S2[95 : 64].u));
D0[63 : 32] = 32'B(v_msad_u8(S0[39 : 8], S1[31 : 0], S2[63 : 32].u));
D0[31 : 0] = 32'B(v_msad_u8(S0[31 : 0], S1[31 : 0], S2[31 : 0].u))

V_XOR3_B32

576

Bitwise XOR of three inputs.
D0.u = (S0.u ^ S1.u ^ S2.u)

Notes
Input and output modifiers not supported.

V_MAD_U16

577

Multiply and add three unsigned short values.
D0.u16 = S0.u16 * S1.u16 + S2.u16

Notes

16.12. VOP3 & VOP3SD Instructions

411 of 597

"RDNA3" Instruction Set Architecture

Supports saturation (unsigned 16-bit integer domain).

V_PERM_B32

580

Byte permute.
BYTE_PERMUTE = lambda(data, sel) (
declare in : 8'B[8];
for i in 0 : 7 do
in[i] = data[i * 8 + 7 : i * 8].b8
endfor;
if sel.u >= 13U then
return 8'0xff
elsif sel.u == 12U then
return 8'0x0
elsif sel.u == 11U then
return in[7][7].b8 * 8'0xff
elsif sel.u == 10U then
return in[5][7].b8 * 8'0xff
elsif sel.u == 9U then
return in[3][7].b8 * 8'0xff
elsif sel.u == 8U then
return in[1][7].b8 * 8'0xff
else
return in[sel]
endif);
D0[31 : 24] = BYTE_PERMUTE({ S0.u, S1.u }, S2.u[31 : 24]);
D0[23 : 16] = BYTE_PERMUTE({ S0.u, S1.u }, S2.u[23 : 16]);
D0[15 : 8] = BYTE_PERMUTE({ S0.u, S1.u }, S2.u[15 : 8]);
D0[7 : 0] = BYTE_PERMUTE({ S0.u, S1.u }, S2.u[7 : 0])

Notes
Selects 8 through 11 are useful in modeling sign extension of a smaller-precision signed integer to a largerprecision result.
Note the MSBs of the 64-bit value being selected are stored in S0. This is counterintuitive for a little-endian
architecture.

V_XAD_U32

581

Bitwise XOR and then add.
D0.u = (S0.u ^ S1.u) + S2.u

Notes
No carryin/carryout and no saturation. This opcode is designed to help accelerate the SHA256 hash algorithm.

16.12. VOP3 & VOP3SD Instructions

412 of 597

"RDNA3" Instruction Set Architecture

V_LSHL_ADD_U32

582

Logical shift left and then add.
D0.u = (S0.u << S1.u[4 : 0].u) + S2.u

V_ADD_LSHL_U32

583

Add and then logical shift left the result.
D0.u = S0.u + S1.u << S2.u[4 : 0].u

V_FMA_F16

584

Fused half precision multiply add.
D0.f16 = fma(S0.f16, S1.f16, S2.f16)

Notes
0.5ULP accuracy, denormals are supported.

V_MIN3_F16

585

Return minimum FP16 value of three inputs.
D0.f16 = v_min_f16(v_min_f16(S0.f16, S1.f16), S2.f16)

V_MIN3_I16

586

Return minimum signed short value of three inputs.
D0.i16 = v_min_i16(v_min_i16(S0.i16, S1.i16), S2.i16)

16.12. VOP3 & VOP3SD Instructions

413 of 597

"RDNA3" Instruction Set Architecture

V_MIN3_U16

587

Return minimum unsigned short value of three inputs.
D0.u16 = v_min_u16(v_min_u16(S0.u16, S1.u16), S2.u16)

V_MAX3_F16

588

Return maximum FP16 value of three inputs.
D0.f16 = v_max_f16(v_max_f16(S0.f16, S1.f16), S2.f16)

V_MAX3_I16

589

Return maximum signed short value of three inputs.
D0.i16 = v_max_i16(v_max_i16(S0.i16, S1.i16), S2.i16)

V_MAX3_U16

590

Return maximum unsigned short value of three inputs.
D0.u16 = v_max_u16(v_max_u16(S0.u16, S1.u16), S2.u16)

V_MED3_F16

591

Return median FP16 value of three inputs.
if (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16)) || isNAN(64'F(S2.f16))) then
D0.f16 = v_min3_f16(S0.f16, S1.f16, S2.f16)
elsif v_max3_f16(S0.f16, S1.f16, S2.f16) == S0.f16 then
D0.f16 = v_max_f16(S1.f16, S2.f16)
elsif v_max3_f16(S0.f16, S1.f16, S2.f16) == S1.f16 then
D0.f16 = v_max_f16(S0.f16, S2.f16)
else
D0.f16 = v_max_f16(S0.f16, S1.f16)
endif

16.12. VOP3 & VOP3SD Instructions

414 of 597

"RDNA3" Instruction Set Architecture

V_MED3_I16

592

Return median signed short value of three inputs.
if v_max3_i16(S0.i16, S1.i16, S2.i16) == S0.i16 then
D0.i16 = v_max_i16(S1.i16, S2.i16)
elsif v_max3_i16(S0.i16, S1.i16, S2.i16) == S1.i16 then
D0.i16 = v_max_i16(S0.i16, S2.i16)
else
D0.i16 = v_max_i16(S0.i16, S1.i16)
endif

V_MED3_U16

593

Return median unsigned short value of three inputs.
if v_max3_u16(S0.u16, S1.u16, S2.u16) == S0.u16 then
D0.u16 = v_max_u16(S1.u16, S2.u16)
elsif v_max3_u16(S0.u16, S1.u16, S2.u16) == S1.u16 then
D0.u16 = v_max_u16(S0.u16, S2.u16)
else
D0.u16 = v_max_u16(S0.u16, S1.u16)
endif

V_MAD_I16

595

Multiply and add three signed short values.
D0.i16 = S0.i16 * S1.i16 + S2.i16

Notes
Supports saturation (signed 16-bit integer domain).

V_DIV_FIXUP_F16

596

Half precision division fixup.
S0 = Quotient, S1 = Denominator, S2 = Numerator.
Given a numerator, denominator, and quotient from a divide, this opcode detects and applies specific case
numerics, touching up the quotient if necessary. This opcode also generates invalid, denorm and divide by
zero exceptions caused by the division.

16.12. VOP3 & VOP3SD Instructions

415 of 597

"RDNA3" Instruction Set Architecture

sign_out = (sign(S1.f16) ^ sign(S2.f16));
if isNAN(64'F(S2.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S2.f16)))
elsif isNAN(64'F(S1.f16)) then
D0.f16 = 16'F(cvtToQuietNAN(64'F(S1.f16)))
elsif ((64'F(S1.f16) == 0.0) && (64'F(S2.f16) == 0.0)) then
// 0/0
D0.f16 = 16'F(0xfe00)
elsif ((64'F(abs(S1.f16)) == +INF) && (64'F(abs(S2.f16)) == +INF)) then
// inf/inf
D0.f16 = 16'F(0xfe00)
elsif ((64'F(S1.f16) == 0.0) || (64'F(abs(S2.f16)) == +INF)) then
// x/0, or inf/y
D0.f16 = sign_out ? -INF.f16 : +INF.f16
elsif ((64'F(abs(S1.f16)) == +INF) || (64'F(S2.f16) == 0.0)) then
// x/inf, 0/y
D0.f16 = sign_out ? -16'0.0 : 16'0.0
else
D0.f16 = sign_out ? -abs(S0.f16) : abs(S0.f16)
endif

V_ADD3_U32

597

Add three unsigned integers.
D0.u = S0.u + S1.u + S2.u

V_LSHL_OR_B32

598

Logical shift left and then bitwise OR.
D0.u = ((S0.u << S1.u[4 : 0].u) | S2.u)

V_AND_OR_B32

599

Bitwise AND and then bitwise OR.
D0.u = ((S0.u & S1.u) | S2.u)

V_OR3_B32

16.12. VOP3 & VOP3SD Instructions

600

416 of 597

"RDNA3" Instruction Set Architecture

Bitwise OR of three inputs.
D0.u = (S0.u | S1.u | S2.u)

V_MAD_U32_U16

601

Multiply and add unsigned values.
D0.u = 32'U(S0.u16) * 32'U(S1.u16) + S2.u

V_MAD_I32_I16

602

Multiply and add signed values.
D0.i = 32'I(S0.i16) * 32'I(S1.i16) + S2.i

V_PERMLANE16_B32

603

Perform arbitrary gather-style operation within a row (16 contiguous lanes).
The first source must be a VGPR and the second and third sources must be scalar values; the second and third
source are combined into a single 64-bit value representing lane selects used to swizzle within each row.
OPSEL is not used in its typical manner for this instruction. For this instruction OPSEL[0] is overloaded to
represent the DPP 'FI' (Fetch Inactive) bit and OPSEL[1] is overloaded to represent the DPP 'BOUND_CTRL' bit.
The remaining OPSEL bits are reserved for this instruction.
Compare with V_PERMLANEX16_B32.
declare tmp : 32'B[64];
lanesel = { S2.u, S1.u };
// Concatenate lane select bits
for i in 0 : WAVE32 ? 31 : 63 do
// Copy original S0 in case D==S0
tmp[i] = VGPR[i][SRC0.u]
endfor;
for row in 0 : WAVE32 ? 1 : 3 do
// Implement arbitrary swizzle within each row
for i in 0 : 15 do
if EXEC[row * 16 + i].u1 then
VGPR[row * 16 + i][VDST.u] = tmp[64'B(row * 16) + lanesel[i * 4 + 3 : i * 4]]
endif
endfor

16.12. VOP3 & VOP3SD Instructions

417 of 597

"RDNA3" Instruction Set Architecture

endfor

Notes
ABS, NEG and OMOD modifiers should all be zeroed for this instruction.
Example implementing a rotation within each row:
v_mov_b32 s0, 0x87654321;
v_mov_b32 s1, 0x0fedcba9;
v_permlane16_b32 v1, v0, s0, s1;
// ROW 0:
// v1.lane[0] <- v0.lane[1]
// v1.lane[1] <- v0.lane[2]
// ...
// v1.lane[14] <- v0.lane[15]
// v1.lane[15] <- v0.lane[0]
//
// ROW 1:
// v1.lane[16] <- v0.lane[17]
// v1.lane[17] <- v0.lane[18]
// ...
// v1.lane[30] <- v0.lane[31]
// v1.lane[31] <- v0.lane[16]

V_PERMLANEX16_B32

604

Perform arbitrary gather-style operation across two rows (each row is 16 contiguous lanes).
The first source must be a VGPR and the second and third sources must be scalar values; the second and third
source are combined into a single 64-bit value representing lane selects used to swizzle within each row.
OPSEL is not used in its typical manner for this instruction. For this instruction OPSEL[0] is overloaded to
represent the DPP 'FI' (Fetch Inactive) bit and OPSEL[1] is overloaded to represent the DPP 'BOUND_CTRL' bit.
The remaining OPSEL bits are reserved for this instruction.
Compare with V_PERMLANE16_B32.
declare tmp : 32'B[64];
lanesel = { S2.u, S1.u };
// Concatenate lane select bits
for i in 0 : WAVE32 ? 31 : 63 do
// Copy original S0 in case D==S0
tmp[i] = VGPR[i][SRC0.u]
endfor;
for row in 0 : WAVE32 ? 1 : 3 do
// Implement arbitrary swizzle across two rows
altrow = { row[1], ~row[0] };
// 1<->0, 3<->2
for i in 0 : 15 do
if EXEC[row * 16 + i].u1 then

16.12. VOP3 & VOP3SD Instructions

418 of 597

"RDNA3" Instruction Set Architecture

VGPR[row * 16 + i][VDST.u] = tmp[64'B(altrow.i * 16) + lanesel[i * 4 + 3 : i * 4]]
endif
endfor
endfor

Notes
ABS, NEG and OMOD modifiers should all be zeroed for this instruction.
Example implementing a rotation across an entire wave32 wavefront:
// Note for this to work, source and destination VGPRs must be different.
// For this rotation, lane 15 gets data from lane 16, lane 31 gets data from lane 0.
// These are the only two lanes that need to use v_permlanex16_b32.
// Enable only the threads that get data from their own row.
v_mov_b32 exec_lo, 0x7fff7fff; // Lanes getting data from their own row
v_mov_b32 s0, 0x87654321;
v_mov_b32 s1, 0x0fedcba9;
v_permlane16_b32 v1, v0, s0, s1 fi; // FI bit needed for lanes 14 and 30
// ROW 0:
// v1.lane[0] <- v0.lane[1]
// v1.lane[1] <- v0.lane[2]
// ...
// v1.lane[14] <- v0.lane[15] (needs FI to read)
// v1.lane[15] unset
//
// ROW 1:
// v1.lane[16] <- v0.lane[17]
// v1.lane[17] <- v0.lane[18]
// ...
// v1.lane[30] <- v0.lane[31] (needs FI to read)
// v1.lane[31] unset
// Enable only the threads that get data from the other row.
v_mov_b32 exec_lo, 0x80008000; // Lanes getting data from the other row
v_permlanex16_b32 v1, v0, s0, s1 fi; // FI bit needed for lanes 15 and 31
// v1.lane[15] <- v0.lane[16]
// v1.lane[31] <- v0.lane[0]

V_CNDMASK_B16

605

Conditional mask on each thread.
D0.u16 = VCC.u1 ? S1.u16 : S0.u16

Notes
In VOP3 the VCC source may be a scalar GPR specified in S2.u.

16.12. VOP3 & VOP3SD Instructions

419 of 597

"RDNA3" Instruction Set Architecture

Floating-point modifiers are valid for this instruction if S0.u16 and S1.u16 are 16-bit floating point values. This
instruction is suitable for negating or taking the absolute value of a floating-point value.

V_MAXMIN_F32

606

Compute maximum of first two operands, followed by minimum of that result and the third operand.
This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the
clamp's maxBound < minBound.
D0.f = v_min_f32(v_max_f32(S0.f, S1.f), S2.f)

Notes
Support input denorm control, allow output denorm value. Exceptions are supported. Note: +0.0 > -0.0 is true.

V_MINMAX_F32

607

Compute minimum of first two operands, followed by maximum of that result and the third operand.
This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the
clamp's minBound > maxBound.
D0.f = v_max_f32(v_min_f32(S0.f, S1.f), S2.f)

Notes
Support input denorm control, allow output denorm value. Exceptions are supported. Note: +0.0 > -0.0 is true.

V_MAXMIN_F16

608

Compute maximum of first two operands, followed by minimum of that result and the third operand.
This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the
clamp's maxBound < minBound.
D0.f16 = v_min_f16(v_max_f16(S0.f16, S1.f16), S2.f16)

Notes
Support input denorm control, allow output denorm value. Exceptions are supported. Note: +0.0 > -0.0 is true.

16.12. VOP3 & VOP3SD Instructions

420 of 597

"RDNA3" Instruction Set Architecture

V_MINMAX_F16

609

Compute minimum of first two operands, followed by maximum of that result and the third operand.
This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the
clamp's maxBound < minBound.
D0.f16 = v_max_f16(v_min_f16(S0.f16, S1.f16), S2.f16)

Notes
Support input denorm control, allow output denorm value. Exceptions are supported. Note: +0.0 > -0.0 is true.

V_MAXMIN_U32

610

Compute maximum of first two operands, followed by minimum of that result and the third operand.
This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the
clamp's maxBound < minBound.
D0.i = 32'I(v_min_u32(v_max_u32(S0.u, S1.u), S2.u))

V_MINMAX_U32

611

Compute minimum of first two operands, followed by maximum of that result and the third operand.
This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the
clamp's maxBound < minBound.
D0.i = 32'I(v_max_u32(v_min_u32(S0.u, S1.u), S2.u))

V_MAXMIN_I32

612

Compute maximum of first two operands, followed by minimum of that result and the third operand.
This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the
clamp's maxBound < minBound.
D0.i = v_min_i32(v_max_i32(S0.i, S1.i), S2.i)

16.12. VOP3 & VOP3SD Instructions

421 of 597

"RDNA3" Instruction Set Architecture

V_MINMAX_I32

613

Compute minimum of first two operands, followed by maximum of that result and the third operand.
This instruction can emulate an API-level "clamp"; unlike MED3 this correctly handles the case where the
clamp's maxBound < minBound.
D0.i = v_max_i32(v_min_i32(S0.i, S1.i), S2.i)

V_DOT2_F16_F16

614

Dot product of packed FP16 values.
tmp = S0[15 : 0].f16 * S1[15 : 0].f16;
tmp += S0[31 : 16].f16 * S1[31 : 16].f16;
tmp += S2.f16;
D0.f16 = tmp

Notes
OPSEL[2] controls which half of S2 is read and OPSEL[3] controls which half of D is written; OPSEL[1:0] are
ignored.

V_DOT2_BF16_BF16

615

Dot product of packed brain-float values.
tmp = S0[15 : 0].bf16 * S1[15 : 0].bf16;
tmp += S0[31 : 16].bf16 * S1[31 : 16].bf16;
tmp += S2.bf16;
D0.bf16 = tmp

Notes
OPSEL[2] controls which half of S2 is read and OPSEL[3] controls which half of D is written; OPSEL[1:0] are
ignored.

V_DIV_SCALE_F32

764

Single precision division pre-scale.
S0 = Input to scale (either denominator or numerator), S1 = Denominator, S2 = Numerator.

16.12. VOP3 & VOP3SD Instructions

422 of 597

"RDNA3" Instruction Set Architecture

Given a numerator and denominator, this opcode appropriately scales inputs for division to avoid subnormal
terms during Newton-Raphson correction method. S0 must be the same value as either S1 or S2.
This opcode produces a VCC flag for post-scaling of the quotient (using V_DIV_FMAS_F32).
VCC = 0x0LL;
if ((64'F(S2.f) == 0.0) || (64'F(S1.f) == 0.0)) then
D0.f = NAN.f
elsif exponent(S2.f) - exponent(S1.f) >= 96 then
// N/D near MAX_FLOAT_F32
VCC = 0x1LL;
if S0.f == S1.f then
// Only scale the denominator
D0.f = ldexp(S0.f, 64)
endif
elsif S1.f == DENORM.f then
D0.f = ldexp(S0.f, 64)
elsif ((1.0 / 64'F(S1.f) == DENORM.f64) && (S2.f / S1.f == DENORM.f)) then
VCC = 0x1LL;
if S0.f == S1.f then
// Only scale the denominator
D0.f = ldexp(S0.f, 64)
endif
elsif 1.0 / 64'F(S1.f) == DENORM.f64 then
D0.f = ldexp(S0.f, -64)
elsif S2.f / S1.f == DENORM.f then
VCC = 0x1LL;
if S0.f == S2.f then
// Only scale the numerator
D0.f = ldexp(S0.f, 64)
endif
elsif exponent(S2.f) <= 23 then
// Numerator is tiny
D0.f = ldexp(S0.f, 64)
endif

V_DIV_SCALE_F64

765

Double precision division pre-scale.
S0 = Input to scale (either denominator or numerator), S1 = Denominator, S2 = Numerator.
Given a numerator and denominator, this opcode appropriately scales inputs for division to avoid subnormal
terms during Newton-Raphson correction method. S0 must be the same value as either S1 or S2.
This opcode produces a VCC flag for post-scaling of the quotient (using V_DIV_FMAS_F64).
VCC = 0x0LL;
if ((S2.f64 == 0.0) || (S1.f64 == 0.0)) then
D0.f64 = NAN.f64
elsif exponent(S2.f64) - exponent(S1.f64) >= 768 then
// N/D near MAX_FLOAT_F64

16.12. VOP3 & VOP3SD Instructions

423 of 597

"RDNA3" Instruction Set Architecture

VCC = 0x1LL;
if S0.f64 == S1.f64 then
// Only scale the denominator
D0.f64 = ldexp(S0.f64, 128)
endif
elsif S1.f64 == DENORM.f64 then
D0.f64 = ldexp(S0.f64, 128)
elsif ((1.0 / S1.f64 == DENORM.f64) && (S2.f64 / S1.f64 == DENORM.f64)) then
VCC = 0x1LL;
if S0.f64 == S1.f64 then
// Only scale the denominator
D0.f64 = ldexp(S0.f64, 128)
endif
elsif 1.0 / S1.f64 == DENORM.f64 then
D0.f64 = ldexp(S0.f64, -128)
elsif S2.f64 / S1.f64 == DENORM.f64 then
VCC = 0x1LL;
if S0.f64 == S2.f64 then
// Only scale the numerator
D0.f64 = ldexp(S0.f64, 128)
endif
elsif exponent(S2.f64) <= 53 then
// Numerator is tiny
D0.f64 = ldexp(S0.f64, 128)
endif

V_MAD_U64_U32

766

Multiply and add unsigned integers and produce a 64-bit result.
{ vcc_out.u1, D0.u64 } = 65'B(65'U(S0.u) * 65'U(S1.u) + 65'U(S2.u64))

V_MAD_I64_I32

767

Multiply and add signed integers and produce a 64-bit result.
{ vcc_out.u1, D0.i64 } = 65'B(65'I(S0.i) * 65'I(S1.i) + 65'I(S2.i64))

V_ADD_CO_U32

768

Add two unsigned integers with carry-out.
tmp = 64'U(S0.u) + 64'U(S1.u);
VCC.u64[laneId] = tmp >= 0x100000000ULL ? 1'1U : 1'0U;
// VCC is an UNSIGNED overflow/carry-out for V_ADD_CO_CI_U32.

16.12. VOP3 & VOP3SD Instructions

424 of 597

"RDNA3" Instruction Set Architecture

D0.u = tmp.u

Notes
In VOP3 the VCC destination may be an arbitrary SGPR-pair.
If clamp enabled, saturate output between 0 and maxUInt32.

V_SUB_CO_U32

769

Subtract the second unsigned integer from the first with carry-out.
tmp = S0.u - S1.u;
VCC.u64[laneId] = S1.u > S0.u ? 1'1U : 1'0U;
// VCC is an UNSIGNED overflow/carry-out for V_SUB_CO_CI_U32.
D0.u = tmp.u

Notes
In VOP3 the VCC destination may be an arbitrary SGPR-pair.
If clamp enabled, saturate output between 0 and maxUInt32.

V_SUBREV_CO_U32

770

Subtract the first unsigned integer from the second with carry-out.
tmp = S1.u - S0.u;
VCC.u64[laneId] = S0.u > S1.u ? 1'1U : 1'0U;
// VCC is an UNSIGNED overflow/carry-out for V_SUB_CO_CI_U32.
D0.u = tmp.u

Notes
In VOP3 the VCC destination may be an arbitrary SGPR-pair.
If clamp enabled, saturate output between 0 and maxUInt32.

V_ADD_NC_U16

771

Add two unsigned shorts. No carry-in or carry-out.
D0.u16 = S0.u16 + S1.u16

16.12. VOP3 & VOP3SD Instructions

425 of 597

"RDNA3" Instruction Set Architecture

Notes
Supports saturation (unsigned 16-bit integer domain).

V_SUB_NC_U16

772

Subtract the second unsigned short from the first. No carry-in or carry-out.
D0.u16 = S0.u16 - S1.u16

Notes
Supports saturation (unsigned 16-bit integer domain).

V_MUL_LO_U16

773

Multiply two unsigned shorts.
D0.u16 = S0.u16 * S1.u16

Notes
Supports saturation (unsigned 16-bit integer domain).

V_CVT_PK_I16_F32

774

Convert two single-precision floats into a packed value of signed words.
D0[31 : 16] = 16'B(v_cvt_i16_f32(S1.f));
D0[15 : 0] = 16'B(v_cvt_i16_f32(S0.f))

V_CVT_PK_U16_F32

775

Convert two single-precision floats into a packed value of unsigned words.
D0[31 : 16] = 16'B(v_cvt_u16_f32(S1.f));
D0[15 : 0] = 16'B(v_cvt_u16_f32(S0.f))

16.12. VOP3 & VOP3SD Instructions

426 of 597

"RDNA3" Instruction Set Architecture

V_MAX_U16

777

Maximum of two unsigned shorts.
D0.u16 = S0.u16 >= S1.u16 ? S0.u16 : S1.u16

V_MAX_I16

778

Maximum of two signed shorts.
D0.i16 = S0.i16 >= S1.i16 ? S0.i16 : S1.i16

V_MIN_U16

779

Minimum of two unsigned shorts.
D0.u16 = S0.u16 < S1.u16 ? S0.u16 : S1.u16

V_MIN_I16

780

Minimum of two signed shorts.
D0.i16 = S0.i16 < S1.i16 ? S0.i16 : S1.i16

V_ADD_NC_I16

781

Add two signed shorts. No carry-in or carry-out.
D0.i16 = S0.i16 + S1.i16

Notes
Supports saturation (signed 16-bit integer domain).

V_SUB_NC_I16

16.12. VOP3 & VOP3SD Instructions

782

427 of 597

"RDNA3" Instruction Set Architecture

Subtract the second signed short from the first. No carry-in or carry-out.
D0.i16 = S0.i16 - S1.i16

Notes
Supports saturation (signed 16-bit integer domain).

V_PACK_B32_F16

785

Pack two FP16 values together.
D0[31 : 16].f16 = S1.f16;
D0[15 : 0].f16 = S0.f16

V_CVT_PK_NORM_I16_F16

786

Convert two FP16 values into packed signed normalized shorts.
D0[15 : 0].i16 = f16_to_snorm(S0[15 : 0].f16);
D0[31 : 16].i16 = f16_to_snorm(S1[15 : 0].f16)

V_CVT_PK_NORM_U16_F16

787

Convert two FP16 values into packed unsigned normalized shorts.
D0[15 : 0].u16 = f16_to_unorm(S0[15 : 0].f16);
D0[31 : 16].u16 = f16_to_unorm(S1[15 : 0].f16)

V_LDEXP_F32

796

Load exponent.
Multiply a single-precision float by an integral power of 2, compare with the ldexp() function in C.
D0.f = S0.f * 2.0F ** S1.i

16.12. VOP3 & VOP3SD Instructions

428 of 597

"RDNA3" Instruction Set Architecture

V_BFM_B32

797

Bitfield modify.
S0 is the bitfield width and S1 is the bitfield offset.
D0.u = 32'U((1 << S0[4 : 0].u) - 1 << S1[4 : 0].u)

V_BCNT_U32_B32

798

Bit count.
D0.u = S1.u;
for i in 0 : 31 do
D0.u += S0[i].u;
// count i'th bit
endfor

V_MBCNT_LO_U32_B32

799

Masked bit count.
laneId is the position of this thread in the wavefront (in 0..63). See also V_MBCNT_HI_U32_B32.
ThreadMask = (1LL << laneId.u) - 1LL;
MaskedValue = (S0.u & ThreadMask[31 : 0].u);
D0.u = S1.u;
for i in 0 : 31 do
D0.u += MaskedValue[i] == 1'1U ? 1U : 0U
endfor

V_MBCNT_HI_U32_B32

800

Masked bit count, high pass.
laneId is the position of this thread in the wavefront (in 0..63). See also V_MBCNT_LO_U32_B32.
ThreadMask = (1LL << laneId.u) - 1LL;
MaskedValue = (S0.u & ThreadMask[63 : 32].u);
D0.u = S1.u;
for i in 0 : 31 do
D0.u += MaskedValue[i] == 1'1U ? 1U : 0U

16.12. VOP3 & VOP3SD Instructions

429 of 597

"RDNA3" Instruction Set Architecture

endfor

Notes
Note that in Wave32 mode ThreadMask[63:32] == 0 and this instruction simply performs a move from S1 to D.
Example to compute each thread's position in 0..63:
v_mbcnt_lo_u32_b32 v0, -1, 0
v_mbcnt_hi_u32_b32 v0, -1, v0
// v0 now contains laneId

V_CVT_PK_NORM_I16_F32

801

Convert two single-precision floats into a packed signed normalized value.
D0[15 : 0].i16 = f32_to_snorm(S0.f);
D0[31 : 16].i16 = f32_to_snorm(S1.f)

V_CVT_PK_NORM_U16_F32

802

Convert two single-precision floats into a packed unsigned normalized value.
D0[15 : 0].u16 = f32_to_unorm(S0.f);
D0[31 : 16].u16 = f32_to_unorm(S1.f)

V_CVT_PK_U16_U32

803

Convert two unsigned integers into a packed unsigned short.
D0[15 : 0].u16 = u32_to_u16(S0.u);
D0[31 : 16].u16 = u32_to_u16(S1.u)

V_CVT_PK_I16_I32

804

Convert two signed integers into a packed signed short.
D0[15 : 0].i16 = i32_to_i16(S0.i);

16.12. VOP3 & VOP3SD Instructions

430 of 597

"RDNA3" Instruction Set Architecture

D0[31 : 16].i16 = i32_to_i16(S1.i)

V_SUB_NC_I32

805

Subtract the second signed integer from the first. No carry-in or carry-out.
D0.i = S0.i - S1.i

Notes
Supports saturation (signed 32-bit integer domain).

V_ADD_NC_I32

806

Add two signed integers. No carry-in or carry-out.
D0.i = S0.i + S1.i

Notes
Supports saturation (signed 32-bit integer domain).

V_ADD_F64

807

Add two double-precision values.
D0.f64 = S0.f64 + S1.f64

Notes
0.5ULP precision, denormals are supported.

V_MUL_F64

808

Multiply two double-precision values.
D0.f64 = S0.f64 * S1.f64

16.12. VOP3 & VOP3SD Instructions

431 of 597

"RDNA3" Instruction Set Architecture

Notes
0.5ULP precision, denormals are supported.

V_MIN_F64

809

Compute the minimum of two floats.
LT_NEG_ZERO = lambda(a, b) (
((a < b) || ((abs(a) == 0.0) && (abs(b) == 0.0) && sign(a) && !sign(b))));
// Version of comparison where -0.0 < +0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(S0.f64) then
D0.f64 = cvtToQuietNAN(S0.f64)
elsif isSignalNAN(S1.f64) then
D0.f64 = cvtToQuietNAN(S1.f64)
elsif isQuietNAN(S1.f64) then
D0.f64 = S0.f64
elsif isQuietNAN(S0.f64) then
D0.f64 = S1.f64
elsif LT_NEG_ZERO(S0.f64, S1.f64) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f64 = S0.f64
else
D0.f64 = S1.f64
endif
else
if isNAN(S1.f64) then
D0.f64 = S0.f64
elsif isNAN(S0.f64) then
D0.f64 = S1.f64
elsif LT_NEG_ZERO(S0.f64, S1.f64) then
// NOTE: -0<+0 is TRUE in this comparison
D0.f64 = S0.f64
else
D0.f64 = S1.f64
endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.
Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so
output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

16.12. VOP3 & VOP3SD Instructions

432 of 597

"RDNA3" Instruction Set Architecture

V_MAX_F64

810

Compute the maximum of two floats.
GT_NEG_ZERO = lambda(a, b) (
((a > b) || ((abs(a) == 0.0) && (abs(b) == 0.0) && !sign(a) && sign(b))));
// Version of comparison where +0.0 > -0.0, differs from IEEE
if WAVE_MODE.IEEE then
if isSignalNAN(S0.f64) then
D0.f64 = cvtToQuietNAN(S0.f64)
elsif isSignalNAN(S1.f64) then
D0.f64 = cvtToQuietNAN(S1.f64)
elsif isQuietNAN(S1.f64) then
D0.f64 = S0.f64
elsif isQuietNAN(S0.f64) then
D0.f64 = S1.f64
elsif GT_NEG_ZERO(S0.f64, S1.f64) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f64 = S0.f64
else
D0.f64 = S1.f64
endif
else
if isNAN(S1.f64) then
D0.f64 = S0.f64
elsif isNAN(S0.f64) then
D0.f64 = S1.f64
elsif GT_NEG_ZERO(S0.f64, S1.f64) then
// NOTE: +0>-0 is TRUE in this comparison
D0.f64 = S0.f64
else
D0.f64 = S1.f64
endif
endif;
// Inequalities in the above pseudocode behave differently from IEEE
// when both inputs are +-0.

Notes
IEEE compliant. Supports denormals, round mode, exception flags, saturation.
Denorm flushing for this operation is effectively controlled by the input denorm mode control: If input
denorm mode is disabling denorm, the internal result of a min/max operation cannot be a denorm value, so
output denorm mode is irrelevant. If input denorm mode is enabling denorm, the internal min/max result can
be a denorm and this operation outputs as a denorm regardless of output denorm mode.

V_LDEXP_F64

811

Load exponent.
Multiply a double-precision float by an integral power of 2, compare with the ldexp() function in C.

16.12. VOP3 & VOP3SD Instructions

433 of 597

"RDNA3" Instruction Set Architecture

D0.f64 = S0.f64 * 2.0 ** S1.i

V_MUL_LO_U32

812

Multiply two unsigned integers.
D0.u = S0.u * S1.u

Notes
If you only need to multiply integers with small magnitudes consider V_MUL_U32_U24, which is a more
efficient implementation.

V_MUL_HI_U32

813

Multiply two unsigned integers and store the high 32 bits of the result.
D0.u = 32'U(64'U(S0.u) * 64'U(S1.u) >> 32U)

Notes
If you only need to multiply integers with small magnitudes consider V_MUL_HI_U32_U24, which is a more
efficient implementation.

V_MUL_HI_I32

814

Multiply two signed integers and store the high 32 bits of the result.
D0.i = 32'I(64'I(S0.i) * 64'I(S1.i) >> 32U)

Notes
If you only need to multiply integers with small magnitudes consider V_MUL_HI_I32_I24, which is a more
efficient implementation.

V_TRIG_PREOP_F64

815

Look Up 2/PI (S0.f64) with segment select S1.u[4:0].

16.12. VOP3 & VOP3SD Instructions

434 of 597

"RDNA3" Instruction Set Architecture

This operation returns an aligned, double precision segment of 2/PI needed to do range reduction on S0.f64
(double-precision value). Multiple segments can be specified through S1.u[4:0]. Rounding is round-to-zero.
Large inputs (exp > 1968) are scaled to avoid loss of precision through denormalization.
shift = 32'I(S1[4 : 0].u) * 53;
if exponent(S0.f64) > 1077 then
shift += exponent(S0.f64) - 1077
endif;
// (2.0/PI) == 0.{b_1200, b_1199, b_1198, ..., b_1, b_0}
// b_1200 is the MSB of the fractional part of 2.0/PI
// Left shift operation indicates which bits are brought
// into the whole part of the number.
// Only whole part of result is kept.
result = 64'F((1201'B(2.0 / PI)[1200 : 0] << shift.u) & 1201'0x1fffffffffffff);
scale = -53 - shift;
if exponent(S0.f64) >= 1968 then
scale += 128
endif;
D0.f64 = ldexp(result, scale)

V_LSHLREV_B16

824

Logical shift left, count is in the first operand.
D0.u[15 : 0] = S1.u[15 : 0] << S0.u[3 : 0].u

V_LSHRREV_B16

825

Logical shift right, count is in the first operand.
D0.u[15 : 0] = S1.u[15 : 0] >> S0.u[3 : 0].u

V_ASHRREV_I16

826

Arithmetic shift right (preserve sign bit), count is in the first operand.
D0.i[15 : 0] = 16'I(signext(S1.i[15 : 0]) >> S0.u[3 : 0].u)

V_LSHLREV_B64

16.12. VOP3 & VOP3SD Instructions

828

435 of 597

"RDNA3" Instruction Set Architecture

Logical shift left, count is in the first operand.
D0.u64 = S1.u64 << S0.u[5 : 0].u

Notes
Only one scalar broadcast constant is allowed.

V_LSHRREV_B64

829

Logical shift right, count is in the first operand.
D0.u64 = S1.u64 >> S0.u[5 : 0].u

Notes
Only one scalar broadcast constant is allowed.

V_ASHRREV_I64

830

Arithmetic shift right (preserve sign bit), count is in the first operand.
D0.u64 = 64'U(signext(S1.u64) >> S0.u[5 : 0].u)

Notes
Only one scalar broadcast constant is allowed.

V_READLANE_B32

864

Copy one VGPR value from a single lane to one SGPR.
declare lane : 32'U;
if WAVE32 then
lane = S1.u[4 : 0].u;
// Lane select for wave32
else
lane = S1.u[5 : 0].u;
// Lane select for wave64
endif;
D0.b = VGPR[lane][SRC0.u]

16.12. VOP3 & VOP3SD Instructions

436 of 597

"RDNA3" Instruction Set Architecture

Notes
Ignores EXEC mask for the VGPR read. Input and output modifiers not supported; this is an untyped operation.

V_WRITELANE_B32

865

Write scalar value into one VGPR in one lane.
declare lane : 32'U;
if WAVE32 then
lane = S1.u[4 : 0].u;
// Lane select for wave32
else
lane = S1.u[5 : 0].u;
// Lane select for wave64
endif;
VGPR[lane][VDST.u] = S0.b

Notes
Ignores EXEC mask for the VGPR write. Input and output modifiers not supported; this is an untyped
operation.

V_AND_B16

866

Bitwise AND.
D0.u16 = (S0.u16 & S1.u16)

Notes
Input and output modifiers not supported.

V_OR_B16

867

Bitwise OR.
D0.u16 = (S0.u16 | S1.u16)

Notes
Input and output modifiers not supported.

16.12. VOP3 & VOP3SD Instructions

437 of 597

"RDNA3" Instruction Set Architecture

V_XOR_B16

868

Bitwise XOR.
D0.u16 = (S0.u16 ^ S1.u16)

Notes
Input and output modifiers not supported.

V_CMP_F_F16

0

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_F16

1

A less than B.
D0.u64[laneId] = S0.f16 < S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_F16

2

A equal to B.
D0.u64[laneId] = S0.f16 == S1.f16;
// D0 = VCC in VOPC encoding.

Notes

16.12. VOP3 & VOP3SD Instructions

438 of 597

"RDNA3" Instruction Set Architecture

Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_F16

3

A less than or equal to B.
D0.u64[laneId] = S0.f16 <= S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_F16

4

A greater than B.
D0.u64[laneId] = S0.f16 > S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LG_F16

5

A less than or greater than B.
D0.u64[laneId] = S0.f16 <> S1.f16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_F16

6

A greater than or equal to B.
D0.u64[laneId] = S0.f16 >= S1.f16;

16.12. VOP3 & VOP3SD Instructions

439 of 597

"RDNA3" Instruction Set Architecture

// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_O_F16

7

A orderable with B.
D0.u64[laneId] = (!isNAN(64'F(S0.f16)) && !isNAN(64'F(S1.f16)));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_U_F16

8

A not orderable with B.
D0.u64[laneId] = (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16)));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGE_F16

9

A not greater than or equal to B.
D0.u64[laneId] = !(S0.f16 >= S1.f16);
// With NAN inputs this is not the same operation as <
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

440 of 597

"RDNA3" Instruction Set Architecture

V_CMP_NLG_F16

10

A not less than or greater than B.
D0.u64[laneId] = !(S0.f16 <> S1.f16);
// With NAN inputs this is not the same operation as ==
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGT_F16

11

A not greater than B.
D0.u64[laneId] = !(S0.f16 > S1.f16);
// With NAN inputs this is not the same operation as <=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLE_F16

12

A not less than or equal to B.
D0.u64[laneId] = !(S0.f16 <= S1.f16);
// With NAN inputs this is not the same operation as >
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NEQ_F16

13

A not equal to B.
D0.u64[laneId] = !(S0.f16 == S1.f16);
// With NAN inputs this is not the same operation as !=

16.12. VOP3 & VOP3SD Instructions

441 of 597

"RDNA3" Instruction Set Architecture

// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLT_F16

14

A not less than B.
D0.u64[laneId] = !(S0.f16 < S1.f16);
// With NAN inputs this is not the same operation as >=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_F16

15

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_F_F32

16

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

442 of 597

"RDNA3" Instruction Set Architecture

V_CMP_LT_F32

17

A less than B.
D0.u64[laneId] = S0.f < S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_F32

18

A equal to B.
D0.u64[laneId] = S0.f == S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_F32

19

A less than or equal to B.
D0.u64[laneId] = S0.f <= S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_F32

20

A greater than B.
D0.u64[laneId] = S0.f > S1.f;
// D0 = VCC in VOPC encoding.

Notes

16.12. VOP3 & VOP3SD Instructions

443 of 597

"RDNA3" Instruction Set Architecture

Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LG_F32

21

A less than or greater than B.
D0.u64[laneId] = S0.f <> S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_F32

22

A greater than or equal to B.
D0.u64[laneId] = S0.f >= S1.f;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_O_F32

23

A orderable with B.
D0.u64[laneId] = (!isNAN(64'F(S0.f)) && !isNAN(64'F(S1.f)));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_U_F32

24

A not orderable with B.
D0.u64[laneId] = (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f)));

16.12. VOP3 & VOP3SD Instructions

444 of 597

"RDNA3" Instruction Set Architecture

// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGE_F32

25

A not greater than or equal to B.
D0.u64[laneId] = !(S0.f >= S1.f);
// With NAN inputs this is not the same operation as <
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLG_F32

26

A not less than or greater than B.
D0.u64[laneId] = !(S0.f <> S1.f);
// With NAN inputs this is not the same operation as ==
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGT_F32

27

A not greater than B.
D0.u64[laneId] = !(S0.f > S1.f);
// With NAN inputs this is not the same operation as <=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

445 of 597

"RDNA3" Instruction Set Architecture

V_CMP_NLE_F32

28

A not less than or equal to B.
D0.u64[laneId] = !(S0.f <= S1.f);
// With NAN inputs this is not the same operation as >
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NEQ_F32

29

A not equal to B.
D0.u64[laneId] = !(S0.f == S1.f);
// With NAN inputs this is not the same operation as !=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLT_F32

30

A not less than B.
D0.u64[laneId] = !(S0.f < S1.f);
// With NAN inputs this is not the same operation as >=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_F32

31

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

16.12. VOP3 & VOP3SD Instructions

446 of 597

"RDNA3" Instruction Set Architecture

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_F_F64

32

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_F64

33

A less than B.
D0.u64[laneId] = S0.f64 < S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_F64

34

A equal to B.
D0.u64[laneId] = S0.f64 == S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_F64

35

A less than or equal to B.

16.12. VOP3 & VOP3SD Instructions

447 of 597

"RDNA3" Instruction Set Architecture

D0.u64[laneId] = S0.f64 <= S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_F64

36

A greater than B.
D0.u64[laneId] = S0.f64 > S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LG_F64

37

A less than or greater than B.
D0.u64[laneId] = S0.f64 <> S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_F64

38

A greater than or equal to B.
D0.u64[laneId] = S0.f64 >= S1.f64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

448 of 597

"RDNA3" Instruction Set Architecture

V_CMP_O_F64

39

A orderable with B.
D0.u64[laneId] = (!isNAN(S0.f64) && !isNAN(S1.f64));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_U_F64

40

A not orderable with B.
D0.u64[laneId] = (isNAN(S0.f64) || isNAN(S1.f64));
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGE_F64

41

A not greater than or equal to B.
D0.u64[laneId] = !(S0.f64 >= S1.f64);
// With NAN inputs this is not the same operation as <
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLG_F64

42

A not less than or greater than B.
D0.u64[laneId] = !(S0.f64 <> S1.f64);
// With NAN inputs this is not the same operation as ==
// D0 = VCC in VOPC encoding.

16.12. VOP3 & VOP3SD Instructions

449 of 597

"RDNA3" Instruction Set Architecture

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NGT_F64

43

A not greater than B.
D0.u64[laneId] = !(S0.f64 > S1.f64);
// With NAN inputs this is not the same operation as <=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLE_F64

44

A not less than or equal to B.
D0.u64[laneId] = !(S0.f64 <= S1.f64);
// With NAN inputs this is not the same operation as >
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NEQ_F64

45

A not equal to B.
D0.u64[laneId] = !(S0.f64 == S1.f64);
// With NAN inputs this is not the same operation as !=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NLT_F64

16.12. VOP3 & VOP3SD Instructions

46

450 of 597

"RDNA3" Instruction Set Architecture

A not less than B.
D0.u64[laneId] = !(S0.f64 < S1.f64);
// With NAN inputs this is not the same operation as >=
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_F64

47

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_I16

49

A less than B.
D0.u64[laneId] = S0.i16 < S1.i16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_I16

50

A equal to B.
D0.u64[laneId] = S0.i16 == S1.i16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

451 of 597

"RDNA3" Instruction Set Architecture

V_CMP_LE_I16

51

A less than or equal to B.
D0.u64[laneId] = S0.i16 <= S1.i16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_I16

52

A greater than B.
D0.u64[laneId] = S0.i16 > S1.i16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_I16

53

A not equal to B.
D0.u64[laneId] = S0.i16 <> S1.i16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_I16

54

A greater than or equal to B.
D0.u64[laneId] = S0.i16 >= S1.i16;
// D0 = VCC in VOPC encoding.

16.12. VOP3 & VOP3SD Instructions

452 of 597

"RDNA3" Instruction Set Architecture

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_U16

57

A less than B.
D0.u64[laneId] = S0.u16 < S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_U16

58

A equal to B.
D0.u64[laneId] = S0.u16 == S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_U16

59

A less than or equal to B.
D0.u64[laneId] = S0.u16 <= S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_U16

60

A greater than B.

16.12. VOP3 & VOP3SD Instructions

453 of 597

"RDNA3" Instruction Set Architecture

D0.u64[laneId] = S0.u16 > S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_U16

61

A not equal to B.
D0.u64[laneId] = S0.u16 <> S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_U16

62

A greater than or equal to B.
D0.u64[laneId] = S0.u16 >= S1.u16;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_F_I32

64

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

454 of 597

"RDNA3" Instruction Set Architecture

V_CMP_LT_I32

65

A less than B.
D0.u64[laneId] = S0.i < S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_I32

66

A equal to B.
D0.u64[laneId] = S0.i == S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_I32

67

A less than or equal to B.
D0.u64[laneId] = S0.i <= S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_I32

68

A greater than B.
D0.u64[laneId] = S0.i > S1.i;
// D0 = VCC in VOPC encoding.

Notes

16.12. VOP3 & VOP3SD Instructions

455 of 597

"RDNA3" Instruction Set Architecture

Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_I32

69

A not equal to B.
D0.u64[laneId] = S0.i <> S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_I32

70

A greater than or equal to B.
D0.u64[laneId] = S0.i >= S1.i;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_I32

71

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_F_U32

72

False.
D0.u64[laneId] = 1'0U;

16.12. VOP3 & VOP3SD Instructions

456 of 597

"RDNA3" Instruction Set Architecture

// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_U32

73

A less than B.
D0.u64[laneId] = S0.u < S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_U32

74

A equal to B.
D0.u64[laneId] = S0.u == S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_U32

75

A less than or equal to B.
D0.u64[laneId] = S0.u <= S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_U32

16.12. VOP3 & VOP3SD Instructions

76

457 of 597

"RDNA3" Instruction Set Architecture

A greater than B.
D0.u64[laneId] = S0.u > S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_U32

77

A not equal to B.
D0.u64[laneId] = S0.u <> S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_U32

78

A greater than or equal to B.
D0.u64[laneId] = S0.u >= S1.u;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_U32

79

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

458 of 597

"RDNA3" Instruction Set Architecture

V_CMP_F_I64

80

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_I64

81

A less than B.
D0.u64[laneId] = S0.i64 < S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_I64

82

A equal to B.
D0.u64[laneId] = S0.i64 == S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LE_I64

83

A less than or equal to B.
D0.u64[laneId] = S0.i64 <= S1.i64;
// D0 = VCC in VOPC encoding.

16.12. VOP3 & VOP3SD Instructions

459 of 597

"RDNA3" Instruction Set Architecture

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_I64

84

A greater than B.
D0.u64[laneId] = S0.i64 > S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_I64

85

A not equal to B.
D0.u64[laneId] = S0.i64 <> S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_I64

86

A greater than or equal to B.
D0.u64[laneId] = S0.i64 >= S1.i64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_I64

87

True.

16.12. VOP3 & VOP3SD Instructions

460 of 597

"RDNA3" Instruction Set Architecture

D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_F_U64

88

False.
D0.u64[laneId] = 1'0U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_LT_U64

89

A less than B.
D0.u64[laneId] = S0.u64 < S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_EQ_U64

90

A equal to B.
D0.u64[laneId] = S0.u64 == S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

461 of 597

"RDNA3" Instruction Set Architecture

V_CMP_LE_U64

91

A less than or equal to B.
D0.u64[laneId] = S0.u64 <= S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GT_U64

92

A greater than B.
D0.u64[laneId] = S0.u64 > S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_NE_U64

93

A not equal to B.
D0.u64[laneId] = S0.u64 <> S1.u64;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_GE_U64

94

A greater than or equal to B.
D0.u64[laneId] = S0.u64 >= S1.u64;
// D0 = VCC in VOPC encoding.

Notes

16.12. VOP3 & VOP3SD Instructions

462 of 597

"RDNA3" Instruction Set Architecture

Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_T_U64

95

True.
D0.u64[laneId] = 1'1U;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_CLASS_F16

125

IEEE numeric class function specified in S1.u, performed on S0.f16.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(64'F(S0.f16)) then
result = S1.u[0]
elsif isQuietNAN(64'F(S0.f16)) then
result = S1.u[1]
elsif exponent(S0.f16) == 31 then
// +-INF
result = S1.u[sign(S0.f16) ? 2 : 9]
elsif exponent(S0.f16) > 0 then
// +-normal value
result = S1.u[sign(S0.f16) ? 3 : 8]
elsif 64'F(abs(S0.f16)) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f16) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f16) ? 5 : 6]

16.12. VOP3 & VOP3SD Instructions

463 of 597

"RDNA3" Instruction Set Architecture

endif;
D0.u64[laneId] = result;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMP_CLASS_F32

126

IEEE numeric class function specified in S1.u, performed on S0.f.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(64'F(S0.f)) then
result = S1.u[0]
elsif isQuietNAN(64'F(S0.f)) then
result = S1.u[1]
elsif exponent(S0.f) == 255 then
// +-INF
result = S1.u[sign(S0.f) ? 2 : 9]
elsif exponent(S0.f) > 0 then
// +-normal value
result = S1.u[sign(S0.f) ? 3 : 8]
elsif 64'F(abs(S0.f)) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f) ? 5 : 6]
endif;
D0.u64[laneId] = result;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

464 of 597

"RDNA3" Instruction Set Architecture

V_CMP_CLASS_F64

127

IEEE numeric class function specified in S1.u, performed on S0.f64.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(S0.f64) then
result = S1.u[0]
elsif isQuietNAN(S0.f64) then
result = S1.u[1]
elsif exponent(S0.f64) == 1023 then
// +-INF
result = S1.u[sign(S0.f64) ? 2 : 9]
elsif exponent(S0.f64) > 0 then
// +-normal value
result = S1.u[sign(S0.f64) ? 3 : 8]
elsif abs(S0.f64) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f64) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f64) ? 5 : 6]
endif;
D0.u64[laneId] = result;
// D0 = VCC in VOPC encoding.

Notes
Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_F16

128

False.

16.12. VOP3 & VOP3SD Instructions

465 of 597

"RDNA3" Instruction Set Architecture

EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_F16

129

A less than B.
EXEC.u64[laneId] = S0.f16 < S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_F16

130

A equal to B.
EXEC.u64[laneId] = S0.f16 == S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_F16

131

A less than or equal to B.
EXEC.u64[laneId] = S0.f16 <= S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_F16

132

A greater than B.

16.12. VOP3 & VOP3SD Instructions

466 of 597

"RDNA3" Instruction Set Architecture

EXEC.u64[laneId] = S0.f16 > S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LG_F16

133

A less than or greater than B.
EXEC.u64[laneId] = S0.f16 <> S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_F16

134

A greater than or equal to B.
EXEC.u64[laneId] = S0.f16 >= S1.f16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_O_F16

135

A orderable with B.
EXEC.u64[laneId] = (!isNAN(64'F(S0.f16)) && !isNAN(64'F(S1.f16)))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_U_F16

136

A not orderable with B.

16.12. VOP3 & VOP3SD Instructions

467 of 597

"RDNA3" Instruction Set Architecture

EXEC.u64[laneId] = (isNAN(64'F(S0.f16)) || isNAN(64'F(S1.f16)))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGE_F16

137

A not greater than or equal to B.
EXEC.u64[laneId] = !(S0.f16 >= S1.f16);
// With NAN inputs this is not the same operation as <

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLG_F16

138

A not less than or greater than B.
EXEC.u64[laneId] = !(S0.f16 <> S1.f16);
// With NAN inputs this is not the same operation as ==

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGT_F16

139

A not greater than B.
EXEC.u64[laneId] = !(S0.f16 > S1.f16);
// With NAN inputs this is not the same operation as <=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

468 of 597

"RDNA3" Instruction Set Architecture

V_CMPX_NLE_F16

140

A not less than or equal to B.
EXEC.u64[laneId] = !(S0.f16 <= S1.f16);
// With NAN inputs this is not the same operation as >

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NEQ_F16

141

A not equal to B.
EXEC.u64[laneId] = !(S0.f16 == S1.f16);
// With NAN inputs this is not the same operation as !=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLT_F16

142

A not less than B.
EXEC.u64[laneId] = !(S0.f16 < S1.f16);
// With NAN inputs this is not the same operation as >=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_F16

143

True.
EXEC.u64[laneId] = 1'1U

Notes

16.12. VOP3 & VOP3SD Instructions

469 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_F32

144

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_F32

145

A less than B.
EXEC.u64[laneId] = S0.f < S1.f

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_F32

146

A equal to B.
EXEC.u64[laneId] = S0.f == S1.f

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_F32

147

A less than or equal to B.
EXEC.u64[laneId] = S0.f <= S1.f

Notes

16.12. VOP3 & VOP3SD Instructions

470 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_F32

148

A greater than B.
EXEC.u64[laneId] = S0.f > S1.f

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LG_F32

149

A less than or greater than B.
EXEC.u64[laneId] = S0.f <> S1.f

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_F32

150

A greater than or equal to B.
EXEC.u64[laneId] = S0.f >= S1.f

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_O_F32

151

A orderable with B.
EXEC.u64[laneId] = (!isNAN(64'F(S0.f)) && !isNAN(64'F(S1.f)))

Notes

16.12. VOP3 & VOP3SD Instructions

471 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_U_F32

152

A not orderable with B.
EXEC.u64[laneId] = (isNAN(64'F(S0.f)) || isNAN(64'F(S1.f)))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGE_F32

153

A not greater than or equal to B.
EXEC.u64[laneId] = !(S0.f >= S1.f);
// With NAN inputs this is not the same operation as <

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLG_F32

154

A not less than or greater than B.
EXEC.u64[laneId] = !(S0.f <> S1.f);
// With NAN inputs this is not the same operation as ==

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGT_F32

155

A not greater than B.
EXEC.u64[laneId] = !(S0.f > S1.f);

16.12. VOP3 & VOP3SD Instructions

472 of 597

"RDNA3" Instruction Set Architecture

// With NAN inputs this is not the same operation as <=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLE_F32

156

A not less than or equal to B.
EXEC.u64[laneId] = !(S0.f <= S1.f);
// With NAN inputs this is not the same operation as >

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NEQ_F32

157

A not equal to B.
EXEC.u64[laneId] = !(S0.f == S1.f);
// With NAN inputs this is not the same operation as !=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLT_F32

158

A not less than B.
EXEC.u64[laneId] = !(S0.f < S1.f);
// With NAN inputs this is not the same operation as >=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_F32

16.12. VOP3 & VOP3SD Instructions

159

473 of 597

"RDNA3" Instruction Set Architecture

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_F64

160

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_F64

161

A less than B.
EXEC.u64[laneId] = S0.f64 < S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_F64

162

A equal to B.
EXEC.u64[laneId] = S0.f64 == S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_F64

16.12. VOP3 & VOP3SD Instructions

163

474 of 597

"RDNA3" Instruction Set Architecture

A less than or equal to B.
EXEC.u64[laneId] = S0.f64 <= S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_F64

164

A greater than B.
EXEC.u64[laneId] = S0.f64 > S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LG_F64

165

A less than or greater than B.
EXEC.u64[laneId] = S0.f64 <> S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_F64

166

A greater than or equal to B.
EXEC.u64[laneId] = S0.f64 >= S1.f64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_O_F64

16.12. VOP3 & VOP3SD Instructions

167

475 of 597

"RDNA3" Instruction Set Architecture

A orderable with B.
EXEC.u64[laneId] = (!isNAN(S0.f64) && !isNAN(S1.f64))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_U_F64

168

A not orderable with B.
EXEC.u64[laneId] = (isNAN(S0.f64) || isNAN(S1.f64))

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NGE_F64

169

A not greater than or equal to B.
EXEC.u64[laneId] = !(S0.f64 >= S1.f64);
// With NAN inputs this is not the same operation as <

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLG_F64

170

A not less than or greater than B.
EXEC.u64[laneId] = !(S0.f64 <> S1.f64);
// With NAN inputs this is not the same operation as ==

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

476 of 597

"RDNA3" Instruction Set Architecture

V_CMPX_NGT_F64

171

A not greater than B.
EXEC.u64[laneId] = !(S0.f64 > S1.f64);
// With NAN inputs this is not the same operation as <=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLE_F64

172

A not less than or equal to B.
EXEC.u64[laneId] = !(S0.f64 <= S1.f64);
// With NAN inputs this is not the same operation as >

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NEQ_F64

173

A not equal to B.
EXEC.u64[laneId] = !(S0.f64 == S1.f64);
// With NAN inputs this is not the same operation as !=

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NLT_F64

174

A not less than B.
EXEC.u64[laneId] = !(S0.f64 < S1.f64);
// With NAN inputs this is not the same operation as >=

Notes

16.12. VOP3 & VOP3SD Instructions

477 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_F64

175

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_I16

177

A less than B.
EXEC.u64[laneId] = S0.i16 < S1.i16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_I16

178

A equal to B.
EXEC.u64[laneId] = S0.i16 == S1.i16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_I16

179

A less than or equal to B.
EXEC.u64[laneId] = S0.i16 <= S1.i16

Notes

16.12. VOP3 & VOP3SD Instructions

478 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_I16

180

A greater than B.
EXEC.u64[laneId] = S0.i16 > S1.i16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_I16

181

A not equal to B.
EXEC.u64[laneId] = S0.i16 <> S1.i16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_I16

182

A greater than or equal to B.
EXEC.u64[laneId] = S0.i16 >= S1.i16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_U16

185

A less than B.
EXEC.u64[laneId] = S0.u16 < S1.u16

Notes

16.12. VOP3 & VOP3SD Instructions

479 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_U16

186

A equal to B.
EXEC.u64[laneId] = S0.u16 == S1.u16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_U16

187

A less than or equal to B.
EXEC.u64[laneId] = S0.u16 <= S1.u16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_U16

188

A greater than B.
EXEC.u64[laneId] = S0.u16 > S1.u16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_U16

189

A not equal to B.
EXEC.u64[laneId] = S0.u16 <> S1.u16

Notes

16.12. VOP3 & VOP3SD Instructions

480 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_U16

190

A greater than or equal to B.
EXEC.u64[laneId] = S0.u16 >= S1.u16

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_I32

192

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_I32

193

A less than B.
EXEC.u64[laneId] = S0.i < S1.i

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_I32

194

A equal to B.
EXEC.u64[laneId] = S0.i == S1.i

Notes

16.12. VOP3 & VOP3SD Instructions

481 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_I32

195

A less than or equal to B.
EXEC.u64[laneId] = S0.i <= S1.i

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_I32

196

A greater than B.
EXEC.u64[laneId] = S0.i > S1.i

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_I32

197

A not equal to B.
EXEC.u64[laneId] = S0.i <> S1.i

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_I32

198

A greater than or equal to B.
EXEC.u64[laneId] = S0.i >= S1.i

Notes

16.12. VOP3 & VOP3SD Instructions

482 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_I32

199

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_U32

200

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_U32

201

A less than B.
EXEC.u64[laneId] = S0.u < S1.u

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_U32

202

A equal to B.
EXEC.u64[laneId] = S0.u == S1.u

Notes

16.12. VOP3 & VOP3SD Instructions

483 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_U32

203

A less than or equal to B.
EXEC.u64[laneId] = S0.u <= S1.u

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_U32

204

A greater than B.
EXEC.u64[laneId] = S0.u > S1.u

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_U32

205

A not equal to B.
EXEC.u64[laneId] = S0.u <> S1.u

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_U32

206

A greater than or equal to B.
EXEC.u64[laneId] = S0.u >= S1.u

Notes

16.12. VOP3 & VOP3SD Instructions

484 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_U32

207

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_I64

208

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_I64

209

A less than B.
EXEC.u64[laneId] = S0.i64 < S1.i64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_I64

210

A equal to B.
EXEC.u64[laneId] = S0.i64 == S1.i64

Notes

16.12. VOP3 & VOP3SD Instructions

485 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_I64

211

A less than or equal to B.
EXEC.u64[laneId] = S0.i64 <= S1.i64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_I64

212

A greater than B.
EXEC.u64[laneId] = S0.i64 > S1.i64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_I64

213

A not equal to B.
EXEC.u64[laneId] = S0.i64 <> S1.i64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_I64

214

A greater than or equal to B.
EXEC.u64[laneId] = S0.i64 >= S1.i64

Notes

16.12. VOP3 & VOP3SD Instructions

486 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_I64

215

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_F_U64

216

False.
EXEC.u64[laneId] = 1'0U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LT_U64

217

A less than B.
EXEC.u64[laneId] = S0.u64 < S1.u64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_EQ_U64

218

A equal to B.
EXEC.u64[laneId] = S0.u64 == S1.u64

Notes

16.12. VOP3 & VOP3SD Instructions

487 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_LE_U64

219

A less than or equal to B.
EXEC.u64[laneId] = S0.u64 <= S1.u64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GT_U64

220

A greater than B.
EXEC.u64[laneId] = S0.u64 > S1.u64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_NE_U64

221

A not equal to B.
EXEC.u64[laneId] = S0.u64 <> S1.u64

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_GE_U64

222

A greater than or equal to B.
EXEC.u64[laneId] = S0.u64 >= S1.u64

Notes

16.12. VOP3 & VOP3SD Instructions

488 of 597

"RDNA3" Instruction Set Architecture

Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_T_U64

223

True.
EXEC.u64[laneId] = 1'1U

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_CLASS_F16

253

IEEE numeric class function specified in S1.u, performed on S0.f16.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(64'F(S0.f16)) then
result = S1.u[0]
elsif isQuietNAN(64'F(S0.f16)) then
result = S1.u[1]
elsif exponent(S0.f16) == 31 then
// +-INF
result = S1.u[sign(S0.f16) ? 2 : 9]
elsif exponent(S0.f16) > 0 then
// +-normal value
result = S1.u[sign(S0.f16) ? 3 : 8]
elsif 64'F(abs(S0.f16)) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f16) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f16) ? 5 : 6]
endif;

16.12. VOP3 & VOP3SD Instructions

489 of 597

"RDNA3" Instruction Set Architecture

EXEC.u64[laneId] = result

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

V_CMPX_CLASS_F32

254

IEEE numeric class function specified in S1.u, performed on S0.f.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(64'F(S0.f)) then
result = S1.u[0]
elsif isQuietNAN(64'F(S0.f)) then
result = S1.u[1]
elsif exponent(S0.f) == 255 then
// +-INF
result = S1.u[sign(S0.f) ? 2 : 9]
elsif exponent(S0.f) > 0 then
// +-normal value
result = S1.u[sign(S0.f) ? 3 : 8]
elsif 64'F(abs(S0.f)) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f) ? 5 : 6]
endif;
EXEC.u64[laneId] = result

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

490 of 597

"RDNA3" Instruction Set Architecture

V_CMPX_CLASS_F64

255

IEEE numeric class function specified in S1.u, performed on S0.f64.
The function reports true if the floating point value is any of the numeric types selected in S1.u according to the
following list:
+ S1.u[0] -- value is a signaling NAN.
S1.u[1] -- value is a quiet NAN.
S1.u[2] -- value is negative infinity.
S1.u[3] -- value is a negative normal value.
S1.u[4] -- value is a negative denormal value.
S1.u[5] -- value is negative zero.
S1.u[6] -- value is positive zero.
S1.u[7] -- value is a positive denormal value.
S1.u[8] -- value is a positive normal value.
S1.u[9] -- value is positive infinity.
declare result : 1'U;
if isSignalNAN(S0.f64) then
result = S1.u[0]
elsif isQuietNAN(S0.f64) then
result = S1.u[1]
elsif exponent(S0.f64) == 1023 then
// +-INF
result = S1.u[sign(S0.f64) ? 2 : 9]
elsif exponent(S0.f64) > 0 then
// +-normal value
result = S1.u[sign(S0.f64) ? 3 : 8]
elsif abs(S0.f64) > 0.0 then
// +-denormal value
result = S1.u[sign(S0.f64) ? 4 : 7]
else
// +-0.0
result = S1.u[sign(S0.f64) ? 5 : 6]
endif;
EXEC.u64[laneId] = result

Notes
Write only EXEC. SDST must be set to EXEC_LO. Signal 'invalid' on sNAN's, and also on qNAN's if clamp is set.

16.12. VOP3 & VOP3SD Instructions

491 of 597

"RDNA3" Instruction Set Architecture

16.13. VINTERP Instructions
Parameter interpolation VALU instructions.

V_INTERP_P10_F32

0

Parameter interpolation, first pass.
D0.f = S0[lane.i % 4 + 1].f * S1.f + S2[lane.i % 4].f

Notes
Performs a V_FMA_F32 operation using fixed DPP8 settings. S0 and S2 refer to a VGPR previously loaded with
LDS_PARAM_LOAD that contains packed interpolation data. S1 is the I/J coordinate.
S0 uses a fixed DPP8 lane select of {1,1,1,1,5,5,5,5}.
S2 uses a fixed DPP8 lane select of {0,0,0,0,4,4,4,4}.
Example usage:
s_mov_b32 m0, s0

// assume s0 contains newprim mask

lds_param_load v0, attr0

// v0 is a temporary register

v_interp_p10_f32 v3, v0, v1, v0 // v1 contains i coordinate
v_interp_p2_f32 v3, v0, v2, v3

// v2 contains j coordinate

V_INTERP_P2_F32

1

Parameter interpolation, second pass.
D0.f = fma(S0[lane.i % 4 + 2].f, S1.f, S2.f)

Notes
Performs a V_FMA_F32 operation using fixed DPP8 settings. S0 refers to a VGPR previously loaded with
LDS_PARAM_LOAD that contains packed interpolation data. S1 is the I/J coordinate. S2 is the result of a
previous V_INTERP_P10_F32 instruction.
S0 uses a fixed DPP8 lane select of {2,2,2,2,6,6,6,6}.

16.13. VINTERP Instructions

492 of 597

"RDNA3" Instruction Set Architecture

V_INTERP_P10_F16_F32

2

Parameter interpolation, first pass.
D0.f = 32'F(S0[lane.i % 4 + 1].f16) * S1.f + 32'F(S2[lane.i % 4].f16)

Notes
Performs a hybrid 16/32-bit multiply-add operation using fixed DPP8 settings. S0 and S2 refer to a VGPR
previously loaded with LDS_PARAM_LOAD that contains packed interpolation data. S1 is the I/J coordinate.
S0 uses a fixed DPP8 lane select of {1,1,1,1,5,5,5,5}.
S2 uses a fixed DPP8 lane select of {0,0,0,0,4,4,4,4}.
OPSEL is allowed for S0 and S2 to specify which half of the register to read from.
Note that the I/J coordinate is 32-bit and the destination is also 32-bit.

V_INTERP_P2_F16_F32

3

Parameter interpolation, second pass.
D0.f16 = 16'F(32'F(S0[lane.i % 4 + 2].f16) * S1.f + S2.f)

Notes
Performs a hybrid 16/32-bit multiply-add operation using fixed DPP8 settings. S0 refers to a VGPR previously
loaded with LDS_PARAM_LOAD that contains packed interpolation data. S1 is the I/J coordinate. S2 is the
result of a previous V_INTERP_P10_F16_F32 instruction.
S0 uses a fixed DPP8 lane select of {2,2,2,2,6,6,6,6}.
OPSEL is allowed for D and S0 to specify which half of the register to write to/read from.
Note that the I/J coordinate is 32-bit and the accumulator input is also 32-bit.

V_INTERP_P10_RTZ_F16_F32

4

Same as V_INTERP_P10_F16_F32 except rounding mode is overridden to round toward zero.

V_INTERP_P2_RTZ_F16_F32

5

Same as V_INTERP_P2_F16_F32 except rounding mode is overridden to round toward zero.

16.13. VINTERP Instructions

493 of 597

"RDNA3" Instruction Set Architecture

16.13. VINTERP Instructions

494 of 597

"RDNA3" Instruction Set Architecture

16.14. Parameter and Direct Load from LDS Instructions
These instructions load data from LDS into a VGPR where the LDS address is derived from wave state and the
M0 register.

LDS_PARAM_LOAD

0

Transfer parameter data from LDS to VGPRs and expand data in LDS using the NewPrimMask (provided in M0)
to place per-quad data into lanes 0-3 of each quad as follows:
{P0, P10, P20, 0.0}
This data may be extracted using DPP8 for interpolation operations. The V_INTERP_* instructions unpack data
automatically.
When loading FP16 parameters, two attributes are loaded into a single VGPR: Attribute 2*ATTR is loaded into
the low 16 bits and attribute 2*ATTR+1 is loaded into the high 16 bits.
This instruction runs in whole quad mode: if any pixel of a quad is active then all 4 pixels of that quad are
written. This is required for interpolation instructions to have all the parameter information available for the
quad.

LDS_DIRECT_LOAD

1

Read a single 32-bit value from LDS to all lanes. A single DWORD is read from LDS memory at ADDR[M0[15:0]],
where M0[15:0] is a byte address and is dword-aligned. M0[18:16] specify the data type for the read and may be
0=UBYTE, 1=USHORT, 2=DWORD, 4=SBYTE, 5=SSHORT.



This instruction runs in whole quad mode: if any pixel of a quad is active then all 4 pixels of
that quad are written.

16.14. Parameter and Direct Load from LDS Instructions

495 of 597

"RDNA3" Instruction Set Architecture

16.15. LDS & GDS Instructions
This suite of instructions operates on data stored within the data share memory. The instructions transfer data
between VGPRs and data share memory.
The bitfield map for the LDS/GDS is:

OFFSET0

= Unsigned byte offset added to the address from the ADDR VGPR.

OFFSET1

= Unsigned byte offset added to the address from the ADDR VGPR.

GDS

= Set if GDS, cleared if LDS.

OP

= DS instruction opcode

ADDR

= Source LDS address VGPR 0 - 255.

DATA0

= Source data0 VGPR 0 - 255.

DATA1

= Source data1 VGPR 0 - 255.

VDST

= Destination VGPR 0- 255.



All instructions with RTN in the name return the value that was in memory before the
operation was performed.

DS_ADD_U32

0

Add data register to memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u += DATA.u;
RETURN_DATA.u = tmp

DS_SUB_U32

1

Subtract data register from memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u -= DATA.u;
RETURN_DATA.u = tmp

DS_RSUB_U32

2

Subtraction with reversed operands.

16.15. LDS & GDS Instructions

496 of 597

"RDNA3" Instruction Set Architecture

tmp = MEM[ADDR].b;
MEM[ADDR] = DATA.b - MEM[ADDR].b;
RETURN_DATA = tmp

DS_INC_U32

3

Increment memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U;
RETURN_DATA.u = tmp

DS_DEC_U32

4

Decrement memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U;
RETURN_DATA.u = tmp

DS_MIN_I32

5

Minimum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src < tmp ? src : tmp;
RETURN_DATA.i = tmp

DS_MAX_I32

6

Maximum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src > tmp ? src : tmp;
RETURN_DATA.i = tmp

16.15. LDS & GDS Instructions

497 of 597

"RDNA3" Instruction Set Architecture

DS_MIN_U32

7

Minimum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src < tmp ? src : tmp;
RETURN_DATA.u = tmp

DS_MAX_U32

8

Maximum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src > tmp ? src : tmp;
RETURN_DATA.u = tmp

DS_AND_B32

9

Bitwise AND of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp & DATA.b);
RETURN_DATA.b = tmp

DS_OR_B32

10

Bitwise OR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp | DATA.b);
RETURN_DATA.b = tmp

DS_XOR_B32

11

Bitwise XOR of register value and memory value.

16.15. LDS & GDS Instructions

498 of 597

"RDNA3" Instruction Set Architecture

tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp ^ DATA.b);
RETURN_DATA.b = tmp

DS_MSKOR_B32

12

Masked dword OR, D0 contains the mask and D1 contains the new value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = ((tmp & ~DATA.b) | DATA2.b);
RETURN_DATA.b = tmp

DS_STORE_B32

13

Write dword.
MEM[ADDR] = DATA.b

DS_STORE_2ADDR_B32

14

Write 2 dwords.
MEM[ADDR_BASE.u + OFFSET0.u * 4U] = DATA.b;
MEM[ADDR_BASE.u + OFFSET1.u * 4U] = DATA2.b

DS_STORE_2ADDR_STRIDE64_B32

15

Write 2 dwords with larger stride.
MEM[ADDR_BASE.u + OFFSET0.u * 4U * 64U] = DATA.b;
MEM[ADDR_BASE.u + OFFSET1.u * 4U * 64U] = DATA2.b

DS_CMPSTORE_B32

16

Compare and store.

16.15. LDS & GDS Instructions

499 of 597

"RDNA3" Instruction Set Architecture

tmp = MEM[ADDR].b;
src = DATA.b;
cmp = DATA2.b;
MEM[ADDR].b = tmp == cmp ? src : tmp;
RETURN_DATA.b = tmp

Notes
In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode.

DS_CMPSTORE_F32

17

Floating point compare and store that handles NAN/INF/denormal values.
tmp = MEM[ADDR].f;
src = DATA.f;
cmp = DATA2.f;
MEM[ADDR].f = tmp == cmp ? src : tmp;
RETURN_DATA.f = tmp

Notes
In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode.

DS_MIN_F32

18

Minimum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src < tmp ? src : tmp;
RETURN_DATA = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

DS_MAX_F32

19

Maximum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;

16.15. LDS & GDS Instructions

500 of 597

"RDNA3" Instruction Set Architecture

MEM[ADDR].f = src > tmp ? src : tmp;
RETURN_DATA = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

DS_NOP

20

Do nothing.

DS_ADD_F32

21

Add data register to floating-point memory value.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src + tmp;
RETURN_DATA.f = tmp

Notes
Floating-point addition handles NAN/INF/denorm.

DS_STORE_B8

30

Byte write.
MEM[ADDR].b8 = DATA[7 : 0].b8

DS_STORE_B16

31

Short write.
MEM[ADDR].b16 = DATA[15 : 0].b16

DS_ADD_RTN_U32

16.15. LDS & GDS Instructions

32

501 of 597

"RDNA3" Instruction Set Architecture

Add data register to memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u += DATA.u;
RETURN_DATA.u = tmp

DS_SUB_RTN_U32

33

Subtract data register from memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u -= DATA.u;
RETURN_DATA.u = tmp

DS_RSUB_RTN_U32

34

Subtraction with reversed operands.
tmp = MEM[ADDR].b;
MEM[ADDR] = DATA.b - MEM[ADDR].b;
RETURN_DATA = tmp

DS_INC_RTN_U32

35

Increment memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U;
RETURN_DATA.u = tmp

DS_DEC_RTN_U32

36

Decrement memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U;
RETURN_DATA.u = tmp

16.15. LDS & GDS Instructions

502 of 597

"RDNA3" Instruction Set Architecture

DS_MIN_RTN_I32

37

Minimum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src < tmp ? src : tmp;
RETURN_DATA.i = tmp

DS_MAX_RTN_I32

38

Maximum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src > tmp ? src : tmp;
RETURN_DATA.i = tmp

DS_MIN_RTN_U32

39

Minimum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src < tmp ? src : tmp;
RETURN_DATA.u = tmp

DS_MAX_RTN_U32

40

Maximum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src > tmp ? src : tmp;
RETURN_DATA.u = tmp

DS_AND_RTN_B32

16.15. LDS & GDS Instructions

41

503 of 597

"RDNA3" Instruction Set Architecture

Bitwise AND of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp & DATA.b);
RETURN_DATA.b = tmp

DS_OR_RTN_B32

42

Bitwise OR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp | DATA.b);
RETURN_DATA.b = tmp

DS_XOR_RTN_B32

43

Bitwise XOR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp ^ DATA.b);
RETURN_DATA.b = tmp

DS_MSKOR_RTN_B32

44

Masked dword OR, D0 contains the mask and D1 contains the new value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = ((tmp & ~DATA.b) | DATA2.b);
RETURN_DATA.b = tmp

DS_STOREXCHG_RTN_B32

45

Write-exchange operation.
tmp = MEM[ADDR].b;
MEM[ADDR].b = DATA.b;
RETURN_DATA.b = tmp

16.15. LDS & GDS Instructions

504 of 597

"RDNA3" Instruction Set Architecture

DS_STOREXCHG_2ADDR_RTN_B32

46

Write-exchange 2 separate dwords.
addr1 = ADDR_BASE.u + OFFSET0.u * 4U;
addr2 = ADDR_BASE.u + OFFSET1.u * 4U;
tmp1 = MEM[addr1].b;
tmp2 = MEM[addr2].b;
MEM[addr1].b = DATA.b;
MEM[addr2].b = DATA2.b;
// Note DATA2 can be any other register
RETURN_DATA[31 : 0] = tmp1;
RETURN_DATA[63 : 32] = tmp2

DS_STOREXCHG_2ADDR_STRIDE64_RTN_B32

47

Write-exchange 2 separate dwords with a stride of 64 dwords.
addr1 = ADDR_BASE.u + OFFSET0.u * 4U * 64U;
addr2 = ADDR_BASE.u + OFFSET1.u * 4U * 64U;
tmp1 = MEM[addr1].b;
tmp2 = MEM[addr2].b;
MEM[addr1].b = DATA.b;
MEM[addr2].b = DATA2.b;
// Note DATA2 can be any other register
RETURN_DATA[31 : 0] = tmp1;
RETURN_DATA[63 : 32] = tmp2

DS_CMPSTORE_RTN_B32

48

Compare and store.
tmp = MEM[ADDR].b;
src = DATA.b;
cmp = DATA2.b;
MEM[ADDR].b = tmp == cmp ? src : tmp;
RETURN_DATA.b = tmp

Notes
In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode.

DS_CMPSTORE_RTN_F32

49

16.15. LDS & GDS Instructions

505 of 597

"RDNA3" Instruction Set Architecture

Floating point compare and store that handles NAN/INF/denormal values.
tmp = MEM[ADDR].f;
src = DATA.f;
cmp = DATA2.f;
MEM[ADDR].f = tmp == cmp ? src : tmp;
RETURN_DATA.f = tmp

Notes
In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode.

DS_MIN_RTN_F32

50

Minimum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src < tmp ? src : tmp;
RETURN_DATA = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

DS_MAX_RTN_F32

51

Maximum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src > tmp ? src : tmp;
RETURN_DATA = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

DS_WRAP_RTN_B32

52

Wrap calculation. Intended for use in ring buffer management.
tmp = MEM[ADDR].u;

16.15. LDS & GDS Instructions

506 of 597

"RDNA3" Instruction Set Architecture

MEM[ADDR].u = tmp >= DATA.u ? tmp - DATA.u : tmp + DATA2.u;
RETURN_DATA = tmp

DS_SWIZZLE_B32

53

Dword swizzle, no data is written to LDS memory.
Swizzles input thread data based on offset mask and returns; note does not read or write the DS memory banks.
Note that reading from an invalid thread results in 0x0.
This opcode supports two specific modes, FFT and rotate, plus two basic modes which swizzle in groups of 4 or
32 consecutive threads.
The FFT mode (offset >= 0xe000) swizzles the input based on offset[4:0] to support FFT calculation. Example
swizzles using input {1, 2, … 20} are:
Offset[4:0]: Swizzle
0x00: {1,11,9,19,5,15,d,1d,3,13,b,1b,7,17,f,1f,2,12,a,1a,6,16,e,1e,4,14,c,1c,8,18,10,20}
0x10: {1,9,5,d,3,b,7,f,2,a,6,e,4,c,8,10,11,19,15,1d,13,1b,17,1f,12,1a,16,1e,14,1c,18,20}
0x1f: No swizzle
The rotate mode (offset >= 0xc000 and offset < 0xe000) rotates the input either left (offset[10] == 0) or right
(offset[10] == 1) a number of threads equal to offset[9:5]. The rotate mode also uses a mask value which can
alter the rotate result. For example, mask == 1 swaps the odd threads across every other even thread (rotate
left), or even threads across every other odd thread (rotate right).
Offset[9:5]: Swizzle
0x01, mask=0, rotate left: {2,3,4,5,6,7,8,9,a,b,c,d,e,f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,1}
0x01, mask=0, rotate right: {20,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f}
0x01, mask=1, rotate left: {2,1,4,7,6,5,8,b,a,9,c,f,e,d,10,13,12,11,14,17,16,15,18,1b,1a,19,1c,1f,1e,1d,20,3}
0x01, mask=1, rotate right: {1e,1,4,3,2,5,8,7,6,9,c,b,a,d,10,f,e,11,14,13,12,15,18,17,16,19,1c,1b,1a,1d,20,1f}
If offset < 0xc000, one of the basic swizzle modes is used based on offset[15]. If offset[15] == 1, groups of 4
consecutive threads are swizzled together. If offset[15] == 0, all 32 threads are swizzled together.
The first basic swizzle mode (when offset[15] == 1) allows full data sharing between a group of 4 consecutive
threads. Any thread within the group of 4 can get data from any other thread within the group of 4, specified by
the corresponding offset bits --- [1:0] for the first thread, [3:2] for the second thread, [5:4] for the third thread,
[7:6] for the fourth thread. Note that the offset bits apply to all groups of 4 within a wavefront; thus if offset[1:0]
== 1, then thread0 grabs thread1, thread4 grabs thread5, etc.
The second basic swizzle mode (when offset[15] == 0) allows limited data sharing between 32 consecutive
threads. In this case, the offset is used to specify a 5-bit xor-mask, 5-bit or-mask, and 5-bit and-mask used to
generate a thread mapping. Note that the offset bits apply to each group of 32 within a wavefront. The details of
the thread mapping are listed below. Some example usages:
SWAPX16 : xor_mask = 0x10, or_mask = 0x00, and_mask = 0x1f
SWAPX8 : xor_mask = 0x08, or_mask = 0x00, and_mask = 0x1f

16.15. LDS & GDS Instructions

507 of 597

"RDNA3" Instruction Set Architecture

SWAPX4 : xor_mask = 0x04, or_mask = 0x00, and_mask = 0x1f
SWAPX2 : xor_mask = 0x02, or_mask = 0x00, and_mask = 0x1f
SWAPX1 : xor_mask = 0x01, or_mask = 0x00, and_mask = 0x1f
REVERSEX32 : xor_mask = 0x1f, or_mask = 0x00, and_mask = 0x1f
REVERSEX16 : xor_mask = 0x0f, or_mask = 0x00, and_mask = 0x1f
REVERSEX8 : xor_mask = 0x07, or_mask = 0x00, and_mask = 0x1f
REVERSEX4 : xor_mask = 0x03, or_mask = 0x00, and_mask = 0x1f
REVERSEX2 : xor_mask = 0x01 or_mask = 0x00, and_mask = 0x1f
BCASTX32: xor_mask = 0x00, or_mask = thread, and_mask = 0x00
BCASTX16: xor_mask = 0x00, or_mask = thread, and_mask = 0x10
BCASTX8: xor_mask = 0x00, or_mask = thread, and_mask = 0x18
BCASTX4: xor_mask = 0x00, or_mask = thread, and_mask = 0x1c
BCASTX2: xor_mask = 0x00, or_mask = thread, and_mask = 0x1e
Pseudocode follows:
offset = offset1:offset0;
if (offset >= 0xe000) {
// FFT decomposition
mask = offset[4:0];
for (i = 0; i < 64; i++) {
j = reverse_bits(i & 0x1f);
j = (j >> count_ones(mask));
j |= (i & mask);
j |= i & 0x20;
thread_out[i] = thread_valid[j] ? thread_in[j] : 0;
}

} elsif (offset >= 0xc000) {
// rotate
rotate = offset[9:5];
mask = offset[4:0];
if (offset[10]) {
rotate = -rotate;
}
for (i = 0; i < 64; i++) {
j = (i & mask) | ((i + rotate) & ~mask);
j |= i & 0x20;
thread_out[i] = thread_valid[j] ? thread_in[j] : 0;

16.15. LDS & GDS Instructions

508 of 597

"RDNA3" Instruction Set Architecture

}

} elsif (offset[15]) {
// full data sharing within 4 consecutive threads
for (i = 0; i < 64; i+=4) {
thread_out[i+0] = thread_valid[i+offset[1:0]]?thread_in[i+offset[1:0]]:0;
thread_out[i+1] = thread_valid[i+offset[3:2]]?thread_in[i+offset[3:2]]:0;
thread_out[i+2] = thread_valid[i+offset[5:4]]?thread_in[i+offset[5:4]]:0;
thread_out[i+3] = thread_valid[i+offset[7:6]]?thread_in[i+offset[7:6]]:0;
}

} else { // offset[15] == 0
// limited data sharing within 32 consecutive threads
xor_mask = offset[14:10];
or_mask = offset[9:5];
and_mask = offset[4:0];
for (i = 0; i < 64; i++) {
j = (((i & 0x1f) & and_mask) | or_mask) ^ xor_mask;
j |= (i & 0x20); // which group of 32
thread_out[i] = thread_valid[j] ? thread_in[j] : 0;
}
}

DS_LOAD_B32

54

Read dword.
RETURN_DATA = MEM[ADDR].b

DS_LOAD_2ADDR_B32

55

Read 2 dwords.
RETURN_DATA[31 : 0] = MEM[ADDR_BASE.u + OFFSET0.u * 4U].b;
RETURN_DATA[63 : 32] = MEM[ADDR_BASE.u + OFFSET1.u * 4U].b

DS_LOAD_2ADDR_STRIDE64_B32

56

Read 2 dwords with a larger stride.
RETURN_DATA[31 : 0] = MEM[ADDR_BASE.u + OFFSET0.u * 4U * 64U].b;

16.15. LDS & GDS Instructions

509 of 597

"RDNA3" Instruction Set Architecture

RETURN_DATA[63 : 32] = MEM[ADDR_BASE.u + OFFSET1.u * 4U * 64U].b

DS_LOAD_I8

57

Signed byte read.
RETURN_DATA.i = 32'I(signext(MEM[ADDR][7 : 0].i8))

DS_LOAD_U8

58

Unsigned byte read.
RETURN_DATA.u = 32'U({ 24'0, MEM[ADDR][7 : 0].u8 })

DS_LOAD_I16

59

Signed short read.
RETURN_DATA.i = 32'I(signext(MEM[ADDR][15 : 0].i16))

DS_LOAD_U16

60

Unsigned short read.
RETURN_DATA.u = 32'U({ 16'0, MEM[ADDR][15 : 0].u16 })

DS_CONSUME

61

LDS & GDS. Subtract (count_bits(exec_mask)) from the value stored in DS memory at (M0.base + instr_offset).
Return the pre-operation value to VGPRs.
The DS subtracts count_bits(vector valid mask) from the value stored at address M0.base + instruction based
offset and returns the pre-op value to all valid lanes. This op can be used in both the LDS and GDS. In the LDS
this address is an offset to HWBASE and clamped by M0.size, but in the GDS the M0.base constant has the
physical GDS address and the compiler must force offset to zero. In GDS it is for the traditional append buffer
operations. In LDS it is for local thread group appends and can be used to regroup divergent threads. The use

16.15. LDS & GDS Instructions

510 of 597

"RDNA3" Instruction Set Architecture

of the M0 register enables the compiler to do indexing of UAV append/consume counters.
For GDS (system wide) consume, the compiler must use a zero for {offset1,offset0}, for LDS the compiler uses
{offset1,offset0} to provide the relative address to the append counter in the LDS for runtime index offset or
index.
Inside DS, do one atomic add for first valid lane and broadcast result to all valid lanes. Offset = 0ffset1:offset0;
Interpreted as byte offset. Only aligned atomics are supported, so 2 lsbs of offset must be set to zero.
addr = M0.base + offset; // offset by LDS HWBASE, limit to M.size
rtnval =

LDS(addr);

LDS(addr) = LDS(addr) - countbits(valid mask);
GPR[VDST] = rtnval; // return to all valid threads

DS_APPEND

62

LDS & GDS. Add (count_bits(exec_mask)) to the value stored in DS memory at (M0.base + instr_offset). Return
the pre-operation value to VGPRs.
The DS adds count_bits(vector valid mask) from the value stored at address M0.base + instruction based offset
and return the pre-op value to all valid lanes. This op can be used in both the LDS and GDS. In the LDS this
address is an offset to HWBASE and clamped by M0.size, but in the GDS the M0.base constant has the physical
GDS address and the compiler must set offset to zero. In GDS it is for the traditional append buffer operations.
In LDS it is for local thread group appends and can be used to regroup divergent threads. The use of the M0
register enables the compiler to do indexing of UAV append/consume counters.
For GDS (system wide) consume, the compiler must use a zero for {offset1,offset0}, for LDS the compiler uses
{offset1,offset0} to provide the relative address to the append counter in the LDS for runtime index offset or
index.
Inside DS, do one atomic add for first valid lane and broadcast result to all valid lanes. Offset = 0ffset1:offset0;
Interpreted as byte offset. Only aligned atomics are supported, so 2 lsbs of offset must be set to zero.
addr = M0.base + offset; // offset by LDS HWBASE, limit to M.size
rtnval =

LDS(addr);

LDS(addr) = LDS(addr) + countbits(valid mask);
GPR[VDST] = rtnval; // return to all valid threads

DS_ORDERED_COUNT

63

GDS-only: Intercepted by GDS and processed by ordered append module. The ordered append module queues
request until this request wave is the oldest in the queue at which time the oldest wave request is dispatched to
the DS with an atomic opcode indicated by OFFSET1[5:4].
Unlike append/consume this operation is sent even if there are no valid lanes when it is issued. The GDS adds
zero and advances the tracking walker that needs to match up with the dispatch counter.

16.15. LDS & GDS Instructions

511 of 597

"RDNA3" Instruction Set Architecture

The following attributes are encoded in the instruction:
• OFFSET0[7:2] contains the ordered_count_index (in dwords).
• OFFSET1[0] contains the wave_release flag.
• OFFSET1[1] contains the wave_done flag.
• OFFSET1[5:4] contains the ord_idx_opcode: 2'b00 = DS_ADD_RTN_U32, 2'b01 = DS_STOREXCHG_RTN_B32,
2'b11 = DS_WRAP_RTN_B32.
• VGPR_DST is the VGPR the result is written to.
• VGPR_ADDR specifies the increment in the first valid lane. If no lanes are valid (EXEC = 0) then the
increment is zero.
• M0 normally carries {16'gds_base, 16'gds_size} for GDS usage. gds_base[15:2] is ordered_count_base[13:0]
(in dwords) and gds_size is used to hold the logical_wave_id, the width is based on total number of waves
in the chip.
The wave type is determined automatically based on the ME_ID and QUEUE_ID of the wavefront.

DS_ADD_U64

64

Add data register to 64-bit memory value.
tmp = MEM[ADDR].u64;
MEM[ADDR].u64 += DATA.u64;
RETURN_DATA.u64 = tmp

DS_SUB_U64

65

Subtract data register from 64-bit memory value.
tmp = MEM[ADDR].u64;
MEM[ADDR].u64 -= DATA.u64;
RETURN_DATA.u64 = tmp

DS_RSUB_U64

66

Subtraction with reversed operands.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = DATA.b64 - tmp;
RETURN_DATA.b64 = tmp

16.15. LDS & GDS Instructions

512 of 597

"RDNA3" Instruction Set Architecture

DS_INC_U64

67

Increment 64-bit memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL;
RETURN_DATA.u64 = tmp

DS_DEC_U64

68

Decrement 64-bit memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL;
RETURN_DATA.u64 = tmp

DS_MIN_I64

69

Minimum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;
src = DATA.i64;
MEM[ADDR].i64 = src < tmp ? src : tmp;
RETURN_DATA.i64 = tmp

DS_MAX_I64

70

Maximum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;
src = DATA.i64;
MEM[ADDR].i64 = src > tmp ? src : tmp;
RETURN_DATA.i64 = tmp

DS_MIN_U64

71

Minimum of two unsigned 64-bit integer values.

16.15. LDS & GDS Instructions

513 of 597

"RDNA3" Instruction Set Architecture

tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src < tmp ? src : tmp;
RETURN_DATA.u64 = tmp

DS_MAX_U64

72

Maximum of two unsigned 64-bit integer values.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src > tmp ? src : tmp;
RETURN_DATA.u64 = tmp

DS_AND_B64

73

Bitwise AND of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp & DATA.b64);
RETURN_DATA.b64 = tmp

DS_OR_B64

74

Bitwise OR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp | DATA.b64);
RETURN_DATA.b64 = tmp

DS_XOR_B64

75

Bitwise XOR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp ^ DATA.b64);
RETURN_DATA.b64 = tmp

16.15. LDS & GDS Instructions

514 of 597

"RDNA3" Instruction Set Architecture

DS_MSKOR_B64

76

Masked dword OR, D0 contains the mask and D1 contains the new value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = ((tmp & ~DATA.b64) | DATA2.b64);
RETURN_DATA.b64 = tmp

DS_STORE_B64

77

Write qword.
MEM[ADDR].b64 = DATA.b64

DS_STORE_2ADDR_B64

78

Write 2 qwords.
MEM[ADDR_BASE.u + OFFSET0.u * 8U].b64 = DATA.b64;
MEM[ADDR_BASE.u + OFFSET1.u * 8U].b64 = DATA2.b64

DS_STORE_2ADDR_STRIDE64_B64

79

Write 2 qwords with a larger stride.
MEM[ADDR_BASE.u + OFFSET0.u * 8U * 64U].b64 = DATA.b64;
MEM[ADDR_BASE.u + OFFSET1.u * 8U * 64U].b64 = DATA2.b64

DS_CMPSTORE_B64

80

Compare and store.
tmp = MEM[ADDR].b64;
src = DATA.b64;
cmp = DATA2.b64;
MEM[ADDR].b64 = tmp == cmp ? src : tmp;
RETURN_DATA.b64 = tmp

16.15. LDS & GDS Instructions

515 of 597

"RDNA3" Instruction Set Architecture

Notes
In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode.

DS_CMPSTORE_F64

81

Floating point compare and store that handles NAN/INF/denormal values.
tmp = MEM[ADDR].f64;
src = DATA.f64;
cmp = DATA2.f64;
MEM[ADDR].f64 = tmp == cmp ? src : tmp;
RETURN_DATA.f64 = tmp

Notes
In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode.

DS_MIN_F64

82

Minimum of two floating-point values.
tmp = MEM[ADDR].f64;
src = DATA.f64;
MEM[ADDR].f64 = src < tmp ? src : tmp;
RETURN_DATA.f64 = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

DS_MAX_F64

83

Maximum of two floating-point values.
tmp = MEM[ADDR].f64;
src = DATA.f64;
MEM[ADDR].f64 = src > tmp ? src : tmp;
RETURN_DATA.f64 = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

16.15. LDS & GDS Instructions

516 of 597

"RDNA3" Instruction Set Architecture

DS_ADD_RTN_U64

96

Add data register to 64-bit memory value.
tmp = MEM[ADDR].u64;
MEM[ADDR].u64 += DATA.u64;
RETURN_DATA.u64 = tmp

DS_SUB_RTN_U64

97

Subtract data register from 64-bit memory value.
tmp = MEM[ADDR].u64;
MEM[ADDR].u64 -= DATA.u64;
RETURN_DATA.u64 = tmp

DS_RSUB_RTN_U64

98

Subtraction with reversed operands.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = DATA.b64 - tmp;
RETURN_DATA.b64 = tmp

DS_INC_RTN_U64

99

Increment 64-bit memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL;
RETURN_DATA.u64 = tmp

DS_DEC_RTN_U64

100

Decrement 64-bit memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u64;

16.15. LDS & GDS Instructions

517 of 597

"RDNA3" Instruction Set Architecture

src = DATA.u64;
MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL;
RETURN_DATA.u64 = tmp

DS_MIN_RTN_I64

101

Minimum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;
src = DATA.i64;
MEM[ADDR].i64 = src < tmp ? src : tmp;
RETURN_DATA.i64 = tmp

DS_MAX_RTN_I64

102

Maximum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;
src = DATA.i64;
MEM[ADDR].i64 = src > tmp ? src : tmp;
RETURN_DATA.i64 = tmp

DS_MIN_RTN_U64

103

Minimum of two unsigned 64-bit integer values.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src < tmp ? src : tmp;
RETURN_DATA.u64 = tmp

DS_MAX_RTN_U64

104

Maximum of two unsigned 64-bit integer values.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src > tmp ? src : tmp;
RETURN_DATA.u64 = tmp

16.15. LDS & GDS Instructions

518 of 597

"RDNA3" Instruction Set Architecture

DS_AND_RTN_B64

105

Bitwise AND of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp & DATA.b64);
RETURN_DATA.b64 = tmp

DS_OR_RTN_B64

106

Bitwise OR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp | DATA.b64);
RETURN_DATA.b64 = tmp

DS_XOR_RTN_B64

107

Bitwise XOR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp ^ DATA.b64);
RETURN_DATA.b64 = tmp

DS_MSKOR_RTN_B64

108

Masked dword OR, D0 contains the mask and D1 contains the new value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = ((tmp & ~DATA.b64) | DATA2.b64);
RETURN_DATA.b64 = tmp

DS_STOREXCHG_RTN_B64

109

Write-exchange operation.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = DATA.b64;

16.15. LDS & GDS Instructions

519 of 597

"RDNA3" Instruction Set Architecture

RETURN_DATA.b64 = tmp

DS_STOREXCHG_2ADDR_RTN_B64

110

Write-exchange 2 separate qwords.
addr1 = ADDR_BASE.u + OFFSET0.u * 8U;
addr2 = ADDR_BASE.u + OFFSET1.u * 8U;
tmp1 = MEM[addr1].b64;
tmp2 = MEM[addr2].b64;
MEM[addr1].b64 = DATA.b64;
MEM[addr2].b64 = DATA2.b64;
// Note DATA2 can be any other register
RETURN_DATA[63 : 0] = tmp1;
RETURN_DATA[127 : 64] = tmp2

DS_STOREXCHG_2ADDR_STRIDE64_RTN_B64

111

Write-exchange 2 qwords with a stride of 64 qwords.
addr1 = ADDR_BASE.u + OFFSET0.u * 8U * 64U;
addr2 = ADDR_BASE.u + OFFSET1.u * 8U * 64U;
tmp1 = MEM[addr1].b64;
tmp2 = MEM[addr2].b64;
MEM[addr1].b64 = DATA.b64;
MEM[addr2].b64 = DATA2.b64;
// Note DATA2 can be any other register
RETURN_DATA[63 : 0] = tmp1;
RETURN_DATA[127 : 64] = tmp2

DS_CMPSTORE_RTN_B64

112

Compare and store.
tmp = MEM[ADDR].b64;
src = DATA.b64;
cmp = DATA2.b64;
MEM[ADDR].b64 = tmp == cmp ? src : tmp;
RETURN_DATA.b64 = tmp

Notes
In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode.

16.15. LDS & GDS Instructions

520 of 597

"RDNA3" Instruction Set Architecture

DS_CMPSTORE_RTN_F64

113

Floating point compare and store that handles NAN/INF/denormal values.
tmp = MEM[ADDR].f64;
src = DATA.f64;
cmp = DATA2.f64;
MEM[ADDR].f64 = tmp == cmp ? src : tmp;
RETURN_DATA.f64 = tmp

Notes
In this architecture the order of src and cmp agree with the BUFFER_ATOMIC_CMPSWAP opcode.

DS_MIN_RTN_F64

114

Minimum of two floating-point values.
tmp = MEM[ADDR].f64;
src = DATA.f64;
MEM[ADDR].f64 = src < tmp ? src : tmp;
RETURN_DATA.f64 = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

DS_MAX_RTN_F64

115

Maximum of two floating-point values.
tmp = MEM[ADDR].f64;
src = DATA.f64;
MEM[ADDR].f64 = src > tmp ? src : tmp;
RETURN_DATA.f64 = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

DS_LOAD_B64

16.15. LDS & GDS Instructions

118

521 of 597

"RDNA3" Instruction Set Architecture

Read 1 qword.
RETURN_DATA = MEM[ADDR].b64

DS_LOAD_2ADDR_B64

119

Read 2 qwords.
RETURN_DATA[63 : 0] = MEM[ADDR_BASE.u + OFFSET0.u * 8U].b64;
RETURN_DATA[127 : 64] = MEM[ADDR_BASE.u + OFFSET1.u * 8U].b64

DS_LOAD_2ADDR_STRIDE64_B64

120

Read 2 qwords with a larger stride.
RETURN_DATA[63 : 0] = MEM[ADDR_BASE.u + OFFSET0.u * 8U * 64U].b64;
RETURN_DATA[127 : 64] = MEM[ADDR_BASE.u + OFFSET1.u * 8U * 64U].b64

DS_ADD_RTN_F32

121

Add data register to floating-point memory value.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src + tmp;
RETURN_DATA.f = tmp

Notes
Floating-point addition handles NAN/INF/denorm.

DS_ADD_GS_REG_RTN

122

Perform an atomic add to data in specific registers embedded in GDS rather than operating on GDS memory
directly. This instruction returns the pre-op value. This instruction is only used by the GS stage and is used to
facilitate streamout.
The return value may be 32 bits or 64 bits depending on the GS register accessed. The data value is 32 bits.

16.15. LDS & GDS Instructions

522 of 597

"RDNA3" Instruction Set Architecture

if OFFSET0[5:2] > 7
// 64-bit GS register access
addr = (OFFSET0[5:2] - 8) * 2 + 8;
VDST[0] = GS_REGS(addr + 0);
VDST[1] = GS_REGS(addr + 1);
{GS_REGS(addr + 1), GS_REGS(addr)} += DATA0[0]; // source is 32 bit
else
addr = OFFSET0[5:2];
VDST[0] = GS_REGS(addr);
GS_REGS(addr) += DATA0[0];
endif.

32-bit GS registers:
offset[5:2] Register
0 GDS_STRMOUT_BUFFER_FILLED_SIZE_0
1 GDS_STRMOUT_BUFFER_FILLED_SIZE_1
2 GDS_STRMOUT_BUFFER_FILLED_SIZE_2
3 GDS_STRMOUT_BUFFER_FILLED_SIZE_3
4 GDS_GS_0
5 GDS_GS_1
6 GDS_GS_2
7 GDS_GS_3
64-bit GS registers:
offset[5:2] Register
8 GDS_STRMOUT_PRIMS_NEEDED_0
9 GDS_STRMOUT_PRIMS_WRITTEN_0
10 GDS_STRMOUT_PRIMS_NEEDED_1
11 GDS_STRMOUT_PRIMS_WRITTEN_1
12 GDS_STRMOUT_PRIMS_NEEDED_2
13 GDS_STRMOUT_PRIMS_WRITTEN_2
14 GDS_STRMOUT_PRIMS_NEEDED_3
15 GDS_STRMOUT_PRIMS_WRITTEN_3

DS_SUB_GS_REG_RTN

123

Perform an atomic subtraction from data in specific registers embedded in GDS rather than operating on GDS
memory directly. This instruction returns the pre-op value. This instruction is only used by the GS stage and is
used to facilitate streamout.
The return value may be 32 bits or 64 bits depending on the GS register accessed. The data value is 32 bits.
if OFFSET0[5:2] > 7
// 64-bit GS register access
addr = (OFFSET0[5:2] - 8) * 2 + 8;
VDST[0] = GS_REGS(addr + 0);
VDST[1] = GS_REGS(addr + 1);

16.15. LDS & GDS Instructions

523 of 597

"RDNA3" Instruction Set Architecture

{GS_REGS(addr + 1), GS_REGS(addr)} -= DATA0[0]; // source is 32 bit
else
addr = OFFSET0[5:2];
VDST[0] = GS_REGS(addr);
GS_REGS(addr) -= DATA0[0];
endif.

32-bit GS registers:
offset[5:2] Register
0 GDS_STRMOUT_BUFFER_FILLED_SIZE_0
1 GDS_STRMOUT_BUFFER_FILLED_SIZE_1
2 GDS_STRMOUT_BUFFER_FILLED_SIZE_2
3 GDS_STRMOUT_BUFFER_FILLED_SIZE_3
4 GDS_GS_0
5 GDS_GS_1
6 GDS_GS_2
7 GDS_GS_3
64-bit GS registers:
offset[5:2] Register
8 GDS_STRMOUT_PRIMS_NEEDED_0
9 GDS_STRMOUT_PRIMS_WRITTEN_0
10 GDS_STRMOUT_PRIMS_NEEDED_1
11 GDS_STRMOUT_PRIMS_WRITTEN_1
12 GDS_STRMOUT_PRIMS_NEEDED_2
13 GDS_STRMOUT_PRIMS_WRITTEN_2
14 GDS_STRMOUT_PRIMS_NEEDED_3
15 GDS_STRMOUT_PRIMS_WRITTEN_3

DS_CONDXCHG32_RTN_B64

126

Conditional write exchange.
declare OFFSET0 : 8'U;
declare OFFSET1 : 8'U;
declare RETURN_DATA : 32'U[2];
ADDR = S0.u;
DATA = S1.u64;
offset = { OFFSET1, OFFSET0 };
ADDR0 = ((ADDR + offset.u) & 0xfff8U);
ADDR1 = ADDR0 + 4U;
RETURN_DATA[0] = LDS[ADDR0].u;
if DATA[31] then
LDS[ADDR0] = { 1'0, DATA[30 : 0] }
endif;
RETURN_DATA[1] = LDS[ADDR1].u;
if DATA[63] then
LDS[ADDR1] = { 1'0, DATA[62 : 32] }

16.15. LDS & GDS Instructions

524 of 597

"RDNA3" Instruction Set Architecture

endif

DS_STORE_B8_D16_HI

160

Byte write in to high word.
MEM[ADDR].b8 = DATA[23 : 16].b8

DS_STORE_B16_D16_HI

161

Short write in to high word.
MEM[ADDR].b16 = DATA[31 : 16].b16

DS_LOAD_U8_D16

162

Unsigned byte read with masked return to lower word.
RETURN_DATA[15 : 0].u16 = 16'U({ 8'0U, MEM[ADDR][7 : 0].u8 })

DS_LOAD_U8_D16_HI

163

Unsigned byte read with masked return to upper word.
RETURN_DATA[31 : 16].u16 = 16'U({ 8'0U, MEM[ADDR][7 : 0].u8 })

DS_LOAD_I8_D16

164

Signed byte read with masked return to lower word.
RETURN_DATA[15 : 0].i16 = 16'I(signext(MEM[ADDR][7 : 0].i8))

DS_LOAD_I8_D16_HI

16.15. LDS & GDS Instructions

165

525 of 597

"RDNA3" Instruction Set Architecture

Signed byte read with masked return to upper word.
RETURN_DATA[31 : 16].i16 = 16'I(signext(MEM[ADDR][7 : 0].i8))

DS_LOAD_U16_D16

166

Unsigned short read with masked return to lower word.
RETURN_DATA[15 : 0].u16 = MEM[ADDR][15 : 0].u16

DS_LOAD_U16_D16_HI

167

Unsigned short read with masked return to upper word.
RETURN_DATA[31 : 16].u16 = MEM[ADDR][15 : 0].u16

DS_BVH_STACK_RTN_B32

173

Ray tracing involves traversing a BVH which is a kind of tree where nodes have up to 4 children. Each shader
thread processes one child at a time, and overflow nodes are stored temporarily in LDS using a stack. This
instruction supports pushing/popping the stack to reduce the number of VALU instructions required per
traversal and reduce VMEM bandwidth requirements.
The LDS stack address is computed using values packed into ADDR and part of OFFSET1. ADDR carries the
stack address for the lane. OFFSET1[5:4] contains stack_size[1:0] -- this value is constant for all lanes and is
patched into the shader by software. Valid stack sizes are {8, 16, 32, 64}.
A new stack address is returned to ADDR --- note that this VGPR is an in-out operand.
DATA0 contains the last node pointer for BVH.
DATA1 contains up to 4 valid data DWORDs for each thread. At a high level the first 3 DWORDs (DATA1[0:2]) is
pushed to the stack if they are valid, and the last DWORD (DATA1[3]) is returned. If the last DWORD is invalid
then pop the stack and return the value from memory.
In general this instruction performs the following :
(stack_base, stack_index) = DECODE_ADDR(ADDR, OFFSET1);
last_node_ptr = DATA0;
// First 3 passes: push data onto stack
for i = 0..2 do
if DATA_VALID(DATA1[i])

16.15. LDS & GDS Instructions

526 of 597

"RDNA3" Instruction Set Architecture

MEM[stack_base + stack_index] = DATA1[i];
Increment stack_index
elsif DATA1[i] == last_node_ptr
// Treat all further data as invalid as well.
break
endif
endfor
// Fourth pass: return data or pop
if DATA_VALID(DATA1[3])
VGPR_RTN = DATA1[3]
else
VGPR_RTN = MEM[stack_base + stack_index];
MEM[stack_base + stack_index] = INVALID_NODE;
Decrement stack_index
endif
ADDR = ENCODE_ADDR(stack_base, stack_index).
function DATA_VALID(data):
if data == INVALID_NODE
return false
elsif last_node_ptr != INVALID_NODE && data == last_node_ptr
// Match last_node_ptr
return false
else
return true
endif
endfunction.

DS_STORE_ADDTID_B32

176

Write dword with thread ID offset.
declare OFFSET0 : 8'U;
declare OFFSET1 : 8'U;
MEM[32'I({ OFFSET1, OFFSET0 } + M0[15 : 0]) + laneID.i * 4].u = DATA0.u

DS_LOAD_ADDTID_B32

177

Read dword with thread ID offset.
declare OFFSET0 : 8'U;
declare OFFSET1 : 8'U;
RETURN_DATA.u = MEM[32'I({ OFFSET1, OFFSET0 } + M0[15 : 0]) + laneID.i * 4].u

DS_PERMUTE_B32

16.15. LDS & GDS Instructions

178

527 of 597

"RDNA3" Instruction Set Architecture

Forward permute. This does not access LDS memory and may be called even if no LDS memory is allocated to
the wave. It uses LDS to implement an arbitrary swizzle across threads in a wavefront.
Note the address passed in is the thread ID multiplied by 4.
If multiple sources map to the same destination lane, it is not deterministic which source lane writes to the
destination lane.
See also DS_BPERMUTE_B32.
// VGPR[laneId][index] is the VGPR RAM
// VDST, ADDR and DATA0 are from the microcode DS encoding
declare tmp : 32'B[64];
declare OFFSET : 16'U;
declare DATA0 : 32'U;
declare VDST : 32'U;
for i in 0 : WAVE64 ? 63 : 31 do
tmp[i] = 0x0
endfor;
for i in 0 : WAVE64 ? 63 : 31 do
// If a source thread is disabled, it does not propagate data.
if EXEC[i].u1 then
// ADDR needs to be divided by 4.
// High-order bits are ignored.
// NOTE: destination lane is MOD 32 regardless of wave size.
dst_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
tmp[dst_lane] = VGPR[i][DATA0]
endif
endfor;
// Copy data into destination VGPRs. If multiple sources
// select the same destination thread, the highest-numbered
// source thread wins.
for i in 0 : WAVE64 ? 63 : 31 do
if EXEC[i].u1 then
VGPR[i][VDST] = tmp[i]
endif
endfor

Notes
Examples (simplified 4-thread wavefronts):
VGPR[SRC0] = { A, B, C, D }
VGPR[ADDR] = { 0, 0, 12, 4 }
EXEC = 0xF, OFFSET = 0
VGPR[VDST] = { B, D, 0, C }

VGPR[SRC0] = { A, B, C, D }
VGPR[ADDR] = { 0, 0, 12, 4 }
EXEC = 0xA, OFFSET = 0
VGPR[VDST] = { -, D, -, 0 }

16.15. LDS & GDS Instructions

528 of 597

"RDNA3" Instruction Set Architecture

DS_BPERMUTE_B32

179

Backward permute. This does not access LDS memory and may be called even if no LDS memory is allocated to
the wave. It uses LDS hardware to implement an arbitrary swizzle across threads in a wavefront.
Note the address passed in is the thread ID multiplied by 4.
Note that EXEC mask is applied to both VGPR read and write. If src_lane selects a disabled thread then zero is
returned.
See also DS_PERMUTE_B32.
// VGPR[laneId][index] is the VGPR RAM
// VDST, ADDR and DATA0 are from the microcode DS encoding
declare tmp : 32'B[64];
declare OFFSET : 16'U;
declare DATA0 : 32'U;
declare VDST : 32'U;
for i in 0 : WAVE64 ? 63 : 31 do
tmp[i] = 0x0
endfor;
for i in 0 : WAVE64 ? 63 : 31 do
// ADDR needs to be divided by 4.
// High-order bits are ignored.
// NOTE: destination lane is MOD 32 regardless of wave size.
src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
// EXEC is applied to the source VGPR reads.
if EXEC[src_lane].u1 then
tmp[i] = VGPR[src_lane][DATA0]
endif
endfor;
// Copy data into destination VGPRs. Some source
// data may be broadcast to multiple lanes.
for i in 0 : WAVE64 ? 63 : 31 do
if EXEC[i].u1 then
VGPR[i][VDST] = tmp[i]
endif
endfor

Notes
Examples (simplified 4-thread wavefronts):
VGPR[SRC0] = { A, B, C, D }
VGPR[ADDR] = { 0, 0, 12, 4 }
EXEC = 0xF, OFFSET = 0
VGPR[VDST] = { A, A, D, B }

VGPR[SRC0] = { A, B, C, D }
VGPR[ADDR] = { 0, 0, 12, 4 }

16.15. LDS & GDS Instructions

529 of 597

"RDNA3" Instruction Set Architecture

EXEC = 0xA, OFFSET = 0
VGPR[VDST] = { -, 0, -, B }

DS_STORE_B96

222

Tri-dword write.
MEM[ADDR + 0U].b = DATA[31 : 0];
MEM[ADDR + 4U].b = DATA[63 : 32];
MEM[ADDR + 8U].b = DATA[95 : 64]

DS_STORE_B128

223

Quad-dword write.
MEM[ADDR + 0U].b = DATA[31 : 0];
MEM[ADDR + 4U].b = DATA[63 : 32];
MEM[ADDR + 8U].b = DATA[95 : 64];
MEM[ADDR + 12U].b = DATA[127 : 96]

DS_LOAD_B96

254

Tri-dword read.
RETURN_DATA[31 : 0] = MEM[ADDR + 0U].b;
RETURN_DATA[63 : 32] = MEM[ADDR + 4U].b;
RETURN_DATA[95 : 64] = MEM[ADDR + 8U].b

DS_LOAD_B128

255

Quad-dword read.
RETURN_DATA[31 : 0] = MEM[ADDR + 0U].b;
RETURN_DATA[63 : 32] = MEM[ADDR + 4U].b;
RETURN_DATA[95 : 64] = MEM[ADDR + 8U].b;
RETURN_DATA[127 : 96] = MEM[ADDR + 12U].b

16.15. LDS & GDS Instructions

530 of 597

"RDNA3" Instruction Set Architecture

16.15.1. LDS Instruction Limitations
Some of the DS instructions are available only to GDS, not LDS. These are:
• DS_GWS_SEMA_RELEASE_ALL
• DS_GWS_INIT
• DS_GWS_SEMA_V
• DS_GWS_SEMA_BR
• DS_GWS_SEMA_P
• DS_GWS_BARRIER
• DS_ORDERED_COUNT

16.15. LDS & GDS Instructions

531 of 597

"RDNA3" Instruction Set Architecture

16.16. MUBUF Instructions
The bitfield map of the MUBUF format is:

BUFFER_LOAD_FORMAT_X

0

Untyped buffer load 1 component with format conversion.
VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]);
// Mem access size depends on format

BUFFER_LOAD_FORMAT_XY

1

Untyped buffer load 2 components with format conversion.
VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]);
// Mem access size depends on format
VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y])

BUFFER_LOAD_FORMAT_XYZ

2

Untyped buffer load 3 components with format conversion.
VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]);
// Mem access size depends on format
VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]);
VDATA[95 : 64].b = ConvertFromFormat(MEM[TADDR.Z])

BUFFER_LOAD_FORMAT_XYZW

3

Untyped buffer load 4 components with format conversion.
VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]);
// Mem access size depends on format
VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]);
VDATA[95 : 64].b = ConvertFromFormat(MEM[TADDR.Z]);

16.16. MUBUF Instructions

532 of 597

"RDNA3" Instruction Set Architecture

VDATA[127 : 96].b = ConvertFromFormat(MEM[TADDR.W])

BUFFER_STORE_FORMAT_X

4

Untyped buffer store 1 component with format conversion.
MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b);
// Mem access size depends on format

BUFFER_STORE_FORMAT_XY

5

Untyped buffer store 2 components with format conversion.
MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b);
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b)

BUFFER_STORE_FORMAT_XYZ

6

Untyped buffer store 3 components with format conversion.
MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b);
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b);
MEM[TADDR.Z] = ConvertToFormat(VDATA[95 : 64].b)

BUFFER_STORE_FORMAT_XYZW

7

Untyped buffer store 4 components with format conversion.
MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b);
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b);
MEM[TADDR.Z] = ConvertToFormat(VDATA[95 : 64].b);
MEM[TADDR.W] = ConvertToFormat(VDATA[127 : 96].b)

BUFFER_LOAD_D16_FORMAT_X

16.16. MUBUF Instructions

8

533 of 597

"RDNA3" Instruction Set Architecture

Untyped buffer load 1 component with format conversion, packed 16-bit components in data register.
VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X]));
// Mem access size depends on format
// VDATA[31:16].b16 is preserved.

BUFFER_LOAD_D16_FORMAT_XY

9

Untyped buffer load 2 components with format conversion, packed 16-bit components in data register.
VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X]));
// Mem access size depends on format
VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y]))

BUFFER_LOAD_D16_FORMAT_XYZ

10

Untyped buffer load 3 components with format conversion, packed 16-bit components in data register.
VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X]));
// Mem access size depends on format
VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y]));
VDATA[47 : 32].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Z]));
// VDATA[63:48].b16 is preserved.

BUFFER_LOAD_D16_FORMAT_XYZW

11

Untyped buffer load 4 components with format conversion, packed 16-bit components in data register.
VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X]));
// Mem access size depends on format
VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y]));
VDATA[47 : 32].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Z]));
VDATA[63 : 48].b16 = 16'B(ConvertFromFormat(MEM[TADDR.W]))

BUFFER_STORE_D16_FORMAT_X

12

Untyped buffer store 1 component with format conversion, packed 16-bit components in data register.
MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16));

16.16. MUBUF Instructions

534 of 597

"RDNA3" Instruction Set Architecture

// Mem access size depends on format

BUFFER_STORE_D16_FORMAT_XY

13

Untyped buffer store 2 components with format conversion, packed 16-bit components in data register.
MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16));
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16))

BUFFER_STORE_D16_FORMAT_XYZ

14

Untyped buffer store 3 components with format conversion, packed 16-bit components in data register.
MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16));
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16));
MEM[TADDR.Z] = ConvertToFormat(32'B(VDATA[47 : 32].b16))

BUFFER_STORE_D16_FORMAT_XYZW

15

Untyped buffer store 4 components with format conversion, packed 16-bit components in data register.
MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16));
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16));
MEM[TADDR.Z] = ConvertToFormat(32'B(VDATA[47 : 32].b16));
MEM[TADDR.W] = ConvertToFormat(32'B(VDATA[63 : 48].b16))

BUFFER_LOAD_U8

16

Untyped buffer load unsigned byte, zero extend in data register.
VDATA.u = 32'U({ 24'0, MEM[ADDR].u8 })

BUFFER_LOAD_I8

16.16. MUBUF Instructions

17

535 of 597

"RDNA3" Instruction Set Architecture

Untyped buffer load signed byte, sign extend in data register.
VDATA.i = 32'I(signext(MEM[ADDR].i8))

BUFFER_LOAD_U16

18

Untyped buffer load unsigned short, zero extend in data register.
VDATA.u = 32'U({ 16'0, MEM[ADDR].u16 })

BUFFER_LOAD_I16

19

Untyped buffer load signed short, sign extend in data register.
VDATA.i = 32'I(signext(MEM[ADDR].i16))

BUFFER_LOAD_B32

20

Untyped buffer load dword.
VDATA.b = MEM[ADDR].b

BUFFER_LOAD_B64

21

Untyped buffer load 2 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b

BUFFER_LOAD_B96

22

Untyped buffer load 3 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b;

16.16. MUBUF Instructions

536 of 597

"RDNA3" Instruction Set Architecture

VDATA[95 : 64] = MEM[ADDR + 8U].b

BUFFER_LOAD_B128

23

Untyped buffer load 4 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b;
VDATA[95 : 64] = MEM[ADDR + 8U].b;
VDATA[127 : 96] = MEM[ADDR + 12U].b

BUFFER_STORE_B8

24

Untyped buffer store byte.
MEM[ADDR].b8 = VDATA[7 : 0]

BUFFER_STORE_B16

25

Untyped buffer store short.
MEM[ADDR].b16 = VDATA[15 : 0]

BUFFER_STORE_B32

26

Untyped buffer store dword.
MEM[ADDR].b = VDATA[31 : 0]

BUFFER_STORE_B64

27

Untyped buffer store 2 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32]

16.16. MUBUF Instructions

537 of 597

"RDNA3" Instruction Set Architecture

BUFFER_STORE_B96

28

Untyped buffer store 3 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32];
MEM[ADDR + 8U].b = VDATA[95 : 64]

BUFFER_STORE_B128

29

Untyped buffer store 4 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32];
MEM[ADDR + 8U].b = VDATA[95 : 64];
MEM[ADDR + 12U].b = VDATA[127 : 96]

BUFFER_LOAD_D16_U8

30

Untyped buffer load unsigned byte, use low 16 bits of data register.
VDATA[15 : 0].u16 = 16'U({ 8'0, MEM[ADDR].u8 });
// VDATA[31:16] is preserved.

BUFFER_LOAD_D16_I8

31

Untyped buffer load signed byte, use low 16 bits of data register.
VDATA[15 : 0].i16 = 16'I(signext(MEM[ADDR].i8));
// VDATA[31:16] is preserved.

BUFFER_LOAD_D16_B16

32

Untyped buffer load short, use low 16 bits of data register.
VDATA[15 : 0].b16 = MEM[ADDR].b16;

16.16. MUBUF Instructions

538 of 597

"RDNA3" Instruction Set Architecture

// VDATA[31:16] is preserved.

BUFFER_LOAD_D16_HI_U8

33

Untyped buffer load unsigned byte, use high 16 bits of data register.
VDATA[31 : 16].u16 = 16'U({ 8'0, MEM[ADDR].u8 });
// VDATA[15:0] is preserved.

BUFFER_LOAD_D16_HI_I8

34

Untyped buffer load signed byte, use high 16 bits of data register.
VDATA[31 : 16].i16 = 16'I(signext(MEM[ADDR].i8));
// VDATA[15:0] is preserved.

BUFFER_LOAD_D16_HI_B16

35

Untyped buffer load short, use high 16 bits of data register.
VDATA[31 : 16].b16 = MEM[ADDR].b16;
// VDATA[15:0] is preserved.

BUFFER_STORE_D16_HI_B8

36

Untyped buffer store byte, use high 16 bits of data register.
MEM[ADDR].b8 = VDATA[23 : 16].b8

BUFFER_STORE_D16_HI_B16

37

Untyped buffer store short, use high 16 bits of data register.
MEM[ADDR].b16 = VDATA[31 : 16].b16

16.16. MUBUF Instructions

539 of 597

"RDNA3" Instruction Set Architecture

BUFFER_LOAD_D16_HI_FORMAT_X

38

Untyped buffer load 1 dword with format conversion, use high 16 bits of data register.
VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X]));
// Mem access size depends on format
// VDATA[15:0].b16 is preserved.

BUFFER_STORE_D16_HI_FORMAT_X

39

Untyped buffer store 1 dword with format conversion, use high 16 bits of data register.
MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[31 : 16].b16));
// Mem access size depends on format

BUFFER_GL0_INV

43

Write back and invalidate the shader L0. Returns ACK to shader.

BUFFER_GL1_INV

44

Invalidate the GL1 cache only. Returns ACK to shader.

BUFFER_LOAD_LDS_U8

45

Untyped buffer load unsigned byte, zero extend and store in LDS destination.

BUFFER_LOAD_LDS_I8

46

Untyped buffer load signed byte, sign extend and store in LDS destination.

BUFFER_LOAD_LDS_U16

47

Untyped buffer load unsigned short, zero extend and store in LDS destination.

16.16. MUBUF Instructions

540 of 597

"RDNA3" Instruction Set Architecture

BUFFER_LOAD_LDS_I16

48

Untyped buffer load signed short, sign extend and store in LDS destination.

BUFFER_LOAD_LDS_B32

49

Untyped buffer load dword, store in LDS destination.

BUFFER_LOAD_LDS_FORMAT_X

50

Untyped buffer load 1 dword with format conversion, store in LDS destination.

BUFFER_ATOMIC_SWAP_B32

51

Swap values in data register and memory.
tmp = MEM[ADDR].b;
MEM[ADDR].b = DATA.b;
RETURN_DATA.b = tmp

BUFFER_ATOMIC_CMPSWAP_B32

52

Compare and swap with memory value.
tmp = MEM[ADDR].b;
src = DATA[31 : 0].b;
cmp = DATA[63 : 32].b;
MEM[ADDR].b = tmp == cmp ? src : tmp;
RETURN_DATA.b = tmp

BUFFER_ATOMIC_ADD_U32

53

Add data register to memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u += DATA.u;
RETURN_DATA.u = tmp

16.16. MUBUF Instructions

541 of 597

"RDNA3" Instruction Set Architecture

BUFFER_ATOMIC_SUB_U32

54

Subtract data register from memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u -= DATA.u;
RETURN_DATA.u = tmp

BUFFER_ATOMIC_CSUB_U32

55

Subtract data register from memory value, clamp to zero.
declare new_value : 32'U;
old_value = MEM[ADDR].u;
if old_value < DATA.u then
new_value = 0U
else
new_value = old_value - DATA.u
endif;
MEM[ADDR].u = new_value;
RETURN_DATA.u = old_value

BUFFER_ATOMIC_MIN_I32

56

Minimum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src < tmp ? src : tmp;
RETURN_DATA.i = tmp

BUFFER_ATOMIC_MIN_U32

57

Minimum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src < tmp ? src : tmp;
RETURN_DATA.u = tmp

16.16. MUBUF Instructions

542 of 597

"RDNA3" Instruction Set Architecture

BUFFER_ATOMIC_MAX_I32

58

Maximum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src > tmp ? src : tmp;
RETURN_DATA.i = tmp

BUFFER_ATOMIC_MAX_U32

59

Maximum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src > tmp ? src : tmp;
RETURN_DATA.u = tmp

BUFFER_ATOMIC_AND_B32

60

Bitwise AND of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp & DATA.b);
RETURN_DATA.b = tmp

BUFFER_ATOMIC_OR_B32

61

Bitwise OR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp | DATA.b);
RETURN_DATA.b = tmp

BUFFER_ATOMIC_XOR_B32

62

Bitwise XOR of register value and memory value.
tmp = MEM[ADDR].b;

16.16. MUBUF Instructions

543 of 597

"RDNA3" Instruction Set Architecture

MEM[ADDR].b = (tmp ^ DATA.b);
RETURN_DATA.b = tmp

BUFFER_ATOMIC_INC_U32

63

Increment memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U;
RETURN_DATA.u = tmp

BUFFER_ATOMIC_DEC_U32

64

Decrement memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U;
RETURN_DATA.u = tmp

BUFFER_ATOMIC_SWAP_B64

65

Swap 64-bit values in data register and memory.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = DATA.b64;
RETURN_DATA.b64 = tmp

BUFFER_ATOMIC_CMPSWAP_B64

66

Compare and swap with 64-bit memory value.
tmp = MEM[ADDR].b64;
src = DATA[63 : 0].b64;
cmp = DATA[127 : 64].b64;
MEM[ADDR].b64 = tmp == cmp ? src : tmp;
RETURN_DATA.b64 = tmp

16.16. MUBUF Instructions

544 of 597

"RDNA3" Instruction Set Architecture

BUFFER_ATOMIC_ADD_U64

67

Add data register to 64-bit memory value.
tmp = MEM[ADDR].u64;
MEM[ADDR].u64 += DATA.u64;
RETURN_DATA.u64 = tmp

BUFFER_ATOMIC_SUB_U64

68

Subtract data register from 64-bit memory value.
tmp = MEM[ADDR].u64;
MEM[ADDR].u64 -= DATA.u64;
RETURN_DATA.u64 = tmp

BUFFER_ATOMIC_MIN_I64

69

Minimum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;
src = DATA.i64;
MEM[ADDR].i64 = src < tmp ? src : tmp;
RETURN_DATA.i64 = tmp

BUFFER_ATOMIC_MIN_U64

70

Minimum of two unsigned 64-bit integer values.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src < tmp ? src : tmp;
RETURN_DATA.u64 = tmp

BUFFER_ATOMIC_MAX_I64

71

Maximum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;

16.16. MUBUF Instructions

545 of 597

"RDNA3" Instruction Set Architecture

src = DATA.i64;
MEM[ADDR].i64 = src > tmp ? src : tmp;
RETURN_DATA.i64 = tmp

BUFFER_ATOMIC_MAX_U64

72

Maximum of two unsigned 64-bit integer values.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src > tmp ? src : tmp;
RETURN_DATA.u64 = tmp

BUFFER_ATOMIC_AND_B64

73

Bitwise AND of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp & DATA.b64);
RETURN_DATA.b64 = tmp

BUFFER_ATOMIC_OR_B64

74

Bitwise OR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp | DATA.b64);
RETURN_DATA.b64 = tmp

BUFFER_ATOMIC_XOR_B64

75

Bitwise XOR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp ^ DATA.b64);
RETURN_DATA.b64 = tmp

16.16. MUBUF Instructions

546 of 597

"RDNA3" Instruction Set Architecture

BUFFER_ATOMIC_INC_U64

76

Increment 64-bit memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL;
RETURN_DATA.u64 = tmp

BUFFER_ATOMIC_DEC_U64

77

Decrement 64-bit memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL;
RETURN_DATA.u64 = tmp

BUFFER_ATOMIC_CMPSWAP_F32

80

Compare and swap with floating-point memory value.
tmp = MEM[ADDR].f;
src = DATA[31 : 0].f;
cmp = DATA[63 : 32].f;
MEM[ADDR].f = tmp == cmp ? src : tmp;
RETURN_DATA.f = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

BUFFER_ATOMIC_MIN_F32

81

Minimum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src < tmp ? src : tmp;
RETURN_DATA = tmp

Notes

16.16. MUBUF Instructions

547 of 597

"RDNA3" Instruction Set Architecture

Floating-point compare handles NAN/INF/denorm.

BUFFER_ATOMIC_MAX_F32

82

Maximum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src > tmp ? src : tmp;
RETURN_DATA = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

BUFFER_ATOMIC_ADD_F32

86

Add data register to floating-point memory value.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src + tmp;
RETURN_DATA.f = tmp

Notes
Floating-point addition handles NAN/INF/denorm.

16.16. MUBUF Instructions

548 of 597

"RDNA3" Instruction Set Architecture

16.17. MTBUF Instructions
The bitfield map of the MTBUF format is:

TBUFFER_LOAD_FORMAT_X

0

Typed buffer load 1 component with format conversion.
VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]);
// Mem access size depends on format

TBUFFER_LOAD_FORMAT_XY

1

Typed buffer load 2 components with format conversion.
VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]);
// Mem access size depends on format
VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y])

TBUFFER_LOAD_FORMAT_XYZ

2

Typed buffer load 3 components with format conversion.
VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]);
// Mem access size depends on format
VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]);
VDATA[95 : 64].b = ConvertFromFormat(MEM[TADDR.Z])

TBUFFER_LOAD_FORMAT_XYZW

3

Typed buffer load 4 components with format conversion.
VDATA[31 : 0].b = ConvertFromFormat(MEM[TADDR.X]);
// Mem access size depends on format
VDATA[63 : 32].b = ConvertFromFormat(MEM[TADDR.Y]);
VDATA[95 : 64].b = ConvertFromFormat(MEM[TADDR.Z]);

16.17. MTBUF Instructions

549 of 597

"RDNA3" Instruction Set Architecture

VDATA[127 : 96].b = ConvertFromFormat(MEM[TADDR.W])

TBUFFER_STORE_FORMAT_X

4

Typed buffer store 1 component with format conversion.
MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b);
// Mem access size depends on format

TBUFFER_STORE_FORMAT_XY

5

Typed buffer store 2 components with format conversion.
MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b);
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b)

TBUFFER_STORE_FORMAT_XYZ

6

Typed buffer store 3 components with format conversion.
MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b);
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b);
MEM[TADDR.Z] = ConvertToFormat(VDATA[95 : 64].b)

TBUFFER_STORE_FORMAT_XYZW

7

Typed buffer store 4 components with format conversion.
MEM[TADDR.X] = ConvertToFormat(VDATA[31 : 0].b);
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(VDATA[63 : 32].b);
MEM[TADDR.Z] = ConvertToFormat(VDATA[95 : 64].b);
MEM[TADDR.W] = ConvertToFormat(VDATA[127 : 96].b)

TBUFFER_LOAD_D16_FORMAT_X

16.17. MTBUF Instructions

8

550 of 597

"RDNA3" Instruction Set Architecture

Typed buffer load 1 component with format conversion, packed 16-bit components in data register.
VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X]));
// Mem access size depends on format
// VDATA[31:16].b16 is preserved.

TBUFFER_LOAD_D16_FORMAT_XY

9

Typed buffer load 2 components with format conversion, packed 16-bit components in data register.
VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X]));
// Mem access size depends on format
VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y]))

TBUFFER_LOAD_D16_FORMAT_XYZ

10

Typed buffer load 3 components with format conversion, packed 16-bit components in data register.
VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X]));
// Mem access size depends on format
VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y]));
VDATA[47 : 32].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Z]));
// VDATA[63:48].b16 is preserved.

TBUFFER_LOAD_D16_FORMAT_XYZW

11

Typed buffer load 4 components with format conversion, packed 16-bit components in data register.
VDATA[15 : 0].b16 = 16'B(ConvertFromFormat(MEM[TADDR.X]));
// Mem access size depends on format
VDATA[31 : 16].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Y]));
VDATA[47 : 32].b16 = 16'B(ConvertFromFormat(MEM[TADDR.Z]));
VDATA[63 : 48].b16 = 16'B(ConvertFromFormat(MEM[TADDR.W]))

TBUFFER_STORE_D16_FORMAT_X

12

Typed buffer store 1 component with format conversion, packed 16-bit components in data register.
MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16));

16.17. MTBUF Instructions

551 of 597

"RDNA3" Instruction Set Architecture

// Mem access size depends on format

TBUFFER_STORE_D16_FORMAT_XY

13

Typed buffer store 2 components with format conversion, packed 16-bit components in data register.
MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16));
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16))

TBUFFER_STORE_D16_FORMAT_XYZ

14

Typed buffer store 3 components with format conversion, packed 16-bit components in data register.
MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16));
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16));
MEM[TADDR.Z] = ConvertToFormat(32'B(VDATA[47 : 32].b16))

TBUFFER_STORE_D16_FORMAT_XYZW

15

Typed buffer store 4 components with format conversion, packed 16-bit components in data register.
MEM[TADDR.X] = ConvertToFormat(32'B(VDATA[15 : 0].b16));
// Mem access size depends on format
MEM[TADDR.Y] = ConvertToFormat(32'B(VDATA[31 : 16].b16));
MEM[TADDR.Z] = ConvertToFormat(32'B(VDATA[47 : 32].b16));
MEM[TADDR.W] = ConvertToFormat(32'B(VDATA[63 : 48].b16))

16.17. MTBUF Instructions

552 of 597

"RDNA3" Instruction Set Architecture

16.18. MIMG Instructions
The bitfield map of the MIMG format is:

IMAGE_LOAD

0

Load element from largest miplevel in resource view, with format conversion specified in the resource
constant. No sampler.

IMAGE_LOAD_MIP

1

Load element from user-specified miplevel in resource view, with format conversion specified in the resource
constant. No sampler.

IMAGE_LOAD_PCK

2

Load element from largest miplevel in resource view, without format conversion. 8- and 16-bit elements are
not sign-extended. No sampler.

IMAGE_LOAD_PCK_SGN

3

Load element from largest miplevel in resource view, without format conversion. 8- and 16-bit elements are
sign-extended. No sampler.

IMAGE_LOAD_MIP_PCK

4

Load element from user-supplied miplevel in resource view, without format conversion. 8- and 16-bit elements
are not sign-extended. No sampler.

IMAGE_LOAD_MIP_PCK_SGN

5

Load element from user-supplied miplevel in resource view, without format conversion. 8- and 16-bit elements
are sign-extended. No sampler.

16.18. MIMG Instructions

553 of 597

"RDNA3" Instruction Set Architecture

IMAGE_STORE

6

Store element to largest miplevel in resource view, with format conversion specified in resource constant. No
sampler.

IMAGE_STORE_MIP

7

Store element to user-specified miplevel in resource view, with format conversion specified in resource
constant. No sampler.

IMAGE_STORE_PCK

8

Store element to largest miplevel in resource view, without format conversion. No sampler.

IMAGE_STORE_MIP_PCK

9

Store element to user-specified miplevel in resource view, without format conversion. No sampler.

IMAGE_ATOMIC_SWAP

10

Swap values in data register and memory.
tmp = MEM[ADDR].b;
MEM[ADDR].b = DATA.b;
RETURN_DATA.b = tmp

IMAGE_ATOMIC_CMPSWAP

11

Compare and swap with memory value.
tmp = MEM[ADDR].b;
src = DATA[31 : 0].b;
cmp = DATA[63 : 32].b;
MEM[ADDR].b = tmp == cmp ? src : tmp;
RETURN_DATA.b = tmp

IMAGE_ATOMIC_ADD

12

16.18. MIMG Instructions

554 of 597

"RDNA3" Instruction Set Architecture

Add data register to memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u += DATA.u;
RETURN_DATA.u = tmp

IMAGE_ATOMIC_SUB

13

Subtract data register from memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u -= DATA.u;
RETURN_DATA.u = tmp

IMAGE_ATOMIC_SMIN

14

Minimum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src < tmp ? src : tmp;
RETURN_DATA.i = tmp

IMAGE_ATOMIC_UMIN

15

Minimum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src < tmp ? src : tmp;
RETURN_DATA.u = tmp

IMAGE_ATOMIC_SMAX

16

Maximum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src > tmp ? src : tmp;

16.18. MIMG Instructions

555 of 597

"RDNA3" Instruction Set Architecture

RETURN_DATA.i = tmp

IMAGE_ATOMIC_UMAX

17

Maximum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src > tmp ? src : tmp;
RETURN_DATA.u = tmp

IMAGE_ATOMIC_AND

18

Bitwise AND of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp & DATA.b);
RETURN_DATA.b = tmp

IMAGE_ATOMIC_OR

19

Bitwise OR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp | DATA.b);
RETURN_DATA.b = tmp

IMAGE_ATOMIC_XOR

20

Bitwise XOR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp ^ DATA.b);
RETURN_DATA.b = tmp

IMAGE_ATOMIC_INC

21

16.18. MIMG Instructions

556 of 597

"RDNA3" Instruction Set Architecture

Increment memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U;
RETURN_DATA.u = tmp

IMAGE_ATOMIC_DEC

22

Decrement memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U;
RETURN_DATA.u = tmp

IMAGE_GET_RESINFO

23

Return resource info for a given mip level specified in the address vgpr. No sampler. Returns 4 integer values
into VGPRs 3-0: {num_mip_levels, depth, height, width}.

IMAGE_MSAA_LOAD

24

Load up to 4 samples of 1 component from an MSAA resource with a user-specified fragment ID. No sampler.

IMAGE_BVH_INTERSECT_RAY

25

Intersection test on bound volume hierarchy nodes for ray tracing acceleration. 32-bit node pointer. No
sampler.
DATA:
The destination VGPRs contain the results of intersection testing. The values returned here are different
depending on the type of BVH node that was fetched.
For box nodes the results contain the 4 pointers of the children boxes in intersection time sorted order.
For triangle BVH nodes the results contain the intersection time and triangle ID of the triangle tested.
The address GPR packing varies based on addressing mode (A16) and NSA mode.
ADDR (A16 = 0):

16.18. MIMG Instructions

557 of 597

"RDNA3" Instruction Set Architecture

11 address VGPRs contain the ray data and BVH node pointer for the intersection test. The data is laid out as
follows (dependent on NSA mode):
• NSA=0 NSA=1 Value
VADDR[0] VADDR[0] = node_pointer (uint32)
VADDR[1] VADDRA[0] = ray_extent (float32)
VADDR[2] VADDRB[0] = ray_origin.x (float32)
VADDR[3] VADDRB[1] = ray_origin.y (float32)
VADDR[4] VADDRB[2] = ray_origin.z (float32)
VADDR[5] VADDRC[0] = ray_dir.x (float32)
VADDR[6] VADDRC[1] = ray_dir.y (float32)
VADDR[7] VADDRC[2] = ray_dir.z (float32)
VADDR[8] VADDRD[0] = ray_inv_dir.x (float32)
VADDR[9] VADDRD[1] = ray_inv_dir.y (float32)
VADDR[10] VADDRD[2] = ray_inv_dir.z (float32)
ADDR (A16 = 1):
For performance and power optimization, the instruction can be encoded to use 16 bit floats for ray_dir and
ray_inv_dir by setting A16 to 1. When the instruction is encoded with 16 bit addresses only 8 address VGPRs are
used as follows (dependent on NSA mode):
• NSA=0 NSA=1 Value
VADDR[0] VADDR[0] = node_pointer (uint32)
VADDR[1] VADDRA[0] = ray_extent (float32)
VADDR[2] VADDRB[0] = ray_origin.x (float32)
VADDR[3] VADDRB[1] = ray_origin.y (float32)
VADDR[4] VADDRB[2] = ray_origin.z (float32)
VADDR[5] VADDRC[0] = {ray_inv_dir.x, ray_dir.x} (2x float16)
VADDR[6] VADDRC[1] = {ray_inv_dir.y, ray_dir.y} (2x float16)
VADDR[7] VADDRC[2] = {ray_inv_dir.z, ray_dir.z} (2x float16)
RSRC:
The resource is the texture descriptor for the operation. The instruction must be encoded with r128=1.
RESTRICTIONS:
The image_bvh_intersect_ray and image_bvh64_intersect_ray opcode do not support all of the features of a
standard MIMG instruction. This puts some restrictions on how the instruction is encoded:
• DMASK must be set to 0xf (instruction returns all four DWORDs)
• D16 must be set to 0 (16 bit return data is not supported)
• R128 must be set to 1 (256 bit T#s are not supported)
• UNRM must be set to 1 (only unnormalized coordinates are supported)
• DIM must be set to 0 (BVH textures are 1D)
• LWE must be set to 0 (LOD warn is not supported)
• TFE must be set to 0 (no support for writing out the extra DWORD for the PRT hit status)
These restrictions must be respected by the SW/compiler, and are not enforced by HW. HW is allowed to
assume that these values are encoded according to the above restrictions, and ignore improper values, or do

16.18. MIMG Instructions

558 of 597

"RDNA3" Instruction Set Architecture

any other undefined behavior, if the above fields do not match their specified values for these instructions.
The HW also has some additional restrictions on the BVH instructions when they are issued:
• The HW ignores the return order settings of the BVH ops and schedules them in the in order read return
queue when fetching data from the texture pipe.

IMAGE_BVH64_INTERSECT_RAY

26

Intersection test on bound volume hierarchy nodes for ray tracing acceleration. 64-bit node pointer. No
sampler.
This instruction allows support for very large BVHs (larger than 32 GBs) that may occur in workstation
workloads. See IMAGE_BVH_INTERSECT_RAY for basic information including restrictions. Only differences
are described here.
ADDR (A16 = 0):
12 address VGPRs contain the ray data and BVH node pointer for the intersection test. The data is laid out as
follows (dependent on NSA mode):
• NSA=0 NSA=1 Value
VADDR[0] VADDR[0] = node_pointer[31:0] (uint32)
VADDR[1] VADDR[1] = node_pointer[63:32] (uint32)
VADDR[2] VADDRA[0] = ray_extent (float32)
VADDR[3] VADDRB[0] = ray_origin.x (float32)
VADDR[4] VADDRB[1] = ray_origin.y (float32)
VADDR[5] VADDRB[2] = ray_origin.z (float32)
VADDR[6] VADDRC[0] = ray_dir.x (float32)
VADDR[7] VADDRC[1] = ray_dir.y (float32)
VADDR[8] VADDRC[2] = ray_dir.z (float32)
VADDR[9] VADDRD[0] = ray_inv_dir.x (float32)
VADDR[10] VADDRD[1] = ray_inv_dir.y (float32)
VADDR[11] VADDRD[2] = ray_inv_dir.z (float32)
ADDR (A16 = 1):
When the instruction is encoded with 16 bit addresses only 9 address VGPRs are used as follows (dependent on
NSA mode):
• NSA=0 NSA=1 Value
VADDR[0] VADDR[0] = node_pointer[31:0] (uint32)
VADDR[1] VADDR[1] = node_pointer[63:32] (uint32)
VADDR[2] VADDRA[0] = ray_extent (float32)
VADDR[3] VADDRB[0] = ray_origin.x (float32)
VADDR[4] VADDRB[1] = ray_origin.y (float32)
VADDR[5] VADDRB[2] = ray_origin.z (float32)
VADDR[6] VADDRC[0] = {ray_inv_dir.x, ray_dir.x} (2x float16)
VADDR[7] VADDRC[1] = {ray_inv_dir.y, ray_dir.y} (2x float16)
VADDR[8] VADDRC[2] = {ray_inv_dir.z, ray_dir.z} (2x float16)

16.18. MIMG Instructions

559 of 597

"RDNA3" Instruction Set Architecture

IMAGE_SAMPLE

27

Sample texture map.

IMAGE_SAMPLE_D

28

Sample texture map, with user derivatives.

IMAGE_SAMPLE_L

29

Sample texture map, with user LOD.

IMAGE_SAMPLE_B

30

Sample texture map, with lod bias.

IMAGE_SAMPLE_LZ

31

Sample texture map, from level 0.

IMAGE_SAMPLE_C

32

Sample texture map, with PCF.

IMAGE_SAMPLE_C_D

33

SAMPLE_C, with user derivatives.

IMAGE_SAMPLE_C_L

34

SAMPLE_C, with user LOD.

IMAGE_SAMPLE_C_B

35

SAMPLE_C, with lod bias.

16.18. MIMG Instructions

560 of 597

"RDNA3" Instruction Set Architecture

IMAGE_SAMPLE_C_LZ

36

SAMPLE_C, from level 0.

IMAGE_SAMPLE_O

37

Sample texture map, with user offsets.

IMAGE_SAMPLE_D_O

38

SAMPLE_O, with user derivatives.

IMAGE_SAMPLE_L_O

39

SAMPLE_O, with user LOD.

IMAGE_SAMPLE_B_O

40

SAMPLE_O, with lod bias.

IMAGE_SAMPLE_LZ_O

41

SAMPLE_O, from level 0.

IMAGE_SAMPLE_C_O

42

SAMPLE_C with user specified offsets.

IMAGE_SAMPLE_C_D_O

43

SAMPLE_C_O, with user derivatives.

IMAGE_SAMPLE_C_L_O

44

SAMPLE_C_O, with user LOD.

16.18. MIMG Instructions

561 of 597

"RDNA3" Instruction Set Architecture

IMAGE_SAMPLE_C_B_O

45

SAMPLE_C_O, with lod bias.

IMAGE_SAMPLE_C_LZ_O

46

SAMPLE_C_O, from level 0.

IMAGE_GATHER4

47

Gather 4 single component elements (2x2).

IMAGE_GATHER4_L

48

Gather 4 single component elements (2x2) with user LOD.

IMAGE_GATHER4_B

49

Gather 4 single component elements (2x2) with user bias.

IMAGE_GATHER4_LZ

50

Gather 4 single component elements (2x2) at level 0.

IMAGE_GATHER4_C

51

Gather 4 single component elements (2x2) with PCF.

IMAGE_GATHER4_C_LZ

52

Gather 4 single component elements (2x2) at level 0, with PCF.

IMAGE_GATHER4_O

53

GATHER4, with user offsets.

16.18. MIMG Instructions

562 of 597

"RDNA3" Instruction Set Architecture

IMAGE_GATHER4_LZ_O

54

GATHER4_LZ, with user offsets.

IMAGE_GATHER4_C_LZ_O

55

GATHER4_C_LZ, with user offsets.

IMAGE_GET_LOD

56

Return calculated LOD as two 32-bit floating point values.
VDATA[0] = clampedLOD;
VDATA[1] = rawLOD.

IMAGE_SAMPLE_D_G16

57

SAMPLE_D with 16-bit floating point derivatives (gradients).

IMAGE_SAMPLE_C_D_G16

58

SAMPLE_C_D with 16-bit floating point derivatives (gradients).

IMAGE_SAMPLE_D_O_G16

59

SAMPLE_D_O with 16-bit floating point derivatives (gradients).

IMAGE_SAMPLE_C_D_O_G16

60

SAMPLE_C_D_O with 16-bit floating point derivatives (gradients).

IMAGE_SAMPLE_CL

64

Sample texture map, with LOD clamp specified in shader.

16.18. MIMG Instructions

563 of 597

"RDNA3" Instruction Set Architecture

IMAGE_SAMPLE_D_CL

65

Sample texture map, with LOD clamp specified in shader, with user derivatives.

IMAGE_SAMPLE_B_CL

66

Sample texture map, with LOD clamp specified in shader, with lod bias.

IMAGE_SAMPLE_C_CL

67

SAMPLE_C, with LOD clamp specified in shader.

IMAGE_SAMPLE_C_D_CL

68

SAMPLE_C, with LOD clamp specified in shader, with user derivatives.

IMAGE_SAMPLE_C_B_CL

69

SAMPLE_C, with LOD clamp specified in shader, with lod bias.

IMAGE_SAMPLE_CL_O

70

SAMPLE_O with LOD clamp specified in shader.

IMAGE_SAMPLE_D_CL_O

71

SAMPLE_O, with LOD clamp specified in shader, with user derivatives.

IMAGE_SAMPLE_B_CL_O

72

SAMPLE_O, with LOD clamp specified in shader, with lod bias.

IMAGE_SAMPLE_C_CL_O

73

SAMPLE_C_O, with LOD clamp specified in shader.

16.18. MIMG Instructions

564 of 597

"RDNA3" Instruction Set Architecture

IMAGE_SAMPLE_C_D_CL_O

74

SAMPLE_C_O, with LOD clamp specified in shader, with user derivatives.

IMAGE_SAMPLE_C_B_CL_O

75

SAMPLE_C_O, with LOD clamp specified in shader, with lod bias.

IMAGE_SAMPLE_C_D_CL_G16

84

SAMPLE_C_D_CL with 16-bit floating point derivatives (gradients).

IMAGE_SAMPLE_D_CL_O_G16

85

SAMPLE_D_CL_O with 16-bit floating point derivatives (gradients).

IMAGE_SAMPLE_C_D_CL_O_G16

86

SAMPLE_C_D_CL_O with 16-bit floating point derivatives (gradients).

IMAGE_SAMPLE_D_CL_G16

95

SAMPLE_D_CL with 16-bit floating point derivatives (gradients).

IMAGE_GATHER4_CL

96

Gather 4 single component elements (2x2) with user LOD clamp.

IMAGE_GATHER4_B_CL

97

Gather 4 single component elements (2x2) with user bias and clamp.

IMAGE_GATHER4_C_CL

98

Gather 4 single component elements (2x2) with user LOD clamp and PCF.

16.18. MIMG Instructions

565 of 597

"RDNA3" Instruction Set Architecture

IMAGE_GATHER4_C_L

99

Gather 4 single component elements (2x2) with user LOD and PCF.

IMAGE_GATHER4_C_B

100

Gather 4 single component elements (2x2) with user bias and PCF.

IMAGE_GATHER4_C_B_CL

101

Gather 4 single component elements (2x2) with user bias, clamp and PCF.

IMAGE_GATHER4H

144

Fetch 1 component per texel from 4x1 texels. DMASK selects which component to read (R,G,B,A) and must
have only one bit set to 1.

16.18. MIMG Instructions

566 of 597

"RDNA3" Instruction Set Architecture

16.19. EXPORT Instructions
Transfer vertex position, vertex parameter, pixel color, or pixel depth information to the output buffer. Every
pixel shader must do at least one export to a color, depth or NULL target with the VM bit set to 1. This
communicates the pixel-valid mask to the color and depth buffers. Every pixel does only one of the above
export types with the DONE bit set to 1. Vertex shaders must do one or more position exports, and at least one
parameter export. The final position export must have the DONE bit set to 1.

16.19. EXPORT Instructions

567 of 597

"RDNA3" Instruction Set Architecture

16.20. FLAT, Scratch and Global Instructions
The bitfield map of the FLAT format is:

16.20.1. Flat Instructions
Flat instructions look at the per work-item address and determine for each work-item if the target memory
address is in global, private or scratch memory.

FLAT_LOAD_U8

16

Untyped buffer load unsigned byte, zero extend in data register.
VDATA.u = 32'U({ 24'0, MEM[ADDR].u8 })

FLAT_LOAD_I8

17

Untyped buffer load signed byte, sign extend in data register.
VDATA.i = 32'I(signext(MEM[ADDR].i8))

FLAT_LOAD_U16

18

Untyped buffer load unsigned short, zero extend in data register.
VDATA.u = 32'U({ 16'0, MEM[ADDR].u16 })

FLAT_LOAD_I16

19

Untyped buffer load signed short, sign extend in data register.
VDATA.i = 32'I(signext(MEM[ADDR].i16))

16.20. FLAT, Scratch and Global Instructions

568 of 597

"RDNA3" Instruction Set Architecture

FLAT_LOAD_B32

20

Untyped buffer load dword.
VDATA.b = MEM[ADDR].b

FLAT_LOAD_B64

21

Untyped buffer load 2 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b

FLAT_LOAD_B96

22

Untyped buffer load 3 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b;
VDATA[95 : 64] = MEM[ADDR + 8U].b

FLAT_LOAD_B128

23

Untyped buffer load 4 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b;
VDATA[95 : 64] = MEM[ADDR + 8U].b;
VDATA[127 : 96] = MEM[ADDR + 12U].b

FLAT_STORE_B8

24

Untyped buffer store byte.
MEM[ADDR].b8 = VDATA[7 : 0]

16.20. FLAT, Scratch and Global Instructions

569 of 597

"RDNA3" Instruction Set Architecture

FLAT_STORE_B16

25

Untyped buffer store short.
MEM[ADDR].b16 = VDATA[15 : 0]

FLAT_STORE_B32

26

Untyped buffer store dword.
MEM[ADDR].b = VDATA[31 : 0]

FLAT_STORE_B64

27

Untyped buffer store 2 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32]

FLAT_STORE_B96

28

Untyped buffer store 3 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32];
MEM[ADDR + 8U].b = VDATA[95 : 64]

FLAT_STORE_B128

29

Untyped buffer store 4 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32];
MEM[ADDR + 8U].b = VDATA[95 : 64];
MEM[ADDR + 12U].b = VDATA[127 : 96]

16.20. FLAT, Scratch and Global Instructions

570 of 597

"RDNA3" Instruction Set Architecture

FLAT_LOAD_D16_U8

30

Untyped buffer load unsigned byte, use low 16 bits of data register.
VDATA[15 : 0].u16 = 16'U({ 8'0, MEM[ADDR].u8 });
// VDATA[31:16] is preserved.

FLAT_LOAD_D16_I8

31

Untyped buffer load signed byte, use low 16 bits of data register.
VDATA[15 : 0].i16 = 16'I(signext(MEM[ADDR].i8));
// VDATA[31:16] is preserved.

FLAT_LOAD_D16_B16

32

Untyped buffer load short, use low 16 bits of data register.
VDATA[15 : 0].b16 = MEM[ADDR].b16;
// VDATA[31:16] is preserved.

FLAT_LOAD_D16_HI_U8

33

Untyped buffer load unsigned byte, use high 16 bits of data register.
VDATA[31 : 16].u16 = 16'U({ 8'0, MEM[ADDR].u8 });
// VDATA[15:0] is preserved.

FLAT_LOAD_D16_HI_I8

34

Untyped buffer load signed byte, use high 16 bits of data register.
VDATA[31 : 16].i16 = 16'I(signext(MEM[ADDR].i8));
// VDATA[15:0] is preserved.

FLAT_LOAD_D16_HI_B16

16.20. FLAT, Scratch and Global Instructions

35

571 of 597

"RDNA3" Instruction Set Architecture

Untyped buffer load short, use high 16 bits of data register.
VDATA[31 : 16].b16 = MEM[ADDR].b16;
// VDATA[15:0] is preserved.

FLAT_STORE_D16_HI_B8

36

Untyped buffer store byte, use high 16 bits of data register.
MEM[ADDR].b8 = VDATA[23 : 16].b8

FLAT_STORE_D16_HI_B16

37

Untyped buffer store short, use high 16 bits of data register.
MEM[ADDR].b16 = VDATA[31 : 16].b16

FLAT_ATOMIC_SWAP_B32

51

Swap values in data register and memory.
tmp = MEM[ADDR].b;
MEM[ADDR].b = DATA.b;
RETURN_DATA.b = tmp

FLAT_ATOMIC_CMPSWAP_B32

52

Compare and swap with memory value.
tmp = MEM[ADDR].b;
src = DATA[31 : 0].b;
cmp = DATA[63 : 32].b;
MEM[ADDR].b = tmp == cmp ? src : tmp;
RETURN_DATA.b = tmp

FLAT_ATOMIC_ADD_U32

16.20. FLAT, Scratch and Global Instructions

53

572 of 597

"RDNA3" Instruction Set Architecture

Add data register to memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u += DATA.u;
RETURN_DATA.u = tmp

FLAT_ATOMIC_SUB_U32

54

Subtract data register from memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u -= DATA.u;
RETURN_DATA.u = tmp

FLAT_ATOMIC_MIN_I32

56

Minimum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src < tmp ? src : tmp;
RETURN_DATA.i = tmp

FLAT_ATOMIC_MIN_U32

57

Minimum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src < tmp ? src : tmp;
RETURN_DATA.u = tmp

FLAT_ATOMIC_MAX_I32

58

Maximum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src > tmp ? src : tmp;

16.20. FLAT, Scratch and Global Instructions

573 of 597

"RDNA3" Instruction Set Architecture

RETURN_DATA.i = tmp

FLAT_ATOMIC_MAX_U32

59

Maximum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src > tmp ? src : tmp;
RETURN_DATA.u = tmp

FLAT_ATOMIC_AND_B32

60

Bitwise AND of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp & DATA.b);
RETURN_DATA.b = tmp

FLAT_ATOMIC_OR_B32

61

Bitwise OR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp | DATA.b);
RETURN_DATA.b = tmp

FLAT_ATOMIC_XOR_B32

62

Bitwise XOR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp ^ DATA.b);
RETURN_DATA.b = tmp

FLAT_ATOMIC_INC_U32

16.20. FLAT, Scratch and Global Instructions

63

574 of 597

"RDNA3" Instruction Set Architecture

Increment memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U;
RETURN_DATA.u = tmp

FLAT_ATOMIC_DEC_U32

64

Decrement memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U;
RETURN_DATA.u = tmp

FLAT_ATOMIC_SWAP_B64

65

Swap 64-bit values in data register and memory.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = DATA.b64;
RETURN_DATA.b64 = tmp

FLAT_ATOMIC_CMPSWAP_B64

66

Compare and swap with 64-bit memory value.
NOTE: RETURN_DATA[2:3] is not modified.
tmp = MEM[ADDR].b64;
src = DATA[63 : 0].b64;
cmp = DATA[127 : 64].b64;
MEM[ADDR].b64 = tmp == cmp ? src : tmp;
RETURN_DATA.b64 = tmp

FLAT_ATOMIC_ADD_U64

67

Add data register to 64-bit memory value.

16.20. FLAT, Scratch and Global Instructions

575 of 597

"RDNA3" Instruction Set Architecture

tmp = MEM[ADDR].u64;
MEM[ADDR].u64 += DATA.u64;
RETURN_DATA.u64 = tmp

FLAT_ATOMIC_SUB_U64

68

Subtract data register from 64-bit memory value.
tmp = MEM[ADDR].u64;
MEM[ADDR].u64 -= DATA.u64;
RETURN_DATA.u64 = tmp

FLAT_ATOMIC_MIN_I64

69

Minimum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;
src = DATA.i64;
MEM[ADDR].i64 = src < tmp ? src : tmp;
RETURN_DATA.i64 = tmp

FLAT_ATOMIC_MIN_U64

70

Minimum of two unsigned 64-bit integer values.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src < tmp ? src : tmp;
RETURN_DATA.u64 = tmp

FLAT_ATOMIC_MAX_I64

71

Maximum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;
src = DATA.i64;
MEM[ADDR].i64 = src > tmp ? src : tmp;
RETURN_DATA.i64 = tmp

16.20. FLAT, Scratch and Global Instructions

576 of 597

"RDNA3" Instruction Set Architecture

FLAT_ATOMIC_MAX_U64

72

Maximum of two unsigned 64-bit integer values.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src > tmp ? src : tmp;
RETURN_DATA.u64 = tmp

FLAT_ATOMIC_AND_B64

73

Bitwise AND of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp & DATA.b64);
RETURN_DATA.b64 = tmp

FLAT_ATOMIC_OR_B64

74

Bitwise OR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp | DATA.b64);
RETURN_DATA.b64 = tmp

FLAT_ATOMIC_XOR_B64

75

Bitwise XOR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp ^ DATA.b64);
RETURN_DATA.b64 = tmp

FLAT_ATOMIC_INC_U64

76

Increment 64-bit memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u64;

16.20. FLAT, Scratch and Global Instructions

577 of 597

"RDNA3" Instruction Set Architecture

src = DATA.u64;
MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL;
RETURN_DATA.u64 = tmp

FLAT_ATOMIC_DEC_U64

77

Decrement 64-bit memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL;
RETURN_DATA.u64 = tmp

FLAT_ATOMIC_CMPSWAP_F32

80

Compare and swap with floating-point memory value.
tmp = MEM[ADDR].f;
src = DATA[31 : 0].f;
cmp = DATA[63 : 32].f;
MEM[ADDR].f = tmp == cmp ? src : tmp;
RETURN_DATA.f = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

FLAT_ATOMIC_MIN_F32

81

Minimum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src < tmp ? src : tmp;
RETURN_DATA = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

FLAT_ATOMIC_MAX_F32

16.20. FLAT, Scratch and Global Instructions

82

578 of 597

"RDNA3" Instruction Set Architecture

Maximum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src > tmp ? src : tmp;
RETURN_DATA = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

FLAT_ATOMIC_ADD_F32

86

Add data register to floating-point memory value.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src + tmp;
RETURN_DATA.f = tmp

Notes
Floating-point addition handles NAN/INF/denorm.

16.20.2. Scratch Instructions
Scratch instructions are like Flat, but assume all work-item addresses fall in scratch (private) space.

SCRATCH_LOAD_U8

16

Untyped buffer load unsigned byte, zero extend in data register.
VDATA.u = 32'U({ 24'0, MEM[ADDR].u8 })

SCRATCH_LOAD_I8

17

Untyped buffer load signed byte, sign extend in data register.

16.20. FLAT, Scratch and Global Instructions

579 of 597

"RDNA3" Instruction Set Architecture

VDATA.i = 32'I(signext(MEM[ADDR].i8))

SCRATCH_LOAD_U16

18

Untyped buffer load unsigned short, zero extend in data register.
VDATA.u = 32'U({ 16'0, MEM[ADDR].u16 })

SCRATCH_LOAD_I16

19

Untyped buffer load signed short, sign extend in data register.
VDATA.i = 32'I(signext(MEM[ADDR].i16))

SCRATCH_LOAD_B32

20

Untyped buffer load dword.
VDATA.b = MEM[ADDR].b

SCRATCH_LOAD_B64

21

Untyped buffer load 2 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b

SCRATCH_LOAD_B96

22

Untyped buffer load 3 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b;
VDATA[95 : 64] = MEM[ADDR + 8U].b

16.20. FLAT, Scratch and Global Instructions

580 of 597

"RDNA3" Instruction Set Architecture

SCRATCH_LOAD_B128

23

Untyped buffer load 4 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b;
VDATA[95 : 64] = MEM[ADDR + 8U].b;
VDATA[127 : 96] = MEM[ADDR + 12U].b

SCRATCH_STORE_B8

24

Untyped buffer store byte.
MEM[ADDR].b8 = VDATA[7 : 0]

SCRATCH_STORE_B16

25

Untyped buffer store short.
MEM[ADDR].b16 = VDATA[15 : 0]

SCRATCH_STORE_B32

26

Untyped buffer store dword.
MEM[ADDR].b = VDATA[31 : 0]

SCRATCH_STORE_B64

27

Untyped buffer store 2 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32]

16.20. FLAT, Scratch and Global Instructions

581 of 597

"RDNA3" Instruction Set Architecture

SCRATCH_STORE_B96

28

Untyped buffer store 3 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32];
MEM[ADDR + 8U].b = VDATA[95 : 64]

SCRATCH_STORE_B128

29

Untyped buffer store 4 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32];
MEM[ADDR + 8U].b = VDATA[95 : 64];
MEM[ADDR + 12U].b = VDATA[127 : 96]

SCRATCH_LOAD_D16_U8

30

Untyped buffer load unsigned byte, use low 16 bits of data register.
VDATA[15 : 0].u16 = 16'U({ 8'0, MEM[ADDR].u8 });
// VDATA[31:16] is preserved.

SCRATCH_LOAD_D16_I8

31

Untyped buffer load signed byte, use low 16 bits of data register.
VDATA[15 : 0].i16 = 16'I(signext(MEM[ADDR].i8));
// VDATA[31:16] is preserved.

SCRATCH_LOAD_D16_B16

32

Untyped buffer load short, use low 16 bits of data register.
VDATA[15 : 0].b16 = MEM[ADDR].b16;
// VDATA[31:16] is preserved.

16.20. FLAT, Scratch and Global Instructions

582 of 597

"RDNA3" Instruction Set Architecture

SCRATCH_LOAD_D16_HI_U8

33

Untyped buffer load unsigned byte, use high 16 bits of data register.
VDATA[31 : 16].u16 = 16'U({ 8'0, MEM[ADDR].u8 });
// VDATA[15:0] is preserved.

SCRATCH_LOAD_D16_HI_I8

34

Untyped buffer load signed byte, use high 16 bits of data register.
VDATA[31 : 16].i16 = 16'I(signext(MEM[ADDR].i8));
// VDATA[15:0] is preserved.

SCRATCH_LOAD_D16_HI_B16

35

Untyped buffer load short, use high 16 bits of data register.
VDATA[31 : 16].b16 = MEM[ADDR].b16;
// VDATA[15:0] is preserved.

SCRATCH_STORE_D16_HI_B8

36

Untyped buffer store byte, use high 16 bits of data register.
MEM[ADDR].b8 = VDATA[23 : 16].b8

SCRATCH_STORE_D16_HI_B16

37

Untyped buffer store short, use high 16 bits of data register.
MEM[ADDR].b16 = VDATA[31 : 16].b16

SCRATCH_LOAD_LDS_U8

16.20. FLAT, Scratch and Global Instructions

45

583 of 597

"RDNA3" Instruction Set Architecture

Untyped buffer load unsigned byte, zero extend and store in LDS destination.

SCRATCH_LOAD_LDS_I8

46

Untyped buffer load signed byte, sign extend and store in LDS destination.

SCRATCH_LOAD_LDS_U16

47

Untyped buffer load unsigned short, zero extend and store in LDS destination.

SCRATCH_LOAD_LDS_I16

48

Untyped buffer load signed short, sign extend and store in LDS destination.

SCRATCH_LOAD_LDS_B32

49

Untyped buffer load dword, store in LDS destination.

16.20.3. Global Instructions
Global instructions are like Flat, but assume all work-item addresses fall in global memory space.

GLOBAL_LOAD_U8

16

Untyped buffer load unsigned byte, zero extend in data register.
VDATA.u = 32'U({ 24'0, MEM[ADDR].u8 })

GLOBAL_LOAD_I8

17

Untyped buffer load signed byte, sign extend in data register.
VDATA.i = 32'I(signext(MEM[ADDR].i8))

16.20. FLAT, Scratch and Global Instructions

584 of 597

"RDNA3" Instruction Set Architecture

GLOBAL_LOAD_U16

18

Untyped buffer load unsigned short, zero extend in data register.
VDATA.u = 32'U({ 16'0, MEM[ADDR].u16 })

GLOBAL_LOAD_I16

19

Untyped buffer load signed short, sign extend in data register.
VDATA.i = 32'I(signext(MEM[ADDR].i16))

GLOBAL_LOAD_B32

20

Untyped buffer load dword.
VDATA.b = MEM[ADDR].b

GLOBAL_LOAD_B64

21

Untyped buffer load 2 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b

GLOBAL_LOAD_B96

22

Untyped buffer load 3 dwords.
VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b;
VDATA[95 : 64] = MEM[ADDR + 8U].b

GLOBAL_LOAD_B128

23

Untyped buffer load 4 dwords.

16.20. FLAT, Scratch and Global Instructions

585 of 597

"RDNA3" Instruction Set Architecture

VDATA[31 : 0] = MEM[ADDR + 0U].b;
VDATA[63 : 32] = MEM[ADDR + 4U].b;
VDATA[95 : 64] = MEM[ADDR + 8U].b;
VDATA[127 : 96] = MEM[ADDR + 12U].b

GLOBAL_STORE_B8

24

Untyped buffer store byte.
MEM[ADDR].b8 = VDATA[7 : 0]

GLOBAL_STORE_B16

25

Untyped buffer store short.
MEM[ADDR].b16 = VDATA[15 : 0]

GLOBAL_STORE_B32

26

Untyped buffer store dword.
MEM[ADDR].b = VDATA[31 : 0]

GLOBAL_STORE_B64

27

Untyped buffer store 2 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32]

GLOBAL_STORE_B96

28

Untyped buffer store 3 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];

16.20. FLAT, Scratch and Global Instructions

586 of 597

"RDNA3" Instruction Set Architecture

MEM[ADDR + 4U].b = VDATA[63 : 32];
MEM[ADDR + 8U].b = VDATA[95 : 64]

GLOBAL_STORE_B128

29

Untyped buffer store 4 dwords.
MEM[ADDR + 0U].b = VDATA[31 : 0];
MEM[ADDR + 4U].b = VDATA[63 : 32];
MEM[ADDR + 8U].b = VDATA[95 : 64];
MEM[ADDR + 12U].b = VDATA[127 : 96]

GLOBAL_LOAD_D16_U8

30

Untyped buffer load unsigned byte, use low 16 bits of data register.
VDATA[15 : 0].u16 = 16'U({ 8'0, MEM[ADDR].u8 });
// VDATA[31:16] is preserved.

GLOBAL_LOAD_D16_I8

31

Untyped buffer load signed byte, use low 16 bits of data register.
VDATA[15 : 0].i16 = 16'I(signext(MEM[ADDR].i8));
// VDATA[31:16] is preserved.

GLOBAL_LOAD_D16_B16

32

Untyped buffer load short, use low 16 bits of data register.
VDATA[15 : 0].b16 = MEM[ADDR].b16;
// VDATA[31:16] is preserved.

GLOBAL_LOAD_D16_HI_U8

33

Untyped buffer load unsigned byte, use high 16 bits of data register.

16.20. FLAT, Scratch and Global Instructions

587 of 597

"RDNA3" Instruction Set Architecture

VDATA[31 : 16].u16 = 16'U({ 8'0, MEM[ADDR].u8 });
// VDATA[15:0] is preserved.

GLOBAL_LOAD_D16_HI_I8

34

Untyped buffer load signed byte, use high 16 bits of data register.
VDATA[31 : 16].i16 = 16'I(signext(MEM[ADDR].i8));
// VDATA[15:0] is preserved.

GLOBAL_LOAD_D16_HI_B16

35

Untyped buffer load short, use high 16 bits of data register.
VDATA[31 : 16].b16 = MEM[ADDR].b16;
// VDATA[15:0] is preserved.

GLOBAL_STORE_D16_HI_B8

36

Untyped buffer store byte, use high 16 bits of data register.
MEM[ADDR].b8 = VDATA[23 : 16].b8

GLOBAL_STORE_D16_HI_B16

37

Untyped buffer store short, use high 16 bits of data register.
MEM[ADDR].b16 = VDATA[31 : 16].b16

GLOBAL_LOAD_ADDTID_B32

40

Untyped buffer load dword. No VGPR address is supplied in this instruction. TID is added to the address as
shown below:
memory_Addr = sgpr_addr(64) + inst_offset(12) + tid*4

16.20. FLAT, Scratch and Global Instructions

588 of 597

"RDNA3" Instruction Set Architecture

GLOBAL_STORE_ADDTID_B32

41

Untyped buffer store dword. No VGPR address is supplied in this instruction. TID is added to the address as
shown below:
memory_Addr = sgpr_addr(64) + inst_offset(12) + tid*4

GLOBAL_LOAD_LDS_ADDTID_B32

42

Untyped buffer load dword, store results to LDS. No VGPR address is supplied in this instruction. TID is added
to the address as shown below:
memory_Addr = sgpr_addr(64) + inst_offset(12) + tid*4

GLOBAL_LOAD_LDS_U8

45

Untyped buffer load unsigned byte, zero extend and store in LDS destination.

GLOBAL_LOAD_LDS_I8

46

Untyped buffer load signed byte, sign extend and store in LDS destination.

GLOBAL_LOAD_LDS_U16

47

Untyped buffer load unsigned short, zero extend and store in LDS destination.

GLOBAL_LOAD_LDS_I16

48

Untyped buffer load signed short, sign extend and store in LDS destination.

GLOBAL_LOAD_LDS_B32

49

Untyped buffer load dword, store in LDS destination.

GLOBAL_ATOMIC_SWAP_B32

51

Swap values in data register and memory.

16.20. FLAT, Scratch and Global Instructions

589 of 597

"RDNA3" Instruction Set Architecture

tmp = MEM[ADDR].b;
MEM[ADDR].b = DATA.b;
RETURN_DATA.b = tmp

GLOBAL_ATOMIC_CMPSWAP_B32

52

Compare and swap with memory value.
tmp = MEM[ADDR].b;
src = DATA[31 : 0].b;
cmp = DATA[63 : 32].b;
MEM[ADDR].b = tmp == cmp ? src : tmp;
RETURN_DATA.b = tmp

GLOBAL_ATOMIC_ADD_U32

53

Add data register to memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u += DATA.u;
RETURN_DATA.u = tmp

GLOBAL_ATOMIC_SUB_U32

54

Subtract data register from memory value.
tmp = MEM[ADDR].u;
MEM[ADDR].u -= DATA.u;
RETURN_DATA.u = tmp

GLOBAL_ATOMIC_CSUB_U32

55

Subtract data register from memory value, clamp to zero.
declare new_value : 32'U;
old_value = MEM[ADDR].u;
if old_value < DATA.u then
new_value = 0U
else
new_value = old_value - DATA.u

16.20. FLAT, Scratch and Global Instructions

590 of 597

"RDNA3" Instruction Set Architecture

endif;
MEM[ADDR].u = new_value;
RETURN_DATA.u = old_value

GLOBAL_ATOMIC_MIN_I32

56

Minimum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src < tmp ? src : tmp;
RETURN_DATA.i = tmp

GLOBAL_ATOMIC_MIN_U32

57

Minimum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src < tmp ? src : tmp;
RETURN_DATA.u = tmp

GLOBAL_ATOMIC_MAX_I32

58

Maximum of two signed integer values.
tmp = MEM[ADDR].i;
src = DATA.i;
MEM[ADDR].i = src > tmp ? src : tmp;
RETURN_DATA.i = tmp

GLOBAL_ATOMIC_MAX_U32

59

Maximum of two unsigned integer values.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = src > tmp ? src : tmp;
RETURN_DATA.u = tmp

16.20. FLAT, Scratch and Global Instructions

591 of 597

"RDNA3" Instruction Set Architecture

GLOBAL_ATOMIC_AND_B32

60

Bitwise AND of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp & DATA.b);
RETURN_DATA.b = tmp

GLOBAL_ATOMIC_OR_B32

61

Bitwise OR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp | DATA.b);
RETURN_DATA.b = tmp

GLOBAL_ATOMIC_XOR_B32

62

Bitwise XOR of register value and memory value.
tmp = MEM[ADDR].b;
MEM[ADDR].b = (tmp ^ DATA.b);
RETURN_DATA.b = tmp

GLOBAL_ATOMIC_INC_U32

63

Increment memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u;
src = DATA.u;
MEM[ADDR].u = tmp >= src ? 0U : tmp + 1U;
RETURN_DATA.u = tmp

GLOBAL_ATOMIC_DEC_U32

64

Decrement memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u;

16.20. FLAT, Scratch and Global Instructions

592 of 597

"RDNA3" Instruction Set Architecture

src = DATA.u;
MEM[ADDR].u = ((tmp == 0U) || (tmp > src)) ? src : tmp - 1U;
RETURN_DATA.u = tmp

GLOBAL_ATOMIC_SWAP_B64

65

Swap 64-bit values in data register and memory.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = DATA.b64;
RETURN_DATA.b64 = tmp

GLOBAL_ATOMIC_CMPSWAP_B64

66

Compare and swap with 64-bit memory value.
tmp = MEM[ADDR].b64;
src = DATA[63 : 0].b64;
cmp = DATA[127 : 64].b64;
MEM[ADDR].b64 = tmp == cmp ? src : tmp;
RETURN_DATA.b64 = tmp

GLOBAL_ATOMIC_ADD_U64

67

Add data register to 64-bit memory value.
tmp = MEM[ADDR].u64;
MEM[ADDR].u64 += DATA.u64;
RETURN_DATA.u64 = tmp

GLOBAL_ATOMIC_SUB_U64

68

Subtract data register from 64-bit memory value.
tmp = MEM[ADDR].u64;
MEM[ADDR].u64 -= DATA.u64;
RETURN_DATA.u64 = tmp

16.20. FLAT, Scratch and Global Instructions

593 of 597

"RDNA3" Instruction Set Architecture

GLOBAL_ATOMIC_MIN_I64

69

Minimum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;
src = DATA.i64;
MEM[ADDR].i64 = src < tmp ? src : tmp;
RETURN_DATA.i64 = tmp

GLOBAL_ATOMIC_MIN_U64

70

Minimum of two unsigned 64-bit integer values.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src < tmp ? src : tmp;
RETURN_DATA.u64 = tmp

GLOBAL_ATOMIC_MAX_I64

71

Maximum of two signed 64-bit integer values.
tmp = MEM[ADDR].i64;
src = DATA.i64;
MEM[ADDR].i64 = src > tmp ? src : tmp;
RETURN_DATA.i64 = tmp

GLOBAL_ATOMIC_MAX_U64

72

Maximum of two unsigned 64-bit integer values.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = src > tmp ? src : tmp;
RETURN_DATA.u64 = tmp

GLOBAL_ATOMIC_AND_B64

73

Bitwise AND of register value and 64-bit memory value.

16.20. FLAT, Scratch and Global Instructions

594 of 597

"RDNA3" Instruction Set Architecture

tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp & DATA.b64);
RETURN_DATA.b64 = tmp

GLOBAL_ATOMIC_OR_B64

74

Bitwise OR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp | DATA.b64);
RETURN_DATA.b64 = tmp

GLOBAL_ATOMIC_XOR_B64

75

Bitwise XOR of register value and 64-bit memory value.
tmp = MEM[ADDR].b64;
MEM[ADDR].b64 = (tmp ^ DATA.b64);
RETURN_DATA.b64 = tmp

GLOBAL_ATOMIC_INC_U64

76

Increment 64-bit memory value with wraparound to zero when incremented to register value.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = tmp >= src ? 0ULL : tmp + 1ULL;
RETURN_DATA.u64 = tmp

GLOBAL_ATOMIC_DEC_U64

77

Decrement 64-bit memory value with wraparound to register value when decremented below zero.
tmp = MEM[ADDR].u64;
src = DATA.u64;
MEM[ADDR].u64 = ((tmp == 0ULL) || (tmp > src)) ? src : tmp - 1ULL;
RETURN_DATA.u64 = tmp

16.20. FLAT, Scratch and Global Instructions

595 of 597

"RDNA3" Instruction Set Architecture

GLOBAL_ATOMIC_CMPSWAP_F32

80

Compare and swap with floating-point memory value.
tmp = MEM[ADDR].f;
src = DATA[31 : 0].f;
cmp = DATA[63 : 32].f;
MEM[ADDR].f = tmp == cmp ? src : tmp;
RETURN_DATA.f = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

GLOBAL_ATOMIC_MIN_F32

81

Minimum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src < tmp ? src : tmp;
RETURN_DATA = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

GLOBAL_ATOMIC_MAX_F32

82

Maximum of two floating-point values.
tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src > tmp ? src : tmp;
RETURN_DATA = tmp

Notes
Floating-point compare handles NAN/INF/denorm.

GLOBAL_ATOMIC_ADD_F32

86

Add data register to floating-point memory value.

16.20. FLAT, Scratch and Global Instructions

596 of 597

"RDNA3" Instruction Set Architecture

tmp = MEM[ADDR].f;
src = DATA.f;
MEM[ADDR].f = src + tmp;
RETURN_DATA.f = tmp

Notes
Floating-point addition handles NAN/INF/denorm.

16.20. FLAT, Scratch and Global Instructions

597 of 597