Microkernel naming conventions
This documents deciphers XNNPACK's microkernels naming convention.
General conventions
Microkernel function names follow this convention:
xnn_<datatype>_<microkernel><activation?>_ukernel_<parameters>__<arch>
Where <datatype>
can be:
cs16
f16
- 16-bit half precision floatf32
- 32-bit single precision floatqc8
qs8
- quantized signed 8 bitqu8
- quantized unsigned 8 bits16
u32
x8
x16
x24
x32
xx
<microkernel>
is the type of microkernel, such as:
gemm
igemm
avgpool
<activation>
if supported for the microkernel is activation that is fused into
the microkernel:
linear
minmax
relu
<parameters>
are microkernel specific, and can mean different things depending
on the microkernel (see below for details).
<arch>
is the architecture the microkernel is optimized for, and can contain
further subdivisions for additional instruction sets supported on the specified
architecture, or processor information:
scalar
aarch32_neon_cortex_a55
neonv8_mlal
wasm
avx512
avx512skx
GEMM and IGEMM microkernels
The <parameters>
for GEMM and IGEMM microkernels represent the mr
and nr
of the microkernel. You can think of it as the number of rows and columns of the
output calculated by the microkernel.
E.g. xnn_f32_gemm_minmax_ukernel_4x8__aarch32_neon_cortex_a7
processes 32
elements of the output matrix.
DWCONV microkernels
These microkernels come in 2 varieties, uni-pass and multi-pass.
Uni-pass have XpYc
in their name, where X
is the kernel tile, and Y
is the
channel tile. p
stands for primary, c
for channel.
Multi-pass have UfVmWlXcYsZr
in their name, where U
is the first pass tile,
V
is the middle pass tile, W
is the last pass tile, X
is the channel tile,
Y
is the channel subtile, and Z
is the channel round. f
stands for first,
m
for middle, l
for last, c
for channel, s
for subtile, r
for round.
The kernel size must be at least W+1
, the middle pass runs for as many
iterations as possible, and the last pass handles the remainder (at least 1).
c
, s
, r
, affects the tiling of channels. We run as many tiles of c
as
possible, followed by rounds of s
. We determine how many tiles of c
to run
based on rounding the number of channels up to r
. r
is determined based on
the natural tiling size of the microarchitecture (e.g. SSE/AVX) and the number
of elements we can read OOB (XNN_EXTRA_BYTES
).
Average Pooling and Global Average Pooling
These microkernels come in 2 varieties, uni-pass and multi-pass.
Uni-pass have Cx
in their name, where C
is a number. This microkernel
processes up to and including C
elements.
Multi-pass have CpDx
in their name, where C
and D
are numbers. This
microkernel processes D
elements in the first pass, and middle pass (which can
run multiple times), and up to C
elements in the last pass.
E.g. xnn_f32_avgpool_minmax_ukernel_9x__neon_c4
can process up to 9 elements.