File size: 3,114 Bytes
8b7c501
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# Microkernel naming conventions

This documents deciphers XNNPACK's microkernels naming convention.

## General conventions

Microkernel function names follow this convention:

`xnn_<datatype>_<microkernel><activation?>_ukernel_<parameters>__<arch>`

Where `<datatype>` can be:

-   `cs16`
-   `f16` - 16-bit half precision float
-   `f32` - 32-bit single precision float
-   `qc8`
-   `qs8` - quantized signed 8 bit
-   `qu8` - quantized unsigned 8 bit
-   `s16`
-   `u32`
-   `x8`
-   `x16`
-   `x24`
-   `x32`
-   `xx`

`<microkernel>` is the type of microkernel, such as:

-   `gemm`
-   `igemm`
-   `avgpool`

`<activation>` if supported for the microkernel is activation that is fused into
the microkernel:

-   `linear`
-   `minmax`
-   `relu`

`<parameters>` are microkernel specific, and can mean different things depending
on the microkernel (see below for details).

`<arch>` is the architecture the microkernel is optimized for, and can contain
further subdivisions for additional instruction sets supported on the specified
architecture, or processor information:

-   `scalar`
-   `aarch32_neon_cortex_a55`
-   `neonv8_mlal`
-   `wasm`
-   `avx512`
-   `avx512skx`

## GEMM and IGEMM microkernels

The `<parameters>` for GEMM and IGEMM microkernels represent the `mr` and `nr`
of the microkernel. You can think of it as the number of rows and columns of the
output calculated by the microkernel.

E.g. `xnn_f32_gemm_minmax_ukernel_4x8__aarch32_neon_cortex_a7` processes 32
elements of the output matrix.

## DWCONV microkernels

These microkernels come in 2 varieties, uni-pass and multi-pass.

Uni-pass have `XpYc` in their name, where `X` is the kernel tile, and `Y` is the
channel tile. `p` stands for primary, `c` for channel.

Multi-pass have `UfVmWlXcYsZr` in their name, where `U` is the first pass tile,
`V` is the middle pass tile, `W` is the last pass tile, `X` is the channel tile,
`Y` is the channel subtile, and `Z` is the channel round.  `f` stands for first,
`m` for middle, `l` for last, `c` for channel, `s` for subtile, `r` for round.
The kernel size must be at least `W+1`, the middle pass runs for as many
iterations as possible, and the last pass handles the remainder (at least 1).
`c`, `s`, `r`, affects the tiling of channels. We run as many tiles of `c` as
possible, followed by rounds of `s`. We determine how many tiles of `c` to run
based on rounding the number of channels up to `r`. `r` is determined based on
the natural tiling size of the microarchitecture (e.g. SSE/AVX) and the number
of elements we can read OOB (`XNN_EXTRA_BYTES`).

## Average Pooling and Global Average Pooling

These microkernels come in 2 varieties, uni-pass and multi-pass.

Uni-pass have `Cx` in their name, where `C` is a number. This microkernel
processes up to and including `C` elements.

Multi-pass have `CpDx` in their name, where `C` and `D` are numbers. This
microkernel processes `D` elements in the first pass, and middle pass (which can
run multiple times), and up to `C` elements in the last pass.

E.g. `xnn_f32_avgpool_minmax_ukernel_9x__neon_c4` can process up to 9 elements.