|
# Depthwise convolution microkernels |
|
|
|
This document describes how depthwise convolution (DWCONV) microkernels work. |
|
|
|
All depthwise convolution microkernels live in `src/*-dwconv`, e.g. |
|
[`src/f32-dwconv`](https://github.com/google/XNNPACK/tree/master/src/f32-dwconv). |
|
|
|
The simplest microkernel to look at is probably |
|
[`f32-dwconv-up2x3-scalar.c`](../src/f32-dwconv/gen/f32-dwconv-up2x3-scalar.c). |
|
|
|
Key parameters: |
|
|
|
- channel tile, how many channels the microkernel can process in each iteration |
|
- kernel tile, how many weights (kernel elements, each element is # channels values) the microkernel reads in each |
|
iteration. This can be greater than the actual number of kernel elements. |
|
|
|
## High level description |
|
|
|
Each call to the DWCONV microkernel will produce 1 row of output. |
|
|
|
For each element of this row of output, DWCONV will produce `channel_tile` |
|
number of outputs in the main loop, with a separate loop to handle remainders |
|
(remainder loop). |
|
|
|
In each iteration of the main loop, the microkernel will read `channel_tile` biases, `channel_tile * kernel_tile` |
|
inputs, `channel_tile * kernel_tile` weights, and, optionally, `channel_tile` of per-channel scales, |
|
perform the convolution, then write `channel_tile` outputs. |
|
|
|
In the remainder loop, the microkernel will read `remainder_channels` biases, |
|
`remainder_channels * kernel_tile` inputs, `remainder_channels * kernel_tile` |
|
weights, perform the convolution, and write `remainder_channels` outputs. |
|
|
|
## Microkernel arguments |
|
|
|
``` |
|
void xnn_f32_dwconv_ukernel_up2x3__scalar( |
|
size_t channels, |
|
size_t output_width, |
|
const float** input, |
|
const float* weights, |
|
float* output, |
|
size_t input_stride, |
|
size_t output_increment, |
|
size_t input_offset, |
|
const float* zero, |
|
const union xnn_f32_default_params params[restrict XNN_MIN_ELEMENTS(1)]) |
|
``` |
|
|
|
- `channels`, number of output channels to compute |
|
- `output_width`, number of produced pixels |
|
- `input`, pointer to input indirection buffer |
|
- `weights`, pointer to weights |
|
- `output`, pointer to output |
|
- `input_stride`, number of bytes to add to the indirection buffer to advance to the input pointers corresponding to the |
|
next output element |
|
- `output_increment`, number of bytes to get to the next output element |
|
- `input_offset`, offset to add to pointers from indirection buffer, unless these pointers match the zero pointer |
|
- `zero`, pointer to zero buffer |
|
- `params`, min/max values for clamping the output |
|
|
|
## Packing |
|
|
|
Based on the high level description of the microkernel, we will have to pack the |
|
weights such that we have: |
|
|
|
- `channel_tile` biases |
|
- `channel_tile * kernel_tile` weights |
|
|
|
Repeated `round_up(channels, channel_tile)` times. |
|
|
|
## Indirection buffer |
|
|
|
The indirection buffer is packed such that the `channel_tile * kernel_tile` |
|
pointers to input required for computing a single output is adjacent to each |
|
other. A simple way to pack it will then be: |
|
|
|
``` |
|
input kernel output |
|
|
|
ABC ab WX |
|
DEF cd YZ |
|
GHI |
|
|
|
uncompressed indirection buffer for first row of output |
|
ABDEBCEF |
|
``` |
|
|
|
This requires `kernel_tile * output_width` pointers. |
|
|
|
We can compress this if we pack the input pointers column first: |
|
|
|
``` |
|
column first uncompressed: |
|
ADBEBECF |
|
``` |
|
|
|
Notice that `BE` is repeated. So we can elide it, provided that we tell the |
|
microkernel how much to skip over to get to the input pointers for the next |
|
output element (it is not just `kernel_tile`), that's what `input_stride` is |
|
for. |
|
|
|
``` |
|
column first compressed: |
|
ADBECF |
|
``` |
|
|
|
The weights similarly have to be packed column first. |
|
|