Depthwise convolution microkernels
This document describes how depthwise convolution (DWCONV) microkernels work.
All depthwise convolution microkernels live in src/*-dwconv
, e.g.
src/f32-dwconv
.
The simplest microkernel to look at is probably
f32-dwconv-up2x3-scalar.c
.
Key parameters:
- channel tile, how many channels the microkernel can process in each iteration
- kernel tile, how many weights (kernel elements, each element is # channels values) the microkernel reads in each iteration. This can be greater than the actual number of kernel elements.
High level description
Each call to the DWCONV microkernel will produce 1 row of output.
For each element of this row of output, DWCONV will produce channel_tile
number of outputs in the main loop, with a separate loop to handle remainders
(remainder loop).
In each iteration of the main loop, the microkernel will read channel_tile
biases, channel_tile * kernel_tile
inputs, channel_tile * kernel_tile
weights, and, optionally, channel_tile
of per-channel scales,
perform the convolution, then write channel_tile
outputs.
In the remainder loop, the microkernel will read remainder_channels
biases,
remainder_channels * kernel_tile
inputs, remainder_channels * kernel_tile
weights, perform the convolution, and write remainder_channels
outputs.
Microkernel arguments
void xnn_f32_dwconv_ukernel_up2x3__scalar(
size_t channels,
size_t output_width,
const float** input,
const float* weights,
float* output,
size_t input_stride,
size_t output_increment,
size_t input_offset,
const float* zero,
const union xnn_f32_default_params params[restrict XNN_MIN_ELEMENTS(1)])
channels
, number of output channels to computeoutput_width
, number of produced pixelsinput
, pointer to input indirection bufferweights
, pointer to weightsoutput
, pointer to outputinput_stride
, number of bytes to add to the indirection buffer to advance to the input pointers corresponding to the next output elementoutput_increment
, number of bytes to get to the next output elementinput_offset
, offset to add to pointers from indirection buffer, unless these pointers match the zero pointerzero
, pointer to zero bufferparams
, min/max values for clamping the output
Packing
Based on the high level description of the microkernel, we will have to pack the weights such that we have:
channel_tile
biaseschannel_tile * kernel_tile
weights
Repeated round_up(channels, channel_tile)
times.
Indirection buffer
The indirection buffer is packed such that the channel_tile * kernel_tile
pointers to input required for computing a single output is adjacent to each
other. A simple way to pack it will then be:
input kernel output
ABC ab WX
DEF cd YZ
GHI
uncompressed indirection buffer for first row of output
ABDEBCEF
This requires kernel_tile * output_width
pointers.
We can compress this if we pack the input pointers column first:
column first uncompressed:
ADBEBECF
Notice that BE
is repeated. So we can elide it, provided that we tell the
microkernel how much to skip over to get to the input pointers for the next
output element (it is not just kernel_tile
), that's what input_stride
is
for.
column first compressed:
ADBECF
The weights similarly have to be packed column first.