# Depthwise convolution microkernels This document describes how depthwise convolution (DWCONV) microkernels work. All depthwise convolution microkernels live in `src/*-dwconv`, e.g. [`src/f32-dwconv`](https://github.com/google/XNNPACK/tree/master/src/f32-dwconv). The simplest microkernel to look at is probably [`f32-dwconv-up2x3-scalar.c`](../src/f32-dwconv/gen/f32-dwconv-up2x3-scalar.c). Key parameters: - channel tile, how many channels the microkernel can process in each iteration - kernel tile, how many weights (kernel elements, each element is # channels values) the microkernel reads in each iteration. This can be greater than the actual number of kernel elements. ## High level description Each call to the DWCONV microkernel will produce 1 row of output. For each element of this row of output, DWCONV will produce `channel_tile` number of outputs in the main loop, with a separate loop to handle remainders (remainder loop). In each iteration of the main loop, the microkernel will read `channel_tile` biases, `channel_tile * kernel_tile` inputs, `channel_tile * kernel_tile` weights, and, optionally, `channel_tile` of per-channel scales, perform the convolution, then write `channel_tile` outputs. In the remainder loop, the microkernel will read `remainder_channels` biases, `remainder_channels * kernel_tile` inputs, `remainder_channels * kernel_tile` weights, perform the convolution, and write `remainder_channels` outputs. ## Microkernel arguments ``` void xnn_f32_dwconv_ukernel_up2x3__scalar( size_t channels, size_t output_width, const float** input, const float* weights, float* output, size_t input_stride, size_t output_increment, size_t input_offset, const float* zero, const union xnn_f32_default_params params[restrict XNN_MIN_ELEMENTS(1)]) ``` - `channels`, number of output channels to compute - `output_width`, number of produced pixels - `input`, pointer to input indirection buffer - `weights`, pointer to weights - `output`, pointer to output - `input_stride`, number of bytes to add to the indirection buffer to advance to the input pointers corresponding to the next output element - `output_increment`, number of bytes to get to the next output element - `input_offset`, offset to add to pointers from indirection buffer, unless these pointers match the zero pointer - `zero`, pointer to zero buffer - `params`, min/max values for clamping the output ## Packing Based on the high level description of the microkernel, we will have to pack the weights such that we have: - `channel_tile` biases - `channel_tile * kernel_tile` weights Repeated `round_up(channels, channel_tile)` times. ## Indirection buffer The indirection buffer is packed such that the `channel_tile * kernel_tile` pointers to input required for computing a single output is adjacent to each other. A simple way to pack it will then be: ``` input kernel output ABC ab WX DEF cd YZ GHI uncompressed indirection buffer for first row of output ABDEBCEF ``` This requires `kernel_tile * output_width` pointers. We can compress this if we pack the input pointers column first: ``` column first uncompressed: ADBEBECF ``` Notice that `BE` is repeated. So we can elide it, provided that we tell the microkernel how much to skip over to get to the input pointers for the next output element (it is not just `kernel_tile`), that's what `input_stride` is for. ``` column first compressed: ADBECF ``` The weights similarly have to be packed column first.