# Depthwise convolution microkernels

This document describes how depthwise convolution (DWCONV) microkernels work.

All depthwise convolution microkernels live in `src/*-dwconv`, e.g.
[`src/f32-dwconv`](https://github.com/google/XNNPACK/tree/master/src/f32-dwconv).

The simplest microkernel to look at is probably
[`f32-dwconv-up2x3-scalar.c`](../src/f32-dwconv/gen/f32-dwconv-up2x3-scalar.c).

Key parameters:

- channel tile, how many channels the microkernel can process in each iteration
- kernel tile, how many weights (kernel elements, each element is # channels values) the microkernel reads in each
  iteration. This can be greater than the actual number of kernel elements.

## High level description

Each call to the DWCONV microkernel will produce 1 row of output.

For each element of this row of output, DWCONV will produce `channel_tile`
number of outputs in the main loop, with a separate loop to handle remainders
(remainder loop).

In each iteration of the main loop, the microkernel will read `channel_tile` biases, `channel_tile * kernel_tile`
inputs, `channel_tile * kernel_tile` weights, and, optionally, `channel_tile` of per-channel scales,
perform the convolution, then write `channel_tile` outputs.

In the remainder loop, the microkernel will read `remainder_channels` biases,
`remainder_channels * kernel_tile` inputs, `remainder_channels * kernel_tile`
weights, perform the convolution, and write `remainder_channels` outputs.

## Microkernel arguments

```
void xnn_f32_dwconv_ukernel_up2x3__scalar(
    size_t channels,
    size_t output_width,
    const float** input,
    const float* weights,
    float* output,
    size_t input_stride,
    size_t output_increment,
    size_t input_offset,
    const float* zero,
    const union xnn_f32_default_params params[restrict XNN_MIN_ELEMENTS(1)])
```

- `channels`, number of output channels to compute
- `output_width`, number of produced pixels
- `input`, pointer to input indirection buffer
- `weights`, pointer to weights
- `output`, pointer to output
- `input_stride`, number of bytes to add to the indirection buffer to advance to the input pointers corresponding to the
  next output element
- `output_increment`, number of bytes to get to the next output element
- `input_offset`, offset to add to pointers from indirection buffer, unless these pointers match the zero pointer
- `zero`, pointer to zero buffer
- `params`, min/max values for clamping the output

## Packing

Based on the high level description of the microkernel, we will have to pack the
weights such that we have:

- `channel_tile` biases
- `channel_tile * kernel_tile` weights

Repeated `round_up(channels, channel_tile)` times.

## Indirection buffer

The indirection buffer is packed such that the `channel_tile * kernel_tile`
pointers to input required for computing a single output is adjacent to each
other. A simple way to pack it will then be:

```
input  kernel  output

ABC    ab      WX
DEF    cd      YZ
GHI

uncompressed indirection buffer for first row of output
ABDEBCEF
```

This requires `kernel_tile * output_width` pointers.

We can compress this if we pack the input pointers column first:

```
column first uncompressed:
ADBEBECF
```

Notice that `BE` is repeated. So we can elide it, provided that we tell the
microkernel how much to skip over to get to the input pointers for the next
output element (it is not just `kernel_tile`), that's what `input_stride` is
for.

```
column first compressed:
ADBECF
```

The weights similarly have to be packed column first.