File size: 3,542 Bytes
8b7c501 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
# Depthwise convolution microkernels
This document describes how depthwise convolution (DWCONV) microkernels work.
All depthwise convolution microkernels live in `src/*-dwconv`, e.g.
[`src/f32-dwconv`](https://github.com/google/XNNPACK/tree/master/src/f32-dwconv).
The simplest microkernel to look at is probably
[`f32-dwconv-up2x3-scalar.c`](../src/f32-dwconv/gen/f32-dwconv-up2x3-scalar.c).
Key parameters:
- channel tile, how many channels the microkernel can process in each iteration
- kernel tile, how many weights (kernel elements, each element is # channels values) the microkernel reads in each
iteration. This can be greater than the actual number of kernel elements.
## High level description
Each call to the DWCONV microkernel will produce 1 row of output.
For each element of this row of output, DWCONV will produce `channel_tile`
number of outputs in the main loop, with a separate loop to handle remainders
(remainder loop).
In each iteration of the main loop, the microkernel will read `channel_tile` biases, `channel_tile * kernel_tile`
inputs, `channel_tile * kernel_tile` weights, and, optionally, `channel_tile` of per-channel scales,
perform the convolution, then write `channel_tile` outputs.
In the remainder loop, the microkernel will read `remainder_channels` biases,
`remainder_channels * kernel_tile` inputs, `remainder_channels * kernel_tile`
weights, perform the convolution, and write `remainder_channels` outputs.
## Microkernel arguments
```
void xnn_f32_dwconv_ukernel_up2x3__scalar(
size_t channels,
size_t output_width,
const float** input,
const float* weights,
float* output,
size_t input_stride,
size_t output_increment,
size_t input_offset,
const float* zero,
const union xnn_f32_default_params params[restrict XNN_MIN_ELEMENTS(1)])
```
- `channels`, number of output channels to compute
- `output_width`, number of produced pixels
- `input`, pointer to input indirection buffer
- `weights`, pointer to weights
- `output`, pointer to output
- `input_stride`, number of bytes to add to the indirection buffer to advance to the input pointers corresponding to the
next output element
- `output_increment`, number of bytes to get to the next output element
- `input_offset`, offset to add to pointers from indirection buffer, unless these pointers match the zero pointer
- `zero`, pointer to zero buffer
- `params`, min/max values for clamping the output
## Packing
Based on the high level description of the microkernel, we will have to pack the
weights such that we have:
- `channel_tile` biases
- `channel_tile * kernel_tile` weights
Repeated `round_up(channels, channel_tile)` times.
## Indirection buffer
The indirection buffer is packed such that the `channel_tile * kernel_tile`
pointers to input required for computing a single output is adjacent to each
other. A simple way to pack it will then be:
```
input kernel output
ABC ab WX
DEF cd YZ
GHI
uncompressed indirection buffer for first row of output
ABDEBCEF
```
This requires `kernel_tile * output_width` pointers.
We can compress this if we pack the input pointers column first:
```
column first uncompressed:
ADBEBECF
```
Notice that `BE` is repeated. So we can elide it, provided that we tell the
microkernel how much to skip over to get to the input pointers for the next
output element (it is not just `kernel_tile`), that's what `input_stride` is
for.
```
column first compressed:
ADBECF
```
The weights similarly have to be packed column first.
|