[![PyPI](https://img.shields.io/pypi/v/spatial-correlation-sampler.svg)](https://pypi.org/project/spatial-correlation-sampler/) # Pytorch Correlation module this is a custom C++/Cuda implementation of Correlation module, used e.g. in [FlowNetC](https://arxiv.org/abs/1504.06852) This [tutorial](http://pytorch.org/tutorials/advanced/cpp_extension.html) was used as a basis for implementation, as well as [NVIDIA's cuda code](https://github.com/NVIDIA/flownet2-pytorch/tree/master/networks/correlation_package) - Build and Install C++ and CUDA extensions by executing `python setup.py install`, - Benchmark C++ vs. CUDA by running `python benchmark.py {cpu, cuda}`, - Run gradient checks on the code by running `python grad_check.py --backend {cpu, cuda}`. # Requirements This module is expected to compile for Pytorch `1.6`. # Installation this module is available on pip `pip install spatial-correlation-sampler` For a cpu-only version, you can install from source with `python setup_cpu.py install` # Known Problems This module needs compatible gcc version and CUDA to be compiled. Namely, CUDA 9.1 and below will need gcc5, while CUDA 9.2 and 10.0 will need gcc7 See [this issue](https://github.com/ClementPinard/Pytorch-Correlation-extension/issues/1) for more information # Usage API has a few difference with NVIDIA's module * output is now a 5D tensor, which reflects the shifts horizontal and vertical. ``` input (B x C x H x W) -> output (B x PatchH x PatchW x oH x oW) ``` * Output sizes `oH` and `oW` are no longer dependant of patch size, but only of kernel size and padding * Patch size `patch_size` is now the whole patch, and not only the radii. * `stride1` is now `stride` and`stride2` is `dilation_patch`, which behave like dilated convolutions * equivalent `max_displacement` is then `dilation_patch * (patch_size - 1) / 2`. * `dilation` is a new parameter, it acts the same way as dilated convolution regarding the correlation kernel * to get the right parameters for FlowNetC, you would have ``` kernel_size=1 patch_size=21, stride=1, padding=0, dilation=1 dilation_patch=2 ``` ## Example ```python import torch from spatial_correlation_sampler import SpatialCorrelationSampler, device = "cuda" batch_size = 1 channel = 1 H = 10 W = 10 dtype = torch.float32 input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True) input2 = torch.randint_like(input1, 1, 4).requires_grad_(True) #You can either use the function or the module. Note that the module doesn't contain any parameter tensor. #function out = spatial_correlation_sample(input1, input2, kernel_size=3, patch_size=1, stride=2, padding=0, dilation=2, dilation_patch=1) #module correlation_sampler = SpatialCorrelationSampler( kernel_size=3, patch_size=1, stride=2, padding=0, dilation=2, dilation_patch=1) out = correlation_sampler(input1, input2) ``` # Benchmark * default parameters are from `benchmark.py`, FlowNetC parameters are same as use in `FlowNetC` with a batch size of 4, described in [this paper](https://arxiv.org/abs/1504.06852), implemented [here](https://github.com/lmb-freiburg/flownet2) and [here](https://github.com/NVIDIA/flownet2-pytorch/blob/master/networks/FlowNetC.py). * Feel free to file an issue to add entries to this with your hardware ! ## CUDA Benchmark * See [here](https://gist.github.com/ClementPinard/270e910147119831014932f67fb1b5ea) for a benchmark script working with [NVIDIA](https://github.com/NVIDIA/flownet2-pytorch/tree/master/networks/correlation_package)'s code, and Pytorch. * Benchmark are launched with environment variable `CUDA_LAUNCH_BLOCKING` set to `1`. * Only `float32` is benchmarked. * FlowNetC correlation parameters where launched with the following command: ```bash CUDA_LAUNCH_BLOCKING=1 python benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256 cuda -d float CUDA_LAUNCH_BLOCKING=1 python NV_correlation_benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256 ``` | implementation | Correlation parameters | device | pass | min time | avg time | | -------------- | ---------------------- | ------- | -------- | ------------: | ------------: | | ours | default | 980 GTX | forward | **5.745 ms** | **5.851 ms** | | ours | default | 980 GTX | backward | 77.694 ms | 77.957 ms | | NVIDIA | default | 980 GTX | forward | 13.779 ms | 13.853 ms | | NVIDIA | default | 980 GTX | backward | **73.383 ms** | **73.708 ms** | | | | | | | | | ours | FlowNetC | 980 GTX | forward | **26.102 ms** | **26.179 ms** | | ours | FlowNetC | 980 GTX | backward | **208.091 ms** | **208.510 ms** | | NVIDIA | FlowNetC | 980 GTX | forward | 35.363 ms | 35.550 ms | | NVIDIA | FlowNetC | 980 GTX | backward | 283.748 ms | 284.346 ms | ### Notes * The overhead of our implementation regarding `kernel_size` > 1 during backward needs some investigation, feel free to dive in the code to improve it ! * The backward pass of NVIDIA is not entirely correct when stride1 > 1 and kernel_size > 1, because not everything is computed, see [here](https://github.com/NVIDIA/flownet2-pytorch/blob/master/networks/correlation_package/src/correlation_cuda_kernel.cu#L120). ## CPU Benchmark * No other implementation is avalaible on CPU. * It is obviously not recommended to run it on CPU if you have a GPU. | Correlation parameters | device | pass | min time | avg time | | ---------------------- | -------------------- | -------- | ----------: | ----------: | | default | E5-2630 v3 @ 2.40GHz | forward | 159.616 ms | 188.727 ms | | default | E5-2630 v3 @ 2.40GHz | backward | 282.641 ms | 294.194 ms | | FlowNetC | E5-2630 v3 @ 2.40GHz | forward | 2.138 s | 2.144 s | | FlowNetC | E5-2630 v3 @ 2.40GHz | backward | 7.006 s | 7.075 s |