This is an implementation of [1] using Tensorflow. No attempt has been done in actually training the network. This does a forward and backward pass on a batch of images from the KITTI dataset.
pip3 install pykitti
python3 main.py --frames_stop=50 --frame_step=5 --learning_rate=0.005 --corres=1000
pykitti is used to load the images from both left and right cameras. The volodyne points are projected into the images using pykitti. Note that the way pixel coordinates should be extracted from the values returned is not clear to me, I have for now just rounded to the nearest integer, this should be fixed before actually training the network.
The first part of the network follows the implementation of GoogleNet up until the inception4a layer exept for the local response normalisation layers as suggested in [2].
To build the convolutional spatial transformer we have a purely convolutional localisation network that returns the values of thetas to apply to each patch. We generate patches of given kernel size at each input. The transformation itself uses the implementation from David Dao. The patches are then merged back together before having a convolution applied with the given kernel size.
We project the output feature back into the input space using bilinear interpolation. We then calculate the values at the given correspondence points.
From the values of the feature from the first network at the corresondence points we get the nearest neighbours in the output feature of the second network. We use this to mine negatives if the nearest neighbour does not correspond to the correspondence point in the second image.
Using both extracted feature at correspondence points for positive pairs and negative pairs as described above, we calculate the loss for all images.
The accuracy metric using PCK was not implemented at this point.
[1] Choy et. al. Universal Correspondence Network [2] Choy et. al. Supplemental Materials for Universal Correspondence Network