Can low level features be learnt (or at least pre-trained) from an unsupervised image dataset? This project explores using spatial context to learn feature representations from unlabeled images. It works by dividing an image into 9 sub-croppings, as seen in the image below. These images are then shuffled into one of 100 different permutations, selected from a set computed to maximize the hamming distance each element. A CNN based on the Resnet-34 architecture is then tasked with reassembling the pieces into the correct order.
This project was inspired by the paper Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. This modifies the original work by using a Resnet-34 like CNN in lieu of the Alexnet architecture used in the original implementation. Additionally, this project was trained on the unlabeled portion of the COCO 2017 dataset, consisting of only ~123k images, while the original paper trained their network on Imagenet, with the labels removed.