Spatial Pyramid Pooling in Deep Convolutional Networks

Kaiming He	Xiangyu Zhang	Shaoqing Ren	Jian Sun
Microsoft Research Asia	Xi'an Jiaotong University	University of Science and Technology of China	Microsoft Research Asia

ILSVRC 2014 - We rank 2nd in detection and 3rd in classification among 38 teams.

Abstract

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224x224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning.

The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102x faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007.

Publications:

"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition"
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, in ECCV 2014.
"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition"
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, in TPAMI 2014.
arXiv

Resources

Slides for ILSVRC 2014 talk
Poster
Code released: fast object detection for VOC