We have proposed a novel convolutional operator dubbed as fast Fourier convolution
Fast Fourier Convolution
NIPS 2020, (2020)
下载 PDF 全文
Vanilla convolutions in modern deep networks are known to operate locally and at fixed scale (e.g., the widely-adopted 3 × 3 kernels in image-oriented tasks). This causes low efficacy in connecting two distant locations in the network. In this work, we propose a novel convolutional operator dubbed as fast Fourier convolution (FFC), which ...更多
下载 PDF 全文
- Deep neural networks have been the prominent driving force for recent dramatic progress in several research domains.
- A majority of modern networks have adopted the architecture of deeply stacking many convolutions with small receptive field (3 × 3 in ResNet  for images or 3 × 3 × 3 in C3D  for videos).
- This still ensures that all image parts are visible to high layers, since stacking convolutional layers can increase the receptive field either linearly or exponentially.
- Recent endeavor on enlarging receptive field includes deformable convolution  and non-local neural networks 
- Deep neural networks have been the prominent driving force for recent dramatic progress in several research domains
- The goal of this paper is the exposition of a novel convolutional unit codenamed fast Fourier convolution (FFC)
- Receptive field refers to the image part that is accessible by one filter
- We validate FFC by replacing convolutions used in a variety of modern networks
- We have proposed a novel convolutional operator dubbed as FFC
- Our comprehensive experiments on three representative computer vision tasks consistently exhibit large performance improvement that is clearly attributed to FFC
- A2-Net  Oct-ResNet-50  DenseNet-201  ResNeXt-50 (32 × 4d)  Res2Net-50 (14w×8s) .
- FFC-ResNet-50 shows 0.4% better accuracy than ResNet-101 while costing only 60% parameters.
- FFC is effective for deeper networks (+1.4% for ResNet-101 and +0.6% for ResNet-152), these networks can achieve large receptive field by stacking many convolutiaonl layers, which shows that the method is complementary to traditional convolution
- ImageNet  is widely adopted to pre-train network backbones for generalization to other more complex tasks.
- Following typical settings in prior work, the input size of all the models is 224 × 224.
- Maximal training epochs are set to 90.
- Linear warm-up strategy is adopted in the first 5 epochs.
- All the networks are optimized by SGD with a batch size of 256 on 4 GPUs. Common data
- The authors have proposed a novel convolutional operator dubbed as FFC.
- It harnesses the Fourier spectral theory for achieving non-local receptive fields in deep models.
- The proposed operator is carefully designed to implement cross-scale fusion.
- The authors' comprehensive experiments on three representative computer vision tasks consistently exhibit large performance improvement that is clearly attributed to FFC.
- The authors strongly believe that FFC paves a new research front for designing non-local, scale-fused neural networks
- Table1: Parameter counts and FLOPs for vanilla convolution, separate component in FFC, and entire FFC respectively. C1 and C2 are the number of channels of input and output respectively. H and W collectively define the spatial resolution. K is the convolutional kernel size. For clarity, here stride and padding are not considered. αin = αout = α, where α is some parameter in [0, 1]
- Table2: The top-1 accuracy of FFC under different ratios on ImageNet. All models use ResNet-50 as their backbones. Note that α = 0 is equal to using vanilla convolutions
- Table3: Investigation of LFU on ImageNet. ResNet-50 serves as the backbone for all
- Table4: Investigation of plugging FFC into more state-of-the-art networks on ImageNet. The first two sets are top-1 accuracy scores obtained by various state-of-the-art methods, which we transcribe from the corresponding papers. Deeper models are listed in the second set. The last set reports the performances of plugging FFC into specific models (e.g., FFC-ResNet-50 implies the use of a base model ResNet-50)
- Table5: Experimental results on Kinetics-400. Three sets from top to bottom: recent state-of-the-art video models, our re-implemented base models, and models enhanced with FFC. All the models adopt ResNet-50 as backbones and read 8-frame input. “†" represents the model is finetuned with TSN framework [<a class="ref-link" id="c30" href="#r30">30</a>]
- Table6: Comparisons on the COCO val2017 dataset for human keypoint detection. OHKM means Online Hard Keypoints Mining
- Non-local neural networks. The theory of effective receptive field  revealed that convolutions tend to contract to the central regions. This questions the necessity of large convolutional kernels. Besides, small-kernel convolutions are also favored in CNNs for mitigating the risk of over-fitting. Recently, researchers gradually realized that linking two arbitrary distant neurons in a layer is crucial for many context-sensitive tasks, such as classifying the action type in a spatio-temporal video tube or jointly inferring the precise locations of human keypoints. This is addressed by recent research on non-local networks. Early methods as in  rely on expensive self-convolutions, which incurs a series of follow-up research that seeks for acceleration (e.g., ). Nonetheless, current paradigm of using non-local operators are sparsely inserting them into some network pipelines. The way that they can be densely knitted remains an unexplored research problem.
- This work is supported by National Key R&D Program of China (2020AAA0104400), National Natural Science Foundation of China (61772037) and Beijing Natural Science Foundation (Z190001). Modern neural networks have evolved for decades, from the primary LeNet to recent Resnet, DenseNet etc
- G. D. Bergland. A guided tour of the fast fourier transform. IEEE Spectrum, 6(7):41–52, July 1969.
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018.
- Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. In CVPR, 2018.
- Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In ICCV, 2019.
- Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. A2-nets: Double attention networks. In NIPS, 2018.
- Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Shuicheng Yan, Jiashi Feng, and Yannis Kalantidis. Graph-based global reasoning networks. In CVPR, 2019.
- Lu Chi, Guiyu Tian, Yadong Mu, Lingxi Xie, and Qi Tian. Fast non-local neural networks with spectral residual learning. In ACM Multimedia, 2019.
- James Cooley and John Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19:297–301, 1965.
- Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, 2017.
- Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. IEEE TPAMI, 2020.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
- Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
- Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019.
- Yitzhak Katznelson. An introduction to harmonic analysis, 1976.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- Ji Lin, Chuang Gan, and Song Han. TSM: Temporal shift module for efficient video understanding. In ICCV, 2019.
- Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
- Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L. Yuille, and Fei-Fei Li. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, 2019.
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard S. Zemel. Understanding the effective receptive field in deep convolutional neural networks. In NIPS, 2016.
- Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
- Oren Rippel, Jasper Snoek, and Ryan P Adams. Spectral representations for convolutional neural networks. In NeurIPS, 2015.
- Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
- Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, and Changhu Wang. Multi-person pose estimation with enhanced channel-wise and spatial information. In CVPR, 2019.
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
- Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In CVPR, 2017.
- Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep high-resolution representation learning for visual recognition. CoRR, abs/1908.07919, 2019.
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
- Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
- Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
- Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
- Zhisheng Zhong, Tiancheng Shen, Yibo Yang, Zhouchen Lin, and Chao Zhang. Joint sub-bands learning with clique structures for wavelet domain super-resolution. In NeurIPS, 2018.