AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We have proposed a novel convolutional operator dubbed as fast Fourier convolution

Fast Fourier Convolution

NIPS 2020, (2020)

被引用0|浏览22
EI
下载 PDF 全文
引用
微博一下

摘要

Vanilla convolutions in modern deep networks are known to operate locally and at fixed scale (e.g., the widely-adopted 3 × 3 kernels in image-oriented tasks). This causes low efficacy in connecting two distant locations in the network. In this work, we propose a novel convolutional operator dubbed as fast Fourier convolution (FFC), which ...更多

代码

数据

0
简介
  • Deep neural networks have been the prominent driving force for recent dramatic progress in several research domains.
  • A majority of modern networks have adopted the architecture of deeply stacking many convolutions with small receptive field (3 × 3 in ResNet [11] for images or 3 × 3 × 3 in C3D [27] for videos).
  • This still ensures that all image parts are visible to high layers, since stacking convolutional layers can increase the receptive field either linearly or exponentially.
  • Recent endeavor on enlarging receptive field includes deformable convolution [9] and non-local neural networks [31]
重点内容
  • Deep neural networks have been the prominent driving force for recent dramatic progress in several research domains
  • The goal of this paper is the exposition of a novel convolutional unit codenamed fast Fourier convolution (FFC)
  • Receptive field refers to the image part that is accessible by one filter
  • We validate FFC by replacing convolutions used in a variety of modern networks
  • We have proposed a novel convolutional operator dubbed as FFC
  • Our comprehensive experiments on three representative computer vision tasks consistently exhibit large performance improvement that is clearly attributed to FFC
方法
  • A2-Net [5] Oct-ResNet-50 [4] DenseNet-201 [13] ResNeXt-50 (32 × 4d) [33] Res2Net-50 (14w×8s) [10].
  • FFC-ResNet-50 shows 0.4% better accuracy than ResNet-101 while costing only 60% parameters.
  • FFC is effective for deeper networks (+1.4% for ResNet-101 and +0.6% for ResNet-152), these networks can achieve large receptive field by stacking many convolutiaonl layers, which shows that the method is complementary to traditional convolution
结果
  • ImageNet [16] is widely adopted to pre-train network backbones for generalization to other more complex tasks.
  • Following typical settings in prior work, the input size of all the models is 224 × 224.
  • Maximal training epochs are set to 90.
  • Linear warm-up strategy is adopted in the first 5 epochs.
  • All the networks are optimized by SGD with a batch size of 256 on 4 GPUs. Common data
结论
  • The authors have proposed a novel convolutional operator dubbed as FFC.
  • It harnesses the Fourier spectral theory for achieving non-local receptive fields in deep models.
  • The proposed operator is carefully designed to implement cross-scale fusion.
  • The authors' comprehensive experiments on three representative computer vision tasks consistently exhibit large performance improvement that is clearly attributed to FFC.
  • The authors strongly believe that FFC paves a new research front for designing non-local, scale-fused neural networks
表格
  • Table1: Parameter counts and FLOPs for vanilla convolution, separate component in FFC, and entire FFC respectively. C1 and C2 are the number of channels of input and output respectively. H and W collectively define the spatial resolution. K is the convolutional kernel size. For clarity, here stride and padding are not considered. αin = αout = α, where α is some parameter in [0, 1]
  • Table2: The top-1 accuracy of FFC under different ratios on ImageNet. All models use ResNet-50 as their backbones. Note that α = 0 is equal to using vanilla convolutions
  • Table3: Investigation of LFU on ImageNet. ResNet-50 serves as the backbone for all
  • Table4: Investigation of plugging FFC into more state-of-the-art networks on ImageNet. The first two sets are top-1 accuracy scores obtained by various state-of-the-art methods, which we transcribe from the corresponding papers. Deeper models are listed in the second set. The last set reports the performances of plugging FFC into specific models (e.g., FFC-ResNet-50 implies the use of a base model ResNet-50)
  • Table5: Experimental results on Kinetics-400. Three sets from top to bottom: recent state-of-the-art video models, our re-implemented base models, and models enhanced with FFC. All the models adopt ResNet-50 as backbones and read 8-frame input. “†" represents the model is finetuned with TSN framework [<a class="ref-link" id="c30" href="#r30">30</a>]
  • Table6: Comparisons on the COCO val2017 dataset for human keypoint detection. OHKM means Online Hard Keypoints Mining
Download tables as Excel
相关工作
  • Non-local neural networks. The theory of effective receptive field [21] revealed that convolutions tend to contract to the central regions. This questions the necessity of large convolutional kernels. Besides, small-kernel convolutions are also favored in CNNs for mitigating the risk of over-fitting. Recently, researchers gradually realized that linking two arbitrary distant neurons in a layer is crucial for many context-sensitive tasks, such as classifying the action type in a spatio-temporal video tube or jointly inferring the precise locations of human keypoints. This is addressed by recent research on non-local networks. Early methods as in [31] rely on expensive self-convolutions, which incurs a series of follow-up research that seeks for acceleration (e.g., [14]). Nonetheless, current paradigm of using non-local operators are sparsely inserting them into some network pipelines. The way that they can be densely knitted remains an unexplored research problem.
基金
  • This work is supported by National Key R&D Program of China (2020AAA0104400), National Natural Science Foundation of China (61772037) and Beijing Natural Science Foundation (Z190001). Modern neural networks have evolved for decades, from the primary LeNet to recent Resnet, DenseNet etc
引用论文
  • G. D. Bergland. A guided tour of the fast fourier transform. IEEE Spectrum, 6(7):41–52, July 1969.
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018.
    Google ScholarLocate open access versionFindings
  • Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. A2-nets: Double attention networks. In NIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Shuicheng Yan, Jiashi Feng, and Yannis Kalantidis. Graph-based global reasoning networks. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Lu Chi, Guiyu Tian, Yadong Mu, Lingxi Xie, and Qi Tian. Fast non-local neural networks with spectral residual learning. In ACM Multimedia, 2019.
    Google ScholarLocate open access versionFindings
  • James Cooley and John Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19:297–301, 1965.
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. IEEE TPAMI, 2020.
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Yitzhak Katznelson. An introduction to harmonic analysis, 1976.
    Google ScholarFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Ji Lin, Chuang Gan, and Song Han. TSM: Temporal shift module for efficient video understanding. In ICCV, 2019.
    Google ScholarFindings
  • Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L. Yuille, and Fei-Fei Li. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard S. Zemel. Understanding the effective receptive field in deep convolutional neural networks. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Oren Rippel, Jasper Snoek, and Ryan P Adams. Spectral representations for convolutional neural networks. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, and Changhu Wang. Multi-person pose estimation with enhanced channel-wise and spatial information. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep high-resolution representation learning for visual recognition. CoRR, abs/1908.07919, 2019.
    Findings
  • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Zhisheng Zhong, Tiancheng Shen, Yibo Yang, Zhouchen Lin, and Chao Zhang. Joint sub-bands learning with clique structures for wavelet domain super-resolution. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
作者
Lu Chi
Lu Chi
Borui Jiang
Borui Jiang
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科