Non-local Neural Networks

CVPR, 2018.

Cited by: 1846|Bibtex|Views288
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
We show the significance of non-local modeling for the tasks of video classification, object detection and segmentation, and pose estimation

Abstract:

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes ...More

Code:

Data:

0
Introduction
  • Capturing long-range dependencies is of central importance in deep neural networks. For sequential data (e.g., in speech, language), recurrent operations [38, 23] are the dominant solution to long-range dependency modeling.
  • Convolutional and recurrent operations both process a local neighborhood, either in space or time; long-range dependencies can only be captured when these operations are applied repeatedly, propagating signals progressively through the data.
  • It causes optimization difficulties that need to be carefully addressed [23, 21]
  • These challenges make multihop dependency modeling, e.g., when messages need to be delivered back and forth between distant positions, difficult
Highlights
  • Capturing long-range dependencies is of central importance in deep neural networks
  • We present non-local operations as an efficient, simple, and generic component for capturing longrange dependencies with deep neural networks
  • In Table 2d we study the effect of non-local blocks applied along space, time, or spacetime
  • We presented a new class of neural networks which capture long-range dependencies via non-local operations
  • Our non-local blocks can be combined with any existing architectures
  • We show the significance of non-local modeling for the tasks of video classification, object detection and segmentation, and pose estimation
Methods
  • Experiments on Video Classification

    The authors perform comprehensive studies on the challenging Kinetics dataset [27].
  • Kinetics [27] contains ∼246k training videos and 20k validation videos.
  • It is a classification task involving 400 human action categories.
  • Charades [44] is a video dataset with ∼8k training, ∼1.8k validation, and ∼2k testing videos.
  • It is a multi-label classification task with 157 action categories.
  • Unlike the original paper [19] that adopted stage-wise training regarding RPN, the authors use an improved implementation with end-to-end joint training similar to [37], which leads to higher baselines than [19]
Conclusion
  • The authors presented a new class of neural networks which capture long-range dependencies via non-local operations.
  • The authors' non-local blocks can be combined with any existing architectures.
  • The authors show the significance of non-local modeling for the tasks of video classification, object detection and segmentation, and pose estimation.
  • A simple addition of non-local blocks provides solid improvement over baselines.
  • The authors hope non-local layers will become an important component of future network architectures
Summary
  • Introduction:

    Capturing long-range dependencies is of central importance in deep neural networks. For sequential data (e.g., in speech, language), recurrent operations [38, 23] are the dominant solution to long-range dependency modeling.
  • Convolutional and recurrent operations both process a local neighborhood, either in space or time; long-range dependencies can only be captured when these operations are applied repeatedly, propagating signals progressively through the data.
  • It causes optimization difficulties that need to be carefully addressed [23, 21]
  • These challenges make multihop dependency modeling, e.g., when messages need to be delivered back and forth between distant positions, difficult
  • Methods:

    Experiments on Video Classification

    The authors perform comprehensive studies on the challenging Kinetics dataset [27].
  • Kinetics [27] contains ∼246k training videos and 20k validation videos.
  • It is a classification task involving 400 human action categories.
  • Charades [44] is a video dataset with ∼8k training, ∼1.8k validation, and ∼2k testing videos.
  • It is a multi-label classification task with 157 action categories.
  • Unlike the original paper [19] that adopted stage-wise training regarding RPN, the authors use an improved implementation with end-to-end joint training similar to [37], which leads to higher baselines than [19]
  • Conclusion:

    The authors presented a new class of neural networks which capture long-range dependencies via non-local operations.
  • The authors' non-local blocks can be combined with any existing architectures.
  • The authors show the significance of non-local modeling for the tasks of video classification, object detection and segmentation, and pose estimation.
  • A simple addition of non-local blocks provides solid improvement over baselines.
  • The authors hope non-local layers will become an important component of future network architectures
Tables
  • Table1: Our baseline ResNet-50 C2D model for video. The dimensions of 3D output maps and filter kernels are in T×H×W (2D kernels in H×W), with the number of channels following. The input is 32×224×224. Residual blocks are shown in brackets
  • Table2: Ablations on Kinetics action classification. We show top-1 and top-5 classification accuracy (%)
  • Table3: Comparisons with state-of-the-art results in Kinetics, reported on the val and test sets. We include the Kinetics 2017 competition winner’s results [<a class="ref-link" id="c3" href="#r3">3</a>], but their best results exploited audio signals (marked in gray) so were not vision-only solutions. †: “avg” is the average of top-1 and top-5 accuracy; individual top-1 or top-5 numbers are not available from the test server at the time of submitting this manuscript
  • Table4: Classification mAP (%) in the Charades dataset [<a class="ref-link" id="c44" href="#r44">44</a>], on the train/val split and the trainval/test split. Our results are based on ResNet-101. Our NL I3D uses 5 non-local blocks
  • Table5: Adding 1 non-local block to Mask R-CNN for COCO object detection and instance segmentation. The backbone is ResNet-50/101 or ResNeXt-152 [<a class="ref-link" id="c53" href="#r53">53</a>], both with FPN [<a class="ref-link" id="c32" href="#r32">32</a>]
  • Table6: Adding non-local blocks to Mask R-CNN for COCO keypoint detection. The backbone is ResNet-101 with FPN [<a class="ref-link" id="c32" href="#r32">32</a>]
Download tables as Excel
Related work
  • Non-local image processing. Non-local means [4] is a classical filtering algorithm that computes a weighted mean of all pixels in an image. It allows distant pixels to contribute to the filtered response at a location based on patch appearance similarity. This non-local filtering idea was later developed into BM3D (block-matching 3D) [10], which performs filtering on a group of similar, but non-local, patches. BM3D is a solid image denoising baseline even compared with deep neural networks [5]. Block matching was used with neural networks for image denoising [6, 31]. Non-local matching is also the essence of successful texture synthesis [12], super-resolution [16], and inpainting [1] algorithms.
Funding
  • This work was partially supported by ONR MURI N000141612007, Sloan, Okawa Fellowship to AG and NVIDIA Fellowship to XW
Reference
  • C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. In Proceedings of SIGGRAPH, ACM Transactions on Graphics, 2009. 2
    Google ScholarLocate open access versionFindings
  • P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. Interaction networks for learning about objects, relations and physics. In Neural Information Processing Systems (NIPS), 2016. 2
    Google ScholarLocate open access versionFindings
  • Y. Bian, C. Gan, X. Liu, F. Li, X. Long, Y. Li, H. Qi, J. Zhou, S. Wen, and Y. Lin. Revisiting the effectiveness of off-theshelf temporal modeling approaches for large-scale video classification. arXiv:1708.03805, 2017. 7
    Findings
  • A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In Computer Vision and Pattern Recognition (CVPR), 2005. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In Computer Vision and Pattern Recognition (CVPR), 2012. 2
    Google ScholarLocate open access versionFindings
  • H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising with multi-layer perceptrons, part 2: training trade-offs and analysis of their mechanisms. arXiv:1211.1552, 2012. 2
    Findings
  • J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Computer Vision and Pattern Recognition (CVPR), 2011, 2, 4, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • S. Chandra, N. Usunier, and I. Kokkinos. Dense and low-rank Gaussian CRFs using deep embeddings. In International Conference on Computer Vision (ICCV), 2017. 2
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv:1412.7062, 2014. 2
    Findings
  • K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. Transactions on Image Processing (TIP), 2007. 2
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Computer Vision and Pattern Recognition (CVPR), 2015. 2
    Google ScholarLocate open access versionFindings
  • A. A. Efros and T. K. Leung. Texture synthesis by nonparametric sampling. In International Conference on Computer Vision (ICCV), 1999. 2
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In Neural Information Processing Systems (NIPS), 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • K. Fukushima and S. Miyake. Neocognitron: A selforganizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets. Springer, 1982. 1
    Google ScholarFindings
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. In International Conference on Machine Learning (ICML), 2017. 2
    Google ScholarLocate open access versionFindings
  • D. Glasner, S. Bagon, and M. Irani. Super-resolution from a single image. In Computer Vision and Pattern Recognition (CVPR), 2009. 2
    Google ScholarLocate open access versionFindings
  • P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 205
    Findings
  • A. Harley, K. Derpanis, and I. Kokkinos. Segmentationaware convolutional networks using local attention masks. In International Conference on Computer Vision (ICCV), 2017. 2
    Google ScholarLocate open access versionFindings
  • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. In International Conference on Computer Vision (ICCV), 2017. 2, 8
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International Conference on Computer Vision (ICCV), 2015. 5
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016. 1, 4, 5
    Google ScholarLocate open access versionFindings
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012. 5
    Findings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997. 1
    Google ScholarLocate open access versionFindings
  • Y. Hoshen. Multi-agent predictive modeling with attentional commnets. In Neural Information Processing Systems (NIPS), 2017. 2
    Google ScholarLocate open access versionFindings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015. 5
    Google ScholarLocate open access versionFindings
  • S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. In International Conference on Machine Learning (ICML), 2010. 2
    Google ScholarLocate open access versionFindings
  • W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv:1705.06950, 2017. 1, 5
    Findings
  • P. Krahenbuhl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Neural Information Processing Systems (NIPS), 2011. 2
    Google ScholarLocate open access versionFindings
  • J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML), 2001. 2
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 1
    Google ScholarFindings
  • S. Lefkimmiatis. Non-local color image denoising with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017. 2
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Computer Vision and Pattern Recognition (CVPR), 2017. 8
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV). 2014. 2, 8
    Google ScholarLocate open access versionFindings
  • S. Liu, S. De Mello, J. Gu, G. Zhong, M.-H. Yang, and J. Kautz. Learning affinity via spatial propagation networks. In Neural Information Processing Systems (NIPS), 2017. 2
    Google ScholarLocate open access versionFindings
  • V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), 2010. 3
    Google ScholarLocate open access versionFindings
  • A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016. 2
    Findings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017. 8
    Google ScholarLocate open access versionFindings
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 1986. 1
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. 5
    Google ScholarLocate open access versionFindings
  • A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Neural Information Processing Systems (NIPS), 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 2009. 2
    Google ScholarLocate open access versionFindings
  • A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint arXiv:1503.02351, 2015. 2
    Findings
  • G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. In Computer Vision and Pattern Recognition (CVPR), 2017. 8
    Google ScholarLocate open access versionFindings
  • G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision (ECCV), 2016. 1, 5, 8
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Neural Information Processing Systems (NIPS), 2014. 2
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015. 5
    Google ScholarLocate open access versionFindings
  • C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In International Conference on Computer Vision (ICCV), 1998. 3
    Google ScholarLocate open access versionFindings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In International Conference on Computer Vision (ICCV), 2015. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), 2017. 2, 3, 6
    Google ScholarLocate open access versionFindings
  • H. Wang and C. Schmid. Action recognition with improved trajectories. In International Conference on Computer Vision (ICCV), 2013. 2
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Computer Vision and Pattern Recognition (CVPR), 2015. 2
    Google ScholarLocate open access versionFindings
  • N. Watters, A. Tacchetti, T. Weber, R. Pascanu, P. Battaglia, and D. Zoran. Visual interaction networks. In Neural Information Processing Systems (NIPS), 2017. 2
    Google ScholarLocate open access versionFindings
  • S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017. 8
    Google ScholarLocate open access versionFindings
  • W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 Conversational Speech Recognition System. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017. 2
    Google ScholarLocate open access versionFindings
  • J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Computer Vision and Pattern Recognition (CVPR), 2015. 2
    Google ScholarLocate open access versionFindings
  • S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV), 2015. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments