AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We propose and evaluate a unified approach for patchbased image matching that jointly learns a deep convolutional neural network for local patch representation as well as a network for robust feature comparison

MatchNet: Unifying feature and metric learning for patch-based matching

IEEE Conference on Computer Vision and Pattern Recognition, no. 1 (2015): 3279-3286

引用816|浏览321
EI WOS
下载 PDF 全文
引用
微博一下

摘要

Motivated by recent successes on learning feature representations and on learning feature comparison functions, we propose a unified approach to combining both for training a patch matching system. Our system, dubbed Match-Net, consists of a deep convolutional network that extracts features from patches and a network of three fully connec...更多

代码

数据

0
简介
  • Patch-based image matching is used extensively in computer vision. Finding accurate correspondences between patches is instrumental in a broad variety of applications including wide-baseline stereo (e.g., [14]), object instance recognition (e.g., [13], fine-grained classification (e.g., [36]), multi-view reconstruction (e.g. [20]), image stitching (e.g. [4]), and structure from motion (e.g. [17]).

    Since 1999, the advent of the influential SIFT descriptor [13], research on patch-based matching has attempted to improve both accuracy and speed.
  • The greater availability of labeled training data and increased computational resources has recently reversed this trend, leading to a new generation of learned descriptors [3, 22, 27, 28] and comparison metrics [9]
  • These approaches typically train a nonlinear model discriminatively using large datasets of patches with known ground truth matches and serve as motivation for the work
重点内容
  • Patch-based image matching is used extensively in computer vision
  • We propose a unified approach that jointly learns a deep network for patch representation as well as a network for robust feature comparison
  • In our preliminary experiments we found that normalized SIFT, which is raw SIFT scaled so its L2-norm is 1, gives slightly better performance than SIFT, so normalized SIFT is used for all our baseline experiments
  • In Yosemite-Liberty, normalized SIFT concat.+NNet performs better than normalized SIFT+L2 by 7.61% in absolute error rate, and MatchNet with the same feature dimension (128) and fully-connected layer dimension (512) achieves a further improvement of 6.7% in absolute error rate
  • We propose and evaluate a unified approach for patchbased image matching that jointly learns a deep convolutional neural network for local patch representation as well as a network for robust feature comparison
  • Our system trains models that achieve state-of-the-art performance on a standard dataset for patch matching
方法
  • The dataset includes three subsets with a total of more than 1.5 million patches.
  • It is suitable for discriminative descriptor or metric learning, and has been used as a standard benchmark dataset by many [3, 9, 22, 27, 28].
  • The dataset comes with patches extracted using either Difference of Gaussian (DoG) interest point detector or multi-scale Harris corner detector.
结果
  • The authors follow the evaluation protocol and evaluate MatchNet along with several SIFT baselines and other learned float descriptors.
  • Results for SIFT baselines and MatchNet with floating point features are listed in Table 2.
  • The authors' best model 4096-512x512 achieves best performance over all evaluation pairs.
  • It achieves 7.75% average error rate vs [22]’s <80f at 10.38%.
  • With a bottleneck of 64d, our 64-1024×1024 model achieves 10.94% average error rate vs [22]’s 10.75% using features with about the same dimension.
  • Not suprisingly, increasing F and B leads to lower error rate, but the absolute gain is diminishing exponentially
结论
  • The authors' baseline experiments with SIFT features confirms that a better metric can significantly improve performance.
  • The authors' best model is trained without a bottleneck and it learns a high-dimensional patch representation coupled with a discriminatively trained metric
  • It improves on the previous state-of-the-art performance across all traintest pairs by up to 3.4% in absolute error rate.
  • With a 512d bottleneck and quantization, MatchNet still outperforms [22]’s PR (<640d) results in 4 out of 6 train-test pairs with up to 7% improvement in absolute error rate.The authors propose and evaluate a unified approach for patchbased image matching that jointly learns a deep convolutional neural network for local patch representation as well as a network for robust feature comparison.
表格
  • Table1: Layer parameters of MatchNet. The output dimension is given by (height × width × depth). PS: patch size for convolution and pooling layers; S: stride. Layer types: C: convolution, MP: max-pooling, FC: fully-connected. We always pad the convolution and pooling layers so the output height and width are those of the input divided by the stride. For FC layers, their size B and F are chosen from: B ∈ {64, 128, 256, 512}, F ∈ {128, 256, 512, 1024}. All convolution and FC layers use ReLU activation except for FC3, whose output is normalized with Softmax (Equation 2)
  • Table2: UBC matching results. Numbers are Error@95% in percentage. See Section 5 for descriptions of different settings. F and B are dimensions for fully-connected and bottleneck layers in Table 1. Bold numbers are the best results across all conditions. Underlined numbers are better than the previous state-of-the-art results with similar feature dimension
  • Table3: Accuracy vs. quantization tradeoff for the 64-1024×1024 network. In the first column, the first value is bits per dimension. The second value is average bits per feature vector. It is computed using 64 + 64 × 0.679 × b, where b is the number of bits per dimension, and the average density (non-zeros) of the feature vector is 67.9%. Numbers in the middle are Error@95%. The first row is for the unquantized features
Download tables as Excel
相关工作
  • Much previous work considers improving some components in the detector-descriptor-similarity pipeline for matching patches. Here we address the most related work that considers learning descriptors or similarities, organized by goal and the types of non-linearity used.

    Feature learning methods such as [3], [28] and [22] encode non-linearity into the procedure for mapping intensity patches to descriptors. Their goal is to learn descriptors whose similarity with respect to a chosen distance metric match the ground truth. For [3] and [22], the procedure includes multiple parameterized blocks of gradient computation, spatial pooling, feature normalization and dimension reduction. [28] uses boosting with weak learners consisting of a family of functions parameterized by gradient orientations and spatial location. Each weak learner represents the result of feature normalization, orientation pooling and thresholding in its +1/−1 output. Weighting and combining multiple weak learners builds a highly non-linear mapping from gradients to robust descriptors. Different types of learning algorithms are proposed to find the optimal parameters: Powell minimization, boosting and convex optimization for [3], [28] and [22], respectively. In [3] and [22] the similarity functions are simply the Euclidean distance. [28] uses a Mahalanobis distance and jointly learns the descriptors and the metric. In comparison, our proposed feature extraction uses a deep convolutional network with multiple convolutional and spatial pooling layers plus an optional bottle neck layer to obtain feature vectors, followed by a similarity measure also based on neural nets.
基金
  • Berg’s was supported in part by NSF IIS:1452851 and a Google faculty research award. Pool0 activations Input patch Pool1 activations Conv2 activations UBC
引用论文
  • H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust features. In ECCV, 2006. 1
    Google ScholarLocate open access versionFindings
  • J. Bromley, I. Guyon, Y. Lecun, E. Sackinger, and R. Shah. Signature verification using a “Siamese” time delay neural network. In NIPS, 1994. 3
    Google ScholarLocate open access versionFindings
  • M. Brown, G. Hua, and S. A. J. Winder. Discriminative learning of local image descriptors. IEEE TPAMI, 33(1):43– 57, 2011. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • M. Brown and D. Lowe. Automatic panoramic image stitching using invariant features. IJCV, 74(1), 2007. 1
    Google ScholarLocate open access versionFindings
  • S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2002, 3
    Google ScholarLocate open access versionFindings
  • D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. 1
    Google ScholarLocate open access versionFindings
  • J. Heinly, E. Dunn, and J.-M. Frahm. Comparative evaluation of binary features. In ECCV, 2012. 1
    Google ScholarLocate open access versionFindings
  • P. Jain, B. Kulis, J. V. Davis, and I. S. Dhillon. Metric and kernel learning using a linear transformation. Journal of Machine Learning Research, 13:519–547, 2012. 1, 2
    Google ScholarLocate open access versionFindings
  • Y. Jia and T. Darrell. Heavy-tailed distances for gradient based image descriptors. In NIPS, pages 397–405, 2011. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1
    Google ScholarLocate open access versionFindings
  • Y. Ke and R. Sukthankar. PCA-SIFT: A more distinctive representation for local image descriptors. In CVPR, 2004. 1
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 201, 2
    Google ScholarLocate open access versionFindings
  • D. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999. 1
    Google ScholarLocate open access versionFindings
  • J. Matas and O. Chum. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 2004. 1
    Google ScholarLocate open access versionFindings
  • K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. TPAMI, 27(10):1615–1630, 2005. 1
    Google ScholarLocate open access versionFindings
  • K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. J. V. Gool. A comparison of affine region detectors. IJCV, 65, 2005. 1
    Google ScholarLocate open access versionFindings
  • N. Molton, A. Davison, and I. Reid. Locally planar patch features for real-time structure from motion. In BMVC, 2004. 1
    Google ScholarLocate open access versionFindings
  • E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: An efficient alternative to SIFT or SURF. In ICCV, 2011. 1
    Google ScholarLocate open access versionFindings
  • R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In AISTATS, 2007. 2
    Google ScholarLocate open access versionFindings
  • S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. In CVPR, 2006. 1
    Google ScholarLocate open access versionFindings
  • G. Shakhnarovich. Learning Task-Specific Similarity. PhD thesis, MIT, 2006. 2
    Google ScholarFindings
  • K. Simonyan, A. Vedaldi, and A. Zisserman. Learning local feature descriptors using convex optimisation. TPAMI, 2014. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos, 2014. arXiv preprint arXiv:1406.2199. 1
    Findings
  • N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the world from internet photo collections. International Journal of Computer Vision, 80(2):189–210, 2008. 5
    Google ScholarLocate open access versionFindings
  • C. Strecha, A. A. Bronstein, M. M. Bronstein, and P. Fua. LDAHash: Improved matching with smaller descriptors. TPAMI, 34(1):66–78, 2012. 2
    Google ScholarLocate open access versionFindings
  • A. Toshev and C. Szegedy. DeepPose: Human pose estimation via deep neural networks. In CVPR, 2014. 1
    Google ScholarLocate open access versionFindings
  • T. Trzcinski, C. M. Christoudias, P. Fua, and V. Lepetit. Boosting binary keypoint descriptors. In CVPR, 2013. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • T. Trzcinski, C. M. Christoudias, V. Lepetit, and P. Fua. Learning image descriptors with the boosting-trick. In NIPS, pages 278–286, 2012. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • http://www.csie.ntu.edu.tw/̃cjlin/liblinear/.5
    Findings
  • http://www.cs.ubc.ca/̃mbrown/patchdata/patchdata.html.5
    Findings
  • http://www.vlfeat.org/overview/sift.html.5
    Findings
  • J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37–57, 1985. 4
    Google ScholarLocate open access versionFindings
  • J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. In CVPR, 2014. 2
    Google ScholarLocate open access versionFindings
  • Y. Weiss, A. B. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008. 2
    Google ScholarLocate open access versionFindings
  • S. A. J. Winder, G. Hua, and M. Brown. Picking the best DAISY. In CVPR, 2009. 5
    Google ScholarLocate open access versionFindings
  • B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free and annotation-free approach for fine-grained image categorization. In CVPR, 2012. 1
    Google ScholarLocate open access versionFindings
  • J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. http://arxiv.org/abs/1409.4326, 2014.2, 3
    Findings
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn