Cross-Domain Document Object Detection: Benchmark Suite and Method

CVPR, pp. 12912-12921, 2020.

Cited by: 0|Bibtex|Views26|Links
EI
Keywords:
pdf documentobject detectionpdf fileDocument Object Modelfeature pyramidMore(16+)
Weibo:
The proposed model builds upon the standard object detection model with three novel domain alignment modules, namely, the feature pyramid alignment module, the Region Alignment module, and Rendering Layer Alignment module

Abstract:

Decomposing images of document pages into high-level semantic regions (e.g., figures, tables, paragraphs), document object detection (DOD) is fundamental for downstream tasks like intelligent document editing and understanding. DOD remains a challenging problem as document objects vary significantly in layout, size, aspect ratio, textur...More

Code:

Data:

0
Introduction
  • Document Object Detection (DOD) is the task of automatically decomposing a document page image into its structural and logical units.
  • DOD is critical for a variety of document image analysis applications, such as document editing, document structure analysis and content understanding [31, 1, 30].
  • Document objects are more diverse in aspect ratio and scale than natural scene objects: tables may occupy a whole page, page numbers can be as small as a single digit, and a single line of text spanning the page has an extreme aspect ratio.
  • Missing parts of natural objects and scenes can be reasonably in-painted based on surrounding context [26]
Highlights
  • Document Object Detection (DOD) is the task of automatically decomposing a document page image into its structural and logical units
  • To address the domain shift problem, we propose three novel modules on top of the standard object detection model, namely, feature pyramid alignment (FPA) module, Region Alignment (RA) module, and the Rendering Layer Alignment (RLA) module
  • We investigate cross-domain document object detection by proposing a benchmark suite and a novel method
  • We provide the essential components, page images and bounding boxes annotations, and auxiliary components, raw PDF files and the PDF rendering layers
  • The proposed model builds upon the standard object detection model with three novel domain alignment modules, namely, the feature pyramid alignment (FPA) module, the Region Alignment (RA) module, and Rendering Layer Alignment (RLA) module
  • Experiments on the benchmark suite confirm the effectiveness of the proposed novel components and that the proposed method significantly outperforms the baseline methods
Methods
  • Figure 2 illustrates the proposed method
  • It is based on the Feature Pyramid Networks (FPN) and includes three novel domain alignment modules, namely, the Feature Pyramid Alignment (FPA) module, the Region Alignment (RA).
  • FPN exploits the pyramidal feature hierarchy of convolutional neural networks and builds a feature pyramid of high-level semantics for all the layers.
  • It is independent of the backbone convolutional architecture.
  • This iteration process outputs a feature pyramid {P1, P2, P3, P4}, where
Results
  • The authors conduct cross-domain evaluation between the three datasets, Chn, Legal and PubMed. The first one is a Chinese document dataset and the latter two are English datasets.
  • The authors first conduct cross-lingual performance evaluation between Chn and Legal, and between Chn and PubMed. Table 3 and Table 4 shows the experimental results.
  • Since Legal and PubMed belong to different English document categories, there is a domain gap between them.
  • The authors conduct cross-category detection evaluations between these two datasets.
Conclusion
  • The authors investigate cross-domain document object detection by proposing a benchmark suite and a novel method.
  • The benchmark suite includes different types of datasets on which cross-domain document object detectors can be trained and evaluated.
  • The proposed model builds upon the standard object detection model with three novel domain alignment modules, namely, the feature pyramid alignment (FPA) module, the Region Alignment (RA) module, and Rendering Layer Alignment (RLA) module.
  • The proposed method improves over the state-of-the-art method for cross-domain object detection on natural scene images
Summary
  • Introduction:

    Document Object Detection (DOD) is the task of automatically decomposing a document page image into its structural and logical units.
  • DOD is critical for a variety of document image analysis applications, such as document editing, document structure analysis and content understanding [31, 1, 30].
  • Document objects are more diverse in aspect ratio and scale than natural scene objects: tables may occupy a whole page, page numbers can be as small as a single digit, and a single line of text spanning the page has an extreme aspect ratio.
  • Missing parts of natural objects and scenes can be reasonably in-painted based on surrounding context [26]
  • Methods:

    Figure 2 illustrates the proposed method
  • It is based on the Feature Pyramid Networks (FPN) and includes three novel domain alignment modules, namely, the Feature Pyramid Alignment (FPA) module, the Region Alignment (RA).
  • FPN exploits the pyramidal feature hierarchy of convolutional neural networks and builds a feature pyramid of high-level semantics for all the layers.
  • It is independent of the backbone convolutional architecture.
  • This iteration process outputs a feature pyramid {P1, P2, P3, P4}, where
  • Results:

    The authors conduct cross-domain evaluation between the three datasets, Chn, Legal and PubMed. The first one is a Chinese document dataset and the latter two are English datasets.
  • The authors first conduct cross-lingual performance evaluation between Chn and Legal, and between Chn and PubMed. Table 3 and Table 4 shows the experimental results.
  • Since Legal and PubMed belong to different English document categories, there is a domain gap between them.
  • The authors conduct cross-category detection evaluations between these two datasets.
  • Conclusion:

    The authors investigate cross-domain document object detection by proposing a benchmark suite and a novel method.
  • The benchmark suite includes different types of datasets on which cross-domain document object detectors can be trained and evaluated.
  • The proposed model builds upon the standard object detection model with three novel domain alignment modules, namely, the feature pyramid alignment (FPA) module, the Region Alignment (RA) module, and Rendering Layer Alignment (RLA) module.
  • The proposed method improves over the state-of-the-art method for cross-domain object detection on natural scene images
Tables
  • Table1: Impact of adding the proposed RLA module on an existing work. The best results are in bold and target images. Thus, they can be used as an additional supervision cue to bridge domain gaps. RLA takes advantage of this and utilizes the rendering layers to generate for each page a mask which specifies the drawing type each pixel belongs to. Figure 3 illustrates this process
  • Table2: Ablation study about the effectiveness of the proposed components. The best results are in bold
  • Table3: Cross-domain detection results between Legal to Chn. The “Oracle” results are obtained by FPN trained with labeled training data of the target domain. The best results are in bold
  • Table4: Cross-domain detection results between PubMed to Chn
  • Table5: Cross-domain detection results between Legal and PubMed
  • Table6: Cross-domain detection results for natural scene images
Download tables as Excel
Related work
  • Our work is related to document object detection and cross-domain object detection for natural scene images.

    2.1. Document Object Detection

    Most existing approaches to document object detection focus on certain types of objects, e.g., tables, figures, or mathematical formulas. Early works rely on various heuristic rules to extract and identify these objects from document images [24, 9, 32]. These approaches often involve a set of hyper-parameters, which are difficult to adapt to new document domains. Recent works are usually data-driven and approach the problem with machine learning techniques, or a hybrid of heuristic rules and learning models. Taking advantage of the impressive progress of object detection on natural scene images, many works adapt natural image object detectors by considering the uniqueness of the document images [30, 11]. He et al [13] propose a two-stage approach to detect tables and figures. In the first stage, the class label for each pixel is predicted using a multi-scale, multi-task fully convolutional neural network. Then in the second stage, heuristic rules are applied on the pixel-wise class predictions to get the object boxes. Gao et al [8] utilize meta-data information of PDF files and detect formulas using a model that combines CNN and RNN. A few works detect multiple types of document objects jointly in a single framework [7]. Yi et al [34] adapt the region proposal approach and redesign the CNN architecture of common object detectors by considering the uniqueness of document objects. [20] first performs deep structure prediction and gets the primitive region proposals from each column region. Then, the primitive proposals are clustered and those within the same cluster are merged as a single object instance.
Funding
  • This work was partially done during the internship of the first author at Adobe Research and was partially supported by Adobe Research funding
Reference
  • Roldano Cattoni, Tarcisio Coianiz, Stefano Messelodi, and Carla Maria Modena. Geometric layout analysis techniques for document image understanding: a review. ITC-irst Technical Report, 9703(09), 1998.
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2017.
    Google ScholarLocate open access versionFindings
  • Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Domain adaptive faster r-cnn for object detection in the wild. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Jing Fang, Xin Tao, Zhi Tang, Ruiheng Qiu, and Ying Liu. Dataset, ground-truth and performance metrics for table detection evaluation. In DAS, 2012.
    Google ScholarLocate open access versionFindings
  • Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Liangcai Gao, Xiaohan Yi, Zhuoren Jiang, Leipeng Hao, and Zhi Tang. Icdar2017 competition on page object detection. In ICDAR, 2017.
    Google ScholarLocate open access versionFindings
  • Liangcai Gao, Xiaohan Yi, Yuan Liao, Zhuoren Jiang, Zuoyu Yan, and Zhi Tang. A deep learning-based formula detection method for pdf documents. In ICDAR, 2017.
    Google ScholarLocate open access versionFindings
  • Utpal Garain. Identification of mathematical expressions in document images. In ICDAR, 2009.
    Google ScholarLocate open access versionFindings
  • Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
    Google ScholarLocate open access versionFindings
  • Azka Gilani, Shah Rukh Qasim, Imran Malik, and Faisal Shafait. Table detection using deep learning. In ICDAR, 2017.
    Google ScholarLocate open access versionFindings
  • Max Gobel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In ICDAR, 2013.
    Google ScholarLocate open access versionFindings
  • Dafang He, Scott Cohen, Brian Price, Daniel Kifer, and C Lee Giles. Multi-scale multi-task fcn for semantic page segmentation and table detection. In ICDAR, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Zhenwei He and Lei Zhang. Multi-adversarial faster-rcnn for unrestricted object detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Mehran Khodabandeh, Arash Vahdat, Mani Ranjbar, and William G Macready. A robust learning approach to domain adaptive object detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Seunghyeon Kim, Jaehoon Choi, Taekyung Kim, and Changick Kim. Self-training and adversarial background regularization for unsupervised domain adaptive one-stage object detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Taekyung Kim, Minki Jeong, Seunghyeon Kim, Seokeon Choi, and Changick Kim. Diversify and match: A domain adaptive representation learning paradigm for object detection. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Xiao-Hui Li, Fei Yin, and Cheng-Lin Liu. Page object detection from pdf document images by deep structured prediction and supervised clustering. In ICPR, 2018.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Xiaoyan Lin, Liangcai Gao, Zhi Tang, Xiaofan Lin, and Xuan Hu. Mathematical formula identification in pdf documents. In ICDAR, 2011.
    Google ScholarLocate open access versionFindings
  • Ning Liu, Dongxiang Zhang, Xing Xu, Long Guo, Lijiang Chen, Wenju Liu, and Dengfeng Ke. Robust math formula recognition in degraded chinese document images. In ICDAR, 2017.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
    Google ScholarFindings
  • Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NuerIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Aruni RoyChowdhury, Prithvijit Chakrabarty, Ashish Singh, SouYoung Jin, Huaizu Jiang, Liangliang Cao, and Erik Learned-Miller. Automatic adaptation of object detectors to new domains using self-training. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. Strong-weak distribution alignment for adaptive object detection. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In ICDAR, 2017.
    Google ScholarLocate open access versionFindings
  • Peter WJ Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In KDD, 2018.
    Google ScholarLocate open access versionFindings
  • Dieu Ni Tran, Tuan Anh Tran, Aran Oh, Soo Hyung Kim, and In Seop Na. Table detection from document image using vertical arrangement of text blocks. International Journal of Contents, 11(4):77–85, 2015.
    Google ScholarLocate open access versionFindings
  • Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C Lee Giles. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Xiaohan Yi, Liangcai Gao, Yuan Liao, Xiaode Zhang, Runtao Liu, and Zhuoren Jiang. Cnn based page object detection in document images. In ICDAR, 2017.
    Google ScholarLocate open access versionFindings
  • Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for document layout analysis. arXiv preprint arXiv:1908.07836, 2019.
    Findings
  • Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycleconsistent adversarial networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Xinge Zhu, Jiangmiao Pang, Ceyuan Yang, Jianping Shi, and Dahua Lin. Adapting object detectors via selective crossdomain alignment. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments