Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval Virtual Event China July, 2020, pp. 2367-2376, 2020.

Cited by: 0|Bibtex|Views38|DOI:https://doi.org/10.1145/3397271.3401442
EI
Other Links: arxiv.org|dl.acm.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Experimental results on two datasets and on the few-shot setting suggest that incorporating rich layout information and expressive text representation significantly improves extraction performance and reduces annotation cost for information extraction from visually rich documents

Abstract:

Many business documents processed in modern NLP and IR pipelines are visually rich: in addition to text, their semantics can also be captured by visual traits such as layout, format, and fonts. We study the problem of information extraction from visually rich documents (VRDs) and present a model that combines the power of large pre-traine...More

Code:

Data:

0
Introduction
  • Information extraction (IE) is the process of identifying within text instances of specified classes of entities as well as relations and events involving these entities [10].
  • Consider the examples in Figure 1: the layouts for invoices, resumes, and job ads contain important information: section titles in resumes and job ads are often in fonts different from the content and prices in invoices are often listed in the same column with “Amount” as the column head
  • Such information is ignored by models that rely solely on text information and IE performance is hindered as as result
Highlights
  • Information extraction (IE) is the process of identifying within text instances of specified classes of entities as well as relations and events involving these entities [10]
  • We introduce two finetuning objectives: Sequence Positional Relationship Classification (SPRC) and Masked Language Model (MLM) to fine-tune the models on unlabeled in-domain data
  • The results indicate that models utilizing the layout information with graph neural network modules outperform every baseline by significant margins: adding the Graph Convolutional Network (GCN) module without fine-tuning improves F1 from 89.58 to 94.37
  • It is obvious that our model significantly improves the performance on extracting position sensitive entities such as School Duration and Section Title
  • We introduced a novel approach for structural-aware IE from visually rich documents with fine-tuning objectives which are proved to be both effective and robust
  • Experimental results on two datasets and on the few-shot setting suggest that incorporating rich layout information and expressive text representation significantly improves extraction performance and reduces annotation cost for information extraction from visually rich documents
Methods
  • Pretraining methods for dialog context representation learning

    arXiv preprint arXiv:1906.00414 (2019).

    [24] Erik G Miller, Nicholas E Matsakis, and Paul A Viola. 2000.
  • Pretraining methods for dialog context representation learning.
  • [24] Erik G Miller, Nicholas E Matsakis, and Paul A Viola.
  • Learning from one example through shared densities on transforms.
  • In Proceedings IEEE Conference on Computer Vision and Pattern Recognition.
  • CVPR 2000 (Cat. No PR00662), Vol 1.
  • IEEE, 464–471
Results
  • The authors apply the graph module to model the complicated visual layout of a document page and two fine-tuning objectives to improve the performance of the language model.
  • The results indicate that models utilizing the layout information with graph neural network modules outperform every baseline by significant margins: adding the GCN module without fine-tuning improves F1 from 89.58 to 94.37.
  • As illustrated in Table 6, both training objectives improve the model performance, and fine-tuning with MLM and SPRC obtains the highest F1 score at 72.13 on this dataset
Conclusion
  • The authors introduced a novel approach for structural-aware IE from visually rich documents with fine-tuning objectives which are proved to be both effective and robust.
  • The authors design two fine-tuning objectives to fully utilize unlabeled data and reduce annotation cost.
  • Experimental results on two datasets and on the few-shot setting suggest that incorporating rich layout information and expressive text representation significantly improves extraction performance and reduces annotation cost for information extraction from visually rich documents.
  • Possible directions include incorporating more spatial and formatting features and better modeling of the relationships between text boxes
Summary
  • Introduction:

    Information extraction (IE) is the process of identifying within text instances of specified classes of entities as well as relations and events involving these entities [10].
  • Consider the examples in Figure 1: the layouts for invoices, resumes, and job ads contain important information: section titles in resumes and job ads are often in fonts different from the content and prices in invoices are often listed in the same column with “Amount” as the column head
  • Such information is ignored by models that rely solely on text information and IE performance is hindered as as result
  • Objectives:

    A large portion of today’s business documents are digital born and offer richer and more accurate layout information than OCR.
  • The authors aim to better utilize such information in this work
  • Methods:

    Pretraining methods for dialog context representation learning

    arXiv preprint arXiv:1906.00414 (2019).

    [24] Erik G Miller, Nicholas E Matsakis, and Paul A Viola. 2000.
  • Pretraining methods for dialog context representation learning.
  • [24] Erik G Miller, Nicholas E Matsakis, and Paul A Viola.
  • Learning from one example through shared densities on transforms.
  • In Proceedings IEEE Conference on Computer Vision and Pattern Recognition.
  • CVPR 2000 (Cat. No PR00662), Vol 1.
  • IEEE, 464–471
  • Results:

    The authors apply the graph module to model the complicated visual layout of a document page and two fine-tuning objectives to improve the performance of the language model.
  • The results indicate that models utilizing the layout information with graph neural network modules outperform every baseline by significant margins: adding the GCN module without fine-tuning improves F1 from 89.58 to 94.37.
  • As illustrated in Table 6, both training objectives improve the model performance, and fine-tuning with MLM and SPRC obtains the highest F1 score at 72.13 on this dataset
  • Conclusion:

    The authors introduced a novel approach for structural-aware IE from visually rich documents with fine-tuning objectives which are proved to be both effective and robust.
  • The authors design two fine-tuning objectives to fully utilize unlabeled data and reduce annotation cost.
  • Experimental results on two datasets and on the few-shot setting suggest that incorporating rich layout information and expressive text representation significantly improves extraction performance and reduces annotation cost for information extraction from visually rich documents.
  • Possible directions include incorporating more spatial and formatting features and better modeling of the relationships between text boxes
Tables
  • Table1: Invoice Dataset Statistics
  • Table2: Model accuracy on Invoice Dataset
  • Table3: Statistics of the entities in labeled Invoice Dataset entities num RoBERTa F1 RoBERTa+GCN F1
  • Table4: Fine-tuning task scores of Invoice Dataset
  • Table5: Model accuracy on unseen Invoice Dataset
  • Table6: Model accuracy on Resume Dataset
  • Table7: Statistics and accuracy of the tags in Resume Dataset
  • Table8: Ablation study on graph module
Download tables as Excel
Related work
  • 2.1 IE for Visually Rich Documents

    Our work falls under the scope of Visually Rich Document information extraction which is a relatively new research topic. We roughly divide the current progress of the approaches on this problem into three categories. The first category is rule-based systems. [30] describes an invoice extraction system using a number of rules and empirical features, and [5] builds a more stable system using tf-idf algorithm and a large number of human-designed features. Other document-level entity extraction systems combine rules and statistical models [2, 32]. Although it is possible to craft high-precision rules in some closed-domain applications, rule-based systems are usually associated with extensive human effort and cannot be rapidly adapted to new domains.

    The second category is graph-based statistical models, where graphs are used to model the relationships between layout components, such as text boxes. [31] first performs graph mining in a document with a set of key-fields selected by clients, in order to learn the pattern to extract information in the absence of clients. More recently, graph neural networks are used to capture structural information in visually rich documents: [19] applies graph modules to encode the visual information with deep neural networks and GraphIE [28] also assumes that the graph structure is ubiquitous in the text, and applies GCN between the BiLSTM encoder-decoder structure to model the layout information in the document. The limitation of these methods is that they do not have access to pretrained language models such as BERT and have not explored rich visual information (e.g. font and weight of texts) beyond the position of texts.
Reference
  • Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. DocBERT: BERT for Document Classification. CoRR abs/1904.08398 (2019). arXiv:1904.08398 http://arxiv.org/abs/1904.08398
    Findings
  • Abdel Belaïd, Yolande Belaïd, Late N Valverde, and Saddok Kebairi. 2001. Adaptive technology for mail-order form segmentation. In Proceedings of Sixth International Conference on Document Analysis and Recognition. IEEE, 689–693.
    Google ScholarLocate open access versionFindings
  • Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). 2019. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/N19-1/
    Locate open access versionFindings
  • Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1511.07289
    Findings
  • Vincent Poulain d’Andecy, Emmanuel Hartmann, and Marçal Rusinol. 2018. Field extraction by hybrid incremental and a-priori structural templates. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 251–256.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
    Findings
  • Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28, 4 (2006), 594–611.
    Google ScholarLocate open access versionFindings
  • Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. GraphRel: Modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1409–1418.
    Google ScholarLocate open access versionFindings
  • Ruiying Geng, Binhua Li, Yongbin Li, Yuxiao Ye, Ping Jian, and Jian Sun. 201Fewshot text classification with induction network. arXiv preprint arXiv:1902.10482 (2019).
    Findings
  • Ralph Grishman. 2012. Information extraction: Capabilities and challenges. (2012).
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90
    Locate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).
    Findings
  • Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., USA.
    Google ScholarFindings
  • Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards understanding 2d documents. arXiv preprint arXiv:1809.08799 (2018).
    Findings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
    Findings
  • Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl
    Locate open access versionFindings
  • Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text Generation from Knowledge Graphs with Graph Transformers, See [3], 2284–2293. https://doi.org/10.18653/v1/n19-1238
    Findings
  • Bang Liu, Ting Zhang, Di Niu, Jinghong Lin, Kunfeng Lai, and Yu Xu. 20Matching long text documents via graph convolutional networks. arXiv preprint arXiv:1802.07459 (2018).
    Findings
  • Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 20Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 2 (Industry Papers), Anastassia Loukina, Michelle Morales, and Rohit Kumar (Eds.). Association for Computational Linguistics, 32–39. https://doi.org/10.18653/v1/n19-2005
    Locate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
    Google ScholarFindings
  • Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).
    Findings
  • arXiv:1907.11692 http://arxiv.org/abs/1907.11692
    Findings
  • [21] Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint arXiv:1703.04826 (2017).
    Findings
  • [22] Stephen Mayhew, Nitish Gupta, and Dan Roth. 2019. Robust Named Entity Recognition with Truecasing Pretraining. arXiv preprint arXiv:1912.07095 (2019).
    Findings
  • [23] Shikib Mehri, Evgeniia Razumovsakaia, Tiancheng Zhao, and Maxine Eskenazi.
    Google ScholarFindings
  • 2019. Pretraining methods for dialog context representation learning. arXiv preprint arXiv:1906.00414 (2019).
    Findings
  • [24] Erik G Miller, Nicholas E Matsakis, and Paul A Viola. 2000. Learning from one example through shared densities on transforms. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), Vol. 1.
    Google ScholarLocate open access versionFindings
  • [25] Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).
    Findings
  • [26] Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474 (2019).
    Findings
  • [27] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).
    Findings
  • [28] Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2019.
    Google ScholarFindings
  • 761. https://doi.org/10.18653/v1/n19-1082
    Findings
  • [29] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018.
    Google ScholarFindings
  • Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openaiassets/researchcovers/languageunsupervised/language understanding paper.
    Findings
  • [30] Marçal Rusinol, Tayeb Benkhelfallah, and Vincent Poulain dAndecy. 2013. Field extraction from administrative documents by incremental structural templates. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, 1100–1104.
    Google ScholarLocate open access versionFindings
  • [31] KC Santosh and Abdel Belaïd. 2013. Pattern-based approach to table extraction. In
    Google ScholarLocate open access versionFindings
  • [32] Daniel Schuster, Klemens Muthmann, Daniel Esser, Alexander Schill, Michael Berger, Christoph Weidling, Kamil Aliyev, and Andreas Hofmeier. 2013. Intellix–
    Google ScholarFindings
  • End-User Trained Information Extraction for Document Archiving. In 2013 12th
    Google ScholarLocate open access versionFindings
  • [33] Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (1997), 2673–2681. https://doi.org/10.
    Locate open access versionFindings
  • [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008.
    Google ScholarLocate open access versionFindings
  • http://papers.nips.cc/paper/7181-attention-is-all-you-need
    Findings
  • [35] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International
    Google ScholarLocate open access versionFindings
  • Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30
    Google ScholarFindings
  • - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.
    Locate open access versionFindings
  • [36] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016.
    Google ScholarLocate open access versionFindings
  • [37] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language
    Google ScholarLocate open access versionFindings
  • Processing. ArXiv abs/1910.03771 (2019).
    Findings
  • [38] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2019. CoRR abs/1912.13318 (2019). arXiv:1912.13318 http://arxiv.org/abs/1912.13318
    Findings
  • [39] Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse Few-Shot Text Classification with Multiple Metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), Marilyn A. Walker, Heng Ji, and
    Google ScholarLocate open access versionFindings
  • Amanda Stent (Eds.). Association for Computational Linguistics, 1206–1215.
    Google ScholarLocate open access versionFindings
  • https://doi.org/10.18653/v1/n18-1109
    Findings
Full Text
Your rating :
0

 

Tags
Comments