Table Search Using a Deep Contextualized Language Model

Trabelsi Mohamed
Trabelsi Mohamed
Xu Yinan
Xu Yinan

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval Virtual Event China July, 2020, pp. 589-598, 2020.

Cited by: 0|Views111
EI
Weibo:
We have addressed the problem of ad hoc table retrieval with the deep contextualized language model BERT

Abstract:

Pretrained contextualized language models such as BERT have achieved impressive results on various natural language processing benchmarks. Benefiting from multiple pretraining tasks and large scale training corpora, pretrained models can capture complex syntactic word relations. In this paper, we use the deep contextualized language model...More

Code:

Data:

0
Introduction
  • As an efficient way to organize and display data, tables are broadly used in different applications: researchers use tables to present their experimental results; companies store information about customers and products in spreadsheets; flight information display systems in the airports show flight schedules to passengers in tables.
  • With structure information and metadata, tables store factual knowledge and are used to build question answering (QA) systems [34]
Highlights
  • As an efficient way to organize and display data, tables are broadly used in different applications: researchers use tables to present their experimental results; companies store information about customers and products in spreadsheets; flight information display systems in the airports show flight schedules to passengers in tables
  • In ad hoc table retrieval, given a query q ∈ Q usually consisting of several keywords q = {k1, k2, ..., kl }, our goal is to rank a set of tables T = {t1, t2, ..., tn } in descending order of their relevance scores with respect to q
  • Even without encoding the tables, HybridBERT-text can still outperform semantic table retrieval (STR), which demonstrates that BERT can extract informative features from tables and context fields for ad hoc table retrieval
  • We have addressed the problem of ad hoc table retrieval with the deep contextualized language model BERT
  • We find that using the max salience selector with row items is the best strategy to construct BERT input
  • In experiments on public datasets, we show that our best approach can outperform the previous state-of-the-art method and BERT baselines with a large margin under different evaluation metrics
  • We further show that the featurebased approach of BERT is better than jointly training BERT with a feature fusion component
Methods
  • Each table could have context fields {p1, ..., pk } depending on the source of the table.
  • Hybrid-BERT-Row-Sum Hybrid-BERT-Row-Mean Hybrid-BERT-Row-Max Hybrid-BERT-Col-Sum Hybrid-BERT-Col-Mean Hybrid-BERT-Col-Max Hybrid-BERT-Cell-Sum Hybrid-BERT-Cell-Mean Hybrid-BERT-Cell-Max. Row Row Row Column Column Column Cell Cell Cell.
  • Sum Salience Mean Salience Max Salience Sum Salience Mean Salience Max Salience Sum Salience Mean Salience Max Salience the authors can see that the first row of a table usually contains some highlevel concepts and provides informative context.
  • The authors consider the table header as a context field.
Results
  • The authors summarize the experimental results in Table 3.
  • The authors can see that all BERT-based models can achieve better results than semantic table retrieval (STR).
  • Rows and cells have a marginal improvement on Hybrid-BERT-text, indicating that encoding the table content has the potential to further boost performance.
  • The differences in performance among randomly selecting columns, rows and cells are not statistically significant.
  • Though the gain of performance is statistically significant at p = 0.005 level, BERT makes the main contribution, since only encoding context fields can achieve impressive results
Conclusion
  • The authors continue the discussion of the proposed methods.

    6.1 Ranking Only with BERT

    To answer RQ2, the authors run the experiments that only use BERT features which means f equals fber t in Equation 3.
  • In answer to RQ2, without additional features, all the proposed methods including baselines can outperform STR.
  • The conclusions are consistent with Section 5.4: sum salience selector is the best for cell items and max salience selector with row items still performs the best when only BERT features are used.The authors have addressed the problem of ad hoc table retrieval with the deep contextualized language model BERT.
  • The authors find that using the max salience selector with row items is the best strategy to construct BERT input.
  • The authors conduct experiments on WebQueryTable dataset and demonstrate that the method generalizes to other domains
Summary
  • Introduction:

    As an efficient way to organize and display data, tables are broadly used in different applications: researchers use tables to present their experimental results; companies store information about customers and products in spreadsheets; flight information display systems in the airports show flight schedules to passengers in tables.
  • With structure information and metadata, tables store factual knowledge and are used to build question answering (QA) systems [34]
  • Methods:

    Each table could have context fields {p1, ..., pk } depending on the source of the table.
  • Hybrid-BERT-Row-Sum Hybrid-BERT-Row-Mean Hybrid-BERT-Row-Max Hybrid-BERT-Col-Sum Hybrid-BERT-Col-Mean Hybrid-BERT-Col-Max Hybrid-BERT-Cell-Sum Hybrid-BERT-Cell-Mean Hybrid-BERT-Cell-Max. Row Row Row Column Column Column Cell Cell Cell.
  • Sum Salience Mean Salience Max Salience Sum Salience Mean Salience Max Salience Sum Salience Mean Salience Max Salience the authors can see that the first row of a table usually contains some highlevel concepts and provides informative context.
  • The authors consider the table header as a context field.
  • Results:

    The authors summarize the experimental results in Table 3.
  • The authors can see that all BERT-based models can achieve better results than semantic table retrieval (STR).
  • Rows and cells have a marginal improvement on Hybrid-BERT-text, indicating that encoding the table content has the potential to further boost performance.
  • The differences in performance among randomly selecting columns, rows and cells are not statistically significant.
  • Though the gain of performance is statistically significant at p = 0.005 level, BERT makes the main contribution, since only encoding context fields can achieve impressive results
  • Conclusion:

    The authors continue the discussion of the proposed methods.

    6.1 Ranking Only with BERT

    To answer RQ2, the authors run the experiments that only use BERT features which means f equals fber t in Equation 3.
  • In answer to RQ2, without additional features, all the proposed methods including baselines can outperform STR.
  • The conclusions are consistent with Section 5.4: sum salience selector is the best for cell items and max salience selector with row items still performs the best when only BERT features are used.The authors have addressed the problem of ad hoc table retrieval with the deep contextualized language model BERT.
  • The authors find that using the max salience selector with row items is the best strategy to construct BERT input.
  • The authors conduct experiments on WebQueryTable dataset and demonstrate that the method generalizes to other domains
Tables
  • Table1: The length statistics of data provided by Zhang and Balog [<a class="ref-link" id="c49" href="#r49">49</a>]. The length is calculated after WordPiece tokenization
  • Table2: The settings of all proposed methods, which use different item types and content selectors
  • Table3: The superscript † shows statistically significant improvements for the method compared with all other methods
  • Table4: The setting of our methods where only BERT features are used
  • Table5: Results using feature-based approaches. The superscript ‡ denotes statistically significant improvements over all baseline methods
  • Table6: Results on WebQueryTable dataset
Download tables as Excel
Related work
  • 2.1 Table Search

    Zhang et al [49] propose a semantic table retrieval (STR) method for ad hoc table retrieval. They first map queries and tables into a set of word embeddings or graph embeddings. Four ways to calculate query-table similarity based on embeddings are then proposed. In the end, the resulting four semantic similarity features are combined with other features into a learning-to-rank framework. Table2Vec [48] obtains semantic features in a similar way but uses embeddings trained from different fields. This method is built upon and does not outperform STR, so we only compare our methods with STR instead of Table2Vec.
Funding
  • This material is based upon work supported by the National Science Foundation under Grant No IIS-1816325
Reference
  • Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. In The semantic web (ISWC). Springer, 722–735.
    Google ScholarFindings
  • Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: entity linking in web tables. In Proc. Int’l Semantic Web Conf. (ISWC). 425–441.
    Google ScholarLocate open access versionFindings
  • Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
    Google ScholarLocate open access versionFindings
  • Michael J Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data integration for the relational web. Proc. of the VLDB Endowment 2, 1 (2009), 1090–1101.
    Google ScholarLocate open access versionFindings
  • Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538–549.
    Google ScholarLocate open access versionFindings
  • Zhangming Chan, Xiuying Chen, Yongliang Wang, Juntao Li, Zhiqiang Zhang, Kun Gai, Dongyan Zhao, and Rui Yan. 2019. Stick to the Facts: Learning towards a Fidelity-oriented E-Commerce Product Description Generation. In Proceedings of the 2019 Conference on EMNLP and the 9th IJCNLP. 4958–4967.
    Google ScholarLocate open access versionFindings
  • Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, and Inderjit Dhillon. 2019. X-BERT: eXtreme Multi-label Text Classification with using Bidirectional Encoder Representations from Transformers. In Proceedings of NeurIPS Science Meets Engineering of Deep Learning Workshop.
    Google ScholarLocate open access versionFindings
  • Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2020. Leveraging Schema Labels to Enhance Dataset Search. In European Conference on Information Retrieval. Springer, 267–280.
    Google ScholarLocate open access versionFindings
  • Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 201What Does BERT Look At? An Analysis of BERT’s Attention. In BlackBoxNLP@ACL.
    Google ScholarFindings
  • Eric Crestan and Patrick Pantel. 2011. Web-scale table census and classification. In Proceedings 4th ACM International Conference on Web Search and Data Mining (WSDM). ACM, 545–554.
    Google ScholarLocate open access versionFindings
  • Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 985âĂŞ988. https://doi.org/10.1145/3331184.3331303
    Locate open access versionFindings
  • Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2978âĂŞ2988.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171âĂŞ4186.
    Google ScholarLocate open access versionFindings
  • Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. 2019.
    Google ScholarFindings
  • Jinyoung Kim, Xiaobing Xue, and W Bruce Croft. 2009. A probabilistic retrieval model for semistructured data. In Proc. European Conference on Information Retrieval (ECIR). Springer, 228–239.
    Google ScholarLocate open access versionFindings
  • Jin Young Kim and W Bruce Croft. 2012. A field relevance model for structured document retrieval. In Proc. European Conf. on Info. Retrieval. Springer, 97–108.
    Google ScholarLocate open access versionFindings
  • Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the Dark Secrets of BERT. In Proceedings of the 2019 Conference on EMNLP and the 9th IJCNLP (EMNLP-IJCNLP). 4364–4373.
    Google ScholarLocate open access versionFindings
  • Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research. TACL 7 (2019), 453–466.
    Google ScholarLocate open access versionFindings
  • Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 20Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6086–6096. https://doi.org/10.18653/v1/P19-1612
    Locate open access versionFindings
  • Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. 2018. Denoising distantly supervised open-domain question answering. In Proc. 56th Annual Meeting of the Assoc. for Computational Linguistics (Vol. 1: Long Papers). 1736–1745.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. (2019). arXiv preprint arXiv:1907.11692.
    Findings
  • Xiaofei Ma, Peng Xu, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang. 2019. Universal Text Representation from BERT: An Empirical Study. arXiv preprint arXiv:1910.07973 (2019).
    Findings
  • Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized Embeddings for Document Ranking. In Proc. 42nd Int’l ACM SIGIR Conference on Research and Development in Information Retrieval. 1101–1104.
    Google ScholarLocate open access versionFindings
  • Yosi Mass, Haggai Roitman, Shai Erera, Or Rivlin, Bar Weiner, and David Konopnicki. 2019. A Study of BERT for Non-Factoid Question-Answering under Passage Length Constraints. arXiv preprint arXiv:1908.06780 (2019).
    Findings
  • Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2020).
    Findings
  • Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424 (2019).
    Findings
  • Paul Ogilvie and Jamie Callan. 2003. Combining document representations for known-item search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 143–150.
    Google ScholarLocate open access versionFindings
  • Harshith Padigela, Hamed Zamani, and W Bruce Croft. 2019. Investigating the Successes and Failures of BERT for Passage Re-Ranking. arXiv preprint arXiv:1905.01758 (2019).
    Findings
  • Benjamin Piwowarski and Patrick Gallinari. 2003. A machine learning model for information retrieval with structured documents. In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 425–438.
    Google ScholarLocate open access versionFindings
  • Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. 2019. Understanding the Behaviors of BERT in Ranking. arXiv preprint arXiv:1904.07531 (2019).
    Findings
  • Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple BM25 extension to multiple weighted fields. In Proc. 13th ACM International Conference on Information and Knowledge Management (CIKM). 42–49.
    Google ScholarLocate open access versionFindings
  • Wataru Sakata, Tomohide Shibata, Ribeka Tanaka, and Sadao Kurohashi. 2019. FAQ Retrieval Using Query-Question Similarity and BERT-Based Query-Answer Relevance. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1113âĂŞ1116.
    Google ScholarLocate open access versionFindings
  • Wataru Sakata, Tomohide Shibata, Ribeka Tanaka, and Sadao Kurohashi. 2019. FAQ Retrieval Using Query-Question Similarity and BERT-Based Query-Answer Relevance. In Proc. 42nd Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval (Paris, France). 1113âĂŞ1116. https://doi.org/10.1145/3331184.3331326
    Locate open access versionFindings
  • Huan Sun, Hao Ma, Xiaodong He, Wen-tau Yih, Yu Su, and Xifeng Yan. 2016. Table cell search for question answering. In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 771–782.
    Google ScholarLocate open access versionFindings
  • Yibo Sun, Zhao Yan, Duyu Tang, Nan Duan, and Bing Qin. 2019. Content-based table retrieval for web queries. Neurocomputing 349 (2019), 183–189.
    Google ScholarLocate open access versionFindings
  • Krysta M Svore and Christopher JC Burges. 2009. A machine learning approach for improved BM25 retrieval. In Proc. 18th ACM Conf. on Information and Knowledge Management (CIKM). 1811–1814.
    Google ScholarLocate open access versionFindings
  • Mohamed Trabelsi, Brian D. Davison, and Jeff Heflin. 2019. Improved Table Retrieval Using Multiple Context Embeddings for Attributes. In Proc. IEEE Big Data. 1238–1244.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
    Google ScholarLocate open access versionFindings
  • Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering semantics of tables on the web. Proceedings of the VLDB Endowment 4, 9 (2011), 528–538.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of ICLR.
    Google ScholarLocate open access versionFindings
  • Zhen Wang, Jiachen Liu, Xinyan Xiao, Yajuan Lyu, and Tian Wu. 2018. Joint Training of Candidate Extraction and Answer Selection for Reading Comprehension. In Proceedings of the 56th Annual Meeting of the ACL. 1715–1724.
    Google ScholarLocate open access versionFindings
  • Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nallapati, and Bing Xiang. 2019. Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering. In EMNLP-IJCNLP 2019. ACL, Hong Kong, China, 5877– 5881. https://doi.org/10.18653/v1/D19-1599
    Locate open access versionFindings
  • Ross Wilkinson. 1994. Effective retrieval of structured documents. In Proc. ACM SIGIR Int’l Conf. on Research and Dev. in Information Retrieval. Springer, 311–317.
    Google ScholarLocate open access versionFindings
  • Sam Wiseman, Stuart M Shieber, and Alexander M Rush. 2017. Challenges in data-to-document generation. arXiv preprint arXiv:1707.08052 (2017).
    Findings
  • Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations). 72–77.
    Google ScholarLocate open access versionFindings
  • Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Simple applications of BERT for ad hoc document retrieval. arXiv preprint arXiv:1903.10972 (2019).
    Findings
  • Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. 2018. Neural ranking models with multiple document fields. In Proc. 11th ACM Int’l Conf. on Web Search and Data Mining (WSDM). 700–708.
    Google ScholarLocate open access versionFindings
  • Li Zhang, Shuo Zhang, and Krisztian Balog. 2019. Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval. In Proc. 42nd Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. 1029–1032.
    Google ScholarLocate open access versionFindings
  • Shuo Zhang and Krisztian Balog. 2018. Ad hoc table retrieval using semantic similarity. In Proc. World Wide Web Conference (TheWebConf). 1553–1562. [SEP] Rank Company 2003 Total Revenue ( C $ M ) 2003 Net Inc ##ome ( Loss ) ( C $ [SEP]
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments