MetaPAD: Meta Pattern Discovery from Massive Text Corpora

KDD '17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Halifax NS Canada August, 2017, pp. 877-886, 2017.

Cited by: 30|Bibtex|Views371|Links
EI
Keywords:
EMNLPcertain contextArea Under the precision-recall CurveInverse-Document Frequencycient frameworkMore(20+)
Wei bo:
We developed an e cient framework, MetaPAD, to discover the meta pa erns from massive corpora with three techniques, including a context-aware segmentation method to carefully determine the boundaries of the pa erns with a learnt pa ern quality assessment function, which avoids c...

Abstract:

Mining textual patterns in news, tweets, papers, and many other kinds of text corpora has been an active theme in text mining and NLP research. Previous studies adopt a dependency parsing-based pattern discovery approach. However, the parsing results lose rich context around entities in the patterns, and the process is costly for a corpus...More

Code:

Data:

0
Introduction
  • Discovering textual pa erns from text data is an active research theme [4, 7, 10, 12, 28], with broad applications such as a ribute extraction [11, 30, 32, 33], aspect mining [8, 15, 19], and slot lling [40, 41].
  • A data-driven exploration of e cient textual pattern mining may have strong implications on the development of e cient methods for NLP tasks on massive text corpora
Highlights
  • Discovering textual pa erns from text data is an active research theme [4, 7, 10, 12, 28], with broad applications such as a ribute extraction [11, 30, 32, 33], aspect mining [8, 15, 19], and slot lling [40, 41]
  • E major contributions of this paper are as follows: (1) we propose a new de nition of typed textual pa ern, called meta pattern, which is more informative, precise, and e cient in discovery than the SOL pa ern; (2) we develop an e cient meta-pa ern mining framework, MetaPAD of three components: generating quality meta pa erns by context-aware segmentation, grouping synonymous meta pa erns, and adjusting entity-type levels for appropriate granularity in the pa ern groups; and (3) our experiments on news and tweet text datasets demonstrate that the MetaPAD not only generates high quality pa erns but achieves signi cant improvement over the state-of-the-art in information extraction
  • The meta pa erns are of high quality on informativeness, completeness, and so on, and practitioners can tell why the pa erns are extracted as an integral semantic unit. ird, though the pa erns like “$P
  • We proposed a novel typed textual pa ern structure, called meta pa ern, which is extened to a frequent, complete, informative, and precise subsequence pa ern in certain context, compared with the SOL pa ern
  • We developed an e cient framework, MetaPAD, to discover the meta pa erns from massive corpora with three techniques, including (1) a context-aware segmentation method to carefully determine the boundaries of the pa erns with a learnt pa ern quality assessment function, which avoids costly dependency parsing and generates high-quality pa erns, (2) a clustering method to group synonymous meta pa erns with integrated information of types, context, and instances, and (3) top-down and bo om-up schemes to adjust the levels of entity types in the meta pa erns by examining the type distributions of entities in the instances
  • Experiments demonstrated that MetaPAD e ciently discovered a large collection of high-quality typed textual pa erns to facilitate challenging NLP tasks like tuple information extraction
Results
  • The authors' proposed MetaPAD discovers high-quality meta pa erns by context-aware segmentation from massive text corpus with a pattern quality assessment function.
  • It further organizes them into synonymous groups.
  • Table 4 presents the groups of synonymous meta pa erns that express a ribute types country:president and company:ceo.
  • Note that for the smaller news data that have many long sentences, PATTY takes even more time, 10.1 hours
Conclusion
  • The authors proposed a novel typed textual pa ern structure, called meta pa ern, which is extened to a frequent, complete, informative, and precise subsequence pa ern in certain context, compared with the SOL pa ern.
  • Experiments demonstrated that MetaPAD e ciently discovered a large collection of high-quality typed textual pa erns to facilitate challenging NLP tasks like tuple information extraction
Summary
  • Introduction:

    Discovering textual pa erns from text data is an active research theme [4, 7, 10, 12, 28], with broad applications such as a ribute extraction [11, 30, 32, 33], aspect mining [8, 15, 19], and slot lling [40, 41].
  • A data-driven exploration of e cient textual pattern mining may have strong implications on the development of e cient methods for NLP tasks on massive text corpora
  • Results:

    The authors' proposed MetaPAD discovers high-quality meta pa erns by context-aware segmentation from massive text corpus with a pattern quality assessment function.
  • It further organizes them into synonymous groups.
  • Table 4 presents the groups of synonymous meta pa erns that express a ribute types country:president and company:ceo.
  • Note that for the smaller news data that have many long sentences, PATTY takes even more time, 10.1 hours
  • Conclusion:

    The authors proposed a novel typed textual pa ern structure, called meta pa ern, which is extened to a frequent, complete, informative, and precise subsequence pa ern in certain context, compared with the SOL pa ern.
  • Experiments demonstrated that MetaPAD e ciently discovered a large collection of high-quality typed textual pa erns to facilitate challenging NLP tasks like tuple information extraction
Tables
  • Table1: Issues of quality over-/under-estimation can be xed when the segmentation recti es pattern frequency
  • Table2: Two datasets we use in the experiments
  • Table3: Entity-Attribute-Value tuples as ground truth
  • Table4: Synonymous meta patterns and their extractions that MetaPAD generates from the news corpus APR on country:president, company:ceo, and person:date of birth
  • Table5: Reporting F1, AUC, and number of true positives (TP) on tuple extraction from news and tweets data
  • Table6: E ciency: time complexity is linear in corpus size
Download tables as Excel
Related work
  • In this section, we summarize existing systems and methods that are related to the topic of this paper.

    TextRunner [4] extracts strings of words between entities in text corpus, and clusters and simpli es these word strings to produce relation-strings. ReVerb [10] constrains pa erns to verbs or verb phrases that end with prepositions. However, the methods in the TextRunner/ReVerb family generate pa erns of frequent relational strings/phrases without entity information. Another line of work, open information extraction systems [2, 22, 36, 39], are supposed to extract verbal expressions for identifying arguments. is is less related to our task of discovering textual pa erns.

    Google’s Biperpedia [12, 13] generates E-A pa erns (e.g., “A of E” and “E ’s A”) from users’ fact-seeking queries (e.g., “president of united states” and “barack oabma’s wife”) by replacing entity with “E” and noun-phrase a ribute with “A”. ReNoun [40] generates S-AO pa erns (e.g., “S’s A is O” and “O, A of S,”) from human-annotated
Funding
  • Research was sponsored in part by the U.S Army Research Lab. under Cooperative Agreement No W911NF-09-2-0053 (NSCTA), National Science Foundation IIS-1320617 and IIS 16-18481, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF- 09-2-0053 (the ARL Network Science CTA). e views and conclusions contained in this document are those of the authors and should not be interpreted as representing the o cial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. e U.S Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. is research was supported by grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov)
Reference
  • Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential pa erns. In ICDE. 3–14.
    Google ScholarFindings
  • Gabor Angeli, Melvin Johnson Premkumar, and Christopher D Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015).
    Google ScholarLocate open access versionFindings
  • Soren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: a nucleus for a web of open data. In e semantic web. 722–735.
    Google ScholarLocate open access versionFindings
  • Michele Banko, Michael J Cafarella, Stephen Soderland, Ma hew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web.. In IJCAI, Vol. 7. 2670–2676.
    Google ScholarLocate open access versionFindings
  • Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247–1250.
    Google ScholarFindings
  • Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
    Google ScholarFindings
  • Andrew Carlson, Justin Be eridge, Bryan Kisiel, Burr Se les, Estevam R Hruschka Jr, and Tom M Mitchell. 2010. Toward an architecture for never-ending language learning. In AAAI, Vol. 5. 3.
    Google ScholarLocate open access versionFindings
  • Zhiyuan Chen, Arjun Mukherjee, and Bing Liu. 2014. Aspect extraction with automated prior knowledge learning. In ACL.
    Google ScholarFindings
  • Marie-Catherine De Marne e, Bill MacCartney, Christopher D Manning, and others. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, Vol. 6.
    Google ScholarLocate open access versionFindings
  • Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In EMNLP. 1535–1545.
    Google ScholarFindings
  • Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema, and Andrew Fano. 2006. Text mining for product a ribute extraction. SIGKDD Explorations 8, 1 (2006), 41–48.
    Google ScholarLocate open access versionFindings
  • Rahul Gupta, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu. 2014. Biperpedia: an ontology for search applications. PVLDB 7, 7 (2014), 505–516.
    Google ScholarLocate open access versionFindings
  • Alon Halevy, Natalya Noy, Sunita Sarawagi, Steven Euijong Whang, and Xiao Yu. 2016. Discovering structure in the universe of a ribute names. In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Commi ee, 939–949.
    Google ScholarLocate open access versionFindings
  • Marti A Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 539–545.
    Google ScholarLocate open access versionFindings
  • Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 168–177.
    Google ScholarLocate open access versionFindings
  • Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Gri, and Joe Ellis. 2010. Overview of the TAC 2010 knowledge base population track. In ird Text Analysis Conference (TAC), Vol. 3. 3–3.
    Google ScholarLocate open access versionFindings
  • Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, and Shiqiang Yang. 2016. Inferring lockstep behavior from connectivity pa ern in large graphs. Knowledge and Information Systems 48, 2 (2016), 399–428.
    Google ScholarLocate open access versionFindings
  • Meng Jiang, Christos Faloutsos, and Jiawei Han. 2016. CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors. In Proceedings of the 22rd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM.
    Google ScholarLocate open access versionFindings
  • Anitha Kannan, Inmar E Givoni, Rakesh Agrawal, and Ariel Fuxman. 2011. Matching unstructured product o ers to structured product speci cations. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 404–412.
    Google ScholarLocate open access versionFindings
  • Xiao Ling and Daniel S Weld. 2012. Fine-grained entity recognition. In AAAI.
    Google ScholarFindings
  • Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. 2015. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1729–1744.
    Google ScholarLocate open access versionFindings
  • Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. e stanford corenlp natural language processing toolkit.. In ACL (System Demonstrations). 55–60.
    Google ScholarLocate open access versionFindings
  • Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online largemargin training of dependency parsers. In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 91–98.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119. [25] ahir P Mohamed, Estevam R Hruschka Jr, and Tom M Mitchell. 2011. Discovering relations between noun categories. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1447–1455.
    Google ScholarLocate open access versionFindings
  • [26] David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classi cation. Lingvisticae Investigationes 30, 1 (2007), 3–26.
    Google ScholarLocate open access versionFindings
  • [27] Ndapandula Nakashole, Tomasz Tylenda, and Gerhard Weikum. 2013. Finegrained semantic typing of emerging entities. In ACL. 1488–1497.
    Google ScholarFindings
  • [28] Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. 2012. PATTY: a taxonomy of relational pa erns with semantic types. In EMNLP. 1135–1145.
    Google ScholarFindings
  • [29] Vivi Nastase, Michael Strube, Benjamin Borschinger, Cacilia Zirn, and Anas Elghafari. 2010. WikiNet: a very large scale multi-lingual concept network. In LREC.
    Google ScholarFindings
  • [30] Marius Pasca and Benjamin Van Durme. 2008. Weakly-supervised acquisition of open-domain classes and class a ributes from web documents and query logs.. In ACL. 19–27.
    Google ScholarLocate open access versionFindings
  • [31] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. 2004. Mining sequential pa erns by pa ern-growth: e pre xspan approach. TKDE 16, 11 (2004), 1424–1440.
    Google ScholarLocate open access versionFindings
  • [32] Katharina Probst, Rayid Ghani, Marko Krema, Andrew Fano, and Yan Liu. 2007.
    Google ScholarFindings
  • [33] Sujith Ravi and Marius Pasca. 2008. Using structured text for large-scale attribute extraction. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 1183–1192.
    Google ScholarLocate open access versionFindings
  • [34] Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R Voss, and Jiawei Han. 2015. Clustype: E ective entity recognition and typing by relation phrasebased clustering. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 995–1004.
    Google ScholarLocate open access versionFindings
  • [35] Xiang Ren, Wenqi He, Meng, Clare R Voss, Heng Ji, and Jiawei Han. 2016.
    Google ScholarFindings
  • [36] Michael Schmitz, Robert Bart, Stephen Soderland, Oren Etzioni, and others. 2012. Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 523–534.
    Google ScholarLocate open access versionFindings
  • [37] Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2017. Automated Phrase Mining from Massive Text Corpora. arXiv preprint arXiv:1702.04457 (2017).
    Findings
  • [38] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing Text for Joint Embedding of Text and Knowledge Bases.. In EMNLP, Vol. 15. 1499–1509.
    Google ScholarLocate open access versionFindings
  • [39] Fei Wu and Daniel S Weld. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 118–127.
    Google ScholarLocate open access versionFindings
  • [40] Mohamed Yahya, Steven Whang, Rahul Gupta, and Alon Y Halevy. 2014. ReNoun: fact extraction for nominal a ributes. In EMNLP. 325–335.
    Google ScholarLocate open access versionFindings
  • [41] Dian Yu and Heng Ji. 2016. Unsupervised person slot lling based on graph mining. In ACL.
    Google ScholarFindings
  • [42] Ning Zhong, Yuefeng Li, and Sheng-Tang Wu. 2012. E ective pa ern discovery for text mining. IEEE transactions on knowledge and data engineering 24, 1 (2012), 30–44.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments