AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose a new task, namely, partially-aligned Data-to-Text generation, in which we generate human-readable text based on automatically produced training data

Partially Aligned Data to Text Generation with Distant Supervision

EMNLP 2020, pp.9183-9193, (2020)

Cited by: 4|Views233
Full Text
Bibtex
Weibo

Abstract

The Data-to-Text task aims to generate human-readable text for describing some given structured data enabling more interpretability. However, the typical generation task is confined to a few particular domains since it requires well-aligned data which is difficult and expensive to obtain. Using partially-aligned data is an alternative way...More

Code:

Data:

0
Introduction
  • The Data-to-Text generation task focuses on generating human-readable text corresponding to some

    ⟨Age of Empires, genre, strategy video game⟩ Model } Train

    ⟨Ceol>olpopepededdinininCCCaanananadadada.a.
  • Many works have been proposed to give impetus to the Data-to-Text generation task.
  • Gardent et al (2017a; 2017b) proposed the WebNLG task aiming at generating description text of the given KB triples.
  • Novikova et al (2017) proposed the E2E task aiming at generating restaurant reviews according to the given restaurant attributes.
  • Lebret et al (2016) proposed the WikiBio task in which the biography of each person is generated according to the given Wikipedia infobox
  • Gardent et al (2017a; 2017b) proposed the WebNLG task aiming at generating description text of the given KB triples. Novikova et al (2017) proposed the E2E task aiming at generating restaurant reviews according to the given restaurant attributes. Lebret et al (2016) proposed the WikiBio task in which the biography of each person is generated according to the given Wikipedia infobox
Highlights
  • The Data-to-Text generation task focuses on generating human-readable text corresponding to some

    ⟨Age of Empires, genre, strategy video game⟩ Model } Train

    ⟨Ceol>olpopepededdinininCCCaanananadadada.a
  • (2) We propose a distant supervision generation framework that can tackle the challenges of the new task including the over-generation problem
  • (1) The superior performance of our Distant Supervision Generation (DSG) model shows that the supportiveness scores do help alleviate the over-generation problem
  • We propose a new task, namely, partially-aligned Data-to-Text generation, in which we generate human-readable text based on automatically produced training data
  • We contribute a partially-aligned dataset WITA produced by our novel automatically annotating framework which is suitable for this new task
Methods
  • The target of the task is to train a model that generates text T that exactly describes the KB triples in K.
  • As illustrated in Fig. 2, in the SE training procedure, the authors first pre-train the SE component to estimate a supportiveness vector s ∈ Rm indicating whether each target word wi ∈ T is describing the input triples in K.
  • The pre-trained SE component is utilized to estimate a supportiveness vector s in both S2SG Training and S2SG Generation.
Results
  • (1) The superior performance of the DSG model shows that the supportiveness scores do help alleviate the over-generation problem.
  • (2) The DSG-A model outperforms models without any adaptor
  • It fails to exceed the DSG model in all metrics.
  • It shows attention can be used to alleviate the overgeneration problem but it performs not as good as the supportiveness scores.
  • It shows attention can be used to alleviate the overgeneration problem but it performs not as good as the supportiveness scores. (3) The DSG model outperforms the DSG-H model illustrating that the soft adaptor is better than the hard adaptor. (4) The ablation experiments show that both the RBS and
Conclusion
  • The authors propose a new task, namely, partially-aligned Data-to-Text generation, in which the authors generate human-readable text based on automatically produced training data.
  • This task is more practical and extensible to any domains.
  • The authors contribute a partially-aligned dataset WITA produced by the novel automatically annotating framework which is suitable for this new task
Summary
  • Introduction:

    The Data-to-Text generation task focuses on generating human-readable text corresponding to some

    ⟨Age of Empires, genre, strategy video game⟩ Model } Train

    ⟨Ceol>olpopepededdinininCCCaanananadadada.a.
  • Many works have been proposed to give impetus to the Data-to-Text generation task.
  • Gardent et al (2017a; 2017b) proposed the WebNLG task aiming at generating description text of the given KB triples.
  • Novikova et al (2017) proposed the E2E task aiming at generating restaurant reviews according to the given restaurant attributes.
  • Lebret et al (2016) proposed the WikiBio task in which the biography of each person is generated according to the given Wikipedia infobox
  • Gardent et al (2017a; 2017b) proposed the WebNLG task aiming at generating description text of the given KB triples. Novikova et al (2017) proposed the E2E task aiming at generating restaurant reviews according to the given restaurant attributes. Lebret et al (2016) proposed the WikiBio task in which the biography of each person is generated according to the given Wikipedia infobox
  • Methods:

    The target of the task is to train a model that generates text T that exactly describes the KB triples in K.
  • As illustrated in Fig. 2, in the SE training procedure, the authors first pre-train the SE component to estimate a supportiveness vector s ∈ Rm indicating whether each target word wi ∈ T is describing the input triples in K.
  • The pre-trained SE component is utilized to estimate a supportiveness vector s in both S2SG Training and S2SG Generation.
  • Results:

    (1) The superior performance of the DSG model shows that the supportiveness scores do help alleviate the over-generation problem.
  • (2) The DSG-A model outperforms models without any adaptor
  • It fails to exceed the DSG model in all metrics.
  • It shows attention can be used to alleviate the overgeneration problem but it performs not as good as the supportiveness scores.
  • It shows attention can be used to alleviate the overgeneration problem but it performs not as good as the supportiveness scores. (3) The DSG model outperforms the DSG-H model illustrating that the soft adaptor is better than the hard adaptor. (4) The ablation experiments show that both the RBS and
  • Conclusion:

    The authors propose a new task, namely, partially-aligned Data-to-Text generation, in which the authors generate human-readable text based on automatically produced training data.
  • This task is more practical and extensible to any domains.
  • The authors contribute a partially-aligned dataset WITA produced by the novel automatically annotating framework which is suitable for this new task
Tables
  • Table1: Statistics of WITA and WebNLG. For the text length and KB number, the data are mean, median, min and max respectively
  • Table2: Main results
  • Table3: N-gram statistics for over-generation error analysis
  • Table4: Dataset size analysis
  • Table5: Case study. The red font stands for over-generated words while the blue underline indicates incoherent parts
  • Table6: Human evaluation
Related work
  • During the past few years, many tasks have been proposed to generate human-readable text from the structured data. WebNLG (Gardent et al, 2017a,b; Ferreira et al, 2019) is proposed to describe KB triples sampled from DBPedia (Auer et al, 2007).

    The E2E (Novikova et al, 2017; Dusek et al, 2020) task is proposed for generating restaurant reviews based on the given attributes. Lebret et al (2016) propose the Wikibio task to generate people’s biography based on given Wikipedia infobox. Fu et al (2020a) propose to generate text based on event chains. Moreover, Liang et al (2009) propose to generate weather reports for weather records and Wiseman et al (2017), Chen and Mooney (2008) and Puduppully et al (2019) propose to generate a match report according to the match briefing. All these datasets are restricted to a few domains where well-aligned data is happened to be available. No existing works are focusing on handling partiallyaligned data. To solve the dataset scarcity problem, Fu et al (2020c) propose to use dual learning to train generation models based on unaligned text and knowledge triples. The model generates text based on input triples and then predict the input triples with a dual extraction model. The two models are trained alternatively with dual learning. Although Cheng et al (2020) proposed the ENTDESC task aiming at generating better text description for a few entities by exploring the knowledge from KB, their focus is more on distilling the useful part from the input knowledge.
Funding
  • The work described in this paper is substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Codes: 14204418) and the National Natural Science Foundation of China (NSFC No 61532010, 61732008)
Study subjects and analysis
records: 400
4.1 Experimental Setup. We split WITA into a training set, a development set, and a testing set of 50,000, 5,000, and 400 records respectively. For the purpose of evaluating the performance of the models, we ask human helpers to annotate the testing set sentences

samples: 130
We conduct a human evaluation to eval the generation performance. We sample 130 samples from each model’s generated sentences and ask human helpers to give an overall score and a match score with respect to the target sentences ranging from 1 to 10. The results are

Reference
  • Soren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, pages 722–735.
    Google ScholarLocate open access versionFindings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72.
    Google ScholarLocate open access versionFindings
  • David L Chen and Raymond J Mooney. 2008. Learning to sportscast: a test of grounded language acquisition. In Proceedings of the 25th international conference on Machine learning, pages 128–135.
    Google ScholarLocate open access versionFindings
  • Liying Cheng, Dekun Wu, Lidong Bing, Yan Zhang, Zhanming Jie, Wei Lu, and Luo Si. 2020. Entdesc: Entity description generation by exploringknowledge graph. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1724–1734.
    Google ScholarLocate open access versionFindings
  • George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research, pages 138–145.
    Google ScholarLocate open access versionFindings
  • Ondrej Dusek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the state-of-the-art of end-to-end natural language generation: The e2e nlg challenge. Computer Speech & Language, 59:123–156.
    Google ScholarLocate open access versionFindings
  • Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648.
    Google ScholarLocate open access versionFindings
  • Thiago Castro Ferreira, Chris van der Lee, Emiel van Miltenburg, and Emiel Krahmer. 2019. Neural datato-text generation: A comparison between pipeline and end-to-end architectures. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 552–562.
    Google ScholarLocate open access versionFindings
  • Zihao Fu, Lidong Bing, and Wai Lam. 2020a. Open domain event text generation. In Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 7748– 7755.
    Google ScholarLocate open access versionFindings
  • Zihao Fu, Lidong Bing, Wai Lam, and Shoaib Jameel. 2020b. Dynamic topic tracker for kb-to-text generation. In Proceedings of the 28th International Conference on Computational Linguistics: Technical Papers (COLING).
    Google ScholarLocate open access versionFindings
  • Zihao Fu, Bei Shi, Lidong Bing, and Wai Lam. 2020c. Unsupervised kb-to-text generation with auxiliary triple extraction using dual learning. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017a. Creating training corpora for nlg micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188.
    Google ScholarLocate open access versionFindings
  • Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017b. The webnlg challenge: Generating text from rdf data. In Proceedings of the 10th International Conference on Natural Language Generation, pages 124–133.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1631–1640.
    Google ScholarLocate open access versionFindings
  • Remi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1203–1213.
    Google ScholarLocate open access versionFindings
  • Joel Legrand, Michael Auli, and Ronan Collobert. 2016. Neural network-based word alignment through score aggregation. In Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers, pages 66–73.
    Google ScholarLocate open access versionFindings
  • Percy Liang, Michael Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 91–99.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
    Google ScholarFindings
  • Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
    Google ScholarLocate open access versionFindings
  • Jekaterina Novikova, Ondrej Dusek, and Verena Rieser. 2017. The e2e dataset: New challenges for end-toend generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Annual meeting on Association for Computational Linguistics, pages 311–318.
    Google ScholarLocate open access versionFindings
  • Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with entity modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2023– 2035.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725.
    Google ScholarLocate open access versionFindings
  • Anastasia Shimorina and Claire Gardent. 2018. Handling rare items in data-to-text generation. In Proceedings of the 11th International Conference on Natural Language Generation, pages 360–370.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575.
    Google ScholarLocate open access versionFindings
  • Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in data-to-document generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2253–2263.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科