AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We introduce a new task of extracting argument pairs from review and rebuttal passages, which explores a new domain for the argument mining research field

APE: Argument Pair Extraction from Peer Review and Rebuttal via Multi task Learning

EMNLP 2020, (2020)

Cited by: 0|Views55
Full Text
Bibtex
Weibo

Abstract

Peer review and rebuttal, with rich interactions and argumentative discussions in between, are naturally a good resource to mine arguments. However, few works study both of them simultaneously. In this paper, we introduce a new argument pair extraction (APE) task on peer review and rebuttal in order to study the contents, the structure an...More

Code:

Data:

0
Introduction
Highlights
  • Argument mining is an important research field that attracts growing attention in recent years (Lawrence and Reed, 2019)
  • After the argument mining step, we evaluate the mining result by checking the correctness of each argument span consisting of one or a few sentences labeled with IOBES
  • PL-HLSTM-conditional random field (CRF) achieves competitive performance on argument mining task and the highest F1 score on sentence pairing task, its overall extraction performance is much worse than our multi-task model MT-H-long shortterm memory (LSTM)-CRF
  • To further verify the effectiveness of our multitask learning through the hierarchical LSTM design, we investigate another baseline, namely, Hybrid-MT-H-LSTM-CRF, which uses the sentence embeddings from token-level biLSTM (T-LSTM) as the features for the sentence pairing classification and uses the embeddings from S-LSTM for labeling argument sentences
  • We introduce a new task of extracting argument pairs from review and rebuttal passages, which explores a new domain for the argument mining research field
  • It is clear that the model performs significantly better on the rebuttal passages than on the review passages
  • We propose a multi-task learning approach based on hierarchical LSTM networks to work towards this problem
Methods
  • The authors first split the RR dataset on review-rebuttal passage-pair level randomly by a ratio of 8:1:1 for training, development and testing, namely RRpassage.
  • Since different reviewers may discuss similar issues for one submission, different review-rebuttal passage pairs of the same submission may share similar context information.
  • To alleviate this effect, the authors prepare another dataset version split on the submission level, namely RR-submission.
  • In RR-submission, multiple review-rebuttal passage pairs of the same submission are in the same set
Results
  • Table 3 shows the performance on both subtasks as well as the overall extraction performance on RR-passage dataset, where the authors compare the proposed multi-task model (MT-HLSTM-CRF) with several strong baselines.
  • PL-H-LSTM-CRF is a pipeline approach that trains argument mining and sentence pairing modules independently and pipes them together to extract argument pairs.
  • The authors observe that the proposed MT-H-LSTM-CRF consistently outperforms the baseline models
  • It performs slightly worse on RR-submission than on RR-passage, plausibly because there is no context information shared between different passage pairs
Conclusion
  • The authors introduce a new task of extracting argument pairs from review and rebuttal passages, which explores a new domain for the argument mining research field.
  • The authors propose a multi-task learning approach based on hierarchical LSTM networks to work towards this problem.
  • The authors will explore the latent information between peer reviews and author responses to improve argument pair extraction.
Summary
  • Introduction:

    Argument mining is an important research field that attracts growing attention in recent years (Lawrence and Reed, 2019).
  • It is applied in real-world applications such as legal documents (Mochales and Moens, 2011; Poudyal, 2015), online debates (Boltuzicand Snajder, 2015; Abbott et al, 2016), persuasive essays (Stab and Gurevych, 2014; Persing and Ng, 2016), etc.
  • The rebuttal part, which is often ignored, is an indispensable and interesting
  • Objectives:

    The authors aim to automatically extract the argument pairs from reviews and rebuttals by studying review-rebuttal pairs together in this work.
  • Score(s, y) = Ayi,yi+1 + Fθ1 (s, yi), i=0 i=1 where Ayi,yi+1 represents the transition parameters between two labels, and Fθ1(s, yi) indicates the score of yi obtained from the neural network encoder parameterized by θ1.
  • Y0 and yn+1 represent the “START” and “END” labels, respectively.
  • The authors aim to minimize the negative log-likelihood for the dataset D1:
  • Methods:

    The authors first split the RR dataset on review-rebuttal passage-pair level randomly by a ratio of 8:1:1 for training, development and testing, namely RRpassage.
  • Since different reviewers may discuss similar issues for one submission, different review-rebuttal passage pairs of the same submission may share similar context information.
  • To alleviate this effect, the authors prepare another dataset version split on the submission level, namely RR-submission.
  • In RR-submission, multiple review-rebuttal passage pairs of the same submission are in the same set
  • Results:

    Table 3 shows the performance on both subtasks as well as the overall extraction performance on RR-passage dataset, where the authors compare the proposed multi-task model (MT-HLSTM-CRF) with several strong baselines.
  • PL-H-LSTM-CRF is a pipeline approach that trains argument mining and sentence pairing modules independently and pipes them together to extract argument pairs.
  • The authors observe that the proposed MT-H-LSTM-CRF consistently outperforms the baseline models
  • It performs slightly worse on RR-submission than on RR-passage, plausibly because there is no context information shared between different passage pairs
  • Conclusion:

    The authors introduce a new task of extracting argument pairs from review and rebuttal passages, which explores a new domain for the argument mining research field.
  • The authors propose a multi-task learning approach based on hierarchical LSTM networks to work towards this problem.
  • The authors will explore the latent information between peer reviews and author responses to improve argument pair extraction.
Tables
  • Table1: An example of review-rebuttal passage pair. REVIEW-1 and REPLY-1 indicate the first argument in the review and its reply in the rebuttal
  • Table2: Overall statistics of RR dataset
  • Table3: Main results on RR-passage dataset
  • Table4: Performance on argument mining task
  • Table5: Argument pair extraction results under different negative samples (NS)
  • Table6: Main results on RR-submission dataset
  • Table7: Performance on RR-passage to compare across different numbers of T-LSTM layers
  • Table8: Performance on RR-passage to compare across different numbers of S-LSTM layers
  • Table9: Performance of MT-H-LSTM-CRF(LT =1,LS=1) when using different numbers of linear layers for sentence pairing task
  • Table10: Performance of MT-H-LSTM-CRF when applying different weight ratios between two losses
  • Table11: Performance of MT-H-LSTM-CRF when using different operations to obtain pair representations
Download tables as Excel
Related work
  • Argument Mining. There is an increasing number of works in the computational argumentation research field in recent years, such as argument mining (Shnarch et al, 2018; Trautmann et al, 2020), argument relation detection (Rocha et al, 2018; Hou and Jochim, 2017), argument quality assessment (Wachsmuth et al, 2017; Gleize et al, 2019; Toledo et al, 2019; Gretz et al, 2019), argument generation (Hua and Wang, 2018; Hua et al, 2019a; Schiller et al, 2020), etc. Stab and Gurevych (2014) and Persing and Ng (2016) both propose pipeline approaches to identify argumentative discourse structures in persuasive essays, which mainly includes two steps: extracting argument components and identifying relations. In terms of the task and the dataset, we intend to extract argument pairs from two passages simultaneously, while they focus on a single passage. In addition, we present a multi-task framework to tackle our proposed task instead of using a pipeline approach. Swanson et al (2015) aim to extract arguments from a large corpus of posts from online forums. They frame their problem as two separate subtasks: argument extraction and argument facet similarity. Chakrabarty et al (2019) focus on argument mining in online persuasive discussion forums based on the CMV dataset. Compared to both of their datasets, our dataset’s contents are more academic and are strongly inclined to attack the other argument. Scale-wise, our dataset consists of 156K sentences, while Chakrabarty et al (2019)’s dataset has 2,756 sentences.
Funding
  • Throughout the quality assessment, the annotation accuracy of the RR dataset reaches 98.4%
  • Although PL-HLSTM-CRF achieves competitive performance on argument mining task and the highest F1 score on sentence pairing task, its overall extraction performance is much worse than our multi-task model MT-H-LSTM-CRF
  • The performance of the sentence pairing task is sacrificed for better overall extraction result, as the best model is selected based on the F1 score of argument pair extraction
  • It is clear that the model performs significantly better on the rebuttal passages than on the review passages
  • When 5 non-aligned argument pairs are selected (5Argu), our model achieves the best F1 score (29.81) for argument pair extraction
Study subjects and analysis
review-rebuttal passage pairs: 4764
We create RR by first collecting all useful information related to the submissions from ICLR in recent years. In total, we select a corpus of 4,764 review-rebuttal passage pairs from over 22K reviews and author responses that are collected from the website. Then we annotate all these review and rebuttal passage pairs following a set of carefully defined annotation guidelines

pairs: 4764
First, instead of only looking at peer reviews, we propose a task to study the reviews and rebuttals jointly. Another difference lies in dataset: ours is fully annotated with 4,764 pairs of reviews and rebuttals, while only 400 reviews are annotated in AMPERE. As mentioned, few works touch on the rebuttal part, which is the other important element of the peer review process

review-rebuttal passage pairs: 4764
In total, 22,127 reviews and rebuttals are collected. After excluding those reviews receiving no reply, we extract 4,764 review-rebuttal passage pairs for data annotation. For those reviews with multiple rounds of rebuttals, only the first rebuttal is selected

passage pairs: 4764
In this work, we concentrate on argument pair extraction from review-rebuttal passage pairs. In total, 5 professional data annotators are hired to annotate these 4,764 passage pairs based on the unified guidelines described below. Firstly, for the argument mining part, the annotators segment reviews and rebuttals by labeling the arguments

pairs: 252
We label 40,831 arguments in total, with 23,150 arguments from reviews and 17,681 arguments from rebuttals. We assess the annotation quality based on random sampling from the full dataset. 5% of the original review-rebuttal passage pairs (252 pairs) are checked. 39 out of 2,417 sampled arguments’ labels are missed or have marginal error. Throughout the quality assessment, the annotation accuracy of the RR dataset reaches 98.4%

pairs: 3119
there are clear signals (i.e., line break, index, etc.) separating the arguments, it would be considered as easy. In total, 65.5% of the data (3,119 pairs) are classified as easy. Difficult is marked when the review and rebuttal passages do not have clear structure

negative samples: 5
Number of linear layers for binary classification task is 3. We randomly select 5 negative samples for the classification task during training in main experiments. We use Adam (Kingma and Ba, 2014) with an initial learning rate of 0.01 and update parameters with a batch size of 10

non-aligned argument pairs: 5
Generally speaking, the ratio is more balanced, and the overall extraction performance improves, and of course, the training is more efficient. Especially, when 5 non-aligned argument pairs are selected (5Argu), our model achieves the best F1 score (29.81) for argument pair extraction. 6.4 Results on RR-submission Dataset

Reference
  • Rob Abbott, Brian Ecker, Pranav Anand, and Marilyn Walker. 2016. Internet argument corpus 2.0: An sql schema for dialogic social media and the corpora to go with it. In Proceedings of LREC.
    Google ScholarLocate open access versionFindings
  • Filip Boltuzicand Jan Snajder. 2015. Identifying prominent arguments in online debates using semantic textual similarity. In Proceedings of the 2nd Workshop on Argumentation Mining.
    Google ScholarLocate open access versionFindings
  • Tuhin Chakrabarty, Christopher Hidey, Smaranda Muresan, Kathy McKeown, and Alyssa Hwang. 2019. AMPERSAND: Argument mining for PERSuAsive oNline discussions. In Proceedings of EMNLP-IJCNLP.
    Google ScholarLocate open access versionFindings
  • Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. TACL.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Laura J Falkenberg and Patricia A Soranno. 2018. Reviewing reviews: An evaluation of peer reviews of journal article submissions. Limnology and Oceanography Bulletin.
    Google ScholarLocate open access versionFindings
  • Yang Gao, Steffen Eger, Ilia Kuznetsov, Iryna Gurevych, and Yusuke Miyao. 2019. Does my rebuttal matter? insights from a major nlp conference. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Noam Slonim. 2019. Are you convinced? choos- Isaac Persing and Vincent Ng. 2016. End-to-end arguing the more convincing evidence with a siamese mentation mining in student essays. In Proceedings network. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Shai Gretz, Roni Friedman, Edo Cohen-Karlik, Assaf Toledo, Dan Lahav, Ranit Aharonov, and Noam Slonim. 201A large-scale dataset for argument quality ranking: Construction and analysis. In Proceedings of AAAI.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Yufang Hou and Charles Jochim. 2017. Argument relation classification using a joint inference model. In Proceedings of the 4th Workshop on Argument Mining.
    Google ScholarLocate open access versionFindings
  • Xinyu Hua, Zhe Hu, and Lu Wang. 2019a. Argument generation with retrieval, planning, and realization. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Xinyu Hua, Mitko Nikolov, Nikhil Badugu, and Lu Wang. 2019b. Argument mining for understanding peer reviews. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Xinyu Hua and Lu Wang. 2018. Neural argument generation augmented with externally retrieved evidence. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, and Roy Schwartz. 2018. A dataset of peer reviews (peerread): Collection, insights and nlp applications. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Jacalyn Kelly, Tara Sadeghieh, and Khosrow Adeli. 2014. Peer review in scientific publications: benefits, critiques, & a survival guide. EJIFCC.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Technical report.
    Google ScholarFindings
  • John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • John Lawrence and Chris Reed. 2019. Argument mining: A survey. Computational Linguistics, 1(1).
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS.
    Google ScholarLocate open access versionFindings
  • Prakash Poudyal. 2015. A machine learning approach to argument mining in legal documents. In AI Approaches to the Complexity of Legal Systems. Springer.
    Google ScholarFindings
  • LA Ramshaw. 1995.
    Google ScholarFindings
  • Text chunking using transformation-based learning. In Proceedings of Third Workshop on Very Large Corpora.
    Google ScholarLocate open access versionFindings
  • Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of CoNLL.
    Google ScholarLocate open access versionFindings
  • Gil Rocha, Christian Stab, Henrique Lopes Cardoso, and Iryna Gurevych. 2018. Cross-lingual argumentative relation identification: from english to portuguese. In Proceedings of the 5th Workshop on Argument Mining.
    Google ScholarLocate open access versionFindings
  • Benjamin Schiller, Johannes Daxenberger, and Iryna Gurevych. 2020. Aspect-controlled neural argument generation. CoRR.
    Google ScholarLocate open access versionFindings
  • Eyal Shnarch, Carlos Alzate, Lena Dankin, Martin Gleize, and Noam Slonim. 2018. Will it blend? blending weak and strong labeled data in a neural network for argumentation mining. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Christian Stab and Iryna Gurevych. 2014. Identifying argumentative discourse structures in persuasive essays. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Reid Swanson, Brian Ecker, and Marilyn Walker. 2015. Argument mining: Extracting arguments from online dialogue. In Proceedings of SIGDIAL.
    Google ScholarLocate open access versionFindings
  • Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, Roni Friedman, Elad Venezian, Dan Lahav, Michal Jacovi, Ranit Aharonov, and Noam Slonim. 2019. Automatic argument quality assessment – new datasets and methods. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Dietrich Trautmann, Johannes Daxenberger, Christian Stab, Hinrich Schutze, and Iryna Gurevych. 2020. Fine-grained argument unit recognition and classification. In Proceedings of AAAI.
    Google ScholarLocate open access versionFindings
  • Henning Wachsmuth, Nona Naderi, Ivan Habernal, Yufang Hou, Graeme Hirst, Iryna Gurevych, and Benno Stein. 2017. Argumentation quality assessment: Theory vs. practice. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Han Xiao. 2018. bert-as-service. https://github.com/hanxiao/bert-as-service.
    Findings
  • Raquel Mochales and Marie-Francine Moens. 2011. Wenting Xiong and Diane Litman. 2011. Automat-Argumentation mining. Artificial Intelligence and ically predicting peer-review helpfulness. In Pro-
    Google ScholarFindings
Author
Your rating :
0

 

Tags
Comments
小科