Transformer-based Argument Mining for Healthcare Applications

Tobias Mayer
Tobias Mayer

ECAI, pp. 2108-2115, 2020.

Cited by: 0|Bibtex|Views69|DOI:https://doi.org/10.3233/FAIA200334
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
For the dynamic embeddings coming from language models, the ones trained on the medical domain corpus, i.e., FlairPM and Embeddings from Language Models, show similar performances with a macro F1-score of.68 on the neoplasm test set

Abstract:

Argument(ation) Mining (AM) typically aims at identifying argumentative components in text and predicting the relations among them. Evidence-based decision making in the health-care domain targets at supporting clinicians in their deliberation process to establish the best course of action for the case under evaluation. Although the reaso...More

Code:

Data:

0
Introduction
  • There is an increasing interest in the development of intelligent systems able to support and ease clinicians’ everyday activities
  • These systems apply to clinical trials, clinical guidelines, and electronic health records, and their solutions range from the automated detection of PICO2 elements [19] in health records to evidence-based reasoning for decision making [18, 8, 24, 35].
  • Given its aptness to automatically detect in text those argumentative structures that are at the basis of evidence-based reasoning applications, AM represents a potential valuable contribution in the healthcare domain
Highlights
  • In the healthcare domain, there is an increasing interest in the development of intelligent systems able to support and ease clinicians’ everyday activities
  • We present a complete AM pipeline for clinical trials relying on deep bidirectional transformers combined with different neural networks, i.e., Long Short-Term Memory (LSTM) networks, Gated Recurrent Unit (GRU) networks, and Conditional Random Fields (CRFs)4; 3
  • For the dynamic embeddings coming from language models (LM), the ones trained on the medical domain corpus, i.e., FlairPM and Embeddings from Language Models (ELMo), show similar performances with a macro F1-score of .68 on the neoplasm test set
  • Evidence scores are higher than claim scores, leading to the conclusion that claims are more diverse than evidence
  • To support clinicians in decision making or in-automatically filling evidence tables for systematic reviews in evidence-based medicine, we propose a complete argument mining pipeline for the healthcare domain
  • Example 2 [True acupuncture was associated with 0.8 fewer hot flashes per day than sham at 6 weeks,]1 [but the difference did not reach statistical significance (95% CI, -0.7 to 2.4; P = .3).]2
  • We show that state-of-the-art argument mining systems are unable to satisfactorily tackle the two tasks of argument component detection and relation prediction on this kind of text, given its peculiar features
Methods
  • BERT SentClf BioBERT SentClf SciBERT SentClf RoBERTa. The Tree-LSTM based end-to-end system performed the worst with a F1-score of .37.
  • The Tree-LSTM based end-to-end system performed the worst with a F1-score of .37
  • This can be explained by the positional encoding in the persuasive essay dataset being more relevant than in ours.
  • There, components are likely to link to a neighboring component, whereas in the dataset the position of a component only partially plays a role, and the distance in the dependency tree is not a meaningful feature.
  • The main problem here is that it learns a multi-objective for link prediction, relation classification and type classification for source and target component, where the latter classification step is already covered by the sequence tagger and unnecessary at this step
Results
  • The authors present and discusses the empirical results of the AM pipeline for RCTs.

    Sequence Tagging The authors show the results for the best performing RNN models and the best performing embedding combinations in Table 2.
  • For the dynamic embeddings coming from LMs, the ones trained on the medical domain corpus, i.e., FlairPM and ELMo, show similar performances with a macro F1-score of .68 on the neoplasm test set.
  • They have the edge over the non-specialized LMs like BERT with .66 or FlairMulti with .63 macro F1-score.
Conclusion
  • To support clinicians in decision making or in-automatically filling evidence tables for systematic reviews in evidence-based medicine, the authors propose a complete argument mining pipeline for the healthcare domain.
  • To this aim, the authors built a novel corpus of healthcare texts (i.e., RCT abstracts) from the MEDLINE database, which are annotated with argumentative components and relations.
  • In the extensive evaluation, addressed on a newly AM annotated dataset of RCTs, the authors investigate the use of different neural transformer architectures and pre-trained models in this pipeline, showing an improvement of the results in comparison with standard baselines and state-of-the-art AM systems
Summary
  • Introduction:

    There is an increasing interest in the development of intelligent systems able to support and ease clinicians’ everyday activities
  • These systems apply to clinical trials, clinical guidelines, and electronic health records, and their solutions range from the automated detection of PICO2 elements [19] in health records to evidence-based reasoning for decision making [18, 8, 24, 35].
  • Given its aptness to automatically detect in text those argumentative structures that are at the basis of evidence-based reasoning applications, AM represents a potential valuable contribution in the healthcare domain
  • Methods:

    BERT SentClf BioBERT SentClf SciBERT SentClf RoBERTa. The Tree-LSTM based end-to-end system performed the worst with a F1-score of .37.
  • The Tree-LSTM based end-to-end system performed the worst with a F1-score of .37
  • This can be explained by the positional encoding in the persuasive essay dataset being more relevant than in ours.
  • There, components are likely to link to a neighboring component, whereas in the dataset the position of a component only partially plays a role, and the distance in the dependency tree is not a meaningful feature.
  • The main problem here is that it learns a multi-objective for link prediction, relation classification and type classification for source and target component, where the latter classification step is already covered by the sequence tagger and unnecessary at this step
  • Results:

    The authors present and discusses the empirical results of the AM pipeline for RCTs.

    Sequence Tagging The authors show the results for the best performing RNN models and the best performing embedding combinations in Table 2.
  • For the dynamic embeddings coming from LMs, the ones trained on the medical domain corpus, i.e., FlairPM and ELMo, show similar performances with a macro F1-score of .68 on the neoplasm test set.
  • They have the edge over the non-specialized LMs like BERT with .66 or FlairMulti with .63 macro F1-score.
  • Conclusion:

    To support clinicians in decision making or in-automatically filling evidence tables for systematic reviews in evidence-based medicine, the authors propose a complete argument mining pipeline for the healthcare domain.
  • To this aim, the authors built a novel corpus of healthcare texts (i.e., RCT abstracts) from the MEDLINE database, which are annotated with argumentative components and relations.
  • In the extensive evaluation, addressed on a newly AM annotated dataset of RCTs, the authors investigate the use of different neural transformer architectures and pre-trained models in this pipeline, showing an improvement of the results in comparison with standard baselines and state-of-the-art AM systems
Tables
  • Table1: Statistics of the extended dataset. Showing the numbers of evidence, claims, major claims, supporting and attacking relations for each disease-based subset, respectively
  • Table2: Results of the multi-class sequence tagging task are given in micro F1 (f1) and macro F1 (F1). The binary F1 for claims are reported as C-F1 and for evidence as E-F1. Best scores in each column are marked in bold; significance was tested with a two-sided Wilcoxon signed rank test
  • Table3: Results of the relation classification task, given in macro F1-score
Download tables as Excel
Related work
  • One of the latest advances in artificial argumentation [2] is the socalled Argument(ation) Mining [30, 22, 7]. Argument mining consists of two standard tasks: (i) the identification of arguments within the text, that may be further split in the detection of argument components (e.g., claims, evidence) and the identification of their textual boundaries. Different methods have been used for this task (e.g., Support Vector Machines (SVMs), Naıve Bayes classifiers, and Neural Networks (NNs)); (ii) the prediction of the relations holding between the arguments identified in the first stage. They are used to build the argument graphs, in which the relations connecting the retrieved argumentative components correspond to the edges. Different methods have been employed to address these tasks, from standard SVMs to NNs. AM methods have been applied to heterogeneous types of textual documents, e.g., persuasive essays [38], scientific articles [39], Wikipedia articles [4], political speeches and debates [27], and peer reviews [17]. However, only few approaches [42, 14, 25, 26] focused on automatically detecting argumentative structures from textual documents in the medical domain, such as clinical trials, clinical guidelines, and Electronic Health Records.
Funding
  • This work is partly funded by the French government labelled PIA program under its IDEX UCA JEDI project (ANR-15-IDEX-0001)
  • This work has been supported by the French government, through the 3IA Cote d’Azur Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR19-P3IA-0002
Study subjects and analysis
patients: 58
Example 1 Extracellular adenosine 5’-triphosphate (ATP) is involved in the regulation of a variety of biologic processes, including neurotransmission, muscle contraction, and liver glucose metabolism, via purinergic receptors. [In nonrandomized studies involving patients with different tumor types including non-small-cell lung cancer (NSCLC), ATP infusion appeared to inhibit loss of weight and deterioration of quality of life (QOL) and performance status]. We conducted a randomized clinical trial to evaluate the effects of ATP in patients with advanced NSCLC (stage IIIB or IV). [...] Fifty-eight patients were randomly assigned to receive either 10 intravenous 30-hour ATP infusions, with the infusions given at 2- to 4-week intervals, or no ATP. Outcome parameters were assessed every 4 weeks until 28 weeks

patients: 28
Between-group differences were tested for statistical significance by use of repeated-measures analysis, and reported P values are two-sided. Twenty-eight patients were allocated to receive ATP treatment and 30 received no ATP. [Mean weight changes per 4-week period were -1.0 kg (95% confidence interval [CI]= 1.5 to -0.5) in the control group and 0.2 kg (95% CI =-0.2 to +0.6) in the ATP group (P=.002)]1. [Serum albumin concentration declined by -1.2 g/L (95% CI=-2.0 to -0.4) per 4 weeks in the control group but remained stable (0.0g/L; 95% CI=-0.3 to +0.3) in the ATP group (P =.006)]2. [Elbow flexor muscle strength declined by -5.5% (95% CI=-9.6% to -1.4%) per 4 weeks in the control group but remained stable (0.0%; 95% CI=-1.4% to +1.4%) in the ATP group (P=.01)]3. [A similar pattern was observed for knee extensor muscles (P =.02)]4. [The effects of ATP on body weight, muscle strength, and albumin concentration were especially marked in cachectic patients (P=.0002, P=.0001, and P=. 0001, respectively, for ATP versus no ATP)]5. [...] This randomized trial demonstrates that [ATP has beneficial effects on weight, muscle strength, and QOL in patients with advanced NSCLC]1. 3.2 Annotation of argumentative relations

Reference
  • Alan Akbik, Duncan Blythe, and Roland Vollgraf, ‘Contextual string embeddings for sequence labeling’, in Proc. of COLING 2018, pp. 1638–1649, (2018).
    Google ScholarLocate open access versionFindings
  • Katie Atkinson, Pietro Baroni, Massimiliano Giacomin, Anthony Hunter, Henry Prakken, Chris Reed, Guillermo Ricardo Simari, Matthias Thimm, and Serena Villata, ‘Towards artificial argumentation’, AI Magazine, 38(3), 25–36, (2017).
    Google ScholarLocate open access versionFindings
  • Ivana Balazevic, Carl Allen, and Timothy Hospedales, ‘TuckER: Tensor factorization for knowledge graph completion’, in Proc. of EMNLPIJCNLP 2019, pp. 5185–5194, (2019).
    Google ScholarLocate open access versionFindings
  • Roy Bar-Haim, Indrajit Bhattacharya, Francesco Dinuzzo, Amrita Saha, and Noam Slonim, ‘Stance classification of context-dependent claims’, in Proc. of EACL 2017, pp. 251–261, (2017).
    Google ScholarLocate open access versionFindings
  • Iz Beltagy, Kyle Lo, and Arman Cohan, ‘SciBERT: A pretrained language model for scientific text’, in Proc. of EMNLP-IJCNLP 2019, pp. 3615–3620, (2019).
    Google ScholarLocate open access versionFindings
  • Antoine Bordes, Nicolas Usunier, Alberto Garcıa-Duran, Jason Weston, and Oksana Yakhnenko, ‘Translating embeddings for modeling multirelational data’, in Proc. of NIPS 2013, pp. 2787–2795, (2013).
    Google ScholarLocate open access versionFindings
  • Elena Cabrio and Serena Villata, ‘Five years of argument mining: a data-driven analysis’, in Proc. of IJCAI 2018, pp. 5427–5433, (2018).
    Google ScholarLocate open access versionFindings
  • Robert Craven, Francesca Toni, Cristian Cadar, Adrian Hadad, and Matthew Williams, ‘Efficient argumentation for medical decisionmaking’, in Proc. of KR 2012, pp. 598–602, (2012).
    Google ScholarLocate open access versionFindings
  • Tim Dettmers, Minervini Pasquale, Stenetorp Pontus, and Sebastian Riedel, ‘Convolutional 2d knowledge graph embeddings’, in Proc. of AAAI 2018, pp. 1811–1818, (February 2018).
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, ‘BERT: Pre-training of deep bidirectional transformers for language understanding’, in Proc. of NAACL-HLT 2019, pp. 4171–4186, (2019).
    Google ScholarLocate open access versionFindings
  • Steffen Eger, Johannes Daxenberger, and Iryna Gurevych, ‘Neural endto-end learning for computational argumentation mining’, in Proc. of ACL 2017, pp. 11–22, (2017).
    Google ScholarLocate open access versionFindings
  • Andrea Galassi, Marco Lippi, and Paolo Torroni, ‘Argumentative link prediction using residual networks and multi-objective learning’, in Proc. of ArgMining 2018 workshop, pp. 1–10, (2018).
    Google ScholarLocate open access versionFindings
  • Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov, ‘Learning word vectors for 157 languages’, in Proc. of LREC 2018, pp. 3483–3487, (2018).
    Google ScholarLocate open access versionFindings
  • Nancy Green, ‘Argumentation for scientific claims in a biomedical research article’, in Proc. of ArgNLP 2014 workshop, (2014).
    Google ScholarLocate open access versionFindings
  • Nancy Green, ‘Annotating evidence-based argumentation in biomedical text’, IEEE BIBM 2015, 922–929, (2015).
    Google ScholarLocate open access versionFindings
  • Benjamin Heinzerling and Michael Strube, ‘Bpemb: Tokenization-free pre-trained subword embeddings in 275 languages’, in Proc. of LREC 2018, pp. 2989–2993, (2018).
    Google ScholarLocate open access versionFindings
  • Xinyu Hua, Mitko Nikolov, Nikhil Badugu, and Lu Wang, ‘Argument mining for understanding peer reviews’, in Proc. of NAACL-HLT 2019, p. 2131–2137, (2019).
    Google ScholarLocate open access versionFindings
  • Anthony Hunter and Matthew Williams, ‘Aggregating evidence about the positive and negative effects of treatments’, Artificial Intelligence in Medicine, 56(3), 173–190, (2012).
    Google ScholarLocate open access versionFindings
  • Di Jin and Peter Szolovits, ‘PICO element detection in medical text via long short-term memory neural networks’, in Proc. of BioNLP 2018 workshop, pp. 67–75, (2018).
    Google ScholarLocate open access versionFindings
  • Alexandros Komninos and Suresh Manandhar, ‘Dependency based embeddings for sentence classification tasks’, in Proc. of NAACL-HLT 2016, pp. 1490–1500, (2016).
    Google ScholarLocate open access versionFindings
  • Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang, ‘BioBERT: a pre-trained biomedical language representation model for biomedical text mining’, Bioinformatics, (2019).
    Google ScholarLocate open access versionFindings
  • Marco Lippi and Paolo Torroni, ‘Argumentation mining: State of the art and emerging trends’, ACM Trans. Internet Techn., 16(2), 10, (2016).
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, ‘Roberta: A robustly optimized BERT pretraining approach’, CoRR, abs/1907.11692, (2019).
    Findings
  • Luca Longo and Lucy Hederman, ‘Argumentation theory for decision support in health-care: A comparison with machine learning’, in Proc. of BHI 2013, pp. 168–180, (2013).
    Google ScholarLocate open access versionFindings
  • Tobias Mayer, Elena Cabrio, Marco Lippi, Paolo Torroni, and Serena Villata, ‘Argument mining on clinical trials’, in Proc. of COMMA 2018, pp. 137–148, (2018).
    Google ScholarLocate open access versionFindings
  • Tobias Mayer, Elena Cabrio, and Serena Villata, ‘ACTA a tool for argumentative clinical trial analysis’, in Proc. of IJCAI 2019, pp. 6551– 6553, (2019).
    Google ScholarLocate open access versionFindings
  • Stefano Menini, Elena Cabrio, Sara Tonelli, and Serena Villata, ‘Never retreat, never retract: Argumentation analysis for political speeches’, in Proc. of AAAI 2018, pp. 4889–4896, (2018).
    Google ScholarLocate open access versionFindings
  • Makoto Miwa and Mohit Bansal, ‘End-to-end relation extraction using lstms on sequences and tree structures’, in Proc. of ACL 2016, pp. 1105–1116, (2016).
    Google ScholarLocate open access versionFindings
  • Vlad Niculae, Joonsuk Park, and Claire Cardie, ‘Argument mining with structured SVMs and RNNs’, in Proc. of ACL 2017, pp. 985–995, (2017).
    Google ScholarLocate open access versionFindings
  • Andreas Peldszus and Manfred Stede, ‘From argument diagrams to argumentation mining in texts: A survey’, Int. J. Cogn. Inform. Nat. Intell., 7(1), 1–31, (2013).
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning, ‘Glove: Global vectors for word representation’, in Proc. of EMNLP 2014, pp. 1532–1543, (2014).
    Google ScholarLocate open access versionFindings
  • Isaac Persing and Vincent Ng, ‘End-to-end argumentation mining in student essays’, in Proc. of NAACL-HLT 2016, pp. 1384–1394, (2016).
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer, ‘Deep contextualized word representations’, in Proc. of NAACL-HLT 2018, pp. 2227–2237, (2018).
    Google ScholarLocate open access versionFindings
  • Peter Potash, Alexey Romanov, and Anna Rumshisky, ‘Here’s my point: Joint pointer architecture for argument mining’, in Proc. of EMNLP 2017, pp. 1364–1373, (2017).
    Google ScholarLocate open access versionFindings
  • Malik Al Qassas, Daniela Fogli, Massimiliano Giacomin, and Giovanni Guida, ‘Analysis of clinical discussions based on argumentation schemes’, Procedia Computer Science, 64, 282–289, (2015).
    Google ScholarLocate open access versionFindings
  • Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, and Iryna Gurevych, ‘Classification and clustering of arguments with contextualized word embeddings’, in Proc. of ACL 2019, pp. 567–578, (2019).
    Google ScholarLocate open access versionFindings
  • Anders Søgaard and Yoav Goldberg, ‘Deep multi-task learning with low level tasks supervised at lower layers’, in Proc. of ACL 2016, pp. 231–235, (2016).
    Google ScholarLocate open access versionFindings
  • Christian Stab and Iryna Gurevych, ‘Parsing argumentation structures in persuasive essays’, Comput. Linguist., 43(3), 619–659, (2017).
    Google ScholarLocate open access versionFindings
  • Simone Teufel, Advaith Siddharthan, and Colin Batchelor, ‘Towards domain-independent argumentative zoning: Evidence from chemistry and computational linguistics’, in Proc. of EMNLP 2009, pp. 1493– 1502, (2009).
    Google ScholarLocate open access versionFindings
  • Antonio Trenta, Anthony Hunter, and Sebastian Riedel, ‘Extraction of evidence tables from abstracts of randomized clinical trials using a maximum entropy classifier and global constraints’, CoRR, abs/1509.05209, (2015).
    Findings
  • Theo Trouillon, Johannes Welbl, Sebastian Riedel, Eric Gaussier, and Guillaume Bouchard, ‘Complex embeddings for simple link prediction’, in Proc. of ICML 2016, pp. 2071–2080, (2016).
    Google ScholarLocate open access versionFindings
  • Jure Zabkar, Martin Mozina, Jerneja Videcnik, and Ivan Bratko, ‘Argument based machine learning in a medical domain’, in Proc. of COMMA 2006, pp. 59–70, (2006).
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi, ‘SWAG: A large-scale adversarial dataset for grounded commonsense inference’, in Proc. of EMNLP 2018, pp. 93–104, (2018).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments