AnswerFact: Fact Checking in Product Question Answering

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views22
Keywords:
e commerceqa settingproduct-related community question answeringproduct informationquestion answering forumMore(10+)
Weibo:
We investigate the fact checking problem in product question answering forums, aiming to predict the answer veracity so as to provide more reliable online shopping environment

Abstract:

Product-related question answering platforms nowadays are widely employed in many Ecommerce sites, providing a convenient way for potential customers to address their concerns during online shopping. However, the misinformation in the answers on those platforms poses unprecedented challenges for users to obtain reliable and truthful produ...More

Code:

Data:

0
Introduction
  • The ability to ask questions during online shopping is found to be a key factor for customers to make purchase decisions (Smith and Anderson, 2016).
  • The user-provided answers on PQA platforms vary significantly on their qualities (Zhang et al, 2020b), and more seriously, their veracity due to the lack of systematic quality control (Mihaylova et al, 2018)
  • Those untruthful answers may attribute to multiple factors such as misunderstandings of the question, improper expressions during writing, and even intentionally malicious attacks from the competitors (Carmel et al, 2018).
  • To predict the veracity of an answer in the QA settings, one can notice that it is insufficient to consider the answer alone since the question text
Highlights
  • The ability to ask questions during online shopping is found to be a key factor for customers to make purchase decisions (Smith and Anderson, 2016)
  • To tackle the aforementioned issues, we introduce a large scale fact checking dataset called AnswerFact for investigating the answer veracity in product question answering forums
  • We conjecture that DeClarE treats each claim-evidence pair as one training instance without considering the relations between evidence sentences
  • We investigate the fact checking problem in product question answering forums, aiming to predict the answer veracity so as to provide more reliable online shopping environment
  • We introduce AnswerFact, an evidence-based fact checking datasets in QA settings
  • Extensive experiments show that our proposed method outperforms various established baselines
Methods
  • 5.1 Experimental Setup

    Dataset As introduced in Section 3, AnswerFact has 60,864 QA pairs in total 2.
Results
  • An exception is the DeClarE model which only obtains similar performance with the CNN-claim method.
  • The authors conjecture that DeClarE treats each claim-evidence pair as one training instance without considering the relations between evidence sentences.
  • The model can be misled by conflicting evidence sentences and makes random predictions.
  • This further indicates the necessity of selecting and ranking the evidence sentences by their importance for the prediction
Conclusion
  • The authors conduct detailed analysis on the proposed evidence ranking module, which plays an important role for finding out more helpful and reliable evidence sentences for the subsequent veracity prediction.

    Impact of Evidence Ranking Strategies

    To investigate the effectiveness of the proposed evidence ranking strategy, the authors substitute it with two possible alternatives and present the results in Table

    5.
  • This is likely due to the fact that it would be difficult for the model to implicitly learn the relations for each claim-evidence pair given only the veracity label
  • The authors alleviate this issue by conducting an agreement matching among the sentences first and calculating a combined evidence embeddings to assist the prediction.In this paper, the authors investigate the fact checking problem in product question answering forums, aiming to predict the answer veracity so as to provide more reliable online shopping environment.
  • Extensive experiments show that the proposed method outperforms various established baselines
Summary
  • Introduction:

    The ability to ask questions during online shopping is found to be a key factor for customers to make purchase decisions (Smith and Anderson, 2016).
  • The user-provided answers on PQA platforms vary significantly on their qualities (Zhang et al, 2020b), and more seriously, their veracity due to the lack of systematic quality control (Mihaylova et al, 2018)
  • Those untruthful answers may attribute to multiple factors such as misunderstandings of the question, improper expressions during writing, and even intentionally malicious attacks from the competitors (Carmel et al, 2018).
  • To predict the veracity of an answer in the QA settings, one can notice that it is insufficient to consider the answer alone since the question text
  • Objectives:

    Given an answer a to its corresponding question q, the aim is to predict the answer veracity which falls into one of the predefined veracity type, with the help of k relevant evidence sentences s1, s2, . . . , sk.
  • Methods:

    5.1 Experimental Setup

    Dataset As introduced in Section 3, AnswerFact has 60,864 QA pairs in total 2.
  • Results:

    An exception is the DeClarE model which only obtains similar performance with the CNN-claim method.
  • The authors conjecture that DeClarE treats each claim-evidence pair as one training instance without considering the relations between evidence sentences.
  • The model can be misled by conflicting evidence sentences and makes random predictions.
  • This further indicates the necessity of selecting and ranking the evidence sentences by their importance for the prediction
  • Conclusion:

    The authors conduct detailed analysis on the proposed evidence ranking module, which plays an important role for finding out more helpful and reliable evidence sentences for the subsequent veracity prediction.

    Impact of Evidence Ranking Strategies

    To investigate the effectiveness of the proposed evidence ranking strategy, the authors substitute it with two possible alternatives and present the results in Table

    5.
  • This is likely due to the fact that it would be difficult for the model to implicitly learn the relations for each claim-evidence pair given only the veracity label
  • The authors alleviate this issue by conducting an agreement matching among the sentences first and calculating a combined evidence embeddings to assist the prediction.In this paper, the authors investigate the fact checking problem in product question answering forums, aiming to predict the answer veracity so as to provide more reliable online shopping environment.
  • Extensive experiments show that the proposed method outperforms various established baselines
Tables
  • Table1: An example instance in AnswerFact, where the answer is the claim to be verified. The relevant product information are provided as evidence sentences
  • Table2: Veracity labels from community votes. nup, ndown, ntotal refers to the number of upvotes, downvotes and total votes of the answer respectively
  • Table3: Summary statistics of the AnswerFact dataset tions with their predicted types and annotate their question types again. The results showed that the precision score reached 0.99 on this set
  • Table4: Performance of various methods for answer veracity predictions on AnswerFact dataset. FTRUE, FMIXED and FFALSE denotes the F1 scores for TRUE, MIXED and FALSE class respectively
  • Table5: Comparison of different evidence ranking
  • Table6: Ablation studies on AVER. Mac/Mic refer to strategies, Macro-F1 scores are reported
  • Table7: A sample case of the prediction where the evidence sentences are ranked by their attention weights
Download tables as Excel
Related work
  • 2.1 Community Question Answering
Funding
  • ∗ The work described in this paper is substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14200719)
Study subjects and analysis
QA pairs: 60864
Extensive experiments are conducted with our proposed model and various existing fact checking methods, showing that our method outperforms all baselines on this task. 5.1 Experimental Setup

Dataset As introduced in Section 3, AnswerFact has 60,864 QA pairs in total 2
. We randomly split them into a training set and a test set with the ratio being 90:10

QA pairs: 495
However, only QA pairs are given in this shared task to predict the answer veracity, making it less practical due to the lack of evidence sources. Moreover, the small number of training data consisting only 495 QA pairs restricts the possibility of trying some powerful machine learning models such as deep neural networks. 1http://www.qatarliving.com

QA pairs: 60864
5.1 Experimental Setup. Dataset As introduced in Section 3, AnswerFact has 60,864 QA pairs in total 2. We randomly split them into a training set and a test set with the ratio being 90:10

Reference
  • Isabelle Augenstein, Christina Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, and Jakob Grue Simonsen. 2019. Multifc: A real-world multi-domain dataset for evidencebased fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, pages 4684–4696.
    Google ScholarLocate open access versionFindings
  • 2018. Integrating stance detection and fact checking in a unified corpus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 21–27.
    Google ScholarLocate open access versionFindings
  • David Carmel, Liane Lewin-Eytan, and Yoelle Maarek. 2018. Product question answering using customer generated content-research challenges. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1349–1350.
    Google ScholarLocate open access versionFindings
  • Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder for english. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, pages 169–174.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794.
    Google ScholarLocate open access versionFindings
  • Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020. Tabfact: A large-scale dataset for table-based fact verification. In International Conference on Learning Representations ICLR 2020.
    Google ScholarLocate open access versionFindings
  • Yang Deng, Wai Lam, Yuexiang Xie, Daoyuan Chen, Yaliang Li, Min Yang, and Ying Shen. 2020. Joint learning of answer selection and answer summary generation in community question answering. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 7651–7658.
    Google ScholarLocate open access versionFindings
  • Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz Zubiaga. 2017. Semeval-2017 task 8: Rumoureval: Determining rumour veracity and support for rumours. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, pages 69–76.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 201BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Zhouhan Lin, Minwei Feng, Cıcero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In 5th International Conference on Learning Representations, ICLR 2017.
    Google ScholarLocate open access versionFindings
  • Jing Ma, Wei Gao, Shafiq R. Joty, and Kam-Fai Wong. 2019. Sentence-level evidence embedding for claim verification with hierarchical attention networks. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, pages 2561–2571.
    Google ScholarLocate open access versionFindings
  • Tsvetomila Mihaylova, Georgi Karadzhov, Pepa Atanasova, Ramy Baly, Mitra Mohtarami, and Preslav Nakov. 2019. Semeval-2019 task 8: Fact checking in community question answering forums. In Proceedings of the 13th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2019, pages 860–869.
    Google ScholarLocate open access versionFindings
  • Tsvetomila Mihaylova, Preslav Nakov, Lluıs Marquez, Alberto Barron-Cedeno, Mitra Mohtarami, Georgi Karadzhov, and James R. Glass. 2018. Fact checking in community forums. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), pages 5309–5316.
    Google ScholarLocate open access versionFindings
  • Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2016. Natural language inference by tree-based convolution and heuristic matching. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 130–136.
    Google ScholarLocate open access versionFindings
  • Preslav Nakov, Doris Hoogeveen, Lluıs Marquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, and Karin Verspoor. 2017. Semeval-2017 task 3: Community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, pages 27–48.
    Google ScholarLocate open access versionFindings
  • Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neural semantic matching networks. In The ThirtyThird AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, pages 6859–6866.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Veronica Perez-Rosas, Bennett Kleinberg, Alexandra Lefevre, and Rada Mihalcea. 20Automatic detection of fake news. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, pages 3391–3401.
    Google ScholarLocate open access versionFindings
  • Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, and Gerhard Weikum. 2018. Declare: Debunking fake news and false claims using evidence-aware deep learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 22–32.
    Google ScholarLocate open access versionFindings
  • Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana Volkova, and Yejin Choi. 2017. Truth of varying shades: Analyzing language in fake news and political fact-checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, pages 2931–2937.
    Google ScholarLocate open access versionFindings
  • Karishma Sharma, Feng Qian, He Jiang, Natali Ruchansky, Ming Zhang, and Yan Liu. 2019. Combating fake news: A survey on identification and mitigation techniques. ACM TIST, 10(3):21:1–21:42.
    Google ScholarLocate open access versionFindings
  • Aaron Smith and Monica Anderson. 2016. Online shopping and e-commerce.
    Google ScholarFindings
  • Bakhtiyar Syed, Vijayasaradhi Indurthi, Manish Shrivastava, Manish Gupta, and Vasudeva Varma. 2019. Fermi at semeval-2019 task 8: An elementary but effective approach to question discernment in community QA forums. In Proceedings of the 13th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2019, pages 1160–1164.
    Google ScholarLocate open access versionFindings
  • Yi Tay, Minh C. Phan, Anh Tuan Luu, and Siu Cheung Hui. 2017. Learning to rank question answer pairs with holographic dual LSTM architecture. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 695–704.
    Google ScholarLocate open access versionFindings
  • James Thorne and Andreas Vlachos. 2018. Automated fact checking: Task formulations, methods and future directions. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, pages 3346–3359.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018a. FEVER: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, pages 809–819.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018b. The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 1–9.
    Google ScholarLocate open access versionFindings
  • Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the Workshop on Language Technologies and Computational Social Science@ACL 2014, pages 18–22.
    Google ScholarLocate open access versionFindings
  • Mengting Wan and Julian J. McAuley. 2016. Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems. In IEEE 16th International Conference on Data Mining, ICDM 2016, pages 489–498.
    Google ScholarLocate open access versionFindings
  • William Yang Wang. 2017. ”liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, pages 422–426.
    Google ScholarLocate open access versionFindings
  • Penghui Wei, Nan Xu, and Wenji Mao. 2019. Modeling conversation structure and temporal dynamics for jointly predicting rumor stance and veracity. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4787– 4798.
    Google ScholarLocate open access versionFindings
  • Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji, and Haiqing Chen. 2019a. Simple and effective text matching with richer alignment features. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4699– 4709.
    Google ScholarLocate open access versionFindings
  • Xiao Yang, Madian Khabsa, Miaosen Wang, Wei Wang, Ahmed Hassan Awadallah, Daniel Kifer, and C. Lee Giles. 2019b. Adversarial training for community question answer selection based on multiscale matching. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, pages 395–402.
    Google ScholarLocate open access versionFindings
  • Wenxuan Zhang, Yang Deng, and Wai Lam. 2020a. Answer ranking for product-related questions via multiple semantic relations modeling. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 569–578.
    Google ScholarLocate open access versionFindings
  • Wenxuan Zhang, Wai Lam, Yang Deng, and Jing Ma. 2020b. Review-guided helpful answer identification in e-commerce. In WWW ’20: The Web Conference 2020, pages 2620–2626.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments