Learning with Weak Supervision for Email Intent Detection

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval Virtual Event China July, 2020, pp. 1051-1060, 2020.

Cited by: 0|Bibtex|Views72|Links
EI
Keywords:
Request Informationdeep neural networkPromise Actionuser interactionuser behaviorMore(10+)
Weibo:
We develop an end-to-end robust neural network model Hydra to jointly learn from a small amount of clean labels and a large amount of weakly labeled instances derived from user interactions

Abstract:

Email remains one of the most frequently used means of online communication. People spend significant amount of time every day on emails to exchange information, manage tasks and schedule events. Previous work has studied different ways for improving email productivity by prioritizing emails, suggesting automatic replies or identifying in...More

Code:

Data:

0
Introduction
  • Email has continued to be a major tool for communication and collaboration over the past decades.
  • Recent studies show that communicating with colleagues and customers takes up to 28% of information workers’ time, second only to role-specific tasks at 39% [9]
  • Such widespread use and significant amount of time spent on email have motivated researchers to study how people use email and how intelligent experiences could assist them to be more productive [12, 24, 28].
  • One of the earliest works to characterize the main purpose email serves in work settings is that of Dabbish et al [12]
  • They conducted a survey of 124 participants to characterize different aspects of email usage.
  • More recent work [50]
Highlights
  • Email has continued to be a major tool for communication and collaboration over the past decades
  • We show that weak supervision from user interaction is effective in the presence of limited amount of annotated data for the task of email intent identification
  • Thereafter we describe in details the task, dataset, and how weak supervision can be leveraged from user interactions to help in the task
  • We report results on the test set with the model parameters picked with the best validation accuracy on the dev set (Table 3 shows the data splits)
  • We develop an end-to-end robust neural network model Hydra to jointly learn from a small amount of clean labels and a large amount of weakly labeled instances derived from user interactions
  • Understanding the nature of different sources of weak supervision is valuable for learning in different application domains
Methods
  • The authors present the experiments to evaluate the effectiveness of Hydra.

    4.1 Experimental Settings

    4.1.1 Datasets.
  • The authors primarily perform experiments on the Avocado email collection.
  • Section 2.4 discusses, in details, the data annotation process to obtain the clean labels and harnessing user interactions to obtain the weakly labeled instances.
  • In addition to Avocado, the authors perform an experiment to show the generalizability of the approach for transfer to another domain, namely the email collection for Enron.
  • Note that Avocado is the only public email collection with available user interaction logs available
Results
  • Evaluation metric

    The authors pose the task as a binary classification problem for each of the intents.
  • The authors use accuracy as the evaluation metric.
  • The authors report results on the test set with the model parameters picked with the best validation accuracy on the dev set (Table 3 shows the data splits).
  • All runs are repeated for 5 times and the average is reported.
  • The authors report performance with two settings with different clean data ratios:.
  • All: The setting where all available clean labels are used.
  • According to the dataset construction process in Section 2.4 this corresponds to clean ratio = 10%
Conclusion
  • The authors leverage weak supervision signals from user interactions to improve intent detection for emails.
  • The authors can extend the framework to multi-task learning where all of the above intent classification tasks can be learned jointly along with multiple sources of weak supervision.
  • It is worth exploring combining label correction and multi-source learning jointly instead of a two-stage approach.
  • In web search, user clicks can be a relatively accurate source of weak supervision, but may suffer from presentation bias; while for email data, user interactions are less accurate but may not suffer from the same biases
Summary
  • Introduction:

    Email has continued to be a major tool for communication and collaboration over the past decades.
  • Recent studies show that communicating with colleagues and customers takes up to 28% of information workers’ time, second only to role-specific tasks at 39% [9]
  • Such widespread use and significant amount of time spent on email have motivated researchers to study how people use email and how intelligent experiences could assist them to be more productive [12, 24, 28].
  • One of the earliest works to characterize the main purpose email serves in work settings is that of Dabbish et al [12]
  • They conducted a survey of 124 participants to characterize different aspects of email usage.
  • More recent work [50]
  • Objectives:

    The authors' objective is to build a framework that leverages signals coming from both sources of supervision and learn an underlying common representation from the context.
  • Methods:

    The authors present the experiments to evaluate the effectiveness of Hydra.

    4.1 Experimental Settings

    4.1.1 Datasets.
  • The authors primarily perform experiments on the Avocado email collection.
  • Section 2.4 discusses, in details, the data annotation process to obtain the clean labels and harnessing user interactions to obtain the weakly labeled instances.
  • In addition to Avocado, the authors perform an experiment to show the generalizability of the approach for transfer to another domain, namely the email collection for Enron.
  • Note that Avocado is the only public email collection with available user interaction logs available
  • Results:

    Evaluation metric

    The authors pose the task as a binary classification problem for each of the intents.
  • The authors use accuracy as the evaluation metric.
  • The authors report results on the test set with the model parameters picked with the best validation accuracy on the dev set (Table 3 shows the data splits).
  • All runs are repeated for 5 times and the average is reported.
  • The authors report performance with two settings with different clean data ratios:.
  • All: The setting where all available clean labels are used.
  • According to the dataset construction process in Section 2.4 this corresponds to clean ratio = 10%
  • Conclusion:

    The authors leverage weak supervision signals from user interactions to improve intent detection for emails.
  • The authors can extend the framework to multi-task learning where all of the above intent classification tasks can be learned jointly along with multiple sources of weak supervision.
  • It is worth exploring combining label correction and multi-source learning jointly instead of a two-stage approach.
  • In web search, user clicks can be a relatively accurate source of weak supervision, but may suffer from presentation bias; while for email data, user interactions are less accurate but may not suffer from the same biases
Tables
  • Table1: Examples of different intent types in enterprise emails with weak labeling rules derived from user interactions
  • Table2: Confusion matrix for human evaluation of weak labeling functions
  • Table3: Email datasets with metadata and user interactions. Clean refers to manually annotated emails, whereas weak refers to the ones obtained leveraging user interactions
  • Table4: Notation Table
  • Table5: Performance of the proposed approach compared to several baselines. Clean Ratio denotes the ratio of clean labels to all available labels (clean and weak) that is used to train the corresponding models. We show results for 10% (All)) and 1% (Tiny) clean ratios. Hydra outperforms all the baselines in all settings
  • Table6: Ablation analysis for Hydra. First row in each section shows Hydra with self-paced learning and GLC. Results are average across all tasks & encoders for a given clean ratio
  • Table7: Hydra (enc = BiLSTM) on RI task with fixed amount of clean data (i.e. all the 1800 clean instances) and varying percentage of weak labels
  • Table8: Domain transfer for Hydra (enc=BiLSTM) on SM task. Av. → En. denotes Hydra trained on clean data in Enron and weak data in Avocado, and tested on Enron. Whereas En. → En. denotes the model trained on only the clean data in Enron, and tested on Enron
Download tables as Excel
Related work
  • In this section, we briefly review the related work on email intent detection, weak supervision and learning from user interactions in other applications such as web search.

    5.1 Email Intent Classification

    Email understanding and intent classification has attracted increasing attention recently. Dabbish et al conduct a survey on 124 participants to characterize email usage [13]. They highlight four distinct uses of email intents like project management, information exchange, scheduling and planning, and social communication. Detecting user intents, especially action-item intents [7], can help service providers to enhance user experience. Recent research focuses on predicting actionable email intent from email contents [31, 51], and identify related user actions such as reply [52], deferral [47], re-finding [33]. Wang et al model the contextual information in email text to identify sentence-level user intents. Lin et al build a reparameterized recurrent neural network to model cross-domain information and identify actionable email intents. In a more finergrained level, Lampter et al [28] study the problem of detecting emails that contain intent of requesting information, and propose to segment email contents into different functional zones. More recently, Azarbonyad et al [4] utilize domain adaptation for commitment detection in emails. They demonstrate superior performance using autoencoders to capture both feature- and sample-level adaptation across domains. In contrast to all these models trained on manually annotated clean labels, we develop a framework Hydra that leverages weak supervision signals from user interactions for intent classification in addition to a small amount of clean labels.
Reference
  • Email statistics report. The Radicati Group, INC., 2015.
    Google ScholarFindings
  • Eugene Agichtein, Eric Brill, and Susan Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR, 2006.
    Google ScholarLocate open access versionFindings
  • Qingyao Ai, Susan T Dumais, Nick Craswell, and Dan Liebling. Characterizing email search using large-scale behavioral logs and surveys. In WWW, 2017.
    Google ScholarLocate open access versionFindings
  • Hosein Azarbonyad, Robert Sim, and Ryen W White. Domain adaptation for commitment detection in email. In WSDM, 2019.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, 2009.
    Google ScholarLocate open access versionFindings
  • Paul N. Bennett and Jaime Carbonell. Detecting action-items in e-mail. In SIGIR, 2005.
    Google ScholarLocate open access versionFindings
  • Paul N Bennett and Jaime G Carbonell. Detecting action items in email. 2005.
    Google ScholarFindings
  • Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In SIGACT, 2017.
    Google ScholarLocate open access versionFindings
  • Michael Chui, James Manyika, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Hugo Sarrazin, Georey Sands, and Magdalena Westergren. The social economy: Unlocking value and productivity through social technologies. McKinsey Global Institute., 2012.
    Google ScholarLocate open access versionFindings
  • Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960.
    Google ScholarLocate open access versionFindings
  • William W. Cohen, Vitor R. Carvalho, and Tom M. Mitchell. Learning to classify email into speech acts. In In Proceedings of Empirical Methods in Natural Language Processing, 2004.
    Google ScholarLocate open access versionFindings
  • Laura A. Dabbish, Robert E. Kraut, Susan Fussell, and Sara Kiesler. Understanding email use: Predicting action on a message. In CHI. ACM, 2005.
    Google ScholarLocate open access versionFindings
  • Laura A Dabbish, Robert E Kraut, Susan Fussell, and Sara Kiesler. Understanding email use: predicting action on a message. In CHI, 2005.
    Google ScholarLocate open access versionFindings
  • Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. Neural ranking models with weak supervision. In SIGIR, 2017.
    Google ScholarLocate open access versionFindings
  • Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2013.
    Google ScholarLocate open access versionFindings
  • Liang Ge, Jing Gao, Xiaoyi Li, and Aidong Zhang. Multi-source deep learning for information trustworthiness estimation. In KDD, 2013.
    Google ScholarLocate open access versionFindings
  • Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602–610, 2005.
    Google ScholarLocate open access versionFindings
  • Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. Beyond dcg: user behavior as a predictor of a successful search. In WSDM, 2010.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Eric Horvitz. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’99, pages 159–166, New York, NY, USA, 1999. ACM.
    Google ScholarLocate open access versionFindings
  • Eric Horvitz, Andy Jacobs, and David Hovel. Attention-sensitive alerting. In UAI, 1999.
    Google ScholarLocate open access versionFindings
  • Meng Jiang, Peng Cui, Rui Liu, Qiang Yang, Fei Wang, Wenwu Zhu, and Shiqiang Yang. Social contextual recommendation. In CIKM, 2012.
    Google ScholarLocate open access versionFindings
  • Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. TOIS, 25(2):7, 2007.
    Google ScholarLocate open access versionFindings
  • Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, László Lukács, Marina Ganea, Peter Young, et al. Smart reply: Automated response suggestion for email. arXiv preprint arXiv:1606.04870, 2016.
    Findings
  • Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for email classification research. In ECML, 2004.
    Google ScholarLocate open access versionFindings
  • Farshad Kooti, Luca Maria Aiello, Mihajlo Grbovic, Kristina Lerman, and Amin Mantrach. Evolution of conversations in the age of email overload. In WWW, 2015.
    Google ScholarLocate open access versionFindings
  • M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In NeurIPS, 2010.
    Google ScholarLocate open access versionFindings
  • Andrew Lampert, Robert Dale, and Cecile Paris. Detecting emails containing requests for action. In HLT, 2010.
    Google ScholarLocate open access versionFindings
  • Andrew Lampert, Robert Dale, and Cecile Paris. Detecting emails containing requests for action. In NAACL, 2010.
    Google ScholarLocate open access versionFindings
  • Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Chu-Cheng Lin, Dongyeop Kang, Michael Gamon, and Patrick Pantel. Actionable email intent modeling with reparametrized rnns. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Cheng Luo, Yukun Zheng, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. Training deep ranking model with weak relevance labels. In Zi Huang, Xiaokui Xiao, and Xin Cao, editors, Databases Theory and Applications, pages 205–216, Cham, 2017. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Joel Mackenzie, Kshitiz Gupta, Fang Qiao, Ahmed Hassan Awadallah, and Milad Shokouhi. Exploring user behavior in email re-finding tasks. In WWW, 2019.
    Google ScholarLocate open access versionFindings
  • Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. Weakly-supervised hierarchical text classification. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
    Google ScholarLocate open access versionFindings
  • Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In NeurIPS, 2013.
    Google ScholarLocate open access versionFindings
  • David F Nettleton, Albert Orriols-Puig, and Albert Fornells. A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial intelligence review, 33(4):275–306, 2010.
    Google ScholarLocate open access versionFindings
  • Douglas Oard, William Webber, David Kirsch, and Sergey Golitsynskiy. Avocado research email collection. Philadelphia: Linguistic Data Consortium, 2015.
    Google ScholarFindings
  • Wanli Ouyang, Xiao Chu, and Xiaogang Wang. Multi-source deep learning for human pose estimation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. VLDB.
    Google ScholarFindings
  • Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Ré. Training complex models with multi-task weak supervision. arXiv preprint arXiv:1810.02840, 2018.
    Findings
  • Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
    Findings
  • Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050, 2018.
    Findings
  • Maya Sappelli, Gabriella Pasi, Suzan Verberne, Maaike de Boer, and Wessel Kraaij. Assessing e-mail intent and tasks in e-mail messages. Information Sciences, 358:1– 17, 2016.
    Google ScholarLocate open access versionFindings
  • Bahareh Sarrafzadeh, Ahmed Hassan Awadallah, Christopher H Lin, Chia-Jung Lee, Milad Shokouhi, and Susan T Dumais. Characterizing and predicting email deferral behavior. In WSDM, 2019.
    Google ScholarLocate open access versionFindings
  • Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
    Findings
  • Paroma Varma, Frederic Sala, Ann He, Alexander Ratner, and Christopher Ré. Learning dependency structures for weak supervision models. ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Wei Wang, Saghar Hosseini, Ahmed Hassan Awadallah, Paul N. Bennett, and Chris Quirk. Context-aware intent identification in email conversations. In SIGIR, 2019.
    Google ScholarLocate open access versionFindings
  • Wei Wang, Saghar Hosseini, Ahmed Hassan Awadallah, Paul N Bennett, and Chris Quirk. Context-aware intent identification in email conversations. In SIGIR, 2019.
    Google ScholarLocate open access versionFindings
  • Liu Yang, Susan T Dumais, Paul N Bennett, and Ahmed Hassan Awadallah. Characterizing and predicting enterprise email reply behavior. In SIGIR, 2017.
    Google ScholarLocate open access versionFindings
  • Xiao Yang, Ahmed Hassan Awadallah, Madian Khabsa, Wei Wang, and Miaosen Wang. Characterizing and supporting question answering in human-to-human communication. In SIGIR, 2018.
    Google ScholarLocate open access versionFindings
  • Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
    Findings
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization, 2016.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments