Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting

ACL, pp. 4134-4145, 2020.

Cited by: 2|Views197
EI
Weibo:
We focus on the unintended discrimination bias in existing text classification datasets

Abstract:

With the recent proliferation of the use of text classifications, researchers have found that there are certain unintended biases in text classification datasets. For example, texts containing some demographic identity-terms (e.g., "gay", "black") are more likely to be abusive in existing abusive language detection datasets. As a result...More
Full Text
Bibtex
Weibo
Introduction
  • With the development of Natural Language Processing (NLP) techniques, Machine Learning (ML) models are being applied in continuously expanding areas, and they are affecting everybody’s life from many aspects.
  • Discrimination problem in NLP models (Sun et al, 2019).
  • Text classification is one of the fundamental tasks in NLP.
  • It aims at assigning any given sentence to a specific class.
  • In this task, models are expected to make predictions with the semantic information rather than with the demographic group identity information (e.g., “gay”, “black”) contained in the sentences
Highlights
  • With the development of Natural Language Processing (NLP) techniques, Machine Learning (ML) models are being applied in continuously expanding areas, and they are affecting everybody’s life from many aspects
  • We propose a model-agnostic debiasing training framework that does not require any extra resources or annotations, apart from a pre-defined set of demographic identity-terms
  • As expected, training with calculated weights can effectively mitigate the impacts of the unintended bias in the datasets
  • Similar to results on Toxicity Comments, we find that both Weight and Supplement perform significantly better than Baseline in terms of Identity Phrase Templates Test Sets AUC and False Positive Equality Difference, and the results of Weight and Supplement are comparable
  • We focus on the unintended discrimination bias in existing text classification datasets
  • It’s worth mentioning that our method is general enough to be applied to other tasks, as the key idea is to obtain the loss on the non-discrimination distribution, and we leave this to future works
Methods
  • The authors present the experimental results for non-discrimination learning.
  • It is reported that the dataset has an unintended gender bias so that models trained in this dataset may consider “You are a good woman.” as “sexist.” We randomly split the dataset in a ratio of 8 : 1 : 1 for training-validation-testing and use this dataset to evaluate our method’s effectiveness on mitigating gender discrimination.
  • The authors adopt the split released by Dixon et al (2018) and use this dataset to evaluate the method’s effectiveness on mitigating discrimination towards minority groups
Results
  • The authors present and discuss the experimental results. As expected, training with calculated weights can effectively mitigate the impacts of the unintended bias in the datasets.

    Sexist Tweets Tabel 3 reports the results on Sexist Tweets dataset.
  • The authors present and discuss the experimental results.
  • As expected, training with calculated weights can effectively mitigate the impacts of the unintended bias in the datasets.
  • Sexist Tweets Tabel 3 reports the results on Sexist Tweets dataset.
  • Swap refers to models trained and validated with 2723 additional gender-swapped samples to balance the identity-terms across labels (Park et al, 2018).
  • Weight refers to models trained and validated with calculated weights.
  • “+” refers to models using debiased word embeddings
  • Weight refers to models trained and validated with calculated weights. “+” refers to models using debiased word embeddings
Conclusion
  • The authors focus on the unintended discrimination bias in existing text classification datasets.
  • The authors formalize the problem as a kind of selection bias from the non-discrimination distribution to the discrimination distribution and propose a debiasing training framework that does not require any extra resources or annotations.
  • Experiments show that the method can effectively alleviate discrimination.
  • It’s worth mentioning that the method is general enough to be applied to other tasks, as the key idea is to obtain the loss on the non-discrimination distribution, and the authors leave this to future works
Summary
  • Introduction:

    With the development of Natural Language Processing (NLP) techniques, Machine Learning (ML) models are being applied in continuously expanding areas, and they are affecting everybody’s life from many aspects.
  • Discrimination problem in NLP models (Sun et al, 2019).
  • Text classification is one of the fundamental tasks in NLP.
  • It aims at assigning any given sentence to a specific class.
  • In this task, models are expected to make predictions with the semantic information rather than with the demographic group identity information (e.g., “gay”, “black”) contained in the sentences
  • Methods:

    The authors present the experimental results for non-discrimination learning.
  • It is reported that the dataset has an unintended gender bias so that models trained in this dataset may consider “You are a good woman.” as “sexist.” We randomly split the dataset in a ratio of 8 : 1 : 1 for training-validation-testing and use this dataset to evaluate our method’s effectiveness on mitigating gender discrimination.
  • The authors adopt the split released by Dixon et al (2018) and use this dataset to evaluate the method’s effectiveness on mitigating discrimination towards minority groups
  • Results:

    The authors present and discuss the experimental results. As expected, training with calculated weights can effectively mitigate the impacts of the unintended bias in the datasets.

    Sexist Tweets Tabel 3 reports the results on Sexist Tweets dataset.
  • The authors present and discuss the experimental results.
  • As expected, training with calculated weights can effectively mitigate the impacts of the unintended bias in the datasets.
  • Sexist Tweets Tabel 3 reports the results on Sexist Tweets dataset.
  • Swap refers to models trained and validated with 2723 additional gender-swapped samples to balance the identity-terms across labels (Park et al, 2018).
  • Weight refers to models trained and validated with calculated weights.
  • “+” refers to models using debiased word embeddings
  • Weight refers to models trained and validated with calculated weights. “+” refers to models using debiased word embeddings
  • Conclusion:

    The authors focus on the unintended discrimination bias in existing text classification datasets.
  • The authors formalize the problem as a kind of selection bias from the non-discrimination distribution to the discrimination distribution and propose a debiasing training framework that does not require any extra resources or annotations.
  • Experiments show that the method can effectively alleviate discrimination.
  • It’s worth mentioning that the method is general enough to be applied to other tasks, as the key idea is to obtain the loss on the non-discrimination distribution, and the authors leave this to future works
Tables
  • Table1: Percentage of toxic comments by some specific demographic identity-terms in the dataset released by Dixon et al (2018)
  • Table2: Statistics of the three datasets for evaluation
  • Table3: Experimental results with Sexist Tweets dataset. “+” refers to models using debiased word embeddings
  • Table4: Experimental results with Toxicity Comments dataset
  • Table5: Experimental results with Jigsaw Toxicity dataset
  • Table6: Templates used to generate IPTTS
  • Table7: Examples of slotted words to generate IPTTS
  • Table8: Frequency of a selection of identity-terms in toxic samples and overall in Jigsaw Toxicity dataset. % is omitted
Download tables as Excel
Related work
  • Non-discrimination and Fairness Nondiscrimination focuses on a number of protected demographic groups, and ask for parity of some statistical measures across these groups (Chouldechova, 2017). As mentioned by Friedler et al (2016), non-discrimination can be achieved only if all groups have similar abilities w.r.t. the task in the constructed space which contains the features that we would like to make a decision. There are various kinds of definitions of non-discrimination corresponding to different statistical measures. Popular measures include raw positive classification rate (Calders and Verwer, 2010), false positive and false negative rate (Hardt et al, 2016) and positive predictive value (Chouldechova, 2017), corresponding to different definitions of non-discrimination. Methods like adversarial training (Beutel et al, 2017; Zhang et al, 2018) and fine-tuning (Park et al, 2018) have been applied to remove biasedness.

    In the NLP area, fairness and discrimination problems have also gained tremendous attention. Caliskan-Islam et al (2016) show that semantics derived automatically from language corpora contain human biases. Bolukbasi et al (2016) show that pre-trained word embeddings trained on large-scale corpus can exhibit gender prejudices and provide a methodology for removing prejudices in embeddings by learning a gender subspace. Zhao et al (2018) introduce the gender bias problem in coreference resolution and propose a general-purpose method for debiasing.
Funding
  • Conghui Zhu and Tiejun Zhao are supported by National Key R&D Program of China (Project No 2017YFB1002102)
Reference
  • Peter C Austin and Elizabeth A Stuart. 2015. Moving towards best practice when using inverse probability of treatment weighting (iptw) using the propensity score to estimate causal treatment effects in observational studies. Statistics in medicine, 34(28):3661– 3679.
    Google ScholarLocate open access versionFindings
  • Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. 2007. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137–144.
    Google ScholarLocate open access versionFindings
  • Aylin Caliskan-Islam, Joanna J. Bryson, and Arvind Narayanan. 2016. Semantics derived automatically from language corpora necessarily contain human biases. Science, 356(6334):183–186.
    Google ScholarLocate open access versionFindings
  • Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163.
    Google ScholarLocate open access versionFindings
  • Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 67–73. ACM.
    Google ScholarLocate open access versionFindings
  • Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–22ACM.
    Google ScholarLocate open access versionFindings
  • Wei Fan, Ian Davidson, Bianca Zadrozny, and Philip S Yu. 2005. An improved categorization of classifier’s sensitivity on sample selection bias. In Fifth IEEE International Conference on Data Mining (ICDM’05), pages 4–pp. IEEE.
    Google ScholarLocate open access versionFindings
  • Sorelle A Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. 2016. On the (im) possibility of fairness. arXiv preprint arXiv:1609.07236.
    Findings
  • Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315–3323.
    Google ScholarLocate open access versionFindings
  • James J Heckman. 1979. Sample selection bias as a specification error. Econometrica: Journal of the econometric society, pages 153–161.
    Google ScholarLocate open access versionFindings
  • Jing Jiang and ChengXiang Zhai. 2007. Instance weighting for domain adaptation in nlp. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 264–271.
    Google ScholarLocate open access versionFindings
  • Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 781–789. ACM.
    Google ScholarLocate open access versionFindings
  • Svetlana Kiritchenko and Saif Mohammad. 2018. Examining gender and race bias in two hundred sentiment analysis systems. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 43–53.
    Google ScholarLocate open access versionFindings
  • Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing gender bias in abusive language detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2799–2804.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
    Google ScholarLocate open access versionFindings
  • Donald B Rubin. 1976. Inference and missing data. Biometrika, 63(3):581–592.
    Google ScholarLocate open access versionFindings
  • Matthias Schonlau, Arthur Van Soest, Arie Kapteyn, and Mick Couper. 2009. Selection bias in web surveys and the use of propensity scores. Sociological Methods & Research, 37(3):291–318.
    Google ScholarLocate open access versionFindings
  • Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244.
    Google ScholarLocate open access versionFindings
  • Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating gender bias in natural language processing: Literature review. In Proceedings of the 57th Annual Meeting of the Association of Computational Linguistics, pages 1630–1640.
    Google ScholarLocate open access versionFindings
  • Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. 2018. Position bias estimation for unbiased learning to rank in personal search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 610–618. ACM.
    Google ScholarLocate open access versionFindings
  • Zeerak Waseem. 2016. Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In Proceedings of the first workshop on NLP and computational social science, pages 138– 142.
    Google ScholarLocate open access versionFindings
  • Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop, pages 88–93.
    Google ScholarLocate open access versionFindings
  • Bianca Zadrozny. 2004. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, page 114. ACM.
    Google ScholarLocate open access versionFindings
  • Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340. ACM.
    Google ScholarLocate open access versionFindings
  • Guanhua Zhang, Bing Bai, Jian Liang, Kun Bai, Shiyu Chang, Mo Yu, Conghui Zhu, and Tiejun Zhao. 2019. Selection bias explorations and debias methods for natural language sentence matching datasets. In Proceedings of the 57th Annual Meeting of the Association of Computational Linguistics, pages 4418– 4429.
    Google ScholarLocate open access versionFindings
  • Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989.
    Google ScholarLocate open access versionFindings
  • Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 2.
    Google ScholarLocate open access versionFindings
  • Given the result P ∗(y|x) = Q(y|x), the consistent learners should be asymptotically immune to different assumptions regarding Q(Z), where a learner is defined as consistent if the learning algorithm can find a model θ that is equivalent to the true model at producing class conditional probabilities given an exhaustive training data set (Fan et al., 2005). In practical, however, as the requirements are often hard to met, we note that models may still be affected due to the deviation between P ∗(x) and Q(x), which is widely studied in the covariate shift problem (Shimodaira, 2000; Ben-David et al., 2007; Jiang and Zhai, 2007). In our paper, as we don’t assume the availability of extra resources and prior knowledge, we simply set P (Z) = Q(Z). We leave more explorations about this assumption for future work.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments