Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

ACL, pp.7237-7256, (2020)

Cited by: 23|Views114
EI
Weibo:
We focus on bias and fairness as case studies, and demonstrate that across increasingly complex settings, our method is capable of attenuating societal biases that are expressed in representations learned from data

Abstract:

The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models. We present Iterative Null-space Projection (INLP), a novel method for removing information from neural representations. Our method is based on repeated traini...More
0
ZH
Full Text
Bibtex
Weibo
Introduction
  • Pre-trained language models, and more generally deep learning methods emerge as very effective techniques for text classification.
  • They are increasingly being used for predictions in real-world situations.
  • A large part of the success is due to the models’ ability to perform representation learning, coming up with effective feature representations for the prediction task at hand
  • These learned representations, while effective, are notoriously opaque: the authors do not know what is encoded in them.
  • There is evidence that they capture a lot of information regarding the demographics of the author of the text (Blodgett et al, 2016; Elazar and Goldberg, 2018)
Highlights
  • What is encoded in vector representations of textual data, and can we control it? Word embeddings, pre-trained language models, and more generally deep learning methods emerge as very effective techniques for text classification
  • We present a novel method for removing linearlyrepresented information from neural representations
  • We focus on bias and fairness as case studies, and demonstrate that across increasingly complex settings, our method is capable of attenuating societal biases that are expressed in representations learned from data
  • While this work focuses on societal bias and fairness, Iterative Nullspace Projection has broader possible use-cases, and can be utilized to remove specific components from a representation, in a controlled and deterministic manner
  • This method can be applicable for other end goals, such as styletransfer, disentanglement of neural representations and increasing their interpretability
Methods
  • Experiments and

    Analysis

    6.1 “Debiasing” Word Embeddings

    In the first set of experiments, the authors evaluate the INLP method in its ability to debias word embeddings (Bolukbasi et al, 2016).
  • The data is randomly divided into a test set (30%), and training and development sets (70%, further divided into 70% training and 30% development examples)
Results
  • INLPbased debiasing results in a very substantial drop in classification accuracy (54.4%), while the removal of the predefined directions only moderately decreases accuracy (80.7%).
  • This shows that datadriven identification of gender-directions outperforms manually selected directions: there are many subtle ways in which gender is encoded, which are hard for people to imagine.
  • INLP significantly outperforms this baseline, while maintaining all explicit gender markers in the input
Conclusion
  • Both the previous method and the mh#»eet−hos#dh»es.taHrtowweitvhert,hwehmileaipnregveinoduesra-dttiermecpttisontakoef this direction as the information that needs to be neutralized, the method instead considers the labeling induced by this gender direction, and iteratively finds and neutralizes directions that#c»orrel#at»e with this labeling.
  • While this work focuses on societal bias and fairness, Iterative Nullspace Projection has broader possible use-cases, and can be utilized to remove specific components from a representation, in a controlled and deterministic manner.
  • This method can be applicable for other end goals, such as styletransfer, disentanglement of neural representations and increasing their interpretability.
  • The authors aim to explore those directions in a future work
Summary
  • Introduction:

    Pre-trained language models, and more generally deep learning methods emerge as very effective techniques for text classification.
  • They are increasingly being used for predictions in real-world situations.
  • A large part of the success is due to the models’ ability to perform representation learning, coming up with effective feature representations for the prediction task at hand
  • These learned representations, while effective, are notoriously opaque: the authors do not know what is encoded in them.
  • There is evidence that they capture a lot of information regarding the demographics of the author of the text (Blodgett et al, 2016; Elazar and Goldberg, 2018)
  • Methods:

    Experiments and

    Analysis

    6.1 “Debiasing” Word Embeddings

    In the first set of experiments, the authors evaluate the INLP method in its ability to debias word embeddings (Bolukbasi et al, 2016).
  • The data is randomly divided into a test set (30%), and training and development sets (70%, further divided into 70% training and 30% development examples)
  • Results:

    INLPbased debiasing results in a very substantial drop in classification accuracy (54.4%), while the removal of the predefined directions only moderately decreases accuracy (80.7%).
  • This shows that datadriven identification of gender-directions outperforms manually selected directions: there are many subtle ways in which gender is encoded, which are hard for people to imagine.
  • INLP significantly outperforms this baseline, while maintaining all explicit gender markers in the input
  • Conclusion:

    Both the previous method and the mh#»eet−hos#dh»es.taHrtowweitvhert,hwehmileaipnregveinoduesra-dttiermecpttisontakoef this direction as the information that needs to be neutralized, the method instead considers the labeling induced by this gender direction, and iteratively finds and neutralizes directions that#c»orrel#at»e with this labeling.
  • While this work focuses on societal bias and fairness, Iterative Nullspace Projection has broader possible use-cases, and can be utilized to remove specific components from a representation, in a controlled and deterministic manner.
  • This method can be applicable for other end goals, such as styletransfer, disentanglement of neural representations and increasing their interpretability.
  • The authors aim to explore those directions in a future work
Tables
  • Table1: The Sentiment scores (in accuracy, higher is better) and TPR differences (lower is better) as a function of the ratio of tweets written by black individuals in the positive-sentiment class
  • Table2: Fair classification on the Biographies corpus
  • Table3: Word similarity scores on Glove embeddings, before and after INLP. The scores are the Spearman correlation coefficient between the similarity scores
  • Table4: Top 100 words influenced by INLP projection (BOW representation, biographies dataset)
  • Table5: Table 5
  • Table6: Words by gender norm
Download tables as Excel
Related work
  • The objective of controlled removal of specific types of information from neural representation is tightly related to the task of disentanglement of the representations (Bengio et al, 2013; Mathieu et al, 2016), that is, controlling and separating the different kinds of information encoded in them. In the context of transfer learning, previous methods have pursued representations which are invariant to some properties of the input, such as genre or topic, in order to ease domain transfer (Ganin and Lempitsky, 2015). Those methods mostly rely on adding an adversarial component (Goodfellow et al, 2014; Ganin and Lempitsky, 2015; Xie et al, 2017; Zhang et al, 2018) to the main task objective: the representation is regularized by an adversary network, that competes against the encoder, trying to extract the protected information from its representation.

    While adverserial methods showed impressive performance in various machine learning tasks, and were applied for the goal of removal of sensitive information (Elazar and Goldberg, 2018; Coavoux et al, 2018; Resheff et al, 2019; Barrett et al, 2019), they are notoriously hard to train. Elazar and Goldberg (2018) have evaluated adverserial methods for the removal of demographic information from representations. They showed that the complete removal of the protected information is nontrivial: even when the attribute seems protected, different classifiers of the same architecture can often still succeed in extracting it. Another drawback of these methods is their reliance on a main-task loss in addition to the adverserial loss, making them less suitable for tasks such as debiasing pretrained word embeddings.
Funding
  • This project received funding from the Europoean Research Council (ERC) under the Europoean Union’s Horizon 2020 research and innovation programme, grant agreement No 802774 (iEXTRACT)
Reference
  • Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27.
    Google ScholarLocate open access versionFindings
  • Matthew Barker and William Rayens. 2003. Partial least squares for discrimination. Journal of Chemometrics: A Journal of the Chemometrics Society, 17(3):166–173.
    Google ScholarLocate open access versionFindings
  • Maria Barrett, Yova Kementchedjhieva, Yanai Elazar, Desmond Elliott, and Anders Søgaard. 2019. Adversarial removal of demographic attributes revisited. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6331– 6336.
    Google ScholarLocate open access versionFindings
  • Adi Ben-Israel. 2015. Projectors on intersections of subspaces. Contemporary Mathematics, pages 41– 50.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828.
    Google ScholarLocate open access versionFindings
  • Su Lin Blodgett, Lisa Green, and Brendan O’Connor. 201Demographic dialectal variation in social media: A case study of african-american english. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1119–1130.
    Google ScholarLocate open access versionFindings
  • Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pages 4349–4357.
    Google ScholarLocate open access versionFindings
  • Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.
    Google ScholarLocate open access versionFindings
  • Maximin Coavoux, Shashi Narayan, and Shay B Cohen. 2018. Privacy-preserving neural representations of text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1–10.
    Google ScholarLocate open access versionFindings
  • Maria De-Arteaga, Alexey Romanov, Hanna M. Wallach, Jennifer T. Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Cem Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pages 120–128. ACM.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yanai Elazar and Yoav Goldberg. 2018. Adversarial removal of demographic attributes from text data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 11– 21. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. 2019. Understanding undesirable word embedding associations. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 1696– 1705. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Yaroslav Ganin and Victor S. Lempitsky. 20Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 1180–1189.
    Google ScholarLocate open access versionFindings
  • Paul Geladi and Bruce R Kowalski. 1986. Partial leastsquares regression: a tutorial. Analytica chimica acta, 185:1–17.
    Google ScholarLocate open access versionFindings
  • Yoav Goldberg. 2019. Assessing bert’s syntactic abilities. arXiv preprint arXiv:1901.05287.
    Findings
  • Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 27, 2019, Volume 1 (Long and Short Papers), pages 609–614. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems
    Google ScholarLocate open access versionFindings
  • 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680.
    Google ScholarFindings
  • Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. 2012. Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1406– 1414.
    Google ScholarLocate open access versionFindings
  • Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3315–3323.
    Google ScholarLocate open access versionFindings
  • Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28.
    Google ScholarLocate open access versionFindings
  • John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4129–4138. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695.
    Google ScholarLocate open access versionFindings
  • Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 427–431. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. TACL, 4:521–535.
    Google ScholarLocate open access versionFindings
  • Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605.
    Google ScholarLocate open access versionFindings
  • David Madras, Elliot Creager, Toniann Pitassi, and Richard S. Zemel. 2019. Fairness through causal awareness: Learning causal latent-variable models for biased data. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pages 349–358. ACM.
    Google ScholarLocate open access versionFindings
  • Michael Mathieu, Junbo Jake Zhao, Pablo Sprechmann, Aditya Ramesh, and Yann LeCun. 2016. Disentangling factors of variation in deep representation using adversarial training. In Advances in
    Google ScholarLocate open access versionFindings
  • Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 5041–5049.
    Google ScholarLocate open access versionFindings
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
    Google ScholarLocate open access versionFindings
  • Fabio Petroni, Tim Rocktaschel, Sebastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H. Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2463–2473. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yehezkel Resheff, Yanai Elazar, Moni Shahar, and Oren Shalom. 2019. Privacy and fairness in recommender systems via adversarial training of user representations. In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,, pages 476–482. INSTICC, SciTePress.
    Google ScholarLocate open access versionFindings
  • Alexey Romanov, Maria De-Arteaga, Hanna M. Wallach, Jennifer T. Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Cem Geyik, Krishnaram Kenthapadi, Anna Rumshisky, and Adam Kalai. 2019. What’s in a name? reducing bias in bios without access to protected attributes. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4187–4195. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Andrew Rosenberg and Julia Hirschberg. 2007. Vmeasure: A conditional entropy-based external cluster evaluation measure. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pages 410– 420. ACL.
    Google ScholarLocate open access versionFindings
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4593–4601. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Graham Neubig. 2017. Controllable invariance through adversarial feature learning. In Advances in
    Google ScholarLocate open access versionFindings
  • Ke Xu, Tongyi Cao, Swair Shah, Crystal Maung, and Haim Schweitzer. 2017. Cleaning the null space: A privacy mechanism for predictors. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 2789–2795. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–3ACM.
    Google ScholarLocate open access versionFindings
  • Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 November 4, 2018, pages 4847–4853. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments