As we show in Appendix B.1, we find that there is a large label distribution shift between non-African regions and Africa, suggesting that the drop in performance may be in some part due to label shift
WILDS: A Benchmark of in-the-Wild Distribution Shifts
下载 PDF 全文
Distribution shifts can cause significant degradation in a broad range of machine learning (ML) systems deployed in the wild. However, many widely-used datasets in the ML community today were not designed for evaluating distribution shifts. These datasets typically have training and test sets drawn from the same distribution, and prior ...更多
下载 PDF 全文
- Distribution shifts—mismatches in data distributions between training and test time—pose significant challenges for machine learning (ML) systems deployed in the wild.
- In contrast to general-purpose ML research, domain experts applying ML in their respective areas are often forced to grapple with distribution shifts in order to make progress on real-world problems
- As a result, these application areas are rich sources of datasets with distribution shifts that arise in the wild, e.g., in medicine (Chen et al, 2020), computational biology (Leek et al, 2010), wildlife conservation (Beery et al, 2018), satellite imagery (Jean et al, 2016), and so on.
- Distribution shifts—mismatches in data distributions between training and test time—pose significant challenges for machine learning (ML) systems deployed in the wild
- As we show in Appendix B.1, we find that there is a large label distribution shift between non-African regions and Africa, suggesting that the drop in performance may be in some part due to label shift
- There is drastic variation in illumination, camera angle, and background, vegetation, and color. This variation, coupled with considerable differences in the distribution of animals between camera traps, likely encourages the model to overfit to specific animal species appearing in specific locations, which may account for the performance drop
- Histopathology datasets can be unwieldy for ML models, as individual images can be several gigabytes large; extracting patches involves many design choices; the classes are typically very unbalanced; and evaluation often relies on more complex slide-level measures such as the free-response receiver operating characteristic (FROC) (Gurcan et al, 2009)
- Prior work has shown that differences in staining between hospitals are the primary source of variation in this dataset, and that specialized stain augmentation methods can close the in- and out-of-distribution accuracy gap on a variant of the dataset based on the same underlying slides (Tellez et al, 2019)
- We show that deep CORAL and invariant risk minimization (IRM) fails to improve performance on unseen users
- Performance disparities across individuals have been observed in a wide range of tasks and applications, including in natural language processing (Geva et al, 2019), automatic speech recognition (Koenecke et al, 2020; Tatman, 2017), federated learning (Li et al, 2019; Caldas et al, 2018), and medical imaging (Badgeley et al, 2019)
- The authors evaluate models by their average and worst-region OOD accuracies. The former measures the ability of the model to generalize across time, while the latter measures how well models do across different regions/subpopulations under a time shift.
Average ERM DeepCORAL IRM
Worst-region ERM DeepCORAL IRM Validation (ID) Validation (OOD) Test (ID) Test (OOD)
Standard split (ID examples) Mixed split (ID + OOD examples) Algorithm ERM ERM
Test (ID) Average Worst-region.
- In evaluating the trained models, the authors consider average accuracy across the binary classification tasks, averaged over each of the validation and test sets separately.
- To assess whether models generalize to unseen categories, the authors evaluate models by their average accuracy on each of the categories in the OOD test set.
- A BERT-base-uncased model trained with the standard ERM objective performs well on the OOD test set, achieving 76.0% accuracy on average and 75.4% on the worst year (Table 29).
- The authors associate each dataset in Wilds with the problem setting that the authors believe best reflects the real-world challenges in the corresponding application area.
- Prior work has shown that there is often insufficient information at training time to distinguish models that would generalize well under a particular distribution shift; many models that perform in-distribution (ID) can vary substantially out-of-distribution (OOD)
- This instability in OOD performance has been reported in natural language processing settings (McCoy et al, 2019; Kim and Linzen, 2020) and in vision and healthcare applications (D’Amour et al, 2020).
- The authors speculate that the relatively similar ID and OOD variances in the other datasets could be in part because of this, and in part because the authors select models based on their OOD validation performance, but further investigation is required
- Table1: The Wilds benchmark contains 7 datasets across a diverse set of application areas and data modalities. Each dataset comprises data from different domains, and the benchmark is set up to evaluate models on distribution shifts across these domains
- Table2: In both domain generalization and subpopulation shift settings, domain information is available to the model at training time. At test time, domain information can be available in domain generalization, but not in subpopulation shift
- Table3: Time shift and worst-region accuracies (%) for models trained on data before 2013 and tested on held-out locations from in-distribution (ID) or out-of-distribution (OOD) test sets in FMoW-wilds. The models are early-stopped with respect to OOD validation accuracy. Standard deviations over 3 trials are in parentheses
- Table4: Performance drops for ERM models on FMoW-wilds. In the standard split, we train on data from 2002–2013, whereas in the mixed split, we train on the same amount of data but half from 2002–2013 and half from 2013–2018. In both cases, we test on data from 2016–2018. Models trained on the standard split degrade in performance under the time shift, especially on the last year (2017) of the test data, and also fare poorly on the subpopulation shift, with low worst-region accuracy. Models trained on the mixed split have higher OOD average and last year accuracy and much higher OOD worst-region accuracy
- Table5: Region shift results (accuracy, %) for models trained on data before 2013 and tested on held-out locations from ID (< 2013) or OOD (≥ 2016) test sets in FMoW-wilds. One STD shown in parentheses
- Table6: Pearson correlation r (higher is better) on in-distribution and out-of-distribution (unseen countries) held-out sets in PovertyMap-wilds, including results on rural or urban subpopulations. All results are averaged over 5 different OOD country folds taken from <a class="ref-link" id="cYeh_et+al_2020_a" href="#rYeh_et+al_2020_a">Yeh et al (2020</a>), with standard deviations across different folds in parentheses. All models are early-stopped with respect to OOD validation MSE
- Table7: Performance drops for ERM models on PovertyMap-wilds. In the standard split, we train on data from one set of countries, and then test on a different set of countries. In the mixed split, we train on the same amount of data but sampled uniformly from all countries. Models trained on the standard split degrade in performance, especially on rural subpopulations, while models trained on the mixed split do not
- Table8: Baseline results on iWildCam2020-wilds
- Table9: Baseline results on Camelyon17-wilds. Parentheses show standard deviation across 10 replicates
- Table10: Performance drops for ERM models on Camelyon17-wilds. In the standard split, we train on data from three hospitals and evaluate on a different test hospital, whereas in the mixed split, we add data from one extra slide from the test hospital to the training set. The original test set has data from 10 slides; here, we report performance for both splits on 9 slides (without the slide that was moved to the training set). This makes the numbers (74.1 vs. 73.3) for the standard split slightly different from Table 9. Parentheses show standard deviation across 10 replicates
- Table11: Baseline results on OGB-MolPCBA. Parentheses show standard deviation across 3 replicates
- Table12: Out-of-distribution vs. in-distribution performance for ERM models on OGB-MolPCBA. In the standard split, we train on molecules from some scaffolds and evaluate on molecules from different scaffolds, whereas in the mixed split, we randomly divide molecules into training and test sets without using scaffold information
- Table13: Baseline results on Amazon-wilds. We report the accuracy of models trained using three baseline algorithms: ERM, DeepCORAL, and IRM. In addition to the average accuracy across all reviews, we compute the accuracy for each reviewer and report the performance for the reviewer in the 10th percentile
- Table14: Comparison with in-distribution baselines on Amazon-wilds. To demonstrate that the poor out-of-distribution (OOD) performance of the ERM model (Table 13) stems from the distribution shift, we compare with in-distribution (ID) baseline models, which are oracle models finetuned on each reviewer. We report the average accuracy on a fixed set of 10 reviewers that are in the 10th percentile or below for the ERM model. Despite being trained on data that are orders of magnitude smaller (less than 1,000 reviews per user, compared to the full training set of 1 million reviews), the oracle baseline models outperform the ERM models
- Table15: Baseline results on CivilComments-wilds. The reweighted (label) algorithm samples equally from the positive and negative class; the group DRO (label) algorithm additionally weights these classes so as to minimize the maximum of the average positive training loss and average negative training loss. Similarly, the reweighted (label × Black) and group DRO (label × Black) algorithms sample equally from the four groups corresponding to all combinations of class and whether there is a mention of Black identity. We show standard deviation across random seeds in parentheses
- Table16: Accuracies on each subpopulation in CivilComments-wilds, averaged over models trained by group DRO (label)
- Table17: Wilds focuses on two specific settings of domain shift: domain generalization and subpopulation shift. These two settings vary only in whether the test domains are seen or unseen. Other problem settings that can apply to Wilds datasets include test-time adaptation and unsupervised domain adaptation
- Table18: Time shift accuracies (%) for models trained on data before 2013 and tested on held-out locations from in-distribution (ID) or out-of-distribution (OOD) test sets in FMoW-wilds. The accuracy of ERM drops significantly in the last year of the dataset. The models are early-stopped with respect to OOD validation accuracy. Standard deviations over 3 trials are in parentheses. Mixed split models use both ID + OOD training examples
- Table19: Pearson correlation r (higher is better) on in-distribution and out-of-distribution (unseen countries) held-out sets in PovertyMap-wilds, including results on rural or urban subpopulations. All results are averaged over 5 different OOD country folds taken from <a class="ref-link" id="cYeh_et+al_2020_a" href="#rYeh_et+al_2020_a">Yeh et al (2020</a>), with standard deviations across different folds in parentheses. All models are early-stopped with respect to OOD validation MSE. (- NL) models do not use nighttime light as input. Mixed split models use both ID + OOD examples as training data
- Table20: Mean squared error (MSE) on in-distribution and out-of-distribution (unseen countries) held-out sets in PovertyMap-wilds. All results are averaged over 5 folds taken from <a class="ref-link" id="cYeh_et+al_2020_a" href="#rYeh_et+al_2020_a">Yeh et al (2020</a>). All models are early-stopped with respect to OOD validation MSE. (- NL) models do not use nighttime light as input. Mixed split models use both ID + OOD examples as training data
- Table21: Dataset details for iWildCam2020-wilds
- Table22: Dataset details for Amazon-wilds
- Table23: Additional results of baseline models on Amazon-wilds
- Table24: Additional results of in-distribution baseline models on Amazon-wilds
- Table25: Group sizes in the test data for CivilComments-wilds. The training and validation data follow similar proportions
- Table26: CivilComments-wilds results for the Group DRO (label × Black) model with early stopping on accuracy on comments that mention the Black identity. Compared to the Group DRO (label) model in Table 15, accuracy on Black comments is higher but accuracy on LGBTQ comments is lower. We show standard deviation across random seeds in parentheses
- Table27: Average multi-task classification accuracy of ERM trained models on BDD100K. All results are reported across 3 random seeds, with standard deviation in parentheses. We observe no substantial drops in the presence of test time distribution shifts
- Table28: Baseline results on category shifts on the Amazon Reviews Dataset. We report the accuracy of models trained using ERM on a single category versus four categories. Across many categories unseen at training time, the latter model modestly but consistently outperforms the former
- Table29: Baseline results on time shifts on the Amazon Reviews Dataset. We report the accuracy of models trained using ERM. In addition to the average accuracy across all years in each split, we report the accuracy for the worst-case year
- Table30: Comparison with in-distribution baselines for time shifts on Amazon Reviews Dataset. We observe only modest performance drops due to time shifts
- Table31: Baseline results on time shifts on the Yelp Open Dataset. We report the accuracy of models trained using ERM
- Table32: Comparison with in-distribution baselines for time shifts on Yelp Open Dataset. We observe only modest performance drops due to time shifts
- Table33: Baseline results on user shifts on the Yelp Open Dataset. We report the accuracy of models trained using ERM. In addition to the average accuracy across all reviews, we compute the accuracy for each reviewer and report the performance for the reviewer in the 10th percentile
- We are grateful for all of the helpful suggestions and constructive feedback from: Aditya Khosla, Andreas Schlueter, Annie Chen, Alexander D’Amour, Allison Koenecke, Alyssa Lees, Ananya Kumar, Andrew Beck, Behzad Haghgoo, Charles Sutton, Christopher Yeh, Cody Coleman, Dan Jurafsky, Daniel Levy, Daphne Koller, David Tellez, Erik Jones, Evan Liu, Fisher Yu, Georgi Marinov, Irena Gao, Irene Chen, Jacky Kang, Jacob Schreiber, Jacob Steinhardt, Jared Dunnmon, Jean Feng, Jeffrey Sorensen, Jianmo Ni, John Hewitt, Kate Saenko, Kelly Cochran, Kensen Shi, Kyle Loh, Li Jiang, Lucy Vasserman, Ludwig Schmidt, Luke Oakden-Rayner, Marco Tulio Ribeiro, Matthew Lungren, Megha Srivastava, Nimit Sohoni, Pranav Rajpurkar, Robin Jia, Rohan Taori, Sarah Bird, Sharad Goel, Sherrie Wang, Stefano Ermon, Steve Yadlowsky, Tatsunori Hashimoto, Vincent Hellendoorn, Yair Carmon, Zachary Lipton, and Zhenghao Chen. The design of the WILDS benchmark was inspired by the Open Graph Benchmark (Hu et al, 2020), and we are grateful to the Open Graph Benchmark team for their advice and help in setting up our benchmark. This project was funded by an Open Philanthropy Project Award and NSF Award Grant No 1805310
- Sagawa was supported by the Herbert Kunzel Stanford Graduate Fellowship
- Marklund was supported by the Dr Tech
- Zhang were supported by NDSEG Graduate Fellowships
- Hu was supported by the Funai Overseas Scholarship and the Masason Foundation Fellowship
- Beery was supported by an NSF Graduate Research Fellowship and is a PIMCO Fellow in Data Science
Wilds datasets span a diverse array of societally-important applications with natural distribution shifts: poverty mapping (Yeh et al, 2020), building and land use classification (Christie et al, 2018), animal species categorization (Beery et al, 2020), predicting text toxicity (Borkan et al, 2019), sentiment analysis (Ni et al, 2019), tumor identification (Bandi et al, 2018), and bioassay prediction (Hu et al, 2020). At present, there are 7 datasets in Wilds (Table 1), reflecting distribution shifts arising from different demographics, users, hospitals, camera locations, countries, time periods, and molecular scaffolds. Wilds builds on top of extensive data-collection efforts by domain experts
Wilds datasets. We now discuss the 7 datasets in the Wilds benchmark, summarized in Table 1. For each dataset, we first describe the task, the distribution shift, and the evaluation criteria
We model toxicity classification as a binary task. Toxicity labels were obtained in the original dataset via crowdsourcing and majority vote, with each comment being reviewed by at least 10 crowdworkers. Annotations of demographic mentions were similarly obtained through crowdsourcing and majority vote
- B. Abelson, K. R. Varshney, and J. Sun. Targeting direct cash transfers to the extremely poor. In International Conference on Knowledge Discovery and Data Mining (KDD), 2014.
- R. Adragna, E. Creager, D. Madras, and R. Zemel. Fairness and robustness in invariant learning: A case study in toxicity classification. arXiv preprint arXiv:2011.06485, 2020.
- A. Ahadi, R. Lister, H. Haapala, and A. Vihavainen. Exploring machine learning methods to automatically identify students in need of assistance. In Proceedings of the Eleventh Annual International Conference on International Computing Education Research, pages 121–130, 2015.
- Jorge A Ahumada, Eric Fegraus, Tanya Birch, Nicole Flores, Roland Kays, Timothy G O’Brien, Jonathan Palmer, Stephanie Schuttler, Jennifer Y Zhao, Walter Jetz, Margaret Kinnaird, Sayali Kulkarni, Arnaud Lyet, David Thau, Michelle Duong, Ruth Oliver, and Anthony Dancer. Wildlife insights: A platform to maximize the potential of camera trap and other passive sensor wildlife data for the planet. Environmental Conservation, 47(1):1–6, 2020.
- E. AlBadawy, A. Saha, and M. Mazurowski. Deep learning for segmentation of brain tumors: Impact of cross-institutional training and testing. Med Phys., 45, 2018.
- A. Alexandari, A. Kundaje, and A. Shrikumar. Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning (ICML), pages 222–232, 2020.
- Miltiadis Allamanis and Marc Brockschmidt. Smartpaste: Learning to adapt source code. arXiv preprint arXiv:1705.07867, 2017.
- Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 38–49, 2015.
- E. Amorim, M. Cançado, and A. Veloso. Automated essay scoring in the presence of biased ratings. In Association for Computational Linguistics (ACL), pages 229–237, 2018.
- R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Language Resources and Evaluation Conference (LREC), pages 4218–4222, 2020.
- M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
- Arthur Asuncion and David Newman. UCI Machine Learning Repository, 2007.
- M. S. Attene-Ramos, N. Miller, R. Huang, S. Michael, M. Itkin, R. J. Kavlock, C. P. Austin, P. Shinn, A. Simeonov, R. R. Tice, et al. The tox21 robotic platform for the assessment of environmental chemicals–from vision to reality. Drug discovery today, 18(15):716–723, 2013.
- J. Atwood, Y. Halpern, P. Baljekar, E. Breck, D. Sculley, P. Ostyakov, S. I. Nikolenko, I. Ivanov, R. Solovyev, W. Wang, et al. The Inclusive Images competition. In Advances in Neural Information Processing Systems (NeurIPS), pages 155–186, 2020.
- R. Aviv, S. A. Teichmann, E. S. Lander, A. Ido, B. Christophe, B. Ewan, B. Bernd, P. Campbell, C. Piero, C. Menna, et al. The human cell atlas. Elife, 6, 2017.
- Žiga Avsec, M. Weilert, A. Shrikumar, A. Alexandari, S. Krueger, K. Dalal, R. Fropf, C. McAnany, J. Gagneur, A. Kundaje, and J. Zeitlinger. Deep learning at base-resolution reveals motif syntax of the cis-regulatory code. bioRxiv, 2019.
- K. Azizzadenesheli, A. Liu, F. Yang, and A. Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations (ICLR), 2019.
- M. A. Badgeley, J. R. Zech, L. Oakden-Rayner, B. S. Glicksberg, M. Liu, W. Gale, M. V. McConnell, B. Percha, T. M. Snyder, and J. T. Dudley. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Medicine, 2, 2019.
- P. Bandi, O. Geessink, Q. Manson, M. V. Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee, K. Paeng, A. Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient level: the CAMELYON17 challenge. IEEE transactions on medical imaging, 38(2):550–560, 2018.
- A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems (NeurIPS), pages 9453–9463, 2019.
- P. L. Bartlett and M. H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research (JMLR), 9(0):1823–1840, 2008.
- T. Baumann, A. Köhn, and F. Hennig. The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening. Language Resources and Evaluation, 53(2): 303–329, 2019.
- BBC. A-levels and GCSEs: How did the exam algorithm work? The British Broadcasting Corporation, 2020. URL https://www.bbc.com/news/explainers-53807730.
- A. H. Beck, A. R. Sangoi, S. Leung, R. J. Marinelli, T. O. Nielsen, M. J. V. D. Vijver, R. B. West, M. V. D. Rijn, and D. Koller. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Science, 3(108), 2011.
- Axel D Becke. Perspective: Fifty years of density-functional theory in chemical physics. The Journal of Chemical Physics, 140(18):18A301, 2014.
- E. Beede, E. Baylor, F. Hersch, A. Iurchenko, L. Wilcox, P. Ruamviboonsuk, and L. M. Vardoulakis. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In Conference on Human Factors in Computing Systems (CHI), pages 1–12, 2020.
- S. Beery, G. V. Horn, and P. Perona. Recognition in terra incognita. In European Conference on Computer Vision (ECCV), pages 456–473, 2018.
- S. Beery, E. Cole, and A. Gjoka. The iWildCam 2020 competition dataset. arXiv preprint arXiv:2004.10340, 2020.
- Sara Beery, Dan Morris, and Siyu Yang. Efficient pipeline for camera trap image review. arXiv preprint arXiv:1907.06772, 2019.
- Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, and Jonathan Huang. Context r-cnn: Long term temporal context for per-camera object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13075–13085, 2020.
- B. E. Bejnordi, M. Veta, P. J. V. Diest, B. V. Ginneken, N. Karssemeijer, G. Litjens, J. A. V. D. Laak, M. Hermsen, Q. F. Manson, M. Balkenhol, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199–2210, 2017.
- D. Bellamy, L. Celi, and A. L. Beam. Evaluating progress on machine learning for longitudinal electronic healthcare data. arXiv preprint arXiv:2010.01149, 2020.
- M. G. Bellemare, S. Candido, P. S. Castro, J. Gong, M. C. Machado, S. Moitra, S. S. Ponda, and Z. Wang. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588, 2020.
- S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NeurIPS), pages 137–144, 2006.
- A. BenTaieb and G. Hamarneh. Adversarial stain transfer for histopathology image analysis. IEEE transactions on medical imaging, 37(3):792–802, 2017.
- A. A. Beyene, T. Welemariam, M. Persson, and N. Lavesson. Improved concept drift handling in surgery prediction and other applications. Knowledge and Information Systems, 44(1): 177–196, 2015.
- G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Inormation Processing Systems, pages 2178–2186, 2011.
- J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 440–447, 2007.
- S. L. Blodgett and B. O’Connor. Racial disparity in natural language processing: A case study of social media African-American English. arXiv preprint arXiv:1707.00061, 2017.
- S. L. Blodgett, L. Green, and B. O’Connor. Demographic dialectal variation in social media: A case study of African-American English. In Empirical Methods in Natural Language Processing (EMNLP), pages 1119–1130, 2016.
- J. Blumenstock, G. Cadamuro, and R. On. Predicting poverty and wealth from mobile phone metadata. Science, 350, 2015.
- R. S. Bohacek, C. McMartin, and W. C. Guida. The art and practice of structure-based drug design: a molecular modeling perspective. Medicinal Research Reviews, 16(1):3–50, 1996.
- D. Borkan, L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Limitations of pinned auc for measuring unintended bias. arXiv preprint arXiv:1903.02088, 2019.
- D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In WWW, pages 491–500, 2019.
- L. Bottou, J. Peters, J. Quiñonero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research (JMLR), 14: 3207–3260, 2013. New York Times, 2020.
- URL https://www.nytimes.com/2020/09/08/opinion/
- L. Bruzzone and M. Marconcini. Domain adaptation problems: A DASVM classification technique and a circular validation strategy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):770–787, 2009.
- D. Bug, S. Schneider, A. Grote, E. Oswald, F. Feuerhake, J. Schüler, and D. Merhof. Contextbased normalization of histological stains using deep convolutional features. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 135–142, 2017.
- Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. In International Conference on Learning Representations (ICLR), 2018.
- J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pages 77–91, 2018.
- M. Burke, S. Heft-Neal, and E. Bendavid. Sources of variation in under-5 mortality across sub-Saharan Africa: a spatial analysis. Lancet Global Health, 4, 2016.
- J. Byrd and Z. Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning (ICML), pages 872–881, 2019.
- S. Caldas, P. Wu, T. Li, J. Konečny, H. B. McMahan, V. Smith, and A. Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.
- G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. W. K. Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, and T. J. Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature Medicine, 25(8):1301–1309, 2019.
- K. Cao, Y. Chen, J. Lu, N. Arechiga, A. Gaidon, and T. Ma. Heteroskedastic and imbalanced deep learning with adaptive regularization. arXiv preprint arXiv:2006.15766, 2020.
- L. Chanussot, A. Das, S. Goyal, T. Lavril, M. Shuaibi, M. Riviere, K. Tran, J. Heras-Domingo, C. Ho, W. Hu, A. Palizhati, A. Sriram, B. Wood, J. Yoon, D. Parikh, C. L. Zitnick, and Z. Ulissi. The Open Catalyst 2020 (oc20) dataset and community challenges. arXiv preprint arXiv:2010.09990, 2020.
- I. Y. Chen, P. Szolovits, and M. Ghassemi. Can AI help reduce disparities in general medical and mental health care? AMA Journal of Ethics, 21(2):167–179, 2019.
- I. Y. Chen, E. Pierson, S. Rose, S. Joshi, K. Ferryman, and M. Ghassemi. Ethical machine learning in health care. arXiv preprint arXiv:2009.10576, 2020.
- V. Chen, S. Wu, A. J. Ratner, J. Weng, and C. Ré. Slice-based learning: A programming model for residual learning in critical data slices. In Advances in Neural Information Processing Systems, pages 9397–9407, 2019.
- T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P. Agapow, M. Zietz, M. M. Hoffman, et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141), 2018.
- G. Christie, N. Fendley, J. Wilson, and R. Mukherjee. Functional map of the world. In Computer Vision and Pattern Recognition (CVPR), 2018.
- J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. Proc. Interspeech, pages 1086–1090, 2018.
- J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. arXiv preprint arXiv:2003.05002, 2020.
- N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
- A. Conneau and G. Lample. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems (NeurIPS), pages 7059–7069, 2019.
- A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V. Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Empirical Methods in Natural Language Processing (EMNLP), pages 2475–2485, 2018.
- E. P. Consortium et al. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57–74, 2012.
- G. Consortium et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science, 369(6509):1318–1330, 2020.
- H. Consortium et al. The human body at cellular resolution: the NIH human biomolecular atlas program. Nature, 574(7777), 2019.
- L. P. Cordella, C. D. Stefano, F. Tortorella, and M. Vento. A method for improving classification reliability of multilayer perceptrons. IEEE Transactions on Neural Networks, 6(5):1140–1147, 1995.
- P. Courtiol, C. Maussion, M. Moarii, E. Pronier, S. Pilcer, M. Sefta, P. Manceron, S. Toldo, M. Zaslavskiy, N. L. Stang, et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nature Medicine, 25(10):1519–1525, 2019.
- F. Croce, M. Andriushchenko, V. Sehwag, N. Flammarion, M. Chiang, P. Mittal, and M. Hein. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020.
- Anne-Sophie Crunchant, David Borchers, Hjalmar Kühl, and Alex Piel. Listening and watching: Do camera traps or acoustic sensors more efficiently detect wild chimpanzees in an open habitat? Methods in Ecology and Evolution, 11(4):542–552, 2020.
- M. F. Cuccarese, B. A. Earnshaw, K. Heiser, B. Fogelson, C. T. Davis, P. F. McLean, H. B. Gordon, K. Skelly, F. L. Weathersby, V. Rodic, et al. Functional immune mapping with deep-learning enabled phenomics applied to immunomodulatory and COVID-19 drug discovery. bioRxiv, 2020.
- Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie. Class-balanced loss based on effective number of samples. In Computer Vision and Pattern Recognition (CVPR), pages 9268–9277, 2019.
- J. Ba D. P Kingma. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- D. Dai and L. Van Gool. Dark model adaptation: Semantic image segmentation from daytime to nighttime. In International Conference on Intelligent Transportation Systems (ITSC), 2018.
- A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisenstein, M. D. Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395, 2020.
- A. D’Amour, H. Srinivasan, J. Atwood, P. Baljekar, D. Sculley, and Y. Halpern. Fairness is not static: deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 525–534, 2020.
- H. Daumé III. Frustratingly easy domain adaptation. In Association for Computational Linguistics (ACL), 2007.
- S. E. Davis, T. A. Lasko, G. Chen, E. D. Siew, and M. E. Matheny. Calibration drift in regression and machine learning models for acute kidney injury. Journal of the American Medical Informatics Association, 24(6):1052–1061, 2017.
- A. J. DeGrave, J. D. Janizek, and S. Lee. AI for radiographic COVID-19 detection selects shortcuts over signal. medRxiv, 2020.
- M. C. Desmarais and R. Baker. A review of recent advances in learner and skill modeling in intelligent learning environments. User Modeling and User-Adapted Interaction, 22(1):9–38, 2012.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics (ACL), pages 4171–4186, 2019.
- N. DigitalGlobe and C. Works. Spacenet. https://aws.amazon.com/publicdatasets/spacenet/, 2016.
- K. A. Dill and J. L. MacCallum. The protein-folding problem, 50 years on. Science, 338(6110): 1042–1046, 2012.
- L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and mitigating unintended bias in text classification. In Association for the Advancement of Artificial Intelligence (AAAI), pages 67–73, 2018.
- J. Djolonga, J. Yung, M. Tschannen, R. Romijnders, L. Beyer, A. Kolesnikov, J. Puigcerver, M. Minderer, A. D’Amour, D. Moldovan, et al. On robustness and transferability of convolutional neural networks. arXiv preprint arXiv:2007.08558, 2020.
- Samuel Dodge and Lina Karam. A study and comparison of human and deep learning recognition performance under visual distortions. In 26th International Conference on Computer Communication and Networks (ICCCN), pages 1–7. IEEE, 2017.
- Q. Dou, D. Castro, K. Kamnitsas, and B. Glocker. Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism. Science Advances, 4(1), 2018.
- J. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses against mixture covariate shifts. https://cs.stanford.edu/~thashim/assets/publications/condrisk.pdf, 2019.
- C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pages 214–226, 2012.
- C. D. Elvidge, P. C. Sutton, T. Ghosh, B. T. Tuttle, K. E. Baugh, B. Bhaduri, and E. Bright. A global poverty map derived from satellite data. Computers and Geosciences, 35, 2009.
- G. Eraslan, Žiga Avsec, J. Gagneur, and F. J. Theis. Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7):389–403, 2019.
- J. Espey, E. Swanson, S. Badiee, Z. Chistensen, A. Fischer, M. Levy, G. Yetman, A. de Sherbinin, R. Chen, Y. Qiu, G. Greenwell, T. Klein,, J. Jutting, M. Jerven, G. Cameron, A. M. A. Rivera, V. C. Arias,, S. L. Mills, and A. Motivans. Data for development: A needs assessment for SDG monitoring and statistical capacity development. Sustainable Development Solutions Network, 2015.
- A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542 (7639):115–118, 2017.
- OpenAI et al. Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
- C. Fang, Y. Xu, and D. N. Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In International Conference on Computer Vision (ICCV), pages 1657–1664, 2013.
- J. Feng, A. Sondhi, J. Perry, and N. Simon. Selective prediction-set models with coverage guarantees. arXiv preprint arXiv:1906.05473, 2019.
- D. Filmer and K. Scott. Assessing asset indices. Demography, 49, 2011.
- J. Futoma, M. Simons, T. Panch, F. Doshi-Velez, and L. A. Celi. The myth of generalisability in clinical research and machine learning in health care. The Lancet Digital Health, 2(9):e489–e492, 2020.
- Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016.
- Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), pages 1180–1189, 2015.
- Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research (JMLR), 17, 2016.
- S. Garg, Y. Wu, S. Balakrishnan, and Z. C. Lipton. A unified view of label shift estimation. arXiv preprint arXiv:2003.07554, 2020.
- Y. Geifman and R. El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Y. Geifman and R. El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning (ICML), 2019.
- Y. Geifman, G. Uziel, and R. El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. In International Conference on Learning Representations (ICLR), 2018.
- R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenettrained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
- R. Geirhos, C. R. Temme, J. Rauber, H. H. Schütt, M. Bethge, and F. A. Wichmann. Generalisation in humans and deep neural networks. Advances in Neural Information Processing Systems, 31:7538–7550, 2018.
- R. Geirhos, J. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks. arXiv preprint arXiv:2004.07780, 2020.
- M. Geva, Y. Goldberg, and J. Berant. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Empirical Methods in Natural Language Processing (EMNLP), 2019.
- Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning (ICML), pages 1273–1272, 2017.
- K. Goel, A. Gu, Y. Li, and C. Ré. Model patching: Closing the subgroup performance gap with data augmentation. arXiv preprint arXiv:2008.06775, 2020.
- B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), pages 2066–2073, 2012.
- I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015.
- N. Graetz, J. Friedman, A. Osgood-Zimmerman, R. Burstein, M. H. Biehl, C. Shields, J. F. Mosser, D. C. Casey, A. Deshpande, L. Earl, R. C. Reiner, S. E. Ray, N. Fullman, A. J. Levine, R. W. Stubbs, B. K. Mayala, J. Longbottom, A. J. Browne, S. Bhatt, D. J. Weiss, P. W. Gething, A. H. Mokdad, S. S. Lim, C. J. L. Murray, E. Gakidou, and S. I. Hay. Mapping local variation in educational attainment across Africa. Nature, 555, 2018.
- M Grooten, T Peterson, and R.E.A Almond. Living Planet Report 2020 - Bending the curve of biodiversity loss. WWF, Gland, Switzerland, 2020.
- S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In International Conference on Robotics and Automation (ICRA), 2017.
- I. Gulrajani and D. Lopez-Paz. In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
- A. Gupta, A. Murali, D. Gandhi, and L. Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. In Advances in Neural Information Processing Systems (NIPS), 2018.
- M. N. Gurcan, L. E. Boucheron, A. Can, A. Madabhushi, N. M. Rajpoot, and B. Yener. Histopathological image analysis: A review. IEEE reviews in biomedical engineering, 2:147–171, 2009.
- M. C. Hansen, P. V. Potapov, R. Moore, M. Hancher, S. A. Turubanova, A. Tyukavina, D. Thau, S. V. Stehman, S. J. Goetz, T. R. Loveland, A. Kommareddy, A. Egorov, L. Chini, C. O. Justice, and J. R. G. Townshend. High-resolution global maps of 21st-century forest cover change. Science, 342, 2013.
- T. B. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning (ICML), 2018.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
- Y. He, Z. Shen, and P. Cui. Towards non-IID image classification: A dataset and baselines. Pattern Recognition, 110, 2020.
- Vincent J Hellendoorn, Sebastian Proksch, Harald C Gall, and Alberto Bacchelli. When code completion fails: A case study on real-world completions. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 960–970. IEEE, 2019.
- B. E. Henderson, N. H. Lee, V. Seewaldt, and H. Shen. The influence of race and ethnicity on the biology of cancer. Nature Reviews Cancer, 12(9):648–653, 2012.
- D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019.
- D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations (ICLR), 2017.
- D. Hendrycks, S. Basart, M. Mazeika, M. Mostajabi, J. Steinhardt, and D. Song. Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132, 2020.
- D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv preprint arXiv:2006.16241, 2020.
- J. W. Ho, Y. L. Jung, T. Liu, B. H. Alver, S. Lee, K. Ikegami, K. Sohn, A. Minoda, M. Y. Tolstorukov, A. Appert, et al. Comparative analysis of metazoan chromatin organization. Nature, 512(7515):449–452, 2014.
- J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), 2018.
- D. Hovy and S. L. Spruit. The social impact of natural language processing. In Association for Computational Linguistics (ACL), pages 591–598, 2016.
- J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080, 2020.
- W. Hu, G. Niu, I. Sato, and M. Sugiyama. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning (ICML), 2018.
- W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
- G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
- James P Hughes, Stephen Rees, S Barrett Kalindjian, and Karen L Philpott. Principles of early drug discovery. British journal of pharmacology, 162(6):1239–1249, 2011.
- Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
- K. Jaganathan, S. K. Panagiotopoulou, J. F. McRae, S. F. Darbandi, D. Knowles, Y. I. Li, J. A. Kosmicki, J. Arbelaez, W. Cui, G. B. Schwartz, et al. Predicting splicing from primary sequence with deep learning. Cell, 176(3):535–548, 2019.
- N. Jean, M. Burke, M. Xie, W. M. Davis, D. B. Lobell, and S. Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353, 2016.
- N. Jean, S. M. Xie, and S. Ermon. Semi-supervised deep kernel learning: Regression with unlabeled data by minimizing predictive variance. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- W. Jin, R. Barzilay, and T. Jaakkola. Enforcing predictive invariance across structured biomedical domains. arXiv preprint arXiv:2006.03908, 2020.
- Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific Data, 3(1):1–9, 2016.
- J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017.
- J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, K. Tunyasuvunakool, O. Ronneberger, R. Bates, A. Žídek, A. Bridgland, C. Meyer, S. A A Kohl, A. Potapenko, A. J Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, M. Steinegger, M. Pacholska, D. Silver, O. Vinyals, A. W Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis. High accuracy protein structure prediction using deep learning. Fourteenth Critical Assessment of Techniques for Protein Structure Prediction, 2020.
- Jongbin Jung, Sharad Goel, Jennifer Skeem, et al. The limits of human predictions of recidivism. Science Advances, 6(7), 2020.
- A. K. Jørgensen, D. Hovy, and A. Søgaard. Challenges of studying and processing dialects in social media. In ACL Workshop on Noisy User-generated Text, pages 9–18, 2015.
- G. Kahn, P. Abbeel, and S. Levine. BADGR: An autonomous self-supervised learning-based navigation system. arXiv preprint arXiv:2002.05700, 2020.
- A. Kamath, R. Jia, and P. Liang. Selective question answering under domain shift. In Association for Computational Linguistics (ACL), 2020.
- Z. Katona, M. Painter, P. N. Patatoukas, and J. Zeng. On the capital market consequences of alternative data: Evidence from outer space. Miami Behavioral Finance Conference, 2018.
- D. Kaushik, E. Hovy, and Z. Lipton. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations (ICLR), 2019.
- M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning (ICML), pages 2564–2572, 2018.
- J. Keilwagen, S. Posch, and J. Grau. Accurate prediction of cell type-specific transcription factor binding. Genome Biology, 20(1), 2019.
- D. R. Kelley, J. Snoek, and J. L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research, 26(7):990–999, 2016.
- J. H. Kim, M. Xie, N. Jean, and S. Ermon. Incorporating spatial context and fine-grained detail from satellite imagery to predict poverty. Stanford University, 2016.
- N. Kim and T. Linzen. Cogs: A compositional generalization challenge based on semantic interpretation. arXiv preprint arXiv:2010.05465, 2020.
- S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B. A. Shoemaker, J. Wang, B. Yu, J. Zhang, and S. H. Bryant. Pubchem substance and compound databases. Nucleic Acids Research, 44(D1):D1202–D1213, 2016.
- A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel. Racial disparities in automated speech recognition. Science, 117(14): 7684–7689, 2020.
- P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang. Concept bottleneck models. In International Conference on Machine Learning (ICML), 2020.
- B. Kompa, J. Snoek, and A. Beam. Empirical frequentist coverage of deep learning uncertainty quantification procedures. arXiv preprint arXiv:2010.03039, 2020.
- D. Komura and S. Ishikawa. Machine learning methods for histopathological image analysis. Computational and Structural Biotechnology Journal, 16:34–42, 2018.
- Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. Spoc: Search-based pseudocode to code. In Advances in Neural Information Processing Systems, pages 11906–11917, 2019.
- C. Kulkarni, P. W. Koh, H. Huy, D. Chia, K. Papadopoulos, J. Cheng, D. Koller, and S. R. Klemmer. Peer and self assessment in massive online classes. Design Thinking Research, pages 131–168, 2015.
- C. E. Kulkarni, R. Socher, M. S. Bernstein, and S. R. Klemmer. Scaling short-answer grading by combining peer assessment with algorithmic scoring. In Proceedings of the first ACM conference on Learning@Scale conference, pages 99–108, 2014.
- A. Kumar, T. Ma, and P. Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning (ICML), 2020.
- A. Kundaje, W. Meuleman, J. Ernst, M. Bilenky, A. Yen, A. Heravi-Moussavi, P. Kheradpour, Z. Zhang, J. Wang, M. J. Ziller, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.
- B. Lake and M. Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International Conference on Machine Learning (ICML), 2018.
- B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Greg Landrum et al. Rdkit: Open-source cheminformatics, 2006.
- Agostina J Larrazabal, Nicolás Nieto, Victoria Peterson, Diego H Milone, and Enzo Ferrante. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences, 117(23):12592–12594, 2020.
- Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How we analyzed the compas recidivism algorithm. ProPublica, 9(1), 2016.
- R. Y. Lau, C. Li, and S. S. Liao. Social analytics: Learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decision Support Systems, 65:80–94, 2014.
- Y. LeCun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.
- J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10), 2010.
- D. Li, Y. Yang, Y. Song, and T. M. Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 5542–5550, 2017.
- H. Li and Y. Guan. Leopard: fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. bioRxiv, 2019.
- H. Li, D. Quang, and Y. Guan. Anchor: trans-cell type prediction of transcription factor binding sites. Genome Research, 29(2):281–292, 2019.
- J. Li, A. H. Miller, S. Chopra, M. Ranzato, and J. Weston. Dialogue learning with human-inthe-loop. In International Conference on Learning Representations (ICLR), 2017.
- J. Li, A. H. Miller, S. Chopra, M. Ranzato, and J. Weston. Learning through dialogue interactions by asking questions. In International Conference on Learning Representations (ICLR), 2017.
- T. Li, M. Sanjabi, A. Beirami, and V. Smith. Fair resource allocation in federated learning. arXiv preprint arXiv:1905.10497, 2019.
- Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou. Revisiting batch normalization for practical domain adaptation. In International Conference on Learning Representations Workshop (ICLRW), 2017.
- S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations (ICLR), 2018.
- M. W. Libbrecht and W. S. Noble. Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6):321–332, 2015.
- Z. Lipton, Y. Wang, and A. Smola. Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning (ICML), 2018.
- L. T. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt. Delayed impact of fair machine learning. In International Conference on Machine Learning (ICML), 2018.
- Y. Liu, K. Gadepalli, M. Norouzi, G. E. Dahl, T. Kohlberger, A. Boyko, S. Venugopalan, A. Timofeev, P. Q. Nelson, G. S. Corrado, et al. Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442, 2017.
- M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015.
- J. Lyu, S. Wang, T. E. Balius, I. Singh, A. Levit, Y. S. Moroz, M. J. O’Meara, T. Che, E. Algaa, K. Tolmachova, et al. Ultra-large library docking for discovering new chemotypes. Nature, 566 (7743):224–229, 2019.
- M. Macenko, M. Niethammer, J. S. Marron, D. Borland, J. T. Woosley, X. Guan, C. Schmitt, and N. E. Thomas. A method for normalizing histology slides for quantitative analysis. In 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pages 1107–1110, 2009.
- Brian A Malloy and James F Power. Quantifying the transition from python 2 to 3: an empirical study of python applications. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 314–323. IEEE, 2017.
- M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19:313–330, 1993.
- K. McCloskey, E. A. Sigel, S. Kearnes, L. Xue, X. Tian, D. Moccia, D. Gikunju, S. Bazzaz, B. Chan, M. A. Clark, et al. Machine learning on DNA-encoded libraries: A new paradigm for hit finding. Journal of Medicinal Chemistry, 2020.
- R. T. McCoy, J. Min, and T. Linzen. Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. arXiv preprint arXiv:1911.02969, 2019.
- R. T. McCoy, E. Pavlick, and T. Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Association for Computational Linguistics (ACL), 2019.
- S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova, H. Ashrafian, T. Back, M. Chesus, G. C. Corrado, A. Darzi, et al. International evaluation of an AI system for breast cancer screening. Nature, 577(7788):89–94, 2020.
- Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635, 2019.
- J. Miller, K. Krauth, B. Recht, and L. Schmidt. The effect of natural distribution shift on question answering models. arXiv preprint arXiv:2004.14444, 2020.
- P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, C. Kavukcuoglu, D. Kumaran, and R. Hadsell. Learning to navigate in complex environments. In International Conference on Learning Representations (ICLR), 2017.
- J. E. Moore, M. J. Purcaro, H. E. Pratt, C. B. Epstein, N. Shoresh, J. Adrian, T. Kawli, C. A. Davis, A. Dobin, R. Kaul, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature, 583(7818):699–710, 2020.
- J. Moult, J. T Pedersen, R. Judson, and K. Fidelis. A large-scale experiment to assess protein structure prediction methods. Proteins: Structure, Function, and Bioinformatics, 23(3):ii–iv, 1995.
- K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning (ICML), pages 10–18, 2013.
- W. Nekoto, V. Marivate, T. Matsila, T. Fasubaa, T. Kolawole, T. Fagbohungbe, S. O. Akinola, S. H. Muhammad, S. Kabongo, S. Osei, S. Freshia, R. A. Niyongabo, R. Macharm, P. Ogayo, O. Ahia, M. Meressa, M. Adeyemi, M. Mokgesi-Selinga, L. Okegbemi, L. J. Martinus, K. Tajudeen, K. Degila, K. Ogueji, K. Siminyu, J. Kreutzer, J. Webster, J. T. Ali, J. Abbott, I. Orife, I. Ezeani, I. A. Dangana, H. Kamper, H. Elsahar, G. Duru, G. Kioko, E. Murhabazi, E. van Biljon, D. Whitenack, C. Onyefuluchi, C. Emezue, B. Dossou, B. Sibanda, B. I. Bassey, A. Olabiyi, A. Ramkilowan, A. Öktem, A. Akinfaderin, and A. Bashir. Participatory research for low-resourced machine translation: A case study in African languages. In Findings of Empirical Methods in Natural Language Processing (Findings of EMNLP), 2020.
- B. Nestor, M. McDermott, W. Boag, G. Berner, T. Naumann, M. C. Hughes, A. Goldenberg, and M. Ghassemi. Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks. arXiv preprint arXiv:1908.00690, 2019.
- J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Empirical Methods in Natural Language Processing (EMNLP), pages 188–197, 2019.
- Marius Nita and David Notkin. Using twinning to adapt programs to alternative apis. In 2010 ACM/IEEE 32nd International Conference on Software Engineering, volume 1, pages 205–214. IEEE, 2010.
- A. Noor, V. Alegana, P. Gething, A. Tatem, and R. Snow. Using remotely sensed night-time light as a proxy for poverty in africa. Population Health Metrics, 6, 2008.
- Mohammad Sadegh Norouzzadeh, Dan Morris, Sara Beery, Neel Joshi, Nebojsa Jojic, and Jeff Clune. A deep active learning system for species identification and counting in camera trap images. arXiv preprint arXiv:1910.09716, 2019.
- NYTimes. The Times is partnering with Jigsaw to expand comment capabilities. The New York Times, 2016. URL https://www.nytco.com/press/the-times-is-partnering-with-jigsaw-to-expand-comment-capabilities/.
- Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.
- Y. Oren, S. Sagawa, T. Hashimoto, and P. Liang. Distributionally robust language modeling. In Empirical Methods in Natural Language Processing (EMNLP), 2019.
- A. Osgood-Zimmerman, A. I. Millear, R. W. Stubbs, C. Shields, B. V. Pickering, L. Earl, N. Graetz, D. K. Kinyoki, S. E. Ray, S. Bhatt, A. J. Browne, R. Burstein, E. Cameron, D. C. Casey, A. Deshpande, N. Fullman, P. W. Gething, H. S. Gibson, N. J. Henry, M. Herrero, L. K. Krause, I. D. Letourneau, A. J. Levine, P. Y. Liu, J. Longbottom, B. K. Mayala, J. F. Mosser, A. M. Noor, D. M. Pigott, E. G. Piwoz, P. Rao, R. Rawat, R. C. Reiner, D. L. Smith, D. J. Weiss, K. E. Wiens, A. H. Mokdad, S. S. Lim, C. J. L. Murray, N. J. Kassebaum, and S. I. Hay. Mapping child growth failure in africa between 2000 and 2015. Nature, 555, 2018.
- Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, and J. Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- S. J. Pan, X. Ni, J. Sun, Q. Yang, and Z. Chen. Cross-domain sentiment classification via spectral feature alignment. In Proceedings of the 19th International World Wide Web Conference, pages 751–760, 2010.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an ASR corpus based on public domain audio books. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 5206–5210, 2015.
- Jason Parham, Jonathan Crall, Charles Stewart, Tanya Berger-Wolf, and Daniel I Rubenstein. Animal population censusing at scale with citizen science and photographic identification. In AAAI Spring Symposium-Technical Report, 2017.
- J. H. Park, J. Shin, and P. Fung. Reducing gender bias in abusive language detection. In Empirical Methods in Natural Language Processing (EMNLP), pages 2799–2804, 2018.
- G. K. Patro, A. Biswas, N. Ganguly, K. P. Gummadi, and A. Chakraborty. Fairrec: Two-sided fairness for personalized recommendations in two-sided platforms. In Proceedings of The Web Conference 2020, pages 1194–1204, 2020.
- X. Peng, B. Usman, N. Kaushik, D. Wang, J. Hoffman, and K. Saenko. VisDA: A synthetic-toreal benchmark for visual domain adaptation. In Computer Vision and Pattern Recognition (CVPR), pages 2021–2026, 2018.
- X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. Moment matching for multi-source domain adaptation. In International Conference on Computer Vision (ICCV), 2019.
- X. Peng, E. Coumans, T. Zhang, T. Lee, J. Tan, and S. Levine. Learning agile robotic locomotion skills by imitating animals. In Robotics: Science and Systems (RSS), 2020.
- L. Perelman. When “the state of the art” is counting words. Assessing Writing, 21:104–111, 2014.
- N. A. Phillips, P. Rajpurkar, M. Sabini, R. Krishnan, S. Zhou, A. Pareek, N. M. Phu, C. Wang, A. Y. Ng, and M. P. Lungren. Chexphoto: 10,000+ smartphone photos and synthetic photographic transformations of chest x-rays for benchmarking deep learning robustness. arXiv preprint arXiv:2007.06199, 2020.
- C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller. Tuned models of peer assessment in moocs. Educational Data Mining, 2013.
- M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko. A review of novelty detection. Signal Processing, 99:215–249, 2014.
- Kerrie A Pipal, Jeremy J Notch, Sean A Hayes, and Peter B Adams. Estimating escapement for a low-abundance steelhead population using dual-frequency identification sonar (didson). North American Journal of Fisheries Management, 32(5):880–893, 2012.
- W Nicholson Price and I Glenn Cohen. Privacy in the age of medical big data. Nature Medicine, 25(1):37–43, 2019.
- D. Quang and X. Xie. Factornet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods, 166:40–47, 2019.
- J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
- Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 419–428, 2014.
- C. Ré, F. Niu, P. Gudipati, and C. Srisuwananukorn. Overton: A data system for monitoring and improving machine-learned products. arXiv preprint arXiv:1909.05372, 2019.
- B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning (ICML), 2019.
- R. C. Reiner, N. Graetz, D. C. Casey, C. Troeger, G. M. Garcia, J. F. Mosser, A. Deshpande, S. J. Swartz, S. E. Ray, B. F. Blacker, P. C. Rao, A. Osgood-Zimmerman, R. Burstein, D. M. Pigott, I. M. Davis, I. D. Letourneau, L. Earl, J. M. Ross, I. A. Khalil, T. H. Farag, O. J. Brady, M. U. Kraemer, D. L. Smith, S. Bhatt, D. J. Weiss, P. W. Gething, N. J. Kassebaum, A. H. Mokdad, C. J. Murray, and S. I. Hay. Variation in childhood diarrheal morbidity and mortality in africa, 2000–2015. New England Journal of Medicine, 379, 2018.
- D. Reker. Practical considerations for active machine learning in drug discovery. Drug Discovery Today: Technologies, 2020.
- M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Association for Computational Linguistics (ACL), pages 4902–4912, 2020.
- S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European Conference on Computer Vision, pages 102–118, 2016.
- M. Rigaki and S. Garcia. Bringing a GAN to a knife-fight: Adapting malware communication to avoid detection. In 2018 IEEE Security and Privacy Workshops (SPW), pages 70–75, 2018.
- E. Rolf, M. I. Jordan, and B. Recht. Post-estimation smoothing: A simple baseline for learning with side information. In Artificial Intelligence and Statistics (AISTATS), 2020.
- G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3234–3243, 2016.
- Amir Rosenfeld, Richard Zemel, and John K Tsotsos. The elephant in the room. arXiv preprint arXiv:1808.03305, 2018.
- F. Sadeghi and S. Levine. CAD2RL: Real single-image flight without a single real image. In Robotics: Science and Systems (RSS), 2017.
- K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European Conference on Computer Vision, pages 213–226, 2010.
- M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 14(1):21–41, 2002.
- S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020.
- S. Sagawa, A. Raghunathan, P. W. Koh, and P. Liang. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning (ICML), 2020.
- D. E. Sahn and D. Stifel. Exploring alternative measures of welfare in the absence of expenditure data. The Review of Income and Wealth, 49, 2003.
- S. Santurkar, D. Tsipras, and A. Madry. Breeds: Benchmarks for subpopulation shift. arXiv, 2020.
- Stefan Schneider and Alex Zhuang. Counting fish and dolphins in sonar images using deep learning. arXiv preprint arXiv:2007.12808, 2020.
- L. Seyyed-Kalantari, G. Liu, M. McDermott, and M. Ghassemi. Chexclusion: Fairness gaps in deep chest X-ray classifiers. arXiv preprint arXiv:2003.00827, 2020.
- Shreya Shankar, Yoni Halpern, Eric Breck, James Atwood, Jimbo Wilson, and D Sculley. No classification without representation: Assessing geodiversity issues in open data sets for the developing world. Advances in Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning for the Developing World, 2017.
- V. Shankar, A. Dave, R. Roelofs, D. Ramanan, B. Recht, and L. Schmidt. Do image classifiers generalize across time? arXiv preprint arXiv:1906.02168, 2019.
- J. Shen, Y. Qu, W. Zhang, and Y. Yu. Wasserstein distance guided representation learning for domain adaptation. In Association for the Advancement of Artificial Intelligence (AAAI), 2018.
- M. D. Shermis. State-of-the-art automated essay scoring: Competition, results, and future directions from a united states demonstration. Assessing Writing, 20:53–76, 2014.
- Rakshith Shetty, Bernt Schiele, and Mario Fritz. Not using the car to see the sidewalk– quantifying and controlling the effects of context in classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8218–8226, 2019.
- H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90:227–244, 2000.
- Richard Shin, Neel Kant, Kavi Gupta, Christopher Bender, Brandon Trabucco, Rishabh Singh, and Dawn Song. Synthetic datasets for neural program synthesis. In International Conference on Learning Representations (ICLR), 2019.
- Yu Shiu, KJ Palmer, Marie A Roch, Erica Fleishman, Xiaobai Liu, Eva-Marie Nosal, Tyler Helble, Danielle Cholewiak, Douglas Gillespie, and Holger Klinck. Deep neural networks for automated detection of marine mammal species. Scientific Reports, 10(1):1–12, 2020.
- Brian K Shoichet. Virtual screening of chemical libraries. Nature, 432(7019):862–865, 2004.
- N. Sohoni, J. Dunnmon, G. Angus, A. Gu, and C. Ré. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- D. Srivastava and S. Mahony. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms, 1863(6), 2020.
- Teague Sterling and John J. Irwin. Zinc 15 – ligand discovery for everyone. Journal of Chemical Information and Modeling, 55(11):2324–2337, 2015. doi: 10.1021/acs.jcim.5b00559. PMID: 26479676.
- Dan Stowell, Michael D Wood, Hanna Pamuła, Yannis Stylianou, and Hervé Glotin. Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods in Ecology and Evolution, 10(3):368–380, 2019.
- A. Subbaswamy, R. Adams, and S. Saria. Evaluating model robustness to dataset shift. arXiv preprint arXiv:2010.15100, 2020.
- B. Sun and K. Saenko. Deep CORAL: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443–450, 2016.
- B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In Association for the Advancement of Artificial Intelligence (AAAI), 2016.
- P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, S. Zhao, S. Cheng, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning (ICML), 2020.
- Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. Pythia: ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2727–2735, 2019.
- Michael A Tabak, Mohammad S Norouzzadeh, David W Wolfson, Steven J Sweeney, Kurt C VerCauteren, Nathan P Snow, Joseph M Halseth, Paul A Di Salvo, Jesse S Lewis, Michael D White, et al. Machine learning to classify animal species in camera trap images: Applications in ecology. Methods in Ecology and Evolution, 10(4):585–590, 2019.
- K. Taghipour and H. T. Ng. A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1882–1891, 2016.
- R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt. Measuring robustness to natural distribution shifts in image classification. arXiv preprint arXiv:2007.00644, 2020.
- R. Tatman. Gender and dialect bias in YouTube’s automatic captions. In Workshop on Ethics in Natural Langauge Processing, volume 1, pages 53–59, 2017.
- D. Tellez, M. Balkenhol, I. Otte-Höller, R. van de Loo, R. Vogels, P. Bult, C. Wauters, W. Vreuls, S. Mol, N. Karssemeijer, et al. Whole-slide mitosis detection in h&e breast histology using phh3 as a reference to train distilled stain-invariant convolutional networks. IEEE transactions on medical imaging, 37(9):2126–2136, 2018.
- D. Tellez, G. Litjens, P. Bándi, W. Bulten, J. Bokhorst, F. Ciompi, and J. van der Laak. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Medical Image Analysis, 58, 2019.
- Dogancan Temel, Jinsol Lee, and Ghassan AlRegib. Cure-or: Challenging unreal and real environments for object recognition. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 137–144. IEEE, 2018.
- T. G. Tiecke, X. Liu, A. Zhang, A. Gros, N. Li, G. Yetman, T. Kilic, S. Murray, B. Blankespoor, E. B. Prydz, and H. H. Dang. Mapping the world population one building at a time. arXiv, 2017.
- J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems (IROS), 2017.
- A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), pages 1521–1528, 2011.
- E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2017.
- B. Uzkent and S. Ermon. Learning when and where to zoom with deep reinforcement learning. In Computer Vision and Pattern Recognition (CVPR), 2020.
- Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh. Neural program repair by jointly learning to localize and repair. In International Conference on Learning Representations (ICLR), 2019.
- Sindre Vatnehol, Hector Peña, and Nils Olav Handegard. A method to automatically detect fish aggregations using horizontally scanning sonar. ICES Journal of Marine Science, 75(5): 1803–1812, 2018.
- B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling. Rotation equivariant cnns for digital pathology. In International Conference on Medical Image Computing and Computer-assisted Intervention, pages 210–218, 2018.
- H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), pages 5018–5027, 2017.
- M. Veta, P. J. V. Diest, M. Jiwa, S. Al-Janabi, and J. P. Pluim. Mitosis counting in breast cancer: Object-level interobserver agreement and comparison to an automatic method. PloS one, 11(8), 2016.
- M. Veta, Y. J. Heng, N. Stathonikos, B. E. Bejnordi, F. Beca, T. Wollmann, K. Rohr, M. A. Shah, D. Wang, M. Rousson, et al. Predicting breast tumor proliferation from whole-slide images: the tupac16 challenge. Medical image analysis, 54:111–121, 2019.
- R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, and S. Savarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
- H. Wang, S. Ge, Z. Lipton, and E. P. Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- S. Wang, M. Bai, G. Mattyus, H. Chu, W. Luo, B. Yang, J. Liang, J. Cheverie, S. Fidler, and R. Urtasun. Torontocity: Seeing the world with a million eyes. In International Conference on Computer Vision (ICCV), 2017.
- S. Wang, W. Chen, S. M. Xie, G. Azzari, and D. B. Lobell. Weakly supervised deep learning for segmentation of remote sensing imagery. Remote Sensing, 12, 2020.
- OR Wearn and P Glover-Kapfer. Camera-trapping for conservation: a guide to best-practices. WWF conservation technology series, 1(1):2019–04, 2017.
- S. Weinberger. Speech accent archive. George Mason University, 2015.
- Ben G Weinstein. A computer vision for animal ecology. Journal of Animal Ecology, 87(3): 533–545, 2018.
- J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, J. M. Stuart, C. G. A. R. Network, et al. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10), 2013.
- R. West, H. S. Paskov, J. Leskovec, and C. Potts. Exploiting social network structure for personto-person sentiment analysis. Transactions of the Association for Computational Linguistics (TACL), 2:297–310, 2014.
- G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69–101, 1996.
- J. J. Williams, J. Kim, A. Rafferty, S. Maldonado, K. Z. Gajos, W. S. Lasecki, and N. Heffernan. Axis: Generating explanations at scale with learnersourcing and machine learning. In Proceedings of the Third (2016) ACM Conference on Learning@Scale, pages 379–388, 2016.
- Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. Predictive inequity in object detection. arXiv preprint arXiv:1902.11097, 2019.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. HuggingFace’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Ho Yuen Frank Wong, Hiu Yin Sonia Lam, Ambrose Ho-Tung Fong, Siu Ting Leung, Thomas Wing-Yan Chin, Christine Shing Yen Lo, Macy Mei-Sze Lui, Jonan Chun Yin Lee, Keith Wan-Hang Chiu, Tom Chung, et al. Frequency and distribution of chest radiographic findings in covid-19 positive patients. Radiology, page 201160, 2020.
- D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In Computer Vision and Pattern Recognition (CVPR), pages 5028–5037, 2017.
- M. Wu, M. Mosse, N. Goodman, and C. Piech. Zero shot learning for code education: Rubric sampling with deep learning inference. In Association for the Advancement of Artificial Intelligence (AAAI), volume 33, pages 782–790, 2019.
- M. Wu, R. L. Davis, B. W. Domingue, C. Piech, and N. Goodman. Variational item response theory: Fast, accurate, and expressive. International Conference on Educational Data Mining, 2020.
- Y. Wu, E. Winston, D. Kaushik, and Z. Lipton. Domain adaptation with asymmetricallyrelaxed distribution alignment. In International Conference on Machine Learning (ICML), pages 6872–6881, 2019.
- Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical Science, 9(2):513–530, 2018.
- M. Wulfmeier, A. Bewley, and I. Posner. Incremental adversarial domain adaptation for continually changing environments. In International Conference on Robotics and Automation (ICRA), 2018.
- K. Xiao, L. Engstrom, A. Ilyas, and A. Madry. Noise or signal: The role of image backgrounds in object recognition. arXiv preprint arXiv:2006.09994, 2020.
- M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon. Transfer learning from deep features for remote sensing and poverty mapping. In Association for the Advancement of Artificial Intelligence (AAAI), 2016.
- S. M. Xie, A. Kumar, R. Jones, F. Khani, T. Ma, and P. Liang. In-N-out: Pre-training and self-training using auxiliary information for out-of-distribution robustness. arXiv, 2020.
- K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2018.
- Y. Yang and S. Newsam. Bag-of-visual-words and spatial extensions for land-use classification. Geographic Information Systems, 2010.
- Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, and V. Sindhwani. Data efficient reinforcement learning for legged robots. In Conference on Robot Learning (CoRL), 2019.
- Michihiro Yasunaga and Percy Liang. Graph-based, self-supervised program repair from diagnostic feedback. In International Conference on Machine Learning (ICML), 2020.
- C. Yeh, A. Perez, A. Driscoll, G. Azzari, Z. Tang, D. Lobell, S. Ermon, and M. Burke. Using publicly available satellite imagery and deep learning to understand economic well-being in africa. Nature Communications, 11, 2020.
- J. You, X. Li, M. Low, D. Lobell, and S. Ermon. Deep gaussian process for crop yield prediction based on remote sensing data. In Association for the Advancement of Artificial Intelligence (AAAI), 2017.
- F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- N. Yuval, W. Tao, C. Adam, B. Alessandro, W. Bo, and N. A. Y. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. In PLOS Medicine, 2018.
- K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning (ICML), pages 819–827, 2013.
- M. Zhang, H. Marklund, N. Dhawan, A. Gupta, S. Levine, and C. Finn. Adaptive risk minimization: A meta-learning approach for tackling group shift. arXiv preprint arXiv:2007.02931, 2020.
- Y. Zhang, J. Baldridge, and L. He. Paws: Paraphrase adversaries from word scrambling. In North American Association for Computational Linguistics (NAACL), 2019.
- J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding variants with deep learning– based sequence model. Nature Methods, 12(10):931–934, 2015.
- X. Zhou, Y. Nie, H. Tan, and M. Bansal. The curse of performance instability in analysis datasets: Consequences, source, and suggestions. arXiv preprint arXiv:2004.13606, 2020.
- C. L. Zitnick, L. Chanussot, A. Das, S. Goyal, J. Heras-Domingo, C. Ho, W. Hu, T. Lavril, A. Palizhati, M. Riviere, M. Shuaibi, A. Sriram, K. Tran, B. Wood, J. Yoon, D. Parikh, and Z. Ulissi. An introduction to electrocatalyst design using machine learning for renewable energy storage. arXiv preprint arXiv:2010.09435, 2020.
- 2. Label × Black: 4 subsets, 1 for each combination of class and Black.
- 2. Validation (OOD): reviews in categories unseen during training.
- 3. Test (OOD): reviews in categories unseen during training.
- 4. Validation (ID): reviews in training categories.
- 5. Test (ID): reviews in training categories.
- 2. Validation (OOD): 20,000 reviews written in years 2014 to 2018.
- 3. Test (OOD): 20,000 reviews written in years 2014 to 2018.
- 2. Validation (OOD): 20,000 reviews written in years 2014 to 2019.
- 3. Test (OOD): 20,000 reviews written in years 2014 to 2019.
- 2. Validation (OOD): 40,000 reviews from another set of 1,600 reviewers, distinct from training and test (OOD).
- 3. Test (OOD): 40,000 reviews from another set 1,600 reviewers, distinct from training and validation (OOD).
- 4. Validation (ID): 40,000 reviews from 1,600 of the 11,856 reviewers in the training set.
- 5. Test (ID): 40,000 reviews from 1,600 of the 11,856 reviewers in the training set.