AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
As we show in Appendix B.1, we find that there is a large label distribution shift between non-African regions and Africa, suggesting that the drop in performance may be in some part due to label shift

WILDS: A Benchmark of in-the-Wild Distribution Shifts

被引用71|浏览238
下载 PDF 全文
引用
微博一下

摘要

Distribution shifts can cause significant degradation in a broad range of machine learning (ML) systems deployed in the wild. However, many widely-used datasets in the ML community today were not designed for evaluating distribution shifts. These datasets typically have training and test sets drawn from the same distribution, and prior ...更多

代码

数据

0
简介
  • Distribution shifts—mismatches in data distributions between training and test time—pose significant challenges for machine learning (ML) systems deployed in the wild.
  • In contrast to general-purpose ML research, domain experts applying ML in their respective areas are often forced to grapple with distribution shifts in order to make progress on real-world problems
  • As a result, these application areas are rich sources of datasets with distribution shifts that arise in the wild, e.g., in medicine (Chen et al, 2020), computational biology (Leek et al, 2010), wildlife conservation (Beery et al, 2018), satellite imagery (Jean et al, 2016), and so on.
重点内容
  • Distribution shifts—mismatches in data distributions between training and test time—pose significant challenges for machine learning (ML) systems deployed in the wild
  • As we show in Appendix B.1, we find that there is a large label distribution shift between non-African regions and Africa, suggesting that the drop in performance may be in some part due to label shift
  • There is drastic variation in illumination, camera angle, and background, vegetation, and color. This variation, coupled with considerable differences in the distribution of animals between camera traps, likely encourages the model to overfit to specific animal species appearing in specific locations, which may account for the performance drop
  • Histopathology datasets can be unwieldy for ML models, as individual images can be several gigabytes large; extracting patches involves many design choices; the classes are typically very unbalanced; and evaluation often relies on more complex slide-level measures such as the free-response receiver operating characteristic (FROC) (Gurcan et al, 2009)
  • Prior work has shown that differences in staining between hospitals are the primary source of variation in this dataset, and that specialized stain augmentation methods can close the in- and out-of-distribution accuracy gap on a variant of the dataset based on the same underlying slides (Tellez et al, 2019)
  • We show that deep CORAL and invariant risk minimization (IRM) fails to improve performance on unseen users
  • Performance disparities across individuals have been observed in a wide range of tasks and applications, including in natural language processing (Geva et al, 2019), automatic speech recognition (Koenecke et al, 2020; Tatman, 2017), federated learning (Li et al, 2019; Caldas et al, 2018), and medical imaging (Badgeley et al, 2019)
结果
  • The authors evaluate models by their average and worst-region OOD accuracies. The former measures the ability of the model to generalize across time, while the latter measures how well models do across different regions/subpopulations under a time shift.

    Average ERM DeepCORAL IRM

    Worst-region ERM DeepCORAL IRM Validation (ID) Validation (OOD) Test (ID) Test (OOD)

    Standard split (ID examples) Mixed split (ID + OOD examples) Algorithm ERM ERM

    Test (ID) Average Worst-region.
  • In evaluating the trained models, the authors consider average accuracy across the binary classification tasks, averaged over each of the validation and test sets separately.
  • To assess whether models generalize to unseen categories, the authors evaluate models by their average accuracy on each of the categories in the OOD test set.
  • A BERT-base-uncased model trained with the standard ERM objective performs well on the OOD test set, achieving 76.0% accuracy on average and 75.4% on the worst year (Table 29).
结论
  • The authors associate each dataset in Wilds with the problem setting that the authors believe best reflects the real-world challenges in the corresponding application area.
  • Prior work has shown that there is often insufficient information at training time to distinguish models that would generalize well under a particular distribution shift; many models that perform in-distribution (ID) can vary substantially out-of-distribution (OOD)
  • This instability in OOD performance has been reported in natural language processing settings (McCoy et al, 2019; Kim and Linzen, 2020) and in vision and healthcare applications (D’Amour et al, 2020).
  • The authors speculate that the relatively similar ID and OOD variances in the other datasets could be in part because of this, and in part because the authors select models based on their OOD validation performance, but further investigation is required
表格
  • Table1: The Wilds benchmark contains 7 datasets across a diverse set of application areas and data modalities. Each dataset comprises data from different domains, and the benchmark is set up to evaluate models on distribution shifts across these domains
  • Table2: In both domain generalization and subpopulation shift settings, domain information is available to the model at training time. At test time, domain information can be available in domain generalization, but not in subpopulation shift
  • Table3: Time shift and worst-region accuracies (%) for models trained on data before 2013 and tested on held-out locations from in-distribution (ID) or out-of-distribution (OOD) test sets in FMoW-wilds. The models are early-stopped with respect to OOD validation accuracy. Standard deviations over 3 trials are in parentheses
  • Table4: Performance drops for ERM models on FMoW-wilds. In the standard split, we train on data from 2002–2013, whereas in the mixed split, we train on the same amount of data but half from 2002–2013 and half from 2013–2018. In both cases, we test on data from 2016–2018. Models trained on the standard split degrade in performance under the time shift, especially on the last year (2017) of the test data, and also fare poorly on the subpopulation shift, with low worst-region accuracy. Models trained on the mixed split have higher OOD average and last year accuracy and much higher OOD worst-region accuracy
  • Table5: Region shift results (accuracy, %) for models trained on data before 2013 and tested on held-out locations from ID (< 2013) or OOD (≥ 2016) test sets in FMoW-wilds. One STD shown in parentheses
  • Table6: Pearson correlation r (higher is better) on in-distribution and out-of-distribution (unseen countries) held-out sets in PovertyMap-wilds, including results on rural or urban subpopulations. All results are averaged over 5 different OOD country folds taken from <a class="ref-link" id="cYeh_et+al_2020_a" href="#rYeh_et+al_2020_a">Yeh et al (2020</a>), with standard deviations across different folds in parentheses. All models are early-stopped with respect to OOD validation MSE
  • Table7: Performance drops for ERM models on PovertyMap-wilds. In the standard split, we train on data from one set of countries, and then test on a different set of countries. In the mixed split, we train on the same amount of data but sampled uniformly from all countries. Models trained on the standard split degrade in performance, especially on rural subpopulations, while models trained on the mixed split do not
  • Table8: Baseline results on iWildCam2020-wilds
  • Table9: Baseline results on Camelyon17-wilds. Parentheses show standard deviation across 10 replicates
  • Table10: Performance drops for ERM models on Camelyon17-wilds. In the standard split, we train on data from three hospitals and evaluate on a different test hospital, whereas in the mixed split, we add data from one extra slide from the test hospital to the training set. The original test set has data from 10 slides; here, we report performance for both splits on 9 slides (without the slide that was moved to the training set). This makes the numbers (74.1 vs. 73.3) for the standard split slightly different from Table 9. Parentheses show standard deviation across 10 replicates
  • Table11: Baseline results on OGB-MolPCBA. Parentheses show standard deviation across 3 replicates
  • Table12: Out-of-distribution vs. in-distribution performance for ERM models on OGB-MolPCBA. In the standard split, we train on molecules from some scaffolds and evaluate on molecules from different scaffolds, whereas in the mixed split, we randomly divide molecules into training and test sets without using scaffold information
  • Table13: Baseline results on Amazon-wilds. We report the accuracy of models trained using three baseline algorithms: ERM, DeepCORAL, and IRM. In addition to the average accuracy across all reviews, we compute the accuracy for each reviewer and report the performance for the reviewer in the 10th percentile
  • Table14: Comparison with in-distribution baselines on Amazon-wilds. To demonstrate that the poor out-of-distribution (OOD) performance of the ERM model (Table 13) stems from the distribution shift, we compare with in-distribution (ID) baseline models, which are oracle models finetuned on each reviewer. We report the average accuracy on a fixed set of 10 reviewers that are in the 10th percentile or below for the ERM model. Despite being trained on data that are orders of magnitude smaller (less than 1,000 reviews per user, compared to the full training set of 1 million reviews), the oracle baseline models outperform the ERM models
  • Table15: Baseline results on CivilComments-wilds. The reweighted (label) algorithm samples equally from the positive and negative class; the group DRO (label) algorithm additionally weights these classes so as to minimize the maximum of the average positive training loss and average negative training loss. Similarly, the reweighted (label × Black) and group DRO (label × Black) algorithms sample equally from the four groups corresponding to all combinations of class and whether there is a mention of Black identity. We show standard deviation across random seeds in parentheses
  • Table16: Accuracies on each subpopulation in CivilComments-wilds, averaged over models trained by group DRO (label)
  • Table17: Wilds focuses on two specific settings of domain shift: domain generalization and subpopulation shift. These two settings vary only in whether the test domains are seen or unseen. Other problem settings that can apply to Wilds datasets include test-time adaptation and unsupervised domain adaptation
  • Table18: Time shift accuracies (%) for models trained on data before 2013 and tested on held-out locations from in-distribution (ID) or out-of-distribution (OOD) test sets in FMoW-wilds. The accuracy of ERM drops significantly in the last year of the dataset. The models are early-stopped with respect to OOD validation accuracy. Standard deviations over 3 trials are in parentheses. Mixed split models use both ID + OOD training examples
  • Table19: Pearson correlation r (higher is better) on in-distribution and out-of-distribution (unseen countries) held-out sets in PovertyMap-wilds, including results on rural or urban subpopulations. All results are averaged over 5 different OOD country folds taken from <a class="ref-link" id="cYeh_et+al_2020_a" href="#rYeh_et+al_2020_a">Yeh et al (2020</a>), with standard deviations across different folds in parentheses. All models are early-stopped with respect to OOD validation MSE. (- NL) models do not use nighttime light as input. Mixed split models use both ID + OOD examples as training data
  • Table20: Mean squared error (MSE) on in-distribution and out-of-distribution (unseen countries) held-out sets in PovertyMap-wilds. All results are averaged over 5 folds taken from <a class="ref-link" id="cYeh_et+al_2020_a" href="#rYeh_et+al_2020_a">Yeh et al (2020</a>). All models are early-stopped with respect to OOD validation MSE. (- NL) models do not use nighttime light as input. Mixed split models use both ID + OOD examples as training data
  • Table21: Dataset details for iWildCam2020-wilds
  • Table22: Dataset details for Amazon-wilds
  • Table23: Additional results of baseline models on Amazon-wilds
  • Table24: Additional results of in-distribution baseline models on Amazon-wilds
  • Table25: Group sizes in the test data for CivilComments-wilds. The training and validation data follow similar proportions
  • Table26: CivilComments-wilds results for the Group DRO (label × Black) model with early stopping on accuracy on comments that mention the Black identity. Compared to the Group DRO (label) model in Table 15, accuracy on Black comments is higher but accuracy on LGBTQ comments is lower. We show standard deviation across random seeds in parentheses
  • Table27: Average multi-task classification accuracy of ERM trained models on BDD100K. All results are reported across 3 random seeds, with standard deviation in parentheses. We observe no substantial drops in the presence of test time distribution shifts
  • Table28: Baseline results on category shifts on the Amazon Reviews Dataset. We report the accuracy of models trained using ERM on a single category versus four categories. Across many categories unseen at training time, the latter model modestly but consistently outperforms the former
  • Table29: Baseline results on time shifts on the Amazon Reviews Dataset. We report the accuracy of models trained using ERM. In addition to the average accuracy across all years in each split, we report the accuracy for the worst-case year
  • Table30: Comparison with in-distribution baselines for time shifts on Amazon Reviews Dataset. We observe only modest performance drops due to time shifts
  • Table31: Baseline results on time shifts on the Yelp Open Dataset. We report the accuracy of models trained using ERM
  • Table32: Comparison with in-distribution baselines for time shifts on Yelp Open Dataset. We observe only modest performance drops due to time shifts
  • Table33: Baseline results on user shifts on the Yelp Open Dataset. We report the accuracy of models trained using ERM. In addition to the average accuracy across all reviews, we compute the accuracy for each reviewer and report the performance for the reviewer in the 10th percentile
Download tables as Excel
基金
  • We are grateful for all of the helpful suggestions and constructive feedback from: Aditya Khosla, Andreas Schlueter, Annie Chen, Alexander D’Amour, Allison Koenecke, Alyssa Lees, Ananya Kumar, Andrew Beck, Behzad Haghgoo, Charles Sutton, Christopher Yeh, Cody Coleman, Dan Jurafsky, Daniel Levy, Daphne Koller, David Tellez, Erik Jones, Evan Liu, Fisher Yu, Georgi Marinov, Irena Gao, Irene Chen, Jacky Kang, Jacob Schreiber, Jacob Steinhardt, Jared Dunnmon, Jean Feng, Jeffrey Sorensen, Jianmo Ni, John Hewitt, Kate Saenko, Kelly Cochran, Kensen Shi, Kyle Loh, Li Jiang, Lucy Vasserman, Ludwig Schmidt, Luke Oakden-Rayner, Marco Tulio Ribeiro, Matthew Lungren, Megha Srivastava, Nimit Sohoni, Pranav Rajpurkar, Robin Jia, Rohan Taori, Sarah Bird, Sharad Goel, Sherrie Wang, Stefano Ermon, Steve Yadlowsky, Tatsunori Hashimoto, Vincent Hellendoorn, Yair Carmon, Zachary Lipton, and Zhenghao Chen. The design of the WILDS benchmark was inspired by the Open Graph Benchmark (Hu et al, 2020), and we are grateful to the Open Graph Benchmark team for their advice and help in setting up our benchmark. This project was funded by an Open Philanthropy Project Award and NSF Award Grant No 1805310
  • Sagawa was supported by the Herbert Kunzel Stanford Graduate Fellowship
  • Marklund was supported by the Dr Tech
  • Zhang were supported by NDSEG Graduate Fellowships
  • Hu was supported by the Funai Overseas Scholarship and the Masason Foundation Fellowship
  • Beery was supported by an NSF Graduate Research Fellowship and is a PIMCO Fellow in Data Science
研究对象与分析
datasets: 7
Wilds datasets span a diverse array of societally-important applications with natural distribution shifts: poverty mapping (Yeh et al, 2020), building and land use classification (Christie et al, 2018), animal species categorization (Beery et al, 2020), predicting text toxicity (Borkan et al, 2019), sentiment analysis (Ni et al, 2019), tumor identification (Bandi et al, 2018), and bioassay prediction (Hu et al, 2020). At present, there are 7 datasets in Wilds (Table 1), reflecting distribution shifts arising from different demographics, users, hospitals, camera locations, countries, time periods, and molecular scaffolds. Wilds builds on top of extensive data-collection efforts by domain experts

datasets: 7
Wilds datasets. We now discuss the 7 datasets in the Wilds benchmark, summarized in Table 1. For each dataset, we first describe the task, the distribution shift, and the evaluation criteria

crowdworkers: 10
We model toxicity classification as a binary task. Toxicity labels were obtained in the original dataset via crowdsourcing and majority vote, with each comment being reviewed by at least 10 crowdworkers. Annotations of demographic mentions were similarly obtained through crowdsourcing and majority vote

引用论文
  • B. Abelson, K. R. Varshney, and J. Sun. Targeting direct cash transfers to the extremely poor. In International Conference on Knowledge Discovery and Data Mining (KDD), 2014.
    Google ScholarLocate open access versionFindings
  • R. Adragna, E. Creager, D. Madras, and R. Zemel. Fairness and robustness in invariant learning: A case study in toxicity classification. arXiv preprint arXiv:2011.06485, 2020.
    Findings
  • A. Ahadi, R. Lister, H. Haapala, and A. Vihavainen. Exploring machine learning methods to automatically identify students in need of assistance. In Proceedings of the Eleventh Annual International Conference on International Computing Education Research, pages 121–130, 2015.
    Google ScholarLocate open access versionFindings
  • Jorge A Ahumada, Eric Fegraus, Tanya Birch, Nicole Flores, Roland Kays, Timothy G O’Brien, Jonathan Palmer, Stephanie Schuttler, Jennifer Y Zhao, Walter Jetz, Margaret Kinnaird, Sayali Kulkarni, Arnaud Lyet, David Thau, Michelle Duong, Ruth Oliver, and Anthony Dancer. Wildlife insights: A platform to maximize the potential of camera trap and other passive sensor wildlife data for the planet. Environmental Conservation, 47(1):1–6, 2020.
    Google ScholarLocate open access versionFindings
  • E. AlBadawy, A. Saha, and M. Mazurowski. Deep learning for segmentation of brain tumors: Impact of cross-institutional training and testing. Med Phys., 45, 2018.
    Google ScholarLocate open access versionFindings
  • A. Alexandari, A. Kundaje, and A. Shrikumar. Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning (ICML), pages 222–232, 2020.
    Google ScholarLocate open access versionFindings
  • Miltiadis Allamanis and Marc Brockschmidt. Smartpaste: Learning to adapt source code. arXiv preprint arXiv:1705.07867, 2017.
    Findings
  • Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 38–49, 2015.
    Google ScholarLocate open access versionFindings
  • E. Amorim, M. Cançado, and A. Veloso. Automated essay scoring in the presence of biased ratings. In Association for Computational Linguistics (ACL), pages 229–237, 2018.
    Google ScholarLocate open access versionFindings
  • R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Language Resources and Evaluation Conference (LREC), pages 4218–4222, 2020.
    Google ScholarLocate open access versionFindings
  • M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
    Findings
  • Arthur Asuncion and David Newman. UCI Machine Learning Repository, 2007.
    Google ScholarLocate open access versionFindings
  • M. S. Attene-Ramos, N. Miller, R. Huang, S. Michael, M. Itkin, R. J. Kavlock, C. P. Austin, P. Shinn, A. Simeonov, R. R. Tice, et al. The tox21 robotic platform for the assessment of environmental chemicals–from vision to reality. Drug discovery today, 18(15):716–723, 2013.
    Google ScholarLocate open access versionFindings
  • J. Atwood, Y. Halpern, P. Baljekar, E. Breck, D. Sculley, P. Ostyakov, S. I. Nikolenko, I. Ivanov, R. Solovyev, W. Wang, et al. The Inclusive Images competition. In Advances in Neural Information Processing Systems (NeurIPS), pages 155–186, 2020.
    Google ScholarLocate open access versionFindings
  • R. Aviv, S. A. Teichmann, E. S. Lander, A. Ido, B. Christophe, B. Ewan, B. Bernd, P. Campbell, C. Piero, C. Menna, et al. The human cell atlas. Elife, 6, 2017.
    Google ScholarLocate open access versionFindings
  • Žiga Avsec, M. Weilert, A. Shrikumar, A. Alexandari, S. Krueger, K. Dalal, R. Fropf, C. McAnany, J. Gagneur, A. Kundaje, and J. Zeitlinger. Deep learning at base-resolution reveals motif syntax of the cis-regulatory code. bioRxiv, 2019.
    Google ScholarFindings
  • K. Azizzadenesheli, A. Liu, F. Yang, and A. Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • M. A. Badgeley, J. R. Zech, L. Oakden-Rayner, B. S. Glicksberg, M. Liu, W. Gale, M. V. McConnell, B. Percha, T. M. Snyder, and J. T. Dudley. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Medicine, 2, 2019.
    Google ScholarLocate open access versionFindings
  • P. Bandi, O. Geessink, Q. Manson, M. V. Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee, K. Paeng, A. Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient level: the CAMELYON17 challenge. IEEE transactions on medical imaging, 38(2):550–560, 2018.
    Google ScholarLocate open access versionFindings
  • A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems (NeurIPS), pages 9453–9463, 2019.
    Google ScholarLocate open access versionFindings
  • P. L. Bartlett and M. H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research (JMLR), 9(0):1823–1840, 2008.
    Google ScholarLocate open access versionFindings
  • T. Baumann, A. Köhn, and F. Hennig. The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening. Language Resources and Evaluation, 53(2): 303–329, 2019.
    Google ScholarLocate open access versionFindings
  • BBC. A-levels and GCSEs: How did the exam algorithm work? The British Broadcasting Corporation, 2020. URL https://www.bbc.com/news/explainers-53807730.
    Locate open access versionFindings
  • A. H. Beck, A. R. Sangoi, S. Leung, R. J. Marinelli, T. O. Nielsen, M. J. V. D. Vijver, R. B. West, M. V. D. Rijn, and D. Koller. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Science, 3(108), 2011.
    Google ScholarLocate open access versionFindings
  • Axel D Becke. Perspective: Fifty years of density-functional theory in chemical physics. The Journal of Chemical Physics, 140(18):18A301, 2014.
    Google ScholarLocate open access versionFindings
  • E. Beede, E. Baylor, F. Hersch, A. Iurchenko, L. Wilcox, P. Ruamviboonsuk, and L. M. Vardoulakis. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In Conference on Human Factors in Computing Systems (CHI), pages 1–12, 2020.
    Google ScholarLocate open access versionFindings
  • S. Beery, G. V. Horn, and P. Perona. Recognition in terra incognita. In European Conference on Computer Vision (ECCV), pages 456–473, 2018.
    Google ScholarLocate open access versionFindings
  • S. Beery, E. Cole, and A. Gjoka. The iWildCam 2020 competition dataset. arXiv preprint arXiv:2004.10340, 2020.
    Findings
  • Sara Beery, Dan Morris, and Siyu Yang. Efficient pipeline for camera trap image review. arXiv preprint arXiv:1907.06772, 2019.
    Findings
  • Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, and Jonathan Huang. Context r-cnn: Long term temporal context for per-camera object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13075–13085, 2020.
    Google ScholarLocate open access versionFindings
  • B. E. Bejnordi, M. Veta, P. J. V. Diest, B. V. Ginneken, N. Karssemeijer, G. Litjens, J. A. V. D. Laak, M. Hermsen, Q. F. Manson, M. Balkenhol, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199–2210, 2017.
    Google ScholarLocate open access versionFindings
  • D. Bellamy, L. Celi, and A. L. Beam. Evaluating progress on machine learning for longitudinal electronic healthcare data. arXiv preprint arXiv:2010.01149, 2020.
    Findings
  • M. G. Bellemare, S. Candido, P. S. Castro, J. Gong, M. C. Machado, S. Moitra, S. S. Ponda, and Z. Wang. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588, 2020.
    Google ScholarLocate open access versionFindings
  • S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NeurIPS), pages 137–144, 2006.
    Google ScholarLocate open access versionFindings
  • A. BenTaieb and G. Hamarneh. Adversarial stain transfer for histopathology image analysis. IEEE transactions on medical imaging, 37(3):792–802, 2017.
    Google ScholarLocate open access versionFindings
  • A. A. Beyene, T. Welemariam, M. Persson, and N. Lavesson. Improved concept drift handling in surgery prediction and other applications. Knowledge and Information Systems, 44(1): 177–196, 2015.
    Google ScholarLocate open access versionFindings
  • G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Inormation Processing Systems, pages 2178–2186, 2011.
    Google ScholarLocate open access versionFindings
  • J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 440–447, 2007.
    Google ScholarLocate open access versionFindings
  • S. L. Blodgett and B. O’Connor. Racial disparity in natural language processing: A case study of social media African-American English. arXiv preprint arXiv:1707.00061, 2017.
    Findings
  • S. L. Blodgett, L. Green, and B. O’Connor. Demographic dialectal variation in social media: A case study of African-American English. In Empirical Methods in Natural Language Processing (EMNLP), pages 1119–1130, 2016.
    Google ScholarLocate open access versionFindings
  • J. Blumenstock, G. Cadamuro, and R. On. Predicting poverty and wealth from mobile phone metadata. Science, 350, 2015.
    Google ScholarLocate open access versionFindings
  • R. S. Bohacek, C. McMartin, and W. C. Guida. The art and practice of structure-based drug design: a molecular modeling perspective. Medicinal Research Reviews, 16(1):3–50, 1996.
    Google ScholarLocate open access versionFindings
  • D. Borkan, L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Limitations of pinned auc for measuring unintended bias. arXiv preprint arXiv:1903.02088, 2019.
    Findings
  • D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In WWW, pages 491–500, 2019.
    Google ScholarLocate open access versionFindings
  • L. Bottou, J. Peters, J. Quiñonero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research (JMLR), 14: 3207–3260, 2013. New York Times, 2020.
    Google ScholarLocate open access versionFindings
  • URL https://www.nytimes.com/2020/09/08/opinion/
    Findings
  • L. Bruzzone and M. Marconcini. Domain adaptation problems: A DASVM classification technique and a circular validation strategy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):770–787, 2009.
    Google ScholarLocate open access versionFindings
  • D. Bug, S. Schneider, A. Grote, E. Oswald, F. Feuerhake, J. Schüler, and D. Merhof. Contextbased normalization of histological stains using deep convolutional features. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 135–142, 2017.
    Google ScholarFindings
  • Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pages 77–91, 2018.
    Google ScholarLocate open access versionFindings
  • M. Burke, S. Heft-Neal, and E. Bendavid. Sources of variation in under-5 mortality across sub-Saharan Africa: a spatial analysis. Lancet Global Health, 4, 2016.
    Google ScholarLocate open access versionFindings
  • J. Byrd and Z. Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning (ICML), pages 872–881, 2019.
    Google ScholarLocate open access versionFindings
  • S. Caldas, P. Wu, T. Li, J. Konečny, H. B. McMahan, V. Smith, and A. Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.
    Findings
  • G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. W. K. Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, and T. J. Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature Medicine, 25(8):1301–1309, 2019.
    Google ScholarLocate open access versionFindings
  • K. Cao, Y. Chen, J. Lu, N. Arechiga, A. Gaidon, and T. Ma. Heteroskedastic and imbalanced deep learning with adaptive regularization. arXiv preprint arXiv:2006.15766, 2020.
    Findings
  • L. Chanussot, A. Das, S. Goyal, T. Lavril, M. Shuaibi, M. Riviere, K. Tran, J. Heras-Domingo, C. Ho, W. Hu, A. Palizhati, A. Sriram, B. Wood, J. Yoon, D. Parikh, C. L. Zitnick, and Z. Ulissi. The Open Catalyst 2020 (oc20) dataset and community challenges. arXiv preprint arXiv:2010.09990, 2020.
    Findings
  • I. Y. Chen, P. Szolovits, and M. Ghassemi. Can AI help reduce disparities in general medical and mental health care? AMA Journal of Ethics, 21(2):167–179, 2019.
    Google ScholarLocate open access versionFindings
  • I. Y. Chen, E. Pierson, S. Rose, S. Joshi, K. Ferryman, and M. Ghassemi. Ethical machine learning in health care. arXiv preprint arXiv:2009.10576, 2020.
    Findings
  • V. Chen, S. Wu, A. J. Ratner, J. Weng, and C. Ré. Slice-based learning: A programming model for residual learning in critical data slices. In Advances in Neural Information Processing Systems, pages 9397–9407, 2019.
    Google ScholarLocate open access versionFindings
  • T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P. Agapow, M. Zietz, M. M. Hoffman, et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141), 2018.
    Google ScholarLocate open access versionFindings
  • G. Christie, N. Fendley, J. Wilson, and R. Mukherjee. Functional map of the world. In Computer Vision and Pattern Recognition (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. Proc. Interspeech, pages 1086–1090, 2018.
    Google ScholarLocate open access versionFindings
  • J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. arXiv preprint arXiv:2003.05002, 2020.
    Findings
  • N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
    Findings
  • A. Conneau and G. Lample. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems (NeurIPS), pages 7059–7069, 2019.
    Google ScholarLocate open access versionFindings
  • A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V. Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Empirical Methods in Natural Language Processing (EMNLP), pages 2475–2485, 2018.
    Google ScholarLocate open access versionFindings
  • E. P. Consortium et al. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57–74, 2012.
    Google ScholarLocate open access versionFindings
  • G. Consortium et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science, 369(6509):1318–1330, 2020.
    Google ScholarLocate open access versionFindings
  • H. Consortium et al. The human body at cellular resolution: the NIH human biomolecular atlas program. Nature, 574(7777), 2019.
    Google ScholarLocate open access versionFindings
  • L. P. Cordella, C. D. Stefano, F. Tortorella, and M. Vento. A method for improving classification reliability of multilayer perceptrons. IEEE Transactions on Neural Networks, 6(5):1140–1147, 1995.
    Google ScholarLocate open access versionFindings
  • P. Courtiol, C. Maussion, M. Moarii, E. Pronier, S. Pilcer, M. Sefta, P. Manceron, S. Toldo, M. Zaslavskiy, N. L. Stang, et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nature Medicine, 25(10):1519–1525, 2019.
    Google ScholarLocate open access versionFindings
  • F. Croce, M. Andriushchenko, V. Sehwag, N. Flammarion, M. Chiang, P. Mittal, and M. Hein. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020.
    Findings
  • Anne-Sophie Crunchant, David Borchers, Hjalmar Kühl, and Alex Piel. Listening and watching: Do camera traps or acoustic sensors more efficiently detect wild chimpanzees in an open habitat? Methods in Ecology and Evolution, 11(4):542–552, 2020.
    Google ScholarLocate open access versionFindings
  • M. F. Cuccarese, B. A. Earnshaw, K. Heiser, B. Fogelson, C. T. Davis, P. F. McLean, H. B. Gordon, K. Skelly, F. L. Weathersby, V. Rodic, et al. Functional immune mapping with deep-learning enabled phenomics applied to immunomodulatory and COVID-19 drug discovery. bioRxiv, 2020.
    Google ScholarFindings
  • Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie. Class-balanced loss based on effective number of samples. In Computer Vision and Pattern Recognition (CVPR), pages 9268–9277, 2019.
    Google ScholarLocate open access versionFindings
  • J. Ba D. P Kingma. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • D. Dai and L. Van Gool. Dark model adaptation: Semantic image segmentation from daytime to nighttime. In International Conference on Intelligent Transportation Systems (ITSC), 2018.
    Google ScholarLocate open access versionFindings
  • A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisenstein, M. D. Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395, 2020.
    Findings
  • A. D’Amour, H. Srinivasan, J. Atwood, P. Baljekar, D. Sculley, and Y. Halpern. Fairness is not static: deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 525–534, 2020.
    Google ScholarLocate open access versionFindings
  • H. Daumé III. Frustratingly easy domain adaptation. In Association for Computational Linguistics (ACL), 2007.
    Google ScholarFindings
  • S. E. Davis, T. A. Lasko, G. Chen, E. D. Siew, and M. E. Matheny. Calibration drift in regression and machine learning models for acute kidney injury. Journal of the American Medical Informatics Association, 24(6):1052–1061, 2017.
    Google ScholarLocate open access versionFindings
  • A. J. DeGrave, J. D. Janizek, and S. Lee. AI for radiographic COVID-19 detection selects shortcuts over signal. medRxiv, 2020.
    Google ScholarLocate open access versionFindings
  • M. C. Desmarais and R. Baker. A review of recent advances in learner and skill modeling in intelligent learning environments. User Modeling and User-Adapted Interaction, 22(1):9–38, 2012.
    Google ScholarLocate open access versionFindings
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics (ACL), pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • N. DigitalGlobe and C. Works. Spacenet. https://aws.amazon.com/publicdatasets/spacenet/, 2016.
    Findings
  • K. A. Dill and J. L. MacCallum. The protein-folding problem, 50 years on. Science, 338(6110): 1042–1046, 2012.
    Google ScholarLocate open access versionFindings
  • L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and mitigating unintended bias in text classification. In Association for the Advancement of Artificial Intelligence (AAAI), pages 67–73, 2018.
    Google ScholarLocate open access versionFindings
  • J. Djolonga, J. Yung, M. Tschannen, R. Romijnders, L. Beyer, A. Kolesnikov, J. Puigcerver, M. Minderer, A. D’Amour, D. Moldovan, et al. On robustness and transferability of convolutional neural networks. arXiv preprint arXiv:2007.08558, 2020.
    Findings
  • Samuel Dodge and Lina Karam. A study and comparison of human and deep learning recognition performance under visual distortions. In 26th International Conference on Computer Communication and Networks (ICCCN), pages 1–7. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Q. Dou, D. Castro, K. Kamnitsas, and B. Glocker. Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
    Google ScholarLocate open access versionFindings
  • Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism. Science Advances, 4(1), 2018.
    Google ScholarLocate open access versionFindings
  • J. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses against mixture covariate shifts. https://cs.stanford.edu/~thashim/assets/publications/condrisk.pdf, 2019.
    Findings
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pages 214–226, 2012.
    Google ScholarLocate open access versionFindings
  • C. D. Elvidge, P. C. Sutton, T. Ghosh, B. T. Tuttle, K. E. Baugh, B. Bhaduri, and E. Bright. A global poverty map derived from satellite data. Computers and Geosciences, 35, 2009.
    Google ScholarLocate open access versionFindings
  • G. Eraslan, Žiga Avsec, J. Gagneur, and F. J. Theis. Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7):389–403, 2019.
    Google ScholarLocate open access versionFindings
  • J. Espey, E. Swanson, S. Badiee, Z. Chistensen, A. Fischer, M. Levy, G. Yetman, A. de Sherbinin, R. Chen, Y. Qiu, G. Greenwell, T. Klein,, J. Jutting, M. Jerven, G. Cameron, A. M. A. Rivera, V. C. Arias,, S. L. Mills, and A. Motivans. Data for development: A needs assessment for SDG monitoring and statistical capacity development. Sustainable Development Solutions Network, 2015.
    Google ScholarLocate open access versionFindings
  • A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542 (7639):115–118, 2017.
    Google ScholarLocate open access versionFindings
  • OpenAI et al. Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
    Findings
  • C. Fang, Y. Xu, and D. N. Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In International Conference on Computer Vision (ICCV), pages 1657–1664, 2013.
    Google ScholarLocate open access versionFindings
  • J. Feng, A. Sondhi, J. Perry, and N. Simon. Selective prediction-set models with coverage guarantees. arXiv preprint arXiv:1906.05473, 2019.
    Findings
  • D. Filmer and K. Scott. Assessing asset indices. Demography, 49, 2011.
    Google ScholarLocate open access versionFindings
  • J. Futoma, M. Simons, T. Panch, F. Doshi-Velez, and L. A. Celi. The myth of generalisability in clinical research and machine learning in health care. The Lancet Digital Health, 2(9):e489–e492, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016.
    Google ScholarLocate open access versionFindings
  • Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), pages 1180–1189, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research (JMLR), 17, 2016.
    Google ScholarLocate open access versionFindings
  • S. Garg, Y. Wu, S. Balakrishnan, and Z. C. Lipton. A unified view of label shift estimation. arXiv preprint arXiv:2003.07554, 2020.
    Findings
  • Y. Geifman and R. El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
    Google ScholarLocate open access versionFindings
  • Y. Geifman and R. El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Y. Geifman, G. Uziel, and R. El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenettrained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
    Findings
  • R. Geirhos, C. R. Temme, J. Rauber, H. H. Schütt, M. Bethge, and F. A. Wichmann. Generalisation in humans and deep neural networks. Advances in Neural Information Processing Systems, 31:7538–7550, 2018.
    Google ScholarLocate open access versionFindings
  • R. Geirhos, J. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks. arXiv preprint arXiv:2004.07780, 2020.
    Findings
  • M. Geva, Y. Goldberg, and J. Berant. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Empirical Methods in Natural Language Processing (EMNLP), 2019.
    Google ScholarLocate open access versionFindings
  • Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning (ICML), pages 1273–1272, 2017.
    Google ScholarLocate open access versionFindings
  • K. Goel, A. Gu, Y. Li, and C. Ré. Model patching: Closing the subgroup performance gap with data augmentation. arXiv preprint arXiv:2008.06775, 2020.
    Findings
  • B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), pages 2066–2073, 2012.
    Google ScholarLocate open access versionFindings
  • I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • N. Graetz, J. Friedman, A. Osgood-Zimmerman, R. Burstein, M. H. Biehl, C. Shields, J. F. Mosser, D. C. Casey, A. Deshpande, L. Earl, R. C. Reiner, S. E. Ray, N. Fullman, A. J. Levine, R. W. Stubbs, B. K. Mayala, J. Longbottom, A. J. Browne, S. Bhatt, D. J. Weiss, P. W. Gething, A. H. Mokdad, S. S. Lim, C. J. L. Murray, E. Gakidou, and S. I. Hay. Mapping local variation in educational attainment across Africa. Nature, 555, 2018.
    Google ScholarLocate open access versionFindings
  • M Grooten, T Peterson, and R.E.A Almond. Living Planet Report 2020 - Bending the curve of biodiversity loss. WWF, Gland, Switzerland, 2020.
    Google ScholarFindings
  • S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In International Conference on Robotics and Automation (ICRA), 2017.
    Google ScholarLocate open access versionFindings
  • I. Gulrajani and D. Lopez-Paz. In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
    Findings
  • A. Gupta, A. Murali, D. Gandhi, and L. Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. In Advances in Neural Information Processing Systems (NIPS), 2018.
    Google ScholarLocate open access versionFindings
  • M. N. Gurcan, L. E. Boucheron, A. Can, A. Madabhushi, N. M. Rajpoot, and B. Yener. Histopathological image analysis: A review. IEEE reviews in biomedical engineering, 2:147–171, 2009.
    Google ScholarLocate open access versionFindings
  • M. C. Hansen, P. V. Potapov, R. Moore, M. Hancher, S. A. Turubanova, A. Tyukavina, D. Thau, S. V. Stehman, S. J. Goetz, T. R. Loveland, A. Kommareddy, A. Egorov, L. Chini, C. O. Justice, and J. R. G. Townshend. High-resolution global maps of 21st-century forest cover change. Science, 342, 2013.
    Google ScholarLocate open access versionFindings
  • T. B. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
    Google ScholarLocate open access versionFindings
  • Y. He, Z. Shen, and P. Cui. Towards non-IID image classification: A dataset and baselines. Pattern Recognition, 110, 2020.
    Google ScholarLocate open access versionFindings
  • Vincent J Hellendoorn, Sebastian Proksch, Harald C Gall, and Alberto Bacchelli. When code completion fails: A case study on real-world completions. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 960–970. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • B. E. Henderson, N. H. Lee, V. Seewaldt, and H. Shen. The influence of race and ethnicity on the biology of cancer. Nature Reviews Cancer, 12(9):648–653, 2012.
    Google ScholarLocate open access versionFindings
  • D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • D. Hendrycks, S. Basart, M. Mazeika, M. Mostajabi, J. Steinhardt, and D. Song. Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132, 2020.
    Findings
  • D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv preprint arXiv:2006.16241, 2020.
    Findings
  • J. W. Ho, Y. L. Jung, T. Liu, B. H. Alver, S. Lee, K. Ikegami, K. Sohn, A. Minoda, M. Y. Tolstorukov, A. Appert, et al. Comparative analysis of metazoan chromatin organization. Nature, 512(7515):449–452, 2014.
    Google ScholarLocate open access versionFindings
  • J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • D. Hovy and S. L. Spruit. The social impact of natural language processing. In Association for Computational Linguistics (ACL), pages 591–598, 2016.
    Google ScholarLocate open access versionFindings
  • J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080, 2020.
    Findings
  • W. Hu, G. Niu, I. Sato, and M. Sugiyama. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
    Findings
  • G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
    Google ScholarLocate open access versionFindings
  • James P Hughes, Stephen Rees, S Barrett Kalindjian, and Karen L Philpott. Principles of early drug discovery. British journal of pharmacology, 162(6):1239–1249, 2011.
    Google ScholarLocate open access versionFindings
  • Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
    Findings
  • K. Jaganathan, S. K. Panagiotopoulou, J. F. McRae, S. F. Darbandi, D. Knowles, Y. I. Li, J. A. Kosmicki, J. Arbelaez, W. Cui, G. B. Schwartz, et al. Predicting splicing from primary sequence with deep learning. Cell, 176(3):535–548, 2019.
    Google ScholarLocate open access versionFindings
  • N. Jean, M. Burke, M. Xie, W. M. Davis, D. B. Lobell, and S. Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353, 2016.
    Google ScholarLocate open access versionFindings
  • N. Jean, S. M. Xie, and S. Ermon. Semi-supervised deep kernel learning: Regression with unlabeled data by minimizing predictive variance. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
    Google ScholarLocate open access versionFindings
  • W. Jin, R. Barzilay, and T. Jaakkola. Enforcing predictive invariance across structured biomedical domains. arXiv preprint arXiv:2006.03908, 2020.
    Findings
  • Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific Data, 3(1):1–9, 2016.
    Google ScholarLocate open access versionFindings
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017.
    Google ScholarLocate open access versionFindings
  • J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, K. Tunyasuvunakool, O. Ronneberger, R. Bates, A. Žídek, A. Bridgland, C. Meyer, S. A A Kohl, A. Potapenko, A. J Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, M. Steinegger, M. Pacholska, D. Silver, O. Vinyals, A. W Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis. High accuracy protein structure prediction using deep learning. Fourteenth Critical Assessment of Techniques for Protein Structure Prediction, 2020.
    Google ScholarFindings
  • Jongbin Jung, Sharad Goel, Jennifer Skeem, et al. The limits of human predictions of recidivism. Science Advances, 6(7), 2020.
    Google ScholarLocate open access versionFindings
  • A. K. Jørgensen, D. Hovy, and A. Søgaard. Challenges of studying and processing dialects in social media. In ACL Workshop on Noisy User-generated Text, pages 9–18, 2015.
    Google ScholarLocate open access versionFindings
  • G. Kahn, P. Abbeel, and S. Levine. BADGR: An autonomous self-supervised learning-based navigation system. arXiv preprint arXiv:2002.05700, 2020.
    Findings
  • A. Kamath, R. Jia, and P. Liang. Selective question answering under domain shift. In Association for Computational Linguistics (ACL), 2020.
    Google ScholarLocate open access versionFindings
  • Z. Katona, M. Painter, P. N. Patatoukas, and J. Zeng. On the capital market consequences of alternative data: Evidence from outer space. Miami Behavioral Finance Conference, 2018.
    Google ScholarLocate open access versionFindings
  • D. Kaushik, E. Hovy, and Z. Lipton. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning (ICML), pages 2564–2572, 2018.
    Google ScholarLocate open access versionFindings
  • J. Keilwagen, S. Posch, and J. Grau. Accurate prediction of cell type-specific transcription factor binding. Genome Biology, 20(1), 2019.
    Google ScholarLocate open access versionFindings
  • D. R. Kelley, J. Snoek, and J. L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research, 26(7):990–999, 2016.
    Google ScholarLocate open access versionFindings
  • J. H. Kim, M. Xie, N. Jean, and S. Ermon. Incorporating spatial context and fine-grained detail from satellite imagery to predict poverty. Stanford University, 2016.
    Google ScholarFindings
  • N. Kim and T. Linzen. Cogs: A compositional generalization challenge based on semantic interpretation. arXiv preprint arXiv:2010.05465, 2020.
    Findings
  • S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B. A. Shoemaker, J. Wang, B. Yu, J. Zhang, and S. H. Bryant. Pubchem substance and compound databases. Nucleic Acids Research, 44(D1):D1202–D1213, 2016.
    Google ScholarLocate open access versionFindings
  • A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel. Racial disparities in automated speech recognition. Science, 117(14): 7684–7689, 2020.
    Google ScholarLocate open access versionFindings
  • P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang. Concept bottleneck models. In International Conference on Machine Learning (ICML), 2020.
    Google ScholarLocate open access versionFindings
  • B. Kompa, J. Snoek, and A. Beam. Empirical frequentist coverage of deep learning uncertainty quantification procedures. arXiv preprint arXiv:2010.03039, 2020.
    Findings
  • D. Komura and S. Ishikawa. Machine learning methods for histopathological image analysis. Computational and Structural Biotechnology Journal, 16:34–42, 2018.
    Google ScholarLocate open access versionFindings
  • Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. Spoc: Search-based pseudocode to code. In Advances in Neural Information Processing Systems, pages 11906–11917, 2019.
    Google ScholarLocate open access versionFindings
  • C. Kulkarni, P. W. Koh, H. Huy, D. Chia, K. Papadopoulos, J. Cheng, D. Koller, and S. R. Klemmer. Peer and self assessment in massive online classes. Design Thinking Research, pages 131–168, 2015.
    Google ScholarLocate open access versionFindings
  • C. E. Kulkarni, R. Socher, M. S. Bernstein, and S. R. Klemmer. Scaling short-answer grading by combining peer assessment with algorithmic scoring. In Proceedings of the first ACM conference on Learning@Scale conference, pages 99–108, 2014.
    Google ScholarLocate open access versionFindings
  • A. Kumar, T. Ma, and P. Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning (ICML), 2020.
    Google ScholarLocate open access versionFindings
  • A. Kundaje, W. Meuleman, J. Ernst, M. Bilenky, A. Yen, A. Heravi-Moussavi, P. Kheradpour, Z. Zhang, J. Wang, M. J. Ziller, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.
    Google ScholarLocate open access versionFindings
  • B. Lake and M. Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
    Google ScholarLocate open access versionFindings
  • Greg Landrum et al. Rdkit: Open-source cheminformatics, 2006.
    Google ScholarLocate open access versionFindings
  • Agostina J Larrazabal, Nicolás Nieto, Victoria Peterson, Diego H Milone, and Enzo Ferrante. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences, 117(23):12592–12594, 2020.
    Google ScholarLocate open access versionFindings
  • Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How we analyzed the compas recidivism algorithm. ProPublica, 9(1), 2016.
    Google ScholarLocate open access versionFindings
  • R. Y. Lau, C. Li, and S. S. Liao. Social analytics: Learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decision Support Systems, 65:80–94, 2014.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.
    Findings
  • J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10), 2010.
    Google ScholarLocate open access versionFindings
  • D. Li, Y. Yang, Y. Song, and T. M. Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 5542–5550, 2017.
    Google ScholarLocate open access versionFindings
  • H. Li and Y. Guan. Leopard: fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. bioRxiv, 2019.
    Google ScholarFindings
  • H. Li, D. Quang, and Y. Guan. Anchor: trans-cell type prediction of transcription factor binding sites. Genome Research, 29(2):281–292, 2019.
    Google ScholarLocate open access versionFindings
  • J. Li, A. H. Miller, S. Chopra, M. Ranzato, and J. Weston. Dialogue learning with human-inthe-loop. In International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • J. Li, A. H. Miller, S. Chopra, M. Ranzato, and J. Weston. Learning through dialogue interactions by asking questions. In International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • T. Li, M. Sanjabi, A. Beirami, and V. Smith. Fair resource allocation in federated learning. arXiv preprint arXiv:1905.10497, 2019.
    Findings
  • Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou. Revisiting batch normalization for practical domain adaptation. In International Conference on Learning Representations Workshop (ICLRW), 2017.
    Google ScholarLocate open access versionFindings
  • S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • M. W. Libbrecht and W. S. Noble. Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6):321–332, 2015.
    Google ScholarLocate open access versionFindings
  • Z. Lipton, Y. Wang, and A. Smola. Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • L. T. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt. Delayed impact of fair machine learning. In International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • Y. Liu, K. Gadepalli, M. Norouzi, G. E. Dahl, T. Kohlberger, A. Boyko, S. Venugopalan, A. Timofeev, P. Q. Nelson, G. S. Corrado, et al. Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442, 2017.
    Findings
  • M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015.
    Google ScholarLocate open access versionFindings
  • J. Lyu, S. Wang, T. E. Balius, I. Singh, A. Levit, Y. S. Moroz, M. J. O’Meara, T. Che, E. Algaa, K. Tolmachova, et al. Ultra-large library docking for discovering new chemotypes. Nature, 566 (7743):224–229, 2019.
    Google ScholarLocate open access versionFindings
  • M. Macenko, M. Niethammer, J. S. Marron, D. Borland, J. T. Woosley, X. Guan, C. Schmitt, and N. E. Thomas. A method for normalizing histology slides for quantitative analysis. In 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pages 1107–1110, 2009.
    Google ScholarLocate open access versionFindings
  • Brian A Malloy and James F Power. Quantifying the transition from python 2 to 3: an empirical study of python applications. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 314–323. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19:313–330, 1993.
    Google ScholarLocate open access versionFindings
  • K. McCloskey, E. A. Sigel, S. Kearnes, L. Xue, X. Tian, D. Moccia, D. Gikunju, S. Bazzaz, B. Chan, M. A. Clark, et al. Machine learning on DNA-encoded libraries: A new paradigm for hit finding. Journal of Medicinal Chemistry, 2020.
    Google ScholarLocate open access versionFindings
  • R. T. McCoy, J. Min, and T. Linzen. Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. arXiv preprint arXiv:1911.02969, 2019.
    Findings
  • R. T. McCoy, E. Pavlick, and T. Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Association for Computational Linguistics (ACL), 2019.
    Google ScholarLocate open access versionFindings
  • S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova, H. Ashrafian, T. Back, M. Chesus, G. C. Corrado, A. Darzi, et al. International evaluation of an AI system for breast cancer screening. Nature, 577(7788):89–94, 2020.
    Google ScholarLocate open access versionFindings
  • Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635, 2019.
    Findings
  • J. Miller, K. Krauth, B. Recht, and L. Schmidt. The effect of natural distribution shift on question answering models. arXiv preprint arXiv:2004.14444, 2020.
    Findings
  • P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, C. Kavukcuoglu, D. Kumaran, and R. Hadsell. Learning to navigate in complex environments. In International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • J. E. Moore, M. J. Purcaro, H. E. Pratt, C. B. Epstein, N. Shoresh, J. Adrian, T. Kawli, C. A. Davis, A. Dobin, R. Kaul, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature, 583(7818):699–710, 2020.
    Google ScholarLocate open access versionFindings
  • J. Moult, J. T Pedersen, R. Judson, and K. Fidelis. A large-scale experiment to assess protein structure prediction methods. Proteins: Structure, Function, and Bioinformatics, 23(3):ii–iv, 1995.
    Google ScholarLocate open access versionFindings
  • K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning (ICML), pages 10–18, 2013.
    Google ScholarLocate open access versionFindings
  • W. Nekoto, V. Marivate, T. Matsila, T. Fasubaa, T. Kolawole, T. Fagbohungbe, S. O. Akinola, S. H. Muhammad, S. Kabongo, S. Osei, S. Freshia, R. A. Niyongabo, R. Macharm, P. Ogayo, O. Ahia, M. Meressa, M. Adeyemi, M. Mokgesi-Selinga, L. Okegbemi, L. J. Martinus, K. Tajudeen, K. Degila, K. Ogueji, K. Siminyu, J. Kreutzer, J. Webster, J. T. Ali, J. Abbott, I. Orife, I. Ezeani, I. A. Dangana, H. Kamper, H. Elsahar, G. Duru, G. Kioko, E. Murhabazi, E. van Biljon, D. Whitenack, C. Onyefuluchi, C. Emezue, B. Dossou, B. Sibanda, B. I. Bassey, A. Olabiyi, A. Ramkilowan, A. Öktem, A. Akinfaderin, and A. Bashir. Participatory research for low-resourced machine translation: A case study in African languages. In Findings of Empirical Methods in Natural Language Processing (Findings of EMNLP), 2020.
    Google ScholarLocate open access versionFindings
  • B. Nestor, M. McDermott, W. Boag, G. Berner, T. Naumann, M. C. Hughes, A. Goldenberg, and M. Ghassemi. Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks. arXiv preprint arXiv:1908.00690, 2019.
    Findings
  • J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Empirical Methods in Natural Language Processing (EMNLP), pages 188–197, 2019.
    Google ScholarLocate open access versionFindings
  • Marius Nita and David Notkin. Using twinning to adapt programs to alternative apis. In 2010 ACM/IEEE 32nd International Conference on Software Engineering, volume 1, pages 205–214. IEEE, 2010.
    Google ScholarLocate open access versionFindings
  • A. Noor, V. Alegana, P. Gething, A. Tatem, and R. Snow. Using remotely sensed night-time light as a proxy for poverty in africa. Population Health Metrics, 6, 2008.
    Google ScholarLocate open access versionFindings
  • Mohammad Sadegh Norouzzadeh, Dan Morris, Sara Beery, Neel Joshi, Nebojsa Jojic, and Jeff Clune. A deep active learning system for species identification and counting in camera trap images. arXiv preprint arXiv:1910.09716, 2019.
    Findings
  • NYTimes. The Times is partnering with Jigsaw to expand comment capabilities. The New York Times, 2016. URL https://www.nytco.com/press/the-times-is-partnering-with-jigsaw-to-expand-comment-capabilities/.
    Locate open access versionFindings
  • Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Oren, S. Sagawa, T. Hashimoto, and P. Liang. Distributionally robust language modeling. In Empirical Methods in Natural Language Processing (EMNLP), 2019.
    Google ScholarLocate open access versionFindings
  • A. Osgood-Zimmerman, A. I. Millear, R. W. Stubbs, C. Shields, B. V. Pickering, L. Earl, N. Graetz, D. K. Kinyoki, S. E. Ray, S. Bhatt, A. J. Browne, R. Burstein, E. Cameron, D. C. Casey, A. Deshpande, N. Fullman, P. W. Gething, H. S. Gibson, N. J. Henry, M. Herrero, L. K. Krause, I. D. Letourneau, A. J. Levine, P. Y. Liu, J. Longbottom, B. K. Mayala, J. F. Mosser, A. M. Noor, D. M. Pigott, E. G. Piwoz, P. Rao, R. Rawat, R. C. Reiner, D. L. Smith, D. J. Weiss, K. E. Wiens, A. H. Mokdad, S. S. Lim, C. J. L. Murray, N. J. Kassebaum, and S. I. Hay. Mapping child growth failure in africa between 2000 and 2015. Nature, 555, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, and J. Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
    Google ScholarLocate open access versionFindings
  • S. J. Pan, X. Ni, J. Sun, Q. Yang, and Z. Chen. Cross-domain sentiment classification via spectral feature alignment. In Proceedings of the 19th International World Wide Web Conference, pages 751–760, 2010.
    Google ScholarLocate open access versionFindings
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an ASR corpus based on public domain audio books. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 5206–5210, 2015.
    Google ScholarLocate open access versionFindings
  • Jason Parham, Jonathan Crall, Charles Stewart, Tanya Berger-Wolf, and Daniel I Rubenstein. Animal population censusing at scale with citizen science and photographic identification. In AAAI Spring Symposium-Technical Report, 2017.
    Google ScholarFindings
  • J. H. Park, J. Shin, and P. Fung. Reducing gender bias in abusive language detection. In Empirical Methods in Natural Language Processing (EMNLP), pages 2799–2804, 2018.
    Google ScholarLocate open access versionFindings
  • G. K. Patro, A. Biswas, N. Ganguly, K. P. Gummadi, and A. Chakraborty. Fairrec: Two-sided fairness for personalized recommendations in two-sided platforms. In Proceedings of The Web Conference 2020, pages 1194–1204, 2020.
    Google ScholarLocate open access versionFindings
  • X. Peng, B. Usman, N. Kaushik, D. Wang, J. Hoffman, and K. Saenko. VisDA: A synthetic-toreal benchmark for visual domain adaptation. In Computer Vision and Pattern Recognition (CVPR), pages 2021–2026, 2018.
    Google ScholarLocate open access versionFindings
  • X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. Moment matching for multi-source domain adaptation. In International Conference on Computer Vision (ICCV), 2019.
    Google ScholarLocate open access versionFindings
  • X. Peng, E. Coumans, T. Zhang, T. Lee, J. Tan, and S. Levine. Learning agile robotic locomotion skills by imitating animals. In Robotics: Science and Systems (RSS), 2020.
    Google ScholarLocate open access versionFindings
  • L. Perelman. When “the state of the art” is counting words. Assessing Writing, 21:104–111, 2014.
    Google ScholarLocate open access versionFindings
  • N. A. Phillips, P. Rajpurkar, M. Sabini, R. Krishnan, S. Zhou, A. Pareek, N. M. Phu, C. Wang, A. Y. Ng, and M. P. Lungren. Chexphoto: 10,000+ smartphone photos and synthetic photographic transformations of chest x-rays for benchmarking deep learning robustness. arXiv preprint arXiv:2007.06199, 2020.
    Findings
  • C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller. Tuned models of peer assessment in moocs. Educational Data Mining, 2013.
    Google ScholarLocate open access versionFindings
  • M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko. A review of novelty detection. Signal Processing, 99:215–249, 2014.
    Google ScholarLocate open access versionFindings
  • Kerrie A Pipal, Jeremy J Notch, Sean A Hayes, and Peter B Adams. Estimating escapement for a low-abundance steelhead population using dual-frequency identification sonar (didson). North American Journal of Fisheries Management, 32(5):880–893, 2012.
    Google ScholarLocate open access versionFindings
  • W Nicholson Price and I Glenn Cohen. Privacy in the age of medical big data. Nature Medicine, 25(1):37–43, 2019.
    Google ScholarLocate open access versionFindings
  • D. Quang and X. Xie. Factornet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods, 166:40–47, 2019.
    Google ScholarLocate open access versionFindings
  • J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
    Google ScholarFindings
  • Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 419–428, 2014.
    Google ScholarLocate open access versionFindings
  • C. Ré, F. Niu, P. Gudipati, and C. Srisuwananukorn. Overton: A data system for monitoring and improving machine-learned products. arXiv preprint arXiv:1909.05372, 2019.
    Findings
  • B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • R. C. Reiner, N. Graetz, D. C. Casey, C. Troeger, G. M. Garcia, J. F. Mosser, A. Deshpande, S. J. Swartz, S. E. Ray, B. F. Blacker, P. C. Rao, A. Osgood-Zimmerman, R. Burstein, D. M. Pigott, I. M. Davis, I. D. Letourneau, L. Earl, J. M. Ross, I. A. Khalil, T. H. Farag, O. J. Brady, M. U. Kraemer, D. L. Smith, S. Bhatt, D. J. Weiss, P. W. Gething, N. J. Kassebaum, A. H. Mokdad, C. J. Murray, and S. I. Hay. Variation in childhood diarrheal morbidity and mortality in africa, 2000–2015. New England Journal of Medicine, 379, 2018.
    Google ScholarLocate open access versionFindings
  • D. Reker. Practical considerations for active machine learning in drug discovery. Drug Discovery Today: Technologies, 2020.
    Google ScholarFindings
  • M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Association for Computational Linguistics (ACL), pages 4902–4912, 2020.
    Google ScholarLocate open access versionFindings
  • S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European Conference on Computer Vision, pages 102–118, 2016.
    Google ScholarLocate open access versionFindings
  • M. Rigaki and S. Garcia. Bringing a GAN to a knife-fight: Adapting malware communication to avoid detection. In 2018 IEEE Security and Privacy Workshops (SPW), pages 70–75, 2018.
    Google ScholarLocate open access versionFindings
  • E. Rolf, M. I. Jordan, and B. Recht. Post-estimation smoothing: A simple baseline for learning with side information. In Artificial Intelligence and Statistics (AISTATS), 2020.
    Google ScholarLocate open access versionFindings
  • G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3234–3243, 2016.
    Google ScholarLocate open access versionFindings
  • Amir Rosenfeld, Richard Zemel, and John K Tsotsos. The elephant in the room. arXiv preprint arXiv:1808.03305, 2018.
    Findings
  • F. Sadeghi and S. Levine. CAD2RL: Real single-image flight without a single real image. In Robotics: Science and Systems (RSS), 2017.
    Google ScholarLocate open access versionFindings
  • K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European Conference on Computer Vision, pages 213–226, 2010.
    Google ScholarLocate open access versionFindings
  • M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 14(1):21–41, 2002.
    Google ScholarLocate open access versionFindings
  • S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020.
    Google ScholarLocate open access versionFindings
  • S. Sagawa, A. Raghunathan, P. W. Koh, and P. Liang. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning (ICML), 2020.
    Google ScholarLocate open access versionFindings
  • D. E. Sahn and D. Stifel. Exploring alternative measures of welfare in the absence of expenditure data. The Review of Income and Wealth, 49, 2003.
    Google ScholarLocate open access versionFindings
  • S. Santurkar, D. Tsipras, and A. Madry. Breeds: Benchmarks for subpopulation shift. arXiv, 2020.
    Google ScholarFindings
  • Stefan Schneider and Alex Zhuang. Counting fish and dolphins in sonar images using deep learning. arXiv preprint arXiv:2007.12808, 2020.
    Findings
  • L. Seyyed-Kalantari, G. Liu, M. McDermott, and M. Ghassemi. Chexclusion: Fairness gaps in deep chest X-ray classifiers. arXiv preprint arXiv:2003.00827, 2020.
    Findings
  • Shreya Shankar, Yoni Halpern, Eric Breck, James Atwood, Jimbo Wilson, and D Sculley. No classification without representation: Assessing geodiversity issues in open data sets for the developing world. Advances in Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning for the Developing World, 2017.
    Google ScholarLocate open access versionFindings
  • V. Shankar, A. Dave, R. Roelofs, D. Ramanan, B. Recht, and L. Schmidt. Do image classifiers generalize across time? arXiv preprint arXiv:1906.02168, 2019.
    Findings
  • J. Shen, Y. Qu, W. Zhang, and Y. Yu. Wasserstein distance guided representation learning for domain adaptation. In Association for the Advancement of Artificial Intelligence (AAAI), 2018.
    Google ScholarLocate open access versionFindings
  • M. D. Shermis. State-of-the-art automated essay scoring: Competition, results, and future directions from a united states demonstration. Assessing Writing, 20:53–76, 2014.
    Google ScholarLocate open access versionFindings
  • Rakshith Shetty, Bernt Schiele, and Mario Fritz. Not using the car to see the sidewalk– quantifying and controlling the effects of context in classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8218–8226, 2019.
    Google ScholarLocate open access versionFindings
  • H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90:227–244, 2000.
    Google ScholarLocate open access versionFindings
  • Richard Shin, Neel Kant, Kavi Gupta, Christopher Bender, Brandon Trabucco, Rishabh Singh, and Dawn Song. Synthetic datasets for neural program synthesis. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • Yu Shiu, KJ Palmer, Marie A Roch, Erica Fleishman, Xiaobai Liu, Eva-Marie Nosal, Tyler Helble, Danielle Cholewiak, Douglas Gillespie, and Holger Klinck. Deep neural networks for automated detection of marine mammal species. Scientific Reports, 10(1):1–12, 2020.
    Google ScholarLocate open access versionFindings
  • Brian K Shoichet. Virtual screening of chemical libraries. Nature, 432(7019):862–865, 2004.
    Google ScholarLocate open access versionFindings
  • N. Sohoni, J. Dunnmon, G. Angus, A. Gu, and C. Ré. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
    Google ScholarLocate open access versionFindings
  • D. Srivastava and S. Mahony. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms, 1863(6), 2020.
    Google ScholarLocate open access versionFindings
  • Teague Sterling and John J. Irwin. Zinc 15 – ligand discovery for everyone. Journal of Chemical Information and Modeling, 55(11):2324–2337, 2015. doi: 10.1021/acs.jcim.5b00559. PMID: 26479676.
    Locate open access versionFindings
  • Dan Stowell, Michael D Wood, Hanna Pamuła, Yannis Stylianou, and Hervé Glotin. Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods in Ecology and Evolution, 10(3):368–380, 2019.
    Google ScholarLocate open access versionFindings
  • A. Subbaswamy, R. Adams, and S. Saria. Evaluating model robustness to dataset shift. arXiv preprint arXiv:2010.15100, 2020.
    Findings
  • B. Sun and K. Saenko. Deep CORAL: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443–450, 2016.
    Google ScholarLocate open access versionFindings
  • B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In Association for the Advancement of Artificial Intelligence (AAAI), 2016.
    Google ScholarLocate open access versionFindings
  • P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, S. Zhao, S. Cheng, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
    Google ScholarLocate open access versionFindings
  • Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning (ICML), 2020.
    Google ScholarLocate open access versionFindings
  • Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. Pythia: ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2727–2735, 2019.
    Google ScholarLocate open access versionFindings
  • Michael A Tabak, Mohammad S Norouzzadeh, David W Wolfson, Steven J Sweeney, Kurt C VerCauteren, Nathan P Snow, Joseph M Halseth, Paul A Di Salvo, Jesse S Lewis, Michael D White, et al. Machine learning to classify animal species in camera trap images: Applications in ecology. Methods in Ecology and Evolution, 10(4):585–590, 2019.
    Google ScholarLocate open access versionFindings
  • K. Taghipour and H. T. Ng. A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1882–1891, 2016.
    Google ScholarLocate open access versionFindings
  • R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt. Measuring robustness to natural distribution shifts in image classification. arXiv preprint arXiv:2007.00644, 2020.
    Findings
  • R. Tatman. Gender and dialect bias in YouTube’s automatic captions. In Workshop on Ethics in Natural Langauge Processing, volume 1, pages 53–59, 2017.
    Google ScholarLocate open access versionFindings
  • D. Tellez, M. Balkenhol, I. Otte-Höller, R. van de Loo, R. Vogels, P. Bult, C. Wauters, W. Vreuls, S. Mol, N. Karssemeijer, et al. Whole-slide mitosis detection in h&e breast histology using phh3 as a reference to train distilled stain-invariant convolutional networks. IEEE transactions on medical imaging, 37(9):2126–2136, 2018.
    Google ScholarLocate open access versionFindings
  • D. Tellez, G. Litjens, P. Bándi, W. Bulten, J. Bokhorst, F. Ciompi, and J. van der Laak. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Medical Image Analysis, 58, 2019.
    Google ScholarLocate open access versionFindings
  • Dogancan Temel, Jinsol Lee, and Ghassan AlRegib. Cure-or: Challenging unreal and real environments for object recognition. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 137–144. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • T. G. Tiecke, X. Liu, A. Zhang, A. Gros, N. Li, G. Yetman, T. Kilic, S. Murray, B. Blankespoor, E. B. Prydz, and H. H. Dang. Mapping the world population one building at a time. arXiv, 2017.
    Google ScholarFindings
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems (IROS), 2017.
    Google ScholarLocate open access versionFindings
  • A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), pages 1521–1528, 2011.
    Google ScholarLocate open access versionFindings
  • E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2017.
    Google ScholarLocate open access versionFindings
  • B. Uzkent and S. Ermon. Learning when and where to zoom with deep reinforcement learning. In Computer Vision and Pattern Recognition (CVPR), 2020.
    Google ScholarLocate open access versionFindings
  • Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh. Neural program repair by jointly learning to localize and repair. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • Sindre Vatnehol, Hector Peña, and Nils Olav Handegard. A method to automatically detect fish aggregations using horizontally scanning sonar. ICES Journal of Marine Science, 75(5): 1803–1812, 2018.
    Google ScholarLocate open access versionFindings
  • B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling. Rotation equivariant cnns for digital pathology. In International Conference on Medical Image Computing and Computer-assisted Intervention, pages 210–218, 2018.
    Google ScholarLocate open access versionFindings
  • H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), pages 5018–5027, 2017.
    Google ScholarLocate open access versionFindings
  • M. Veta, P. J. V. Diest, M. Jiwa, S. Al-Janabi, and J. P. Pluim. Mitosis counting in breast cancer: Object-level interobserver agreement and comparison to an automatic method. PloS one, 11(8), 2016.
    Google ScholarLocate open access versionFindings
  • M. Veta, Y. J. Heng, N. Stathonikos, B. E. Bejnordi, F. Beca, T. Wollmann, K. Rohr, M. A. Shah, D. Wang, M. Rousson, et al. Predicting breast tumor proliferation from whole-slide images: the tupac16 challenge. Medical image analysis, 54:111–121, 2019.
    Google ScholarLocate open access versionFindings
  • R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, and S. Savarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
    Google ScholarLocate open access versionFindings
  • D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
    Findings
  • H. Wang, S. Ge, Z. Lipton, and E. P. Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
    Google ScholarLocate open access versionFindings
  • S. Wang, M. Bai, G. Mattyus, H. Chu, W. Luo, B. Yang, J. Liang, J. Cheverie, S. Fidler, and R. Urtasun. Torontocity: Seeing the world with a million eyes. In International Conference on Computer Vision (ICCV), 2017.
    Google ScholarLocate open access versionFindings
  • S. Wang, W. Chen, S. M. Xie, G. Azzari, and D. B. Lobell. Weakly supervised deep learning for segmentation of remote sensing imagery. Remote Sensing, 12, 2020.
    Google ScholarLocate open access versionFindings
  • OR Wearn and P Glover-Kapfer. Camera-trapping for conservation: a guide to best-practices. WWF conservation technology series, 1(1):2019–04, 2017.
    Google ScholarLocate open access versionFindings
  • S. Weinberger. Speech accent archive. George Mason University, 2015.
    Google ScholarFindings
  • Ben G Weinstein. A computer vision for animal ecology. Journal of Animal Ecology, 87(3): 533–545, 2018.
    Google ScholarLocate open access versionFindings
  • J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, J. M. Stuart, C. G. A. R. Network, et al. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10), 2013.
    Google ScholarLocate open access versionFindings
  • R. West, H. S. Paskov, J. Leskovec, and C. Potts. Exploiting social network structure for personto-person sentiment analysis. Transactions of the Association for Computational Linguistics (TACL), 2:297–310, 2014.
    Google ScholarLocate open access versionFindings
  • G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69–101, 1996.
    Google ScholarLocate open access versionFindings
  • J. J. Williams, J. Kim, A. Rafferty, S. Maldonado, K. Z. Gajos, W. S. Lasecki, and N. Heffernan. Axis: Generating explanations at scale with learnersourcing and machine learning. In Proceedings of the Third (2016) ACM Conference on Learning@Scale, pages 379–388, 2016.
    Google ScholarLocate open access versionFindings
  • Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. Predictive inequity in object detection. arXiv preprint arXiv:1902.11097, 2019.
    Findings
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. HuggingFace’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
    Findings
  • Ho Yuen Frank Wong, Hiu Yin Sonia Lam, Ambrose Ho-Tung Fong, Siu Ting Leung, Thomas Wing-Yan Chin, Christine Shing Yen Lo, Macy Mei-Sze Lui, Jonan Chun Yin Lee, Keith Wan-Hang Chiu, Tom Chung, et al. Frequency and distribution of chest radiographic findings in covid-19 positive patients. Radiology, page 201160, 2020.
    Google ScholarLocate open access versionFindings
  • D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In Computer Vision and Pattern Recognition (CVPR), pages 5028–5037, 2017.
    Google ScholarLocate open access versionFindings
  • M. Wu, M. Mosse, N. Goodman, and C. Piech. Zero shot learning for code education: Rubric sampling with deep learning inference. In Association for the Advancement of Artificial Intelligence (AAAI), volume 33, pages 782–790, 2019.
    Google ScholarLocate open access versionFindings
  • M. Wu, R. L. Davis, B. W. Domingue, C. Piech, and N. Goodman. Variational item response theory: Fast, accurate, and expressive. International Conference on Educational Data Mining, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Wu, E. Winston, D. Kaushik, and Z. Lipton. Domain adaptation with asymmetricallyrelaxed distribution alignment. In International Conference on Machine Learning (ICML), pages 6872–6881, 2019.
    Google ScholarLocate open access versionFindings
  • Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical Science, 9(2):513–530, 2018.
    Google ScholarLocate open access versionFindings
  • M. Wulfmeier, A. Bewley, and I. Posner. Incremental adversarial domain adaptation for continually changing environments. In International Conference on Robotics and Automation (ICRA), 2018.
    Google ScholarLocate open access versionFindings
  • K. Xiao, L. Engstrom, A. Ilyas, and A. Madry. Noise or signal: The role of image backgrounds in object recognition. arXiv preprint arXiv:2006.09994, 2020.
    Findings
  • M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon. Transfer learning from deep features for remote sensing and poverty mapping. In Association for the Advancement of Artificial Intelligence (AAAI), 2016.
    Google ScholarLocate open access versionFindings
  • S. M. Xie, A. Kumar, R. Jones, F. Khani, T. Ma, and P. Liang. In-N-out: Pre-training and self-training using auxiliary information for out-of-distribution robustness. arXiv, 2020.
    Google ScholarFindings
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • Y. Yang and S. Newsam. Bag-of-visual-words and spatial extensions for land-use classification. Geographic Information Systems, 2010.
    Google ScholarLocate open access versionFindings
  • Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, and V. Sindhwani. Data efficient reinforcement learning for legged robots. In Conference on Robot Learning (CoRL), 2019.
    Google ScholarLocate open access versionFindings
  • Michihiro Yasunaga and Percy Liang. Graph-based, self-supervised program repair from diagnostic feedback. In International Conference on Machine Learning (ICML), 2020.
    Google ScholarLocate open access versionFindings
  • C. Yeh, A. Perez, A. Driscoll, G. Azzari, Z. Tang, D. Lobell, S. Ermon, and M. Burke. Using publicly available satellite imagery and deep learning to understand economic well-being in africa. Nature Communications, 11, 2020.
    Google ScholarLocate open access versionFindings
  • J. You, X. Li, M. Low, D. Lobell, and S. Ermon. Deep gaussian process for crop yield prediction based on remote sensing data. In Association for the Advancement of Artificial Intelligence (AAAI), 2017.
    Google ScholarLocate open access versionFindings
  • F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
    Google ScholarLocate open access versionFindings
  • N. Yuval, W. Tao, C. Adam, B. Alessandro, W. Bo, and N. A. Y. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
    Google ScholarLocate open access versionFindings
  • J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. In PLOS Medicine, 2018.
    Google ScholarLocate open access versionFindings
  • K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning (ICML), pages 819–827, 2013.
    Google ScholarLocate open access versionFindings
  • M. Zhang, H. Marklund, N. Dhawan, A. Gupta, S. Levine, and C. Finn. Adaptive risk minimization: A meta-learning approach for tackling group shift. arXiv preprint arXiv:2007.02931, 2020.
    Findings
  • Y. Zhang, J. Baldridge, and L. He. Paws: Paraphrase adversaries from word scrambling. In North American Association for Computational Linguistics (NAACL), 2019.
    Google ScholarLocate open access versionFindings
  • J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding variants with deep learning– based sequence model. Nature Methods, 12(10):931–934, 2015.
    Google ScholarLocate open access versionFindings
  • X. Zhou, Y. Nie, H. Tan, and M. Bansal. The curse of performance instability in analysis datasets: Consequences, source, and suggestions. arXiv preprint arXiv:2004.13606, 2020.
    Findings
  • C. L. Zitnick, L. Chanussot, A. Das, S. Goyal, J. Heras-Domingo, C. Ho, W. Hu, T. Lavril, A. Palizhati, M. Riviere, M. Shuaibi, A. Sriram, K. Tran, B. Wood, J. Yoon, D. Parikh, and Z. Ulissi. An introduction to electrocatalyst design using machine learning for renewable energy storage. arXiv preprint arXiv:2010.09435, 2020.
    Findings
  • 2. Label × Black: 4 subsets, 1 for each combination of class and Black.
    Google ScholarFindings
  • 2. Validation (OOD): reviews in categories unseen during training.
    Google ScholarFindings
  • 3. Test (OOD): reviews in categories unseen during training.
    Google ScholarFindings
  • 4. Validation (ID): reviews in training categories.
    Google ScholarFindings
  • 5. Test (ID): reviews in training categories.
    Google ScholarFindings
  • 2. Validation (OOD): 20,000 reviews written in years 2014 to 2018.
    Google ScholarFindings
  • 3. Test (OOD): 20,000 reviews written in years 2014 to 2018.
    Google ScholarFindings
  • 2. Validation (OOD): 20,000 reviews written in years 2014 to 2019.
    Google ScholarFindings
  • 3. Test (OOD): 20,000 reviews written in years 2014 to 2019.
    Google ScholarFindings
  • 2. Validation (OOD): 40,000 reviews from another set of 1,600 reviewers, distinct from training and test (OOD).
    Google ScholarFindings
  • 3. Test (OOD): 40,000 reviews from another set 1,600 reviewers, distinct from training and validation (OOD).
    Google ScholarFindings
  • 4. Validation (ID): 40,000 reviews from 1,600 of the 11,856 reviewers in the training set.
    Google ScholarFindings
  • 5. Test (ID): 40,000 reviews from 1,600 of the 11,856 reviewers in the training set.
    Google ScholarFindings
作者
Shiori Sagawa
Shiori Sagawa
Henrik Marklund
Henrik Marklund
Michihiro Yasunaga
Michihiro Yasunaga
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科