AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We showed that the Blackwell winner was efficiently computable from samples with a simple and optimal procedure, and that it outperformed the von Neumann winner in a user study on autonomous driving

Preference learning along multiple criteria: A game-theoretic perspective

NIPS 2020, (2021)

Cited by: 0|Views53
EI
Full Text
Bibtex
Weibo

Abstract

The literature on ranking from ordinal data is vast, and there are several ways to aggregate overall preferences from pairwise comparisons between objects. In particular, it is well known that any Nash equilibrium of the zero sum game induced by the preference matrix defines a natural solution concept (winning distribution over objects)...More

Code:

Data:

0
Introduction
  • Economists, social scientists, engineers, and computer scientists have long studied models for human preferences, under the broad umbrella of social choice theory [10, 7].
  • Learning from human preferences has found applications in interactive robotics for learning reward functions [45, 39], in medical domains for personalizing assistive devices [59, 9], and in recommender systems for optimizing search engines [15, 28].
  • An object could correspond to a product in a search query, or a policy or reward function in reinforcement learning.
  • A vast body of classical work dating back to Condorcet and Borda [17, 12] has focused on defining and producing a “winning" object from the result of a set of pairwise comparisons
Highlights
  • Economists, social scientists, engineers, and computer scientists have long studied models for human preferences, under the broad umbrella of social choice theory [10, 7]
  • We introduced the notion of a Blackwell winner, which generalizes many known winning solution concepts
  • We showed that the Blackwell winner was efficiently computable from samples with a simple and optimal procedure, and that it outperformed the von Neumann winner in a user study on autonomous driving
  • Our work raises many interesting follow-up questions: How does the sample complexity vary as a function of the preference tensor P? Can the process of choosing a good target set be automated? What are the analogs of our results in the setting where pairwise comparisons can be elicited actively?
  • As a possible negative consequence, getting this choice wrong could lead to incorrect inferences and unexpected behavior in the real world
Results
  • 60% of the people preferred Policy A over B – making it the von Neumann winner.
  • Set S1 requires feasible score vectors to satisfy 40% of the population along both comfort and speed
Conclusion
  • Discussion and future work

    In this paper, the authors considered the problem of eliciting and learning from preferences along multiple criteria, as a way to obtain rich feedback under weaker assumptions.
  • An important step towards deploying AI systems in the real world involves aligning their objectives with human values.
  • Examples of such objectives include safety for autonomous vehicles, fairness for recommender systems, and effectiveness of assistive medical devices.
  • The authors' paper takes a step towards accomplishing this goal by providing a framework to aggregate human preferences along such subjective criteria, which are often hard to encode mathematically.
  • As a possible negative consequence, getting this choice wrong could lead to incorrect inferences and unexpected behavior in the real world
Related work
  • Most closely related to our work is the field of computational social choice, which has focused on defining notions of winners from overall pairwise comparisons (see the survey [37] for a review). Amongst them, three deterministic notions of a winner—the Condorcet [17], Borda [12], and Copeland [18] winners—have been widely studied. In addition, Dudik et al [22] recently introduced the notion of a (randomized) von Neumann winner. Starting with the work of Yue et al [57], there have been several research papers studying an online version of preference learning, called the Dueling Bandits problem. Algorithms have been proposed to compete with Condorcet [60, 62, 4], Copeland [61, 56], Borda [30] and von Neumann [22] winners.

    The theoretical foundations of decision making based on multiple criteria have been widely studied within the operations research community . This sub-field—called multiple-criteria decision analysis— has focused largely on scoring, classification, and sorting based on multiple-criteria feedback. See the surveys [44, 63] for thorough overviews of existing methods and their associated guarantees. The problem of eliciting the user’s relative weighting of the various criteria has also been considered [20]. However, relatively less attention has been paid to the study of randomized decisions and statistical inference, both of which form the focus of our work. From an applied perspective, the combination of multi-criteria assessments has received attention in disparate fields such as psychometrics [40, 35], healthcare [50], and recidivism prediction [55]. In many of these cases, a variety of approaches—both linear and non-linear—have been empirically evaluated [19]. Justification for non-linear aggregation of scores along the criteria has a long history in psychology and the behavioral sciences [27, 24, 54].
Funding
  • AP is supported by a Swiss Re research fellowship at the Simons Institute for the Theory of Computing and KB is supported by a JP Morgan AI Fellowship
  • This work was partially supported by Office of Naval Research Young Investigator Award and a AFOSR grant to ADD, and by Office of Naval Research Grant DOD ONR-N00014-18-1-2640 to MJW
Study subjects and analysis
subjects: 50
The cumulative comparison data is given in Appendix D, and the average weight vector elicited from the users was found to be w1 = [0.21, 0.19, 0.20, 0.18, 0.22]. We ran this study with 50 subjects. In the overall preference elicitation, we saw an approximate ordering amongst the base policies: CEDBA

subjects: 41
The Blackwell winners R1 and R2 for the sets S1 and S2 with the ∞ distance were found to be R1 = [0.09, 0.15, 0.30, 0.15, 0.31] and R2 = [0.01, 0.01, 0.31, 0.02, 0.65]. In the second phase, we obtained preferences from a set of 41 subjects comparing the randomized polices R1 and R2 with the baseline policies A-E. The results are aggregated in Table 1 in Appendix D

Reference
  • J. Abernethy, P. L. Bartlett, and E. Hazan. Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, pages 27–46, 2011.
    Google ScholarLocate open access versionFindings
  • A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, 2010.
    Google ScholarLocate open access versionFindings
  • M. Aghassi and D. Bertsimas. Robust game theory. Mathematical Programming, 107(1-2): 231–273, 2006.
    Google ScholarLocate open access versionFindings
  • N. Ailon, Z. Karnin, and T. Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856–864, 2014.
    Google ScholarLocate open access versionFindings
  • A. R. Alimov and I. Tsar’kov. Connectedness and other geometric properties of suns and chebyshev sets. Fundamentalnaya i Prikladnaya Matematika, 19(4):21–91, 2014.
    Google ScholarLocate open access versionFindings
  • S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza. Power to the people: The role of humans in interactive machine learning. Ai Magazine, 35(4):105–120, 2014.
    Google ScholarLocate open access versionFindings
  • K. J. Arrow et al. Social choice and individual values. 1951.
    Google ScholarFindings
  • V. Balestro, H. Martini, and R. Teixeira. Convex analysis in normed spaces and metric projections onto convex bodies. arXiv preprint arXiv:1908.08742, 2019.
    Findings
  • E. Bıyık, N. Huynh, M. J. Kochenderfer, and D. Sadigh. Active preference-based gaussian process regression for reward learning. arXiv preprint arXiv:2005.02575, 2020.
    Findings
  • D. Black. On the rationale of group decision-making. Journal of political economy, 56(1): 23–34, 1948.
    Google ScholarLocate open access versionFindings
  • D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
    Google ScholarLocate open access versionFindings
  • J. d. Borda. Mémoire sur les élections au scrutin. Histoire de l’Academie Royale des Sciences pour 1781 (Paris, 1784), 1784.
    Google ScholarFindings
  • R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
    Google ScholarLocate open access versionFindings
  • S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8, 2015.
    Google ScholarLocate open access versionFindings
  • O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems (TOIS), 30(1):1–41, 2012.
    Google ScholarLocate open access versionFindings
  • P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.
    Google ScholarLocate open access versionFindings
  • M. d. Condorcet. Essai sur l’application de l’analyse a la probabilité des decisions rendues a la pluralité des voix. 1785.
    Google ScholarFindings
  • A. H. Copeland. A reasonable social welfare function. Technical report, mimeo, 1951. University of Michigan, 1951.
    Google ScholarFindings
  • K. M. Douglas and R. J. Mislevy. Estimating classification accuracy for complex decision rules based on multiple scores. Journal of Educational and Behavioral Statistics, 35(3):280–306, 2010.
    Google ScholarLocate open access versionFindings
  • M. Doumpos and C. Zopounidis. Regularized estimation for preference disaggregation in multiple criteria decision making. Computational Optimization and Applications, 38(1):61–80, 2007.
    Google ScholarLocate open access versionFindings
  • J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5), 2015.
    Google ScholarLocate open access versionFindings
  • M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi. Contextual dueling bandits. In Conference on Learning Theory, 2015.
    Google ScholarLocate open access versionFindings
  • A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, 2005.
    Google ScholarLocate open access versionFindings
  • D. Frisch and R. T. Clemen. Beyond expected utility: Rethinking behavioral decision research. Psychological bulletin, 116(1):46, 1994.
    Google ScholarLocate open access versionFindings
  • D. Fudenberg and D. K. Levine. Self-confirming equilibrium. Econometrica: Journal of the Econometric Society, pages 523–545, 1993.
    Google ScholarLocate open access versionFindings
  • S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2013.
    Google ScholarLocate open access versionFindings
  • W. M. Goldstein and J. Beattie. Judgments of relative importance in decision making: The importance of interpretation and the interpretation of importance. In Frontiers of mathematical psychology, pages 110–137.
    Google ScholarLocate open access versionFindings
  • K. Hofmann, S. Whiteson, and M. De Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 249–258, 2011.
    Google ScholarLocate open access versionFindings
  • E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16-17), 2008.
    Google ScholarLocate open access versionFindings
  • K. G. Jamieson, S. Katariya, A. Deshpande, and R. D. Nowak. Sparse dueling bandits. 2015.
    Google ScholarFindings
  • V. Kuleshov and S. Ermon. Estimating uncertainty online against an adversary. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • E. Lehrer. Partially specified probabilities: Decisions and games. American Economic Journal: Microeconomics, 4(1):70–100, 2012.
    Google ScholarLocate open access versionFindings
  • R. D. Luce. Individual choice behavior. 1959.
    Google ScholarFindings
  • S. Mannor, V. Perchet, and G. Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization. In Conference on Learning Theory, pages 339–355, 2014.
    Google ScholarLocate open access versionFindings
  • M. T. McBee, S. J. Peters, and C. Waterman. Combining scores in multiple-criteria assessment systems: The impact of combination rule. Gifted Child Quarterly, 58(1):69–89, 2014.
    Google ScholarLocate open access versionFindings
  • S. Miryoosefi, K. Brantley, H. Daume III, M. Dudik, and R. E. Schapire. Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems, pages 14070–14079, 2019.
    Google ScholarLocate open access versionFindings
  • H. Moulin. Handbook of Computational Social Choice. Cambridge University Press, 2016.
    Google ScholarFindings
  • Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2), 2017.
    Google ScholarLocate open access versionFindings
  • M. Palan, N. C. Landolfi, G. Shevchuk, and D. Sadigh. Learning reward functions by integrating human demonstrations and preferences. arXiv preprint arXiv:1906.08928, 2019.
    Findings
  • J. P. Papay. Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1):163–193, 2011.
    Google ScholarLocate open access versionFindings
  • J.-P. Penot and R. Ratsimahalo. Characterizations of metric projections in Banach spaces and applications. In Abstract and Applied Analysis, volume 3, 1970.
    Google ScholarLocate open access versionFindings
  • V. Perchet. Approachability, regret and calibration; implications and equivalences. arXiv preprint arXiv:1301.2663, 2013.
    Findings
  • V. Perchet. A note on robust nash equilibria with uncertainties. RAIRO-Operations Research, 48(3):365–371, 2014.
    Google ScholarLocate open access versionFindings
  • J.-C. Pomerol and S. Barba-Romero. Multicriterion decision in management: principles and practice, volume 25. Springer Science & Business Media, 2012.
    Google ScholarFindings
  • D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia. Active preference-based learning of reward functions. In Robotics: Science and Systems, 2017.
    Google ScholarLocate open access versionFindings
  • W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans. Trial without error: Towards safe reinforcement learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2067–2069, 2018.
    Google ScholarLocate open access versionFindings
  • M. Schulze. A new monotonic, clone-independent, reversal symmetric, and Condorcetconsistent single-winner election method. Social Choice and Welfare, 36(2):267–303, 2011.
    Google ScholarLocate open access versionFindings
  • O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, 2013.
    Google ScholarLocate open access versionFindings
  • O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1), 2017.
    Google ScholarLocate open access versionFindings
  • A. Teixeira-Pinto and S.-L. T. Normand. Statistical methodology for classifying units on the basis of multiple-related measures. Statistics in medicine, 27(9):1329–1350, 2008.
    Google ScholarLocate open access versionFindings
  • L. L. Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927.
    Google ScholarLocate open access versionFindings
  • A. B. Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
    Google ScholarFindings
  • A. Tversky and D. Kahneman. Judgment under uncertainty: Heuristics and biases. science, 185 (4157):1124–1131, 1974.
    Google ScholarFindings
  • A. Tversky and D. Kahneman. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–291, 1979.
    Google ScholarLocate open access versionFindings
  • G. D. Walters. Taking the next step: Combining incrementally valid indicators to improve recidivism prediction. Assessment, 18(2):227–233, 2011.
    Google ScholarLocate open access versionFindings
  • H. Wu and X. Liu. Double thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649–657, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
    Google ScholarLocate open access versionFindings
  • L. Zajícek. On the fréchet differentiability of distance functions. Proceedings of the 12th Winter School on Abstract Analysis, pages 161–165, 1984.
    Google ScholarLocate open access versionFindings
  • J. Zhang, P. Fiers, K. A. Witte, R. W. Jackson, K. L. Poggensee, C. G. Atkeson, and S. H. Collins. Human-in-the-loop optimization of exoskeleton assistance during walking. Science, 356(6344):1280–1284, 2017.
    Google ScholarLocate open access versionFindings
  • M. Zoghi, S. Whiteson, R. Munos, and M. De Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. arXiv preprint arXiv:1312.3393, 2013.
    Findings
  • M. Zoghi, Z. S. Karnin, S. Whiteson, and M. De Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307–315, 2015.
    Google ScholarLocate open access versionFindings
  • M. Zoghi, S. Whiteson, and M. de Rijke. Mergerucb: A method for large-scale online ranker evaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 17–26, 2015.
    Google ScholarLocate open access versionFindings
  • C. Zopounidis and M. Doumpos. Multicriteria classification and sorting methods: A literature review. European Journal of Operational Research, 138(2):229–246, 2002.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科