AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We showed that the Blackwell winner was efficiently computable from samples with a simple and optimal procedure, and that it outperformed the von Neumann winner in a user study on autonomous driving
Preference learning along multiple criteria: A game-theoretic perspective
NIPS 2020, (2021)
The literature on ranking from ordinal data is vast, and there are several ways to aggregate overall preferences from pairwise comparisons between objects. In particular, it is well known that any Nash equilibrium of the zero sum game induced by the preference matrix defines a natural solution concept (winning distribution over objects)...More
PPT (Upload PPT)
- Economists, social scientists, engineers, and computer scientists have long studied models for human preferences, under the broad umbrella of social choice theory [10, 7].
- Learning from human preferences has found applications in interactive robotics for learning reward functions [45, 39], in medical domains for personalizing assistive devices [59, 9], and in recommender systems for optimizing search engines [15, 28].
- An object could correspond to a product in a search query, or a policy or reward function in reinforcement learning.
- A vast body of classical work dating back to Condorcet and Borda [17, 12] has focused on defining and producing a “winning" object from the result of a set of pairwise comparisons
- Economists, social scientists, engineers, and computer scientists have long studied models for human preferences, under the broad umbrella of social choice theory [10, 7]
- We introduced the notion of a Blackwell winner, which generalizes many known winning solution concepts
- We showed that the Blackwell winner was efficiently computable from samples with a simple and optimal procedure, and that it outperformed the von Neumann winner in a user study on autonomous driving
- Our work raises many interesting follow-up questions: How does the sample complexity vary as a function of the preference tensor P? Can the process of choosing a good target set be automated? What are the analogs of our results in the setting where pairwise comparisons can be elicited actively?
- As a possible negative consequence, getting this choice wrong could lead to incorrect inferences and unexpected behavior in the real world
- 60% of the people preferred Policy A over B – making it the von Neumann winner.
- Set S1 requires feasible score vectors to satisfy 40% of the population along both comfort and speed
- Discussion and future work
In this paper, the authors considered the problem of eliciting and learning from preferences along multiple criteria, as a way to obtain rich feedback under weaker assumptions.
- An important step towards deploying AI systems in the real world involves aligning their objectives with human values.
- Examples of such objectives include safety for autonomous vehicles, fairness for recommender systems, and effectiveness of assistive medical devices.
- The authors' paper takes a step towards accomplishing this goal by providing a framework to aggregate human preferences along such subjective criteria, which are often hard to encode mathematically.
- As a possible negative consequence, getting this choice wrong could lead to incorrect inferences and unexpected behavior in the real world
- Most closely related to our work is the field of computational social choice, which has focused on defining notions of winners from overall pairwise comparisons (see the survey  for a review). Amongst them, three deterministic notions of a winner—the Condorcet , Borda , and Copeland  winners—have been widely studied. In addition, Dudik et al  recently introduced the notion of a (randomized) von Neumann winner. Starting with the work of Yue et al , there have been several research papers studying an online version of preference learning, called the Dueling Bandits problem. Algorithms have been proposed to compete with Condorcet [60, 62, 4], Copeland [61, 56], Borda  and von Neumann  winners.
The theoretical foundations of decision making based on multiple criteria have been widely studied within the operations research community . This sub-field—called multiple-criteria decision analysis— has focused largely on scoring, classification, and sorting based on multiple-criteria feedback. See the surveys [44, 63] for thorough overviews of existing methods and their associated guarantees. The problem of eliciting the user’s relative weighting of the various criteria has also been considered . However, relatively less attention has been paid to the study of randomized decisions and statistical inference, both of which form the focus of our work. From an applied perspective, the combination of multi-criteria assessments has received attention in disparate fields such as psychometrics [40, 35], healthcare , and recidivism prediction . In many of these cases, a variety of approaches—both linear and non-linear—have been empirically evaluated . Justification for non-linear aggregation of scores along the criteria has a long history in psychology and the behavioral sciences [27, 24, 54].
- AP is supported by a Swiss Re research fellowship at the Simons Institute for the Theory of Computing and KB is supported by a JP Morgan AI Fellowship
- This work was partially supported by Office of Naval Research Young Investigator Award and a AFOSR grant to ADD, and by Office of Naval Research Grant DOD ONR-N00014-18-1-2640 to MJW
Study subjects and analysis
The cumulative comparison data is given in Appendix D, and the average weight vector elicited from the users was found to be w1 = [0.21, 0.19, 0.20, 0.18, 0.22]. We ran this study with 50 subjects. In the overall preference elicitation, we saw an approximate ordering amongst the base policies: CEDBA
The Blackwell winners R1 and R2 for the sets S1 and S2 with the ∞ distance were found to be R1 = [0.09, 0.15, 0.30, 0.15, 0.31] and R2 = [0.01, 0.01, 0.31, 0.02, 0.65]. In the second phase, we obtained preferences from a set of 41 subjects comparing the randomized polices R1 and R2 with the baseline policies A-E. The results are aggregated in Table 1 in Appendix D
- J. Abernethy, P. L. Bartlett, and E. Hazan. Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, pages 27–46, 2011.
- A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, 2010.
- M. Aghassi and D. Bertsimas. Robust game theory. Mathematical Programming, 107(1-2): 231–273, 2006.
- N. Ailon, Z. Karnin, and T. Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856–864, 2014.
- A. R. Alimov and I. Tsar’kov. Connectedness and other geometric properties of suns and chebyshev sets. Fundamentalnaya i Prikladnaya Matematika, 19(4):21–91, 2014.
- S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza. Power to the people: The role of humans in interactive machine learning. Ai Magazine, 35(4):105–120, 2014.
- K. J. Arrow et al. Social choice and individual values. 1951.
- V. Balestro, H. Martini, and R. Teixeira. Convex analysis in normed spaces and metric projections onto convex bodies. arXiv preprint arXiv:1908.08742, 2019.
- E. Bıyık, N. Huynh, M. J. Kochenderfer, and D. Sadigh. Active preference-based gaussian process regression for reward learning. arXiv preprint arXiv:2005.02575, 2020.
- D. Black. On the rationale of group decision-making. Journal of political economy, 56(1): 23–34, 1948.
- D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
- J. d. Borda. Mémoire sur les élections au scrutin. Histoire de l’Academie Royale des Sciences pour 1781 (Paris, 1784), 1784.
- R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8, 2015.
- O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems (TOIS), 30(1):1–41, 2012.
- P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.
- M. d. Condorcet. Essai sur l’application de l’analyse a la probabilité des decisions rendues a la pluralité des voix. 1785.
- A. H. Copeland. A reasonable social welfare function. Technical report, mimeo, 1951. University of Michigan, 1951.
- K. M. Douglas and R. J. Mislevy. Estimating classification accuracy for complex decision rules based on multiple scores. Journal of Educational and Behavioral Statistics, 35(3):280–306, 2010.
- M. Doumpos and C. Zopounidis. Regularized estimation for preference disaggregation in multiple criteria decision making. Computational Optimization and Applications, 38(1):61–80, 2007.
- J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5), 2015.
- M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi. Contextual dueling bandits. In Conference on Learning Theory, 2015.
- A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, 2005.
- D. Frisch and R. T. Clemen. Beyond expected utility: Rethinking behavioral decision research. Psychological bulletin, 116(1):46, 1994.
- D. Fudenberg and D. K. Levine. Self-confirming equilibrium. Econometrica: Journal of the Econometric Society, pages 523–545, 1993.
- S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2013.
- W. M. Goldstein and J. Beattie. Judgments of relative importance in decision making: The importance of interpretation and the interpretation of importance. In Frontiers of mathematical psychology, pages 110–137.
- K. Hofmann, S. Whiteson, and M. De Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 249–258, 2011.
- E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16-17), 2008.
- K. G. Jamieson, S. Katariya, A. Deshpande, and R. D. Nowak. Sparse dueling bandits. 2015.
- V. Kuleshov and S. Ermon. Estimating uncertainty online against an adversary. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- E. Lehrer. Partially specified probabilities: Decisions and games. American Economic Journal: Microeconomics, 4(1):70–100, 2012.
- R. D. Luce. Individual choice behavior. 1959.
- S. Mannor, V. Perchet, and G. Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization. In Conference on Learning Theory, pages 339–355, 2014.
- M. T. McBee, S. J. Peters, and C. Waterman. Combining scores in multiple-criteria assessment systems: The impact of combination rule. Gifted Child Quarterly, 58(1):69–89, 2014.
- S. Miryoosefi, K. Brantley, H. Daume III, M. Dudik, and R. E. Schapire. Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems, pages 14070–14079, 2019.
- H. Moulin. Handbook of Computational Social Choice. Cambridge University Press, 2016.
- Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2), 2017.
- M. Palan, N. C. Landolfi, G. Shevchuk, and D. Sadigh. Learning reward functions by integrating human demonstrations and preferences. arXiv preprint arXiv:1906.08928, 2019.
- J. P. Papay. Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1):163–193, 2011.
- J.-P. Penot and R. Ratsimahalo. Characterizations of metric projections in Banach spaces and applications. In Abstract and Applied Analysis, volume 3, 1970.
- V. Perchet. Approachability, regret and calibration; implications and equivalences. arXiv preprint arXiv:1301.2663, 2013.
- V. Perchet. A note on robust nash equilibria with uncertainties. RAIRO-Operations Research, 48(3):365–371, 2014.
- J.-C. Pomerol and S. Barba-Romero. Multicriterion decision in management: principles and practice, volume 25. Springer Science & Business Media, 2012.
- D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia. Active preference-based learning of reward functions. In Robotics: Science and Systems, 2017.
- W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans. Trial without error: Towards safe reinforcement learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2067–2069, 2018.
- M. Schulze. A new monotonic, clone-independent, reversal symmetric, and Condorcetconsistent single-winner election method. Social Choice and Welfare, 36(2):267–303, 2011.
- O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, 2013.
- O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1), 2017.
- A. Teixeira-Pinto and S.-L. T. Normand. Statistical methodology for classifying units on the basis of multiple-related measures. Statistics in medicine, 27(9):1329–1350, 2008.
- L. L. Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927.
- A. B. Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
- A. Tversky and D. Kahneman. Judgment under uncertainty: Heuristics and biases. science, 185 (4157):1124–1131, 1974.
- A. Tversky and D. Kahneman. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–291, 1979.
- G. D. Walters. Taking the next step: Combining incrementally valid indicators to improve recidivism prediction. Assessment, 18(2):227–233, 2011.
- H. Wu and X. Liu. Double thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649–657, 2016.
- Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
- L. Zajícek. On the fréchet differentiability of distance functions. Proceedings of the 12th Winter School on Abstract Analysis, pages 161–165, 1984.
- J. Zhang, P. Fiers, K. A. Witte, R. W. Jackson, K. L. Poggensee, C. G. Atkeson, and S. H. Collins. Human-in-the-loop optimization of exoskeleton assistance during walking. Science, 356(6344):1280–1284, 2017.
- M. Zoghi, S. Whiteson, R. Munos, and M. De Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. arXiv preprint arXiv:1312.3393, 2013.
- M. Zoghi, Z. S. Karnin, S. Whiteson, and M. De Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307–315, 2015.
- M. Zoghi, S. Whiteson, and M. de Rijke. Mergerucb: A method for large-scale online ranker evaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 17–26, 2015.
- C. Zopounidis and M. Doumpos. Multicriteria classification and sorting methods: A literature review. European Journal of Operational Research, 138(2):229–246, 2002.