AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
To learn the optimal action-value function Q∗ defined on M(Ω), which is a space of distributions, we introduce mean embedding, which embeds the space of distributions to a reproducing kernel Hilbert space

Breaking the Curse of Many Agents: Provable Mean Embedding QQQ-Iteration for Mean-Field Reinforcement Learning

ICML, pp.10092-10103, (2020)

Cited by: 0|Views109
EI
Full Text
Bibtex
Weibo

Abstract

Multi-agent reinforcement learning (MARL) achieves significant empirical successes. However, MARL suffers from the curse of many agents. In this paper, we exploit the symmetry of agents in MARL. In the most generic form, we study a mean-field MARL problem. Such a mean-field MARL is defined on mean-field states, which are distributions t...More

Code:

Data:

Introduction
  • Reinforcement learning (RL) (Sutton and Barto, 2018) searches for the optimal policy for sequential decision making through interacting with environments and learning from experiences.
  • By Propositions 2.1 and 2.2, the optimal action-value function Q∗ is related to the joint state through the empirical state distribution Ms defined in (2.10).
  • To capture such a limiting dynamics of infinite agents with exchangeability, the authors define an MDP with M(S), the space of probability measures supported on S, as the mean-field state space as follows.
Highlights
  • Reinforcement learning (RL) (Sutton and Barto, 2018) searches for the optimal policy for sequential decision making through interacting with environments and learning from experiences
  • For cooperative tasks, Multi-agent reinforcement learning searches for the optimal policy that maximizes the social welfare (Ng, 1975), i.e., the expected total reward obtained by all agents (Tan, 1993; Panait and Luke, 2005; Wang and Sandholm, 2003; Claus and Boutilier, 1998; Lauer and Riedmiller, 2000; Dzeroski et al, 2001; Guestrin et al, 2002; Kar et al, 2013; Zambaldi et al, 2018)
  • We study mean-field Multi-agent reinforcement learning in the collaborative setting, where the mean-field states are distributions over a continuous space S
  • We show that mean-field fitted Q-iteration algorithm breaks the curse of many agents in the sense that its computational complexity only scales linearly in the number of observed agents, and the statistical accuracy enjoys a “blessing of many agents”, that is, a larger number of observed agents improves the statistical accuracy
  • Our contribution is three-fold: (i) We propose the first model-free mean-field Multi-agent reinforcement learning algorithm, namely mean-field fitted Q-iteration algorithm, that allows for continuous support with provable guarantees. We prove that mean-field fitted Q-iteration algorithm breaks the curse of many agents by establishing its nonasymptotic computational and statistical rates of convergence. We motivate a principled framework for exploiting the invariance in Multi-agent reinforcement learning, e.g., exchangeability, via mean embedding
  • To learn the optimal action-value function Q∗ defined on M(Ω), which is a space of distributions, we introduce mean embedding, which embeds the space of distributions to a reproducing kernel Hilbert space (RKHS)
Results
  • The authors highlight that the mean-field MARL setting faces two challenges: (i) learning the value function and policy is intractable as they are functionals of distributions, which are infinite dimensional as S is continuous, and the mean-field state is only accessible through the observation of a finite number of agents, which only provides partial information.
  • To learn the optimal action-value function Q∗ defined on M(Ω), which is a space of distributions, the authors introduce mean embedding, which embeds the space of distributions to a reproducing kernel Hilbert space (RKHS).
  • The authors highlight that the Holder continuity of K(·, ·) allows for an approximation of mean embedding μp based on the empirical approximation p of the distribution p ∈ M(Ω) with finite observations, which further allows for an approximation of the action-value function with finite observations, tackles the challenge.
  • For an empirical approximation of state-action configuration ωa,ps = δa × ps, where ps is the empirical distribution of the observed states {si}i∈N , the mean embedding takes the following form, μωa,ps (·)
  • Under the mean embedding with the feature mappings defined in (2.19), the parameterization of action-value function takes the form of Q = f (1/N ·
  • The authors propose an algorithm that learns the optimal action-value function Q∗ by the sample {δai × pi,s}i∈[n] that follows a sampling distribution ν over the space of state-action configurations M(Ω).
  • Μωa,pi,s′ is the mean embedding of the distribution ωa,pi,s′ = δa × pi,s′ , pi,s′ is the empirical distribution supported on the set {s′i,j}j∈[N], and Qλk is the approximation of the optimal actionvalue function at the k-th iteration of MF-FQI.
Conclusion
  • Under Assumption 3.3, the following theorem characterizes the one-step approximation error of MF-FQI defined in Algorithm 1.
  • By Theorem 3.5, the approximation error of the action-value function attained by MF-FQI is characterized by the three terms on the right-hand side of (3.11).
  • MF-FQI tackles the “curse of many agents” via mean embedding of the mean-field state for the policy evaluation step, which approximately calculates the action-value function for the greedy policy.
Summary
  • Reinforcement learning (RL) (Sutton and Barto, 2018) searches for the optimal policy for sequential decision making through interacting with environments and learning from experiences.
  • By Propositions 2.1 and 2.2, the optimal action-value function Q∗ is related to the joint state through the empirical state distribution Ms defined in (2.10).
  • To capture such a limiting dynamics of infinite agents with exchangeability, the authors define an MDP with M(S), the space of probability measures supported on S, as the mean-field state space as follows.
  • The authors highlight that the mean-field MARL setting faces two challenges: (i) learning the value function and policy is intractable as they are functionals of distributions, which are infinite dimensional as S is continuous, and the mean-field state is only accessible through the observation of a finite number of agents, which only provides partial information.
  • To learn the optimal action-value function Q∗ defined on M(Ω), which is a space of distributions, the authors introduce mean embedding, which embeds the space of distributions to a reproducing kernel Hilbert space (RKHS).
  • The authors highlight that the Holder continuity of K(·, ·) allows for an approximation of mean embedding μp based on the empirical approximation p of the distribution p ∈ M(Ω) with finite observations, which further allows for an approximation of the action-value function with finite observations, tackles the challenge.
  • For an empirical approximation of state-action configuration ωa,ps = δa × ps, where ps is the empirical distribution of the observed states {si}i∈N , the mean embedding takes the following form, μωa,ps (·)
  • Under the mean embedding with the feature mappings defined in (2.19), the parameterization of action-value function takes the form of Q = f (1/N ·
  • The authors propose an algorithm that learns the optimal action-value function Q∗ by the sample {δai × pi,s}i∈[n] that follows a sampling distribution ν over the space of state-action configurations M(Ω).
  • Μωa,pi,s′ is the mean embedding of the distribution ωa,pi,s′ = δa × pi,s′ , pi,s′ is the empirical distribution supported on the set {s′i,j}j∈[N], and Qλk is the approximation of the optimal actionvalue function at the k-th iteration of MF-FQI.
  • Under Assumption 3.3, the following theorem characterizes the one-step approximation error of MF-FQI defined in Algorithm 1.
  • By Theorem 3.5, the approximation error of the action-value function attained by MF-FQI is characterized by the three terms on the right-hand side of (3.11).
  • MF-FQI tackles the “curse of many agents” via mean embedding of the mean-field state for the policy evaluation step, which approximately calculates the action-value function for the greedy policy.
Related work
Reference
  • Acciaio, B., Backhoff-Veraguas, J. and Carmona, R. (2018). Extended mean field control problems: stochastic maximum principle and transport perspective. arXiv preprint arXiv:1802.05754.
    Findings
  • Altun, Y. and Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In International Conference on Computational Learning Theory. Springer.
    Google ScholarLocate open access versionFindings
  • Andersson, D. and Djehiche, B. (2011). A maximum principle for SDEs of mean-field type. Applied Mathematics & Optimization, 63 341–356.
    Google ScholarLocate open access versionFindings
  • Antos, A., Szepesvari, C. and Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 89–129.
    Google ScholarLocate open access versionFindings
  • Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584.
    Findings
  • Bensoussan, A., Frehse, J., Yam, P. et al. (2013). Mean field games and mean field type control theory, vol.
    Google ScholarLocate open access versionFindings
  • Bloem-Reddy, B. and Teh, Y. W. (2019). Probabilistic symmetry and invariant neural networks. arXiv preprint arXiv:1901.06082.
    Findings
  • Bu, L., Babu, R., De Schutter, B. et al. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38 156–172.
    Google ScholarLocate open access versionFindings
  • Buckdahn, R., Djehiche, B. and Li, J. (2011). A general stochastic maximum principle for SDEs of mean-field type. Applied Mathematics & Optimization, 64 197–216.
    Google ScholarLocate open access versionFindings
  • Buckdahn, R., Djehiche, B., Li, J., Peng, S. et al. (2009). Mean-field backward stochastic differential equations: a limit approach. The Annals of Probability, 37 1524–1565.
    Google ScholarLocate open access versionFindings
  • Caponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7 331–368.
    Google ScholarLocate open access versionFindings
  • Carmona, R. and Delarue, F. (2018). Probabilistic Theory of Mean Field Games with Applications I-II. Springer.
    Google ScholarFindings
  • Carmona, R., Delarue, F. and Lachapelle, A. (2013). Control of McKean–Vlasov dynamics versus mean field games. Mathematics and Financial Economics, 7 131–166.
    Google ScholarLocate open access versionFindings
  • Carmona, R., Delarue, F. et al. (2015). Forward–backward stochastic differential equations and controlled McKean–Vlasov dynamics. The Annals of Probability, 43 2647–2700.
    Google ScholarLocate open access versionFindings
  • Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. arXiv preprint arXiv:1905.00360.
    Findings
  • Claus, C. and Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998 746–752.
    Google ScholarLocate open access versionFindings
  • Dzeroski, S., De Raedt, L. and Driessens, K. (2001). Relational reinforcement learning. Machine Learning, 43 7–52.
    Google ScholarLocate open access versionFindings
  • Ernst, D., Geurts, P. and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6 503–556.
    Google ScholarLocate open access versionFindings
  • Farahmand, A.-m., Ghavamzadeh, M., Szepesvari, C. and Mannor, S. (2009). Regularized fitted q-iteration for planning in continuous-space markovian decision problems. In 2009 American Control Conference. IEEE.
    Google ScholarLocate open access versionFindings
  • Farahmand, A.-m., Ghavamzadeh, M., Szepesvari, C. and Mannor, S. (2016). Regularized policy iteration with nonparametric function spaces. Journal of Machine Learning Research, 17 4809–4874.
    Google ScholarLocate open access versionFindings
  • Farahmand, A.-m., Szepesvari, C. and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Fornasier, M. and Solombrino, F. (2014). Mean-field optimal control. ESAIM: Control, Optimisation and Calculus of Variations, 20 1123–1152.
    Google ScholarLocate open access versionFindings
  • Fukumizu, K., Gretton, A., Sun, X. and Scholkopf, B. (2008). Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Gartner, T., Flach, P. A., Kowalczyk, A. and Smola, A. J. (2002). Multi-instance kernels. In International Conference on Machine Learning, vol. 2.
    Google ScholarLocate open access versionFindings
  • Gomes, D. A., Nurbekyan, L. and Pimentel, E. (2015). Economic models and mean-field games theory. In 30th Brazilian Mathematics Colloquium.
    Google ScholarLocate open access versionFindings
  • Gretton, A., Borgwardt, K., Rasch, M., Scholkopf, B. and Smola, A. J. (2007). A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B. and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13 723–773.
    Google ScholarLocate open access versionFindings
  • Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K. and Scholkopf, B. (2009). Covariate shift by kernel mean matching. Dataset shift in machine learning, 3 5.
    Google ScholarLocate open access versionFindings
  • Gueant, O., Lasry, J.-M. and Lions, P.-L. (2011). Mean field games and applications. In Paris-Princeton lectures on mathematical finance 2010.
    Google ScholarLocate open access versionFindings
  • Guestrin, C., Lagoudakis, M. and Parr, R. (2002). Coordinated reinforcement learning. In ICML, vol.
    Google ScholarLocate open access versionFindings
  • Guo, X., Hu, A., Xu, R. and Zhang, J. (2019). Learning mean-field games. arXiv preprint arXiv:1901.09585.
    Findings
  • Haussler, D. (1999). Convolution kernels on discrete structures. Tech. rep., Department of Computer Science, University of California.
    Google ScholarFindings
  • Hu, J. and Wellman, M. P. (2003). Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4 1039–1069.
    Google ScholarLocate open access versionFindings
  • Huang, M., Caines, P. E. and Malhame, R. P. (2003). Individual and mass behaviour in large population stochastic wireless power control problems: centralized and Nash equilibrium solutions. In 42nd IEEE International Conference on Decision and Control, vol.
    Google ScholarLocate open access versionFindings
  • Huang, M., Caines, P. E. and Malhame, R. P. (2007). Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized ε-Nash equilibria. IEEE Transactions on Automatic Control, 52 1560–1571.
    Google ScholarLocate open access versionFindings
  • Huang, M., Caines, P. E. and Malhame, R. P. (2012). Social optima in mean field LQG control: centralized and decentralized strategies. IEEE Transactions on Automatic Control, 57 1736–1751.
    Google ScholarLocate open access versionFindings
  • Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
    Google ScholarFindings
  • Jiachen, Y., Xiaojing, Y., Rakshit, T., Huan, X. and Hongyuan, Z. (2017). Learning deep mean field games for modeling large population behavior. arXiv preprint arXiv:1711.03156.
    Findings
  • Jiang, J. and Lu, Z. (2018). Learning attentional communication for multi-agent cooperation. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Kar, S., Moura, J. M. and Poor, H. V. (2013). QD-learning: A collaborative distributed strategy for multiagent reinforcement learning through Consensus + Innovations. IEEE Transactions on Signal Processing, 61 1848–1862.
    Google ScholarLocate open access versionFindings
  • Kondor, R. and Jebara, T. (2003). A kernel between sets of vectors. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Lasry, J.-M. and Lions, P.-L. (2006a). Jeuxa champ moyen. i–le cas stationnaire. Comptes Rendus Mathematique, 343 619–625.
    Google ScholarLocate open access versionFindings
  • Lasry, J.-M. and Lions, P.-L. (2006b). Jeuxa champ moyen. ii–horizon fini et controle optimal. Comptes Rendus Mathematique, 343 679–684.
    Google ScholarLocate open access versionFindings
  • Lasry, J.-M. and Lions, P.-L. (2007). Mean field games. Japanese journal of mathematics, 2 229–260.
    Google ScholarLocate open access versionFindings
  • Lauer, M. and Riedmiller, M. (2000). An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In International Conference on Machine Learning. Citeseer.
    Google ScholarLocate open access versionFindings
  • Lazaric, A., Ghavamzadeh, M. and Munos, R. (2016). Analysis of classification-based policy iteration algorithms. Journal of Machine Learning Research, 17 583–612.
    Google ScholarLocate open access versionFindings
  • Li, M., Jiao, Y., Yang, Y., Gong, Z., Wang, J., Wang, C., Wu, G., Ye, J. et al. (2019). Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. arXiv preprint arXiv:1901.11454.
    Findings
  • Lin, K., Zhao, R., Xu, Z. and Zhou, J. (2018). Efficient large-scale fleet management via multi-agent deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM.
    Google ScholarLocate open access versionFindings
  • Lin, S.-B., Guo, X. and Zhou, D.-X. (2017). Distributed learning with regularized least squares. Journal of Machine Learning Research, 18 3202–3232.
    Google ScholarLocate open access versionFindings
  • Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings. Elsevier, 157–163.
    Google ScholarLocate open access versionFindings
  • Meyer-Brandis, T., Øksendal, B. and Zhou, X. Y. (2012). A mean-field stochastic maximum principle via Malliavin calculus. Stochastics An International Journal of Probability and Stochastic Processes, 84 643– 666.
    Google ScholarLocate open access versionFindings
  • Moll, B., Rachel, L. and Restrepo, P. (2019). Uneven growth: automations impact on income and wealth inequality. Manuscript, Princeton University.
    Google ScholarFindings
  • Muandet, K., Fukumizu, K., Dinuzzo, F. and Scholkopf, B. (2012). Learning from distributions via support measure machines. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Munos, R. and Szepesvari, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 815–857.
    Google ScholarLocate open access versionFindings
  • Nash, J. (1951). Non-cooperative games. Annals of Mathematics 286–295.
    Google ScholarLocate open access versionFindings
  • Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y. and Srebro, N. (2018). Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076.
    Findings
  • Ng, Y.-K. (1975). Bentham or Bergson? Finite sensibility, utility functions and social welfare functions. The Review of Economic Studies, 42 545–569.
    Google ScholarLocate open access versionFindings
  • OpenAI (2018). Openai Five. https://blog.openai.com/openai-five/.
    Findings
  • Panait, L. and Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11 387–434.
    Google ScholarLocate open access versionFindings
  • Scherrer, B. (2013). On the performance bounds of some policy search dynamic programming algorithms. arXiv preprint arXiv:1306.0539.
    Findings
  • Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. (2015). Approximate modified policy iteration and its application to the game of Tetris. Journal of Machine Learning Research, 16 1629–1676.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
    Findings
  • Shalev-Shwartz, S., Shammah, S. and Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295.
    Findings
  • Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016). Mastering the game of Go with deep neural networks and tree search. nature, 529 484.
    Google ScholarLocate open access versionFindings
  • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. et al. (2017). Mastering the game of Go without human knowledge. Nature, 550 354.
    Google ScholarLocate open access versionFindings
  • Smola, A., Gretton, A., Song, L. and Scholkopf, B. (2007). A Hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory. Springer.
    Google ScholarLocate open access versionFindings
  • Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Scholkopf, B. and Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11 1517–1561.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press.
    Google ScholarFindings
  • Szabo, Z., Gretton, A., Poczos, B. and Sriperumbudur, B. (2015). Two-stage sampled learning theory on distributions. In Artificial Intelligence and Statistics.
    Google ScholarLocate open access versionFindings
  • Szepesvari, C. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Tolstikhin, I., Sriperumbudur, B. K. and Muandet, K. (2017). Minimax estimation of kernel mean embeddings. Journal of Machine Learning Research, 18 3002–3048.
    Google ScholarLocate open access versionFindings
  • Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W. M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen, J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D. and Silver, D. (2019). AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/.
    Findings
  • Wang, X. and Sandholm, T. (2003). Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Yang, E. and Gu, D. (2004). Multiagent reinforcement learning for multi-robot systems: A survey. Tech. rep., University of Essex.
    Google ScholarFindings
  • Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W. and Wang, J. (2018). Mean field multi-agent reinforcement learning. arXiv preprint arXiv:1802.05438.
    Findings
  • Yang, Z., Xie, Y. and Wang, Z. (2019). A theoretical analysis of deep Q-learning. arXiv preprint arXiv:1901.00137.
    Findings
  • Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R. and Smola, A. J. (2017). Deep sets. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y., Babuschkin, I., Tuyls, K., Reichert, D., Lillicrap, T., Lockhart, E. et al. (2018). Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830.
    Findings
  • Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
    Findings
  • We now establish a topological structure on the space M(Ω) adopted from Szaboet al. (2015). Recall that we denote by Ω = S × A the space of state-action pairs. We assume that Ω is a polish space, and denote by B(Ω) the Borel σ-algebra of Ω. We denote by M0(A) the space of all the point mass distributions on A, and denote by M(S) the space of all the distributions on S. We assume that both M0(A) and M(S) are equipped with the weak topology such that for all f ∈ CB(A) and g ∈ CB(S), the mappings p → f (x) dp(x) and q → g(x) dq(x) are continuous for p ∈ M0(Ω) and q ∈ M(Ω), respectively. Note that any p ∈ M0(A) and q ∈ M(S) defines a product measure ω = p × q ∈ M(Ω) on (Ω, B(A) ⊗ B(S)). We endow the set M(Ω) with the product topology of corresponding weak topology defined on M0(A) and M(S), which makes M(Ω) a Polish space.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科