AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Our work focuses on a particular aspect of dynamic treatment regime evaluation and does not cover other aspects that are important for the real-world applications of Dynamic treatment regimes

Gradient Regularized V-Learning for Dynamic Treatment Regimes

NIPS 2020, (2020)

被引用0|浏览155
EI
下载 PDF 全文
引用
微博一下

摘要

Deciding how to optimally treat a patient, including how to select treatments over time among the multiple available treatments, represents one of the most important issues that need to be addressed in medicine today. A dynamic treatment regime (DTR) is a sequence of treatment rules indicating how to individualize treatments for a patient...更多

代码

数据

0
简介
  • Clinical decision-makers regularly face the daunting challenge of choosing from multiple treatment options and treatment timings.
  • While clinical trials represent the gold standard for causal inference, clinical trials for longitudinal studies are expensive to conduct.
  • They have few patients and narrow inclusion criteria, and usually do not follow patients for long periods of time.
  • Unlike in static settings, developing time-varying treatment rules in longitudinal settings poses unique opportunities to understand how diseases evolve under different treatment plans, how individual patients respond to medication over time, and which timings are optimal for assigning treatments.
  • A DTR is a sequence of time-varying treatment rules that determine which treatment to provide
重点内容
  • Clinical decision-makers regularly face the daunting challenge of choosing from multiple treatment options and treatment timings
  • Unlike in static settings, developing time-varying treatment rules in longitudinal settings poses unique opportunities to understand how diseases evolve under different treatment plans, how individual patients respond to medication over time, and which timings are optimal for assigning treatments
  • This section is organized as follows: (1) we first introduce a set of the outcome and propensity models, Mt:T, required for estimating the value function Vtpdt:T q; (2) we discuss under what condition Mt:T can construct an efficient estimator of Vtpdt:T q in semiparametric theory; (3) we describe our neural network architecture that parameterizes the models in Mt:T, and a theory that demonstrates that our proposed regularizer can encourage the models to satisfy the optimality condition of efficient estimators; and (4) we briefly discuss the two Gradient Regularized V -learning (GRV) based dynamic treatment regime (DTR) learning algorithms
  • We have introduced Gradient Regularized V -Learning (GRV), a novel regularization method that enables recurrent neural network models to estimate the value function of a target DTR accurately and learn better DTRs
  • We hope that GRV will become a useful regularization method when RNNs are deployed to tackle the challenges of treatment effect estimation and decision making in machine learning
  • Our work focuses on a particular aspect of DTR evaluation and does not cover other aspects that are important for the real-world applications of DTRs
方法
  • The first set of the experiments is based on two non-Markovian simulation studies adapted from [43].
  • The authors call them treatment cost trade-off and survival rate maximization respectively.
  • The DTR objective is to minimize the total treatment cost in the treatment trajectory while keeping the health metric above the threshold at the end of the trajectory.
  • The DTR objective is to maximize the survival rate of cancer patients using an invasive and non-invasive treatment in Metrics Dataset.
结论
  • The authors have introduced Gradient Regularized V -Learning (GRV), a novel regularization method that enables recurrent neural network models to estimate the value function of a target DTR accurately and learn better DTRs.
  • The authors hope that GRV will become a useful regularization method when RNNs are deployed to tackle the challenges of treatment effect estimation and decision making in machine learning.
  • The authors' work can help to develop accurate and individualized decision-making system in many real-world applications, such as treatment recommendation.
  • Offline DTR or policy evaluation is still a challenging problem because the authors often do not have sufficient samples to estimate the outcomes for all the possible treatment plans over time.
总结
  • Introduction:

    Clinical decision-makers regularly face the daunting challenge of choosing from multiple treatment options and treatment timings.
  • While clinical trials represent the gold standard for causal inference, clinical trials for longitudinal studies are expensive to conduct.
  • They have few patients and narrow inclusion criteria, and usually do not follow patients for long periods of time.
  • Unlike in static settings, developing time-varying treatment rules in longitudinal settings poses unique opportunities to understand how diseases evolve under different treatment plans, how individual patients respond to medication over time, and which timings are optimal for assigning treatments.
  • A DTR is a sequence of time-varying treatment rules that determine which treatment to provide
  • Methods:

    The first set of the experiments is based on two non-Markovian simulation studies adapted from [43].
  • The authors call them treatment cost trade-off and survival rate maximization respectively.
  • The DTR objective is to minimize the total treatment cost in the treatment trajectory while keeping the health metric above the threshold at the end of the trajectory.
  • The DTR objective is to maximize the survival rate of cancer patients using an invasive and non-invasive treatment in Metrics Dataset.
  • Conclusion:

    The authors have introduced Gradient Regularized V -Learning (GRV), a novel regularization method that enables recurrent neural network models to estimate the value function of a target DTR accurately and learn better DTRs.
  • The authors hope that GRV will become a useful regularization method when RNNs are deployed to tackle the challenges of treatment effect estimation and decision making in machine learning.
  • The authors' work can help to develop accurate and individualized decision-making system in many real-world applications, such as treatment recommendation.
  • Offline DTR or policy evaluation is still a challenging problem because the authors often do not have sufficient samples to estimate the outcomes for all the possible treatment plans over time.
表格
  • Table1: Performance of benchmarks, our estimator and algorithms: the MSEs of the value function estimators (lower is better) and the value of the learned DTRs (higher is better). The mean and standard deviation are averaged across 20 independent runs on a testing set with 20,000 individuals
Download tables as Excel
相关工作
  • In the literature, DTR learning in causal inference and off policy evaluation (OPE) in batch reinforcement learning (BRL) are the two branches of methods that attempt to solve the estimation problem of V pdq. Here, we summarize the existing value function estimators in three classes.

    Importance Sampling estimators. The importance sampling (IS) estimator of Vtpdt:T q is given by the empirical version of the R.H.S of Equation (2), with P pAs | Hsq in the denominator approximated by a propensity score model gspAs, Hsq. In the DTR literature, backward outcome weighted learning (BOWL) [6] is a method that derives the treatment rule dt by optimizing the IS estimator of Vtpdt, dt1:T q backwardly through time given the previously optimized rules dt1:T . Simultaneous outcome weighted learning (SOWL) [6] optimizes the treatment rules in d jointly based on the IS estimator of V pdq. In BRL, IS estimators [25, 26, 27] are used to evaluate the value function under a target policy by reweighting the rewards in the historical data with the probability ratio of the target policy and the policy that generates the data. IS estimators are known to be consistent and unbiased but suffer high variance due to the inverse propensity score product.
基金
  • This work was supported by GlaxoSmithKline (GSK), the US Office of Naval Research (ONR), and the National Science Foundation (NSF) 1722516
研究对象与分析
individuals: 20000
The DTR value is defined as the cumulative outcome under the learned DTR. Specifically, we generate a large testing dataset with 20,000 individuals. For each learned DTR, we let each individual in the testing set follow the treatments recommended by the DTR

additional simulation studies: 3
Overall, the performance gain highlights the effectiveness of our regularizer, which constructs an efficient value function estimator by adapting the nuisance models to solve the estimating equation during the training process of the nuisance models. In Appendix C, we show our method performs well on three additional simulation studies from [6] where the covariate vector is relatively high-dimensional while the training set is small with hundreds of samples. We also provide an ablation study which shows that ablating the regularizer leads to worse performance in our method

引用论文
  • Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896, 2017.
    Findings
  • Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
    Google ScholarLocate open access versionFindings
  • Maximilian Kasy. Partial identification, distributional preferences, and the welfare ranking of policies. Review of Economics and Statistics, 98(1):111–131, 2016.
    Google ScholarLocate open access versionFindings
  • Razieh Nabi, Daniel Malinsky, and Ilya Shpitser. Learning optimal fair policies. Proceedings of machine learning research, 97:4674, 2019.
    Google ScholarLocate open access versionFindings
  • Jörg Stoye. Minimax regret treatment choice with covariates or with limited validity of experiments. Journal of Econometrics, 166(1):138–156, 2012.
    Google ScholarLocate open access versionFindings
  • Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
    Google ScholarLocate open access versionFindings
  • Onur Atan, William R Zame, Qiaojun Feng, and Mihaela van der Schaar. Constructing effective personalized policies using counterfactual inference from biased data sets with many features. Machine Learning, 108(6):945–970, 2019.
    Google ScholarLocate open access versionFindings
  • S. A. Murphy, D. Oslin, A. Rush, and J. Zhu. Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology, 32:257–262, 2007.
    Google ScholarLocate open access versionFindings
  • P. Flume, B. O’sullivan, K. Robinson, C. Goss, P. Mogayzel, D. Willey-Courand, J. Bujan, J. Finder, M. Lester, L. Quittell, R. Rosenblatt, R. Vender, Leslie A Hazle, K. Sabadosa, and B. Marshall. Cystic fibrosis pulmonary guidelines: chronic medications for maintenance of lung health. American journal of respiratory and critical care medicine, 176 10:957–69, 2007.
    Google ScholarLocate open access versionFindings
  • Susan A Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355, 2003.
    Google ScholarLocate open access versionFindings
  • Ying-Qi Zhao, Donglin Zeng, Eric B Laber, and Michael R Kosorok. New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510):583–598, 2015.
    Google ScholarLocate open access versionFindings
  • Philip W Lavori and Ree Dawson. A design for testing clinical strategies: biased adaptive within-subject randomization. Journal of the Royal Statistical Society: Series A (Statistics in Society), 163(1):29–38, 2000.
    Google ScholarLocate open access versionFindings
  • Philip W Lavori and Ree Dawson. Adaptive treatment strategies in chronic disease. Annu. Rev. Med., 59:443–453, 2008.
    Google ScholarLocate open access versionFindings
  • Susan A Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in medicine, 24(10):1455–1481, 2005.
    Google ScholarLocate open access versionFindings
  • Peter F Thall, Randall E Millikan, and Hsi-Guang Sung. Evaluating multiple treatment courses in clinical trials. Statistics in medicine, 19(8):1011–1028, 2000.
    Google ScholarLocate open access versionFindings
  • Peter F Thall, Hsi-Guang Sung, and Elihu H Estey. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association, 97(457):29–39, 2002.
    Google ScholarLocate open access versionFindings
  • Jared K Lunceford, Marie Davidian, and Anastasios A Tsiatis. Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials. Biometrics, 58(1):48–57, 2002.
    Google ScholarLocate open access versionFindings
  • Abdus S Wahed and Anastasios A Tsiatis. Optimal estimator for the survival distribution and related quantities for treatment policies in two-stage randomization designs in clinical trials. Biometrics, 60(1):124–133, 2004.
    Google ScholarLocate open access versionFindings
  • Abdus S Wahed and Anastasios A Tsiatis. Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data. Biometrika, 93(1):163–177, 2006.
    Google ScholarLocate open access versionFindings
  • Edward H Wagner, Brian T Austin, Connie Davis, Mike Hindmarsh, Judith Schaefer, and Amy Bonomi. Improving chronic illness care: translating evidence into action. Health affairs, 20(6):64–78, 2001.
    Google ScholarLocate open access versionFindings
  • Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects. In Advances in Neural Information Processing Systems, pages 2503–2513, 2019.
    Google ScholarLocate open access versionFindings
  • Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
    Google ScholarLocate open access versionFindings
  • Miguel A Hernán, Babette Brumback, and James M Robins. Marginal structural models to estimate the joint causal effect of nonrandomized treatments. Journal of the American Statistical Association, 96(454):440–448, 2001.
    Google ScholarLocate open access versionFindings
  • James M Robins. Optimal structural nested models for optimal sequential decisions. In Proceedings of the second seattle Symposium in Biostatistics, pages 189–326.
    Google ScholarLocate open access versionFindings
  • Michael JD Powell and J Swann. Weighted uniform sampling—a monte carlo technique for reducing variance. IMA Journal of Applied Mathematics, 2(3):228–236, 1966.
    Google ScholarLocate open access versionFindings
  • Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
    Google ScholarLocate open access versionFindings
  • A Rupam Mahmood, Hado P van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pages 3014–3022, 2014.
    Google ScholarLocate open access versionFindings
  • Susan A Murphy. A generalization error for q-learning. Journal of Machine Learning Research, 6(Jul):1073–1097, 2005.
    Google ScholarLocate open access versionFindings
  • Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.
    Google ScholarFindings
  • Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • Doron Blatt, Susan A Murphy, and Ji Zhu. A-learning for approximate planning. Ann Arbor, 1001:48109–2122, 2004.
    Google ScholarLocate open access versionFindings
  • Phillip J. Schulte, A. Tsiatis, E. Laber, and M. Davidian. Q- and A-learning methods for estimating optimal dynamic treatment regimes. Statistical science: a review journal of the Institute of Mathematical Statistics, 29 4:640–661, 2014.
    Google ScholarLocate open access versionFindings
  • Hoang M Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. arXiv preprint arXiv:1903.08738, 2019.
    Findings
  • Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
    Google ScholarLocate open access versionFindings
  • Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019.
    Findings
  • Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
    Google ScholarLocate open access versionFindings
  • Richard E Bellman and Stuart E Dreyfus. Applied dynamic programming. Princeton university press, 2015.
    Google ScholarFindings
  • Min Qian and Susan A Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011.
    Google ScholarLocate open access versionFindings
  • Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.
    Findings
  • Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
    Google ScholarLocate open access versionFindings
  • Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 21(167):1–63, 2020.
    Google ScholarLocate open access versionFindings
  • Baqun Zhang, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika, 100(3):681– 694, 2013.
    Google ScholarLocate open access versionFindings
  • Xinkun Nie, Emma Brunskill, and Stefan Wager. Learning when-to-treat policies. arXiv preprint arXiv:1905.09751, 2019.
    Findings
  • James M Robins, Andrea Rotnitzky, and Mark van der Laan. On profile likelihood: comment. Journal of the American Statistical Association, 95(450):477–482, 2000.
    Google ScholarLocate open access versionFindings
  • James M Robins. Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, volume 1999, pages 6–10. Indianapolis, IN, 2000.
    Google ScholarLocate open access versionFindings
  • Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Ya’acov Ritov, J Klaassen, Jon A Wellner, and YA’Acov Ritov. Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore, 1993.
    Google ScholarLocate open access versionFindings
  • Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
    Google ScholarLocate open access versionFindings
  • Anastasios Tsiatis. Semiparametric theory and missing data. Springer Science & Business Media, 2007.
    Google ScholarFindings
  • Michael R Kosorok. Introduction to empirical processes and semiparametric inference. Springer Science & Business Media, 2007.
    Google ScholarFindings
  • Mark J Van der Laan and Sherri Rose. Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media, 2011.
    Google ScholarFindings
  • Jonathan Levy. Tutorial: Deriving the efficient influence curve for large models. arXiv preprint arXiv:1903.01706, 2019.
    Findings
  • M. J. van der Laan and A. Luedtke. Targeted learning of the mean outcome under an optimal dynamic treatment rule. Journal of Causal Inference, 3:61 – 95, 2015.
    Google ScholarLocate open access versionFindings
  • Mark J van der Laan and Alexander R Luedtke. Targeted learning of the mean outcome under an optimal dynamic treatment rule. Journal of causal inference, 3(1):61–95, 2015.
    Google ScholarLocate open access versionFindings
  • Iván Díaz, Nicholas Williams, Katherine L Hoffman, and Edward J Schenck. Nonparametric causal effects based on longitudinal modified treatment policies. arXiv preprint arXiv:2006.01366, 2020.
    Findings
  • Aurélien F Bibaut, Ivana Malenica, Nikos Vlassis, and Mark J van der Laan. More efficient off-policy evaluation through regularized targeted learning. arXiv preprint arXiv:1912.06292, 2019.
    Findings
  • Jiawei Huang and Nan Jiang. From importance sampling to doubly robust policy gradient. arXiv preprint arXiv:1910.09066, 2019.
    Findings
  • Andrew Bennett and Nathan Kallus. Efficient policy learning from surrogate-loss classification reductions. arXiv preprint arXiv:2002.05153, 2020.
    Findings
  • Nathan Kallus and Masatoshi Uehara. Statistically efficient off-policy policy gradients. arXiv preprint arXiv:2002.04014, 2020.
    Findings
  • Liesbet De Bus, Bram Gadeyne, Johan Steen, Jerina Boelens, Geert Claeys, Dominique Benoit, Jan De Waele, Johan Decruyenaere, and Pieter Depuydt. A complete and multifaceted overview of antibiotic use and infection diagnosis in the intensive care unit: results from a prospective four-year registration. Critical Care, 22(1):241, 2018.
    Google ScholarLocate open access versionFindings
  • Muhammad Ali, Humaira Naureen, Muhammad Haseeb Tariq, Muhammad Junaid Farrukh, Abubakar Usman, Shahana Khattak, and Hina Ahsan. Rational use of antibiotics in an intensive care unit: a retrospective study of the impact on clinical outcomes and mortality rate. Infection and Drug Resistance, 12:493, 2019.
    Google ScholarLocate open access versionFindings
  • U Waheed, P Williams, S Brett, G Baldock, and N Soni. White cell count and intensive care unit outcome. Anaesthesia, 58(2):180–182, 2003.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科