arrow
Volume 40, Issue 6
A Stochastic Trust-Region Framework for Policy Optimization

Mingming Zhao, Yongfeng Li & Zaiwen Wen

J. Comp. Math., 40 (2022), pp. 1004-1030.

Published online: 2022-08

Export citation
  • Abstract

In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy. The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy. We solve the subproblem using a  reconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region. To overcome the bias caused by sampling to the function estimations under the random settings, we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted. Moreover, for a Gaussian policy which is commonly used for continuous action space, the maximization with respect to the mean and covariance is performed separately to control the entropy loss. Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions. Comparisons with the state-of-the-art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.

  • AMS Subject Headings

49L20, 90C15, 90C26, 90C40, 93E20

  • Copyright

COPYRIGHT: © Global Science Press

  • Email address

mmz102@pku.edu.cn (Mingming Zhao)

yongfengli@pku.edu.cn (Yongfeng Li)

wenzw@pku.edu.cn (Zaiwen Wen)

  • BibTex
  • RIS
  • TXT
@Article{JCM-40-1004, author = {Zhao , MingmingLi , Yongfeng and Wen , Zaiwen}, title = {A Stochastic Trust-Region Framework for Policy Optimization}, journal = {Journal of Computational Mathematics}, year = {2022}, volume = {40}, number = {6}, pages = {1004--1030}, abstract = {

In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy. The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy. We solve the subproblem using a  reconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region. To overcome the bias caused by sampling to the function estimations under the random settings, we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted. Moreover, for a Gaussian policy which is commonly used for continuous action space, the maximization with respect to the mean and covariance is performed separately to control the entropy loss. Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions. Comparisons with the state-of-the-art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.

}, issn = {1991-7139}, doi = {https://doi.org/10.4208/jcm.2104-m2021-0007}, url = {http://global-sci.org/intro/article_detail/jcm/20845.html} }
TY - JOUR T1 - A Stochastic Trust-Region Framework for Policy Optimization AU - Zhao , Mingming AU - Li , Yongfeng AU - Wen , Zaiwen JO - Journal of Computational Mathematics VL - 6 SP - 1004 EP - 1030 PY - 2022 DA - 2022/08 SN - 40 DO - http://doi.org/10.4208/jcm.2104-m2021-0007 UR - https://global-sci.org/intro/article_detail/jcm/20845.html KW - Deep reinforcement learning, Stochastic trust region method, Policy optimization, Global convergence, Entropy control. AB -

In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy. The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy. We solve the subproblem using a  reconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region. To overcome the bias caused by sampling to the function estimations under the random settings, we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted. Moreover, for a Gaussian policy which is commonly used for continuous action space, the maximization with respect to the mean and covariance is performed separately to control the entropy loss. Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions. Comparisons with the state-of-the-art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.

Mingming Zhao, Yongfeng Li & Zaiwen Wen. (2022). A Stochastic Trust-Region Framework for Policy Optimization. Journal of Computational Mathematics. 40 (6). 1004-1030. doi:10.4208/jcm.2104-m2021-0007
Copy to clipboard
The citation has been copied to your clipboard