TY - GEN
T1 - Reduced variance deep reinforcement learning with temporal logic specifications
AU - Gao, Qitong
AU - Hajinezhad, Davood
AU - Zhang, Yan
AU - Kantaros, Yiannis
AU - Zavlanos, Michael M.
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/4/16
Y1 - 2019/4/16
N2 - In this paper, we propose a model-free reinforcement learning method to synthesize control policies for mobile robots modeled as Markov Decision Process (MDP) with unknown transition probabilities that satisfy Linear Temporal Logic (LTL) specifications. Specifically, we develop a reduced variance deep Q-Learning technique that relies on Neural Networks (NN) to approximate the state-action values of the MDP and employs a reward function that depends on the accepting condition of the Deterministic Rabin Automaton (DRA) that captures the LTL specification. The key idea is to convert the deep Q-Learning problem into a nonconvex max-min optimization problem with a finite-sum structure, and develop an Arrow-Hurwicz-Uzawa type stochastic reduced variance algorithm with constant stepsize to solve it. Unlike Stochastic Gradient Descent (SGD) methods that are often used in deep reinforcement learning, our method can estimate the gradients of an unknown loss function more accurately and can improve the stability of the training process. Moreover, our method does not require learning the transition probabilities in the MDP, constructing a product MDP, or computing Accepting Maximal End Components (AMECs). This allows the robot to learn an optimal policy even if the environment cannot be modeled accurately or if AMECs do not exist. In the latter case, the resulting control policies minimize the frequency with which the system enters bad states in the DRA that violate the task specifications. To the best of our knowledge, this is the first model-free deep reinforcement learning algorithm that can synthesize policies that maximize the probability of satisfying an LTL specification even if AMECs do not exist. Rigorous convergence analysis and rate of convergence are provided for the proposed algorithm as well as numerical experiments that validate our method.
AB - In this paper, we propose a model-free reinforcement learning method to synthesize control policies for mobile robots modeled as Markov Decision Process (MDP) with unknown transition probabilities that satisfy Linear Temporal Logic (LTL) specifications. Specifically, we develop a reduced variance deep Q-Learning technique that relies on Neural Networks (NN) to approximate the state-action values of the MDP and employs a reward function that depends on the accepting condition of the Deterministic Rabin Automaton (DRA) that captures the LTL specification. The key idea is to convert the deep Q-Learning problem into a nonconvex max-min optimization problem with a finite-sum structure, and develop an Arrow-Hurwicz-Uzawa type stochastic reduced variance algorithm with constant stepsize to solve it. Unlike Stochastic Gradient Descent (SGD) methods that are often used in deep reinforcement learning, our method can estimate the gradients of an unknown loss function more accurately and can improve the stability of the training process. Moreover, our method does not require learning the transition probabilities in the MDP, constructing a product MDP, or computing Accepting Maximal End Components (AMECs). This allows the robot to learn an optimal policy even if the environment cannot be modeled accurately or if AMECs do not exist. In the latter case, the resulting control policies minimize the frequency with which the system enters bad states in the DRA that violate the task specifications. To the best of our knowledge, this is the first model-free deep reinforcement learning algorithm that can synthesize policies that maximize the probability of satisfying an LTL specification even if AMECs do not exist. Rigorous convergence analysis and rate of convergence are provided for the proposed algorithm as well as numerical experiments that validate our method.
KW - Linear Temporal Logic
KW - Reduced Variance Stochastic Optimization
KW - Reinforcement Learning
UR - https://www.scopus.com/pages/publications/85066628698
U2 - 10.1145/3302509.3311053
DO - 10.1145/3302509.3311053
M3 - Conference contribution
AN - SCOPUS:85066628698
T3 - ICCPS 2019 - Proceedings of the 2019 ACM/IEEE International Conference on Cyber-Physical Systems
SP - 237
EP - 248
BT - ICCPS 2019 - Proceedings of the 2019 ACM/IEEE International Conference on Cyber-Physical Systems
A2 - Ramachandran, Gowri Sankar
A2 - Ortiz, Jorge
PB - Association for Computing Machinery, Inc
T2 - 10th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS 2019, part of the 2019 CPS-IoT Week
Y2 - 16 April 2019 through 18 April 2019
ER -