Reduced variance deep reinforcement learning with temporal logic specifications

  • Qitong Gao
  • , Davood Hajinezhad
  • , Yan Zhang
  • , Yiannis Kantaros
  • , Michael M. Zavlanos

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, we propose a model-free reinforcement learning method to synthesize control policies for mobile robots modeled as Markov Decision Process (MDP) with unknown transition probabilities that satisfy Linear Temporal Logic (LTL) specifications. Specifically, we develop a reduced variance deep Q-Learning technique that relies on Neural Networks (NN) to approximate the state-action values of the MDP and employs a reward function that depends on the accepting condition of the Deterministic Rabin Automaton (DRA) that captures the LTL specification. The key idea is to convert the deep Q-Learning problem into a nonconvex max-min optimization problem with a finite-sum structure, and develop an Arrow-Hurwicz-Uzawa type stochastic reduced variance algorithm with constant stepsize to solve it. Unlike Stochastic Gradient Descent (SGD) methods that are often used in deep reinforcement learning, our method can estimate the gradients of an unknown loss function more accurately and can improve the stability of the training process. Moreover, our method does not require learning the transition probabilities in the MDP, constructing a product MDP, or computing Accepting Maximal End Components (AMECs). This allows the robot to learn an optimal policy even if the environment cannot be modeled accurately or if AMECs do not exist. In the latter case, the resulting control policies minimize the frequency with which the system enters bad states in the DRA that violate the task specifications. To the best of our knowledge, this is the first model-free deep reinforcement learning algorithm that can synthesize policies that maximize the probability of satisfying an LTL specification even if AMECs do not exist. Rigorous convergence analysis and rate of convergence are provided for the proposed algorithm as well as numerical experiments that validate our method.

Original languageEnglish
Title of host publicationICCPS 2019 - Proceedings of the 2019 ACM/IEEE International Conference on Cyber-Physical Systems
EditorsGowri Sankar Ramachandran, Jorge Ortiz
PublisherAssociation for Computing Machinery, Inc
Pages237-248
Number of pages12
ISBN (Electronic)9781450362856
DOIs
StatePublished - Apr 16 2019
Event10th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS 2019, part of the 2019 CPS-IoT Week - Montreal, Canada
Duration: Apr 16 2019Apr 18 2019

Publication series

NameICCPS 2019 - Proceedings of the 2019 ACM/IEEE International Conference on Cyber-Physical Systems

Conference

Conference10th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS 2019, part of the 2019 CPS-IoT Week
Country/TerritoryCanada
CityMontreal
Period04/16/1904/18/19

Keywords

  • Linear Temporal Logic
  • Reduced Variance Stochastic Optimization
  • Reinforcement Learning

Fingerprint

Dive into the research topics of 'Reduced variance deep reinforcement learning with temporal logic specifications'. Together they form a unique fingerprint.

Cite this