-A A +A

Safe Reinforcement learning

Reinforcement Learning (RL) is a known architecture for synthesising policies for Markov Decision Processes (MDP). We work on extending this paradigm to the synthesis of ‘safe policies’, or more general of policies such that a linear time property is satisfied. We convert the property into an automaton, then construct a product MDP between the automaton and the original MDP. A reward function is then assigned to the states of the product automaton, according to accepting conditions of the automaton. With this reward function, RL synthesises a policy that satisfies the property: as such, the policy synthesis procedure is `constrained' by the given specification. Additionally, we show that the RL procedure sets up an online value iteration method to calculate the maximum probability of satisfying the given property, at any given state of the MDP. We evaluate the performance of the algorithm on numerous numerical examples.


Themes:  Learning and Verification

People:  Hosein Hasanbeig, Tim Seabrook, Kyriakos Polymenakos, Alessandro Abate



Oxford Control and Verification (OXCAV) Group

C/O Department of Computer Science

Wolfson Building, Parks Rd, Oxford OX1 3QD

T: +44 (0) 18656 10767

Education - This is a contributing Drupal Theme
Design by WeebPal.