Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization

Uri Gadot, Esther Derman, Navdeep Kumar, Maxence Elfatihi, Kfir Levy, Shie Mannor

Research output: Contribution to journalConference articlepeer-review

Abstract

In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an α-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.

Original languageEnglish
Pages (from-to)2109021098
Number of pages1
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume38
Issue number19
StatePublished - 25 Mar 2024
Event38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, Canada
Duration: 20 Feb 202427 Feb 2024

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization'. Together they form a unique fingerprint.

Cite this