In many safety-critical reinforcement learning applications, such as healthcare and autonomous driving, it is essential that learned policies respect constraints (e.g., dosage limits or speed limits) to avoid severe consequences. Training a policy through trial-and-error in these domains can be risky because constraints may be violated during training. In this talk, we present an offline constrained RL algorithm that relies solely on pre-collected offline dataset, without further interactions with the environment, to learn a near-optimal policy that satisfies the specified constraints. Previous work in this setting based on the actor-critic algorithm structure assumes the offline dataset to cover all reachable state-action pairs by any policy. In contrast, our algorithm, based on the linear programming formulation of RL, can learn a nearly optimal feasible policy using datasets that only cover state-action pairs reachable by the optimal policy. We provide theoretical guarantees on the sample complexity of learning a near-optimal feasible policy in the general function approximation setting.
#Bio
Kihyuk (Ki) Hong is a 5th year PhD student in statistics at the University of Michigan. His research focuses on provable guarantees in reinforcement learning. His past work has focused on nonstationary bandits, offline RL, constrained RL and infinite-horizon average-reward RL. Previously, he worked as a machine learning engineer at Facebook and Naver, specializing in search and recommendation systems. He is broadly intereted in bridging theory and practice in reinforcement learning.