batch reinforcement learning
Recently Published Documents


TOTAL DOCUMENTS

31
(FIVE YEARS 11)

H-INDEX

10
(FIVE YEARS 2)

Author(s):  
Serkan Cabi ◽  
Sergio Gómez Colmenarejo ◽  
Alexander Novikov ◽  
Ksenia Konyushova ◽  
Scott Reed ◽  
...  

Author(s):  
Vincent Francois-Lavet ◽  
Guillaume Rabusseau ◽  
Joelle Pineau ◽  
Damien Ernst ◽  
Raphael Fonteneau

When an agent has limited information on its environment, the suboptimality of an RL algorithm can be decomposed into the sum of two terms: a term related to an asymptotic bias (suboptimality with unlimited data) and a term due to overfitting (additional suboptimality due to limited data). In the context of reinforcement learning with partial observability, this paper provides an analysis of the tradeoff between these two error sources. In particular, our theoretical analysis formally characterizes how a smaller state representation increases the asymptotic bias while decreasing the risk of overfitting.


Author(s):  
Sungryull Sohn ◽  
Yinlam Chow ◽  
Jayden Ooi ◽  
Ofir Nachum ◽  
Honglak Lee ◽  
...  

In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, high-confidence states without risking poor performance at sparsely-visited states. To remedy this, we propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance. We show that BRPO achieves the state-of-the-art performance in a number of tasks.


2020 ◽  
Vol 35 (3) ◽  
pp. 1990-2001 ◽  
Author(s):  
Hanchen Xu ◽  
Alejandro D. Dominguez-Garcia ◽  
Peter W. Sauer

2019 ◽  
Vol 65 ◽  
pp. 1-30 ◽  
Author(s):  
Vincent Francois-Lavet ◽  
Guillaume Rabusseau ◽  
Joelle Pineau ◽  
Damien Ernst ◽  
Raphael Fonteneau

This paper provides an analysis of the tradeoff between asymptotic bias (suboptimality with unlimited data) and overfitting (additional suboptimality due to limited data) in the context of reinforcement learning with partial observability. Our theoretical analysis formally characterizes that while potentially increasing the asymptotic bias, a smaller state representation decreases the risk of overfitting. This analysis relies on expressing the quality of a state representation by bounding $L_1$ error terms of the associated belief states.  Theoretical results are empirically illustrated when the state representation is a truncated history of observations, both on synthetic POMDPs and on a large-scale POMDP in the context of smartgrids, with real-world data. Finally, similarly to known results in the fully observable setting, we also briefly discuss and empirically illustrate how using function approximators and adapting the discount factor may enhance the tradeoff between asymptotic bias and overfitting in the partially observable context.


Sign in / Sign up

Export Citation Format

Share Document