Offline Reinforcement Learning

1 sources - 4 claims

The decision problem is formulated as a discounted infinite-horizon Markov decision process with 90-day decision intervals. Functional FQE estimates policy value using kernel ridge regression in an RKHS with a tensor-product state-action kernel. Functional FQI updates a B-spline functional linear policy through penalised maximisation of the empirical Q-function average. Avoiding per-sample greedy maximisation is presented as a way to reduce computational intensity, Q-function overestimation, and non-smooth policies in functional-action settings.