Reinforcement Learning

The leave-one-out root-group baseline is described as unbiased because it is independent of the scored rollout tree. RAO trains all recursive nodes jointly using a local reward. The node reward combines local task success with a delegation…

1 sources - 5 claims

The leave-one-out root-group baseline is described as unbiased because it is independent of the scored rollout tree. RAO trains all recursive nodes jointly using a local reward. The node reward combines local task success with a delegation bonus based on immediate-child success. Using success rate rather than the number of successful children is intended to avoid rewarding indiscriminate spawning. Depth-level inverse-frequency weighting reduces domination by depths with many trajectories.