Policy Gradient Theory

1 sources - 5 claims

Removing the uncertainty term 4p(1-p) causes the largest single-component performance degradation in ablation studies, validating the theoretical result. Theorem 1's theoretical guarantees depend on a homogeneity approximation for score-function norms across reward outcomes, which may not hold in all settings. The Bernoulli variance p(1-p) governs per-prompt gradient informativeness, directly justifying the uncertainty term as the primary driver of the Energy Score. The momentum term m_i(t) is the output of a complementary causal high-pass filter on the pass-rate sequence, isolating rapid temporal changes in policy performance. Removing the difficulty anchor causes the selector to collapse toward initially easy prompts that transiently sit near p=0.5 due to policy noise.