Probabilistic Chunk Masking

1 sources - 7 claims

PCM reaches the 98% success rate threshold 2.38 times faster than vanilla GRPO on LIBERO-Object. PCM's wall-clock gain is entirely attributable to per-step compute savings because its per-step learning curves match those of vanilla GRPO. The masked biased estimator used by PCM is preferred over an unbiased importance-weighted alternative because it yields lower gradient estimator variance at the same chunk budget. PCM operates at the trajectory-phase level rather than the token level and uses an outcome-grounded signal rather than policy-internal uncertainty proxies such as entropy. Probabilistic Chunk Masking is a drop-in modification to GRPO that selects a fixed budget of trajectory chunks per update and physically removes the rest before the forward and backward pass. A budget of B=12 chunks (19% of trajectory) is selected as the PCM default and applied without task-specific tuning across all benchmarks. PCM relies on three distinct mechanisms — concentration, exploration, and online adaptation — all of which are necessary for its effectiveness.