ALFWorld Experiments
Tier 2 found a 32.2% off-policy training-time reduction from gating. Tier 1 found a 13.25% rollout wall-clock reduction from gating. Tier 3 found a 10.7% average on-policy wall-clock reduction across four seeds. The predictive analysis use…
1 sources - 6 claims
Tier 2 found a 32.2% off-policy training-time reduction from gating. Tier 1 found a 13.25% rollout wall-clock reduction from gating. Tier 3 found a 10.7% average on-policy wall-clock reduction across four seeds. The predictive analysis used 100 ALFWorld valid_seen groups across six task types. Tier 1 measured rollout-time saving using 100 tasks run under baseline and gated conditions. Tier 3 did not show a statistically significant held-out success improvement at four seeds.