Vision-Language-Action Models
Supervised fine-tuning is effective in unimodal, well-covered regions, leaving RL to refine sparse-demonstration or multimodal regimes that constitute only a fraction of the trajectory. Entropy is an ineffective signal for identifying outc…
1 sources - 4 claims
Supervised fine-tuning is effective in unimodal, well-covered regions, leaving RL to refine sparse-demonstration or multimodal regimes that constitute only a fraction of the trajectory. Entropy is an ineffective signal for identifying outcome-critical phases in VLA policies because SFT pretraining drives policies to low entropy and residual entropy reflects modeling noise. Reinforcement learning post-training of VLA models has become an important step for enabling generalization beyond the supervised fine-tuning distribution. The Neyman allocation framework and C_c proxy are domain-agnostic in principle and could be applied to LLM reasoning tasks by computing C_c against verified outcomes in math or code domains.