Automated Evaluation and Ablations
State-aware reasoning improved performance compared with a vanilla Gemini 2.0 Flash baseline without explicit phase transitions or uncertainty-guided questioning. Image plus dialogue consistently improved performance compared with image-on…
1 sources - 5 claims
State-aware reasoning improved performance compared with a vanilla Gemini 2.0 Flash baseline without explicit phase transitions or uncertainty-guided questioning. Image plus dialogue consistently improved performance compared with image-only diagnosis across evaluated datasets. Robustness testing found stable diagnostic accuracy, information gathering, hallucination rate, and management appropriateness across 1,168 augmented scenarios. The researchers built an automated simulation environment for rapid iteration. The automated environment includes scenario generation, turn-by-turn multimodal dialogue simulation, and auto-rating.