Causal Estimation Benchmarks

1 sources - 7 claims

On CVD Risk Toy, CausalFlow-T recovered a hazard ratio of 0.786 plus or minus 0.051 when the true protective hazard ratio was 0.831. The FIRE semi-synthetic oracle showed that CausalFlow-T and GNN-CVAE were the only models passing the bias threshold. On LDL Toy, TARNet had the lowest absolute MAE but showed systematic error and variance collapse. On Cox Survival, CVAE had better MAE, but CausalFlow-T had the best arm-1 error and closest hazard ratio recovery. CausalFlow-T was benchmarked on four complete synthetic datasets with known counterfactuals. Causal reliability metrics included subgroup calibration, arm reconstruction error, tail variance ratio, HR recovery, and stability. The findings support evaluating longitudinal causal models with criteria beyond factual MAE.