Batch Gradient Variance

Lower gradient variance is presented as beneficial because evaluation occurs on the full graph. RNS has the lowest loss variance and gradient variance among the compared samplers at random initialization. The final term in the sampled modi…

1 sources - 5 claims

Lower gradient variance is presented as beneficial because evaluation occurs on the full graph. RNS has the lowest loss variance and gradient variance among the compared samplers at random initialization. The final term in the sampled modified objective is the variance of mini-batch gradients and is denoted R(w). Sequential per-batch SGD updates introduce a gradient-variance penalty. The main sampler-dependent difference is batch-gradient stability and homogeneity rather than mean gradient scale.