Batch Gradient Variance
Lower gradient variance is presented as beneficial because evaluation occurs on the full graph. RNS has the lowest loss variance and gradient variance among the compared samplers at random initialization. The final term in the sampled modi…
1 sources - 5 claims
Lower gradient variance is presented as beneficial because evaluation occurs on the full graph. RNS has the lowest loss variance and gradient variance among the compared samplers at random initialization. The final term in the sampled modified objective is the variance of mini-batch gradients and is denoted R(w). Sequential per-batch SGD updates introduce a gradient-variance penalty. The main sampler-dependent difference is batch-gradient stability and homogeneity rather than mean gradient scale.