12.4. Stochastic Gradient Descent — Dive into Deep Learning 1.0.0-beta0 documentation

d2l.ai

Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration of stochastic gradient descent, we uniformly sample an index
�
∈
{
1
,
…
,
�
}
for data examples at random, and compute the gradient
∇
�
�
(
�
)
to update
� we want to emphasize that the stochastic

1 Users

0 Comments

5 Highlights

0 Notes

- Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration of stochastic gradient descent, we uniformly sample an index � ∈ { 1 , … , � } for data examples at random, and compute the gradient ∇ � � ( � ) to update �
- we want to emphasize that the stochastic gradient ∇ � � ( � ) is an unbiased estimate of the full gradient ∇ � ( � ) because
- The only way to resolve these conflicting goals is to reduce the learning rate dynamically as optimization progresses.
- In the first piecewise constant scenario we decrease the learning rate, e.g., whenever progress in optimization stalls. This is a common strategy for training deep networks. Alternatively we could decrease it much more aggressively by an exponential decay. Unfortunately this often leads to premature stopping before the algorithm has converged. A po...
- Sampling with replacement leads to an increased variance and decreased data efficiency relative to sampling without replacement. Hence, in practice we perform the latter (and this is the default choice throughout this book).

Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.