all entries are set to 0 , except for the entry corresponding to our token
the length of the sequence introduces a new notion of depth. In addition to the passing through the network in the input-to-output direction, inputs at the first time step must pass through a chain of
layers along the time steps in order to influence the output of the model at the final time step.
For example, with learning rate � > 0 , each update takes the form � ← � − � � . Let’s further assume that the objective function � is sufficiently smooth. Formally, we say that the objective is Lipschitz continuous with constant � , meaning that for any � and � , we have
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.