Rolling Average vs Gradient Descent with Softmax & Cross Entropy
← Thread:Talk:Rolling Averages/Rolling Average vs Softmax & Cross Entropy
If each Guess Factor bin is considered as an output unit before Softmax (logit), and loss is Cross Entropy, then the gradient of each logit is then:
- qi - 1, if bin is hit
- qi, otherwise
Where qi is the output of the ith unit after Softmax (estimated probability)
If gradients are not applied on logits as normal, but instead applied on qi itself, then:
- qi := qi - η * (qi - 1) = (1 - η) * qi + η * 1, if bin i hit
- qi := qi - η * qi = (1 - η) * qi + η * 0, otherwise
Which is essentially rolling average, where η (learning rate) equals to the α (decay rate) in exponential moving average.
Anyway this analog isn't how rolling average works, as logits don't equal to qi at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...
You do not have permission to edit this page, for the following reasons:
You can view and copy the source of this page.
Return to Thread:Talk:Rolling Averages/Rolling Average vs Softmax & Cross Entropy/reply.