Rolling Average vs Gradient Descent with Softmax & Cross Entropy

Revision as of 27 July 2021 at 04:06.
The highlighted comment was edited in this revision. [diff]

Rolling Average vs Gradient Descent with Softmax & Cross Entropy

If each Guess Factor bin is considered an output unit before Softmax (logit), and loss is Cross Entropy, then the gradient of each logit is then:

q_i - 1, if bin is hit

q_i, otherwise

Where q_i is the output of the i^th unit after Softmax (estimated probability)

If gradients are not applied on logits as normal, but instead applied on q_i itself, then:

q_i := q_i - η * (q_i - 1) = (1 - η) * q_i + η * 1, if bin i hit

q_i := q_i - η * q_i = (1 - η) * q_i + η * 0, otherwise

Which is essentially rolling average, where η (learning rate) equals to the α (decay rate) in exponential moving average.

Anyway this analog isn't how rolling average works, as logits don't equal to q_is at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...

Xor (talk)‎

Rolling Average vs Gradient Descent with Softmax & Cross Entropy

Rolling Average vs Gradient Descent with Softmax & Cross Entropy

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools