Difference between revisions of "Thread:Talk:Rolling Averages/Rolling Average vs Softmax & Cross Entropy"

Revision as of 06:06, 27 July 2021

If each Guess Factor bin is considered an output unit before Softmax (logit), and loss is Cross Entropy, then the gradient of each logit is then:

q_i - 1, if bin is hit

q_i, otherwise

Where q_i is the output of the i^th unit after Softmax (estimated probability)

If gradients are not applied on logits as normal, but instead applied on q_i itself, then:

q_i := q_i - η * (q_i - 1) = (1 - η) * q_i + η * 1, if bin i hit

q_i := q_i - η * q_i = (1 - η) * q_i + η * 0, otherwise

Which is essentially rolling average, where η (learning rate) equals to the α (decay rate) in exponential moving average.

Anyway this analog isn't how rolling average works, as logits don't equal to q_is at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...

@@ Line 5: / Line 5: @@
 Where q<sub>i</sub> is the output of the i<sup>th</sup> unit after Softmax (estimated probability)
-If gradient is not applied on logits as normal, but instead applied on q<sub>i</sub> itself, then:
+If gradients are not applied on logits as normal, but instead applied on q<sub>i</sub> itself, then:
 : q<sub>i</sub> := q<sub>i</sub> - η * (q<sub>i</sub> - 1) = (1 - η) * q<sub>i</sub> + η * 1, if bin i hit
 : q<sub>i</sub> := q<sub>i</sub> - η * q<sub>i</sub> = (1 - η) * q<sub>i</sub> + η * 0, otherwise
 Which is essentially rolling average, where η (learning rate) equals to the α (decay rate) in exponential moving average.
-Anyway this analog isn't how rolling average works, as logit doesn't equal to q<sub>i</sub> at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...
+Anyway this analog isn't how rolling average works, as logits don't equal to q<sub>i</sub>s at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...

Difference between revisions of "Thread:Talk:Rolling Averages/Rolling Average vs Softmax & Cross Entropy"

Revision as of 06:06, 27 July 2021

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools