Difference between revisions of "Thread:Talk:Rolling Averages/Rolling Average vs Softmax & Cross Entropy"

Revision as of 06:03, 27 July 2021

If each Guess Factor bin is considered an output unit before Softmax (logit), and loss is Cross Entropy, then the gradient of each logit is then:

q_i - 1, if bin is hit

q_i, otherwise

Where q_i is the output of the i^th unit after Softmax (estimated probability)

If gradient is not applied on logits as normal, but instead applied on q_i itself, then:

q_i := q_i - eta * (q_i - 1) = (1 - eta) * q_i + eta * 1, if bin i hit

q_i := q_i - eta * q_i = (1 - eta) * q_i + eta * 0, otherwise

Which is essentially rolling average, where eta (learning rate) equals to the alpha (decay rate) in exponential moving average.

Anyway this analog isn't how rolling average works, as logit doesn't equal to q_i at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...

@@ Line 2: / Line 2: @@
 : q<sub>i</sub> - 1, if bin is hit
 : q<sub>i</sub>, otherwise
+Where q<sub>i</sub> is the output of the i<sup>th</sup> unit after Softmax (estimated probability)
 If gradient is not applied on logits as normal, but instead applied on q<sub>i</sub> itself, then:

Difference between revisions of "Thread:Talk:Rolling Averages/Rolling Average vs Softmax & Cross Entropy"

Revision as of 06:03, 27 July 2021

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools