Difference between revisions of "Thread:Talk:Rolling Averages/Rolling Average vs Softmax & Cross Entropy"

From Robowiki
Jump to navigation Jump to search
m
m
Line 5: Line 5:
 
Where q<sub>i</sub> is the output of the i<sup>th</sup> unit after Softmax (estimated probability)
 
Where q<sub>i</sub> is the output of the i<sup>th</sup> unit after Softmax (estimated probability)
  
If gradient is not applied on logits as normal, but instead applied on q<sub>i</sub> itself, then:
+
If gradients are not applied on logits as normal, but instead applied on q<sub>i</sub> itself, then:
 
: q<sub>i</sub> := q<sub>i</sub> - η * (q<sub>i</sub> - 1) = (1 - η) * q<sub>i</sub> + η * 1, if bin i hit
 
: q<sub>i</sub> := q<sub>i</sub> - η * (q<sub>i</sub> - 1) = (1 - η) * q<sub>i</sub> + η * 1, if bin i hit
 
: q<sub>i</sub> := q<sub>i</sub> - η * q<sub>i</sub> = (1 - η) * q<sub>i</sub> + η * 0, otherwise
 
: q<sub>i</sub> := q<sub>i</sub> - η * q<sub>i</sub> = (1 - η) * q<sub>i</sub> + η * 0, otherwise
 
Which is essentially rolling average, where η (learning rate) equals to the α (decay rate) in exponential moving average.
 
Which is essentially rolling average, where η (learning rate) equals to the α (decay rate) in exponential moving average.
  
Anyway this analog isn't how rolling average works, as logit doesn't equal to q<sub>i</sub> at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...
+
Anyway this analog isn't how rolling average works, as logits don't equal to q<sub>i</sub>s at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...

Revision as of 06:06, 27 July 2021

If each Guess Factor bin is considered an output unit before Softmax (logit), and loss is Cross Entropy, then the gradient of each logit is then:

qi - 1, if bin is hit
qi, otherwise

Where qi is the output of the ith unit after Softmax (estimated probability)

If gradients are not applied on logits as normal, but instead applied on qi itself, then:

qi := qi - η * (qi - 1) = (1 - η) * qi + η * 1, if bin i hit
qi := qi - η * qi = (1 - η) * qi + η * 0, otherwise

Which is essentially rolling average, where η (learning rate) equals to the α (decay rate) in exponential moving average.

Anyway this analog isn't how rolling average works, as logits don't equal to qis at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...