Difference between revisions of "Thread:Talk:Rolling Averages/Rolling Average vs Softmax & Cross Entropy"
Jump to navigation
Jump to search
m |
m |
||
Line 3: | Line 3: | ||
: q<sub>i</sub>, otherwise | : q<sub>i</sub>, otherwise | ||
− | If gradient is not applied on | + | If gradient is not applied on logits as normal, but instead applied on q<sub>i</sub> itself, then: |
: q<sub>i</sub> := q<sub>i</sub> - eta * (q<sub>i</sub> - 1) = (1 - eta) * q<sub>i</sub> + eta * 1, if bin i hit | : q<sub>i</sub> := q<sub>i</sub> - eta * (q<sub>i</sub> - 1) = (1 - eta) * q<sub>i</sub> + eta * 1, if bin i hit | ||
: q<sub>i</sub> := q<sub>i</sub> - eta * q<sub>i</sub> = (1 - eta) * q<sub>i</sub> + eta * 0, otherwise | : q<sub>i</sub> := q<sub>i</sub> - eta * q<sub>i</sub> = (1 - eta) * q<sub>i</sub> + eta * 0, otherwise | ||
Which is essentially rolling average, where eta (learning rate) equals to the alpha (decay rate) in exponential moving average. | Which is essentially rolling average, where eta (learning rate) equals to the alpha (decay rate) in exponential moving average. | ||
− | Anyway this analog isn't how rolling average works, as logit | + | Anyway this analog isn't how rolling average works, as logit doesn't equal to q<sub>i</sub> at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate... |
Revision as of 05:58, 27 July 2021
If each Guess Factor bin is considered an output unit before Softmax (logit), and loss is Cross Entropy, then the gradient of each logit is then:
- qi - 1, if bin is hit
- qi, otherwise
If gradient is not applied on logits as normal, but instead applied on qi itself, then:
- qi := qi - eta * (qi - 1) = (1 - eta) * qi + eta * 1, if bin i hit
- qi := qi - eta * qi = (1 - eta) * qi + eta * 0, otherwise
Which is essentially rolling average, where eta (learning rate) equals to the alpha (decay rate) in exponential moving average.
Anyway this analog isn't how rolling average works, as logit doesn't equal to qi at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...