Rolling Average vs Gradient Descent with Softmax & Cross Entropy

If each Guess Factor bin is considered as an output unit before Softmax (logit), and loss is Cross Entropy, then the gradient of each logit is then:

q_i - 1, if bin is hit

q_i, otherwise

Where q_i is the output of the i^th unit after Softmax (estimated probability)

If gradients are not applied on logits as normal, but instead applied on q_i itself, then:

q_i := q_i - η * (q_i - 1) = (1 - η) * q_i + η * 1, if bin i hit

q_i := q_i - η * q_i = (1 - η) * q_i + η * 0, otherwise

Which is essentially rolling average, where η (learning rate) equals to the α (decay rate) in exponential moving average.

Anyway this analog isn't how rolling average works, as logits don't equal to q_i at all. But what if we replace rolling average with gradient descent? I suppose it could learn even faster, as the outputs farther from real value get higher decay rate...

Xor (talk)‎

Then one step further, you don't use VCS any more, instead add the logits of velocity bins, accel bins and distance bins, etc., all together. This structure is essentially estimating the probability as a multiplication of probability when e.g. velocity is high and distance is close.

If the movement profile relating to velocity, distance, etc. is independent, this approach will be mostly the same as traditional segmented VCS, with more data points.

Note that this approach is essentially a neural network without hidden units, or multiclass logistic regression.

Xor (talk)‎

Rolling Average vs Gradient Descent with Softmax & Cross Entropy

Rolling Average vs Gradient Descent with Softmax & Cross Entropy

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools