Difference between revisions of "Thread:Talk:BeepBoop/Understanding BeepBoop/Property of gradient of cross-entropy loss with kernel density estimation"

From Robowiki
Jump to navigation Jump to search
m
m
Line 14: Line 14:
  
 
The scale of K(x - x_i) doesn't matter at all (e.g. when all data point is far from target distribution). The effective learning rate is same when two data points are equally near the target distribution, or when they are equally far from the target distribution.
 
The scale of K(x - x_i) doesn't matter at all (e.g. when all data point is far from target distribution). The effective learning rate is same when two data points are equally near the target distribution, or when they are equally far from the target distribution.
 +
 +
This behavior disobeys the intuition that when data points are far from target distribution, one should not learn quite a lot from this scenario.

Revision as of 09:08, 5 February 2022

I'm quite curious about the behavior of cross entropy loss between a uniform distribution and kernel density estimation with softmax weight:

Cross-entropy-kde.png

where a and b is the lower and upper bound of the target uniform distribution, K is the kernel function (assume normalized), x_j is the angle of the data point, and z_j the weight before softmax.

The integral is often calculated by numerical methods, such as binning, so let's consider the gradient of the i-th data point's weight before softmax, and consider only one of the bins (with angle x) and ignore the values multiplied before integral:

Derivative-cross-entropy-kde.png

It degenerates to ordinary cross entropy loss with softmax when K is either 1 or 0 (and 1 iif the "label" matches): S_i - 1 when label matches or S_i when label mismatches.

But things start to get interesting when K is different.

The scale of K(x - x_i) doesn't matter at all (e.g. when all data point is far from target distribution). The effective learning rate is same when two data points are equally near the target distribution, or when they are equally far from the target distribution.

This behavior disobeys the intuition that when data points are far from target distribution, one should not learn quite a lot from this scenario.