Property of gradient of cross-entropy loss with kernel density estimation

Jump to navigation Jump to search
Revision as of 5 February 2022 at 08:08.
The highlighted comment was edited in this revision. [diff]

Property of gradient of cross-entropy loss with kernel density estimation

I'm quite curious about the behavior of cross entropy loss between a uniform distribution and kernel density estimation with softmax weight:

Cross-entropy-kde.png

where a and b is the lower and upper bound of the target uniform distribution, K is the kernel function (assume normalized), x_j is the angle of the data point, and z_j the weight before softmax.

The integral is often calculated by numerical methods, such as binning, so let's consider the gradient of the i-th data point's weight before softmax, and consider only one of the bins (with angle x) and ignore the values multiplied before integral:

Derivative-cross-entropy-kde.png

It degenerates to ordinary cross entropy loss with softmax when K is either 1 or 0 (and 1 iif the "label" matches): S_i - 1 when label matches or S_i when label mismatches.

But things start to get interesting when K is different.

The scale of K(x - x_i) doesn't matter at all (e.g. when all data point is far from target distribution). The effective learning rate is same when two data points are equally near the target distribution, or when they are equally far from the target distribution.

This behavior disobeys the intuition that when data points are far from target distribution, one should not learn quite a lot from this scenario.

    Xor (talk)09:51, 5 February 2022