Property of gradient of cross-entropy loss with kernel density estimation
The highlighted comment was created in this revision.
I'm quite curious about the behavior of the cross entropy loss between a target uniform distribution and kernel density estimation with softmax weight:
where a and b is the lower and upper bound of the target uniform distribution, K is the kernel function (assume normalized), x_j is the angle of the data point, and z_j the logit (weight before softmax).
The integral is often calculated by numerical methods, such as binning, so let's consider the gradient of the logit of the i-th data point, and consider only one of the bins (with angle x), ignoring the values multiplied before integral (1 / (b - a)):
It degenerates to ordinary cross entropy loss (the multi-class classification case) with softmax when K is either 1 or 0 (and 1 iif the "label" matches): S_i - 1 when label matches or S_i when label mismatches. (S_i denotes softmax weight of data point i)
But things start to get interesting when K is different.
The common scale of K(x - x_i) doesn't matter at all (e.g. when all data points are far from target distribution, or near). The effective learning rate is same when two data points are equally near the target distribution, or when they are equally far from the target distribution.
This behavior disobeys the intuition that when data points are far from target distribution, it should not learn as much as when they are near.
Actually when K(x - x_i) takes the form of e^(-|x - x_i|), which is laplace distribution, there is one more interesting property: The distance between the center of target uniform distribution and the closest data point to it can be subtracted from all the data points, without affecting the gradient. Imagine that all the data points are moving to the center of the target distribution at the same speed, and stops as soon as the first data point aligns.
Interesting observations! The scale invariance of K actually seems like a good property to me. It means that K doesn't really need to be normalized, or more precisely that multiplying K by a constant multiplies the gradient by that constant, which seems like the behavior you'd want. Most loss functions (e.g., mean squared error) learn more for far data points than close ones. That might be good for surfing, but I could imagine that you may want the opposite for targeting where you "give up" on hard data points and focus on the ones that you might score a hit on. So perhaps BeepBoop's loss is a middle ground that works decently well for both.
The thoughts on surfing & targeting is quite inspiring. And even if no data points are near within K size (hard case), that case is still valuable, since there may exist some data point just outside of the K size. And repeating the training process with new weight iteratively may eventually turn that case into an easy case ;) Are you doing something similar as well?
One more finding. Actually you don't need to take integral or use bins at all, you can compute the loss from each data point separately and take the sum of the loss. Although the value in loss isn't equal, the gradients are exactly the same. This yields one more insight: the absolute predicted value isn't important at all, all that matters is how relatively they are close to the target distribution, compared to each other. As a result, the cluster used for one prediction isn't necessarily in the same batch, they can be shuffled entirely, yet doesn't affect the result (theoretically).