Property of gradient of cross-entropy loss with kernel density estimation
← Thread:Talk:BeepBoop/Understanding BeepBoop/Property of gradient of cross-entropy loss with kernel density estimation
You do not have permission to edit this page, for the following reasons:
- The action you have requested is limited to users in the group: Users.
- You must confirm your email address before editing pages. Please set and validate your email address through your user preferences.
You can view and copy the source of this page.
Return to Thread:Talk:BeepBoop/Understanding BeepBoop/Property of gradient of cross-entropy loss with kernel density estimation.
Interesting observations! The scale invariance of K actually seems like a good property to me. It means that K doesn't really need to be normalized, or more precisely that multiplying K by a constant multiplies the gradient by that constant, which seems like the behavior you'd want. Most loss functions (e.g., mean squared error) learn more for far data points than close ones. That might be good for surfing, but I could imagine that you may want the opposite for targeting where you "give up" on hard data points and focus on the ones that you might score a hit on. So perhaps BeepBoop's loss is a middle ground that works decently well for both.
The thoughts on surfing & targeting is quite inspiring. And even if no data points are near within K size (hard case), that case is still valuable, since there may exist some data point just outside of the K size. And repeating the training process with new weight iteratively may eventually turn that case into an easy case ;) Are you doing something similar as well?
One more finding. Actually you don't need to take integral or use bins at all, you can compute the loss from each data point separately and take the sum of the loss. Although the value in loss isn't equal, the gradients are exactly the same. This yields one more insight: the absolute predicted value isn't important at all, all that matters is how relatively they are close to the target distribution, compared to each other. As a result, the cluster used for one prediction isn't necessarily in the same batch, they can be shuffled entirely, yet doesn't affect the result (theoretically).
Oops the calc is wrong.