Property of gradient of cross-entropy loss with kernel density estimation
Interesting observations! The scale invariance of K actually seems like a good property to me. It means that K doesn't really need to be normalized, or more precisely that multiplying K by a constant multiplies the gradient by that constant, which seems like the behavior you'd want. Most loss functions (e.g., mean squared error) learn more for far data points than close ones. That might be good for surfing, but I could imagine that you may want the opposite for targeting where you "give up" on hard data points and focus on the ones that you might score a hit on. So perhaps BeepBoop's loss is a middle ground that works decently well for both.
The thoughts on surfing & targeting is quite inspiring. And even if no data points are near within K size (hard case), that case is still valuable, since there may exist some data point just outside of the K size. And repeating the training process with new weight iteratively may eventually turn that case into an easy case ;) Are you doing something similar as well?