Talk:BeepBoop/Understanding BeepBoop

[View source↑]
[History↑]

BeepBoop vs Yatagan

I've noticed that versus Yatagan BeepBoop never moves. The bullet shielding is effective enough that BeepBoop wins comfortably but I'm curious whether this behaviour is a bug or not. So far I've not seen any other bots that BeepBoop responds to in this way.

David414 (talk)‎

A quick way to find these bots is to filter bots BeepBoop win at near 100% APS while they have quite high APS in average. Another quick way is to look at positive KNNPBI bots of BeepBoop.

Also BeepBoop is open source, so you can have a look at the white list of shieldable bots ;)

Xor (talk)‎

I thought it might be some kind of whitelist, I should probably have had a look for one. I wonder how much APS the whitelist is worth?

David414 (talk)‎

IIRC BeepBoop without shield should be near APS 92%+, so the shield whitelist worth 2%-3% APS. Given that it already has very high APS, the APS gain of the shield system should be must higher on bots with less APS.

Xor (talk)‎

Property of gradient of cross-entropy loss with kernel density estimation

I'm quite curious about the behavior of the cross entropy loss between a target uniform distribution and kernel density estimation with softmax weight:

where a and b is the lower and upper bound of the target uniform distribution, K is the kernel function (assume normalized), x_j is the angle of the data point, and z_j the logit (weight before softmax).

The integral is often calculated by numerical methods, such as binning, so let's consider the gradient of the logit of the i-th data point, and consider only one of the bins (with angle x), ignoring the values multiplied before integral (1 / (b - a)):

It degenerates to ordinary cross entropy loss (the multi-class classification case) with softmax when K is either 1 or 0 (and 1 iif the "label" matches): S_i - 1 when label matches or S_i when label mismatches. (S_i denotes softmax weight of data point i)

But things start to get interesting when K is different.

The common scale of K(x - x_i) doesn't matter at all (e.g. when all data points are far from target distribution, or near). The effective learning rate is same when two data points are equally near the target distribution, or when they are equally far from the target distribution.

This behavior disobeys the intuition that when data points are far from target distribution, it should not learn as much as when they are near.

Actually when K(x - x_i) takes the form of e^(-|x - x_i|), which is laplace distribution, there is one more interesting property: The distance between the center of target uniform distribution and the closest data point to it can be subtracted from all the data points, without affecting the gradient. Imagine that all the data points are moving to the center of the target distribution at the same speed, and stops as soon as the first data point aligns.

Xor (talk)‎

Interesting observations! The scale invariance of K actually seems like a good property to me. It means that K doesn't really need to be normalized, or more precisely that multiplying K by a constant multiplies the gradient by that constant, which seems like the behavior you'd want. Most loss functions (e.g., mean squared error) learn more for far data points than close ones. That might be good for surfing, but I could imagine that you may want the opposite for targeting where you "give up" on hard data points and focus on the ones that you might score a hit on. So perhaps BeepBoop's loss is a middle ground that works decently well for both.

--Kev (talk)‎

The thoughts on surfing & targeting is quite inspiring. And even if no data points are near within K size (hard case), that case is still valuable, since there may exist some data point just outside of the K size. And repeating the training process with new weight iteratively may eventually turn that case into an easy case ;) Are you doing something similar as well?

Xor (talk)‎

One more finding. Actually you don't need to take integral or use bins at all, you can compute the loss from each data point separately and take the sum of the loss. Although the value in loss isn't equal, the gradients are exactly the same. This yields one more insight: the absolute predicted value isn't important at all, all that matters is how relatively they are close to the target distribution, compared to each other. As a result, the cluster used for one prediction isn't necessarily in the same batch, they can be shuffled entirely, yet doesn't affect the result (theoretically).

Oops the calc is wrong.

Xor (talk)‎

Thread title	Replies	Last modified
BeepBoop vs Yatagan	3	14:36, 16 August 2025
Property of gradient of cross-entropy loss with kernel density estimation	3	17:26, 13 March 2022

Talk:BeepBoop/Understanding BeepBoop

Contents

BeepBoop vs Yatagan

Property of gradient of cross-entropy loss with kernel density estimation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools