Faster normalXAngle --> faster sin,cos,tan
Inspired by recent discussion I did a little profiling of DrussGT, and was horrified to find that plain old robocode.util.Utils.normalRelativeAngle() was using a huge percentage of my processing time. So, I tried various methods to see how they were for speed, and came up with one which takes about 1/4 of the time:
public static final double normalRelativeAngle(double d){ d += Math.PI; double i = Math.floor(d*(1/(2*Math.PI))); d -= i*(2*Math.PI); return d-Math.PI; } public static final double normalAbsoluteAngle(double d){ double i = Math.floor(d*(1/(2*Math.PI))); d -= i*(2*Math.PI); return d; }
I think the big improvement comes from no divisions, no % and no decisions, all of which are slow. Of course, it isn't ulp-accurate (although I'm not sure the previous one was either?), but over 1M calls the largest difference I saw between this and the old one was -3.552713678800501E-15 which is small enough for me to call 0.
Of course, this technique can then be applied to sin and cos, which then run about 2x the speed of previously, and on my system this is now ~4x the speed of Math.sin/cos:
public static final double sin(double value) { value = value*K + 0.5; double i = Math.floor(value*(1.0/TRIG_DIVISIONS)); value -= i*TRIG_DIVISIONS; return sineTable[(int)value]; } public static final double cos(double value) { value = value*K + 0.5 + Math.PI*0.5*K; double i = Math.floor(value*(1.0/TRIG_DIVISIONS)); value -= i*TRIG_DIVISIONS; return sineTable[(int)value]; }
It can also be applied to the tan method, but I hardly use tan so I haven't bothered with it. Additionally, TRIG_DIVISIONS is no longer restricted to power-of-2 numbers because we lost the &.
This code is public domain, do with it as you like, and give credit if you think I deserve it! Enjoy, and write speedy bots!
Further experimentation with the profiler tells me that our lookup-table sin and cos are actually slower than an expanded polynomial version, I suspect this is due to occasional cache misses. However, the faster normalRelativeAngle code is still useful, here are my latest sin/cos implementations:
public static final double sin(double d) { d += Math.PI; double x2 = Math.floor(d*(1/(2*Math.PI))); d -= x2*(2*Math.PI); d-=Math.PI; x2 = d * d; return //accurate to 6.82e-8, 3.3x faster than Math.sin, //faster than lookup table in real-world conditions due to no cache misses //all values from "Fast Polynomial Approximations to Sine and Cosine", Garret, C. K., 2012 (((((-2.05342856289746600727e-08*x2 + 2.70405218307799040084e-06)*x2 - 1.98125763417806681909e-04)*x2 + 8.33255814755188010464e-03)*x2 - 1.66665772196961623983e-01)*x2 + 9.99999707044156546685e-01)*d; } public static final double cos(double d) { d += Math.PI; double x2 = Math.floor(d*(1/(2*Math.PI))); d -= x2*(2*Math.PI); d-=Math.PI; d *= d; return //max error 5.6e-7, 4x faster than Math.cos, //faster than lookup table in real-world conditions due to less cache misses //all values from "Fast Polynomial Approximations to Sine and Cosine", Garret, C. K., 2012 ((((- 2.21941782786353727022e-07*d + 2.42532401381033027481e-05)*d - 1.38627507062573673756e-03)*d + 4.16610337354021107429e-02)*d - 4.99995582499065048420e-01)*d + 9.99999443739537210853e-01; }
Upon doing my own research (heavy googling) I've decided to try a modified version of the LUT code for now. The key difference is the overall footprint is much smaller than the original implementation, hopefully allowing it to remain in memory. I'm doing this by reducing the number of divisions pretty drastically, storing the values as single-precision, and then doing linear interpolation between the nearest two divisions to get the final output. The asymptotic functions will use polynomial implementations, because I fear the interpolation will totally wreck accuracy. And of course, testing will prove whether this is a good approach or not.
Have a look at Polynomial Approximations to Sine and Cosine. This uses no memory and is even faster than the lookup table approach. Skilgannon uses this as far as I know.
Also have a look at Chebyshev Approximation (http://metamerist.com/cheby/example38.htm archive) which works pretty well for atan, asin and acos.
That is what I'm referring to as the "polynomial implementation" above. My intuition is that the LUT version is only slower in practice because it gets evicted out of the cache due to being so large. I haven't done a side-by-side comparison test yet, but I've shrunk it by roughly a factor of 32 from the smallest version I've seen on this page, and I've added a couple other tricks to reduce the calculations required to find the lookup index. Once I truly need the speed boost I'll begin benchmarking the two against each other.
Table lookups are slow I suspect due to array index out-of-bounds checking.
It also happens in kNN search. I was able to double the speed of my bot by replacing a few arrays with objects with hardcoded number of attributes. Only works with constant sized arrays, like data point coordinates.
No, I'm fairly sure it's caching, because profiling shows LUT twice as slow as polynomial, but a micro-benchmark shows LUT ~30% faster.
I generate a list of random numbers in an array and then time how long it takes for each of the functions to process the entire array in a loop. The result of each function I add to a running total, and I print the total at the end to make sure it doesn't get optimised away, and also to see how accurate they are. I could also calculate the overhead if I just add the results of the array, but I haven't tried that yet, and I think it might get optimised into SSE ops anyway.
Just a thought, but why not incorporate the faster normalRelativeAngle and normalAbsoluteAngle into the codebase of Robocode. That way everybody profits from it in the future when a newer version than 1.8.1.0 is standard.
Personally, I wouldn't have any issues with it, but the fact that using a * instead of / probably loses a few ULPS and I'm not sure that is what we want for inside of the game engine. It is still faster than the standard method even using / , but by doing the - in a separate operation there is probably some ULPS loss there as well.
Fast-math classes/functions have lower precision.