Cache effects on benchmark
Maybe using a reference bot would make benchmarking more meaninful?
Pick one bot, put each tree inside it, one at a time, and run it against a 1v1 test bed.
For the best comparison, absolutely. However, it might be difficult to set up the battles so that every one is the same, particularly if the trees are non-deterministic due to things like points being equal. Also, it adds a lot of overhead which would make testing very slow.
It would make testing slow, but the overhead is what will make benchmarks meaningful. Remove the overhead and you also remove cache thrashing.
Running a test bed much like we run challenges would do. Instead of making every battle the same, run multiple random battles and measure average run time.
What I worry is that many different trees would have to be tested in exactly the same way. Unlike scores, times are different on different computers. Perhaps if we put them all in parallel in the same robot, then see how much time they take?