how to build a good test bed?
IIRC, in the past versions, the improvement over previous version is somewhat good indicator of final result, e.g. 0.5 increase in APS of common pairings (e.g. 300 common opponents) indicates 0.5 increase in final APS.
IMO the APS until full pairing is meanless, but the difference in common APS is useful.
However, this version breaks the previous pattern. difference in common APS is no longer an indicator, nor the full pairing APS.
The reason why I test agasint relatively “weak” bots is that the majority of the rumble is there. And what affects your score the most is also there. More than half of the bots in rumble is between in APS [40, 70), and there are only 160 bots above 70, and 324 bots below 40. Bots below APS 40 can be ignored IMO, as the improvement against them can only be marginal.
Until the pairing is complete, APS is not a good indicator. I always go to the details of my bot and then select an older version to compare with. In that case only the bots that both versions have fought, are taken into account. It indeed seems that the last 10% of the pairings involve the best opponents, GrubbmThree held around 58 APS till approx 1000 pairings, then fell down to 57.2. Note that even with 3000-5000 battles, there are still a lot of bots you have only have one fight against, so a few bad battles do have influence.
As for testbed, I used to have around 20 bots in my testbed (50 seasons): 5 top-50 bots, 5 'white-whales', 5 between place 100-300 en a few specific ones to check whether something was broke (f.e. bbo.RamboT must score less than 0.5%)