how to build a good test bed?
Until the pairing is complete, APS is not a good indicator. I always go to the details of my bot and then select an older version to compare with. In that case only the bots that both versions have fought, are taken into account. It indeed seems that the last 10% of the pairings involve the best opponents, GrubbmThree held around 58 APS till approx 1000 pairings, then fell down to 57.2. Note that even with 3000-5000 battles, there are still a lot of bots you have only have one fight against, so a few bad battles do have influence.
As for testbed, I used to have around 20 bots in my testbed (50 seasons): 5 top-50 bots, 5 'white-whales', 5 between place 100-300 en a few specific ones to check whether something was broke (f.e. bbo.RamboT must score less than 0.5%)