User talk:Voidious/Robocode Version Tests

From Robowiki
< User talk:Voidious
Revision as of 01:08, 19 July 2009 by GrubbmGait (talk | contribs) (My thought on what version to use)
Jump to navigation Jump to search

I don't think there are any differences between 1.6.1.4 and 1.7.1.1. The difference is quite big, 0.82%! But I think we are still using the fix, right? » Nat | Talk » 15:51, 17 July 2009 (UTC)

Yes, it's a pretty big difference (0.72% actually). 250 battles might not be enough, though, maybe I should run more for that pairing. If there is a 0.72% difference, I'd probably be against the change. But there's lots more to test first - I see my 1.6.1.4 CPU constant is ~20% higher than the 1.7* versions, that could even play a part. --Voidious 20:36, 17 July 2009 (UTC)

The Surfer-vs-Surfer battle seems to affect much more than Surfer-vs-Random battles. » Nat | Talk » 05:08, 18 July 2009 (UTC)

Not necessarially Nat. These results could just mean that Komarious and Ascendant are affected differently than Diamond and DrussGT. Komarious/Ascendant vs the random movers would be needed to tell that. Really this test would be more revealing if there was a full round-robin between involved bots. I'd also wonder: Do Komarious or Ascendant actually assume anything that's breaks in Alpha2? It's also worth noting that tests of score alone can't tell us if changes better match assumptions a bot is making, since a bot could just by fluke happen to operate in a way that is happy with conditions that the code didn't try to assume at all. --Rednaxela 05:26, 18 July 2009 (UTC)
There's also just a much larger variance in those pairings, so maybe 500 battles isn't even enough. My initial goal was just to find out if there was a measurable difference among these versions, so I was going for diversity in the battles I used. Full round robin between all bots (in all versions) might help in deducing causes, but this testing is already taking a "metric ass-ton" of CPU cycles =), so I'd definitely reduce the # of bots if I were to try that. It would still be pretty speculative, though.
Assuming that we have enough battles, the Diamond vs Komarious result I believe shows there are other differences between 1.6.1.4 and 1.7.x that are contributing. I noticed the CPU constant is a little different, but I'm pretty sure nobody's skipping turns anyway. The Alpha2 updateMovement code shouldn't change anything for these two bots, I don't think.
The PrairieWolf result is bizarre. 2% is well above any margin of error on 500 battles, I think. I thought PrairieWolf would have decreased performance, if anything, when we changed the +1/-1 decel rules, but he does better in Alpha3.
--Voidious 05:48, 18 July 2009 (UTC)
ATWHEB (Assuming that we have enough battles), it seems that the surfer gains score from this changes, which seem weird... Since most surfers use old way, including the old decel-through-zero rules. I wonder what DrussGT vs. Diamond score will look like.
I think PrairieWolf vibrate a bit shorter so DuelistMini missed him. » Nat | Talk » 06:07, 18 July 2009 (UTC)
You might have to test DuelistMini against another opponent -- maybe it's the one that's doing worse, as opposed to PrairieWolf doing better? Thanks for running all these tests, btw. Hopefully they'll help pick the right version and not just confuse things further! =) --Darkcanuck 06:25, 18 July 2009 (UTC)
No problem, though I will be taking some time to work on Diamond this weekend. =) Yeah, I don't think we can draw many conclusions just yet. I also just realized that both PW and DM do data saving... I'm actually not sure if that might throw off results or not (hopefully it does?), but maybe I'll try to find another vibrating test bot. I'm gonna run some stuff in 1.7.1.1 overnight here. --Voidious 06:36, 18 July 2009 (UTC)

Just want to know, where do you guys consider as 'unacceptable' (the red value)? Currently I use ±0.1 as 'margin of error' (green value), ±0.4 as 'acceptable' (black value), beyond that is all 'unacceptable'. I don't normally run a lot of battles so I don't know where it should be. » Nat | Talk » 07:30, 18 July 2009 (UTC)

Margin of error = ±0.1 when comparing results from 500 battles is probably about right, maybe even ±0.2. (That would mean margin of error on each 500-battle result is half that.) I'm unsure of my opinion on "acceptable". My first instinct is "any measurable change is unacceptable", but I need to think about it. I just really, really hate the idea of all our old Robocode bots slowly becoming (artificially) weaker and weaker, just because of changes to Robocode... --Voidious 16:37, 18 July 2009 (UTC)
I too dislike the idea of changes progressively making old bots weaker and weaker, but I really don't think that these changes are something that makes old bots progressively weaker. I have a feeling that most of the score changes that could be seen would not be due to slightly breaking old assumptions, but instead are just flukes arising from things just plain being a little different. The tests show DrussGT vs Ascendant having stronger DrussGT scores than before, but I doubt that Ascendant assumes old behavior in a way that DrussGT doesn't. At very least, I think the number of bots that are not acting quite as intended due to old Robocode movement being unintuitive/weird, is greater than the number of bots that truly intentionally assume old behavior that differs from the Alpha2 mode of operation (Alpha3 being a slightly different story). --Rednaxela 22:35, 18 July 2009 (UTC)
I mostly agree, which is the reason I might be OK with some measurable score differences. However, I think that if there continue to be changes that have measurable impacts on scoring, older bots will gradually get weaker (or should I say, "weakerer", since they're bound to get relatively "weaker", anyway). Even if the code doesn't assume anything explicitly, it may implicitly -- it was tuned in a certain Robocode environment that is no longer reflective of how Robocode works. --Voidious 22:59, 18 July 2009 (UTC)

Let me throw in my 2 cents to the discussion. Although I am a rather conservative guy, I also like clear and simple behaviour. Therefor my vote undoubtly goes for the Alpha-2 variant. Differences in score between the old code and Alpha-2 I take for granted as the old code just seems flawed.
I think that the Alpha-3 is one step too far, as the vibrating movement was rather popular in the days before Wavesurfing, and is used in several bots from that time. As for solving 'bugs' in the old code, there have been more changes that did influence the score, and we accepted them because they were logical and better following the rules than before. Think about onBulletHitBullet, which happens 50% more often now than in 1.0.6, favouring the bots that take them into account. --GrubbmGait 00:08, 19 July 2009 (UTC)