User talk:Voidious/Robocode Version Tests

From Robowiki
< User talk:Voidious
Revision as of 23:35, 18 July 2009 by Rednaxela (talk | contribs) (reply about modes of operation)
Jump to navigation Jump to search

I don't think there are any differences between 1.6.1.4 and 1.7.1.1. The difference is quite big, 0.82%! But I think we are still using the fix, right? » Nat | Talk » 15:51, 17 July 2009 (UTC)

Yes, it's a pretty big difference (0.72% actually). 250 battles might not be enough, though, maybe I should run more for that pairing. If there is a 0.72% difference, I'd probably be against the change. But there's lots more to test first - I see my 1.6.1.4 CPU constant is ~20% higher than the 1.7* versions, that could even play a part. --Voidious 20:36, 17 July 2009 (UTC)

The Surfer-vs-Surfer battle seems to affect much more than Surfer-vs-Random battles. » Nat | Talk » 05:08, 18 July 2009 (UTC)

Not necessarially Nat. These results could just mean that Komarious and Ascendant are affected differently than Diamond and DrussGT. Komarious/Ascendant vs the random movers would be needed to tell that. Really this test would be more revealing if there was a full round-robin between involved bots. I'd also wonder: Do Komarious or Ascendant actually assume anything that's breaks in Alpha2? It's also worth noting that tests of score alone can't tell us if changes better match assumptions a bot is making, since a bot could just by fluke happen to operate in a way that is happy with conditions that the code didn't try to assume at all. --Rednaxela 05:26, 18 July 2009 (UTC)
There's also just a much larger variance in those pairings, so maybe 500 battles isn't even enough. My initial goal was just to find out if there was a measurable difference among these versions, so I was going for diversity in the battles I used. Full round robin between all bots (in all versions) might help in deducing causes, but this testing is already taking a "metric ass-ton" of CPU cycles =), so I'd definitely reduce the # of bots if I were to try that. It would still be pretty speculative, though.
Assuming that we have enough battles, the Diamond vs Komarious result I believe shows there are other differences between 1.6.1.4 and 1.7.x that are contributing. I noticed the CPU constant is a little different, but I'm pretty sure nobody's skipping turns anyway. The Alpha2 updateMovement code shouldn't change anything for these two bots, I don't think.
The PrairieWolf result is bizarre. 2% is well above any margin of error on 500 battles, I think. I thought PrairieWolf would have decreased performance, if anything, when we changed the +1/-1 decel rules, but he does better in Alpha3.
--Voidious 05:48, 18 July 2009 (UTC)
ATWHEB (Assuming that we have enough battles), it seems that the surfer gains score from this changes, which seem weird... Since most surfers use old way, including the old decel-through-zero rules. I wonder what DrussGT vs. Diamond score will look like.
I think PrairieWolf vibrate a bit shorter so DuelistMini missed him. » Nat | Talk » 06:07, 18 July 2009 (UTC)
You might have to test DuelistMini against another opponent -- maybe it's the one that's doing worse, as opposed to PrairieWolf doing better? Thanks for running all these tests, btw. Hopefully they'll help pick the right version and not just confuse things further! =) --Darkcanuck 06:25, 18 July 2009 (UTC)
No problem, though I will be taking some time to work on Diamond this weekend. =) Yeah, I don't think we can draw many conclusions just yet. I also just realized that both PW and DM do data saving... I'm actually not sure if that might throw off results or not (hopefully it does?), but maybe I'll try to find another vibrating test bot. I'm gonna run some stuff in 1.7.1.1 overnight here. --Voidious 06:36, 18 July 2009 (UTC)

Just want to know, where do you guys consider as 'unacceptable' (the red value)? Currently I use ±0.1 as 'margin of error' (green value), ±0.4 as 'acceptable' (black value), beyond that is all 'unacceptable'. I don't normally run a lot of battles so I don't know where it should be. » Nat | Talk » 07:30, 18 July 2009 (UTC)

Margin of error = ±0.1 when comparing results from 500 battles is probably about right, maybe even ±0.2. (That would mean margin of error on each 500-battle result is half that.) I'm unsure of my opinion on "acceptable". My first instinct is "any measurable change is unacceptable", but I need to think about it. I just really, really hate the idea of all our old Robocode bots slowly becoming (artificially) weaker and weaker, just because of changes to Robocode... --Voidious 16:37, 18 July 2009 (UTC)
I too dislike the idea of changes progressively making old bots weaker and weaker, but I really don't think that these changes are something that makes bots old progressively weaker. I have a feeling that most of the score changes that could be seen would not be due to it slightly breaking old assumptions, but instead are just flukes that make things different. The tests show DrussGT vs Ascendant having stronger DrussGT scores than before, but I doubt that Ascendant assumes old behavior in a way that DrussGT doesn't. At very least, I think the number of bots that are not acting quite as intended due to old Robocode movement being unintuitive/weird, is greater than the number of bots that truly intentionally assume old behavior that differs from the Alpha2 mode of operation (Alpha3 being a slightly different story). --Rednaxela 22:35, 18 July 2009 (UTC)