User talk:Voidious/Robocode Version Tests

From Robowiki
< User talk:Voidious
Revision as of 17:37, 18 July 2009 by Voidious (talk | contribs) (margin of error / what's acceptable)
Jump to navigation Jump to search

I don't think there are any differences between 1.6.1.4 and 1.7.1.1. The difference is quite big, 0.82%! But I think we are still using the fix, right? » Nat | Talk » 15:51, 17 July 2009 (UTC)

Yes, it's a pretty big difference (0.72% actually). 250 battles might not be enough, though, maybe I should run more for that pairing. If there is a 0.72% difference, I'd probably be against the change. But there's lots more to test first - I see my 1.6.1.4 CPU constant is ~20% higher than the 1.7* versions, that could even play a part. --Voidious 20:36, 17 July 2009 (UTC)

The Surfer-vs-Surfer battle seems to affect much more than Surfer-vs-Random battles. » Nat | Talk » 05:08, 18 July 2009 (UTC)

Not necessarially Nat. These results could just mean that Komarious and Ascendant are affected differently than Diamond and DrussGT. Komarious/Ascendant vs the random movers would be needed to tell that. Really this test would be more revealing if there was a full round-robin between involved bots. I'd also wonder: Do Komarious or Ascendant actually assume anything that's breaks in Alpha2? It's also worth noting that tests of score alone can't tell us if changes better match assumptions a bot is making, since a bot could just by fluke happen to operate in a way that is happy with conditions that the code didn't try to assume at all. --Rednaxela 05:26, 18 July 2009 (UTC)
There's also just a much larger variance in those pairings, so maybe 500 battles isn't even enough. My initial goal was just to find out if there was a measurable difference among these versions, so I was going for diversity in the battles I used. Full round robin between all bots (in all versions) might help in deducing causes, but this testing is already taking a "metric ass-ton" of CPU cycles =), so I'd definitely reduce the # of bots if I were to try that. It would still be pretty speculative, though.
Assuming that we have enough battles, the Diamond vs Komarious result I believe shows there are other differences between 1.6.1.4 and 1.7.x that are contributing. I noticed the CPU constant is a little different, but I'm pretty sure nobody's skipping turns anyway. The Alpha2 updateMovement code shouldn't change anything for these two bots, I don't think.
The PrairieWolf result is bizarre. 2% is well above any margin of error on 500 battles, I think. I thought PrairieWolf would have decreased performance, if anything, when we changed the +1/-1 decel rules, but he does better in Alpha3.
--Voidious 05:48, 18 July 2009 (UTC)
ATWHEB (Assuming that we have enough battles), it seems that the surfer gains score from this changes, which seem weird... Since most surfers use old way, including the old decel-through-zero rules. I wonder what DrussGT vs. Diamond score will look like.
I think PrairieWolf vibrate a bit shorter so DuelistMini missed him. » Nat | Talk » 06:07, 18 July 2009 (UTC)
You might have to test DuelistMini against another opponent -- maybe it's the one that's doing worse, as opposed to PrairieWolf doing better? Thanks for running all these tests, btw. Hopefully they'll help pick the right version and not just confuse things further! =) --Darkcanuck 06:25, 18 July 2009 (UTC)
No problem, though I will be taking some time to work on Diamond this weekend. =) Yeah, I don't think we can draw many conclusions just yet. I also just realized that both PW and DM do data saving... I'm actually not sure if that might throw off results or not (hopefully it does?), but maybe I'll try to find another vibrating test bot. I'm gonna run some stuff in 1.7.1.1 overnight here. --Voidious 06:36, 18 July 2009 (UTC)

Just want to know, where do you guys consider as 'unacceptable' (the red value)? Currently I use ±0.1 as 'margin of error' (green value), ±0.4 as 'acceptable' (black value), beyond that is all 'unacceptable'. I don't normally run a lot of battles so I don't know where it should be. » Nat | Talk » 07:30, 18 July 2009 (UTC)

Margin of error = ±0.1 when comparing results from 500 battles is probably about right, maybe even ±0.2. (That would mean margin of error on each 500-battle result is half that.) I'm unsure of my opinion on "acceptable". My first instinct is "any measurable change is unacceptable", but I need to think about it. I just really, really hate the idea of all our old Robocode bots slowly becoming (artificially) weaker and weaker, just because of changes to Robocode... --Voidious 16:37, 18 July 2009 (UTC)

You cannot post new threads to this discussion page because it has been protected from new threads, or you do not currently have permission to edit.

There are no threads on this page yet.