Thread history

From Talk:ScalarBot/Version History
Viewing a history listing
Jump to navigation Jump to search
Time User Activity Comment
18:11, 16 June 2019 Skilgannon (talk | contribs) New reply created (Reply to how to build a good test bed?)
11:59, 14 June 2019 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
08:16, 6 June 2019 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
13:09, 5 June 2019 Skilgannon (talk | contribs) New reply created (Reply to how to build a good test bed?)
03:52, 31 May 2019 Xor (talk | contribs) Comment text edited  
03:38, 31 May 2019 Xor (talk | contribs) Comment text edited  
03:36, 31 May 2019 Xor (talk | contribs) Comment text edited  
03:34, 31 May 2019 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
03:59, 13 October 2017 Xor (talk | contribs) Comment text edited  
03:54, 13 October 2017 Xor (talk | contribs) Comment text edited  
03:49, 13 October 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
18:32, 12 October 2017 Skilgannon (talk | contribs) New reply created (Reply to how to build a good test bed?)
17:05, 12 October 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
12:10, 12 October 2017 Skilgannon (talk | contribs) New reply created (Reply to how to build a good test bed?)
12:09, 12 October 2017 Skilgannon (talk | contribs) New reply created (Reply to how to build a good test bed?)
13:34, 4 October 2017 Rsalesc (talk | contribs) New reply created (Reply to how to build a good test bed?)
04:53, 4 October 2017 Xor (talk | contribs) Comment text edited  
04:33, 4 October 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
00:29, 4 October 2017 Rsalesc (talk | contribs) New reply created (Reply to how to build a good test bed?)
00:07, 4 October 2017 Rsalesc (talk | contribs) New reply created (Reply to how to build a good test bed?)
23:56, 3 October 2017 Skilgannon (talk | contribs) New reply created (Reply to how to build a good test bed?)
23:30, 3 October 2017 Rsalesc (talk | contribs) New reply created (Reply to how to build a good test bed?)
11:19, 28 September 2017 Skilgannon (talk | contribs) Comment text edited  
11:17, 28 September 2017 Skilgannon (talk | contribs) New reply created (Reply to how to build a good test bed?)
10:09, 28 September 2017 Xor (talk | contribs) Comment text edited  
10:08, 28 September 2017 Xor (talk | contribs) Comment text edited  
10:03, 28 September 2017 Xor (talk | contribs) Comment text edited  
09:57, 28 September 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
08:25, 28 September 2017 Skilgannon (talk | contribs) Deleted (Author request)
08:25, 28 September 2017 Skilgannon (talk | contribs) New reply created (Reply to how to build a good test bed?)
07:53, 28 September 2017 Skilgannon (talk | contribs) New reply created (since deleted) (Reply to how to build a good test bed?)
07:16, 28 September 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
04:21, 28 September 2017 Beaming (talk | contribs) Comment text edited (typo)
04:21, 28 September 2017 Beaming (talk | contribs) New reply created (Reply to how to build a good test bed?)
03:18, 28 September 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
02:33, 28 September 2017 Beaming (talk | contribs) New reply created (Reply to how to build a good test bed?)
01:40, 28 September 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
23:15, 27 September 2017 Skilgannon (talk | contribs) Comment text edited  
23:14, 27 September 2017 Skilgannon (talk | contribs) New reply created (Reply to how to build a good test bed?)
16:40, 27 September 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
15:19, 27 September 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
13:07, 27 September 2017 GrubbmGait (talk | contribs) New reply created (Reply to how to build a good test bed?)
05:36, 27 September 2017 Xor (talk | contribs) Comment text edited  
05:34, 27 September 2017 Xor (talk | contribs) New reply created (Reply to how to build a good test bed?)
04:35, 27 September 2017 Beaming (talk | contribs) New reply created (Reply to how to build a good test bed?)
04:11, 27 September 2017 Xor (talk | contribs) Comment text edited  
03:08, 27 September 2017 Xor (talk | contribs) Comment text edited  
03:06, 27 September 2017 Xor (talk | contribs) Comment text edited  
03:04, 27 September 2017 Xor (talk | contribs) Comment text edited  
03:03, 27 September 2017 Xor (talk | contribs) New thread created  

how to build a good test bed?

Recently I tried a lot to tune the movement, and the result is promising — it performs very well in my test bed (which consisits of some bots I’m performing bad in the past, including some guess factor targeting bots, dc bots and a simple targeter with VG. However, the rumble result shows a huge performance regression ;/

Then, I tried another one, when published to the rumble, it shows huge performance increase (and a little increase when full pairing) — but after ~5000 battles, the performance is even decreased, comparing to the baseline version.

My test bed is running at 35 battles for 30 seasons with 10 bots — 300 battles in total, but it shows irrelevant with rumble score. Is that the bots I choose make it a bad test bed, or just because I have too little battles?

The bots I use in my test bed are FloodHT, SandboxDT, RaikoMicro (gf targeting bots), Tron, Aleph (dc bots), Che, Fermet, WeeklongObsession (pattern matchers), GrubbmGrb (“simple” targeting)

Again, it seems that even after ~3000 battles, the rumble score is still not reliable enough to be used to compare two versions.

Then come my questions: How do you evaluate your bot? How much bots are there in your test bed and how many battles do you run for each of them?

Xor (talk)03:03, 27 September 2017

I think your problem that you are already in top 10 :) while you are testing against relatively simple bots (by modern standards). You probably already have score pushing above 90% for this bots. If I were you I would chose test bed from the top 30 or even top 10. But after all the only real test is the rumble, may be there is a bunch of bots against which you are under performing and none of them are in the test bed.

Otherwise I do something similar but my bot is not that high, so my test bad shows relevant scores. Though sometimes it is somewhat off. I also notice that the score in rumble always slide down until it settles. I am not sure why, may be some bot which save stats keep improving with each round for a while.

But lately I notice that in melee rumble slide down is somewhat catastrophic. When I introduced EvBot v9.2 it was in the top 20 for the first 300 pairing or so, and then just plunge about extra 20 places down. I see it with several latest releases and still cannot understand why.

Beaming (talk)04:35, 27 September 2017

IIRC, in the past versions, the improvement over previous version is somewhat good indicator of final result, e.g. 0.5 increase in APS of common pairings (e.g. 300 common opponents) indicates 0.5 increase in final APS.

IMO the APS until full pairing is meanless, but the difference in common APS is useful.

However, this version breaks the previous pattern. difference in common APS is no longer an indicator, nor the full pairing APS.

The reason why I test agasint relatively “weak” bots is that the majority of the rumble is there. And what affects your score the most is also there. More than half of the bots in rumble is between in APS [40, 70), and there are only 160 bots above 70, and 324 bots below 40. Bots below APS 40 can be ignored IMO, as the improvement against them can only be marginal.

Xor (talk)05:34, 27 September 2017

Until the pairing is complete, APS is not a good indicator. I always go to the details of my bot and then select an older version to compare with. In that case only the bots that both versions have fought, are taken into account. It indeed seems that the last 10% of the pairings involve the best opponents, GrubbmThree held around 58 APS till approx 1000 pairings, then fell down to 57.2. Note that even with 3000-5000 battles, there are still a lot of bots you have only have one fight against, so a few bad battles do have influence.

As for testbed, I used to have around 20 bots in my testbed (50 seasons): 5 top-50 bots, 5 'white-whales', 5 between place 100-300 en a few specific ones to check whether something was broke (f.e. bbo.RamboT must score less than 0.5%)

GrubbmGait (talk)13:07, 27 September 2017

Thanks for figuring out that! I thought the rumble is stabilized very fast (common APS diff of ~300 pairings are already useful), but it turned out not.

Maybe I should build a test bed with more varieties, e.g. bots from all over the rumble with different kind of strategies.

Xor (talk)15:19, 27 September 2017
 
 
 

I found that when newer version is tight with previous version, try to compare it with different older versions could help — or, the best, compare with some baseline version which is battled enough and is stable.

Xor (talk)16:40, 27 September 2017
 

It depends what I am working in.

For movement, often a single bot is enough to prove a theory. Escape angle tuning is a rambot plus DevilFish, surfing mechanics is DoctorBob, anti-GF RaikoMicro, anti-fast-learning is Ascendant and for general unpredictability Shadow or Diamond.

Targeting I always find less interesting. Maybe because it is a more pure ML problem, with less ways to optimise that haven't already been studied in a related field. I decided to brute-force it by adding lots of features and then using a genetic optimization to tune the weights against recordings of the entire rumble population, about 5000 battles. The surfers I did separately, but with the same process.

Skilgannon (talk)23:14, 27 September 2017

WoW Thanks for the sharing! In the past I only tune the movement agaisnt RaikoMicro by roborunner & carefully wathcing battles and that way works very well. Recently I tried some more brute force way but it seems not working. Maybe for an undeveloped ML area, some idea or theory is more useful.

recordings of the entire population — I’m wondering will it be useful to tune agaisnt wave surfers, which react to fire, in a way that their reaction is irrelevant?

Or can we just treat wavesurfers as some random movement that is not random enough? And with so many attributes, their reaction on fire will be inaccurate enough to be ignored and just proper decay is enough?

BTW, I’m really curious about how long it takes for a generation ;) And how many threads you are using to run it ;)

Xor (talk)01:40, 28 September 2017

Movement I find much more interesting - I think there is still a lot of unexplored potential here. Targeting can only get as good as the ML system though. The only tricks I see from targeting side involve bullet shielding and bullet power optimization.

For surfers I evolved the weights in multiple steps - record data, tune weights, re-record data, retune weights etc. I agree fixed data isn't ideal against learning movements, but it seemed to work ok.

By recorded battles, I actually just recorded the ML style interactions. So the only work to do in the genetic algorithm was parse input line, add to tree, and if it was a firing tick then do KNN + kernel density and N ticks later check if the prediction was in the correct bounds.

About 15 minutes per generation for an i5-2410M using 4 threads.

Skilgannon (talk)08:25, 28 September 2017

So only record gun waves seems ok? And IMO the gun prediction of each wave can be evaluated immediately, since the result is already known. btw, are you optimizing hit rate overall (e.g. total hit / total fire of all battles) or robocode score? (e.g. average bullet damage per battle). I think the lattar should be better when bullet power selection is also evaluated (or when it is not disabled). But since in real battles hit/miss will also affect total waves per round, that would be inaccurate for recorded battles. So how do you deal with bullet power? imo using the recorded ones sound reasonable, although not perfect.

The difference between evaluating overall hit rate and average bullet damage per battle is interesting. Seems that the latter will weight on damage per bullet. Also when comparing average hit rate per battle with overall hitrate, the former will weight battles on bullets fired per battle.

Xor (talk)09:57, 28 September 2017

I optimized for hit rate. Bullet power was kept the same as when it was recorded.

And I saved/loaded all waves (for learning), but only did prediction using firing waves.

Skilgannon (talk)11:17, 28 September 2017
 

So... each of those generations was evolved against those 5000 battles, right? What was the size of your population? I've tried my hands at genetic tuning some time ago but I gave up because it seems my evolving step was too slow. I'm wondering what was your population size when you got those 15 minutes, because one generation with 150 battles for me take waay more than that :/ I'll need some reference to optimize my targeting system.

Rsalesc (talk)23:30, 3 October 2017

From memory, population size was about 20. It was something between a gradient descent and a genetic algorithm, by moving from the stronger members away from the weaker members, plus some random component. Remember, I had already extracted all of the features etc, and saved them just before inserting into the Kd-Tree, so the only thing I needed at evaluation time was:

  1. read data from file
  2. add points to the tree
  3. KNN/KDE
  4. count inliers vs outliers -> give a score

Then at the end multiply the evolved weights with the code weights, recompile, and collect a new set of data; repeat until happy.

Skilgannon (talk)23:56, 3 October 2017
 

I’m doing nearly the same thing now. I write knn data points and gfs to files, so all I do is just:

read data from file; add to tree; knn/kde; count inliers vs outliers. and I’m only doing knn/kde on firing waves.

However it takes me ~10min per generation with only 1500 tcrm battles.

My population size is also 20, and I’m also using 4 threads. It’s Core i7 with 4 cores at 2.6Ghz, so it should be even faster than i5-2410M which has only 2 cores.

Are you reading data and adding to tree at the same time, or reading data to memory in one go and adding to tree then?

Xor (talk)03:34, 31 May 2019

It was read a line, add to tree, and if it was a firing tick do a prediction. For parallelization I just started a new thread for each bot, and join the thread when the bot is processed. It would probably br a bit faster with a thread pool.

Unfortunately I think I lost this code, I think it was on my University computer...

Skilgannon (talk)13:09, 5 June 2019
 
 
 

Holly smoke! Using the whole rumble for tune up. It probably takes half a day to have one generation in a genetic algorithm.

Beaming (talk)02:33, 28 September 2017

5000 battles on the fly takes me ~4hrs iirc. But recorded battles should take shorter imo.

Xor (talk)03:18, 28 September 2017

What is recorded battles?

Beaming (talk)04:21, 28 September 2017

e.g. WaveSim by voidious

Xor (talk)07:16, 28 September 2017