Difference between revisions of "Talk:RoboResearch/Development"
(Confidence Intervals) |
m (→Statistics: signature3) |
||
(7 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
== Showing Results == | == Showing Results == | ||
− | I have already talked with some robocoders about how they would like to see results in the RoboResearch GUI. So far the best design I've come up with (merging ideas from [[Synapse]], [[Rednaxela]], [[Voidious]] and [[Chase-san]]) would open a separate window for each challenge+challenger. It would have a table that shows live-feed results like the summary on wiki pages. That window could be expanded to show a second table underneath the first, with the scores for each season. The top table could also be expanded to include a row for each version of the challenger in the database. Whichever version is selected in the top table is the one whose seasons would be displayed in the lower table. Can I get any feedback or new ideas? --[[User:Simonton|Simonton]] 05:35, 21 September 2008 (UTC) | + | I have already talked with some robocoders about how they would like to see results in the RoboResearch GUI. So far the best design I've come up with (merging ideas from [[User:Synapse|Synapse]], [[User:Rednaxela|Rednaxela]], [[User:Voidious|Voidious]] and [[User:Chase-san|Chase-san]]) would open a separate window for each challenge+challenger. It would have a table that shows live-feed results like the summary on wiki pages. That window could be expanded to show a second table underneath the first, with the scores for each season. The top table could also be expanded to include a row for each version of the challenger in the database. Whichever version is selected in the top table is the one whose seasons would be displayed in the lower table. Can I get any feedback or new ideas? --[[User:Simonton|Simonton]] 05:35, 21 September 2008 (UTC) |
Well, apparently everyone either liked the idea or doesn't care enough to offer alternative input. The above idea has been implemented with a slight difference: the seasons are displayed in a separate window, at a button push. This means they do not change with as you select different version - instead you can open the seasons for each version in its own window. --[[User:Simonton|Simonton]] 15:29, 1 October 2008 (UTC) | Well, apparently everyone either liked the idea or doesn't care enough to offer alternative input. The above idea has been implemented with a slight difference: the seasons are displayed in a separate window, at a button push. This means they do not change with as you select different version - instead you can open the seasons for each version in its own window. --[[User:Simonton|Simonton]] 15:29, 1 October 2008 (UTC) | ||
Line 16: | Line 16: | ||
* '''Update:''' Actually I'll revise that. If [http://en.wikipedia.org/wiki/Confidence_interval#Confidence_intervals_in_measurement this] is correct, we should divide by the square root of the number of seasons, rather than the number of seasons. That will give error in the average score to whatever confidence interval we specify. --[[User:Rednaxela|Rednaxela]] 23:49, 1 October 2008 (UTC) | * '''Update:''' Actually I'll revise that. If [http://en.wikipedia.org/wiki/Confidence_interval#Confidence_intervals_in_measurement this] is correct, we should divide by the square root of the number of seasons, rather than the number of seasons. That will give error in the average score to whatever confidence interval we specify. --[[User:Rednaxela|Rednaxela]] 23:49, 1 October 2008 (UTC) | ||
* I've done my research on confidence intervals. The simplest explanation I found is [http://www.stat.psu.edu/~resources/ClassNotes/ljs_19/index.htm here]. Basically, we can assume the actual mean (given infinite battles) will be within <tt>t * stdDev / sqrt(numBattles)</tt>, where <tt>t</tt> is taken from a [http://www.union.edu/PUBLIC/BIODEPT/t.html t table] (<tt>numBattles - 1</tt> in the rows, <tt>1 - confidenceLevel</tt> in the columns). But I'm having trouble deciding how to use this information to answer the question, "Am I 95% confident that version A scores better than version B?" I'm not convinced the answer is the same as, "Are their confidence intervals disjoint?" Does anyone have (or can anyone find) any insight? --[[User:Simonton|Simonton]] 16:45, 3 October 2008 (UTC) | * I've done my research on confidence intervals. The simplest explanation I found is [http://www.stat.psu.edu/~resources/ClassNotes/ljs_19/index.htm here]. Basically, we can assume the actual mean (given infinite battles) will be within <tt>t * stdDev / sqrt(numBattles)</tt>, where <tt>t</tt> is taken from a [http://www.union.edu/PUBLIC/BIODEPT/t.html t table] (<tt>numBattles - 1</tt> in the rows, <tt>1 - confidenceLevel</tt> in the columns). But I'm having trouble deciding how to use this information to answer the question, "Am I 95% confident that version A scores better than version B?" I'm not convinced the answer is the same as, "Are their confidence intervals disjoint?" Does anyone have (or can anyone find) any insight? --[[User:Simonton|Simonton]] 16:45, 3 October 2008 (UTC) | ||
+ | ** Ahh yes, looks about the same as what I figured out except that the "t" values offer some further refinement. Well, if the 95% confidence intervals don't overlap, then because for each there is a 95% chance of the true value lying within the interval, the chance of the score of one being better than the other as such, would be at very least 95.0625% (I calculated this based on 95% chance of a given score being in the interval, 2.5% chance of being higher, and 2.5% chance of being lower, and summing chances of all possibilities that guarantee the score to be really be better (ones where the higher mean is in it's interval or higher, and the lower one is in it's interval or lower, hence <code>0.95*(0.95+0.025)+0.25(0.95*0.025)=0.950625</code>, and possibilities that were ambiguous I didn't include to be safe). One could calculate a more accurate probability of one score being better than another when 95% confidence intervals don't overalap, it would depend on how far apart the intervals actually are, however I can guarantee it will be at least 95.0625% in all cases where the intervals have 95% confidence. That make some sense? --[[User:Rednaxela|Rednaxela]] 17:30, 3 October 2008 (UTC) | ||
Nice work on this lately! I just got Subclipse working again here and downloaded the latest version, however am having one problem: Trying to start the gui causes the following error: | Nice work on this lately! I just got Subclipse working again here and downloaded the latest version, however am having one problem: Trying to start the gui causes the following error: | ||
Line 36: | Line 37: | ||
I know how to use raw TCP/IP streams to write the proxies required to implement a networked RoboResearch, but I'm not sure that's the "best" solution. I know how to use [http://www.springframework.org/ Spring] to make remote method calls, but it does not allow callbacks for things like listeners. Does anybody know of some other technology that would be appropriate to support the kind of remote communication necessary? I think I've heard [http://java.sun.com/javase/technologies/core/basic/rmi/index.jsp RMI] can do callbacks, so that might be one possibility. [http://java.sun.com/products/jms/ JMS] might be another. I'm just not familiar enough with any of these to know their strengths and limitations. Something that supports file transfer for uploading/downloading bots would be ideal. Any thoughts/recommendations/input? --[[User:Simonton|Simonton]] 16:38, 25 September 2008 (UTC) | I know how to use raw TCP/IP streams to write the proxies required to implement a networked RoboResearch, but I'm not sure that's the "best" solution. I know how to use [http://www.springframework.org/ Spring] to make remote method calls, but it does not allow callbacks for things like listeners. Does anybody know of some other technology that would be appropriate to support the kind of remote communication necessary? I think I've heard [http://java.sun.com/javase/technologies/core/basic/rmi/index.jsp RMI] can do callbacks, so that might be one possibility. [http://java.sun.com/products/jms/ JMS] might be another. I'm just not familiar enough with any of these to know their strengths and limitations. Something that supports file transfer for uploading/downloading bots would be ideal. Any thoughts/recommendations/input? --[[User:Simonton|Simonton]] 16:38, 25 September 2008 (UTC) | ||
+ | |||
+ | Just an idea, perhaps for networking roboresearch you can have the server issue work units based on the computer speed (maybe user has to enter this? or possibly use some c code to detect cpu info), for instance slower computers get challenges that complete faster (i.e. those without very slow bots) and etc... --[[User:Starrynte|Starrynte]] 21:57, 15 November 2008 (UTC) | ||
+ | |||
+ | :Assuming that every server has a work queue, assign new work units to the server with the shortest queue (attempt to give servers queues of similar lengths). The fastest server would tend to have the shortest queue most often so it will be given a greater amount of work. -- [[User:Nfwu|<font color="#3333BB">Nf</font>]][[User Talk:Nfwu|<font color="#FF33BB">wu</font>]] 05:59, 16 November 2008 (UTC) | ||
== Statistics == | == Statistics == | ||
Line 44: | Line 49: | ||
I think this is an excellent idea. As for specifying a different margin of error for each reference bot, I don't ever see myself using such a feature. For example against HoF I don't really care if I can score 99.99 instead of 99.95 - it's all the same to me. --[[User:Simonton|Simonton]] 19:17, 2 October 2008 (UTC) | I think this is an excellent idea. As for specifying a different margin of error for each reference bot, I don't ever see myself using such a feature. For example against HoF I don't really care if I can score 99.99 instead of 99.95 - it's all the same to me. --[[User:Simonton|Simonton]] 19:17, 2 October 2008 (UTC) | ||
+ | |||
+ | Is Skilgannons idea being implemented? It would be a very useful option to have! --[[User:Rsim|Rsim]] 20:01, 2 September 2009 (UTC) | ||
+ | |||
+ | I don't think so, as Simonton hasn't been actively developing RoboResearch for quite a while. (I did exchange a couple e-mails with him semi-recently about the Melee support I coded up, though.) This seems like a fine idea, so maybe one of us could implement it - it shouldn't be too difficult. --[[User:Voidious|Voidious]] 20:26, 2 September 2009 (UTC) | ||
+ | |||
+ | ... and we should ask Simonton for the RoboResearch SVN access to commit too. Adding RoboResearch to the Robocode installation is on a TODO list of Robocode. » <span style="font-size:0.9em;color:darkgreen;">[[User:Nat|Nat]] | [[User_talk:Nat|Talk]]</span> » 01:43, 3 September 2009 (UTC) |
Latest revision as of 02:43, 3 September 2009
Showing Results
I have already talked with some robocoders about how they would like to see results in the RoboResearch GUI. So far the best design I've come up with (merging ideas from Synapse, Rednaxela, Voidious and Chase-san) would open a separate window for each challenge+challenger. It would have a table that shows live-feed results like the summary on wiki pages. That window could be expanded to show a second table underneath the first, with the scores for each season. The top table could also be expanded to include a row for each version of the challenger in the database. Whichever version is selected in the top table is the one whose seasons would be displayed in the lower table. Can I get any feedback or new ideas? --Simonton 05:35, 21 September 2008 (UTC)
Well, apparently everyone either liked the idea or doesn't care enough to offer alternative input. The above idea has been implemented with a slight difference: the seasons are displayed in a separate window, at a button push. This means they do not change with as you select different version - instead you can open the seasons for each version in its own window. --Simonton 15:29, 1 October 2008 (UTC)
I would like to add some statistics to the results windows. Firstly I would like to highlight the top score among versions in the results table (or make it italic or underlined, or whatever). But to make that more useful I'd like to calculate the standard deviation of each score, then highlight any that are "close enough" to be tied for the top score. Then I'd like to do the same thing but only within those scores you select. I do it for my own benefit, but I hope candy like this makes you want to update RoboResearch and reap the benefits of my labor :). At some point I'll make a zip file and put it on sourceforge for everyone, so you don't have to deal with SVN. --Simonton 15:38, 1 October 2008 (UTC)
I'd be in the former group, I didn't see anything I didn't like =) Thanks for all this work you're putting into it. If I update will I lose all the season's I've already run? I'm not that familiar with the database system you're using... -- Skilgannon 16:47, 1 October 2008 (UTC)
You will not loose any data by updating. However I have lost data when closing down my database process before, so I recommend periodic backups of the database directory (and due to the way it works, I recommend not deleting a backup until you're sure a more recent one actually has all your data). If you had any scripts that you modified from those in SVN, they may get updated or deleted (run.bat and run.cfg come to mind). Note that SVN is not as up-to-date as the features listed on the wiki; I'm currently without internet at home, so I only get to commit on the weekends. While I'm posting again let me ask, does anyone have any pointers about how to calculate which scores are "close enough to be tied". I'm sure it's not hard, but I might as well ask for pointers instead of doing the statistics research myself :). --Simonton 17:26, 1 October 2008 (UTC)
- The scores are "close enough to be tied" if result1 +- margin_of_error1 and result2 +- margin_of_error2 overlap. The more they overlap the closer they are to tied, tied% = overlap/(2*min(margin_of_error1, margin_of_error2))*100 -- Skilgannon 20:06, 1 October 2008 (UTC)
- Makes sense, but how does one calculate "margin_of_error"? Some percentage of a standard deviation? If so, what precentage? And what about long-learning challenges where there really isn't enough data to get a good stardard deviation (e.g. at one season s.d. is zero)? --Simonton 20:34, 1 October 2008 (UTC)
- Once there are enough seasons run, I believe a number proportional to the standard deviations would make sense. If we assume the scores are roughly in a normal distribution, then once there is enough data for accurate calculation of standard deviation it's not hard to state criteria like "how wide is the region in which 90% of scores are expected to fall", in which case it would be 1.64485 standard deviations (see the table of confidence intervals here). One very important factor however, is that as the number of seasons increases, the "margin of error" of the AVERAGE should decrease proportionally, therefore I believe the most statistically sound thing to use for a "margin_of_error" would be, something like "1.64485 standard deviations, divided by the number of seasons", or if we want a more stringent confidence interval of 95% or something, then 1.95996 instead of 1.64485, but either way, something of that form would be most statistically sound I believe. --Rednaxela 22:27, 1 October 2008 (UTC)
- Update: Actually I'll revise that. If this is correct, we should divide by the square root of the number of seasons, rather than the number of seasons. That will give error in the average score to whatever confidence interval we specify. --Rednaxela 23:49, 1 October 2008 (UTC)
- I've done my research on confidence intervals. The simplest explanation I found is here. Basically, we can assume the actual mean (given infinite battles) will be within t * stdDev / sqrt(numBattles), where t is taken from a t table (numBattles - 1 in the rows, 1 - confidenceLevel in the columns). But I'm having trouble deciding how to use this information to answer the question, "Am I 95% confident that version A scores better than version B?" I'm not convinced the answer is the same as, "Are their confidence intervals disjoint?" Does anyone have (or can anyone find) any insight? --Simonton 16:45, 3 October 2008 (UTC)
- Ahh yes, looks about the same as what I figured out except that the "t" values offer some further refinement. Well, if the 95% confidence intervals don't overlap, then because for each there is a 95% chance of the true value lying within the interval, the chance of the score of one being better than the other as such, would be at very least 95.0625% (I calculated this based on 95% chance of a given score being in the interval, 2.5% chance of being higher, and 2.5% chance of being lower, and summing chances of all possibilities that guarantee the score to be really be better (ones where the higher mean is in it's interval or higher, and the lower one is in it's interval or lower, hence
0.95*(0.95+0.025)+0.25(0.95*0.025)=0.950625
, and possibilities that were ambiguous I didn't include to be safe). One could calculate a more accurate probability of one score being better than another when 95% confidence intervals don't overalap, it would depend on how far apart the intervals actually are, however I can guarantee it will be at least 95.0625% in all cases where the intervals have 95% confidence. That make some sense? --Rednaxela 17:30, 3 October 2008 (UTC)
- Ahh yes, looks about the same as what I figured out except that the "t" values offer some further refinement. Well, if the 95% confidence intervals don't overlap, then because for each there is a 95% chance of the true value lying within the interval, the chance of the score of one being better than the other as such, would be at very least 95.0625% (I calculated this based on 95% chance of a given score being in the interval, 2.5% chance of being higher, and 2.5% chance of being lower, and summing chances of all possibilities that guarantee the score to be really be better (ones where the higher mean is in it's interval or higher, and the lower one is in it's interval or lower, hence
Nice work on this lately! I just got Subclipse working again here and downloaded the latest version, however am having one problem: Trying to start the gui causes the following error:
Exception in thread "main" java.sql.SQLException: socket creation error
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
at org.hsqldb.jdbc.jdbcConnection.<init>(Unknown Source)
at org.hsqldb.jdbcDriver.getConnection(Unknown Source)
at org.hsqldb.jdbcDriver.connect(Unknown Source)
at java.sql.DriverManager.getConnection(DriverManager.java:620)
at java.sql.DriverManager.getConnection(DriverManager.java:200)
at roboResearch.engine.Database.<init>(Database.java:65)
at roboResearch.GUI.<init>(GUI.java:44)
at roboResearch.GUI.main(GUI.java:25)
--Rednaxela 18:22, 1 October 2008 (UTC)
Now how is it you compliment the latest work when you can't even use it? ;) You need to run the database in a separate process (until I update SVN this weekend). I think that's your problem. Instructions for that are in ... umm ... I think it gives you the command to run if you try executing TUI with no command line args ... or invalid args ... or something like that. --Simonton 18:33, 1 October 2008 (UTC)
Ah nice, it work now. By the way, one issue I just noticed, is it doesn't like if very much when you browser for a bot outside of that robocode_bots directory ;) --Rednaxela 18:41, 1 October 2008 (UTC)
Networking Proxies
I know how to use raw TCP/IP streams to write the proxies required to implement a networked RoboResearch, but I'm not sure that's the "best" solution. I know how to use Spring to make remote method calls, but it does not allow callbacks for things like listeners. Does anybody know of some other technology that would be appropriate to support the kind of remote communication necessary? I think I've heard RMI can do callbacks, so that might be one possibility. JMS might be another. I'm just not familiar enough with any of these to know their strengths and limitations. Something that supports file transfer for uploading/downloading bots would be ideal. Any thoughts/recommendations/input? --Simonton 16:38, 25 September 2008 (UTC)
Just an idea, perhaps for networking roboresearch you can have the server issue work units based on the computer speed (maybe user has to enter this? or possibly use some c code to detect cpu info), for instance slower computers get challenges that complete faster (i.e. those without very slow bots) and etc... --Starrynte 21:57, 15 November 2008 (UTC)
- Assuming that every server has a work queue, assign new work units to the server with the shortest queue (attempt to give servers queues of similar lengths). The fastest server would tend to have the shortest queue most often so it will be given a greater amount of work. -- Nfwu 05:59, 16 November 2008 (UTC)
Statistics
I just had an idea for an 'automatic' mode for choosing battles, which is potentially both quicker and more accurate than just running lots of seasons. It involves running the bot that has the highest margin of error the most, and running the bots that have lower margins of errors for less seasons. Of course, several seasons will have to be run initially to determine which bots have higher margins of error. From here, after every battle the future opponents are sorted in order of margin of error (descending) and the bot at the top is run next. Thus, instead of specifying how many seasons should be run, one could specify the required margin of error. --Skilgannon 11:58, 2 October 2008 (UTC)
Sounds like a good way to do things, and would give a more accurate overall challenge score quicker, except one possible consideration is that the desired margin of error for every enemy isn't unnecessarily the same. For example, for a strong surfer, HOF scored tend to be in the 99.90+ range. For it, the necessary margin of error to tell the difference between versions, would be considerably smaller than the necessary margin of error for other bots. One thought I had, is that perhaps to set "goal" margins for each bot, should be done by comparing previous versions, looking at what the typical score difference between recent versions was. Or perhaps it would be better to only allow manual configuration of the margin of error for each bot? --Rednaxela 13:39, 2 October 2008 (UTC)
I think this is an excellent idea. As for specifying a different margin of error for each reference bot, I don't ever see myself using such a feature. For example against HoF I don't really care if I can score 99.99 instead of 99.95 - it's all the same to me. --Simonton 19:17, 2 October 2008 (UTC)
Is Skilgannons idea being implemented? It would be a very useful option to have! --Rsim 20:01, 2 September 2009 (UTC)
I don't think so, as Simonton hasn't been actively developing RoboResearch for quite a while. (I did exchange a couple e-mails with him semi-recently about the Melee support I coded up, though.) This seems like a fine idea, so maybe one of us could implement it - it shouldn't be too difficult. --Voidious 20:26, 2 September 2009 (UTC)
... and we should ask Simonton for the RoboResearch SVN access to commit too. Adding RoboResearch to the Robocode installation is on a TODO list of Robocode. » Nat | Talk » 01:43, 3 September 2009 (UTC)