Difference between revisions of "Talk:Darkcanuck/RRServer/KnownIssues"
(→Performance: upload scalability) |
m (→Performance: bow to Bwbaugh's processing power) |
||
(12 intermediate revisions by 7 users not shown) | |||
Line 73: | Line 73: | ||
With the last server revision is not either necessary run the script cause upload's speed as been considerable incremented --[[User:Lestofante|lestofante]] 12:53, 6 December 2008 (UTC) | With the last server revision is not either necessary run the script cause upload's speed as been considerable incremented --[[User:Lestofante|lestofante]] 12:53, 6 December 2008 (UTC) | ||
− | I hope this is the right place for this considering the time since the last post... Are there any ideas in the works to improve scalability of the server to accept more concurrent users submitting battle results? It works just fine when there are only the normal 2-5 clients working, and even works well when I add another 44 clients. However, the one time I unleashed 80 clients the upload process was very very slow for each individual client and slowed the response time of the web server greatly. Perhaps accepting uploads immediately and adding them to a queue to be processed? This would cause the rankings to no longer be real-time, however if you could accept | + | I hope this is the right place for this considering the time since the last post... Are there any ideas in the works to improve scalability of the server to accept more concurrent users submitting battle results? It works just fine when there are only the normal 2-5 clients working, and even works well when I add another 44 clients. However, the one time I unleashed 80 clients the upload process was very very slow for each individual client and slowed the response time of the web server greatly. Perhaps accepting uploads immediately and adding them to a queue to be processed? This would cause the rankings to no longer be real-time, however if you could accept 10+ battles per second then that point may be moot. I also understand due to participation and available time that this is likely not a high priority. --[[User:Bwbaugh|bwbaugh]] 12:44, 12 June 2011 (UTC) |
+ | |||
+ | While I believe that queueing the request and use daemon to process the result, there are also many problem that may arise. What if the daemon can't process the input fast enough? (It's likely to be the case here if you fire up all your clients). The queue may group to hundreds of thousands of request, and that wouldn't help the situation. Also, queueing in database add more work to the disk, while if you cache in memory, it may run out quickly. I don't know if the server is currently running on dedicate box (or VPS, for that matter) or could hosting, but I think the server is already scratch to its limit. RoboRumble server is very database-write intensive, thus making it disk-intensive. (Replacing the disk with SSDs in RAID-0 might help, but that might make CPU the bottleneck, and the cost of SSDs) | ||
+ | |||
+ | I am not sure, but Darkcanuck, if you replace Apache with nginx or lighttpd (I prefer nginx) with php-fpm, would it work slightly faster due to lower amount of threads/processes? --[[User:Nat|<span style="color:#099;">Nat</span>]] [[User talk:Nat|<span style="color:#0a5;">Pavasant</span>]] 14:45, 12 June 2011 (UTC) | ||
+ | |||
+ | I'm suspecting that the current bottleneck is actually the SQL server it's backed on actually. I know it does end up doing some fancy locking stuff. Personally, I think a *really* good way to improve scalability, would be to change the upload protocol to allow multiple battles uploaded per HTTP request. This would make client uploads faster, AND also allow the server to pool multiple battle updates into a single SQL statement. --[[User:Rednaxela|Rednaxela]] 00:09, 13 June 2011 (UTC) | ||
+ | |||
+ | From my knowledge of the RRServer from the day it was open sourced (haven't check the code recently), each request require multiple SQL to request and update battle result, rankings, etc. IMO, it may improve, but I am not really sure about it. Since each script may run longer, there are more chance for memory leaks (and this is for internal PHP, not the script; one my php-based daemon script experiences PHP's memory leak and shutdown every two days, through I am unable to find real cause). | ||
+ | |||
+ | My opinion of redesigning protocol: I believe we should drop HTTP for its cubersome headers. A simple protocol where the client connect and send result line-by-line would be easier and faster (though we cannot use web server to serve it anymore, but many other server that can be easy to implement) | ||
+ | |||
+ | Afterall, Fnl said RoboRumble is one of the dirtiest corners of Robocode... --[[User:Nat|<span style="color:#099;">Nat</span>]] [[User talk:Nat|<span style="color:#0a5;">Pavasant</span>]] 11:00, 13 June 2011 (UTC) | ||
+ | |||
+ | Any changes or new thoughts about this during the last 4.5 months? I drool at the idea of releasing all the processing power available to me during downtime, which I conservatively estimate to be at least 1,000 battles per minute (or 2.88 million battles per weekend) instead of the ~180 or so battles per minute I've got going now. The server can queue all the battles from the weekend and process them during the week! *wink* ... just kidding ... sort of. --[[User:Bwbaugh|bwbaugh]] 07:07, 31 October 2011 (UTC) | ||
+ | |||
+ | :While I don't have anything particularly useful to add, I must say that I bow to your processing power. With your backing, I am less likely to annoy everyone else by releasing new versions of my robot almost every day. :-D -- [[User:Skotty|Skotty]] 17:13, 31 October 2011 (UTC) | ||
== Proofing against add/remove wars == | == Proofing against add/remove wars == | ||
Line 102: | Line 118: | ||
I don't know where was the discussion located, so I post it here. Maybe the drop of ELO and Glicko rating is due to imprecision of calculation? Because your current system and DB only store them in INT(5) field, not floating-point field? --[[User:Nat|<span style="color:#099;">Nat</span>]] [[User talk:Nat|<span style="color:#0a5;">Pavasant</span>]] 15:01, 25 May 2010 (UTC) | I don't know where was the discussion located, so I post it here. Maybe the drop of ELO and Glicko rating is due to imprecision of calculation? Because your current system and DB only store them in INT(5) field, not floating-point field? --[[User:Nat|<span style="color:#099;">Nat</span>]] [[User talk:Nat|<span style="color:#0a5;">Pavasant</span>]] 15:01, 25 May 2010 (UTC) | ||
+ | |||
+ | Not yet. I'm still trying to 2k it :) --[[User:Miked0801|Miked0801]] 17:18, 12 June 2011 (UTC) | ||
+ | |||
+ | : It's all arbitrary, but in my opinion, LBB is already closer to 2100 relative to the NanoBot division. =) Or 2200... --[[User:Voidious|Voidious]] 17:27, 12 June 2011 (UTC) | ||
+ | |||
+ | == Battles number cap in rankings page == | ||
+ | |||
+ | I was browsing the melee General ranking page and noticed some bots have 65535 battles. --[[User:MN|MN]] 01:43, 19 June 2011 (UTC) | ||
+ | |||
+ | : Good catch! I will try to find some time to shut the server down (briefly) and update the database. The problem is only in the rankings table so after it's fixed the true battle count will get updated as each bot gets another battle. --[[User:Darkcanuck|Darkcanuck]] 13:36, 19 July 2011 (UTC) | ||
+ | |||
+ | == Twin Duel Matchmaking Issue? == | ||
+ | |||
+ | Does anyone know why the same 4 bots in the twin duel consistently get the VAST majority of battles fought between them? I have run 15k+ twin duel matches, and almost every one (but not all) were between those four teams. I thought it might be because I don't have the other bots, but since at least 1 or 2 matches were in fact run with the others, I don't know what the issue is. --[[User:Bwbaugh|bwbaugh]] 10:14, 31 October 2011 (UTC) |
Latest revision as of 18:13, 31 October 2011
Contents
Survival count should total 35
I want to confirm what has already been mentioned: survival count should not necessarily total 35. It appears that when bots die on the same tick, they both get credited with a first place finish. --Simonton 22:49, 27 September 2008 (UTC)
- There was only one unmatching battle (17+14) in the over 5000 results I sent to this server. But indeed, 34 and 36 should be accepted too in my opinion. I think that battles against rambots (used to) have the most chance on getting these non-35 results. --GrubbmGait 23:16, 27 September 2008 (UTC)
The current check is that the survival totals at least 35 should that should take into account ties. But the results that are getting stuck on my clients are for <35 1st place survivals. One client has 24-8 (32 total) for Cephalosporin vs UrChicken2. Two of my clients (not physically here, checked on them yesterday) had about a dozen battles involving RougeDC Classic, also with less than 34. What to do? --Darkcanuck 23:55, 27 September 2008 (UTC)
Well, could you try running the normal Robocode (non rumble) with pairings like you're seeing this with in rumble? If you can't reproduce it that way, than it sounds like it might be a bug in the rumble client, and if you can reproduce it it should become evident what's happening by watching the battles when this happens (perhaps use the replay feature? Not sure how well that works though). I can't seem to see anything at all like that happening here. --Rednaxela 00:02, 28 September 2008 (UTC)
- I only see those clients once every few weeks, so it will be awhile. They're running 1.5.4 with the latest version of Java on Windows 2000. It's only ~12 battles out of 1500+, so a very low occurrence. If I turn off the survival check, you'll see them come through... --Darkcanuck 00:19, 28 September 2008 (UTC)
I would not accept results lower than 34 or higher than 40. This check is mainly to intercept results from wrongly setupped clients, as that can be devastating for the rankings. A rare glitch of Robocode or the client can easily be repaired (when noticed). --GrubbmGait 01:16, 28 September 2008 (UTC)
HTTP response code: 503
Hmm, I started getting a bunch of "HTTP response code: 503" errors when uploading results. Anyone know why? --Rednaxela 18:09, 2 December 2008 (UTC)
I've got the same problem, and sometimes downloading rating page --lestofante 19:04, 2 December 2008 (UTC)
Yeah, also when the client tries to download the ratings here too. One thing I've noticed is that when it tries to download the rating pages is that sometimes it's 503 and sometimes it's 500, but it's always 503 for uploading. Anyways right now I'm running my RR client with upload/download disabled in order to build up battles to upload in bulk when uploading is working again. --Rednaxela 21:06, 2 December 2008 (UTC)
- Hmm, the server does not look happy -- some sort of high load condition, I'll what I can do. --Darkcanuck 04:04, 3 December 2008 (UTC)
ATTENTION! Someone has a bad participants list -- almost half the bots in the rumble have been removed! This is a very slow operation, especially when other clients are trying to add them back in. I've disabled bot removal for now while I investigate the source... --Darkcanuck 04:43, 3 December 2008 (UTC)
Mea culpa, yesterday I've run for at least 20 minutes 5 hour the rumble with an empty participants list before I see, understood and fixed my client's problem.. next time I will play with his option with Upload flag set to false! Now everything is running and I've over 2000 battle's to upload --lestofante 11:29, 3 December 2008 (UTC)
- Ok, I've re-enabled your username + IP so you can upload again. 2000 battles, eh? Just make sure you're using one of the approved clients (1.5.4 or 1.6.0). Participant removals are still disabled just in case since the participant list can't change anyway.
Retired Bots
In details pages like this, bots that are 'retired' like Cunobelin 0.2.1 are showing up (see ELO score 0 bots on details pages). Did this accidentally happen when you disabled removal/retiring of bots temporarily? (Also, would it be possible to make details pages for retired bots viewable with the right URL? :)) --Rednaxela 16:52, 3 December 2008 (UTC)
- Thanks for pointing that out -- I was trying to figure out why the rumble pairings weren't going back to normal. I'm not quite sure how those bots got reactivated, need to think on that a bit. (Could be the economic crisis has affected their retirement plans?) It doesn't look like any rumble clients are trying to remove them though -- probably due to the failure to download the participants list. I'll see what I can do to remove them. --Darkcanuck 02:31, 4 December 2008 (UTC)
- Hm, I think its a bug in the reactivation process... if both participants in a pairing have been retired, reactivating either also reactivates the pairing, ugh. --Darkcanuck 02:48, 4 December 2008 (UTC)
Performance
I'm uploading many battle I've run off-line, but the server is too slow. in 29 minutes I've uploaded only 262 battles, about 9 for minute!! Maybe are you updating all scoring system at every battle upload?--lestofante 13:30, 4 December 2008 (UTC)
The scoring is calculated for each upload yes. If it wasn't the results page and such wouldn't be nice and live and such. Apparently Darkcanuck is planning some things that may improve performance such as getting rid of table locking in favor of transactions, but really, I do think the roughly 2 seconds per battle that you describe is very acceptable. After all, running a battle takes considerably longer than 2 seconds in almost all cases, so it's not like it's that huge an impact on the total number of battles uploaded. --Rednaxela 13:47, 4 December 2008 (UTC)
I think the scoring system have to be updated only when requested the scoring page or similia.. so we have to update only WHAT we need WHEN we need, saving many time and resources. Pay attention: the upload relay is about 60sec/9battle=6sec/battle, not 2... acceptable? lets see.
- 1 execute of roborumble.sh = 1 iteration
- 1 iteration = 10 battle
- 6sec/battle * 10 battle = 60second of upload time for iteration
- 150-200 seconds = 1 iteration on amd64 x2 4000+, ubuntu OS, openJDK as virtual machine, UPLOAD=NOT,DOWNLOAD=NOT + roborumble loading time (calculate with unix script "time")
Result: upload time is about 1/4 of my rumble running time in the best case...add the ratings downloads time+new bot check and as result use on-line rumble takes about DOUBLE time than off-line(300-360 sec, calculated with unix script "time"), IMHO it's not acceptable If you have unix(machintosh is unix) you can calculate your execution time (I don't know how do it for windows), it's easy: time -p ./roborumble.sh, results is the first line, expressed in seconds--lestofante 14:30, 4 December 2008 (UTC)
Whoops, I read it as "262 in 9 minutes". Still 6 seconds is acceptable I think. Certainly far from ideal, but acceptable enough. I believe the real fix is making the RR client upload in the background while battles are running. And... er.... updating the scoring system when loading the scoring page would... often take a few minutes or longer if the scoring page hasn't been viewed in a while I believe. I'm quite sure updating it incrementally is a necessity. --Rednaxela 15:02, 4 December 2008 (UTC)
The background upload is a great idea! now I will try to implement it, a little java program that grab the 1vs1result.txt(roborumble must have UPLOAD=NOT) copy it in RAM, clean the original, and in background upload the copy in RAM. we will see. --lestofante 15:59, 4 December 2008 (UTC)
- Haha, well, I kind of think it would be easier to just implement it as a patch to the normal roborumble client that just moves the upload process into another thread that runs in the background. Plus as a patch to the existing RR client, I think it's likely it would be adopted for the next official release of Robocode ;) --Rednaxela 18:11, 4 December 2008 (UTC)
I agree that 2 seconds would be nicer... Think of it as break to let your processor cool off a bit? If you look elsewhere in these pages, scoring used to be batched but performance degraded dramatically as the database grew. Its faster (and more reliable) when scoring is done on each upload -- only the two bots uploaded get updated, but the Elo algorithm requires a lot of data to do so. Right now there's a scaling problem where many simultaneous uploads cause each one to slow down and I'm planning to addres this soon. But if lots of clients are recovering from the past two days problems (I took the server offline several hours last night) then we have a higher volume of uploads right now too. --Darkcanuck 16:08, 4 December 2008 (UTC)
- I think most clients should be done recovering by now. I noticed the server refusing uploads for a bit last night and my client was easily back to normal long before I woke up. --Rednaxela 18:11, 4 December 2008 (UTC)
I've implemented a the Radnaxela idea (see Talk:RoboRumble/Development#Background_Uploader), and in about 5 hour I've duplicated my month upload... wow! And no collateral effect ^^" --lestofante 01:11, 5 December 2008 (UTC)
- Very cool! But can you do like Rednaxela suggested and make this another thread in the rumble client? You guys are determined to push my poor server to its limits... :) --Darkcanuck 01:29, 5 December 2008 (UTC)
- Well don't worry, I won't be using background upload on my laptop here... at least not until I get an external cooling fan to boost airflow... because if it's maintained at high CPU use with no chances to cool off... it gets up past 93C sometimes which I call iffy when the processor is only designed to operate up to 95C (and the builtin auto-shutoff limit is at 105C) --Rednaxela 03:11, 5 December 2008 (UTC)
- Another performance tip: I've had several hard drive failures/corruptions on machines I've used as clients which run continuously. My guess is that it's provoked by the constant disk writes caused by bots writing to their data directory at the end of each round. So I now have my robotcache dir mounted on a ramdisk -- its much quieter and battles run faster too. But I do put in a nice delay in between iterations for cooling to keep the cpu under 80. --Darkcanuck 04:04, 5 December 2008 (UTC)
- Ok, look at Talk:RoboRumble/Development#Background_Uploader for the last really working version and use instruction. Now I'm running (and uploading) 1 iteration of 10 battle in 150 seconds (p.s. total CPU use under 60% with amule + azureus + firefox + eclipse + compiz). Now I'm going to implement a patch for the 1.6.0's code, is only one class to modify (UploaderResult) but my SVN client don't want run and manually import robocode's code in Eclipse give me some path problem (help me please, see Talk:Robocode/Developers_Guide_for_building_Robocode) --lestofante 23:14, 5 December 2008 (UTC)
With the last server revision is not either necessary run the script cause upload's speed as been considerable incremented --lestofante 12:53, 6 December 2008 (UTC)
I hope this is the right place for this considering the time since the last post... Are there any ideas in the works to improve scalability of the server to accept more concurrent users submitting battle results? It works just fine when there are only the normal 2-5 clients working, and even works well when I add another 44 clients. However, the one time I unleashed 80 clients the upload process was very very slow for each individual client and slowed the response time of the web server greatly. Perhaps accepting uploads immediately and adding them to a queue to be processed? This would cause the rankings to no longer be real-time, however if you could accept 10+ battles per second then that point may be moot. I also understand due to participation and available time that this is likely not a high priority. --bwbaugh 12:44, 12 June 2011 (UTC)
While I believe that queueing the request and use daemon to process the result, there are also many problem that may arise. What if the daemon can't process the input fast enough? (It's likely to be the case here if you fire up all your clients). The queue may group to hundreds of thousands of request, and that wouldn't help the situation. Also, queueing in database add more work to the disk, while if you cache in memory, it may run out quickly. I don't know if the server is currently running on dedicate box (or VPS, for that matter) or could hosting, but I think the server is already scratch to its limit. RoboRumble server is very database-write intensive, thus making it disk-intensive. (Replacing the disk with SSDs in RAID-0 might help, but that might make CPU the bottleneck, and the cost of SSDs)
I am not sure, but Darkcanuck, if you replace Apache with nginx or lighttpd (I prefer nginx) with php-fpm, would it work slightly faster due to lower amount of threads/processes? --Nat Pavasant 14:45, 12 June 2011 (UTC)
I'm suspecting that the current bottleneck is actually the SQL server it's backed on actually. I know it does end up doing some fancy locking stuff. Personally, I think a *really* good way to improve scalability, would be to change the upload protocol to allow multiple battles uploaded per HTTP request. This would make client uploads faster, AND also allow the server to pool multiple battle updates into a single SQL statement. --Rednaxela 00:09, 13 June 2011 (UTC)
From my knowledge of the RRServer from the day it was open sourced (haven't check the code recently), each request require multiple SQL to request and update battle result, rankings, etc. IMO, it may improve, but I am not really sure about it. Since each script may run longer, there are more chance for memory leaks (and this is for internal PHP, not the script; one my php-based daemon script experiences PHP's memory leak and shutdown every two days, through I am unable to find real cause).
My opinion of redesigning protocol: I believe we should drop HTTP for its cubersome headers. A simple protocol where the client connect and send result line-by-line would be easier and faster (though we cannot use web server to serve it anymore, but many other server that can be easy to implement)
Afterall, Fnl said RoboRumble is one of the dirtiest corners of Robocode... --Nat Pavasant 11:00, 13 June 2011 (UTC)
Any changes or new thoughts about this during the last 4.5 months? I drool at the idea of releasing all the processing power available to me during downtime, which I conservatively estimate to be at least 1,000 battles per minute (or 2.88 million battles per weekend) instead of the ~180 or so battles per minute I've got going now. The server can queue all the battles from the weekend and process them during the week! *wink* ... just kidding ... sort of. --bwbaugh 07:07, 31 October 2011 (UTC)
- While I don't have anything particularly useful to add, I must say that I bow to your processing power. With your backing, I am less likely to annoy everyone else by releasing new versions of my robot almost every day. :-D -- Skotty 17:13, 31 October 2011 (UTC)
Proofing against add/remove wars
I was just thinking, in order to proof against add/remove wars in the future, maybe it would be good to make the server check the participants list itself to verify the addition/removal? It make make addition/removal of a bot to/from rumble slightly slower, but those two operations are rare under normal conditions. --Rednaxela 22:54, 4 December 2008 (UTC)
- I was thinking along similar lines too, but I would make it a periodic check. Say every hour, the server gets the new participant list and does the add/removes directly and preventing clients from doing so. Also, the list could be obtained from the server so we can easily point to new locations or handle cases when the wiki is unavailable. --Darkcanuck 01:29, 5 December 2008 (UTC)
- Maybe have it also double-check the participants list again whenever a client sends results for a new bot? I'm saying that just because I think it's nice to be able to see your bots show up in the results as soon as the first battles are uploaded. --Rednaxela 03:11, 5 December 2008 (UTC)
- True. This feature is not high on my list yet -- I want to fix the locking first, then look at the details comparison stuff (including retired bot details). --Darkcanuck 04:04, 5 December 2008 (UTC)
Sort order on "Glicko-2 (RD)"
Really minor thing, but I noticed the "Glicko-2 (RD)" column doesn't sort correctly - eg, for descending, all 3 digit numbers list before all 4 digit numbers. Super minor, but figure it's worth letting ya know. Rankings page looks great, btw, nice work. --Voidious 00:02, 5 January 2009 (UTC)
Just to note, a fair number of columns are like that not just the Glicko-2 one. Also in the bot details pages. --Rednaxela 00:42, 5 January 2009 (UTC)
- Thanks for the heads-up. Looks like tablesorter gets confused by the parentheses around the RD value and possibly the class attributes for high/low scores... --Darkcanuck 05:49, 5 January 2009 (UTC)
Just a reminder about this ;-) It seems all the columns are being sorted as text, not numbers. For instance, when sorting highest to lowest, 900 comes before 2000. --Skilgannon 16:00, 30 April 2009 (UTC)
- Ah, good catch. I fixed this problem on every table *but* the main rankings during the last update. Should work for the main tables now too. --Darkcanuck 03:43, 1 May 2009 (UTC)
- Thanks =) It still seems to sort by text on the Glicko-2 column though... I'm not sure how easy the fix would be because of the RD value. Can you set custom comparators? Or is this nothing like Java? :-) --Skilgannon 10:31, 1 May 2009 (UTC)
Dropping of ELO and Glicko rating
I don't know where was the discussion located, so I post it here. Maybe the drop of ELO and Glicko rating is due to imprecision of calculation? Because your current system and DB only store them in INT(5) field, not floating-point field? --Nat Pavasant 15:01, 25 May 2010 (UTC)
Not yet. I'm still trying to 2k it :) --Miked0801 17:18, 12 June 2011 (UTC)
- It's all arbitrary, but in my opinion, LBB is already closer to 2100 relative to the NanoBot division. =) Or 2200... --Voidious 17:27, 12 June 2011 (UTC)
Battles number cap in rankings page
I was browsing the melee General ranking page and noticed some bots have 65535 battles. --MN 01:43, 19 June 2011 (UTC)
- Good catch! I will try to find some time to shut the server down (briefly) and update the database. The problem is only in the rankings table so after it's fixed the true battle count will get updated as each bot gets another battle. --Darkcanuck 13:36, 19 July 2011 (UTC)
Twin Duel Matchmaking Issue?
Does anyone know why the same 4 bots in the twin duel consistently get the VAST majority of battles fought between them? I have run 15k+ twin duel matches, and almost every one (but not all) were between those four teams. I thought it might be because I don't have the other bots, but since at least 1 or 2 matches were in fact run with the others, I don't know what the issue is. --bwbaugh 10:14, 31 October 2011 (UTC)