ThreadDeath problem and large amount of skipped turns

Jump to navigation Jump to search

ThreadDeath problem and large amount of skipped turns

This continuation of problem investigation mentioned at Thread:Talk:RoboRumble/Client java version and Thread:Talk:RoboRumble/ThreadDeath problem.

After Skilgannon patch, I can see some useful debugging info. In particular, I added output for the number of skipped turns which triggers ThreadDeath call.

I see that as many as 100 or even more skips are detected. This is when TPS slider at max. Move it to 1000 and everything is handy dandy. Note that 1000 TPS is unrealistic with my CPU constant of 6 ms. I.e. per second I can get 1/0.006 about 170 tics. So the system have no time to go to the idle state in either case. On top of it, 160 skipped turns (times 6ms), i.e. about 1 seconds of inactivity, would be noticeable with a naked eye watching the battle. I see no freeze there. So something does not add up.

Here I need the experts help. I look at the source, and the only timing mechanism I see is that within allocated time the thread must report isSleeping. But I see no time checks anywhere, i.e. no one calls for nanoTime. So how the robocode decides that the timeout is exceeded?

Could it be that java calls to robocode waitSleeping are not done correctly.

Here is one more strange thing. I switched debug=true, than I have about 300*cpuConstant to finish a tic, and I time profiled everything. My time per tic is consistently shorter with debug=true, then with debug=false (default). I know it probabilistic but ... Also, in either case my inner methods never report times exceeding 25mS, which is 5 skipped ticks at most. So how do I end up with 100 skipped tics penalty.

Also, I cannot get to ThreadDeath when I play against HawkOnFire. This is very simple and fast bot, but how does it help my bot to avoid the skipped turn penalty?

Beaming (talk)19:38, 7 September 2017

It's interesting, I actually see considerable speed difference between running at 1000TPS and max, even though my CPU constant is 5.3ms. I think this is because many ticks have much less than 1ms of calculations, and these cause a sleep in 1000TPS mode. Also, just to confirm, I also only see the ThreadDeath issues when running at max, and never at 1000TPS.

From what I remember the engine works something like this:

  1. Engine triggers condition variable to release bot thread
  2. Engine goes to sleep for up to CPU_CONSTANT milliseconds
  3. Bot thread runs
  4. Bot thread finishes, triggers interrupts to wake up engine thread early
  5. Engine thread wakes up, and checks time. If time exceeded, divide time taken by CPU_CONSTANT milliseconds, subtract 1, say bot skipped this many turns, if exceeds 30 kill bot. Otherwise continue as normal.

So I'm wondering if maybe there is another thread that is delaying the engine thread from waking up, perhaps to do some maintenance (garbage collection, JIT compiler swapping out methods with optimized versions, etc). There are probably heuristics that say it is a good time to do maintenance, when switching between threads, if nothing else is available. When in 1000TPS mode these actions would happen when the sleep happens (ie. nicely scheduled, not interfering with bot timing etc), since the heuristics say it is better to work in a sleep than to not switch between threads quickly.

In my mind, the easiest way to fix this would be by also doing timing in the bot thread, and only checking the timing in the engine thread if the bot thread hasn't finished yet.

Thoughts? Does anybody see issues with my thinking here?

Skilgannon (talk)20:54, 7 September 2017

It does not look that item 5 is performed the way which would be reasonable (i.e. how you describe).

Have a look checkSkippedTurn() where decision about penalty is done (I believe it is actually your code :). It does not check CPU time, it makes comparison based on the internal robocode Ticks.

 int numSkippedTurns = (currentExecutionTime - lastExecutionTime) - 1;

Robocode should call something to increase time (tic) inside of robot peer. If it does not do so for a bot, that bot will be punished. I still cannot find the part of the code where time++ logic is executed. These threads drive me nuts.

So, I like your proposal to time the bot inside its thread.

Beaming (talk)21:45, 7 September 2017

Hmm, there is still a problem with my idea, robocode will still kill the thread if it doesn't respond in time.

I see two different ways to combat this:

  1. Put a turnState enum with states {START, RUNNING, FINISHED} which can be polled to know if the bot thread is finished, combine this with timing in the bot thread, and then if it is finished we know it is not the bot causing the delay, so we don't kill the thread.
  2. Put some small sleeps every ~100 ticks to give the JVM time to perform optimizations and cleanup without interfering with the bot threads.

These could be combined, since they attack the problem from different sides.

Skilgannon (talk)22:20, 7 September 2017