User talk:Pedersen/kdTree
I don't have a application for a tree yet (though conceivably it would work with a gun if I bothered to develop weights for the various dimensions, after determining what those dimensions should be) but I figured it would be fun to whip one up. It turns out it was pretty quick to crank out, and after a little debugging I not only had blazing fast speeds at high volumes, faster by far than the samples that came with the benchmark at about 5000 data points, I found that at about that range I started dipping below 100% accuracy. I have spent most of today, longer than it took to write the first draft, trying to find out what is going wrong. As the day winds to a close, I've found that I can get it to bleed accuracy by setting the capacities really low, which will also help with debugging (in terms of diagramming out in code and by hand). When I set the dimensions to 2, neighbors to 10, and sample size to 50, along with a bucket size of 4, I caused it to hit 90% accuracy. My time is up today though, so I'll have to proceed with debugging another evening. --Martin 00:27, 6 March 2010 (UTC)
The fact that accuracy bleeds as bucket capacity goes down, suggests to me that something is wrong with the logic for when to descend a branch or not is broken but the logic for processing points within a leaf is fine. Chances are you'll see accuracy decrease much further if you increase the count of fetched neighbors while keeping bucket size constant. Given the results you've said, it's quite possible that it is never descending into 'second choice' paths ever in fact. --Rednaxela 01:10, 6 March 2010 (UTC)
- It could be (also) using the nearest value to determine if the next path is viable, rather than the furthest in the current matching set. --Chase 02:06, 6 March 2010 (UTC)
I've suspected that the problem is when I split more than once in the same dimension, but that speculation hasn't borne out through trials. I know it has to do with the logic for talking alternate branches, since if I remove the conditional logic and force it (i.e. a full tree search) it gets 100% every time. --Martin 05:03, 6 March 2010 (UTC)
My hypothesis was that I could eliminate alternate paths by finding the distance between the reference point and the plane across the splitting point. The problem arose because rather than finding the simple distance in the same dimension between my reference point and the splitting plane, I was finding the distance to the exact point where the plane intersected all of the previous planes used to reach the present point. That was a bit wordy, so I'll give an example. Let's say our point is (7.5, 7.5). So far we have branched at 5 on the X axis and then 5 on the Y axis. The distance I should obtain is 2.5 (distance between (7.5,7.5) and (7.5,5.0), i.e. the closest possible point on the other side of the splitting plane. Instead, I was finding the distance to (5.0,5.0). Many eligible values could have lain in the arc that I culled.
Getting the distance corrected has fixed the accuracy problem, but now I am walking a heck of a lot more tree than previously, so it is much slower. My own 'flat tree' search is outperforming it, and it only outperforms Simontons's, in terms of reference implementations in the benchmark. --Martin 21:58, 8 March 2010 (UTC)
- Scratch that. I still had a bucket size of 4 from my stress testing. Putting the buckets to 40 (first test) put the performance halfway between Red's and Void's. So I'll have a contender yet, though I know Red has some stuff that hasn't been released yet. --Martin 22:07, 8 March 2010 (UTC)