User talk:Pedersen/kdTree

I don't have an application for a tree yet (though conceivably it would work with a gun if I bothered to develop weights for the various dimensions, after determining what those dimensions should be) but I figured it would be fun to whip one up. It turns out it was pretty quick to crank out, and after a little debugging I not only had blazing fast speeds at high volumes, faster by far than the samples that came with the benchmark at about 5000 data points, I found that at about that range I started dipping below 100% accuracy. I have spent most of today, longer than it took to write the first draft, trying to find out what is going wrong. As the day winds to a close, I've found that I can get it to bleed accuracy by setting the capacities really low, which will also help with debugging (in terms of diagramming out in code and by hand). When I set the dimensions to 2, neighbors to 10, and sample size to 50, along with a bucket size of 4, I caused it to hit 90% accuracy. My time is up today though, so I'll have to proceed with debugging another evening. --Martin 00:27, 6 March 2010 (UTC)

The fact that accuracy bleeds as bucket capacity goes down, suggests to me that something is wrong with the logic for when to descend a branch or not is broken but the logic for processing points within a leaf is fine. Chances are you'll see accuracy decrease much further if you increase the count of fetched neighbors while keeping bucket size constant. Given the results you've said, it's quite possible that it is never descending into 'second choice' paths ever in fact. --Rednaxela 01:10, 6 March 2010 (UTC)

It could be (also) using the nearest value to determine if the next path is viable, rather than the furthest in the current matching set. --Chase 02:06, 6 March 2010 (UTC)

I've suspected that the problem is when I split more than once in the same dimension, but that speculation hasn't borne out through trials. I know it has to do with the logic for talking alternate branches, since if I remove the conditional logic and force it (i.e. a full tree search) it gets 100% every time. --Martin 05:03, 6 March 2010 (UTC)

My hypothesis was that I could eliminate alternate paths by finding the distance between the reference point and the plane across the splitting point. The problem arose because rather than finding the simple distance in the same dimension between my reference point and the splitting plane, I was finding the distance to the exact point where the plane intersected all of the previous planes used to reach the present point. That was a bit wordy, so I'll give an example. Let's say our point is (7.5, 7.5). So far we have branched at 5 on the X axis and then 5 on the Y axis. The distance I should obtain is 2.5 (distance between (7.5,7.5) and (7.5,5.0)), i.e. the closest possible point on the other side of the splitting plane. Instead, I was finding the distance to (5.0,5.0). Many eligible values could have lain in the arc that I culled.
Getting the distance corrected has fixed the accuracy problem, but now I am walking a heck of a lot more tree than previously, so it is much slower. My own 'flat tree' search is outperforming it, and it only outperforms Simontons's, in terms of reference implementations in the benchmark. --Martin 21:58, 8 March 2010 (UTC)

Scratch that. I still had a bucket size of 4 from my stress testing. Putting the buckets to 40 (first test) put the performance halfway between Red's and Void's. So I'll have a contender yet, though I know Red has some stuff that hasn't been released yet. --Martin 22:07, 8 March 2010 (UTC)

Well, I posted the code, in case anyone wants to see my raw implementation. I started with Duyn's tutorial, at least as far as the Exemplar class and making the KDBucketTree class use generics, so that much at least will look familiar. There may also be other familiar parts, but I think at that after about 10 minutes I went off the reservation and ran with it. I made some tweaks to run with the benchmark tool, and then came a simple bug fix followed by the planar distance described above. Since then I also made a 5-way branching tree as a test but it didn't seem any better in terms of speed so it is back on the shelf. There's definitely room for improvement. For example I presently make no effort to re-balance the tree. When my sample exceeds my bucket capacity I sample all the data ranges and use the mean of the largest range. I avoid the infinite loop problem Red mentioned by not testing for re-splitting after a split. It will take a new element being added before testing for a split. Ultimately it may get a stack overflow error anyway. /shrug Another improvement could be to keep a list of all splitting points, find the squared planar distances from the control point, and processing them from nearest to farthest until the closest neighbor is closer than the closest plane. That would mean an end to walking the tree though. --Martin 01:05, 9 March 2010 (UTC)

public class KDBucketTreeComparator<T extends Exemplar> implements Comparator<KDBucketTree<T>> {

After figuring out how to define a Comparator for a generic class (above), I was explaining the design to a co-worker buddy and my design for "finding nearest neighbors by walking through a list of Leaves rather than walking the tree" fell apart for two reasons: 1) leaves don't have splitting planes, and 2) the splitting plane concept only applied to following the least favorable branch, which is lost if you aren't walking the tree. I'll still mull over the potential, but this is a setback. --Martin 21:44, 9 March 2010 (UTC)

I created another speedy-but-flawed implementation, getting in the 95-100% range at about a third of the KNN time of Void's at 100k records. It even beats his linear search by 1ms. It is just me, or are the flat searches faster than the tree searches for everyone else? I thought the trees were supposed to be a performance gain... But I digress. The new implementation still creates a tree but no longer walks it for finding nearest neighbors. Instead, it finds the distance to the nearest corner of every leaf, sorts them as a PriorityQueue, and walks through them until the closest neighbor is closer than the closest corner of the next leaf. --Martin 00:09, 10 March 2010 (UTC)

[View source↑]
[History↑]

You cannot post new threads to this discussion page because it has been protected from new threads, or you do not currently have permission to edit.

There are no threads on this page yet.

User talk:Pedersen/kdTree

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools