User talk:Pedersen/kdTree
I don't have an application for a tree yet (though conceivably it would work with a gun if I bothered to develop weights for the various dimensions, after determining what those dimensions should be) but I figured it would be fun to whip one up. It turns out it was pretty quick to crank out, and after a little debugging I not only had blazing fast speeds at high volumes, faster by far than the samples that came with the benchmark at about 5000 data points, I found that at about that range I started dipping below 100% accuracy. I have spent most of today, longer than it took to write the first draft, trying to find out what is going wrong. As the day winds to a close, I've found that I can get it to bleed accuracy by setting the capacities really low, which will also help with debugging (in terms of diagramming out in code and by hand). When I set the dimensions to 2, neighbors to 10, and sample size to 50, along with a bucket size of 4, I caused it to hit 90% accuracy. My time is up today though, so I'll have to proceed with debugging another evening. --Martin 00:27, 6 March 2010 (UTC)
The fact that accuracy bleeds as bucket capacity goes down, suggests to me that something is wrong with the logic for when to descend a branch or not is broken but the logic for processing points within a leaf is fine. Chances are you'll see accuracy decrease much further if you increase the count of fetched neighbors while keeping bucket size constant. Given the results you've said, it's quite possible that it is never descending into 'second choice' paths ever in fact. --Rednaxela 01:10, 6 March 2010 (UTC)
- It could be (also) using the nearest value to determine if the next path is viable, rather than the furthest in the current matching set. --Chase 02:06, 6 March 2010 (UTC)
I've suspected that the problem is when I split more than once in the same dimension, but that speculation hasn't borne out through trials. I know it has to do with the logic for talking alternate branches, since if I remove the conditional logic and force it (i.e. a full tree search) it gets 100% every time. --Martin 05:03, 6 March 2010 (UTC)
My hypothesis was that I could eliminate alternate paths by finding the distance between the reference point and the plane across the splitting point. The problem arose because rather than finding the simple distance in the same dimension between my reference point and the splitting plane, I was finding the distance to the exact point where the plane intersected all of the previous planes used to reach the present point. That was a bit wordy, so I'll give an example. Let's say our point is (7.5, 7.5). So far we have branched at 5 on the X axis and then 5 on the Y axis. The distance I should obtain is 2.5 (distance between (7.5,7.5) and (7.5,5.0)), i.e. the closest possible point on the other side of the splitting plane. Instead, I was finding the distance to (5.0,5.0). Many eligible values could have lain in the arc that I culled.
Getting the distance corrected has fixed the accuracy problem, but now I am walking a heck of a lot more tree than previously, so it is much slower. My own 'flat tree' search is outperforming it, and it only outperforms Simontons's, in terms of reference implementations in the benchmark. --Martin 21:58, 8 March 2010 (UTC)
- Scratch that. I still had a bucket size of 4 from my stress testing. Putting the buckets to 40 (first test) put the performance halfway between Red's and Void's. So I'll have a contender yet, though I know Red has some stuff that hasn't been released yet. --Martin 22:07, 8 March 2010 (UTC)
Well, I posted the code, in case anyone wants to see my raw implementation. I started with Duyn's tutorial, at least as far as the Exemplar class and making the KDBucketTree class use generics, so that much at least will look familiar. There may also be other familiar parts, but I think at that after about 10 minutes I went off the reservation and ran with it. I made some tweaks to run with the benchmark tool, and then came a simple bug fix followed by the planar distance described above. Since then I also made a 5-way branching tree as a test but it didn't seem any better in terms of speed so it is back on the shelf. There's definitely room for improvement. For example I presently make no effort to re-balance the tree. When my sample exceeds my bucket capacity I sample all the data ranges and use the mean of the largest range. I avoid the infinite loop problem Red mentioned by not testing for re-splitting after a split. It will take a new element being added before testing for a split. Ultimately it may get a stack overflow error anyway. /shrug Another improvement could be to keep a list of all splitting points, find the squared planar distances from the control point, and processing them from nearest to farthest until the closest neighbor is closer than the closest plane. That would mean an end to walking the tree though. --Martin 01:05, 9 March 2010 (UTC)
public class KDBucketTreeComparator<T extends Exemplar> implements Comparator<KDBucketTree<T>> {
After figuring out how to define a Comparator for a generic class (above), I was explaining the design to a co-worker buddy and my design for "finding nearest neighbors by walking through a list of Leaves rather than walking the tree" fell apart for two reasons: 1) leaves don't have splitting planes, and 2) the splitting plane concept only applied to following the least favorable branch, which is lost if you aren't walking the tree. I'll still mull over the potential, but this is a setback. --Martin 21:44, 9 March 2010 (UTC)
- I created another speedy-but-flawed implementation, getting in the 95-100% range at about a third of the KNN time of Void's at 100k records. It even beats his linear search by 1ms. It is just me, or are the flat searches faster than the tree searches for everyone else? I thought the trees were supposed to be a performance gain... But I digress. The new implementation still creates a tree but no longer walks it for finding nearest neighbors. Instead, it finds the distance to the nearest corner of every leaf, sorts them as a PriorityQueue, and walks through them until the closest neighbor is closer than the closest corner of the next leaf. --Martin 00:09, 10 March 2010 (UTC)
- Well, the fix was pretty simple. Turns out that PriotityQueue.iterator() is useless, so you can't do a modern for-loop. A minor refactor and the implementation is back on its feet and faster than the fastest benchmark tree, but I really think it should be even faster, so I'll mosey over to the recent post by Duyn about PriorityQueue performance. --Martin 00:38, 10 March 2010 (UTC)
- Fun story: Before Shadow switched to my tree, it was victim to this actually! PriorityQueue.iterator() (as used by Simonton's tree (which is what Shadow used to use)) returns the output in heap ordering (contrary to what ABC had assumed), the same type of ordering as what my currently released tree does if you tell it not to sort. Funny enough, Shadow still did quite well despite this bug. That behavior of PriorityQueue was mentioned in the talk page of my tree I believe. --Rednaxela 01:23, 10 March 2010 (UTC)
- Well, the fix was pretty simple. Turns out that PriotityQueue.iterator() is useless, so you can't do a modern for-loop. A minor refactor and the implementation is back on its feet and faster than the fastest benchmark tree, but I really think it should be even faster, so I'll mosey over to the recent post by Duyn about PriorityQueue performance. --Martin 00:38, 10 March 2010 (UTC)
- Haven't played with this stuff in a while, but last I did, my tree outperformed my linear search by maybe the 10-15k mark? Certainly by 25k (end of a normal battle), and by leaps and bounds beyond that. And Red's is several times faster than mine. At 100k my tree is slower than linear? How many dimensions/what cluster size? Trees do worse vs linear as those go up... I was usually testing ~10 dimensions and 30-50 cluster size. --Voidious 00:41, 10 March 2010 (UTC)
- I've been testing with 13 dimensions, at 5k, 50k, and 100k data points. Beyond that I blow out the memory. Your flat search is always run first and is always in the lead at the end. Maybe it is just a quirk of my work computer (Intel Dual CPU @ 2.66GHz, JRE 1.6.0_01). It consistently beats the other benchmark implementations. I don't know how old those examples are. It is rare that I can outperform it without a bug in my implementation. --Martin 19:25, 10 March 2010 (UTC)
- That... doesn't really make any sense. My tree in the example is about 25% to 50% slower than my currently posted one, however tests here show my tree being close to 30 times faster than Voidious' linear search. I doubt such a massive difference could be accounted for by a quirk of the computer. I'd be curious what your results would look like with a graph like I posted in the kd-tree talk page. Should I upload the updated version of the benchmark I have here (Has CSV export for that type of graph, updated version of my tree, Chase-San's new tree, and Duyn's tree)? I'm also curious whether you would have the same results for the non-random data set. --Rednaxela 21:01, 10 March 2010 (UTC)
- An update would be handy, for other reasons as well. As an example, I had to tweak the took to make use of the random number generator, as I wasn't aware of any sample file to use. There have been some other tweaks I did, but I doubt they were necessary, such as removing the dependence on execution parameters. I've always run it through Eclipse anyway. Perhaps that is part of the problem. If you prefer, you can email the benchmark to me directly, 'martinalanpedersen' by way of gmail. --Martin 22:03, 10 March 2010 (UTC)
- Here's an update to KNN.jar, with new/updated trees, csv output for graphing, and when started without arguments works in an interactive mode asking the user for the parameters (including support for optionally downloading the standard data file normally used with it). :) --Rednaxela 05:31, 11 March 2010 (UTC)
- An update would be handy, for other reasons as well. As an example, I had to tweak the took to make use of the random number generator, as I wasn't aware of any sample file to use. There have been some other tweaks I did, but I doubt they were necessary, such as removing the dependence on execution parameters. I've always run it through Eclipse anyway. Perhaps that is part of the problem. If you prefer, you can email the benchmark to me directly, 'martinalanpedersen' by way of gmail. --Martin 22:03, 10 March 2010 (UTC)
- That... doesn't really make any sense. My tree in the example is about 25% to 50% slower than my currently posted one, however tests here show my tree being close to 30 times faster than Voidious' linear search. I doubt such a massive difference could be accounted for by a quirk of the computer. I'd be curious what your results would look like with a graph like I posted in the kd-tree talk page. Should I upload the updated version of the benchmark I have here (Has CSV export for that type of graph, updated version of my tree, Chase-San's new tree, and Duyn's tree)? I'm also curious whether you would have the same results for the non-random data set. --Rednaxela 21:01, 10 March 2010 (UTC)
- I've been testing with 13 dimensions, at 5k, 50k, and 100k data points. Beyond that I blow out the memory. Your flat search is always run first and is always in the lead at the end. Maybe it is just a quirk of my work computer (Intel Dual CPU @ 2.66GHz, JRE 1.6.0_01). It consistently beats the other benchmark implementations. I don't know how old those examples are. It is rare that I can outperform it without a bug in my implementation. --Martin 19:25, 10 March 2010 (UTC)
- Yeah, and actually, if the tree is optimized well, it's speed can be indistinguishable from a well optimized linear search, even with almost no points. Check out the charts on Talk:Kd-tree to see how performance of various searches varies over the course of the battle. Even extremely early, the best few trees perform no worse than linear search right from the start. Actually, Voidious's tree is the only one significantly slower than linear search at low numbers of points. --Rednaxela 01:23, 10 March 2010 (UTC)
- You mean Simonton's right, because Voidious' does about the same as the rest, where as Simonton's is pretty slow according to the charts. --Chase 14:56, 10 March 2010 (UTC)
- #1 Voidious' Linear search [6.5008] - #2 Pedersen's KDBucketFlat [10.2425] - #3 Pedersen's KDBucketTree [13.9054] - #4 Voidious' Bucket PR k-d tree [16.6113]
I haven't tried the new benchmark yet, but I tried building the .jar file and running it in a DOS window and I get the same results. I also tried running my own flat search first so see if that gave a speed boost but the results were the same. It must be a quirk of this work computer, though I didn't muster the interest last night to test it at home. I'll check out the new benchmark. --Martin 18:59, 11 March 2010 (UTC)
Those results for my linear/tree wouldn't surprise me for 13 dimensions. I ran a quick test the other day, and for 8 dimensions, the tree became best around 15k; for 10, they stayed even through at least 100k. That was for 30-50 nearest neighbors and random data. That said, I'd expect Red's trees to fare much, much better... --Voidious 19:19, 11 March 2010 (UTC)
I find those results extremely surprising for 13 dimensions. I'm seeing the Voidious's tree becoming better at 13 dimensions, at under 5k results actually. I'm curious, could you post a graph like I've made from your computer? It's just a matter of running the latest version of KNN.jar, and telling it to output a CSV file for graphing. I get things looking like that only when I jack the dimension count up to 30 or higher. --Rednaxela 19:37, 11 March 2010 (UTC)
Ahhh, I ran the test with a variety of settings and I have the following interesting comparison:
random points, 10 dimensions, 40 nearest neighbors, 20k data points, - #1 Voidious' Linear search [0.9770] - #2 Rednaxela's kd-tree (3rd gen) [1.0578] - #3 Rednaxela's kd-tree (2nd gen) [1.1071] - #4 Chase-san's kd-tree (KDTreeC) [1.4763] - #5 duyn's Bucket kd-tree [1.5285] - #6 Voidious' Bucket PR k-d tree [1.8564] - #7 Simonton's Bucket PR k-d tree [2.0622] diamond's gun data sample, 13 dimensions (12 actually in use), 40 nearest neighbors, 25k data points, - #1 Rednaxela's kd-tree (2nd gen) [0.1075] - #2 Rednaxela's kd-tree (3rd gen) [0.1150] - #3 duyn's Bucket kd-tree [0.2217] - #4 Chase-san's kd-tree (KDTreeC) [0.3201] - #5 Simonton's Bucket PR k-d tree [0.4566] - #6 Voidious' Bucket PR k-d tree [0.5898] - #7 Voidious' Linear search [1.4292]
In other words, the distribution of the data matters a lot. --Rednaxela 20:01, 11 March 2010 (UTC)
Also, for smaller numbers of nearest neighbors:
random points, 10 dimensions, 1 nearest neighbors, 20k data points, - #1 Rednaxela's kd-tree (3rd gen) [0.2191] - #2 Rednaxela's kd-tree (2nd gen) [0.2317] - #3 duyn's Bucket kd-tree [0.2708] - #4 Chase-san's kd-tree (KDTreeC) [0.3963] - #5 Simonton's Bucket PR k-d tree [0.4126] - #6 Voidious' Bucket PR k-d tree [0.6314] - #7 Voidious' Linear search [0.8487] random points, 10 dimensions, 10 nearest neighbors, 20k data points, - #1 Rednaxela's kd-tree (3rd gen) [0.5733] - #2 Rednaxela's kd-tree (2nd gen) [0.6131] - #3 duyn's Bucket kd-tree [0.7744] - #4 Voidious' Linear search [0.8725] - #5 Chase-san's kd-tree (KDTreeC) [0.9344] - #6 Simonton's Bucket PR k-d tree [1.1397] - #7 Voidious' Bucket PR k-d tree [1.5931] random points, 10 dimensions, 1 nearest neighbors, 2k data points, - #1 Rednaxela's kd-tree (2nd gen) [0.0721] - #2 Rednaxela's kd-tree (3rd gen) [0.0815] - #3 Chase-san's kd-tree (KDTreeC) [0.0899] - #4 duyn's Bucket kd-tree [0.0914] - #5 Voidious' Linear search [0.0918] - #6 Voidious' Bucket PR k-d tree [0.0930] - #7 Simonton's Bucket PR k-d tree [0.1371] random points, 10 dimensions, 10 nearest neighbors, 2k data points, - #1 Voidious' Linear search [0.0969] - #2 Voidious' Bucket PR k-d tree [0.1159] - #3 Rednaxela's kd-tree (2nd gen) [0.1306] - #4 Chase-san's kd-tree (KDTreeC) [0.1337] - #5 Rednaxela's kd-tree (3rd gen) [0.1456] - #6 duyn's Bucket kd-tree [0.2188] - #7 Simonton's Bucket PR k-d tree [0.2613] random points, 10 dimensions, 10 nearest neighbors, 10k data points, - #1 Rednaxela's kd-tree (2nd gen) [0.3716] - #2 Rednaxela's kd-tree (3rd gen) [0.3796] - #3 Voidious' Linear search [0.4314] - #4 Chase-san's kd-tree (KDTreeC) [0.5080] - #5 duyn's Bucket kd-tree [0.5190] - #6 Voidious' Bucket PR k-d tree [0.6563] - #7 Simonton's Bucket PR k-d tree [0.7404]
--Rednaxela 20:13, 11 March 2010 (UTC)
And I promise no more data flood after this:
diamond's gun data sample, 13 dimensions (12 actually in use), 1 nearest neighbors, 25k data points, - #1 Rednaxela's kd-tree (2nd gen) [0.0249] - #2 Rednaxela's kd-tree (3rd gen) [0.0286] - #3 duyn's Bucket kd-tree [0.0435] - #4 Chase-san's kd-tree (KDTreeC) [0.1044] - #5 Simonton's Bucket PR k-d tree [0.1156] - #6 Voidious' Bucket PR k-d tree [0.1673] - #7 Voidious' Linear search [1.2885] diamond's gun data sample, 13 dimensions (12 actually in use), 100 nearest neighbors, 25k data points, - #1 Rednaxela's kd-tree (2nd gen) [0.2037] - #2 Rednaxela's kd-tree (3rd gen) [0.2292] - #3 duyn's Bucket kd-tree [0.4722] - #4 Chase-san's kd-tree (KDTreeC) [0.5124] - #5 Simonton's Bucket PR k-d tree [0.7547] - #6 Voidious' Bucket PR k-d tree [1.0399] - #7 Voidious' Linear search [1.8301] diamond's gun data sample, 13 dimensions (12 actually in use), 500 nearest neighbors, 25k data points, - #1 Rednaxela's kd-tree (2nd gen) [0.6796] - #2 Rednaxela's kd-tree (3rd gen) [0.7919] - #3 duyn's Bucket kd-tree [1.7949] - #4 Simonton's Bucket PR k-d tree [2.4160] - #5 Chase-san's kd-tree (KDTreeC) [3.6759] - #6 Voidious' Linear search [7.3524] - #7 Voidious' Bucket PR k-d tree [7.4192]
--Rednaxela 20:47, 11 March 2010 (UTC)