Difference between revisions of "Talk:Kd-tree"

From Robowiki
Jump to navigation Jump to search
(new result is on the way.)
(→‎Bucket PR k-d tree: new result, Simonton's the fastest, Voidious's the best.)
Line 161: Line 161:
  
 
Thank you, I have changed my local implementation to fixed your. Just found some bug in the benchmark, all trees except Voidious' and Rednaxela's does full Euclidian distance. Now it use squared distance, but the time didn't effect much. And just found some bug in my tree that make it run a little faster, or not? Anyway, my benchmarks is now running... with large input set(15,400000,150) and 50 tests at once. Will post new result and upload new version soon. It is now fully automated and get the input size/dimensions/neighbours from argv. &raquo; <span style="font-size:0.9em;color:darkgreen;">[[User:Nat|Nat]] | [[User_talk:Nat|Talk]]</span> &raquo; 15:03, 17 August 2009 (UTC)
 
Thank you, I have changed my local implementation to fixed your. Just found some bug in the benchmark, all trees except Voidious' and Rednaxela's does full Euclidian distance. Now it use squared distance, but the time didn't effect much. And just found some bug in my tree that make it run a little faster, or not? Anyway, my benchmarks is now running... with large input set(15,400000,150) and 50 tests at once. Will post new result and upload new version soon. It is now fully automated and get the input size/dimensions/neighbours from argv. &raquo; <span style="font-size:0.9em;color:darkgreen;">[[User:Nat|Nat]] | [[User_talk:Nat|Talk]]</span> &raquo; 15:03, 17 August 2009 (UTC)
 +
 +
<pre>
 +
NAT'S K-NEAREST NEIGHBOURS ALGORITHMS BENCHMARK
 +
------------------------------------------------
 +
Running 50 test(s) for k-nearest neighbours search:
 +
:: 15 dimension(s); 400000 data point(s); 150 neighbour(s)
 +
 +
 +
Running tests... COMPLETED.
 +
 +
RESULT << k-nearest neighbours search with flat/linear searching >>
 +
: Averaged used time              = 478.0353 miliseconds
 +
: Average adding time            = 1.359 microseconds
 +
: Average last node adding time  = 2.406 microseconds
 +
: Averaged  accuracy              = 100%
 +
: Worst case used time            = 776.3058 miliseconds
 +
: Best case used time            = 423.6457 miliseconds
 +
 +
RESULT << k-nearest neighbours search with Rednaxela's k-d tree >>
 +
: Averaged used time              = 318.4575 miliseconds
 +
: Average adding time            = 14.808 microseconds
 +
: Average last node adding time  = 13.042 microseconds
 +
: Averaged  accuracy              = 83%
 +
: Worst case used time            = 453.7179 miliseconds
 +
: Best case used time            = 157.7133 miliseconds
 +
 +
RESULT << k-nearest neighbours search with Simonton's Bucket PR k-d tree >>
 +
: Averaged used time              = 78.3186 miliseconds
 +
: Average adding time            = 3.446 microseconds
 +
: Average last node adding time  = 4.113 microseconds
 +
: Averaged  accuracy              = 100%
 +
: Worst case used time            = 247.4606 miliseconds
 +
: Best case used time            = 41.4053 miliseconds
 +
 +
RESULT << k-nearest neighbours search with Nat's Bucket PR k-d tree >>
 +
: Averaged used time              = 193.9578 miliseconds
 +
: Average adding time            = 3.556 microseconds
 +
: Average last node adding time  = 3.515 microseconds
 +
: Averaged  accuracy              = 100%
 +
: Worst case used time            = 805.729 miliseconds
 +
: Best case used time            = 41.7864 miliseconds
 +
 +
RESULT << k-nearest neighbours search with Voidious' Bucket PR k-d tree >>
 +
: Averaged used time              = 83.3344 miliseconds
 +
: Average adding time            = 4.1 microseconds
 +
: Average last node adding time  = 3.832 microseconds
 +
: Averaged  accuracy              = 100%
 +
: Worst case used time            = 96.4601 miliseconds
 +
: Best case used time            = 79.8629 miliseconds
 +
 +
 +
BEST RESULT:
 +
- #1 Simonton's Bucket PR k-d tree [78.3186]
 +
- #2 Voidious' Bucket PR k-d tree [83.3344]
 +
- #3 Nat's Bucket PR k-d tree [193.9578]
 +
- #4 Rednaxela's k-d tree [318.4575]
 +
- #5 flat/linear searching [478.0353]
 +
</pre>
 +
Even though Simonton's is the fastest averaged, it has quite large worst case. Voidious' the best here. And it seems that my tweak make it slower... My worst case is larger than linear worst case. This test run for 15 minutes on my machine so if you run it yourself please be patient. &raquo; <span style="font-size:0.9em;color:darkgreen;">[[User:Nat|Nat]] | [[User_talk:Nat|Talk]]</span> &raquo; 15:16, 17 August 2009 (UTC)

Revision as of 16:16, 17 August 2009

Bucket PR k-d tree

I think you all know about Bucket PR k-d tree (aka Simonton's tree), right? I'll get directly to my point.

The Bucket PR k-d tree (I'll refer as just a tree from now on) is a binary tree split by the median of the bound, right? I wonder if I make it m-ary tree, let say, Ternary or Quaternary tree instead of just binary? Will it better? » Nat | Talk » 09:12, 15 August 2009 (UTC)

Note, it doesn't by definition have to be split by the median, though the median is usually considered the best place. As far as your second question... it may be better, or it may not. I'm pretty sure it will depend heavily on the implementation details and the data set in question. --Rednaxela 14:51, 15 August 2009 (UTC)

In a binary tree you do 1 comparison to narrow it down to 1/2 the tree (if it's well balanced). In a 3-ary you do (on average) 1.5 comparisons to narrow it to 1/3. 4-ary 2 comparisons for 1/4 of the tree. 1*1/2 = 1.5*1/3 = 2*1/4. They're all theoretically the same. --Simonton 15:33, 15 August 2009 (UTC)

  • Also note that with a 4-ary tree it's best to do a binary search at each node, which is pretty much the same thing as going back to a binary tree, except you'd be doing two comparisons for the same dimension instead of stepping to the next. --Simonton 15:36, 15 August 2009 (UTC)

(Edit conflict) It could be faster splitting 4-ways, sure, but my gut says not by very much. I'd probably recommend doing the normal 2-way (binary) one first, since it will be much simpler to develop and to debug, and then you can try to modify it to do 4-way. You'll need the 2-way version, anyway, in order to compare the speeds. Good luck. --Voidious 15:38, 15 August 2009 (UTC)

It seems that Ternary tree is the fastest of binary, ternary and quaternary. But I've some question, does the Kd-Tree output the same as linear (brute-force) search. Some of my test state that:

Starting Bucket PR k-d tree performance test...
Generating points...
Data generated.
Performing linear search...
Linear search complete; time = 0.003425156
Performing binary k-d tree search...
Binary tree search complete; time = 0.001995574
: accuracy = 0.4666666666666667
Performing ternary k-d tree search...
Ternary tree search complete; time = 2.91098E-4
: accuracy = 1.0
Performing quaternary k-d tree search...
Quaternary tree search complete; time = 3.22178E-4
: accuracy = 1.0

Data completely random. Accuracy calculate by number of result that is same as linear / cluster size. This test with a hundred data points and cluster size of 15. If I increase the data points to 26,000 then the accuracy drop to zero. Is this my tree problem or it is known problem with Kd-Tree? » Nat | Talk » 16:34, 15 August 2009 (UTC)

The kd-tree should definitely give the same results as a brute force search, so you must still have some bugs to work out. --Voidious 16:38, 15 August 2009 (UTC)

Really? Simonton's one sometimes wrong too! » Nat | Talk » 11:28, 16 August 2009 (UTC)
Well, more test with m-ary tree and Simonton's and it seems that my tree and Simonton's have the exact same output but my linear didn't. I'll try your, Rednaxela's and Chase-san's one before conclude. Expected some kd-tree benchmarks this night (ICT), afternoon (UTC) or morning (EST) » Nat | Talk » 11:56, 16 August 2009 (UTC)
Really? I'd be extremely surprised if Simonton's kd-tree had a bug in it. Maybe it's your linear search that has a bug instead (or also)? But I can say with 100% certainty that a kd-tree nearest neighbors search should produce the exact same results as a brute force nearest neighbors search. --Voidious 15:01, 16 August 2009 (UTC)

Your tree is the only tree on this site (exclude Chase-san's tree because his one doesn't have k-nearest neighbour search) that is perfect, see this:

RESULT << k-nearest neighbours search with flat/linear searching >>
: Used time             = 447.54298 x 10^{-3} seconds
: Average adding time   = 1.374 x 10^{-6} seconds
: Last node adding time = 2.4440000000000004 x 10^{-6} seconds
: Accuracy              = 100%

RESULT << k-nearest neighbours search with Rednaxela's k-d tree >>
: Used time             = 1265.16806 x 10^{-3} seconds
: Average adding time   = 33.005 x 10^{-6} seconds
: Last node adding time = 19.696 x 10^{-6} seconds
: Accuracy              = 86%

RESULT << k-nearest neighbours search with Simonton's Bucket PR k-d tree >>
: Used time             = 96.38627 x 10^{-3} seconds
: Average adding time   = 3.039 x 10^{-6} seconds
: Last node adding time = 3.423 x 10^{-6} seconds
: Accuracy              = 80%

RESULT << k-nearest neighbours search with Nat's Bucket PR k-d tree >>
: Used time             = 117.65993 x 10^{-3} seconds
: Average adding time   = 3.368 x 10^{-6} seconds
: Last node adding time = 3.282 x 10^{-6} seconds
: Accuracy              = 80%

RESULT << k-nearest neighbours search with Voidious' Bucket PR k-d tree >>
: Used time             = 90.04413000000001 x 10^{-3} seconds
: Average adding time   = 4.639 x 10^{-6} seconds
: Last node adding time = 3.562 x 10^{-6} seconds
: Accuracy              = 100%

Actually, mine and Simonton's got all answers corrected, too, but I have this code in the answer checker:

if (Math.abs(linearAnswer[0].getDistance() - currentAnswer[0].getDistance()) > 0.00001)
	accuracy *= 0.8;

So they decrease to 80%. Input data for above result is quite large, k = 150 with 400000 data points with 15 dimensions. Here is another result with k = 5, points = 1000 and 4 dimensions:

RESULT << k-nearest neighbours search with flat/linear searching >>
: Used time             = 9.28114 x 10^{-3} seconds
: Average adding time   = 3.738 x 10^{-6} seconds
: Last node adding time = 7.613 x 10^{-6} seconds
: Accuracy              = 100%

RESULT << k-nearest neighbours search with Rednaxela's k-d tree >>
: Used time             = 3.67735 x 10^{-3} seconds
: Average adding time   = 22.053 x 10^{-6} seconds
: Last node adding time = 65.302 x 10^{-6} seconds
: Accuracy              = 60%

RESULT << k-nearest neighbours search with Simonton's Bucket PR k-d tree >>
: Used time             = 2.47867 x 10^{-3} seconds
: Average adding time   = 8.007 x 10^{-6} seconds
: Last node adding time = 4.959 x 10^{-6} seconds
: Accuracy              = 80%

RESULT << k-nearest neighbours search with Nat's Bucket PR k-d tree >>
: Used time             = 3.32095 x 10^{-3} seconds
: Average adding time   = 5.402 x 10^{-6} seconds
: Last node adding time = 5.238 x 10^{-6} seconds
: Accuracy              = 80%

RESULT << k-nearest neighbours search with Voidious' Bucket PR k-d tree >>
: Used time             = 0.21316000000000002 x 10^{-3} seconds
: Average adding time   = 4.586 x 10^{-6} seconds
: Last node adding time = 3.5620000000000003 x 10^{-6} seconds
: Accuracy              = 100%

Note [1]: This test suite was completely rewritten. The linear search is quite slow since I just basically Collections.sort() the points.
Note [2]: My tree is 6-ary tree with 22 buckets, which is the faster combination. The fastest isn't reveal yet that is why I create this test suite » Nat | Talk » 15:35, 16 August 2009 (UTC)

Hmm, interesting. I don't understand the part where you multiply by 0.8, even though you say they got all results correct. It looks like you're just comparing the distances on the first element - is it just because those two kd-trees don't sort the results before returning them? I also don't understand how my tree could have a total time of 1/10th of the others in that last test, though I wish it were true. =) --Voidious 15:49, 16 August 2009 (UTC)

Dunno why 0.8. But, when I change my answer checking code a bit, I accidentally remove Arrays.sort() so, yeah. They are 100% now, except Rednaxela's. About the time, yes it is true. And in first test it always stay there. Mine and Simonton's usually go between 60 and 300 × 10-3. Mine usually take more time, though. Note [3]: Data completely random. » Nat | Talk » 15:56, 16 August 2009 (UTC)

Oh, and I think that's why Diamond run faster than Dookious. Separate data from the tree seems to be a good idea, especially if you set your HashMap's density to 0.1 or something. » Nat | Talk » 16:01, 16 August 2009 (UTC)

Er.... this is strange. The test code I had showed my implementation having perfect accuracy... let me test again... --Rednaxela 16:16, 16 August 2009 (UTC)

Ugh.... I found part of why your benchmark shows it so slow.... because of how you put:
public HyperPoint getPosition() {
  return new HyperPoint(location);
}

in the test code for mine... which would be a huge cause for slowness... --Rednaxela 16:33, 16 August 2009 (UTC)

After fixing that, and removing the weird 0.8 thing, I get results like:

RESULT << k-nearest neighbours search with flat/linear searching >>
: Used time             = 19.19909 x 10^{-3} seconds
: Average adding time   = 4.426 x 10^{-6} seconds
: Last node adding time = 8.032 x 10^{-6} seconds
: Accuracy              = 100%

RESULT << k-nearest neighbours search with Rednaxela's k-d tree >>
: Used time             = 4.16715 x 10^{-3} seconds
: Average adding time   = 46.03 x 10^{-6} seconds
: Last node adding time = 9.987 x 10^{-6} seconds
: Accuracy              = 100%

RESULT << k-nearest neighbours search with Simonton's Bucket PR k-d tree >>
: Used time             = 14.88995 x 10^{-3} seconds
: Average adding time   = 11.440000000000001 x 10^{-6} seconds
: Last node adding time = 9.289000000000001 x 10^{-6} seconds
: Accuracy              = 100%

RESULT << k-nearest neighbours search with Nat's Bucket PR k-d tree >>
: Used time             = 5.1635800000000005 x 10^{-3} seconds
: Average adding time   = 38.910000000000004 x 10^{-6} seconds
: Last node adding time = 10.337000000000002 x 10^{-6} seconds
: Accuracy              = 100%

RESULT << k-nearest neighbours search with Voidious' Bucket PR k-d tree >>
: Used time             = 0.28069000000000005 x 10^{-3} seconds
: Average adding time   = 9.144 x 10^{-6} seconds
: Last node adding time = 8.520000000000001 x 10^{-6} seconds
: Accuracy              = 100%

most of the time, but if I re-run it, sometimes mine gets lower scores. I'll try and figure out what is happening with that. Haha, my adding speed really is terrible compared to my searching speed. --Rednaxela 16:43, 16 August 2009 (UTC)

Hmm.. alright... I'm now working on a new k-d tree that should be even more efficent than the Voidious one I believe... :) --Rednaxela 23:04, 16 August 2009 (UTC)

Curses... my attempt to make a better kd-tree by using heap-style storage of the tree has failed. While it would be efficent, the kd-tree is insufficently balanced and too sparse, causing the size of array that needed to be allocated to be HUGE :( --Rednaxela 03:30, 17 August 2009 (UTC)

Thank you, I have changed my local implementation to fixed your. Just found some bug in the benchmark, all trees except Voidious' and Rednaxela's does full Euclidian distance. Now it use squared distance, but the time didn't effect much. And just found some bug in my tree that make it run a little faster, or not? Anyway, my benchmarks is now running... with large input set(15,400000,150) and 50 tests at once. Will post new result and upload new version soon. It is now fully automated and get the input size/dimensions/neighbours from argv. » Nat | Talk » 15:03, 17 August 2009 (UTC)

NAT'S K-NEAREST NEIGHBOURS ALGORITHMS BENCHMARK
------------------------------------------------
Running 50 test(s) for k-nearest neighbours search:
:: 15 dimension(s); 400000 data point(s); 150 neighbour(s)


Running tests... COMPLETED.

RESULT << k-nearest neighbours search with flat/linear searching >>
: Averaged used time              = 478.0353 miliseconds
: Average adding time             = 1.359 microseconds
: Average last node adding time   = 2.406 microseconds
: Averaged  accuracy              = 100%
: Worst case used time            = 776.3058 miliseconds
: Best case used time             = 423.6457 miliseconds

RESULT << k-nearest neighbours search with Rednaxela's k-d tree >>
: Averaged used time              = 318.4575 miliseconds
: Average adding time             = 14.808 microseconds
: Average last node adding time   = 13.042 microseconds
: Averaged  accuracy              = 83%
: Worst case used time            = 453.7179 miliseconds
: Best case used time             = 157.7133 miliseconds

RESULT << k-nearest neighbours search with Simonton's Bucket PR k-d tree >>
: Averaged used time              = 78.3186 miliseconds
: Average adding time             = 3.446 microseconds
: Average last node adding time   = 4.113 microseconds
: Averaged  accuracy              = 100%
: Worst case used time            = 247.4606 miliseconds
: Best case used time             = 41.4053 miliseconds

RESULT << k-nearest neighbours search with Nat's Bucket PR k-d tree >>
: Averaged used time              = 193.9578 miliseconds
: Average adding time             = 3.556 microseconds
: Average last node adding time   = 3.515 microseconds
: Averaged  accuracy              = 100%
: Worst case used time            = 805.729 miliseconds
: Best case used time             = 41.7864 miliseconds

RESULT << k-nearest neighbours search with Voidious' Bucket PR k-d tree >>
: Averaged used time              = 83.3344 miliseconds
: Average adding time             = 4.1 microseconds
: Average last node adding time   = 3.832 microseconds
: Averaged  accuracy              = 100%
: Worst case used time            = 96.4601 miliseconds
: Best case used time             = 79.8629 miliseconds


BEST RESULT: 
 - #1 Simonton's Bucket PR k-d tree [78.3186]
 - #2 Voidious' Bucket PR k-d tree [83.3344]
 - #3 Nat's Bucket PR k-d tree [193.9578]
 - #4 Rednaxela's k-d tree [318.4575]
 - #5 flat/linear searching [478.0353]

Even though Simonton's is the fastest averaged, it has quite large worst case. Voidious' the best here. And it seems that my tweak make it slower... My worst case is larger than linear worst case. This test run for 15 minutes on my machine so if you run it yourself please be patient. » Nat | Talk » 15:16, 17 August 2009 (UTC)