I'm switching over to your 3rd gen tree, and had a question re: your statement here:
"... it has some additional features like an iterator to allow one to iterate over the nearest points in sorted order, and if you stop early it saves a notable amount of cpu."
How early are we talking about? For instance, if I grab an iterator for 200 maxPointsReturned, iterate over the first 150 of them and decide I'm done, is that still in the "saves a notable amount of cpu" territory?
I'd have to benchmark it to be sure, and it depends on the distribution of data of course, but I'd estimate stopping at 150 out of 200 points would shave off something like 10-15% of the search time compared to just getting 200 points.
IIRC it depends on the structure of your data. From what I understand, it slowly expands the hypersphere of contained points in bursts (grabbing further and further tree branches), sorting them in order of relevancy as it goes. It depends on whether you don't have to do an extra expansion to get the extra points. An iterator for 200, then getting 150, will be slower than getting an iterator for 150, but chances are it will be faster than getting an iterator for 200 and using all 200.
Pretty much yeah, though it does avoid the full effort of sorting them by using a min-max heap that tosses the most distant points off and keeps the closest point accessible in constant time. The search algorithm is exactly the same as searching for all 200 (it needs to remember the 200 closest points it's found so far, and know what the furthest and closest ones of that set are), except that it pauses the search when when it is able to determine that no unchecked branch could have anything closer than the closest point not yet returned by the iterator.
Excellent info from you both, thanks!
One more question: the DistanceFunction file defines an Interface, but features no comments to describe what the two member methods are supposed to do.
distance() is utterly obvious... especially when reading your EuclideanDistanceFunction implementation. But distanceToRect()... I think I know what it requires, but I don't want to screw it up when I write my own DistanceFunction.
What precisely is distanceToRect defined as?
It's the minimum distance from that point you are testing to the hyper-rectangle defined by the min and max co-ordinates on each dimension.
So considering dimension x: if the point val is less than the min, the distance is (min - val), if it is between min and max the value is 0 (it is inside the rectangle), if it is more than max the distance is (val - max).
If you are doing Euclidean, square each distance then add them together. I do Manhattan, so just use the absolute value.
Yep, what Skilgannon said. Sorry I forgot to put comments in that interface.
If your'e wondering, this is used to compare the search point to the bounding box associated with each branch of the tree, and allows efficient skipping of irrelevant branches.
Excellent. I was guessing it was that or related to that from reading your euclidian distance implementation.
When I write a WeightedSquareEuclidieanDistanceFunction and/or WeightedManhattanDistanceFunction, would you like me to commit them to your bitbucket hg repo? I'm happy to help contribute! :)
Sure, or if posted on the wiki or in a bot I can upload it to that some time. Thanks :)
DeBroglie rev0026 is up. Only change from rev0025 is your 3rd gen tree. Should perform pretty close to what it was doing before, though some differences are to be expected since your 3rd gen tree doesn't drop points. We'll see, though I have to get going to the doctor at the moment.
I tested with both the SqrEuclidean and Manhattan versions of my Weighted trees. Both seemed to work fine in several test battles with some bots I had sitting around. I ended up making a WeightedDistanceFunction class to be a superclass of both the WeightedManhattanDistanceFunction and WeightedSquareEuclideanDistanceFunction.. to duplicate less of the code involved in weighting.
The weighted DF should failover gracefully if given weights that mismatch the number of tree dimensions. Only thing I didn't implement was doing a Math.abs() on weights, since someone out there might invent a DistanceFunction that utilizes negative weights.
If the code on my bitbucket fork meets your approval, I can toss you a pull request. :)
EDIT: Made a new bitbucket with all the work in a single commit, and decided to make WeightedDistanceFunction abstract.
Neat! When I first took a quick look at the version you initially posted, I was thinking to myself that WeightedDistanceFunction should have been abstract yeah.
I'll merge it in some time shortly.