Difference between revisions of "User:Duyn/kd-tree Tutorial"
(Beginning of a new tutorial.) |
(Progress save) |
||
Line 27: | Line 27: | ||
While this class is fully usable as is, rarely will you be interested in just the domain of nearest neighbours in a search. It is expected that specific data (eg. guess factors) will be loaded by sub-classing this Exemplar class. Our k-d tree will be parameterised based on this expectation. | While this class is fully usable as is, rarely will you be interested in just the domain of nearest neighbours in a search. It is expected that specific data (eg. guess factors) will be loaded by sub-classing this Exemplar class. Our k-d tree will be parameterised based on this expectation. | ||
− | ... | + | ==Basic Tree== |
+ | Here is a basic tree structure: | ||
+ | |||
+ | <pre> | ||
+ | public class BucketKdTree<T extends Exemplar> { | ||
+ | private List<T> exemplars = new LinkedList<T>(); | ||
+ | private BucketKdTree<T> left, right; | ||
+ | private int bucketSize; | ||
+ | private final int dimensions; | ||
+ | |||
+ | public BucketKdTree(int bucketSize, int dimensions) { | ||
+ | this.bucketSize = bucketSize; | ||
+ | this.dimensions = dimensions; | ||
+ | } | ||
+ | |||
+ | private boolean | ||
+ | isTree() { return left != null; } | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | Each tree is either a tree with both left and right sub-trees defined, or a leaf with exemplars filled. Because of our splitting algorithm, it is pointless to allow a tree to be both since the mean might not correspond with any actual exemplars. | ||
+ | |||
+ | Bucket size and dimensions must be passed into the constructor. We could infer dimension from the dimension of the first point added, but this is simpler. Bucket size is not final because theoretically it could be varied, though our implementation will not. | ||
+ | |||
+ | ==Adding== | ||
+ | |||
+ | Each of the public API functions defers the actual addition to another private static function. This is to avoid accidentally referring to instance variables while we walk the tree. This is a common pattern we will use for much of the actual behaviour code for this tree. | ||
+ | |||
+ | We decide whether to split a leaf only after the add has been completed. | ||
+ | |||
+ | <pre> | ||
+ | // One at a time | ||
+ | public void | ||
+ | add(T ex) { | ||
+ | BucketKdTree<T> tree = addNoSplit(this, ex); | ||
+ | if (shouldSplit(tree)) { | ||
+ | split(tree); | ||
+ | } | ||
+ | } | ||
+ | |||
+ | // Bulk add gives us more data to choose a better split point | ||
+ | public void | ||
+ | addAll(Collection<T> exs) { | ||
+ | // Some spurious function calls. Optimised for readability over | ||
+ | // efficiency. | ||
+ | final Set<BucketKdTree<T>> modTrees = | ||
+ | new HashSet<BucketKdTree<T>>(); | ||
+ | for(T ex : exs) { | ||
+ | modTrees.add(addNoSplit(this, ex)); | ||
+ | } | ||
+ | |||
+ | for(BucketKdTree<T> tree : modTrees) { | ||
+ | if (shouldSplit(tree)) { | ||
+ | split(tree); | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | To add an exemplar, we traverse the tree from top down until we find a leaf. Then we add the exemplar to the list at that leaf. | ||
+ | |||
+ | To decide which sub-tree to traverse, each tree stores two values—a splitting dimension and a splitting value. If our new exemplar's domain along the splitting dimension is greater than the tree's splitting value, we put it in the right sub-tree. Otherwise, it goes in the left one. | ||
+ | |||
+ | Since adding takes little time compared to searching, we take this opportunity to make some optimisations: | ||
+ | |||
+ | * We keep track of the actual hyperrect the points in this tree occupy. This lets us rule out a tree, even though its space may intersect with our search sphere, if it doesn't actually contain any points within the hyperrect bounding our search sphere. This hyperrect is defined by <code>maxBounds</code> and <code>minBounds</code>. | ||
+ | |||
+ | * To save us having to do a full iteration when we come to split a leaf, we compute the running mean and variance for each dimension using [http://www.johndcook.com/standard_deviation.html Welford's method]: | ||
+ | <math> | ||
+ | M_1 = x_1,\qquad S_1 = 0</math> | ||
+ | <math> | ||
+ | M_k = M_{k-1} + {x_k - M_{k-1} \over k} \qquad(exMean)</math> | ||
+ | <math> | ||
+ | S_k = S_{k-1} + (x_k - M_{k-1})\times(x_k - M_k) \qquad(exSumSqDev)</math> | ||
+ | |||
+ | * We keep track of whether all exemplars in this leaf have the same domain. If they do, we know that comparisons on one exemplar apply to all exemplars in that leaf. | ||
+ | |||
+ | <pre> | ||
+ | private int splitDim; | ||
+ | private double split; | ||
+ | |||
+ | // These aren't initialised until add() is called. | ||
+ | private double[] exMean; | ||
+ | private double[] exSumSqDev; | ||
+ | |||
+ | // Optimisation when tree contains large number of duplicates | ||
+ | private boolean exemplarsAreUniform = true; | ||
+ | |||
+ | // Optimisation for searches. This lets us skip a node if its | ||
+ | // scope intersects with a search hypersphere but it doesn't contain | ||
+ | // any points that actually intersect. | ||
+ | private double[] maxBounds; | ||
+ | private double[] minBounds; | ||
+ | |||
+ | // Adds an exemplar without splitting overflowing leaves. | ||
+ | // Returns leaf to which exemplar was added. | ||
+ | private static <T extends Exemplar> BucketKdTree<T> | ||
+ | addNoSplit(BucketKdTree<T> tree, T ex) { | ||
+ | // Some spurious function calls. Optimised for readability over | ||
+ | // efficiency. | ||
+ | BucketKdTree<T> cursor = tree; | ||
+ | while (cursor != null) { | ||
+ | updateBounds(cursor, ex); | ||
+ | if (cursor.isTree()) { | ||
+ | // Sub-tree | ||
+ | cursor = ex.domain[cursor.splitDim] <= cursor.split | ||
+ | ? cursor.left : cursor.right; | ||
+ | } else { | ||
+ | // Leaf | ||
+ | cursor.exemplars.add(ex); | ||
+ | final int nExs = cursor.exemplars.size(); | ||
+ | if (nExs == 1) { | ||
+ | cursor.exMean = | ||
+ | Arrays.copyOf(ex.domain, cursor.dimensions); | ||
+ | cursor.exSumSqDev = new double[cursor.dimensions]; | ||
+ | } else { | ||
+ | for(int d = 0; d < cursor.dimensions; d++) { | ||
+ | final double coord = ex.domain[d]; | ||
+ | |||
+ | final double oldExMean = cursor.exMean[d]; | ||
+ | final double newMean = cursor.exMean[d] = | ||
+ | oldExMean + (coord - oldExMean)/nExs; | ||
+ | |||
+ | final double oldSumSqDev = cursor.exSumSqDev[d]; | ||
+ | cursor.exSumSqDev[d] = oldSumSqDev | ||
+ | + (coord - oldExMean)*(coord - newMean); | ||
+ | } | ||
+ | } | ||
+ | if (cursor.exemplarsAreUniform) { | ||
+ | final List<T> cExs = cursor.exemplars; | ||
+ | if (cExs.size() > 0 && !ex.domainEquals(cExs.get(0))) | ||
+ | cursor.exemplarsAreUniform = false; | ||
+ | } | ||
+ | return cursor; | ||
+ | } | ||
+ | } | ||
+ | throw new RuntimeException("Walked tree without adding anything"); | ||
+ | } | ||
+ | |||
+ | private static <T extends Exemplar> void | ||
+ | updateBounds(BucketKdTree<T> tree, Exemplar ex) { | ||
+ | final int dims = tree.dimensions; | ||
+ | if (tree.maxBounds == null) { | ||
+ | tree.maxBounds = Arrays.copyOf(ex.domain, dims); | ||
+ | tree.minBounds = Arrays.copyOf(ex.domain, dims); | ||
+ | } else { | ||
+ | for(int d = 0; d < dims; d++) { | ||
+ | final double dimVal = ex.domain[d]; | ||
+ | if (dimVal > tree.maxBounds[d]) | ||
+ | tree.maxBounds[d] = dimVal; | ||
+ | else if (dimVal < tree.minBounds[d]) | ||
+ | tree.minBounds[d] = dimVal; | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | ==Splitting== | ||
+ | |||
+ | ==Searching== | ||
+ | |||
+ | ==Full Source Code== | ||
+ | For the full source code to the tree built in this tutorial, see [[User:Duyn/BucketKdTree|duyn's Bucket kd-tree]]. |
Revision as of 05:54, 27 February 2010
[this is the beginning of a tutorial on writing a k-d tree].
This tutorial will walk you through the process of writing a k-d tree for k-nearest neighbour search. The tree here will hold multiple items in each leaf, splitting only when a leaf overflows. It will split on the mean of the dimension with the largest variance. There are already several k-d trees on this wiki, see some of the others for ideas.
A k-d tree is a binary tree which successively splits a k-dimensional space in half. This lets us speed up a nearest neighbour search by not examining points in partitions which are too far away. A k-nearest neighbour search can be implemented in from a nearest neighbour algorithm by not shrinking the search radius until k items have been found. The rest of this tutorial will refer to both as nearest neighbour queries.
An Exemplar class
We will call each item in our k-d tree an exemplar. Each exemplar has a domain—its spatial co-ordinates in the k-d tree's space. Each exemplar could also have an arbitrary payload, but our tree does not need to know about that. It will only handle storing exemplars based on their domain and returning them in a nearest neighbour search.
You might already have a class somewhere called Point which handles 2D co-ordinates. This terminology avoids conflict with that.
public class Exemplar { public final double[] domain; public Exemplar(final double[] coords) { this.domain = coords; } // Short hand. Shorter than calling Arrays.equals() each time. public boolean domainEquals(final Exemplar other) { return Arrays.equals(domain, other.domain); } }
While this class is fully usable as is, rarely will you be interested in just the domain of nearest neighbours in a search. It is expected that specific data (eg. guess factors) will be loaded by sub-classing this Exemplar class. Our k-d tree will be parameterised based on this expectation.
Basic Tree
Here is a basic tree structure:
public class BucketKdTree<T extends Exemplar> { private List<T> exemplars = new LinkedList<T>(); private BucketKdTree<T> left, right; private int bucketSize; private final int dimensions; public BucketKdTree(int bucketSize, int dimensions) { this.bucketSize = bucketSize; this.dimensions = dimensions; } private boolean isTree() { return left != null; } }
Each tree is either a tree with both left and right sub-trees defined, or a leaf with exemplars filled. Because of our splitting algorithm, it is pointless to allow a tree to be both since the mean might not correspond with any actual exemplars.
Bucket size and dimensions must be passed into the constructor. We could infer dimension from the dimension of the first point added, but this is simpler. Bucket size is not final because theoretically it could be varied, though our implementation will not.
Adding
Each of the public API functions defers the actual addition to another private static function. This is to avoid accidentally referring to instance variables while we walk the tree. This is a common pattern we will use for much of the actual behaviour code for this tree.
We decide whether to split a leaf only after the add has been completed.
// One at a time public void add(T ex) { BucketKdTree<T> tree = addNoSplit(this, ex); if (shouldSplit(tree)) { split(tree); } } // Bulk add gives us more data to choose a better split point public void addAll(Collection<T> exs) { // Some spurious function calls. Optimised for readability over // efficiency. final Set<BucketKdTree<T>> modTrees = new HashSet<BucketKdTree<T>>(); for(T ex : exs) { modTrees.add(addNoSplit(this, ex)); } for(BucketKdTree<T> tree : modTrees) { if (shouldSplit(tree)) { split(tree); } } }
To add an exemplar, we traverse the tree from top down until we find a leaf. Then we add the exemplar to the list at that leaf.
To decide which sub-tree to traverse, each tree stores two values—a splitting dimension and a splitting value. If our new exemplar's domain along the splitting dimension is greater than the tree's splitting value, we put it in the right sub-tree. Otherwise, it goes in the left one.
Since adding takes little time compared to searching, we take this opportunity to make some optimisations:
- We keep track of the actual hyperrect the points in this tree occupy. This lets us rule out a tree, even though its space may intersect with our search sphere, if it doesn't actually contain any points within the hyperrect bounding our search sphere. This hyperrect is defined by
maxBounds
andminBounds
.
- To save us having to do a full iteration when we come to split a leaf, we compute the running mean and variance for each dimension using Welford's method:
<math> M_1 = x_1,\qquad S_1 = 0</math> <math> M_k = M_{k-1} + {x_k - M_{k-1} \over k} \qquad(exMean)</math> <math> S_k = S_{k-1} + (x_k - M_{k-1})\times(x_k - M_k) \qquad(exSumSqDev)</math>
- We keep track of whether all exemplars in this leaf have the same domain. If they do, we know that comparisons on one exemplar apply to all exemplars in that leaf.
private int splitDim; private double split; // These aren't initialised until add() is called. private double[] exMean; private double[] exSumSqDev; // Optimisation when tree contains large number of duplicates private boolean exemplarsAreUniform = true; // Optimisation for searches. This lets us skip a node if its // scope intersects with a search hypersphere but it doesn't contain // any points that actually intersect. private double[] maxBounds; private double[] minBounds; // Adds an exemplar without splitting overflowing leaves. // Returns leaf to which exemplar was added. private static <T extends Exemplar> BucketKdTree<T> addNoSplit(BucketKdTree<T> tree, T ex) { // Some spurious function calls. Optimised for readability over // efficiency. BucketKdTree<T> cursor = tree; while (cursor != null) { updateBounds(cursor, ex); if (cursor.isTree()) { // Sub-tree cursor = ex.domain[cursor.splitDim] <= cursor.split ? cursor.left : cursor.right; } else { // Leaf cursor.exemplars.add(ex); final int nExs = cursor.exemplars.size(); if (nExs == 1) { cursor.exMean = Arrays.copyOf(ex.domain, cursor.dimensions); cursor.exSumSqDev = new double[cursor.dimensions]; } else { for(int d = 0; d < cursor.dimensions; d++) { final double coord = ex.domain[d]; final double oldExMean = cursor.exMean[d]; final double newMean = cursor.exMean[d] = oldExMean + (coord - oldExMean)/nExs; final double oldSumSqDev = cursor.exSumSqDev[d]; cursor.exSumSqDev[d] = oldSumSqDev + (coord - oldExMean)*(coord - newMean); } } if (cursor.exemplarsAreUniform) { final List<T> cExs = cursor.exemplars; if (cExs.size() > 0 && !ex.domainEquals(cExs.get(0))) cursor.exemplarsAreUniform = false; } return cursor; } } throw new RuntimeException("Walked tree without adding anything"); } private static <T extends Exemplar> void updateBounds(BucketKdTree<T> tree, Exemplar ex) { final int dims = tree.dimensions; if (tree.maxBounds == null) { tree.maxBounds = Arrays.copyOf(ex.domain, dims); tree.minBounds = Arrays.copyOf(ex.domain, dims); } else { for(int d = 0; d < dims; d++) { final double dimVal = ex.domain[d]; if (dimVal > tree.maxBounds[d]) tree.maxBounds[d] = dimVal; else if (dimVal < tree.minBounds[d]) tree.minBounds[d] = dimVal; } } }
Splitting
Searching
Full Source Code
For the full source code to the tree built in this tutorial, see duyn's Bucket kd-tree.