Regarding PrioQueue

Jump to navigation Jump to search
Revision as of 19 July 2013 at 21:51.
The highlighted comment was created in this revision.

Regarding PrioQueue

Out of curiosity, have you compared your PrioQueue impelementation that uses binary search, to the heap implementation that I use in my 3rd gen tree? Looking over it, I expect the heap to be faster.

Consider the heapsort algorithm. It is O(n log(n)) to perform the sort like most good sorting algorithms. Roughly half the effort of the heapsort is generating the heap-ordering, and half is generating the fully sorted order from the heap-ordering. This is not exactly what we're doing, but the general idea is that generating heap-ordering is lower cost than generating sorted-ordering, despite both having an O(n log(n)) cost.

With both the heap method and the binary search method, insertions are O(log(n)), except...

  1. With the heap method, you only generate heap order for items that don't end up within the closest N points. This doesn't change the O(log(n)) characteristic, but it ought to be lower cost.
  2. Your implementation also has some O(n) characeristics, because inserting in the middle of an ArrayList causes the ArrayList to copy everything after the insertion point. It's a fast O(n) operation when the number of points being returned is small, but I think it's still notable.

You're probably right that the binary search method get's through JIT faster, but I'm unsure if that pays off due to the two disadvantages I note above. In addition, I'd suspect your SearchResult container object probably adds some extra cost to your result queue. When I was doing my heap-based implementation I found it was faster to have an Object[] array for the results and double[] array for distances, and manage both in the heap, rather than using a container.

    Rednaxela (talk)20:46, 17 July 2013

    In the tests I did I found that the container didn't make any difference. I *did* try using the ResultHeap from your Gen2 tree, but it was quite a bit slower, the same speed as a linear-search ArrayList backed PrioQueue.

    I think it isn't just a JIT issue, but also the fact that we generally only have ~50 elements in the queue. At these sizes something like an Insertion Sort is comparable in speed to a Heap Sort, so I thought I'd implement what I know first.

    I'll give it another run, this time with your Gen3 heap.

    Also, I tried using the PrioQueue as the stack for storing the tree nodes. I used the pointRectDist as the ordering factor, but it increased the number of points visited by ~30% so I'm not sure what I did wrong. I know you did something similar, how did you get it to work?

      Skilgannon (talk)21:03, 17 July 2013

      Regarding the stack of tree nodes, perhaps you had it sorting in the wrong direction or something? Based on my tests, when setting it up so it always visits the nearest unvisited node next, it pretty reliably decreased the number of points visisted. The overall performance gain was smaller than the reduction in points visited (due to the extra distance calculations Edit: not due to extra distance calculations, turns out I was calculating bounding box distances for unvisited paths before I implemented the stack of tree nodes), but it did did pay off slightly for me in any case.

        Rednaxela (talk)22:14, 17 July 2013

        OK, results are in. It turns out I still had some performance tucked away in my PrioQueue, like you thought. Odd that I didn't pick that up earlier.

        Next, using your MaxHeap from Gen3 was faster than my old PrioQueue, but slower than the new one. Although it is really hard testing when you have shared dependencies without giving one a JIT advantage!

        Using the PrioQueue (the fast one) instead of the ArrayDeque now reduces the number of nodes I visit by ~20%, but still takes longer than a stack - about the same amount longer as using the MaxHeap for the results. I skip pointRectDist checks on the chosen path when the results queue doesn't have any values, but beyond that it seems dead-end to me. It seems to me that the overhead from the PrioQueue instead of an ArrayDeque is more than I save by visiting less nodes. It also uses a few less pointRectDist calculations. I even tried trimming the stack occasionally when new data came in, thinking maybe it was growing and pushed out of the fast region. Maybe I'll try using the MinHeap, but otherwise I'm going to stick with my simple lazily-evaluate-pointRectDist ArrayDeque implementation.

        So, new results without garbage-creation:

        BEST RESULT: 
         - #1 Skilgannon's Cache-hit KDTree [0.0286]
         - #2 Rednaxela's kd-tree (3rd gen) [0.0291]
         - #3 Rednaxela's kd-tree (2nd gen) [0.0319]
         - #4 Voidious' Bucket PR k-d tree [0.1619]
        

        New results with garbage-creation:

        BEST RESULT: 
         - #1 Skilgannon's Cache-hit KDTree [0.0288]
         - #2 Rednaxela's kd-tree (3rd gen) [0.0314]
         - #3 Rednaxela's kd-tree (2nd gen) [0.0337]
         - #4 Voidious' Bucket PR k-d tree [0.1705]
        

        Notice my Cache-hit tree isn't really affected by the garbage, while all the others are.

          Skilgannon (talk)02:50, 18 July 2013

          Ahh, nice stuff! Could you post both your current version of your tree and the code you're using for "garbage-creation"? I'm curious to do some tests with it.

            Rednaxela (talk)03:06, 18 July 2013

            Here is the garbage creation code:

             public TestResult doTest(KNNImplementation algorithm, int numNeighbours, SampleData[] data, KNNPoint[][] solution) {
                  TestResult result = new TestResult(algorithm.getName(), data.length, numNeighbours, solution);
               
                  ArrayList<KNNPoint[]> neighbourLists = new ArrayList<KNNPoint[]>();
                  double sum = 0;//DEPENDENCY FOR GARBAGE
                  int addCount = 0;
                  for (SampleData sample : data) {
                     if (sample.save) {
                        long time = -System.nanoTime();
                        algorithm.addDataPoint(sample.entry);
                        time += System.nanoTime();
                        result.recordAdd(time);
                        addCount++;
                     }
                  
                     if (sample.search) {
                        long time = -System.nanoTime();
                        KNNPoint[] neighbourList = algorithm.getNearestNeighbors(sample.data, Math.min(numNeighbours, addCount));
                        time += System.nanoTime();
                        result.recordSearch(neighbourList, time);
                     }
                     //GARBAGE - vary size of array to vary effect - 0 should have no effect
                     double[] garbageD = new double[1000];
                     int[] garbageI = new int[garbageD.length];
                     for(int i = 0; i < garbageD.length; i++){
                        garbageD[i] = Math.random();
                        garbageI[i] = (int)(Integer.MAX_VALUE*Math.random());
                     }
                     for(int i = 0; i < garbageD.length; i++){
                        garbageD[i] *= garbageI[garbageD.length-1-i];
                        sum += garbageD[garbageD.length-1 - i];
                        garbageI[i] -= sum;
                     }
                  }
                  if(sum == -1.0)//DEPENDENCY FOR GARBAGE
                     System.out.println("Unlikely event, but cannot be eliminated.");
               
                  return result;
               }
            
              Skilgannon (talk)03:11, 18 July 2013

              Hmm... I just tried running it, including your garbage-generation:

              With the initially-uploaded version of your tree:

               - #1 Rednaxela's kd-tree (3rd gen) [0.0281]
               - #2 Skilgannon's Cache-hit KDTree [0.0336]
               - #3 Rednaxela's kd-tree (2nd gen) [0.0347]
               - #4 Voidious' Linear search [0.5787]
              

              With the newly-uploaded version of your tree:

               - #1 Rednaxela's kd-tree (3rd gen) [0.0281]
               - #2 Skilgannon's Cache-hit KDTree [0.0292]
               - #3 Rednaxela's kd-tree (2nd gen) [0.0346]
               - #4 Voidious' Linear search [0.5756]
              

              I'm running this version of KNN.jar with default options (all "yes", including standard data file and isolated processes), 40 nearest neighbors, 10 repetitions.

              Could you maybe try running this jar file exactly as-is to see what that results in on your machine?

                Rednaxela (talk)04:34, 18 July 2013

                Here's what I get:

                BEST RESULT:
                 - #1 Skilgannon's Cache-hit KDTree [0.0285]
                 - #2 Rednaxela's kd-tree (3rd gen) [0.0311]
                 - #3 Rednaxela's kd-tree (2nd gen) [0.0326]
                 - #4 Voidious' Linear search [0.4172]
                

                Strange, it must have to do with cache sizes and RAM speeds. Here is my CPU. I have 6GB of DDR3-1333, and java -version gives me

                java version "1.7.0"
                Java(TM) SE Runtime Environment (build 1.7.0-b147)
                Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
                

                I also have an SSD, if that changes anything. Anything stand out to you?

                  Skilgannon (talk)09:10, 18 July 2013

                  Here's the CPU I'm using. That was with 8GB of DDR3 running at 1066MHz, but that's intentionally underclocked by a lot. If I clock my memory up to it's rated frequency of 1600MHz the result is as follows:

                   - #1 Rednaxela's kd-tree (3rd gen) [0.0276]
                   - #2 Skilgannon's Cache-hit KDTree [0.0288]
                   - #3 Rednaxela's kd-tree (2nd gen) [0.0340]
                   - #4 Voidious' Linear search [0.5593]

                  A little faster all around, but not much difference with memory clock rate it seems.

                  The AMD Phenom II I'm using has 6MB of L3 cache, 512kB of L2 cache per-core, completely seperate.

                  The Intel Core i5 you're using has 3MB of L3 cache, 256kB of L2 cache per-core, though it looks like supposedly intel has some method of allowing the two cores to access eachother's L2 cache.

                  Java version:

                  java version "1.7.0_40"
                  OpenJDK Runtime Environment (IcedTea 2.4.1) (ArchLinux build 7.u40_2.4.1-1-x86_64)
                  OpenJDK 64-Bit Server VM (build 24.0-b50, mixed mode)

                  I have my root partition running on an SSD with the /home partition running on a spinning disk, but I doubt the disk situation affects this benchmark at all in any case.

                  Edit: I also find it very interesting that you get better performance numbers for the linear search, whereas I get better performance numbers for all of the trees except yours is about the same, plus the gap between my 3rd and 2nd gen tree is much larger for me than it is for you.

                  Edit: Since you uploaded a small update to your code this morning I tried that also. The change was small, going from 0.0288 changing to 0.0284.

                    Rednaxela (talk)13:53, 18 July 2013
                     

                    Because of how Java runs the JIT and GC in separate threads, I just tried a couple quick things:

                    If I force Java to run on only one core, I get this result:

                     - #1 Skilgannon's Cache-hit KDTree [0.0334]
                     - #2 Rednaxela's kd-tree (3rd gen) [0.0343]
                     - #3 Rednaxela's kd-tree (2nd gen) [0.0375]
                     - #4 Voidious' Linear search [0.5844]

                    If I force Java to run on two cores, I get this result:

                     - #1 Rednaxela's kd-tree (3rd gen) [0.0280]
                     - #2 Skilgannon's Cache-hit KDTree [0.0304]
                     - #3 Rednaxela's kd-tree (2nd gen) [0.0341]
                     - #4 Voidious' Linear search [0.4806]

                    Compared to allowing all 6 cores, only allowing 2 cores improved the linear search result (more dramatic than I expected!), but it hurt all of the kd-trees still.

                      Rednaxela (talk)14:57, 18 July 2013

                      Maybe that's why my linear search score is so much better than yours?

                      BTW, newest code is a little bit faster.

                        Skilgannon (talk)15:39, 18 July 2013

                        So, it turns out that if I use Oracle Java in Windows instead of OpenJDK on Linux, the performance is pretty different:

                         - #1 Skilgannon's Cache-hit KDTree [0.0275]
                         - #2 Rednaxela's kd-tree (3rd gen) [0.0290]
                         - #3 Rednaxela's kd-tree (2nd gen) [0.0309]
                         - #4 Voidious' Linear search [0.5553]

                        The relative performance of things looks much more similar to what you saw, with a smaller difference between my 3rd and 2nd gen tree, with your one performing better.

                        Between OracleJava/Windows and OpenJDK/Linux, my 3rd gen tree and your cache-hit tree, swap places it seems.

                        java version "1.7.0_25"
                        Java(TM) SE Runtime Environment (build 1.7.0_25-b17)
                        Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

                        (Also turns out the server JVM is much better suited for the kd-tree test than the client JVM. Edited this post to switch the results to those using the server JVM)

                          Rednaxela (talk)02:43, 19 July 2013

                          Client JVM is not designed for heavy processing, like how the rumble is.

                            MN (talk)22:51, 19 July 2013