Reason behind using Manhattan distance

Jump to navigation Jump to search
Revision as of 28 August 2018 at 16:51.
The highlighted comment was created in this revision.

Reason behind using Manhattan distance

In this page, I noticed

 using Euclidean distance decreased my score against real-world targets considerably

However, having better score is just a result instead of reason. And I've been thinking about the reason why Manhattan works better for years...

Today, something come to my mind. For faster calculation, most of us use SqrEuclidean instead of real Euclidean. This wouldn't affect the order, but once u use squared distance for gaussian function, boom, the actual distance (to the same degree as the Manhattan one) is squared twice, which actually decreases k size dramatically in some cases.

So could you remember whether your Euclidean version gun is using SqrEuclidean and using that (squared distance comparing to Manhattan) for gaussian, or the correct Euclidean distance is used for gaussian?

    Xor (talk)15:37, 21 August 2018

    That was quite a while ago :-) But I know I tested a lot of different distance functions, including exotic things like multiplicative and log-based, and Manhattan worked best. I'm fairly sure I used Euclidean with a sqrt on the squared distance.

    Having a gun that is different from what people expect is helpful, since the tuning they do doesn't affect you as much. This is my guess why Manhattan worked best for me

      Skilgannon (talk)07:11, 23 August 2018

      being different sounds reasonable, since there are plenty of vcs surfers (and vcs is more like euclidean than manhattan imho. btw i’m curious about what log-based distance function is ;)

        Xor (talk)11:41, 23 August 2018

        Log based was something like log(1+abs(a1-b1))

          Skilgannon (talk)12:07, 23 August 2018
           
           

          I have 2 hypotheses:

          - Manhattan distance is more tolerant to noise than Euclidean distance. Squaring a dimension amplifies noise.

          - Curse of dimensionality. Euclidean distance behaves oddly at high dimensions.

            MN (talk)01:28, 28 August 2018

            Squaring does not affect the order of nearest points, then with knn the same data points should be chosen.

            And about noice

            IMG 5655.GIF

            euclidean seems to be even more tolerant when noice has less energy than the main dimensions.

            So manhattan seems to be more "elite-oriented", dropping points with offsets in another dimension more aggressively.

            Anyway, according to https://datascience.stackexchange.com/questions/20075/when-would-one-use-manhattan-distance-as-opposite-to-euclidean-distance

            Manhattan distance (L1 norm) may be preferable to Euclidean distance (L2 norm) for the case of high dimensional data:

              Xor (talk)03:19, 28 August 2018

              Suppose there are 3 data points:

              1 reference data point:

              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

              And 2 data points in the database:

              [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] (Euclidean distance = 3.87, Squared Euclidean distance = 15, Manhattan distance = 15)

              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4] (Euclidean distance = 4, Squared Euclidean distance = 16, Manhattan distance = 4)

              If noise changes a single 0 into a 4, it will affect Euclidean distance 4x times higher than Manhattan distance. Euclidean distance will pick the first, Manhattan distance will pick the second. If you divide all numbers by 10 and keep them all between 0 and 0.4, so they all have less energy than the main dimensions, the result will still be the same.

                MN (talk)17:51, 28 August 2018