Reason behind using Manhattan distance

Jump to navigation Jump to search

Reason behind using Manhattan distance

In this page, I noticed

 using Euclidean distance decreased my score against real-world targets considerably

However, having better score is just a result instead of reason. And I've been thinking about the reason why Manhattan works better for years...

Today, something come to my mind. For faster calculation, most of us use SqrEuclidean instead of real Euclidean. This wouldn't affect the order, but once u use squared distance for gaussian function, boom, the actual distance (to the same degree as the Manhattan one) is squared twice, which actually decreases k size dramatically in some cases.

So could you remember whether your Euclidean version gun is using SqrEuclidean and using that (squared distance comparing to Manhattan) for gaussian, or the correct Euclidean distance is used for gaussian?

Xor (talk)15:37, 21 August 2018

That was quite a while ago :-) But I know I tested a lot of different distance functions, including exotic things like multiplicative and log-based, and Manhattan worked best. I'm fairly sure I used Euclidean with a sqrt on the squared distance.

Having a gun that is different from what people expect is helpful, since the tuning they do doesn't affect you as much. This is my guess why Manhattan worked best for me

Skilgannon (talk)07:11, 23 August 2018

You do not have permission to edit this page, for the following reasons:

  • The action you have requested is limited to users in the group: Users.
  • You must confirm your email address before editing pages. Please set and validate your email address through your user preferences.

You can view and copy the source of this page.

Return to Thread:Talk:DrussGT/Understanding DrussGT/Reason behind using Manhattan distance/reply (2).

Log based was something like log(1+abs(a1-b1))

Skilgannon (talk)12:07, 23 August 2018
 

Just had a thought about DrussGT's hundreds of random VCS bins and Manhattan distance —

Consider we have infinite amount of random VCS buffers (random bin size and dimensions, weighted equally, no decay), then 1 distance increment in a dimension result in "1" decrease in the total of buffers (data weight) containing that data.

When distance increased in dimension A by 1, and distance increased in dimension B by 1 as well, then data weight decreased by 1 + 1 = 2, in the same way manhattan distance works.

If we use manhattan distance together with knn, and decrease weight linearly on data distance, it should yield similar result to random VCS.

However, once rolling average (decay) is used, things get a lot different there...

Xor (talk)16:43, 15 September 2018
 
 

I have 2 hypotheses:

- Manhattan distance is more tolerant to noise than Euclidean distance. Squaring a dimension amplifies noise.

- Curse of dimensionality. Euclidean distance behaves oddly at high dimensions.

MN (talk)01:28, 28 August 2018

Squaring does not affect the order of nearest points, then with knn the same data points should be chosen.

And about noice

IMG 5655.GIF

euclidean seems to be even more tolerant when noice has less energy than the main dimensions.

So manhattan seems to be more "elite-oriented", dropping points with offsets in another dimension more aggressively.

Anyway, according to https://datascience.stackexchange.com/questions/20075/when-would-one-use-manhattan-distance-as-opposite-to-euclidean-distance

Manhattan distance (L1 norm) may be preferable to Euclidean distance (L2 norm) for the case of high dimensional data:

Xor (talk)03:19, 28 August 2018

Suppose there are 3 data points:

1 reference data point:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

And 2 data points in the database:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] (Euclidean distance = 3.87, Squared Euclidean distance = 15, Manhattan distance = 15)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4] (Euclidean distance = 4, Squared Euclidean distance = 16, Manhattan distance = 4)

If noise changes a single 0 into a 4, it will affect Euclidean distance 4x times higher than Manhattan distance. Euclidean distance will pick the first, Manhattan distance will pick the second.

MN (talk)17:51, 28 August 2018

this is a good demonstration! euclidean is sensitive to outliners and prefer the averagely non-bad one rather than some good point with some dimensions being noise.

Xor (talk)03:03, 29 August 2018
 

Shouldn´t you be adding that +1 to the x value before squaring?

Euclidean distance = sqrt( (x+1)^2 )

Manhattan distance = | x+1 |

MN (talk)18:10, 28 August 2018

my case is noise in another dimension ;)

however if noise is added to the main dimension,

it will be

sqrt((1 + x)^2 + 1)

vs

|1 + x | + 1

and if we put two curves together (shifted so that tey intersects on x=0)

http://robowiki.net/w/images/5/5a/C3BD3E15-EEB6-4F63-826F-7C1F5E54A78E.gif

euclidean looks terrible with large noise in one dimension, and manhattan looks robust.

Xor (talk)02:53, 29 August 2018
 
 

I think it is due to the noise rejection. For me it is the ratio between how a small change in a lot of dimensions is weighted compared to a big change in a single dimension, as you demonstrated above. You can also think about it like the difference between L1 and L2 distance, how they would affect a minimization problem. L1 rejects large noises, and is the most robust you can get while still maintaining a convex search space. L2 has a gradient that gets larger the bigger the distance, so dimensions with more error are effectively weighted higher, and weighted higher than just proportional to the amount of error.

Skilgannon (talk)18:15, 28 August 2018
 

Rethinking this after 4.5 years, L1/L2 distancing resembles L1/L2 norm in logistic regression, where L1 norm tend to find weights with more zeros, and L2 norm tend to find equal but non-zero weights for co-linear attributes. Since no one is using duplicated attributes due to limited dimensionality, this benefit of L2 norm is nullified.

The property of having more zeros of L1 norm reminds me of pattern matching, where zero means a match and non-zero means a mismatch. Being able to make a partial but best-effort match effectively simulates using a large amount of trees, each having a subset of the attributes.

Xor (talk)09:24, 17 January 2023