Reason behind using Manhattan distance
Fragment of a discussion from Talk:DrussGT/Understanding DrussGT
Jump to navigation
Jump to search
Rethinking this after 4.5 years, L1/L2 distancing resembles L1/L2 norm in logistic regression, where L1 norm tend to find weights with more zeros, and L2 norm tend to find equal but non-zero weights for co-linear attributes. Since no one is using duplicated attributes due to limited dimensionality, this benefit of L2 norm is nullified.
The property of having more zeros of L1 norm reminds me of pattern matching, where zero means a match and non-zero means a mismatch. Being able to make a partial but best-effort match effectively simulates using a large amount of trees, eaching having a subset of the attributes.