BeepBoop seems to be the new king
Fragment of a discussion from User talk:Kev
Jump to navigation
Jump to search
Congratulations (again) from me too ;) BeepBoop since 1.2 had very surprising results (nearly 95!!!). And yet nothing worked when I tried to use gradient descent in training models. Would you mind to share a little bit more about this section? E.g. initialization, learning rate, how to prevent getting zero or negative exponent in x^a formula…
I’ve been meaning to release the code for the training, but it’s currently a huge mess and I’m pretty busy! In the meantime, here are some details that might help:
- I initialized the powers to 1, biases to 0, and multipliers to a simple hand-made KNN formula.
- I constrained the powers to be positive, so I guess the formula should really be written as w(x+b)^abs(a).
- I used Adam with a learning rate 1e-3 for optimization.
- Changing the KNN formula of course changes the nearest neighbors, so I alternated between training for a couple thousand steps and rebuilding the tree and making new examples.
- For simplicity/efficiency, I used binning to build a histogram over GFs for an observation. Simply normalizing the histogram so it sums to 1 to get an output distribution doesn’t work that well (for one thing, it can produce very low probabilities if the kernel width is small). Instead, I used the output distribution softmax(t * log(histogram + abs(b))) where t and b are learned parameters initialized to 1 and 1e-4.