Faster normalXAngle --> faster sin,cos,tan
Have a look at Polynomial Approximations to Sine and Cosine. This uses no memory and is even faster than the lookup table approach. Skilgannon uses this as far as I know.
That is what I'm referring to as the "polynomial implementation" above. My intuition is that the LUT version is only slower in practice because it gets evicted out of the cache due to being so large. I haven't done a side-by-side comparison test yet, but I've shrunk it by roughly a factor of 32 from the smallest version I've seen on this page, and I've added a couple other tricks to reduce the calculations required to find the lookup index. Once I truly need the speed boost I'll begin benchmarking the two against each other.