Basically build-time compiled kernels were using --fast-math (which is correct) but run-time compiled did not.