This save a little memory and copying in the kernel by storing only a 4x3 matrix instead of a 4x4 matrix. We already did this in a few places, and those don't need to be special exceptions anymore now.
atomic_cas_float