The GPU kernel needs to use atomics for accumulation since all offsets are processed in parallel, but on CPUs that's not the case, so we can disable them there for a considerable speedup.
atomic_cas_float