To make porting to other architectures easier, clarifying that this does not need to be supported. The unused parallel_reduce implementation assumed warp size 32, but is easy to update if we ever need it in the future.