Mostly this is making inlining match CUDA 7.5 in a few performance critical
places. The end result is that performance is now better than before, possibly
due to less register spilling or other CUDA 8.0 compiler improvements.
On benchmarks scenes, there are 3% to 35% render time reductions. Stack memory
usage is reduced a little too.
Reviewed By: sergey
Differential Revision: https://developer.blender.org/D2269