The issue was caused by a limitation of GetNumaNodeProcessorMask():
on systems with more than 64 processors, this parameter is set to the
processor mask for the node only if the node is in the same processor
group as the calling thread. Otherwise, the parameter is set to zero.
Patch from Max Dmitrichenko, thanks!