Applying contignious after reshape solved the issue.
% python -c "import torch;import torch.nn.functional as f;x=torch.arange(1000,dtype=torch.float).reshape(10,10,10).permute(2,0,1);y=x.to('mps');print((f.gelu(x)-f.gelu(y).cpu()).abs().max().item())"
999.0
% python -c "import torch;import torch.nn.functional as f;x=torch.arange(1000,dtype=torch.float).reshape(10,10,10).permute(2,0,1);y=x.to('mps');print((f.gelu(x)-f.gelu(y.contiguous()).cpu()).abs().max().item())"
0.0