Training using MPS - issues with MPSNNGraph.encodeBatch

I curious why so few people are posting about training on MPS - am I missing something?


I'm building a trivial Convolutional Neural Network (CNN) in MPSCNN and have come across an issue with exporting the weights out.


When training I run into issues when calling **MPSNNGraph.encodeBatch** when the batch size is larger than 4 (which is somewhat peculiar given that a MTLTexture has 4 channels). Anytime I increase the batch size I am returned with **nan** for the weights and bias coefficients (both via the datasources locally stored MPSCNNConvolutionWeightsAndBiasesState (from the update method) or exporting the weights from the associated filter nodes).


I have increased the precision on the graph and all resultImages of the nodes to float32 and also added clipping to the optimizer, with no luck. Is there a way to tell if this is a memory issue or overflow on the data types used? Would the issue reside in the optimizer, gradients, states, or transfer from GPU to CPU?


Any suggestions greatly appreciated - I've been stuck on this 'challenge' for a good part of a couple of weeks now - cheers.


Some suggestion from me, to Apple and the MPS ML team:

- Best practice **example** of training using MPS

- Better debugging tools

Replies

... some more details; I'm using a softMaxCrossEntropy and mean for the loss with a SGD optimizer with a learning rate of approx. 0.001.


When I run the graph with a small batch (4) the gradients are relatively small but running the graph with a large batch (128) the gradients are large and sometimes nan ... I would have assumed mean on the loss would, well, take the average.


I'm obviously missing something - any thoughts, suggestions much appreciated - cheers.

I think you're one of a very few number of people in the world who have actually used this API. ;-) I wouldn't be surprised if it still has a few rough edges.


I've had a blog post about training with MPS in the works for ages but other stuff keeps coming up instead.

"I've had a blog post about training with MPS in the works" - curious how far in the works this post is; working code by any chance?


I thought I had resolved it today by changing the format of the gradient image results from float16 to float32 - but the excitment was short lived.


I'm considering constructing the network in absence of the MPSNNGraph to see if it works - Turi Create is built on top so it's supposely possible.


Thanks for your response.

Some further information;


The graph's default storage format is set to Float32 - I adjusted the batch size (everything remains constant) and set the training style to CPU to capture the gradients of my top most layer - below are the results (outputting the first 10 coefficents) - (a, b, ...) just indicating the re-runs (first backward pass for each).


BATCH SIZE = 4

Gradient weights l1 (a) ... 1568 ... [0.0032182545, 0.0018722187, 0.004452133, 0.0027766703, 0.004814127, 0.002290076, 0.0005896213, 0.002064481, 0.0019948026, 0.0055566807, 0.003961149]


Gradient weights l1 (b) ... 1568 ... [0.0032182545, 0.0018722187, 0.004452133, 0.0027766703, 0.004814127, 0.002290076, 0.0005896213, 0.002064481, 0.0019948026, 0.0055566807, 0.003961149]


Gradient weights l1 (c)... 1568 ... [0.0032182545, 0.0018722187, 0.004452133, 0.0027766703, 0.004814127, 0.002290076, 0.0005896213, 0.002064481, 0.0019948026, 0.0055566807, 0.003961149]


BATCH SIZE = 8

Gradient weights l1 (a) ... 1568 ... [-0.35463914, 0.58976394, -0.59485054, 0.22903103, -0.51804817, 0.59701616, 0.5051392, 0.074297816, 0.4284085, -0.8984931, -0.10788263]


Gradient weights l1 (b) ... 1568 ... [-0.8611915, 0.12668955, -0.20884266, -0.102241494, -0.6502063, -0.23424746, -0.4674223, -0.6518867, -0.23104043, -0.40736914, -0.31194344]


BATCH SIZE = 16

Gradient weights l1 (a) ... 1568 ... [1.26359e+35, 5.4729107e+35, 3.3159668e+35, 5.214483e+35, 3.2493971e+35, 9.169122e+35, 9.311691e+35, 2.1583421e+35, 3.952557e+35, 2.3942557e+35, 3.6645236e+35]


Gradient weights l1 (b) ... 1568 ... [0.09119261, 0.05756697, 0.07213145, 0.014482293, 0.09319483, 0.038098965, 0.06368228, 0.09818763, 0.034319896, 0.032822747, 0.011597654]


Gradient weights l1 (c) ... 1568 ... [-nan, -nan, -nan, -nan, -nan, -nan, -nan, -nan, -nan, -nan, -nan]


BATCH SIZE = 32

Gradient weights l1 (a) ... 1568 ... [1.2068136e+35, -2.3001325e+34, 2.1084688e+35, -2.9456847e+35, 9.786839e+33, -6.9434864e+35, -1.4935384e+35, -1.0668826e+35, -1.9871346e+35, 7.397618e+34, -2.4444336e+35]


Gradient weights l1 (b)... 1568 ... [-1.3880644e+35, -2.4221317e+34, -1.1778572e+35, -1.7336298e+35, -1.8964465e+35, -2.3253935e+35, -4.467901e+35, -2.1361668e+35, -8.294703e+34, -1.3844599e+35, -2.800067e+35]


Gradient weights l1 (c)... 1568 ... [-nan, -nan, -nan, -nan, -nan, -nan, -nan, -nan, -nan, -nan, -nan]


I would have thought gradients would have been averaged? Despite this; appears to become instable as you increase the batch size as batch size of 4 illustrates what you would expect - consistent gradient (everything else remaining constant - dropouts removed).


Is it overflowing within the process, memory issues?

I didn't get around to the code part yet.


I don't think you can train without MPSNNGraph? Turi Create also uses MPSNNGraph, see here: h t t ps: //github.com/apple/turicreate/blob/master/src/unity/toolkits/neural_net/mps_graph_networks.mm


Of course Turi works on macOS and the Mac version of MPS is not necessarily the same as iOS.

I think the MPSNNGraph acts as a Facade - everything seems to be exposed via the filters - would just require you to make all the connections, manage all the intermediate images and states but possible.


Yes - been browsing through TuriCreate but I find it difficult to follow (reminder how important comments are).


I've submittted a tick to Apple developer support - hopefully get an answer this year (although, appreciate its Christmas and developers are probably releasing updates/new apps like made) and everyone else is trying to wind down for the year - will keep you posted.

If you are on desktop, please don't forget to synchronize your resources. AMD in particular is fond of stuffing NaNs in output buffers that are not synchronized.

https://developer.apple.com/documentation/metal/synchronizing_a_managed_resource

See also MPS class routines for same.