Apple Tensorflow Internal Error (0000000e:Internal Error)

When I train a model (private, for work) using Apple Tensorflow, I get an error like this:

        The Metal Performance Shaders operations encoded on it may not have completed.                                                               
        Error:                                                                                                                                       
        (null)                                                                                                                                       
        Internal Error (0000000e:Internal Error)                                                                                                     
        <AGXG13XFamilyCommandBuffer: 0x355c49fc0>                                                                                                    
    label = <none>                                                                                                                                   
    device = <AGXG13XDevice: 0x10d981400>                                                                                                            
        name = Apple M1 Pro                                                                                                                          
    commandQueue = <AGXG13XFamilyCommandQueue: 0x11dedb600>                                                                                          
        label = <none>                                                                                                                               
        device = <AGXG13XDevice: 0x10d981400>                                                                                                        
            name = Apple M1 Pro                                                                                                                      
    retainedReferences = 1

When I run the same script on a server with a Geforce GPU, then it works fine.

It happens already during the first epoch. I also see that the memory leaks as it starts with 3 GB and reaches 20 GB within this epoch.

Does anyone know how to deal with this problem? Thank you!

I don't know "why" this happens but for me it only happens when I'm computing gradients against a loss tensor that is non-flat. Oddly, even with this error, my model still trained.

Apple Tensorflow Internal Error (0000000e:Internal Error)
 
 
Q