When I train a model (private, for work) using Apple Tensorflow, I get an error like this:
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG13XFamilyCommandBuffer: 0x355c49fc0>
label = <none>
device = <AGXG13XDevice: 0x10d981400>
name = Apple M1 Pro
commandQueue = <AGXG13XFamilyCommandQueue: 0x11dedb600>
label = <none>
device = <AGXG13XDevice: 0x10d981400>
name = Apple M1 Pro
retainedReferences = 1
When I run the same script on a server with a Geforce GPU, then it works fine.
It happens already during the first epoch. I also see that the memory leaks as it starts with 3 GB and reaches 20 GB within this epoch.
Does anyone know how to deal with this problem? Thank you!
Post
Replies
Boosts
Views
Activity
I'd like to control whether the network training happens on CPU or GPU when using tensorflow-metal.
How to do this?
Thanks!