This does not seem to be effecting the training, but it seems somewhat important (no clue on how to read it however):
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG13XFamilyCommandBuffer: 0x29b027b50>
label = <none>
device = <AGXG13XDevice: 0x12da25600>
name = Apple M1 Max
commandQueue = <AGXG13XFamilyCommandQueue: 0x106477000>
label = <none>
device = <AGXG13XDevice: 0x12da25600>
name = Apple M1 Max
retainedReferences = 1
This is happening during a "heavy" model training on "heavy" dataset, so maybe is related to some memory issue, but I have no clue how to confront it