Not only Upgrading tensorflow-macos and tensorflow-metal breaks Conv2d with groups arg , it also makes training unable to finish.
Today, after upgrading the tensorflow-macos to 2.9.0 and tensorflow-metal to 0.5.0, my notebook can no longer make progress after training around 16 minutes.
I tested 4 times. It could happily run around 17 to 18 epochs, each epoch around 55 seconds. After that, it just stopped making progress.
I checked the activity monitor, both cpu and gpu usage were 0 at that point.
I accidentally found that there are a lot of kernel faults in the Console app.
The last one before I force-killed the process:
IOReturn IOGPUDevice::new_resource(IOGPUNewResourceArgs *, struct IOGPUNewResourceReturnData *, IOByteCount, uint32_t *): PID 68905 likely leaking IOGPUResource (count=200000)
The PID 68905 is in fact the training process.
I have always observed this kind of issue for several months. But it's not as frequent and I can restart my notebook train successfully. No luck today.
Hope Apple engineers can found the cause and fix it.