scaaml - very slow processing after some time - tensorflow metal

Hi, I'm running scaaml which starts running fine, after several iterations its slows right down.

76: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled. 2022-07-04 06:25:08.268023: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled. 2048/2048 [==============================] - 512s 250ms/step - loss: 1.8051 - acc: 0.3809 - val_loss: 1.9365 - val_acc: 0.3350 Epoch 19/30 536/2048 [======>.......................] - ETA: 44:10:15 - loss: 1.7715 - acc: 0.3911

Previous flows were processed in a reasonable amount of time 173: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled. 2022-07-04 06:16:20.906834: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled. WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 5 of 46). These functions will not be directly callable after loading. 2048/2048 [==============================] - 538s 263ms/step - loss: 1.8303 - acc: 0.3744 - val_loss: 1.8793 - val_acc: 0.3452 Epoch 18/30 2048/2048 [==============================] - ETA: 0s - loss: 1.8051 - acc: 0.38092022-07-04 06:25:08.264476: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled. 2022-07-04 06:25:08.268023: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled. 2048/2048 [==============================] - 512s 250ms/step - loss: 1.8051 - acc: 0.3809 - val_loss: 1.

I'm running the code elsewhere and it runs just fine.

I could run other GPU tasks and these picked up the GPU no problem, its as if running after an extended period of time, the resources/application stopped - but kept running very slowly.

Answered by Frameworks Engineer in 720497022

Thank you for the details! We've been able to reproduce this and are looking at the problem more closely.

After running the same code on the same samples, it happend again.

Epoch 19/30 504/2048 [======>.......................] - ETA: 21:47:06 - loss: 1.8561 - acc: 0.371

I dont think it is concidence.

Run your code again, and open Console.app, start streaming messages with Errors and Faults filters. If you see a lot of

IOReturn IOGPUDevice::new_resource(IOGPUNewResourceArgs *, struct IOGPUNewResourceReturnData *, IOByteCount, uint32_t *): PID ???? likely leaking IOGPUResource (count=????)

after some time, we are experiencing the same issue.

In my case, when the count above reaches 200000, the training will stop making progress.

See https://developer.apple.com/forums/thread/706920

Hello @alz0r, thanks for reporting your issue! We couldn't reproduce your issue right away, could you please provide some more information? Such as:

  • tensorflow-macos version
  • tensorflow-metal version
  • OS version
  • Which machine you're running on

Also, how many epochs did it take for the training to considerably slow down?

It would be very helpful if you specified/provided the script (along with parameters) which you used when you ran into the issue, we tried the demo in scaaml_intro and that did not slow down after running for hours. If you provide more details, it will be very helpful in identifying the problem.

Hi,

tensorflow-macos             2.9.2
tensorflow-metal             0.5.0
macOS Montery 12.4 (patched and upto date)
Machine : iMac Retina 5K,
          27 Inch, 2020, 
          3.8GHz 8-Core Intel Core i7, 
          128Gb 2667 Mhz DDR4, 
          Graphics AMD Radeon Pro 5500 XT 8GB

Command to run (as per documentation)

python3 train.py -c config/stm32f415_tinyaes.json

When running on GPU the slow down occurs exactly the same epoch (19), as a test I disabled the GPU in a duplicate script and whilst taking considerably longer, passed epoch 19, as you can see on GPU enable epoch 19 the time has gone upto 122:06:17

Commend to run (for CPU only, slight modification to script included)

python3 train_cpu.py -c config/stm32f415_tinyaes.json

Script modification to disable GPU (I have left in the last line and first line of the original script so the placement can be identified, else its identical.

from scaaml.utils import tf_cap_memory

try:
    # Disable all GPUS
    tf.config.set_visible_devices([], 'GPU')
    visible_devices = tf.config.get_visible_devices()
    for device in visible_devices:
        assert device.device_type != 'GPU'
except:
    # Invalid device or cannot modify virtual devices once initialized.
    pass

def train_model(config):

CPU ONLY

2048/2048 [==============================] - 5014s 2s/step - loss: 1.3966 - acc: 0.4811 - val_loss: 1.5574 - val_acc: 0.4297
Epoch 25/30
1502/2048 [=====================>........] - ETA: 22:02 - loss: 1.3701 - acc: 0.4919

GPU ENABLED

2022-07-05 14:43:20.822168: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 5 of 46). These functions will not be directly callable after loading.
2048/2048 [==============================] - 516s 252ms/step - loss: 1.9292 - acc: 0.3521 - val_loss: 1.9108 - val_acc: 0.3503
Epoch 18/30
2048/2048 [==============================] - ETA: 0s - loss: 1.8986 - acc: 0.35982022-07-05 14:52:39.447402: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2022-07-05 14:52:39.450685: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2048/2048 [==============================] - 546s 267ms/step - loss: 1.8986 - acc: 0.3598 - val_loss: 2.0514 - val_acc: 0.3303
Epoch 19/30
 741/2048 [=========>....................] - ETA: 122:06:17 - loss: 1.8543 - acc: 0.3750/Users/alan/.pyenv/versions/3.9.5/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker

I have run the code on an external system with GPUs based on linux and it runs without problem. This is blocking my research project (MSc) and whilst I can still use the CPU mode, the idea is to compare/baseline against various platforms and functionalities (whilst also using my own traces), so relevant to be able to use all the features available of the host system (GPUs in this case).

Hope this helps and you can offer a solution.

Regards,

alz0r

deleted.

Accepted Answer

Thank you for the details! We've been able to reproduce this and are looking at the problem more closely.

Hi Team, Any updates please ? Appreciate if you have any findings so I can add to my research log. Kind Regards, Alze

Hi Team,

Any updates ? Because I have to progress my research I'm now using alternative platforms (i..e I've bought a new laptop with a CUDA / nVidia GPU). However, it would be good to solve this issue of not being able to use technically higher specification machine without having to boot it into Ubuntu on dual-boot (which leads to its own problems), I should be able to run tensorflow optimized for GPUs on my mac and its native OS and I'm sure many other ML/AI people do as well. Are we going to see any progress on this issue ?

Kind Regards, Alze.

scaaml - very slow processing after some time - tensorflow metal
 
 
Q