GPU training deadlock

I've been using tensorflow-metal together with jupyter lab.

Sometimes during training, the notebook would stop printing training progress. The training process seems dead as interrupting kernel wouldn't respond. I have to restart kernel and train again.

The problem doesn't always occur, and I couldn't tell what was the cause.

Until recently I started using tensorflow-probability. I could 100% reproduce the problem on my machine.

Here is the demo to reproduce the problem.

import numpy as np
import tensorflow as tf

#tf.config.set_visible_devices([], 'GPU')

import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
from tensorflow_probability import sts

STEPS = 10000
BATCH = 64

noise = np.random.random(size=(BATCH, STEPS)) * 0.5
signal = np.sin((
    np.broadcast_to(np.arange(STEPS), (BATCH, STEPS)) / (10 + np.random.random(size=(BATCH, 1)))
    + np.random.random(size=(BATCH, 1))
) * np.pi * 2)

data = noise + signal

data = data.astype(np.float32) # float64 would fail under GPU training, no idea why

def build_model(observed):
    season = sts.Seasonal(
        num_seasons=10,
        num_steps_per_season=1,
        observed_time_series=observed,
    )

    model = sts.Sum([
        season,
    ], observed_time_series=observed)

    return model

model = build_model(data)

variational_posteriors = sts.build_factored_surrogate_posterior(model=model)

loss_curve = tfp.vi.fit_surrogate_posterior(    target_log_prob_fn=model.joint_distribution(observed_time_series=data).log_prob,
    surrogate_posterior=variational_posteriors,
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    num_steps=5,
)

print('loss', loss_curve)

After starting the demo, using python demo.py, I can observe the python process is running, consuming cpu and gpu. And then, when the cpu and gpu usage drops to zero, it never prints anything.

The process doesn't responding to ctrl+c, and I have to force kill it.

I use Activity Monitor to sample the "dead" process. It shows a lot of threads are waiting, including main thread.

...
    +       2228 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1228  [0x180659808]
    +         2228 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1806210c0]

And some metal threads

...
    +                                         2228 tensorflow::PluggableDeviceContext::CopyDeviceTensorToCPU(tensorflow::Tensor const*, absl::lts_20210324::string_view, tensorflow::Device*, tensorflow::Tensor*, std::__1::function<void (tensorflow::Status const&)>)  (in _pywrap_tensorflow_internal.so) + 152  [0x28006290c]
    +                                           2228 tensorflow::PluggableDeviceUtil::CopyPluggableDeviceTensorToCPU(tensorflow::Device*, tensorflow::DeviceContext const*, tensorflow::Tensor const*, tensorflow::Tensor*, std::__1::function<void (tensorflow::Status const&)>)  (in _pywrap_tensorflow_internal.so) + 320  [0x2800689bc]
    +                                             2228 stream_executor::Stream::ThenMemcpy(void*, stream_executor::DeviceMemoryBase const&, unsigned long long)  (in _pywrap_tensorflow_internal.so) + 116  [0x286f0b08c]
    +                                               2228 stream_executor::(anonymous namespace)::CStreamExecutor::Memcpy(stream_executor::Stream*, void*, stream_executor::DeviceMemoryBase const&, unsigned long long)  (in _pywrap_tensorflow_internal.so) + 128  [0x2816595c8]
    +                                                 2228 metal_plugin::memcpy_dtoh(SP_Device const*, SP_Stream_st*, void*, SP_DeviceMemoryBase const*, unsigned long long, TF_Status*)  (in libmetal_plugin.dylib) + 444  [0x126acc224]
    +                                                   2228 ???  (in AGXMetalG13X)  load address 0x1c5cd0000 + 0x1c5ad8  [0x1c5e95ad8]
    +                                                     2228 -[IOGPUMetalBuffer initWithDevice:pointer:length:options:sysMemSize:gpuAddress:args:argsSize:deallocator:]  (in IOGPU) + 332  [0x19ac3ae3c]
    +                                                       2228 -[IOGPUMetalResource initWithDevice:remoteStorageResource:options:args:argsSize:]  (in IOGPU) + 476  [0x19ac469f8]
    +                                                         2228 IOGPUResourceCreate  (in IOGPU) + 224  [0x19ac4c970]
    +                                                           2228 IOConnectCallMethod  (in IOKit) + 236  [0x183104bc4]
    +                                                             2228 io_connect_method  (in IOKit) + 440  [0x183104da8]
    +                                                               2228 mach_msg  (in libsystem_kernel.dylib) + 76  [0x18061dd00]
    +                                                                 2228 mach_msg_trap  (in libsystem_kernel.dylib) + 8  [0x18061d954]

I'm no expert but it looks like there is deadlock.

Training with cpu works by uncommenting line 4.

Here are my configurations.

  • MacBook Pro (14-inch, 2021)
  • Apple M1 Pro 32 GB
  • macOS 12.2.1 (21D62)
  • tensorflow-deps           2.8.0  
  • tensorflow-macos          2.8.0  
  • tensorflow-metal          0.4.0  
  • tensorflow-probability    0.16.0 

Hi @wangcheng

Thanks for reporting this and especially for the script to reproduce it. I can confirm that the script reproduces the issue for me locally so that we can start investigating the reason for this deadlock. I will update once we have solved the issue and can confirm when the fix will be out.

We have identified the issue here. The network demo provided calls on multiple ops that do not have their GPU implementation registered and fallback onto the GPU causing a lot of dead time which makes it look like a deadlock as the threads are busy waiting even though the computations are being run (very slowly) in the background.

The list of ops running on the CPU was Softplus, RandomUniformInt, StatelessRandomUniformFullIntV2, BitwiseXOR, StatelessRandomGetKeyCounter, MatrixDiagV3, MatrixBandPart, MatrixDiag, Any, StateLessRandomUniform, StatelessRandomNormalV2. We have added these to the list of ops we need to accelerate in the metal plugin but in the meanwhile recommend running this model on the CPU.

Hi apple engineer, I think I've found out the issue:tensorflow-metal is leaking GPU resource.

Run that script for 20s and I can see a bunch of IOReturn IOGPUDevice::new_resource(IOGPUNewResourceArgs *, struct IOGPUNewResourceReturnData *, IOByteCount, uint32_t *): PID 3963 likely leaking IOGPUResource (count=101000) messages in Console.app.

See also https://developer.apple.com/forums/thread/706920

It seems to be solved with re-install for me :

  • pip uninstall tensorflow-macos
  • pip uninstall tensorflow-metal
  • python -m pip install tensorflow-macos
    * Name: tensorflow-macos
    * Version: 2.12.0
  • python -m pip install tensorflow-metal
    * Name: tensorflow-metal
    * Version: 0.8.0
GPU training deadlock
 
 
Q