I've been using tensorflow-metal
together with jupyter lab.
Sometimes during training, the notebook would stop printing training progress. The training process seems dead as interrupting kernel wouldn't respond. I have to restart kernel and train again.
The problem doesn't always occur, and I couldn't tell what was the cause.
Until recently I started using tensorflow-probability
. I could 100% reproduce the problem on my machine.
Here is the demo to reproduce the problem.
import numpy as np
import tensorflow as tf
#tf.config.set_visible_devices([], 'GPU')
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
from tensorflow_probability import sts
STEPS = 10000
BATCH = 64
noise = np.random.random(size=(BATCH, STEPS)) * 0.5
signal = np.sin((
np.broadcast_to(np.arange(STEPS), (BATCH, STEPS)) / (10 + np.random.random(size=(BATCH, 1)))
+ np.random.random(size=(BATCH, 1))
) * np.pi * 2)
data = noise + signal
data = data.astype(np.float32) # float64 would fail under GPU training, no idea why
def build_model(observed):
season = sts.Seasonal(
num_seasons=10,
num_steps_per_season=1,
observed_time_series=observed,
)
model = sts.Sum([
season,
], observed_time_series=observed)
return model
model = build_model(data)
variational_posteriors = sts.build_factored_surrogate_posterior(model=model)
loss_curve = tfp.vi.fit_surrogate_posterior( target_log_prob_fn=model.joint_distribution(observed_time_series=data).log_prob,
surrogate_posterior=variational_posteriors,
optimizer=tf.optimizers.Adam(learning_rate=0.1),
num_steps=5,
)
print('loss', loss_curve)
After starting the demo, using python demo.py
, I can observe the python process is running, consuming cpu and gpu. And then, when the cpu and gpu usage drops to zero, it never prints anything.
The process doesn't responding to ctrl+c
, and I have to force kill it.
I use Activity Monitor
to sample the "dead" process. It shows a lot of threads are waiting, including main thread.
...
+ 2228 _pthread_cond_wait (in libsystem_pthread.dylib) + 1228 [0x180659808]
+ 2228 __psynch_cvwait (in libsystem_kernel.dylib) + 8 [0x1806210c0]
And some metal threads
...
+ 2228 tensorflow::PluggableDeviceContext::CopyDeviceTensorToCPU(tensorflow::Tensor const*, absl::lts_20210324::string_view, tensorflow::Device*, tensorflow::Tensor*, std::__1::function<void (tensorflow::Status const&)>) (in _pywrap_tensorflow_internal.so) + 152 [0x28006290c]
+ 2228 tensorflow::PluggableDeviceUtil::CopyPluggableDeviceTensorToCPU(tensorflow::Device*, tensorflow::DeviceContext const*, tensorflow::Tensor const*, tensorflow::Tensor*, std::__1::function<void (tensorflow::Status const&)>) (in _pywrap_tensorflow_internal.so) + 320 [0x2800689bc]
+ 2228 stream_executor::Stream::ThenMemcpy(void*, stream_executor::DeviceMemoryBase const&, unsigned long long) (in _pywrap_tensorflow_internal.so) + 116 [0x286f0b08c]
+ 2228 stream_executor::(anonymous namespace)::CStreamExecutor::Memcpy(stream_executor::Stream*, void*, stream_executor::DeviceMemoryBase const&, unsigned long long) (in _pywrap_tensorflow_internal.so) + 128 [0x2816595c8]
+ 2228 metal_plugin::memcpy_dtoh(SP_Device const*, SP_Stream_st*, void*, SP_DeviceMemoryBase const*, unsigned long long, TF_Status*) (in libmetal_plugin.dylib) + 444 [0x126acc224]
+ 2228 ??? (in AGXMetalG13X) load address 0x1c5cd0000 + 0x1c5ad8 [0x1c5e95ad8]
+ 2228 -[IOGPUMetalBuffer initWithDevice:pointer:length:options:sysMemSize:gpuAddress:args:argsSize:deallocator:] (in IOGPU) + 332 [0x19ac3ae3c]
+ 2228 -[IOGPUMetalResource initWithDevice:remoteStorageResource:options:args:argsSize:] (in IOGPU) + 476 [0x19ac469f8]
+ 2228 IOGPUResourceCreate (in IOGPU) + 224 [0x19ac4c970]
+ 2228 IOConnectCallMethod (in IOKit) + 236 [0x183104bc4]
+ 2228 io_connect_method (in IOKit) + 440 [0x183104da8]
+ 2228 mach_msg (in libsystem_kernel.dylib) + 76 [0x18061dd00]
+ 2228 mach_msg_trap (in libsystem_kernel.dylib) + 8 [0x18061d954]
I'm no expert but it looks like there is deadlock.
Training with cpu works by uncommenting line 4.
Here are my configurations.
- MacBook Pro (14-inch, 2021)
- Apple M1 Pro 32 GB
- macOS 12.2.1 (21D62)
- tensorflow-deps 2.8.0
- tensorflow-macos 2.8.0
- tensorflow-metal 0.4.0
- tensorflow-probability 0.16.0