wangcheng’s Profile | Apple Developer Forums

GPU training deadlock

I've been using tensorflow-metal together with jupyter lab. Sometimes during training, the notebook would stop printing training progress. The training process seems dead as interrupting kernel wouldn't respond. I have to restart kernel and train again. The problem doesn't always occur, and I couldn't tell what was the cause. Until recently I started using tensorflow-probability. I could 100% reproduce the problem on my machine. Here is the demo to reproduce the problem. import numpy as np import tensorflow as tf #tf.config.set_visible_devices([], 'GPU') import tensorflow_probability as tfp from tensorflow_probability import distributions as tfd from tensorflow_probability import sts STEPS = 10000 BATCH = 64 noise = np.random.random(size=(BATCH, STEPS)) * 0.5 signal = np.sin(( np.broadcast_to(np.arange(STEPS), (BATCH, STEPS)) / (10 + np.random.random(size=(BATCH, 1))) + np.random.random(size=(BATCH, 1)) ) * np.pi * 2) data = noise + signal data = data.astype(np.float32) # float64 would fail under GPU training, no idea why def build_model(observed): season = sts.Seasonal( num_seasons=10, num_steps_per_season=1, observed_time_series=observed, ) model = sts.Sum([ season, ], observed_time_series=observed) return model model = build_model(data) variational_posteriors = sts.build_factored_surrogate_posterior(model=model) loss_curve = tfp.vi.fit_surrogate_posterior( target_log_prob_fn=model.joint_distribution(observed_time_series=data).log_prob, surrogate_posterior=variational_posteriors, optimizer=tf.optimizers.Adam(learning_rate=0.1), num_steps=5, ) print('loss', loss_curve) After starting the demo, using python demo.py, I can observe the python process is running, consuming cpu and gpu. And then, when the cpu and gpu usage drops to zero, it never prints anything. The process doesn't responding to ctrl+c, and I have to force kill it. I use Activity Monitor to sample the "dead" process. It shows a lot of threads are waiting, including main thread. ... + 2228 _pthread_cond_wait (in libsystem_pthread.dylib) + 1228 [0x180659808] + 2228 __psynch_cvwait (in libsystem_kernel.dylib) + 8 [0x1806210c0] And some metal threads ... + 2228 tensorflow::PluggableDeviceContext::CopyDeviceTensorToCPU(tensorflow::Tensor const*, absl::lts_20210324::string_view, tensorflow::Device*, tensorflow::Tensor*, std::__1::function<void (tensorflow::Status const&)>) (in _pywrap_tensorflow_internal.so) + 152 [0x28006290c] + 2228 tensorflow::PluggableDeviceUtil::CopyPluggableDeviceTensorToCPU(tensorflow::Device*, tensorflow::DeviceContext const*, tensorflow::Tensor const*, tensorflow::Tensor*, std::__1::function<void (tensorflow::Status const&)>) (in _pywrap_tensorflow_internal.so) + 320 [0x2800689bc] + 2228 stream_executor::Stream::ThenMemcpy(void*, stream_executor::DeviceMemoryBase const&, unsigned long long) (in _pywrap_tensorflow_internal.so) + 116 [0x286f0b08c] + 2228 stream_executor::(anonymous namespace)::CStreamExecutor::Memcpy(stream_executor::Stream*, void*, stream_executor::DeviceMemoryBase const&, unsigned long long) (in _pywrap_tensorflow_internal.so) + 128 [0x2816595c8] + 2228 metal_plugin::memcpy_dtoh(SP_Device const*, SP_Stream_st*, void*, SP_DeviceMemoryBase const*, unsigned long long, TF_Status*) (in libmetal_plugin.dylib) + 444 [0x126acc224] + 2228 ??? (in AGXMetalG13X) load address 0x1c5cd0000 + 0x1c5ad8 [0x1c5e95ad8] + 2228 -[IOGPUMetalBuffer initWithDevice:pointer:length:options:sysMemSize:gpuAddress:args:argsSize:deallocator:] (in IOGPU) + 332 [0x19ac3ae3c] + 2228 -[IOGPUMetalResource initWithDevice:remoteStorageResource:options:args:argsSize:] (in IOGPU) + 476 [0x19ac469f8] + 2228 IOGPUResourceCreate (in IOGPU) + 224 [0x19ac4c970] + 2228 IOConnectCallMethod (in IOKit) + 236 [0x183104bc4] + 2228 io_connect_method (in IOKit) + 440 [0x183104da8] + 2228 mach_msg (in libsystem_kernel.dylib) + 76 [0x18061dd00] + 2228 mach_msg_trap (in libsystem_kernel.dylib) + 8 [0x18061d954] I'm no expert but it looks like there is deadlock. Training with cpu works by uncommenting line 4. Here are my configurations. MacBook Pro (14-inch, 2021) Apple M1 Pro 32 GB macOS 12.2.1 (21D62) tensorflow-deps 2.8.0 tensorflow-macos 2.8.0 tensorflow-metal 0.4.0 tensorflow-probability 0.16.0

Machine Learning & AI General tensorflow-metal

4

2

1.9k

Mar ’22

Upgrading tensorflow-macos and tensorflow-metal breaks Conv2d with groups arg

Today I upgraded tensorflow-macos to 2.9.0 and tensorflow-metal to 0.5.0, and found my old notebook failed to run. It ran well with tensorflow-macos 2.8.0 and tensorflow-metal 0.4.0. Specifically, I found that the groups arg of Conv2d layer was the cause. Here is a demo: import tensorflow as tf from tensorflow import keras as tfk # tf.config.set_visible_devices([], 'GPU') Xs = tf.random.normal((32, 64, 48, 4)) ys = tf.random.normal((32,)) tf.random.set_seed(0) model = tfk.Sequential([ tfk.layers.Conv2D( filters=16, kernel_size=(4, 3), groups=4, # groups arg activation='relu', ), tfk.layers.Flatten(), tfk.layers.Dense(1, activation='sigmoid'), ]) model.compile( loss=tfk.losses.BinaryCrossentropy(), metrics=[ tfk.metrics.BinaryAccuracy(), ], ) model.fit(Xs, ys, epochs=2, verbose=1) The error is: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xla_ops.cc:296 : UNIMPLEMENTED: Could not find compiler for platform METAL: NOT_FOUND: could not find registered compiler for platform METAL -- check target linkage Removing groups arg would make the code run again. Training on CPU, by uncommenting line 4, gives different error: 'apple-m1' is not a recognized processor for this target (ignoring processor) LLVM ERROR: 64-bit code requested on a subtarget that doesn't support it! And removing groups arg also would make training on CPU work. However I didn't test training on CPU before the upgrade. My device is a MacBook Pro 14' running macOS 12.4.

Machine Learning & AI General tensorflow-metal

3

1.9k

May ’22

The new tensorflow-macos and tensorflow-metal incapacitate training

Not only Upgrading tensorflow-macos and tensorflow-metal breaks Conv2d with groups arg , it also makes training unable to finish. Today, after upgrading the tensorflow-macos to 2.9.0 and tensorflow-metal to 0.5.0, my notebook can no longer make progress after training around 16 minutes. I tested 4 times. It could happily run around 17 to 18 epochs, each epoch around 55 seconds. After that, it just stopped making progress. I checked the activity monitor, both cpu and gpu usage were 0 at that point. I accidentally found that there are a lot of kernel faults in the Console app. The last one before I force-killed the process: IOReturn IOGPUDevice::new_resource(IOGPUNewResourceArgs *, struct IOGPUNewResourceReturnData *, IOByteCount, uint32_t *): PID 68905 likely leaking IOGPUResource (count=200000) The PID 68905 is in fact the training process. I have always observed this kind of issue for several months. But it's not as frequent and I can restart my notebook train successfully. No luck today. Hope Apple engineers can found the cause and fix it.

Machine Learning & AI General tensorflow-metal

19

6

5.2k

May ’22

wangcheng

Post

Replies

Boosts

Views

Activity