Post

Replies

Boosts

Views

Activity

TensorFlow crashes with a segfault
I have some script which crashes on Apple M1 hardware with tensorflow-macos 2.11.0. It should not crash. It does not crash on other hardware. I think the code should work. It does work fine on other hardware. But even if there is sth wrong with the code, it still should not crash, but throw some exception instead. I also reported this here: https://github.com/tensorflow/tensorflow/issues/59780 On Apple M1 hardware: Checkout https://github.com/rwth-i6/returnn. (Maybe commit 3a67da87c2fd8783c5c2469d72cf1319b5b45837 to be sure.) Run: python3 tests/test_TFUtil.py test_get_variable_grad_from_update_ops The relevant code: https://github.com/rwth-i6/returnn/blob/3a67da87c2fd8783c5c2469d72cf1319b5b45837/tests/test_TFUtil.py#L3507 https://github.com/rwth-i6/returnn/blob/3a67da87c2fd8783c5c2469d72cf1319b5b45837/returnn/tf/util/basic.py#L6649 Relevant log output ... grad: Tensor("test_get_variable_grad_from_update_ops/gradients_2/test_get_variable_grad_from_update_ops/sub_grad/tuple/control_dependency:0", shape=(), dtype=float32) Fatal Python error: Segmentation fault Thread 0x0000000103500580 (most recent call first): File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1454 in _call_tf_sessionrun File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1361 in _run_fn File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1378 in _do_call File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1371 in _do_run File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1191 in _run File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 968 in run File "/Users/az/Programmierung/crnn/tests/test_TFUtil.py", line 3529 in test_get_variable_grad_from_update_ops File "/Users/az/Programmierung/crnn/tests/test_TFUtil.py", line 4559 in <module> fish: Job 1, 'python3 tests/test_TFUtil.py te…' terminated by signal SIGSEGV (Address boundary error) Stack trace in LLDB in the crashing thread: * thread #28, queue = 'metal gpu stream', stop reason = EXC_BAD_ACCESS (code=1, address=0xbeaddc3f8010) * frame #0: 0x00000001836ea5a0 libobjc.A.dylib`objc_msgSend + 32 frame #1: 0x000000018df96d38 MPSNDArray`___lldb_unnamed_symbol1550 + 2292 frame #2: 0x000000018df98bbc MPSNDArray`___lldb_unnamed_symbol1567 + 300 frame #3: 0x000000018df991e8 MPSNDArray`___lldb_unnamed_symbol1569 + 176 frame #4: 0x0000000159a7d2b8 libmetal_plugin.dylib`invocation function for block in double dispatchOneKernel<MPSNDArrayIdentity>(MetalStream*, MPSNDArrayIdentity*, NSArray*, MPSNDArray*, char const*, MPSKernelDAGObject*) + 120 frame #5: 0x00000001836a01b4 libdispatch.dylib`_dispatch_client_callout + 20 frame #6: 0x00000001836af414 libdispatch.dylib`_dispatch_lane_barrier_sync_invoke_and_complete + 56 frame #7: 0x0000000159a7d140 libmetal_plugin.dylib`double dispatchOneKernel<MPSNDArrayIdentity>(MetalStream*, MPSNDArrayIdentity*, NSArray*, MPSNDArray*, char const*, MPSKernelDAGObject*) + 120 frame #8: 0x0000000159a7fffc libmetal_plugin.dylib`metal_plugin::MPSApplyMomentumOp<float>::Compute(metal_plugin::OpKernelContext*) + 2768 frame #9: 0x0000000159a7f2fc libmetal_plugin.dylib`void metal_plugin::ComputeOpKernel<metal_plugin::MPSApplyMomentumOp<float> >(void*, TF_OpKernelContext*) + 44 frame #10: 0x000000014cd00028 libtensorflow_framework.2.dylib`tensorflow::PluggableDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) + 148 frame #11: 0x000000014cc847f0 libtensorflow_framework.2.dylib`tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long long) + 3764 frame #12: 0x000000028a47eb6c _pywrap_tensorflow_internal.so`Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) + 1496 frame #13: 0x000000028a47e468 _pywrap_tensorflow_internal.so`tsl::thread::EigenEnvironment::CreateThread(std::__1::function<void ()>)::'lambda'()::operator()() const + 80 frame #14: 0x000000014cb9e878 libtensorflow_framework.2.dylib`tsl::(anonymous namespace)::PThread::ThreadFn(void*) + 120 frame #15: 0x000000018386426c libsystem_pthread.dylib`_pthread_start + 148 As you see from the output, the crash happens in the last session.run([minimize_op, grad]).
0
0
737
Feb ’23
TensorFlow hangs in session.run
This is a new neural model I implemented, and I want to do training. It's modified based on an existing attention-based encoder-decoder model, where everything works fine. In the new model, it just hangs in session.run and does not do anything. I also cannot interrupt it. It hangs inside the TensorFlow C++ code. This seems to be specific for Mac M1 hardware. I cannot reproduce the problem on other hardware or environments. I already posted this here but it was suggested to also post it here. So far I don't have a minimal example, and this will be quite a big effort to generate one, as this is some very complex model. But here some relevant details: This is based on RETURNN. We still use graph-mode. I tested both with control flow v1 (calling disable_control_flow_v2) and control flow v2. It hangs in both cases. I tested using tfdbg or enable_dump_debug_info. It crashes then with a segfault. I get a number of other warnings, which are maybe related. See below. To reproduce: Code: https://github.com/rwth-i6/i6_experiments/blob/81bcef39b5829aa43b84bcab4b4fa03f82fc3bc5/users/zeyer/experiments/exp2023_02_16_chunked_attention/demo_returnn_config.py Checkout the i6_experiments repo, commit 81bcef39b5829aa43b84bcab4b4fa03f82fc3bc5. Checkout RETURNN, commit 2ed598443f22de42599a0fee9bc43fbb5e0abec2. Run: python3 returnn/rnn.py i6_experiments/users/zeyer/experiments/exp2023_02_16_chunked_attention/demo_returnn_config.py With control flow v2: 2023-02-17 10:02:03.997491: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1: type_id: TFT_OPTIONAL args { type_id: TFT_PRODUCT args { type_id: TFT_TENSOR args { type_id: TFT_INT32 } } } is neither a subtype nor a supertype of the combined inputs preceding it: type_id: TFT_OPTIONAL args { type_id: TFT_PRODUCT args { type_id: TFT_TENSOR args { type_id: TFT_FLOAT } } } while inferring type of node 'output/rec/while/body/_38/output/rec/prev_target_embed_moved_input/cond/output/_1608' 2023-02-17 10:34:46.595736: W tensorflow/c/c_api.cc:291] Operation '{name:'global_step' id:1961 op device:{requested: '/device:CPU:0', assigned: ''} def:{{{node global_step}} = VarHandleOp[_class=["loc:@global_step"], _has_manual_control_dependencies=true, allowed_devices=[], container="", dtype=DT_INT64, shape=[], shared_name="global_step", _device="/device:CPU:0"]()}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session. 2023-02-17 10:35:56.799620+0100 python3[5197:2744697] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) ... 2023-02-17 10:36:01.801307+0100 python3[5197:2744697] Execution of the command buffer was aborted due to an error during execution. Ignored (for causing prior/excessive GPU errors) (00000004:kIOGPUCommandBufferCallbackErrorSubmissionsIgnored) ... (Related: https://github.com/tensorflow/tensorflow/issues/57052) With control flow v1: 2023-02-17 10:10:01.733679: W tensorflow/c/c_api.cc:291] Operation '{name:'global_step' id:1528 op device:{requested: '/device:CPU:0', assigned: ''} def:{{{node global_step}} = VarHandleOp[_class=["loc:@global_step"], _has_manual_control_dependencies=true, allowed_devices=[], container="", dtype=DT_INT64, shape=[], shared_name="global_step", _device="/device:CPU:0"]()}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session. 2023-02-17 10:10:14.257716+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) 2023-02-17 10:10:14.257754+0100 python3[3727:2732582] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) 2023-02-17 10:10:14.258366+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) 2023-02-17 10:10:14.258504+0100 python3[3727:2732582] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) 2023-02-17 10:10:14.258541+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) 2023-02-17 10:10:14.258587+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) 2023-02-17 10:10:19.258726+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) 2023-02-17 10:10:19.258784+0100 python3[3727:2732395] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
0
0
626
Feb ’23