Tensorflow metal: The Metal Performance Shaders operations encoded on it may not have completed.

This does not seem to be effecting the training, but it seems somewhat important (no clue on how to read it however):

Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG13XFamilyCommandBuffer: 0x29b027b50>
    label = <none> 
    device = <AGXG13XDevice: 0x12da25600>
        name = Apple M1 Max 
    commandQueue = <AGXG13XFamilyCommandQueue: 0x106477000>
        label = <none> 
        device = <AGXG13XDevice: 0x12da25600>
            name = Apple M1 Max 
    retainedReferences = 1

This is happening during a "heavy" model training on "heavy" dataset, so maybe is related to some memory issue, but I have no clue how to confront it

Hi @Alberto1999!

Thanks for reporting the issue. Would you happen to have a test script you could provide us that would reproduce this error message? I understand however that if this only happens sporadically during very memory heavy training it might be difficult to reproduce consistently. But I can confirm that this does not look like expected behavior so I would like to investigate it in more detail.

Additionally which OS version, tensorflow-macos version and tensorflow-metal version did you observe this?

Hi there, so the problem is very sporadic, and is happening during the training of a heavy TF model, and it's not so "deterministic", however I can provide you a link to a ZIP file with jupyter notebook and dataset

However if you want, the images come from the facades dataset, so maybe I can just share you the code, the dataset is downloadable from here https://www.kaggle.com/datasets/balraj98/facades-dataset, and you need to place it in the directory of the notebook, so something like this:

...
├── notebook.ipynb
└── dataset
       ├── trainA
       ├── trainB
       ├── testA
       └── testB

the whole code can be downloaded from here: https://drive.google.com/file/d/1Clqf1uSzMIntA551dp8B1Z-hZFPAa8VL/view?usp=sharing
It requires basic packages, and the likelihood to see that error message is directly proportional to be batchsize (so I suspect it has something to do with the memory)

My pc is a 2021 16" MacBook Pro M1 MAX 26 core GPU 32Gb RAM with 2Tb SSD running MacOS 12.4 (21F79)

Sure, let me know if you need more info about this

Hi there
I got a much simpler snipped that causes the same error, without external datasets:

import tensorflow as tf
import tensorflow.keras as K
import numpy as np
num_words = 10000
(X_train, y_train), (X_test, y_test) = K.datasets.imdb.load_data(num_words=num_words)
(X_valid, X_test) = X_test[:12500], X_test[12500:]
(y_valid, y_test) = y_test[:12500], y_test[12500:]
maxlen = 500
X_train_trim = K.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test_trim = K.preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)
X_valid_trim = K.preprocessing.sequence.pad_sequences(X_valid, maxlen=maxlen)
model_K = K.models.Sequential([
    K.layers.Embedding(input_dim=num_words, output_dim=10),
    K.layers.SimpleRNN(32),
    K.layers.Dense(1, "sigmoid")
])
model_K.compile(loss='binary_crossentropy', optimizer="adam", metrics=["accuracy"])
with tf.device("/device:CPU:0"):
    history_K = model_K.fit(X_train_trim, y_train, epochs=10, batch_size=128, validation_data=(X_valid_trim, y_valid))

In addition to this, there is also the fact that SimpleRNN does not work on M1 GPU what so ever (thus the tf.device), as reported here: https://github.com/tensorflow/tensorflow/issues/56082 (on the other hand, LSTM works fine)

However, I think this might be due to the Graph creation, as a simple reimplementation of SimpleRNN have the same issue (however, this does not really hold, otherwise LSTM would have the same issue)

Hi, I have this same error on M2 MAX with tensorflow in LSTM The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x5cbea68f0> label = <none> device = <AGXG14CDevice: 0x13385a200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x1422ab400> label = <none> device = <AGXG14CDevice: 0x13385a200> name = Apple M2 Max retainedReferences = 1

Hi, I am getting this error with test script from the tensorflow plugin metal page. I have a power mac M3 on OS 14.4 (latest at this time.) Unfortunately, I created another thread https://developer.apple.com/forums/thread/748413. Should I close that one?

Tensorflow metal was working GREAT on my Power Mac Mac M3 until Tuesday. Then my code started freezing. I ran the test script from https://developer.apple.com/metal/tensorflow-plugin/ and it now crashes - this used to work fine, but all of a sudden it does not. The results are shown below.

Was there ever any answers on the previous posts? Could this be a hardware problem?

The test script is just this:

import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)

The errors I get are like the following:

Epoch 1/5
  1/782 [..............................] - ETA: 51:53 - loss: 6.0044 - accuracy: 0.0312Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Ignored (for causing prior/excessive GPU errors) (00000004:kIOGPUCommandBufferCallbackErrorSubmissionsIgnored)
	<AGXG15XFamilyCommandBuffer: 0x1172515e0>
    label = <none> 
    device = <AGXG15SDevice: 0x1588e6000>
        name = Apple M3 Pro 
    commandQueue = <AGXG15XFamilyCommandQueue: 0x17427e400>
        label = <none> 
        device = <AGXG15SDevice: 0x1588e6000>
            name = Apple M3 Pro 
    retainedReferences = 1
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Ignored (for causing prior/excessive GPU errors) (00000004:kIOGPUCommandBufferCallbackErrorSubmissionsIgnored)
	<AGXG15XFamilyCommandBuffer: 0x117257b40>
    label = <none> 
    device = <AGXG15SDevice: 0x1588e6000>
        name = Apple M3 Pro 
    commandQueue = <AGXG15XFamilyCommandQueue: 0x17427e400>
        label = <none> 
        device = <AGXG15SDevice: 0x1588e6000>
            name = Apple M3 Pro 
    retainedReferences = 1
Tensorflow metal: The Metal Performance Shaders operations encoded on it may not have completed.
 
 
Q