Tensorflow metal: The Metal Performance Shaders operations encoded on it may not have completed.

Question

Alberto1999 OP

Created Aug ’22

Replies 6

Boosts 0

Views 2.6k

Participants 4

This does not seem to be effecting the training, but it seems somewhat important (no clue on how to read it however):

Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG13XFamilyCommandBuffer: 0x29b027b50>
    label = <none> 
    device = <AGXG13XDevice: 0x12da25600>
        name = Apple M1 Max 
    commandQueue = <AGXG13XFamilyCommandQueue: 0x106477000>
        label = <none> 
        device = <AGXG13XDevice: 0x12da25600>
            name = Apple M1 Max 
    retainedReferences = 1

This is happening during a "heavy" model training on "heavy" dataset, so maybe is related to some memory issue, but I have no clue how to confront it

Boost

Answer 1

Frameworks Engineer OP

Apple

Aug ’22

Hi @Alberto1999!

Thanks for reporting the issue. Would you happen to have a test script you could provide us that would reproduce this error message? I understand however that if this only happens sporadically during very memory heavy training it might be difficult to reproduce consistently. But I can confirm that this does not look like expected behavior so I would like to investigate it in more detail.

Additionally which OS version, tensorflow-macos version and tensorflow-metal version did you observe this?

0

Answer 2

Alberto1999 OP

Aug ’22

Hi there, so the problem is very sporadic, and is happening during the training of a heavy TF model, and it's not so "deterministic", however I can provide you a link to a ZIP file with jupyter notebook and dataset

However if you want, the images come from the facades dataset, so maybe I can just share you the code, the dataset is downloadable from here https://www.kaggle.com/datasets/balraj98/facades-dataset, and you need to place it in the directory of the notebook, so something like this:

...
├── notebook.ipynb
└── dataset
       ├── trainA
       ├── trainB
       ├── testA
       └── testB

the whole code can be downloaded from here: https://drive.google.com/file/d/1Clqf1uSzMIntA551dp8B1Z-hZFPAa8VL/view?usp=sharing
It requires basic packages, and the likelihood to see that error message is directly proportional to be batchsize (so I suspect it has something to do with the memory)

My pc is a 2021 16" MacBook Pro M1 MAX 26 core GPU 32Gb RAM with 2Tb SSD running MacOS 12.4 (21F79)

0

Answer 3

Alberto1999 OP

Aug ’22

Sure, let me know if you need more info about this

0

Answer 4

Alberto1999 OP

Aug ’22

Hi there
I got a much simpler snipped that causes the same error, without external datasets:

import tensorflow as tf
import tensorflow.keras as K
import numpy as np
num_words = 10000
(X_train, y_train), (X_test, y_test) = K.datasets.imdb.load_data(num_words=num_words)
(X_valid, X_test) = X_test[:12500], X_test[12500:]
(y_valid, y_test) = y_test[:12500], y_test[12500:]
maxlen = 500
X_train_trim = K.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test_trim = K.preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)
X_valid_trim = K.preprocessing.sequence.pad_sequences(X_valid, maxlen=maxlen)
model_K = K.models.Sequential([
    K.layers.Embedding(input_dim=num_words, output_dim=10),
    K.layers.SimpleRNN(32),
    K.layers.Dense(1, "sigmoid")
])
model_K.compile(loss='binary_crossentropy', optimizer="adam", metrics=["accuracy"])
with tf.device("/device:CPU:0"):
    history_K = model_K.fit(X_train_trim, y_train, epochs=10, batch_size=128, validation_data=(X_valid_trim, y_valid))

In addition to this, there is also the fact that SimpleRNN does not work on M1 GPU what so ever (thus the tf.device), as reported here: https://github.com/tensorflow/tensorflow/issues/56082 (on the other hand, LSTM works fine)

However, I think this might be due to the Graph creation, as a simple reimplementation of SimpleRNN have the same issue (however, this does not really hold, otherwise LSTM would have the same issue)

0

Answer 5

LCrossman OP

Mar ’24

Hi, I have this same error on M2 MAX with tensorflow in LSTM The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x5cbea68f0> label = <none> device = <AGXG14CDevice: 0x13385a200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x1422ab400> label = <none> device = <AGXG14CDevice: 0x13385a200> name = Apple M2 Max retainedReferences = 1

0

Answer 6

carl24k OP

Mar ’24

Hi, I am getting this error with test script from the tensorflow plugin metal page. I have a power mac M3 on OS 14.4 (latest at this time.) Unfortunately, I created another thread https://developer.apple.com/forums/thread/748413. Should I close that one?

Tensorflow metal was working GREAT on my Power Mac Mac M3 until Tuesday. Then my code started freezing. I ran the test script from https://developer.apple.com/metal/tensorflow-plugin/ and it now crashes - this used to work fine, but all of a sudden it does not. The results are shown below.

Was there ever any answers on the previous posts? Could this be a hardware problem?

The test script is just this:

import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)

The errors I get are like the following:

Epoch 1/5
  1/782 [..............................] - ETA: 51:53 - loss: 6.0044 - accuracy: 0.0312Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Ignored (for causing prior/excessive GPU errors) (00000004:kIOGPUCommandBufferCallbackErrorSubmissionsIgnored)
	<AGXG15XFamilyCommandBuffer: 0x1172515e0>
    label = <none> 
    device = <AGXG15SDevice: 0x1588e6000>
        name = Apple M3 Pro 
    commandQueue = <AGXG15XFamilyCommandQueue: 0x17427e400>
        label = <none> 
        device = <AGXG15SDevice: 0x1588e6000>
            name = Apple M3 Pro 
    retainedReferences = 1
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Ignored (for causing prior/excessive GPU errors) (00000004:kIOGPUCommandBufferCallbackErrorSubmissionsIgnored)
	<AGXG15XFamilyCommandBuffer: 0x117257b40>
    label = <none> 
    device = <AGXG15SDevice: 0x1588e6000>
        name = Apple M3 Pro 
    commandQueue = <AGXG15XFamilyCommandQueue: 0x17427e400>
        label = <none> 
        device = <AGXG15SDevice: 0x1588e6000>
            name = Apple M3 Pro 
    retainedReferences = 1

0