Deep Learning Chapter 10: Advanced use of recurrent neural networks not Using GPU

I am running a recurrent neural network example on an iMac 27" with an AMD Radeon Pro 5700 XT on OS X 12.3. The code runs, the GPU was initialized the GPU is not active. Each epoch takes:

819/819 [==============================] - 32417s 40s/step - loss: 121.7538 - mae: 8.9641 - val_loss: 100.3145 - val_mae: 8.0313

 % python chapter-10.py
['"Date Time"', '"p (mbar)"', '"T (degC)"', '"Tpot (K)"', '"Tdew (degC)"', '"rh (%)"', '"VPmax (mbar)"', '"VPact (mbar)"', '"VPdef (mbar)"', '"sh (g/kg)"', '"H2OC (mmol/mol)"', '"rho (g/m**3)"', '"wv (m/s)"', '"max. wv (m/s)"', '"wd (deg)"']
420451
num_train_samples: 210225
num_val_samples: 105112
num_test_samples: 105114
2022-03-28 18:28:59.988516: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Metal device set to: AMD Radeon Pro 5700 XT

systemMemory: 128.00 GB
maxCacheSize: 7.99 GB

2022-03-28 18:28:59.989242: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-03-28 18:28:59.989616: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
WARNING:tensorflow:Layer lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
Epoch 1/50
2022-03-28 18:29:02.342296: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
819/819 [==============================] - ETA: 0s - loss: 121.7538 - mae: 8.9641 2022-03-29 03:21:02.092397: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
819/819 [==============================] - 32417s 40s/step - loss: 121.7538 - mae: 8.9641 - val_loss: 100.3145 - val_mae: 8.0313
Epoch 2/50
381/819 [============>.................] - ETA: 4:50:31 - loss: 93.7597 - mae: 7.6880
from tensorflow import keras
from tensorflow.keras import layers

num_features = 14
steps = 120

import os
fname = os.path.join("jena_climate_2009_2016.csv")

with open(fname) as f:
    data = f.read()

lines = data.split("\n")
header = lines[0].split(",")
lines = lines[1:]
print(header)
print(len(lines))

import numpy as np
temperature = np.zeros((len(lines),))
raw_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
    values = [float(x) for x in line.split(",")[1:]]
    temperature[i] = values[1]
    raw_data[i, :] = values[:]

sampling_rate = 6
sequence_length = 120
delay = sampling_rate * (sequence_length + 24 - 1)
batch_size = 256

num_train_samples = int(0.5 * len(raw_data))
num_val_samples = int(0.25 * len(raw_data))
num_test_samples = len(raw_data) - num_train_samples - num_val_samples
print("num_train_samples:", num_train_samples)
print("num_val_samples:", num_val_samples)
print("num_test_samples:", num_test_samples)

train_dataset = keras.utils.timeseries_dataset_from_array(
    raw_data[:-delay],
    targets=temperature[delay:],
    sampling_rate=sampling_rate,
    sequence_length=sequence_length,
    shuffle=True,
    batch_size=batch_size,
    start_index=0,
    end_index=num_train_samples)

val_dataset = keras.utils.timeseries_dataset_from_array(
    raw_data[:-delay],
    targets=temperature[delay:],
    sampling_rate=sampling_rate,
    sequence_length=sequence_length,
    shuffle=True,
    batch_size=batch_size,
    start_index=num_train_samples,
    end_index=num_train_samples + num_val_samples)

test_dataset = keras.utils.timeseries_dataset_from_array(
    raw_data[:-delay],
    targets=temperature[delay:],
    sampling_rate=sampling_rate,
    sequence_length=sequence_length,
    shuffle=True,
    batch_size=batch_size,
    start_index=num_train_samples + num_val_samples)

inputs = keras.Input(shape=(steps, num_features))
x = layers.SimpleRNN(16, return_sequences=True)(inputs)
x = layers.SimpleRNN(16, return_sequences=True)(x)
outputs = layers.SimpleRNN(16)(x)


inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.LSTM(32, recurrent_dropout=0.25)(inputs)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

callbacks = [
    keras.callbacks.ModelCheckpoint("jena_lstm_dropout.keras",
                                    save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
                    epochs=50,
                    validation_data=val_dataset,
                    callbacks=callbacks)

Any idea why the GPU is not active? How does this code example run on an M1 Ultra?

On Google Colab with a GPU each epoch is ~383 seconds (not 32417seconds):

`--2022-03-29 16:58:04-- https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.132.216 Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.132.216|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 13565642 (13M) [application/zip] Saving to: ‘jena_climate_2009_2016.csv.zip’

jena_climate_2009_2 100%[===================>] 12.94M 61.8MB/s in 0.2s

2022-03-29 16:58:04 (61.8 MB/s) - ‘jena_climate_2009_2016.csv.zip’ saved [13565642/13565642]

Archive: jena_climate_2009_2016.csv.zip inflating: jena_climate_2009_2016.csvinflating: __MACOSX/._jena_climate_2009_2016.csv['"Date Time"', '"p (mbar)"', '"T (degC)"', '"Tpot (K)"', '"Tdew (degC)"', '"rh (%)"', '"VPmax (mbar)"', '"VPact (mbar)"', '"VPdef (mbar)"', '"sh (g/kg)"', '"H2OC (mmol/mol)"', '"rho (g/m**3)"', '"wv (m/s)"', '"max. wv (m/s)"', '"wd (deg)"'] 420451 num_train_samples: 210225 num_val_samples: 105112 num_test_samples: 105114 WARNING:tensorflow:Layer lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU. Epoch 1/50 819/819 [==============================] - 383s 461ms/step - loss: 147.6203 - mae: 10.0559 - val_loss: 137.3602 - val_mae: 9.6686 Epoch 2/50 118/819 [===>....`

Hi @dbl001

So here the meaningful line in the printout is Layer lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU. which is caused by the recurrent_dropout=0.25 parameter given to the LSTM layer. Unfortunately when the call falls back to the generic GPU kernel, the execution ends up being slow due to how the pluggable device interface for the RNN layers is implemented in the Tensorflow Core which prevents us from truly leveraging the GPU capabilities. We are only able to highjack (and accelerate) the calls made into the cuDNN RNN kernels.

We are hoping to be able to remedy this issue in the future but for now I can only recommend running the RNN kernels that are not satisfying the cuDNN implementation conditions to be run on the CPU. The conditions are listed here https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM

Deep Learning Chapter 10: Advanced use of recurrent neural networks not Using GPU
 
 
Q