Same problem here. I also noticed the drop in GPU usage to 50%. It's hard to keep going like this.
Post
Replies
Boosts
Views
Activity
Any updates on this?
Just to reinforce, I'm currently starting a training model with 7gb of memory usage and at the end of the training, the memory usage hits 100gb+ (getting slower and slower because of the swap). On the other hand, on google colab (nvidia version), it worked perfectly without this excessive memory usage.
Hey,
Thank you for your detailed response.
I've prepared two standalone scripts that try to replicate the issue using randomly generated data. In both scripts, I generate synthetic data of varying sizes, to mirror the dynamic input size scenario in my actual project. This synthetic data is then passed through a Keras model, that includes a tf.keras.layers.Resizing layer.
The first script uses tf.data.Dataset to feed the model, while the second one utilizes a generator function to yield the data in batches.
Interestingly, the memory issue it seems to occur only in the script that uses tf.data.Dataset (filling the memory cache) and does not seem to occur when using the generator (~1.5gb of memory). However, in my actual code, where I use the generator approach, I do observe the memory issue (reaching more than memory cache). Furthermore, the issue is absent when using CPU or Nvidia GPU (via Google Colab), which both reach less than 1.5gb of memory. Anyway, you can find the two scripts below.
Script using tf.data.Dataset:
import numpy as np
import tensorflow as tf
# # Use cpu test memory
# tf.config.set_visible_devices([], 'GPU')
def generate_data(num_samples, max_size):
"""Generate synthetic data of varying sizes"""
data = []
labels = []
for _ in range(num_samples):
size = np.random.randint(1, max_size+1)
data.append(np.ones((size, size)) * 255) # Example of image
labels.append(np.random.randint(0, 2)) # binary classification for simplicity
return data, labels
class DynamicResizeModel(tf.keras.Model):
"""A model that includes a resizing layer"""
def __init__(self, target_size):
super().__init__()
self.target_size = target_size
self.expand_dims = tf.keras.layers.Lambda(lambda x: tf.expand_dims(x, -1))
self.resize = tf.keras.layers.Resizing(*target_size)
self.flatten = tf.keras.layers.Flatten()
self.dense = tf.keras.layers.Dense(1, activation='sigmoid')
def call(self, inputs):
x = self.expand_dims(inputs)
x = self.resize(x)
x = self.flatten(x)
return self.dense(x)
# Generate training data
train_data, train_labels = generate_data(100, 1024) # you can adjust these parameters as needed
# Convert the variable-sized data to ragged tensors
train_data = tf.ragged.constant(train_data)
train_labels = tf.constant(train_labels)
# Prepare a dataset
train_dataset = tf.data.Dataset.from_tensor_slices((train_data, train_labels))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(8) # adjust batch size as needed
# Create and train the model
model = DynamicResizeModel(target_size=(128, 32)) # resize all inputs to 128x32
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_dataset, epochs=1000)
Script using a generator:
import numpy as np
import tensorflow as tf
# # Use cpu test memory
# tf.config.set_visible_devices([], 'GPU')
def generate_data(num_samples, max_size):
"""Generate synthetic data of varying sizes"""
data = []
labels = []
for _ in range(num_samples):
size = np.random.randint(1, max_size+1)
data.append(np.ones((size, size)) * 255) # Example of image
labels.append(np.random.randint(0, 2)) # binary classification for simplicity
return data, labels
def data_generator(data, labels, batch_size):
"""Create a generator that returns batches of data"""
num_samples = len(data)
indices = np.arange(num_samples)
while True:
for i in range(0, num_samples, batch_size):
batch_indices = indices[i:i+batch_size]
batch_data = tf.ragged.constant([data[idx] for idx in batch_indices], dtype=tf.float32)
batch_labels = np.array([labels[idx] for idx in batch_indices], dtype=np.float32)
yield batch_data, batch_labels
np.random.shuffle(indices)
class DynamicResizeModel(tf.keras.Model):
"""A model that includes a resizing layer"""
def __init__(self, target_size):
super().__init__()
self.target_size = target_size
self.expand_dims = tf.keras.layers.Lambda(lambda x: tf.expand_dims(x, -1))
self.resize = tf.keras.layers.Resizing(*target_size)
self.flatten = tf.keras.layers.Flatten()
self.dense = tf.keras.layers.Dense(1, activation='sigmoid')
def call(self, inputs):
x = self.expand_dims(inputs)
x = self.resize(x)
x = self.flatten(x)
return self.dense(x)
# Generate training data
num_samples = 100 # Total number of samples in your dataset
max_size = 1024 # Maximum size of matrix
train_data, train_labels = generate_data(num_samples, max_size)
# Create and train the model
model = DynamicResizeModel(target_size=(1024, 128)) # resize all inputs to 1024, 128
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Set the parameters
batch_size = 8 # Number of samples per batch
# Create a generator
train_generator = data_generator(train_data, train_labels, batch_size)
# Use fit to train the model
model.fit(train_generator, steps_per_epoch=num_samples // batch_size, epochs=1000)
Given these findings, I would like to understand if this behavior is expected with the tensorflow-metal plugin or if it is indeed an anomaly. If it's the former, could you provide guidance on optimizing my code to prevent the memory issue while using tensorflow-metal?
Looking forward to your insights.