GPU utilization decays from 50% to 10% in non-batch inference for huggingface distilbert-base-cased

MacBook Pro M2 Max 96gb macOS 13.3 tensorflow-macos 2.9.0 tensorflow-metal 0.5.0

Here's the reproducible test case

from transformers import AutoTokenizer, TFDistilBertForSequenceClassification
from datasets import load_dataset

imdb = load_dataset('imdb')
sentences = imdb['train']['text'][:500]

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')

for i, sentence in tqdm(enumerate(sentences)):
  inputs = tokenizer(sentence, truncation=True, return_tensors='tf')
  output = model(inputs).logits
  pred = np.argmax(output.numpy(), axis=1)

  if i % 100 == 0:
    print(f"len(input_ids): {inputs['input_ids'].shape[-1]}")

I monitored GPU utilization slowly decayed from 50% to 10%. It is excruciating slow towards the end. The print statement also confirmed this:

Metal device set to: Apple M2 Max

systemMemory: 96.00 GB
maxCacheSize: 36.00 GB

3it [00:00, 10.87it/s]
len(input_ids): 391
101it [00:13,  6.38it/s]
len(input_ids): 215
201it [00:34,  4.78it/s]
len(input_ids): 237
301it [00:55,  4.26it/s]
len(input_ids): 256
401it [01:54,  1.12it/s]
len(input_ids): 55
500it [03:40,  2.27it/s]

I found no evidence yet this is a heat throttling issue, 'cos after the huge drop in GPU utilization, other processes will overtake using the GPU (like 2%).

I wonder what's going on? Is there any profiling tips I can do to help investigate. I am aware I can "fix" this by doing batch inferences. But seeing this GPU utilization decay is unsettling, since this can potentially happen for a training session (which is far longer).

I found out this has something to do with the variation in length of input tokens from one inference to the next. It doesn't seem to like receiving lengths that vary greatly, maybe this causes some sort of weird fragmentation in GPU memory?? Here's the code that only extract IMDB sentences that has >512 tokens. And it is able to sustain GPU utilization, with ~30it/s.

from transformers import AutoTokenizer, TFDistilBertForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')

from datasets import load_dataset
imdb = load_dataset('imdb')

print('starting collecting sentences with tokens >= 512')
sentences = [sentence for sentence in imdb['train']['text'] if tokenizer(sentence, truncation=True, return_tensors='tf')['input_ids'].shape[-1] >= 512]
print('finished collecting sentences with tokens >= 512')

for k, sentence in tqdm(enumerate(sentences)):
  inputs = tokenizer(sentence, truncation=True, return_tensors='tf')

  output = model(inputs).logits
  pred = np.argmax(output.numpy(), axis=1)
  
  if k % 100 == 0:
    print(f"len(input_ids): {inputs['input_ids'].shape[-1]}")

print:

7it [00:00, 31.12it/s]
len(input_ids): 512
107it [00:03, 32.38it/s]
len(input_ids): 512
...
...
3804it [02:00, 31.85it/s]
len(input_ids): 512
3904it [02:03, 32.50it/s]
len(input_ids): 512
3946it [02:04, 31.70it/s]

Hi @kechan,

Thanks for reporting this and the code snippet to repro it. I'll investigate to see what the cause is.

The characteristic of this slowdown to happen with varying length tokens you pointed out is very interesting. That makes me think that this might be explained by the caching behavior of various computation kernels where the kernels usually get cached with static shapes. In this case token length might end up affecting the caching key so that each different length input requires the graph to get rebuilt. I'll let you know once we have a better idea on what's going on.

GPU utilization decays from 50% to 10% in non-batch inference for huggingface distilbert-base-cased
 
 
Q