GPU cannot be assigned properly while NLP task.

Dear All Developers,

I have reported an issue about the HuggingFace package on 683992.

In the beginning, I thought the problem is from HuggingFace. However, I found out it seems results from TensorFlow-Hub after some further tests.

Here is the thing, I made a fine-tuning BERT model with TF and TF-Hub only. And I got the same error as before.

Here is the detail about the error.

InvalidArgumentError: Cannot assign a device for operation AdamWeightDecay/AdamWeightDecay/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
RealDiv: GPU CPU 
ResourceGather: GPU CPU 
AddV2: GPU CPU 
Sqrt: GPU CPU 
Unique: CPU 
ResourceScatterAdd: GPU CPU 
UnsortedSegmentSum: CPU 
AssignVariableOp: GPU CPU 
AssignSubVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
Identity: GPU CPU 
StridedSlice: GPU CPU 
_Arg: GPU CPU 
Const: GPU CPU 

So, obviously, there is something wrong with the TF part and I don't think there is a quick solution.

As transformers and related models are so powerful in the NLP area, it is a great shame that if we cannot solving NLP tasks with GPU accelerating.

I will raise this issue on Feedback Assistant App too, and please comment here if you would also like Apple to solve this issue.

Sincerely,

hawkiyc

Hi hawkiyc!

Thank you so much for reporting this issue. Team is aware of it, reproduced it and working on a fix. There is no known workaround at this time. The fix will be provided in the upcoming seeds.

Please file FB assistant ticket and put its number there, so we could update you on a progress.

Have a great day!

I have been experiencing a similar issue while training a GAN.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation loader/GeneratorDataset: Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

Any news about when and how the issue would be solved?

Is there any update on this? Any ETA?

I am seeing this error when training the tensorflowtts model on mac m1 chip.

Metal device set to: Apple M1 Max
...
systemMemory: 64.00 GB
maxCacheSize: 21.33 GB

Traceback (most recent call last):
  File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 528, in <module>
    main()
  File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 516, in main
    trainer.fit(
  File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 1010, in fit
    self.run()
  File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 104, in run
    self._train_epoch()
  File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 126, in _train_epoch
    self._train_step(batch)
  File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 113, in _train_step
    self.one_step_forward(batch)
  File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation gradients/tacotron2/decoder/while_grad/tacotron2/decoder/while/Placeholder_0/accumulator: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Merge: GPU CPU 
AddV2: GPU CPU 

Same when trying to fine-tune the universal sentence encoder (tfhub). CPU training works, slowly though. To be able to train just add: tf.config.set_visible_devices([], 'GPU') to hide the GPUs. Any updates on this?

InvalidArgumentError: Cannot assign a device for operation Adam/Adam/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available. Colocation Debug Info: Colocation group had the following types and supported devices: Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]

GPU cannot be assigned properly while NLP task.
 
 
Q