In the WWDC video was shown that a GeLU function can be easily implemented using MPSGraph. Can this function also be used for training? And if, how?
MPSGraph and training with self-created functions
Yes, feel free to get gradient of the ops by using automatic differentiation, or use the basic math operations to write the gelu gradient just as we showed writing GeLU.
Of course we will stitch all these ops and you should get optimal performance with a single kernel launch.