I currently use motion capture in an app, and I am intrigued by the new Action Classifiers as a way to detect behaviors as either a signal to start / end something or score the user's performance. I am wondering about how realistic it is to run Vision framework implementing a machine learning model simultaneously with ARKit implementing motion capture.
It depends a lot on the device you are running on and what other kinds of computations is your app doing. One important point is that ARKit runs at 60 FPS. Performing an action classification with the Vision framework on every single image might be unnecessary. Here is a developer sample which shows how to combine ARKit, Vision and CoreML: Tracking and Altering Images. In this sample app we decided to run a rectangle detection with the Vision framework at 10 FPS. So I would suggest you to investigate what frequency is necessary for the action classification to work well.