Hello,
Let me chime in as I seek also a lip-sync solution for Vision Pro
It's not the same as a lip-sync sdk the ARKit face tracking and it cannot replace it. If a virtual character needs lip-sync (interaction between you and the virtual character) you have on other headset (other brand here) a sdk that provides a value of which phoneme it detects. This way you can hook it up to your own animation system to create realistic facial animation. A simple solution is blendshapes but in Unity you would hook it up to an Animator where you could tweak the interaction of the shapes and how these transition.
Now that Apple is going to deliver the Vision Pro it would be really helpful to have a sdk that uses the neural engine to deliver realtime prediction of which phoneme it detects based on audio.
For now we use an opensource lipsync framework called uLipSync. This works on metal as it uses some nice Unity Burst compilation. But this lip-sync does not predict so there is some latency.