How to find the camera transform (or view matrix) in the world coordinate from a camera frame

I'm trying to implement a prototype to render virtual objects in a mixed immersive space on the camer frames captured by CameraFrameProvider.

Here are what I have done:

  1. Get camera's instrinsics from frame.primarySample.parameters.intrinsics
  2. Get camera's extrinsics from frame.primarySample.parameters.extrinsics
  3. Get the device anchor by worldTrackingProvider.queryDeviceAnchor(atTimestamp: CACurrentMediaTime())
  4. Setup a RealityKit.RealityRenderer to render virtual objects on the captured camera frames
        let realityRenderer = try RealityKit.RealityRenderer()
        realityRenderer.cameraSettings.colorBackground = .outputTexture()
        let cameraEntity = PerspectiveCamera()
        // see https://developer.apple.com/forums/thread/770235 
        let cameraTransform = deviceAnchor.originFromAnchorTransform * extrinsics.inverse
        
        cameraEntity.setTransformMatrix(cameraTransform, relativeTo: nil)
        cameraEntity.camera.near = 0.01
        cameraEntity.camera.far = 100
        cameraEntity.camera.fieldOfViewOrientation = .horizontal
        // manually calculated based on camera intrinsics
        cameraEntity.camera.fieldOfViewInDegrees = 105 

        realityRenderer.entities.append(cameraEntity)
        realityRenderer.activeCamera = cameraEntity

Virtual objects, which should be seen in the camera frames, are clipped out by the camera transform.

If I use deviceAnchor.originFromAnchorTransform as the camera transform, virtual objects can be rendered on camera frames at wrong positions (I think it is because the camera extrinsics isn't used to adjust the camera to the correct position).

My question is how to use the camera extrinsic matrix for this purpose?

Does the camera extrinsics point to a similar orientation of the device anchor with some minor rotation and postion change? Here is an extrinsics from a camera frame. It seems that the direction of Y-axis and Z-axis are flipped by the extrinsics. So the camera is point to a wrong direction.

simd_float4x4([[0.9914258, 0.012555369, -0.13006608, 0.0], // X-axis
[-0.0009778949, -0.9946325, -0.10346654, 0.0], // Y-axis
[-0.13066702, 0.10270659, -0.98609203, 0.0],  // Z-axis
[0.024519, -0.019568002, -0.058280986, 1.0]]) // translation
How to find the camera transform (or view matrix) in the world coordinate from a camera frame
 
 
Q