VisionOS ARKit CameraFrame Sample Parameters Extrinsics

Question

Created Dec ’24

Replies 4

Boosts 0

Participants 3

the following documentation tells me that the CameraFrame.Sample.Parameters.extrinsics is of type simd_float4x4, great! https://developer.apple.com/documentation/arkit/cameraframe/sample/parameters/4443449-extrinsics

I have read in the answer of another post that this extrinsics represents the pose of the physical camera relative to the device anchor.

Did I understand correctly that the device anchor is where the scene is rendered from onto the user's display?
What is the coordinate system in which this offset is defined, which axis is left, which one is up, which one is forward?
The last column of the extrinsics seems to define a translation of approximately 2 cm along the x axis, -2cm along the y axis and -5 cm along the z axis. I tried to measure the physical distance between the main left and right cameras in order to find out if it's rather 2cm or 5 cm from the "middle", it looks more like 5, so I assume that the z axis is looking towards the right (from the user's perspective). Is that so? For x and y, I assume that the physical camera is approximately 2 cm to the front of the user and 2cm to the bottom, which of x and y is horizontal, which on vertical?
How is the camera image indexed, is it row-major and is the origin on the top left?

I am looking forward to learning about all the details on these extrinsics in order to make use of it.

Answered by DaveloperAtComerge in 817393022

I stand corrected about the left camera being the right one from the user's perspective. That conclusion was made because you said the extrinsics are in a coordinate system in which the x axis goes towards the user's right, and the extrinsics seem to have a translation with an x component of about 2.5 cm, which would mean that the camera is to the right.

After testing putting my finger on the actual physical cameras, I saw that it is indeed the left camera. So naturally I asked myself what I am doing wrong when interpreting the extrinsics?

Well it turns out: the extrinsics do not define the transformation from the device anchor to the camera, but from the camera to the device anchor. I had to invert the matrix, everything works now.

Boost

Answer 1

Vision Pro Engineer OP

Apple

Dec ’24

Did I understand correctly that the device anchor is where the scene is rendered from onto the user's display?

No. Rendering happens for each eye, from the eye's position. But this is an implementation detail that hopefully should not make a difference for app developers.

What is the coordinate system in which this offset is defined, which axis is left, which one is up, which one is forward?

It is defined in the device anchor's coordinate system. IIRC for the device anchor, X axis should be to the right when viewed from the user's perspective, Y axis is up and Z axis towards the user (right handed). But you should be able to convert vectors/transforms from the camera's coordinate system to the device anchor's coordinate system via the extrinsic matrix "blindly", without needing to worry about / making assumptions about these details.

I am looking forward to learning about all the details on these extrinsics in order to make use of it.

If you have a use case which requires more detailed information about the extrinsic then it would be great if you let us know, either here in the forum or via the Feedback Assistant.

0

Answer 2

DaveloperAtComerge OP

Dec ’24

Thanks a lot for all those clarifications! I see at least two use cases in which understanding the camera extrinsics is crucial:

An object is tracked with a non-generic algorithm which allows for a much higher tracking accuracy for the specific use case than any other out of the box tracking solution. A pose is computed relative to the camera, where is the object in world space?
A gridded sheet is tracked using ARKit image tracking for a high-feature texture in its center. the user can color each cell of the grid with a set of distinct colors which the system should interpret. Given a 3d coordinate in world space, which pixel area is corresponding in the camera frame?

We are now using the WorldTrackingProvider's queryDeviceAnchor with the current timestamp CACurrentMediaTime(). Is multiplying that with the camera extrinsics the correct approach to get the extrinsics in world space?
For debugging purposes we are now drawing the captured frames onto a canvas which we position at one meter in front of the camera location (as described above) with pixel density 1/focal length, together with a small sphere at the center of the canvas and a tube going from the camera to the center of the canvas. It looks like the "left camera" really is the right camera (from the user's perspective), is that correct?
when rendered for the right eye, the tube seems to to be pointing perfectly forward, slightly from the top left of the display. Does this mean that the whole scene is rendered from the camera's position? If not, what does it mean?
Unfortunately we have still not been able to display the tracked object at the correct pose in world space; there is a consistent offset which is very similar to the offset between the passthrough and the rendered frame on the canvas. Thanks a lot for your assistance so far, that is very much appreciated! I will try to test all the assumptions in a minimal project which we could share if that helps, I will keep you posted here if I make any progress.

0

Answer 3

DaveloperAtComerge OP

Dec ’24

Accepted Answer

I stand corrected about the left camera being the right one from the user's perspective. That conclusion was made because you said the extrinsics are in a coordinate system in which the x axis goes towards the user's right, and the extrinsics seem to have a translation with an x component of about 2.5 cm, which would mean that the camera is to the right.

After testing putting my finger on the actual physical cameras, I saw that it is indeed the left camera. So naturally I asked myself what I am doing wrong when interpreting the extrinsics?

Well it turns out: the extrinsics do not define the transformation from the device anchor to the camera, but from the camera to the device anchor. I had to invert the matrix, everything works now.

0

Answer 4

hale_xie OP

2w

Thank you for the information.

There is no API document says the device anchor is the origin of the extrinsics of the camera.

0