People Occlusion + Scene Understanding on VisionOS

In ARKit for iPad, I could 1) build a mesh on top of the real world and 2) request a people occlusion map for use with my application so people couls move behind or in fromt of virtual content via compositing. However, in VisionOS, there is no ARFrame image to pass to the function that would generate the occlusion data. Is it possible to do people occlusion in visionOS? If so, how it is done—through a data provider, or is it automatic when passthrough is enabled? If it’s not possible, is this something that might have a solution in future updates as the platform develops? Being able to combine virtual content and the real world with people being able to interact with the content convincingly is a really important aspect to AR, so it would make sense for this to be possible.

From the video, People Occlusion seemed to be working. https://twitter.com/tokufxug/status/1666767606475530240

@KTRosenberg I am also very concerned about the lack of ability to sense/interact with the world around you. I seems you will only get what xrOS decides to give you, which are currently simple things like walls, floors, etc... As you've no doubt noticed in the sessions there are repeated (and valid) references to user privacy. The downside is it seems like we are beholden to what xrOS shares with us, we can't implement any novel object detection of our own. This is of course v1, but it is vital to be able to sense/interact with the world around us for the ar vision to become a reality.

I’m concerned especially that: 1: passthrough mode requires RealityKit, which is very limiting. I want to do the work to create things using Metal. 2: VR mode doesn’t let you move around.

2 destroys most ideas for interestint VR use cases and could be solved by introeucing a user-defined safe boundary. (Not to compare with competition, but this is the standard solution.)

1 is a problem because it limits the creative style choicea in the design of content and prevents the programmer from using their own entine in mixed reality mode. RealityKit is great for getting started, but when you want more control, it can get in the way. I know Metal can be used to generate resources for RealityKit, but this is also super limited and a ton of friction for the programmer. RealityKit limits the flexibility in how data are represented, and maybe people don’t want the style of RealityKit visuals in mixed reality. To solve 1, you enable metal with passthrough mode. However, without depth and lighting data, I understand you can’t achieve realistic lighting mixed with the real world. Either you enable a permission system so people know the risks of granting camera data OR an easier in-between solution would be the following:

I’ve thought about it a bit: Rather than granting pixel data, have a special visionOS Metal command that says to draw triangles with the passthrough drawn on top in screenspace, with occlusion, all handled behind the scenes by the system without the code ever touching the pixels. Use Apple-approved extension shaders for lighting. e.g. There’s research showing it’s useful to have portals in VR showing windows into the real world. Being able to specify a surface for the camera feed would enable this and many other use cases. The api would look like the old opengl 1.0 fixed function pipeline. The program provides parameters and the backend and compositor handle things without giving you the data directly. Like glFog:

[commandBuffer enableXRPassthroughPipeline withOptions:OCCLUSION];

[renderEncoder drawWithXRPassthroughComposition: triangle buffer… withRaytracing:YES];

These would let the programmer specify what geometry should be composed with the passthrough backgrounds, enabling either full passthrough backgrounds or portals into the real world. With occlusion and lighting handled automatically. Disable occlusion tests when this is enabled. The compositor would do steps to ensure the passthrough pixels are separate from your render. And it could apply fixed function raytracing.

I for one would prefer just implementing a robust permissions system, but this is the best I can think of otherwise.

Overall, I think RealityKit is too high-level and a Metal-based solution allowing for passthrough, mobility, and some more control with a bit of fixed-functionalitt commands would work.

@KTRosenberg From looking at the SDK it looks like SceneReconstructionProvider (https://developer.apple.com/documentation/arkit/scenereconstructionprovider) will be able to get you the raw detected geometry MeshAnchor (https://developer.apple.com/documentation/arkit/meshanchor).

This doesn't solve the issue of pixel data access, but gives an idea of the lowest level scene understanding you'll be able to get (so far).

People Occlusion + Scene Understanding on VisionOS
 
 
Q