I would like to extract depth data for a given point in ARSession.currentFrame.smoothedSceneDepth
.
Optimally this would end up looking something like:
ARView.depth(at point: CGPoint)
With the point being in UIKit coordinates just like the points passed to the raycasting methods.
I ultimately to use this depth data to convert a 2D normalized landmark from a Vision image request into a 3D world space coordinate in the 3D scene - I only lack the accurate depth data for a given 2D point.
What I have available is:
- The normalized landmark from the Vision request.
- Ability to convert this^ to AVFoundation coordinates.
- Ability to convert this^ to screen-space/display coordinates.
When the depth data is provided correctly I can combine the 2D position in UIKit/screen-space coordinates with the depth (in meters) to produce an accurate 3D world position with the use of ARView.ray(through:)
What I have not been able to figure out is how to get this depth value for this coordinate on screen.
I can index the pixel buffer like this:
extension CVPixelBuffer { func value(for point: CGPoint) -> Float32 { CVPixelBufferLockBaseAddress(self, .readOnly) let width = CVPixelBufferGetWidth(self) let height = CVPixelBufferGetHeight(self) //Something potentially going wrong here. let pixelX: Int = width * point.x let pixelY: Int = height * point.y let bytesPerRow = CVPixelBufferGetBytesPerRow(self) let baseAddress = CVPixelBufferGetBaseAddress(self)! assert(kCVPixelFormatType_DepthFloat32 == CVPixelBufferGetPixelFormatType(depthDataMap)) let rowData = baseAddress + pixelY * bytesPerRow let distanceAtXYPoint = rowData.assumingMemoryBound(to: Float32.self)[pixelX] CVPixelBufferUnlockBaseAddress(self, .readOnly) return distanceAtXYPoint } }
And then try to use this method like so:
guard let depthMap = (currentFrame.smoothedSceneDepth ?? currentFrame.sceneDepth)?.depthMap else { return nil } //The depth at this coordinate, in meters. let depthValue = depthMap.value(for: myGivenPoint)
The frame semantics [.smoothedSceneDepth, .sceneDepth]
have been set properly on my ARConfiguration. The depth data is available.
If I hard-code the width and height values like so:
let pixelX: Int = width / 2 let pixelY: Int = height / 2
I get the correct depth value for the center of the screen. I have only been testing in portrait mode.
But I do not know how to index the depth data for any given point.
With some help I was able to figure out what coordinates and conversion I needed to use.
The Vision result comes in Vision coordinates: normalized, (0,0) Bottom-Left, (1,1) Top-Right.
AVFoundation coordinates are (0,0) Top-Left, (1,1) Bottom-Right.
To convert from Vision coordinates to AVFoundation coordinates, you must flip the Y-axis like so:
public extension CGPoint { func convertVisionToAVFoundation() -> CGPoint { return CGPoint(x: self.x, y: 1 - self.y) } }
This AVFoundation coordinate is what needs to be used as input for indexing the depth buffer, like so:
public extension CVPixelBuffer { ///The input point must be in normalized AVFoundation coordinates. i.e. (0,0) is in the Top-Left, (1,1,) in the Bottom-Right. func value(from point: CGPoint) -> Float? { let width = CVPixelBufferGetWidth(self) let height = CVPixelBufferGetHeight(self) let colPosition = Int(point.x * CGFloat(width)) let rowPosition = Int(point.y * CGFloat(height)) return value(column: colPosition, row: rowPosition) } func value(column: Int, row: Int) -> Float? { guard CVPixelBufferGetPixelFormatType(self) == kCVPixelFormatType_DepthFloat32 else { return nil } CVPixelBufferLockBaseAddress(self, .readOnly) if let baseAddress = CVPixelBufferGetBaseAddress(self) { let width = CVPixelBufferGetWidth(self) let index = column + (row * width) let offset = index * MemoryLayout<Float>.stride let value = baseAddress.load(fromByteOffset: offset, as: Float.self) CVPixelBufferUnlockBaseAddress(self, .readOnly) return value } CVPixelBufferUnlockBaseAddress(self, .readOnly) return nil } }
This is all that is needed to get depth for a given position from a Vision request.
Here is my body tracking swift package that has a 3D hand tracking example that uses this: