I would like to extract depth data for a given point in ARSession.currentFrame.smoothedSceneDepth
.
Optimally this would end up looking something like:
ARView.depth(at point: CGPoint)
With the point being in UIKit coordinates just like the points passed to the raycasting methods.
I ultimately to use this depth data to convert a 2D normalized landmark from a Vision image request into a 3D world space coordinate in the 3D scene - I only lack the accurate depth data for a given 2D point.
What I have available is:
- The normalized landmark from the Vision request.
- Ability to convert this^ to AVFoundation coordinates.
- Ability to convert this^ to screen-space/display coordinates.
When the depth data is provided correctly I can combine the 2D position in UIKit/screen-space coordinates with the depth (in meters) to produce an accurate 3D world position with the use of ARView.ray(through:)
What I have not been able to figure out is how to get this depth value for this coordinate on screen.
I can index the pixel buffer like this:
extension CVPixelBuffer {
func value(for point: CGPoint) -> Float32 {
CVPixelBufferLockBaseAddress(self, .readOnly)
let width = CVPixelBufferGetWidth(self)
let height = CVPixelBufferGetHeight(self)
//Something potentially going wrong here.
let pixelX: Int = width * point.x
let pixelY: Int = height * point.y
let bytesPerRow = CVPixelBufferGetBytesPerRow(self)
let baseAddress = CVPixelBufferGetBaseAddress(self)!
assert(kCVPixelFormatType_DepthFloat32 == CVPixelBufferGetPixelFormatType(depthDataMap))
let rowData = baseAddress + pixelY * bytesPerRow
let distanceAtXYPoint = rowData.assumingMemoryBound(to: Float32.self)[pixelX]
CVPixelBufferUnlockBaseAddress(self, .readOnly)
return distanceAtXYPoint
}
}
And then try to use this method like so:
guard let depthMap = (currentFrame.smoothedSceneDepth ?? currentFrame.sceneDepth)?.depthMap else { return nil }
//The depth at this coordinate, in meters.
let depthValue = depthMap.value(for: myGivenPoint)
The frame semantics [.smoothedSceneDepth, .sceneDepth]
have been set properly on my ARConfiguration. The depth data is available.
If I hard-code the width and height values like so:
let pixelX: Int = width / 2
let pixelY: Int = height / 2
I get the correct depth value for the center of the screen. I have only been testing in portrait mode.
But I do not know how to index the depth data for any given point.
With some help I was able to figure out what coordinates and conversion I needed to use.
The Vision result comes in Vision coordinates: normalized, (0,0) Bottom-Left, (1,1) Top-Right.
AVFoundation coordinates are (0,0) Top-Left, (1,1) Bottom-Right.
To convert from Vision coordinates to AVFoundation coordinates, you must flip the Y-axis like so:
public extension CGPoint {
func convertVisionToAVFoundation() -> CGPoint {
return CGPoint(x: self.x, y: 1 - self.y)
}
}
This AVFoundation coordinate is what needs to be used as input for indexing the depth buffer, like so:
public extension CVPixelBuffer {
///The input point must be in normalized AVFoundation coordinates. i.e. (0,0) is in the Top-Left, (1,1,) in the Bottom-Right.
func value(from point: CGPoint) -> Float? {
let width = CVPixelBufferGetWidth(self)
let height = CVPixelBufferGetHeight(self)
let colPosition = Int(point.x * CGFloat(width))
let rowPosition = Int(point.y * CGFloat(height))
return value(column: colPosition, row: rowPosition)
}
func value(column: Int, row: Int) -> Float? {
guard CVPixelBufferGetPixelFormatType(self) == kCVPixelFormatType_DepthFloat32 else { return nil }
CVPixelBufferLockBaseAddress(self, .readOnly)
if let baseAddress = CVPixelBufferGetBaseAddress(self) {
let width = CVPixelBufferGetWidth(self)
let index = column + (row * width)
let offset = index * MemoryLayout<Float>.stride
let value = baseAddress.load(fromByteOffset: offset, as: Float.self)
CVPixelBufferUnlockBaseAddress(self, .readOnly)
return value
}
CVPixelBufferUnlockBaseAddress(self, .readOnly)
return nil
}
}
This is all that is needed to get depth for a given position from a Vision request.
Here is my body tracking swift package that has a 3D hand tracking example that uses this: