Screen Space Coordinates to CVPixelBuffer Coordinates

I would like to extract depth data for a given point in ARSession.currentFrame.smoothedSceneDepth. Optimally this would end up looking something like: ARView.depth(at point: CGPoint) With the point being in UIKit coordinates just like the points passed to the raycasting methods.

I ultimately to use this depth data to convert a 2D normalized landmark from a Vision image request into a 3D world space coordinate in the 3D scene - I only lack the accurate depth data for a given 2D point.

What I have available is:

  • The normalized landmark from the Vision request.
  • Ability to convert this^ to AVFoundation coordinates.
  • Ability to convert this^ to screen-space/display coordinates.

When the depth data is provided correctly I can combine the 2D position in UIKit/screen-space coordinates with the depth (in meters) to produce an accurate 3D world position with the use of ARView.ray(through:) What I have not been able to figure out is how to get this depth value for this coordinate on screen.

I can index the pixel buffer like this:

extension CVPixelBuffer {

    func value(for point: CGPoint) -> Float32 {

        CVPixelBufferLockBaseAddress(self, .readOnly)

        let width = CVPixelBufferGetWidth(self)
        let height = CVPixelBufferGetHeight(self)

        
        //Something potentially going wrong here.
        let pixelX: Int = width * point.x
        let pixelY: Int = height * point.y

   
        let bytesPerRow = CVPixelBufferGetBytesPerRow(self)
        let baseAddress = CVPixelBufferGetBaseAddress(self)!

assert(kCVPixelFormatType_DepthFloat32 == CVPixelBufferGetPixelFormatType(depthDataMap))

        let rowData = baseAddress + pixelY * bytesPerRow

        let distanceAtXYPoint = rowData.assumingMemoryBound(to: Float32.self)[pixelX]

       
        CVPixelBufferUnlockBaseAddress(self, .readOnly)

        return distanceAtXYPoint

    }

}

And then try to use this method like so:

        guard let depthMap = (currentFrame.smoothedSceneDepth ?? currentFrame.sceneDepth)?.depthMap else { return nil }

//The depth at this coordinate, in meters.
let depthValue = depthMap.value(for: myGivenPoint)

The frame semantics [.smoothedSceneDepth, .sceneDepth] have been set properly on my ARConfiguration. The depth data is available.

If I hard-code the width and height values like so:

        let pixelX: Int = width / 2
        let pixelY: Int = height / 2

I get the correct depth value for the center of the screen. I have only been testing in portrait mode.

But I do not know how to index the depth data for any given point.

Answered by CodeName in 712036022

With some help I was able to figure out what coordinates and conversion I needed to use.

The Vision result comes in Vision coordinates: normalized, (0,0) Bottom-Left, (1,1) Top-Right.

AVFoundation coordinates are (0,0) Top-Left, (1,1) Bottom-Right.

To convert from Vision coordinates to AVFoundation coordinates, you must flip the Y-axis like so:

public extension CGPoint {
    func convertVisionToAVFoundation() -> CGPoint {
        return CGPoint(x: self.x, y: 1 - self.y)
    }
}

This AVFoundation coordinate is what needs to be used as input for indexing the depth buffer, like so:

public extension CVPixelBuffer {

    ///The input point must be in normalized AVFoundation coordinates. i.e. (0,0) is in the Top-Left, (1,1,) in the Bottom-Right.

    func value(from point: CGPoint) -> Float? {

        let width = CVPixelBufferGetWidth(self)

        let height = CVPixelBufferGetHeight(self)

        let colPosition = Int(point.x * CGFloat(width))

        let rowPosition = Int(point.y * CGFloat(height))

        return value(column: colPosition, row: rowPosition)

    }

    func value(column: Int, row: Int) -> Float? {

        guard CVPixelBufferGetPixelFormatType(self) == kCVPixelFormatType_DepthFloat32 else { return nil }

        CVPixelBufferLockBaseAddress(self, .readOnly)

        if let baseAddress = CVPixelBufferGetBaseAddress(self) {

            let width = CVPixelBufferGetWidth(self)

            let index = column + (row * width)

            let offset = index * MemoryLayout<Float>.stride

            let value = baseAddress.load(fromByteOffset: offset, as: Float.self)

                CVPixelBufferUnlockBaseAddress(self, .readOnly)

            return value

        }
        CVPixelBufferUnlockBaseAddress(self, .readOnly)

        return nil
    }
}

This is all that is needed to get depth for a given position from a Vision request.

Here is my body tracking swift package that has a 3D hand tracking example that uses this:

https://github.com/Reality-Dev/BodyTracking

Okay after finding this question and trying what it said I made some progress. However, I am attempting to use arView.session.currentFrame.smoothedSceneDepth and not arView.session.currentFrame.estimatedDepthData.

Here is the updated extension:

extension CVPixelBuffer {

    func value(from point: CGPoint) -> Float? {

        let width = CVPixelBufferGetWidth(self)

        let height = CVPixelBufferGetHeight(self)

        let normalizedYPosition = ((point.y / UIScreen.main.bounds.height) * 1.3).clamped(0, 1.0)

        let colPosition = Int(normalizedYPosition * CGFloat(height))

        
        let rowPosition = Int(( 1 - (point.x / UIScreen.main.bounds.width)) * CGFloat(width) * 0.8)


        return value(column: colPosition, row: rowPosition)

    }

    

    func value(column: Int, row: Int) -> Float? {

        guard CVPixelBufferGetPixelFormatType(self) == kCVPixelFormatType_DepthFloat32 else { return nil }

        CVPixelBufferLockBaseAddress(self, .readOnly)

        if let baseAddress = CVPixelBufferGetBaseAddress(self) {

            let width = CVPixelBufferGetWidth(self)

            let index = column + (row * width)

            let offset = index * MemoryLayout<Float>.stride

            let value = baseAddress.load(fromByteOffset: offset, as: Float.self)

                CVPixelBufferUnlockBaseAddress(self, .readOnly)

            return value

        }

        CVPixelBufferUnlockBaseAddress(self, .readOnly)

        return nil

    }
}

Note that point.y is associated with column position and point.x is associated with row position, so the buffer appears to be rotated relative to the view. I suspect there is some conversion between coordinate spaces that I need to be doing that I am unaware of. To get this close to having it working I had to multiply the normalized Y position by 1.3 and the X position by 0.8, as well as invert the X axis by subtracting from 1.

The app still consistently crashes on this line: let value = baseAddress.load(fromByteOffset: offset, as: Float.self)

Accepted Answer

With some help I was able to figure out what coordinates and conversion I needed to use.

The Vision result comes in Vision coordinates: normalized, (0,0) Bottom-Left, (1,1) Top-Right.

AVFoundation coordinates are (0,0) Top-Left, (1,1) Bottom-Right.

To convert from Vision coordinates to AVFoundation coordinates, you must flip the Y-axis like so:

public extension CGPoint {
    func convertVisionToAVFoundation() -> CGPoint {
        return CGPoint(x: self.x, y: 1 - self.y)
    }
}

This AVFoundation coordinate is what needs to be used as input for indexing the depth buffer, like so:

public extension CVPixelBuffer {

    ///The input point must be in normalized AVFoundation coordinates. i.e. (0,0) is in the Top-Left, (1,1,) in the Bottom-Right.

    func value(from point: CGPoint) -> Float? {

        let width = CVPixelBufferGetWidth(self)

        let height = CVPixelBufferGetHeight(self)

        let colPosition = Int(point.x * CGFloat(width))

        let rowPosition = Int(point.y * CGFloat(height))

        return value(column: colPosition, row: rowPosition)

    }

    func value(column: Int, row: Int) -> Float? {

        guard CVPixelBufferGetPixelFormatType(self) == kCVPixelFormatType_DepthFloat32 else { return nil }

        CVPixelBufferLockBaseAddress(self, .readOnly)

        if let baseAddress = CVPixelBufferGetBaseAddress(self) {

            let width = CVPixelBufferGetWidth(self)

            let index = column + (row * width)

            let offset = index * MemoryLayout<Float>.stride

            let value = baseAddress.load(fromByteOffset: offset, as: Float.self)

                CVPixelBufferUnlockBaseAddress(self, .readOnly)

            return value

        }
        CVPixelBufferUnlockBaseAddress(self, .readOnly)

        return nil
    }
}

This is all that is needed to get depth for a given position from a Vision request.

Here is my body tracking swift package that has a 3D hand tracking example that uses this:

https://github.com/Reality-Dev/BodyTracking

However, if you would like to find the position on screen for use with something such as UIKit or ARView.ray(through:) further transformation is required.

The Vision request was performed on arView.session.currentFrame.capturedImage.

arView.session.currentFrame is an ARFrame.

From the documentation on ARFrame.displayTransform(for:viewportSize:):

Normalized image coordinates range from (0,0) in the upper left corner of the image to (1,1) in the lower right corner. This method creates an affine transform representing the rotation and aspect-fit crop operations necessary to adapt the camera  image to the specified orientation and to the aspect ratio of the specified viewport. The affine transform does not scale to the viewport's pixel size. The capturedImage pixel buffer is the original image captured by the device camera, and thus not adjusted for device orientation or view aspect ratio.

So the image being rendered on screen is a cropped version of the frame that the camera captures, and there is transformation needed to go from AVFoundation coordinates to display (UIKit) coordinates.

Converting from AVFoundation coordinates to display (UIKit) coordinates:



public extension ARView {

      func convertAVFoundationToScreenSpace(_ point: CGPoint) -> CGPoint? {

        //Convert from normalized AVFoundation coordinates (0,0 top-left, 1,1 bottom-right)

        //to screen-space coordinates.

        guard

            let arFrame = session.currentFrame,

            let interfaceOrientation = window?.windowScene?.interfaceOrientation

        else {return nil}

            let transform = arFrame.displayTransform(for: interfaceOrientation, viewportSize: frame.size)

            let normalizedCenter = point.applying(transform)

            let center = normalizedCenter.applying(CGAffineTransform.identity.scaledBy(x: frame.width, y: frame.height))

            return center
    }
}

To go the opposite direction, from UIKit display coordinates to AVFoundation coordinates:


public extension ARView {



    func convertScreenSpaceToAVFoundation(_ point: CGPoint) -> CGPoint? {

        //Convert to normalized pixel coordinates (0,0 top-left, 1,1 bottom-right)

        //from screen-space UIKit coordinates.

        guard

          let arFrame = session.currentFrame,

          let interfaceOrientation = window?.windowScene?.interfaceOrientation

        else {return nil}

          let inverseScaleTransform = CGAffineTransform.identity.scaledBy(x: frame.width, y: frame.height).inverted()

          let invertedDisplayTransform = arFrame.displayTransform(for: interfaceOrientation, viewportSize: frame.size).inverted()

          let unScaledPoint = point.applying(inverseScaleTransform)

          let normalizedCenter = unScaledPoint.applying(invertedDisplayTransform)

          return normalizedCenter
    }
}

To get a world-space coordinate from a UIKit screen coordinate and a corresponding depth value:


    /// Get the world-space position from a UIKit screen point and a depth value
    /// - Parameters:
    ///   - screenPosition: A CGPoint representing a point on screen in UIKit coordinates.
    ///   - depth: The depth at this coordinate, in meters.
    /// - Returns: The position in world space of this coordinate at this depth.
    private func worldPosition(screenPosition: CGPoint, depth: Float) -> simd_float3? {

        guard

            let rayResult = arView.ray(through: screenPosition)

        else {return nil}

        //rayResult.direction is a normalized (1 meter long) vector pointing in the correct direction, and we want to go the length of depth along this vector.

         let worldOffset = rayResult.direction * depth

         let worldPosition = rayResult.origin + worldOffset

         return worldPosition
    }

To set the position of an entity in world space for a given point on screen:


    let currentFrame = arView.session.currentFrame,

    let sceneDepth = (currentFrame.smoothedSceneDepth ?? currentFrame.sceneDepth)?.depthMap

    let depthAtPoint = sceneDepth.value(from: avFoundationPosition),

    let worldPosition = worldPosition(screenPosition: uiKitPosition, depth: depthAtPoint)

    trackedEntity.setPosition(worldPosition, relativeTo: nil)

And don't forget to set the proper frameSemantics on your ARConfiguration:


    func runNewConfig(){

        // Create a session configuration
        let configuration = ARWorldTrackingConfiguration()

        //Goes with (currentFrame.smoothedSceneDepth ?? currentFrame.sceneDepth)?.depthMap
        let frameSemantics: ARConfiguration.FrameSemantics = [.smoothedSceneDepth, .sceneDepth]

        //Goes with currentFrame.estimatedDepthData
        //let frameSemantics: ARConfiguration.FrameSemantics = .personSegmentationWithDepth


        if ARWorldTrackingConfiguration.supportsFrameSemantics(frameSemantics) {
            configuration.frameSemantics.insert(frameSemantics)
        }

        // Run the view's session

        session.run(configuration)
    }

Screen Space Coordinates to CVPixelBuffer Coordinates
 
 
Q