How to improve the captured image's resolution?

To create signatures for human faces and compare the similarities, I'm using ARKit's captureImage from ARFrame which is derived from the front facing camera with ARFaceTrackingConfiguration.

However, compared to using the Vision and AVFoundation frameworks, the quality of the signature analysis is significantly impacted by the captureImage's low resolution. The resolution of the capturedImage in ARKit is just 640x480, according to capturedDepthData even though the video format is set to the highest resolution.

let configuration = ARFaceTrackingConfiguration()
if let videoFormat = ARFaceTrackingConfiguration.supportedVideoFormats.sorted(by: { ($0.imageResolution.width * $0.imageResolution.height) < ($1.imageResolution.width * $1.imageResolution.height) }).last {
    configuration.videoFormat = videoFormat
}

I tried using captureHighResolutionFrame and as well as change the video format:

if let videoFormat = ARFaceTrackingConfiguration.recommendedVideoFormatForHighResolutionFrameCapturing {
    configuration.videoFormat = videoFormat
}

However, according to the documentation:

The system delivers a high-resolution frame out-of-band, which means that it doesn't affect the other frames that the session receives at a regular interval

The asynchronous nature of taking the high resolution images seems to result alternating between the standard captured image and the high resolution images rather than replacing the regular captured images. This is a concern because, depending on the size variations, displayTransform and CGAffineTransform must be used in different ways to scale the images.

Not only that I need to able to use the frames continuously at either 30 fps or 60 fps, as they're produced rather than taking pictures occasionally, which the captureHighResolutionFrame method seems to be designed for considering the shutter sound.

In order to use the captured image, I'm currently transforming it in the following way

let image: CIImage = CIImage(cvImageBuffer: imageBuffer)
let imageSize: CGSize = CGSize(width: CVPixelBufferGetWidth(imageBuffer), height: CVPixelBufferGetHeight(imageBuffer))
let normalizeTransform: CGAffineTransform = CGAffineTransform(scaleX: 1.0 / imageSize.width, y: 1.0 / imageSize.height)
let flipTransform: CGAffineTransform = metadata.orientation.isPortrait ? CGAffineTransform(scaleX: -1, y: -1).translatedBy(x: -1, y: -1) : .identity

guard let viewPort: CGRect = face.viewPort else { return nil }
let viewPortSize: CGSize = viewPort.size

guard let displayTransform: CGAffineTransform = face.arFrame?.displayTransform(for: metadata.orientation, viewportSize: CGSize(width: viewPortSize.width, height: viewPortSize.height)) else {
  return nil
}

let scaleX: CGFloat = viewPortSize.width
let scaleY: CGFloat = viewPortSize.height
let viewPortTransform: CGAffineTransform = CGAffineTransform(scaleX: scaleX, y: scaleY)

let scaledImage: CIImage = image
  .transformed(by: normalizeTransform
    .concatenating(flipTransform)
    .concatenating(displayTransform)
    .concatenating(viewPortTransform)
  )
  .cropped(to: viewPort)

Hi Ovis!

As you correctly noticed, captureHighResolutionFrame lets you additionally capture individual frames at a higher resolution, without replacing the regular session:didUpdateFrame callbacks at a given frame rate. The intention of this API is to obtain higher quality frames at certain times when you need them, while still being able to run the session efficiently at a lower resolution. It sounds this is not the suitable API for your use case.

However, the default video format of ARFaceTrackingConfiguration provides a higher resolution than just 640 x 480. For example, on an iPhone 14 Pro the capturedImage has a size of 1440 x 1080 pixels. Independent of the capturedImage resolution, capturedDepthData always has a resolution of 640 x 480.

Thank you for your reply.

My first question is, does the size of the capturedImage for the front facing camera increase according to the increase of the video format of ARFaceTrackingConfiguration?

My second question is, displayTransform seems to be the prevalent way of converting the coordinates of the capturedImage to the camera image onscreen for ARKit. However, the method uses the normalized image coordinates from (0, 0) to (1, 1) which means shrinking an the captured image drastically and impacting the resolution negatively:

let normalizeTransform: CGAffineTransform = CGAffineTransform(scaleX: 1.0 / imageSize.width, y: 1.0 / imageSize.height)

Do you have any recommendation on how to achieve the coordinate conversion without such a drastic measure? My main objective is to convert the coordinate, orientation, and the size of the capturedImage to those of the image on screen.

How to improve the captured image's resolution?
 
 
Q