DockKit tracking becomes erratic with increased zoom factor in iOS app

I'm developing an iOS app using DockKit to control a motorized stand. I've noticed that as the zoom factor of the AVCaptureDevice increases, the stand's movement becomes increasingly erratic up and down, almost like a pendulum motion. I'm not sure why this is happening or how to fix it.

Here's a simplified version of my tracking logic:

func trackObject(_ boundingBox: CGRect, _ dockAccessory: DockAccessory) async throws {
    guard let device = AVCaptureDevice.default(for: .video),
          let input = try? AVCaptureDeviceInput(device: device) else {
        fatalError("Camera not available")
    }
    
    let currentZoomFactor = device.videoZoomFactor
    let dimensions = device.activeFormat.formatDescription.dimensions
    let referenceDimensions = CGSize(width: CGFloat(dimensions.width), height: CGFloat(dimensions.height))
    
    let intrinsics = calculateIntrinsics(for: device, currentZoom: Double(currentZoomFactor))
    
    let deviceOrientation = UIDevice.current.orientation
    let cameraOrientation: DockAccessory.CameraOrientation = {
        switch deviceOrientation {
        case .landscapeLeft: return .landscapeLeft
        case .landscapeRight: return .landscapeRight
        case .portrait: return .portrait
        case .portraitUpsideDown: return .portraitUpsideDown
        default: return .unknown
        }
    }()
    
    let cameraInfo = DockAccessory.CameraInformation(
        captureDevice: input.device.deviceType,
        cameraPosition: input.device.position,
        orientation: cameraOrientation,
        cameraIntrinsics: useIntrinsics ? intrinsics : nil,
        referenceDimensions: referenceDimensions
    )
    
    let observation = DockAccessory.Observation(
        identifier: 0,
        type: .object,
        rect: boundingBox
    )
    let observations = [observation]
    
    try await dockAccessory.track(observations, cameraInformation: cameraInfo)
}

func calculateIntrinsics(for device: AVCaptureDevice, currentZoom: Double) -> matrix_float3x3 {
    let dimensions = CMVideoFormatDescriptionGetDimensions(device.activeFormat.formatDescription)
    let width = Float(dimensions.width)
    let height = Float(dimensions.height)
    
    let diagonalPixels = sqrt(width * width + height * height)
    let estimatedFocalLength = diagonalPixels * 0.8
    
    let fx = Float(estimatedFocalLength) * Float(currentZoom)
    let fy = fx
    let cx = width / 2.0
    let cy = height / 2.0
    
    return matrix_float3x3(
        SIMD3<Float>(fx, 0, cx),
        SIMD3<Float>(0, fy, cy),
        SIMD3<Float>(0, 0, 1)
    )
}

I'm calling this function regularly (10-30 times per second) with updated bounding box information. The erratic movement seems to worsen as the zoom factor increases.

Questions:

  1. Why might increasing the zoom factor cause this erratic movement?
  2. I'm currently calculating camera intrinsics based on the current zoom factor. Is this approach correct, or should I be doing something differently?
  3. Are there any other factors I should consider when using DockKit with a variable zoom?
  4. Could the frequency of calls to trackRider (10-30 times per second) be contributing to the erratic movement? If so, what would be an optimal frequency?

Any insights or suggestions would be greatly appreciated. Thanks!

Answered by Davidbaraff2 in 808590022

These snippets might be of use to you:

if let captureConnection = videoDataOutput.connection(with: .video) {
            captureConnection.isEnabled = true
            captureConnection.isCameraIntrinsicMatrixDeliveryEnabled = true
}

[God almighty. Why is it so impossible to format code in this editor?]

This function pulls out the intrinsics and computes the field-of-view, but that was for something I was doing; just the intrinsics matrix here might be what you want:

nonisolated func computeFOV(_ sampleBuffer: CMSampleBuffer) -> Double? {
        guard let camData = CMGetAttachment(sampleBuffer, key:kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, attachmentModeOut: nil) as? \
Data else { return nil }
    let intrinsics: matrix_float3x3? = camData.withUnsafeBytes { pointer in
        if let baseAddress = pointer.baseAddress {
            return baseAddress.assumingMemoryBound(to: matrix_float3x3.self).pointee
        }
        return nil
    }

    guard let intrinsics = intrinsics else { return nil }

    let fx = intrinsics[0][0]
    let w = 2 * intrinsics[2][0]
    return Double(atan2(w, 2*fx))
}

Again, sorry for the totally ****** formatting. If someone can tell me how this is supposed to work, I'm all ears. I pasted code and hit "code block" but it didn't help much.

Be sure to pass in the camera intrinsics. Rather than compute them yourself, pull them from the AVCaptureDevice.

I've seen something similar, when the zoom is at default, it's fine, as it increases, the fact that your view is zoomed in isn't known to the tracking system because incorrect intrinsics. So a small offset at low zoom because a big offset at bigger zoom and the system tells the accessory to rotate too much. Feedback loop.

Accepted Answer

These snippets might be of use to you:

if let captureConnection = videoDataOutput.connection(with: .video) {
            captureConnection.isEnabled = true
            captureConnection.isCameraIntrinsicMatrixDeliveryEnabled = true
}

[God almighty. Why is it so impossible to format code in this editor?]

This function pulls out the intrinsics and computes the field-of-view, but that was for something I was doing; just the intrinsics matrix here might be what you want:

nonisolated func computeFOV(_ sampleBuffer: CMSampleBuffer) -> Double? {
        guard let camData = CMGetAttachment(sampleBuffer, key:kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, attachmentModeOut: nil) as? \
Data else { return nil }
    let intrinsics: matrix_float3x3? = camData.withUnsafeBytes { pointer in
        if let baseAddress = pointer.baseAddress {
            return baseAddress.assumingMemoryBound(to: matrix_float3x3.self).pointee
        }
        return nil
    }

    guard let intrinsics = intrinsics else { return nil }

    let fx = intrinsics[0][0]
    let w = 2 * intrinsics[2][0]
    return Double(atan2(w, 2*fx))
}

Again, sorry for the totally ****** formatting. If someone can tell me how this is supposed to work, I'm all ears. I pasted code and hit "code block" but it didn't help much.

Thank you so much! Don't worry about the bad editor formatting, but I fully understand your frustration!

The code you provided works like a Charme and the pendulum motion is gone now!

Thank you so much 🙏

@Davidbaraff2 can you help me again? I'm wondering what referenceDimension property of the cameraInformation object means. Does it mean the size of the CMSampleBuffer or for example previewLayer where the camera Output gets rendered?

Thank you very much!

I'll provide additional code tomorrow, but with the current implementation DockKit reacts extremly slow to updated boundingBox positions. Did you notice something similiar @Davidbaraff2?

@Davidbaraff2 that's the code im currently using. I try to Capture faces with vision and the back camera in portrait Mode. To be able to render the bounding boxes on screen, I noticed using .leftMirrored as orientation for VNImageRequestHandler helps a lot, but I don't get DockKit to track the faces correctly.

What am I missing here?

class TestViewController: UIViewController {
    private let captureSession = AVCaptureSession()
    private let videoDataOutput = AVCaptureVideoDataOutput()
    
    private lazy var previewLayer = AVCaptureVideoPreviewLayer(session: self.captureSession)
    
    private var faceLayers: [CAShapeLayer] = []
    private var dockAccessory: DockAccessory?
    private var captureDevice: AVCaptureDevice?
    
    override func viewDidLoad() {
        super.viewDidLoad()
        
        Task {
            try await DockAccessoryManager.shared.setSystemTrackingEnabled(false)
            
            for await accessory in try DockAccessoryManager.shared.accessoryStateChanges {
                print(accessory.state)
               
                DispatchQueue.main.async {
                    //self.connectedToDock = accessory.state == DockAccessory.State.docked
                    self.dockAccessory = accessory.accessory
                }
            }
        }
        
        setupCamera()
        captureSession.startRunning()
        
        func setupCamera() {
            self.captureSession.sessionPreset = .hd1280x720
            let deviceDiscoverySession = AVCaptureDevice.DiscoverySession(
                deviceTypes: [
                    .builtInDualCamera,
                    .builtInTripleCamera
                ],
                mediaType: .video,
                position: .back
            )
            if let device = deviceDiscoverySession.devices.first {
                if let deviceInput = try? AVCaptureDeviceInput(device: device) {
                    if captureSession.canAddInput(deviceInput) {
                        captureSession.addInput(deviceInput)
                        
                        setupPreview()
                    }
                }
                
                self.captureDevice = device
            }
            
            func setupPreview() {
                self.previewLayer.videoGravity = .resizeAspectFill
                self.view.layer.addSublayer(self.previewLayer)
                self.previewLayer.frame = self.view.frame
                
                self.videoDataOutput.videoSettings = [
                    (kCVPixelBufferPixelFormatTypeKey as NSString) : NSNumber(value: kCVPixelFormatType_32BGRA)
                ] as [String : Any]

                self.videoDataOutput.setSampleBufferDelegate(
                    self,
                    queue: DispatchQueue(label: "camera queue")
                )
                self.captureSession.addOutput(self.videoDataOutput)
                
                let videoConnection = self.videoDataOutput.connection(with: .video)
                videoConnection?.videoOrientation = .portrait
            }
        }
    }
    
    private var frameCounter = 0
    private var lastTimestamp = Date()
}

extension TestViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        guard let imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
          return
        }

        let faceDetectionRequest = VNDetectFaceLandmarksRequest(completionHandler: { (request: VNRequest, error: Error?) in
            DispatchQueue.main.async {
                self.faceLayers.forEach({ drawing in drawing.removeFromSuperlayer() })

                if let observations = request.results as? [VNFaceObservation] {
                    self.handleFaceDetectionObservations(
                        observations: observations,
                        imageBuffer,
                        sampleBuffer
                    )
                }
            }
        })
        
        let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: imageBuffer, orientation: .leftMirrored, options: [:])
        
        do {
            try imageRequestHandler.perform([faceDetectionRequest])
        } catch {
          print(error.localizedDescription)
        }
    }
    
    private func handleFaceDetectionObservations(
        observations: [VNFaceObservation],
        _ pixelBuffer: CVPixelBuffer,
        _ sampleBuffer: CMSampleBuffer
    ) {
        for observation in observations {
            var faceRectConverted = self.previewLayer.layerRectConverted(fromMetadataOutputRect: observation.boundingBox)
            let faceRectanglePath = CGPath(rect: faceRectConverted, transform: nil)
            
            let faceLayer = CAShapeLayer()
            faceLayer.path = faceRectanglePath
            faceLayer.fillColor = UIColor.clear.cgColor
            faceLayer.strokeColor = UIColor.yellow.cgColor
            
            self.faceLayers.append(faceLayer)
            self.view.layer.addSublayer(faceLayer)
            
            if
                let captureDevice = captureDevice,
                let dockAccessory = dockAccessory
            {
                Task {
                    do {
                        try await trackWithDockKit(
                            observation.boundingBox,
                            dockAccessory,
                            pixelBuffer,
                            sampleBuffer
                        )
                    } catch {
                        print(error)
                    }
                }
            }
        }
    }
    
    func trackWithDockKit(_ boundingBox: CGRect, _ dockAccessory: DockAccessory, _ pixelBuffer: CVPixelBuffer, _ cmSampelBuffer: CMSampleBuffer) async throws {
        guard
            let device = captureDevice
        else {
            fatalError("Kamera nicht verfügbar")
        }
        
        let size = CGSize(width: CVPixelBufferGetWidth(pixelBuffer), height: CVPixelBufferGetHeight(pixelBuffer))
        
        let cameraInfo = DockAccessory.CameraInformation(
            captureDevice: device.deviceType,
            cameraPosition: device.position,
            orientation: .corrected,
            cameraIntrinsics: nil,
            referenceDimensions: size
        )
        
        let observation = DockAccessory.Observation(
            identifier: 0,
            type: .object,
            rect: boundingBox
        )
        let observations = [observation]
        
        try await dockAccessory.track(observations, cameraInformation: cameraInfo)
    }
}

For a little while I tried to pass rectangles to dockAccessory.track(), but then I gave up. For whatever reason, it didn't seem to work so well.

Instead, I measure how far the center of my rectangle is from being centered in the image (in angular units), and then call setVelocity (in the "yaw" axis) with a speed that is proportional to this distance. (If you get the sign wrong, you'll figure it out... quickly.)

I observed this gave me much smoother results than letting the camera do its own tracking. I also tried to directly call setOrientation() to make the camera turn to what I think should be the exact amount, but that works poorly. Using the error of how far the center of the rectangle is from the image center, as a measure of how fast to rotate the camera, gives much smoother results.

Note: be very careful that if you don't call this routine very frequently, that you have a safeguard in place to countermand the last "setVelocity" call. Because if you don't, and the last thing you tell the camera is "rotate that way with speed X", and you never say anything again, your camera will just spin and spin and spin...

Here's my own personal track() routine, since dockkit's track() doesn't meet my needs:

var errorIsLarge = false

func track(rectangles: [CGRect], fov: Double) async {
        guard !rectangles.isEmpty else {
            await setVelocity(pitch: 0, yaw: 0, roll: 0)
            return
        }
        
        let r = rectangles.reduce(CGRect.null) { $0.union($1) }
        let xOffset = 2 * (r.midX - 0.5)
        var thetaOffset = xOffset * fov
        print("Tracking: \(thetaOffset.asDegrees) degrees off, midX = \(r.midX)")
        
        if abs(thetaOffset) > (3.0).asRadians {
            print("Error is large set true")
            errorIsLarge = true
        }
        
        if abs(thetaOffset) < (3.0).asRadians {
            if !errorIsLarge {
                thetaOffset = 0
            }
        }
        
        if abs(thetaOffset) < (1.0).asRadians {
            errorIsLarge = false
            print("error is large is FALSE")
            thetaOffset = 0
        }

        print("Setting velocity to \(-thetaOffset * 3)")
        await setVelocity(pitch: 0, yaw: -thetaOffset * 3, roll: 0)

What are we doing here? Whenever we get more than 3 degrees off target, that's enough error to try to correct it. If our error is less than 3 degrees, and we're not in a "large error state", ignore the error. We want smooth tracking.

We leave being in a "large" error state when we reduce the error to less than 1 degree.

We set the velocity to the opposite of 3 X thetaOffset, where thetaOffset is how far off target we are, noting that as I said above, at times we want to tolerate small errors. The idea is that if you get close enough, stop trying to micro-correct.

In my case, I'm taking all my observation rectangles, unioning them, and taking the center. However you make observations, at the end of the day, you're going to get some absolute theta error, and that's what you deal with.

Note: I'm not trying to pitch up or down (or, god forbid, roll!) the camera. Just yaw (i.e. rotate around the vertical axis).

DockKit tracking becomes erratic with increased zoom factor in iOS app
 
 
Q