How do I mix 2 hardware audio devices and record them as 1 audio track?

I am trying to mix the audio from 2 different hardware audio devices together in real-time and record the results. Does anybody have any idea how to do this? This is on macOS.

Things I have tried and why it didn't work:

  • Adding 2 audio AVCaptureDevices to an AVCaptureMovieFileOutput or AVAssetWriter. This results in a file that has 2 audio tracks. This doesn't work for me for various reasons. Sure I can mix them together with an AVAssetExportSession, but it needs to be real-time.

  • Programmatically creating an aggregate device and recording that as an AVCaptureDevice. This "sort of" works, but it always results in a recording with strange channel issues. For example, if I combine a 1 channel mic and a 2 channel device, I get a recording with 3 channel audio (L R C). If I make an aggregate out of 2 stereo devices, I get a recording with quadraphonic sound(L R Ls Rs), which won't even play back on some players. If I always force it to stereo, all stereo tracks get turned to mono for some reason.

  • Programmatically creating an aggregate device and trying to use it in an AVAudioEngine. I've had multiple problems with this, but the main one is that when the aggregate device is an input node, it only reports the format of its main device, and no sub-devices. And I can't force it to be 3 or 4 channels without errors.

  • Use an AVCaptureSession to output the sample buffers of both devices, then convert and put those samples into their own AVPlayerNodes. Then mix those AVPlayerNodes into an AVAudioEngine mixer. This actually works, but the resulting audio lags so far behind real-time, that it is unusable. If I record a webcam video along with the audio, the lip-sync is off by like half a second.

I really need help with this. If anybody has a way to do this, let me know.

Some caveats that have also been tripping me up:

  • The hardware devices that need to be recorded might not be the default input device for the system. The MBP built in mic might be the default device, but I need to record 2 other devices and disclose the built in mic.
  • The devices usually don't have the same audio format. I might be mixing an lpcm mono int16 interleaved with a lpcm stereo float32 non-interleaved.
  • It absolutely has to be real-time and 1 single audio track.

It shouldn't be this hard, right?

Did you find a solution to this? I am trying to do the same thing.

How do I mix 2 hardware audio devices and record them as 1 audio track?
 
 
Q