Audio Unit Circular buffering

I have sort of asked this before, but I've simply hit another wall this time around. Here it goes!

While there are tons of information on how to use a circular buffer on the RemoteIO Audio Unit, by way of the callbacks, I don't see any immediately obvious way how to do this with an AUAudioUnit subclass. Has anyone succeeded in doing this?

I see there are two handlers in the AUAudioUnit called inputHandler and outputHandler. But I have had no luck in simply implementing them, such that they copy the data to a circular buffer.


My intent is to accumulate a minimum amount of samples/frames in the input buffer before the render block is called, such that I can perform an FFT with a reasonable amount of sample data. Also, such that overlap-and is possible to do.

I find myself restricted to the render block function, which (apparently, correct me if I'm wrong) insists that I render exactly the amount of incoming samples.

What I'd like to avoid is offsetting the output. I.e. 1) saving previous frame (assuming some zero-data to begin with), 2) pulling input, and outputting a frame in-between the previous and current frame by overlap-add. I'd much rather cause some latency within the plugin instead, by waiting to render until enough data has arrived. However, currently I can also see a way to do the former. The latter is slightly trickier, and I find myself somewhat restricted by the API, and missing some documentation and/or code examples of doing this.


The g-o-a-l is to accumulate more samples than render block asks, before rendering, perform window/FFT/filtering/IFFT and overlap-add.


For some reason the Apple forum doesn't like the word 'g o a l', it said 'invalid characters 😝


Any advice is much appreciated, I find the docs are way too short and doesn't explain in the detail they used to. Especially the audio unit part of the docs.

I'm not agreeing with this new way Apple has made the docs.

Accepted Reply

You really don't have any choice about when you're called, and (to some extent) how many samples you're asked to process on each call. Delay is inherent in spectral processing, and the system can account for it in the latency property.


Most FFT-based algorithms suffer a delay that is related to the FFT size. In many cases it's exactly equal to the FFT size. Once you know your latency, be sure to report the value in your AUAudioUnit subclass.


So now you just need to build a bit of machinery to handle this "asymmetry" between the data that is requested from your audio unit, and what you are asked to provide. Your processing loop will look something like this:


process() {
  // 1. Stash all the input samples into an input ring buffer

  // 2. Process the input ring buffer, producing samples for the output ring buffer
  while( /* the input ring buffer contains enough to run an FFT */ ) {
    // a. Run the FFT on the (windowed!) samples at the end of the input ring buffer
    // b. "Magic happens here" - You do your DSP in the frequency domain
    // c. Perform the overlap-add, then place output into your output ring buffer
    // d. Advance the beginning of the input ring by the hop amount (i.e. fft_size - fft_overlap)
  }

  // 3. Copy all the available output samples from the output ring buffer
  // 4. Advance the output ring buffer by all but the hop amount
}


A few notes about the above:

  • Step 2 won't run during every call to process, and rarely runs more than once. Never assume anything about the "metrics" of your process calls.
  • Your ring buffer needs separate "copy" and "produce/consume" operations (one for the head and tail side) so that you can copy more data than you plan to remove.
  • Step 2.c. and 2.d. requires a little bit of juggling to get it right. You ony produce a hop worth of output on each iteration, but you need a scratch buffer to hold the time domain results from the last iteration so you can overlap-add into the last result.


Now for some "pro tips":

  • It's a lot easier if you prime your input and output buffers with zeros when the effect is first allocated, and when it's reset. That way, you can run the FFT on the first call, and have some samples immediately for output. Otherwise you have to deal with some awkward special cases. This results in a latency equal to the block size you choose, but it results in very few headaches in practice. Headaches are inversely proportional to your fight to reduce this latency. Choose your battles wisely! 🙂
  • Don't cut down on your temporary "scratch buffers" until you have determined your DSP works properly first. Once you're satisfied that the signal processing is stable, then you can start to identify scratch buffers that can be re-used in the processing operation.
  • Fancy lock-free structures are great if you're actually planning to move data between threads, but in this example there is no need for the additional confusion and complexity. If everything's pre-allocated, a very simple buffer queue is plenty fast, especially if you use…
  • Accelerate.framework for all your math. Learn it, live it, love it. It's incredible how much you can get done in a render cycle with so little CPU usage.
  • Build in some instrumentation to debug what you're doing. For example, create a separate scratch buffer that you can copy into during the process call, and dump some calculation results into there. On the main thread, run a periodic timer that'll peek into that buffer to see if your math is blowing out. If you're *really* keen, output data into a format that can be read by Octave or MATLAB, so you can even visualize and test your DSP output easily.


It took me a long time to figure this kind of stuff out in practice, and very many books don't cover the nuts & bolts of practical use.


Hope this helps!

Replies

One way to minimize latency is to request audio buffers smaller than the offset implied by the accumulation required by your filters or other processing. Newer iOS devices seem to support callback buffer sizes well under 5 mS. Then simply have the audio callbacks output silence (or some other sound prefix) until you have accumulated enough processed input audio to output. The amount of samples you accumulate can be independant of the callback buffer size and the fast convolution block size by several means, but a lock-free circular buffer or two seems to be one common solution.

I'm not sure if I understand you correctly. I might have misunderstood, but it sounds like you're describing the delayed output method, when you say "output silence". I'd rather introduce latency in the rendering —that is, halt rendering until enough samples have been accumulated—, rather than produce a delay in the output signal. Have I misunderstood your suggestion?

Halting rendering until xyz and delaying rendering are identical at the physical output.

Are you sure we are talking about the same thing? I agree that at the physical output this is the same. What I'm making, however, is meant for real-time rendering, used as an effect on audio track in a DAW, where the output of one plugin shouldn't cause signal delay, but latency. That is, I don't what the tracks using the effect to become offset from the rest of the tracks, hence collect more samples before rendering, not simply output zeros.


The API asks me to render N frames for time [t,t+N-1], providing N frames to me, if I output zeros for that frame the API will consider that part "rendered", meaning a signal that should have started at t, now starts at some offset in the future from t, which is effectively a delayed signal. While what I want is to wait for, say N more samples, and once I've got 2N samples, I will render frames for time [t, t+2N-1], which is causing latency in the chain, rather than delay. But I guess this would depend on how the API works. If the API won't continue to render the following frames until it has completely finished the previous, then of course the plugin could wait forever, and still not receive any new samples.


I hope this clears up what I mean 🙂 Thanks again for your input! 🙂

Not clear yet what you want. In a real-time system, you will be outputing N samples while waiting for N more samples, or else the system isn't real-time. A voltmeter will show some voltage on the DAC output during that time interval, what do you want it to be? Thus, any delay is a latency, and any latency is a delay. (Unless you have a faster-than-light-speed time-machine.)

You are talking from the perspective of the output/listener of the sound, not from the signal chain.


Let x[n] be a signal going into a filter of size M > N, producing y[n]. If you output zeros until M samples have accumulated, you are offsetting the signal in y[n] by N on each renderpass, which is an actual delay effect in linear time invariant systems. I do not want to produce a delay in the y[n] signal. I'd like to ensure that each render pass has at least M frames, such that x[0] corresponds to y[0], not x[0] delayed by y[kN] for some k.


What you are arguing is that such a delay is indistuishable at the output. This is true if and only if this is the only signal being output. If I have a duplicate of that signal on another channel, but not have my effect on it, you'd hear phase cancellation, because the one with the effect on would have a slight offset/delay in the sidechain.

Even from the point of view of a signal chain, not producing a processing delay is only possible in a non-buffered and non-realtime signal chain. e.g. get your required data and process it well ahead of when it is needed, e.g. well before your Audio Unit is ever called.

One solution I've used to reduce unwanted phase cancellation effects is to make the delay of all parallel processing blocks equal to the latency of the longest processing block.

While I understand the effect of it, I'm not quite sure what you mean by making the delay of all parallel processing blocks equal to the latency of the longest block. I don't control the other processes, nor have access to them. The audio unit is meant for DAW's like Logic. The offset and cancellation I'm referring to are in relation to other audio tracks in such an environment. If one is delayed, especially in techniques like parallel compression, the offset pretty much kills the quality of the effect.


All of this considered, this is why I asked previously of a way to "look ahead" in the input stream, getting M samples, but rendering N samples, where M > N. If that was possible, then I didn't see any immediate problem with doing a STFT and rendering y[n] at exactly x[n], with no offset. But apparently this isn't possible, as far as I understand 😟

You really don't have any choice about when you're called, and (to some extent) how many samples you're asked to process on each call. Delay is inherent in spectral processing, and the system can account for it in the latency property.


Most FFT-based algorithms suffer a delay that is related to the FFT size. In many cases it's exactly equal to the FFT size. Once you know your latency, be sure to report the value in your AUAudioUnit subclass.


So now you just need to build a bit of machinery to handle this "asymmetry" between the data that is requested from your audio unit, and what you are asked to provide. Your processing loop will look something like this:


process() {
  // 1. Stash all the input samples into an input ring buffer

  // 2. Process the input ring buffer, producing samples for the output ring buffer
  while( /* the input ring buffer contains enough to run an FFT */ ) {
    // a. Run the FFT on the (windowed!) samples at the end of the input ring buffer
    // b. "Magic happens here" - You do your DSP in the frequency domain
    // c. Perform the overlap-add, then place output into your output ring buffer
    // d. Advance the beginning of the input ring by the hop amount (i.e. fft_size - fft_overlap)
  }

  // 3. Copy all the available output samples from the output ring buffer
  // 4. Advance the output ring buffer by all but the hop amount
}


A few notes about the above:

  • Step 2 won't run during every call to process, and rarely runs more than once. Never assume anything about the "metrics" of your process calls.
  • Your ring buffer needs separate "copy" and "produce/consume" operations (one for the head and tail side) so that you can copy more data than you plan to remove.
  • Step 2.c. and 2.d. requires a little bit of juggling to get it right. You ony produce a hop worth of output on each iteration, but you need a scratch buffer to hold the time domain results from the last iteration so you can overlap-add into the last result.


Now for some "pro tips":

  • It's a lot easier if you prime your input and output buffers with zeros when the effect is first allocated, and when it's reset. That way, you can run the FFT on the first call, and have some samples immediately for output. Otherwise you have to deal with some awkward special cases. This results in a latency equal to the block size you choose, but it results in very few headaches in practice. Headaches are inversely proportional to your fight to reduce this latency. Choose your battles wisely! 🙂
  • Don't cut down on your temporary "scratch buffers" until you have determined your DSP works properly first. Once you're satisfied that the signal processing is stable, then you can start to identify scratch buffers that can be re-used in the processing operation.
  • Fancy lock-free structures are great if you're actually planning to move data between threads, but in this example there is no need for the additional confusion and complexity. If everything's pre-allocated, a very simple buffer queue is plenty fast, especially if you use…
  • Accelerate.framework for all your math. Learn it, live it, love it. It's incredible how much you can get done in a render cycle with so little CPU usage.
  • Build in some instrumentation to debug what you're doing. For example, create a separate scratch buffer that you can copy into during the process call, and dump some calculation results into there. On the main thread, run a periodic timer that'll peek into that buffer to see if your math is blowing out. If you're *really* keen, output data into a format that can be read by Octave or MATLAB, so you can even visualize and test your DSP output easily.


It took me a long time to figure this kind of stuff out in practice, and very many books don't cover the nuts & bolts of practical use.


Hope this helps!

Thank you so much for this very elaborate reply! This helped a great deal!


I was aware of the inherent delay, but failed to see a way to compensate for it. I was under the impression that the reported latency property of the audio unit subclass was a read-only, computed by the API purely for measuring performance — the docs are quite vague at this time in my opinion. If this can indeed "make up" for the delay, then it clears up the problem completely! I'll try it out!


Thanks again for this great reply — lots of great information! 😀