Am trying to distinguish the differences in volumes between background noise, and someone speaking in Swift.
Previously, I had come across a tutorial which had me looking at the power levels in each channel. It come out as the code listed in Sample One which I called within the installTap closure. It was ok, but the variance between background and the intended voice to record, wasn't that great. Sure, it could have been the math used to calculate it, but since I have no experience in audio data, it was like reading another language.
Then I came across another demo. It's code was much simpler, and the difference in values between background noise and speaking voice was much greater, therefore much more detectable. It's listed here in Sample Two, which I also call within the installTap closure.
My issue here is wanting to understand what is happening in the code. In all my experiences with other languages, voice was something I never dealt with before, so this is way over my head.
Not looking for someone to explain this to me line by line. But if someone could let me know where I can find decent documentation so I can better grasp what is going on, I would appreciate it.
Thank you
Sample One
func audioMetering(buffer:AVAudioPCMBuffer) {
// buffer.frameLength = 1024
let inNumberFrames:UInt = UInt(buffer.frameLength)
if buffer.format.channelCount > 0 {
let samples = (buffer.floatChannelData![0])
var avgValue:Float32 = 0
vDSP_meamgv(samples,1 , &avgValue, inNumberFrames)
var v:Float = -100
if avgValue != 0 {
v = 20.0 * log10f(avgValue)
}
self.averagePowerForChannel0 = (self.LEVEL_LOWPASS_TRIG*v) + ((1-self.LEVEL_LOWPASS_TRIG)*self.averagePowerForChannel0)
self.averagePowerForChannel1 = self.averagePowerForChannel0
}
if buffer.format.channelCount > 1 {
let samples = buffer.floatChannelData![1]
var avgValue:Float32 = 0
vDSP_meamgv(samples, 1, &avgValue, inNumberFrames)
var v:Float = -100
if avgValue != 0 {
v = 20.0 * log10f(avgValue)
}
self.averagePowerForChannel1 = (self.LEVEL_LOWPASS_TRIG*v) + ((1-self.LEVEL_LOWPASS_TRIG)*self.averagePowerForChannel1)
}
}
Sample Two
private func getVolume(from buffer: AVAudioPCMBuffer, bufferSize: Int) -> Float {
guard let channelData = buffer.floatChannelData?[0] else {
return 0
}
let channelDataArray = Array(UnsafeBufferPointer(start:channelData, count: bufferSize))
var outEnvelope = [Float]()
var envelopeState:Float = 0
let envConstantAtk:Float = 0.16
let envConstantDec:Float = 0.003
for sample in channelDataArray {
let rectified = abs(sample)
if envelopeState < rectified {
envelopeState += envConstantAtk * (rectified - envelopeState)
} else {
envelopeState += envConstantDec * (rectified - envelopeState)
}
outEnvelope.append(envelopeState)
}
// 0.007 is the low pass filter to prevent
// getting the noise entering from the microphone
if let maxVolume = outEnvelope.max(),
maxVolume > Float(0.015) {
return maxVolume
} else {
return 0.0
}
}
Read up on "noise gates". The audio samples are PCM float values between -1.0 and 1.0. That's the instantaneous value of the signal amplitude in the channel, it cannot exceed maximum volume. Values close to 0.0 are very quiet. Assuming your microphone is correctly set up, when you speak loudly, close to the microphone, the peak signal amplitude will be close to 1.0. A background noise might be 30 to 40dB below this, or at a level of about 0.03 to 0.01. A reasonable noise gate might want to measure the input values and cut any sound at or below this level, if it persists, while it would permit sound above this level, again if it persists. A reasonable noise gate usually turns on quickly (so you don't cut off the beginning of speech) and turns off slowly (so you don't have abrupt cut-offs which can sound jarring).
The first sample uses vDSP_meamvg, which is documented here: https://developer.apple.com/documentation/accelerate/1449731-vdsp_meamgv It isn't clear what LEVEL_LOWPASS_TRIG is supposed to be. It might be a value between 0 and 1. Its name suggests it is a trigger value (i.e. a threshold for the noise gate) but it seems to be used here to adjust how rapidly the averagePowerForChannel variables approach the measured value in a single buffer.
The second sample is easier to comprehend. It takes every incoming sample (a floating point value between -1 and +1) and finds its absolute value. If the current rectified sample is larger than the 'envelopeState', the envelopeState is increased by 0.16 of the difference, while if it is smaller the envelopeState is decreased by 0.03 of the difference. So 'envelopeState' will rapidly rise if there is input, and slowly fall if there is none, or if the volume of that input falls below that of 'envelopeState'. The routine uses the peak value of an array of 'envelopeState' values as a gate. The buffer is considered full of background noise only if that peak value is less than 0.015. Remember envelopeState isn't an absolute signal level, it is a filtered difference value.
hope this helps.