AVSpeechSynthesizer buffer conversion, write format bug?

Is the format description AVSpeechSynthesizer for the speech buffer is correct? When I attempt to convert it, I get back noise from two different conversion methods.

I am seeking to convert the speech buffer provided by the AVSpeechSynthesizer "func write(_ utterance: AVSpeechUtterance..." method. The goal is to convert the sample type, change the sample rate and change from mono to stereo buffer. I later manipulate the buffer data and pass it through AVAudioengine. For testing purposes, I have kept the sample rate to the original 22050.0

What have I tried? I have a method that I've been using for years named "resampleBuffer" that does this. When I apply it to the speech buffer, I get back noise. When I attempt to manually convert format and to stereo with "convertSpeechBufferToFloatStereo", I am getting back clipped output. I tested flipping the samples, addressing the Big Endian, Signed Integer but that didn't work.

The speech buffer description is inBuffer description: <AVAudioFormat 0x6000012862b0: 1 ch, 22050 Hz, 'lpcm' (0x0000000E) 32-bit big-endian signed integer>

import Cocoa
import AVFoundation

class SpeakerTest: NSObject, AVSpeechSynthesizerDelegate {
    let synth = AVSpeechSynthesizer()

    override init() {
        super.init()
    }

    func resampleBuffer( inSource: AVAudioPCMBuffer, newSampleRate: Double) -> AVAudioPCMBuffer?
    {
        // resample and convert mono to stereo

        var error          : NSError?
        let kChannelStereo = AVAudioChannelCount(2)
        let convertRate    = newSampleRate / inSource.format.sampleRate
        let outFrameCount  = AVAudioFrameCount(Double(inSource.frameLength) * convertRate)
        let outFormat      = AVAudioFormat(standardFormatWithSampleRate: newSampleRate, channels: kChannelStereo)!
        let avConverter    = AVAudioConverter(from: inSource.format, to: outFormat )
        let outBuffer      = AVAudioPCMBuffer(pcmFormat: outFormat, frameCapacity: outFrameCount)!
        let inputBlock     : AVAudioConverterInputBlock = { (inNumPackets, outStatus) -> AVAudioBuffer? in
            outStatus.pointee = AVAudioConverterInputStatus.haveData // very important, must have
            let audioBuffer : AVAudioBuffer = inSource
            return audioBuffer
        }
        avConverter?.sampleRateConverterAlgorithm = AVSampleRateConverterAlgorithm_Mastering
        avConverter?.sampleRateConverterQuality   = .max

        if let converter = avConverter
        {
            let status = converter.convert(to: outBuffer, error: &error, withInputFrom: inputBlock)
//            print("\(status): \(status.rawValue)")
            if ((status != .haveData) || (error != nil))
            {
                print("\(status): \(status.rawValue), error: \(String(describing: error))")
                return nil // conversion error
            }
        } else {
            return nil // converter not created
        }
//        print("success!")
        return outBuffer
    }


    func writeToFile(_ stringToSpeak: String, speaker: String)
    {
        var output    : AVAudioFile?
        let utterance = AVSpeechUtterance(string: stringToSpeak)
        let desktop   = "~/Desktop"
        let fileName  = "Utterance_Test.caf" // not in sandbox
        var tempPath  = desktop + "/" + fileName
        tempPath      = (tempPath as NSString).expandingTildeInPath

        let usingSampleRate = 22050.0  // 44100.0
        let outSettings = [
            AVFormatIDKey            : kAudioFormatLinearPCM, // kAudioFormatAppleLossless
            AVSampleRateKey          : usingSampleRate,
            AVNumberOfChannelsKey    : 2,
            AVEncoderAudioQualityKey : AVAudioQuality.max.rawValue
        ] as [String : Any]


        // temporarily ignore the speaker and use the default voice
        let curLangCode  = AVSpeechSynthesisVoice.currentLanguageCode()
        utterance.voice  = AVSpeechSynthesisVoice(language: curLangCode)
//        utterance.volume = 1.0
        print("Int32.max: \(Int32.max), Int32.min: \(Int32.min)")

        synth.write(utterance) { (buffer: AVAudioBuffer) in
            guard let pcmBuffer = buffer as? AVAudioPCMBuffer else {
                fatalError("unknown buffer type: \(buffer)")
            }
            if ( pcmBuffer.frameLength == 0 ) {
                // done
            } else {
                // append buffer to file
                var outBuffer : AVAudioPCMBuffer
                outBuffer = self.resampleBuffer( inSource: pcmBuffer, newSampleRate: usingSampleRate)! // doesnt work
//                outBuffer = self.convertSpeechBufferToFloatStereo( pcmBuffer ) // doesnt work
//                outBuffer = pcmBuffer // original format does work

                if ( output == nil ) {
                    //var bufferSettings = utterance.voice?.audioFileSettings
                    // Audio files cannot be non-interleaved.
                    var outSettings = outBuffer.format.settings
                    outSettings["AVLinearPCMIsNonInterleaved"] = false

                    let inFormat    = pcmBuffer.format
                    print("inBuffer description: \(inFormat.description)")
                    print("inBuffer settings: \(inFormat.settings)")
                    print("inBuffer format: \(inFormat.formatDescription)")
                    print("outBuffer settings: \(outSettings)\n")
                    print("outBuffer format: \(outBuffer.format.formatDescription)")

                    output = try! AVAudioFile( forWriting: URL(fileURLWithPath: tempPath),settings: outSettings)
                }
                try! output?.write(from: outBuffer)
                print("done")
            }
        }
    }
}


class ViewController: NSViewController {
    let speechDelivery = SpeakerTest()

    override func viewDidLoad() {
        super.viewDidLoad()
        let targetSpeaker   = "Allison"
        var sentenceToSpeak = ""
        for indx in 1...10
        {
            sentenceToSpeak += "This is sentence number \(indx). [[slnc 3000]] \n"
        }
        speechDelivery.writeToFile(sentenceToSpeak, speaker: targetSpeaker)
    }
}

Three test can be performed. The only one that works is to directly write the buffer to disk

Is this really "32-bit big-endian signed integer"?

Am I addressing this correctly or is this a bug?

I'm on macOS 11.4

adding missing method and calls

                var outBuffer : AVAudioPCMBuffer
//                outBuffer = self.resampleBuffer( inSource: pcmBuffer, newSampleRate: usingSampleRate)! // doesnt work
                outBuffer = self.convertSpeechBufferToFloatStereo( pcmBuffer ) // doesnt work
//                outBuffer = pcmBuffer // original format does work
    func convertSpeechBufferToFloatStereo( _ inSource: AVAudioPCMBuffer ) -> AVAudioPCMBuffer
    {
        /*
         macOS speech buffer is int32ChannelData
         change format from int32ChannelData to floatChannelData
         duplicate left channel to right
         */
        let numSamples  = AVAudioFrameCount(inSource.frameLength)
        let sampleRate  = inSource.format.sampleRate
        let outFormat   = AVAudioFormat(commonFormat: AVAudioCommonFormat.pcmFormatFloat32,
                                        sampleRate: sampleRate, channels:AVAudioChannelCount(2),
                                        interleaved: false)
        let outSource = AVAudioPCMBuffer(pcmFormat: outFormat!, frameCapacity: numSamples)!
        outSource.frameLength = numSamples // The framelength must be set to ensure the data is written to disk

        let sourceChannels  = UnsafeBufferPointer(start: inSource.int32ChannelData,  count: Int(inSource.format.channelCount))
        let destinChannels  = UnsafeBufferPointer(start: outSource.floatChannelData, count: Int(outSource.format.channelCount))
        let sourceLeftChan  = sourceChannels[0]
        let destinLeftChan  = destinChannels[0]
        let destinRightChan = destinChannels[1]

        for index in 0 ..< Int(numSamples)
        {
            // Must normalize Int32 to Float [-1.0, +1.0]
            // Int32.max: 2147483647, Int32.min: -2147483648
            // let sample   = Int32(bigEndian: sourceLeftChan[index])
            let sample   = sourceLeftChan[index]
            let floatVal = Float(sample) / Float(Int32.max)
            destinLeftChan[index]  = floatVal
            destinRightChan[index] = floatVal
        }
        return outSource
    }

Can you check if this is resolved in the latest Monterey beta? We made some fixes to the buffer format on macOS in Monterey.

I worked around the problem by writing buffer to disk, then read it back into memory.

The "resampleBuffer" method works once I've done this.

I haven't been able to determine what the buffer format is but I do not think its <AVAudioFormat 0x6000012862b0: 1 ch, 22050 Hz, 'lpcm' (0x0000000E) 32-bit big-endian signed integer>

The same happens on iOS 14.6. The code below produces a very noisy audio file.

let speechSynthesizer = AVSpeechSynthesizer()
let utterance = AVSpeechUtterance(string: "Hi, my name is ALex.")
let voice = AVSpeechSynthesisVoice(identifier: "com.apple.speech.voice.Alex")!
utterance.voice = voice
let audioFileSettings = voice.audioFileSettings
let audioUrl = ... //valid URL
var audioFile = try! AVAudioFile(forWriting: audioUrl, settings: audioFileSettings, commonFormat: .pcmFormatInt32, interleaved: false)
speechSynthesizer.write(utterance) { buffer in
   let pcmBuffer = buffer as! AVAudioPCMBuffer
   if pcmBuffer.frameLength > 0{
       try! audioFile.write(from: pcmBuffer)
   }
}

On some voices (example: Alex), it happens, and on others (example: Samantha), it doesn't. The code is so simple that I don't think there is a bug there. It must be a framework problem. Mr. Frameworks Engineer, fix this, please.

I ran into the same issue but (after two days) found the following:

buffer.format can not be used as Source Format description when converting the buffer. Instead the data is always 32-Bit float.

If we would trust the format, the buffer contains signed, packed, big-endian integer samples - which is wrong.

This explains, why you will find the data in buffer.int32ChannelData and not in buffer.floatChannelData.

however, this alone would just generate sound that sounds overdriven. But the explanation for the noise is that also the flag big-endian is wrong. This wrong flag is responsible for the noise.

The simple solution is, not to rely on buffer.format and instead use the following hard-coded format:

AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: Double(22050), channels: 1, interleaved: false)

which represents exactly the format generated by AVSpeechSynthesizer.

But if you use this correct format, keep in mind that you will find the date in buffer.int32ChannelData and not in buffer.floatChannelData.

AVSpeechSynthesizer buffer conversion, write format bug?
 
 
Q