I thought it was enough to quit XCode and clean the project but after restarting my machine, the desired behavior came back.
I had not restarted my machine since installing XCode 13 the previous day.
Post
Replies
Boosts
Views
Activity
It may be that something else needs to be done on macOS but because this works in the simulator, I filed a bug report.
Jul 9, 2021 at 2:27 PM – FB9298976
What would be a good workaround \ alternative? I moved my code and workflow from NSSpeechSynthesizer to AVSpeechSynthesizer for the sole benefit of applying\using a pronunciation dictionary, so whatever I can do in the interim, I will gladly consider to make this work.
I worked around the problem by writing buffer to disk, then read it back into memory.
The "resampleBuffer" method works once I've done this.
I haven't been able to determine what the buffer format is but I do not think its <AVAudioFormat 0x6000012862b0: 1 ch, 22050 Hz, 'lpcm' (0x0000000E) 32-bit big-endian signed integer>
I just learned that "en-IN" is not a language choice on Catalina.
Substituting "en-US" will achieve the same outcome, though the voice will be US English
let indiaVoice = AVSpeechSynthesisVoice(language: "en-US")!
adding missing method and calls
var outBuffer : AVAudioPCMBuffer
// outBuffer = self.resampleBuffer( inSource: pcmBuffer, newSampleRate: usingSampleRate)! // doesnt work
outBuffer = self.convertSpeechBufferToFloatStereo( pcmBuffer ) // doesnt work
// outBuffer = pcmBuffer // original format does work
func convertSpeechBufferToFloatStereo( _ inSource: AVAudioPCMBuffer ) -> AVAudioPCMBuffer
{
/*
macOS speech buffer is int32ChannelData
change format from int32ChannelData to floatChannelData
duplicate left channel to right
*/
let numSamples = AVAudioFrameCount(inSource.frameLength)
let sampleRate = inSource.format.sampleRate
let outFormat = AVAudioFormat(commonFormat: AVAudioCommonFormat.pcmFormatFloat32,
sampleRate: sampleRate, channels:AVAudioChannelCount(2),
interleaved: false)
let outSource = AVAudioPCMBuffer(pcmFormat: outFormat!, frameCapacity: numSamples)!
outSource.frameLength = numSamples // The framelength must be set to ensure the data is written to disk
let sourceChannels = UnsafeBufferPointer(start: inSource.int32ChannelData, count: Int(inSource.format.channelCount))
let destinChannels = UnsafeBufferPointer(start: outSource.floatChannelData, count: Int(outSource.format.channelCount))
let sourceLeftChan = sourceChannels[0]
let destinLeftChan = destinChannels[0]
let destinRightChan = destinChannels[1]
for index in 0 ..< Int(numSamples)
{
// Must normalize Int32 to Float [-1.0, +1.0]
// Int32.max: 2147483647, Int32.min: -2147483648
// let sample = Int32(bigEndian: sourceLeftChan[index])
let sample = sourceLeftChan[index]
let floatVal = Float(sample) / Float(Int32.max)
destinLeftChan[index] = floatVal
destinRightChan[index] = floatVal
}
return outSource
}
Filed report just now. Jun 30, 2021 at 9:19 PM – FB9225882
You are 100% correct.
The utterance aloud completed but nothing else worked.
I mistakenly thought the process would remain in memory until the completion of speech and the associated delegates fired.
For those who may need a working solution, here it is.
class SpeakerTest: NSObject, AVSpeechSynthesizerDelegate {
let synth = AVSpeechSynthesizer()
override init() {
super.init()
synth.delegate = self
}
func isSandboxEnvironment() -> Bool
{
let environ = ProcessInfo.processInfo.environment
return ( environ["APP_SANDBOX_CONTAINER_ID"] != nil )
}
func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer, didFinish utterance: AVSpeechUtterance) {
print("Utterance didFinish")
}
func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer,
willSpeakRangeOfSpeechString characterRange: NSRange,
utterance: AVSpeechUtterance)
{
print("speaking range: \(characterRange)")
}
func selectVoice(targetSpeaker: String, defLangCode: String) -> AVSpeechSynthesisVoice
{
var usedVoice = AVSpeechSynthesisVoice(language: defLangCode) // should be the default voice
let userCode = AVSpeechSynthesisVoice.currentLanguageCode()
let voices = AVSpeechSynthesisVoice.speechVoices()
for voice in voices {
// print("\(voice.identifier) \(voice.name) \(voice.quality) \(voice.language)")
if (voice.name.lowercased() == targetSpeaker.lowercased())
{
usedVoice = AVSpeechSynthesisVoice(identifier: voice.identifier)
break
}
}
// ensure we return a valid voice
if (usedVoice == nil) {usedVoice = AVSpeechSynthesisVoice(language: userCode) }
return usedVoice!
}
func speak(_ string: String, speaker: String) {
let utterance = AVSpeechUtterance(string: string)
utterance.voice = selectVoice(targetSpeaker: speaker, defLangCode: "en-US")
synth.speak(utterance)
}
func writeToBuffer(_ stringToSpeak: String, speaker: String)
{
print("entering writeToBuffer")
let utterance = AVSpeechUtterance(string: stringToSpeak)
utterance.voice = selectVoice(targetSpeaker: speaker, defLangCode: "en-US")
synth.write(utterance) { (buffer: AVAudioBuffer) in
print("executing synth.write")
guard let pcmBuffer = buffer as? AVAudioPCMBuffer else {
fatalError("unknown buffer type: \(buffer)")
}
if ( pcmBuffer.frameLength == 0 ) {
print("buffer is empty")
} else {
print("buffer has content \(buffer)")
}
}
}
func writeToFile(_ stringToSpeak: String, speaker: String)
{
let utterance = AVSpeechUtterance(string: stringToSpeak)
var output : AVAudioFile?
let desktop = "~/Desktop"
let fileName = "Utterance_Test.caf" // not in sandbox
var tempPath = desktop + "/" + fileName
tempPath = (tempPath as NSString).expandingTildeInPath
// if sandboxed, it goes in the container
if ( isSandboxEnvironment() ) { tempPath = "Utterance_Test.caf" }
utterance.voice = selectVoice(targetSpeaker: speaker, defLangCode: "en-US")
synth.write(utterance) { (buffer: AVAudioBuffer) in
guard let pcmBuffer = buffer as? AVAudioPCMBuffer else {
fatalError("unknown buffer type: \(buffer)")
}
if ( pcmBuffer.frameLength == 0 ) {
// done
} else {
// append buffer to file
if ( output == nil ) {
let bufferSettings = utterance.voice?.audioFileSettings
output = try! AVAudioFile( forWriting: URL(fileURLWithPath: tempPath),settings: bufferSettings!)
}
try! output?.write(from: pcmBuffer)
}
}
}
}
class ViewController: NSViewController {
let speechDelivery = SpeakerTest()
override func viewDidLoad() {
super.viewDidLoad()
let targetSpeaker = "Allison"
var sentenceToSpeak = "This writes to buffer and disk."
sentenceToSpeak += "Also, 'didFinish' and 'willSpeakRangeOfSpeechString' delegates fire."
speechDelivery.writeToBuffer(sentenceToSpeak, speaker: targetSpeaker)
speechDelivery.speak(sentenceToSpeak, speaker: targetSpeaker)
speechDelivery.writeToFile(sentenceToSpeak, speaker: targetSpeaker)
}
override var representedObject: Any? {
didSet {
// Update the view, if already loaded.
}
}
}
I was still seeking a resolution, alternative method or property to observe for identifying a fast-forward behavior.
https://developer.apple.com/forums/thread/663489
AVPlayer.timeControlStatus and AVPlayer.rate are wrong during fast forward or backward
I found above post, which describes the same behavior.
When the fast-forward is engaged, the "rate" reported is zero, although it is clearly not paused but moving at a fast rate.
Unfortunately, I have not identified a solution.
oddly, I cant edit the post to address the formatting (missing newlines) and the image didnt show up but here is the image to clarify what I mean regarding the fast-forward control.
Option-click here will increment the value by 0.1 from 1.0 to 2.0 (e.g. 1.1, 1.2, 1.3x).
Command-click here will rotate between 2x,5x, 10x, 30x and 60x.
The property observer reports back 0.0 once you command-click.
This is absolute extraordinary. I've been staring at this code for a while, to understand its elegance, scalability and effectiveness. I have more to study, as I read through materials on property wrappers.
Neat-o!
That is an understatement.
Thank you for all your help.
Conclusion:
This can't be done with SFSpeechURLRecognitionRequest(url:)
You must utilize SFSpeechAudioBufferRecognitionRequest()
The solution is to utilize SFSpeechAudioBufferRecognitionRequest and read the audio into a buffer then either shift the entire audio block left (left trimming, removing the previously recognized speech segment) after every recognition or to feed SFSpeechAudioBufferRecognitionRequest 60 second snippets of audio.
Also because the progress didn't work, I used the running count to determine the current position that was being recognized relative to the length of the audio, to determine the progress.
Requirement
You must keep a rolling count of where you segments were found to keep track of your position.
Caveats
If the request was short (let's say 30 seconds), recognition will not proceed, so a segment that was short must be padded with silence and adjusted for in your accounting
If the audio buffer contains blocks that are more than 1 minute of non-speech (could be silence, music, unintelligible speech), you must wait for a timeout and then advance 60 seconds, otherwise you will just timeout and not get any further recognition data. I have not been able to determine how to shorten the timeout which appears to be 22 seconds.
Example: If you have audio with a two minute stretch of non-speech, you will need to wait 22 seconds after the first timeout, cancel the request, then advance to the next position, append the audio and wait for the recognition request, which is another 22 seconds for the timeout before again advancing. So, if the audio contains many stretches of non-speech, this process works but is problematic in terms of processing time. Granted a 22 second timeout is better than a 1 minute timeout.
I am still tuning this process but it does work.
I took the plunge today and upgraded to BigSur (11.2.3) and unfortunately, neither write, nor the call backs for didFinish nor willSpeakRangeOfSpeechString get called. I only upgraded to utilize this framework. Is this working for anyone and if so, what is the secret? Is there an entitlement that must be enabled?
In the writeToBuffer function, none of the print statements are executed, so synth.write(utterance) is not executing.
Any advice? }
func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer, didFinish utterance: AVSpeechUtterance)
{
print("Utterance didFinish")
} }
I also tried considered that the callback didnt execute because of using the write method, so I only executed the speak method.
The speech was heard, however, the callbacks were not executed. I am at a loss for what is needed to make this work. I will gladly continue to utilize NSSpeechSynthesizer if someone has information on addSpeechDictionary. This is my only impediment in speech, getting the correct pronunciation without resorting to creating spelling workarounds for word pronunciations. }
Fixed -- YES! Thank you for that information. I'm not going mad :-).
Hopefully this is not a silly question but is there a way to resolve or work around this in Catalina? It would seem a major upgrade as a fix is a big ask.
Also does AVSpeechSynthesizer support offline rendering for faster than real-time. Its not something I can test under Catalina and I havent been able to find documentation for this feature, as it does exist in NSSpeechSynthesizer.
Why? I have been using NSSpeechSynthesizer for years, using an AVAudioEngine + Speech mixing workflow for recording audio to a file, faster than real-time rendering. I have a system and it works but I must workaround some challenges with pronunciations.
I cant make the speech dictionary (addSpeechDictionary) work under NSSpeechSynthesizer (I dont know if its a known problem or just me) and resorted to butchered spellings to get a voice to pronounce a words correctly. My hope is that I can utilize AVSpeechSynthesizer in IPA mode or some other method to pronounce words correctly while also rendering to disk, all in faster than real-time.