Speech Recognition and how to utilze SFTranscriptionSegment to track recognition timestamps?


How is the SFTranscriptionSegment used to track where recognition of statements are made?
My goal is to transcribe the audio and record where in the audio file a sentences are spoken.
The timestamps start reset after each phrase recognition.

If I attempt to keep a running count of the timestamps + duration, it does not match when the phrase was spoken after the first or second recognized phrase.

If I keep a running count of the first SFTranscriptionSegment[0] plus subsequent SFTranscriptionSegment[last] + duration, I should stay aligned with to the next speech segment but it does not.

How is the SFTranscriptionSegment used to track where recognition of statements are made?


The following affirmations are used as a test of speech recognition.
I output the affirmations using the NSSpeechSynthesizer Veena speech voice (with only padded silence between sentences) to a file.
I then read the file into speech recognition, to test the output against a known set of sentences.
If I need to know where in the file a speech segment begins, how to I get it from the timestamps and duration?

I set on device recognition to TRUE because, there files are unlimited and my target files can be up to two hours in length, while my test files are 15-30 minutes in length, so this must be done on the device.

recogRequest.requiresOnDeviceRecognition = true

Running on macOS Catalina 10.5.7

Affirmations

I plan leisure time regularly.
I balance my work life and my leisure life perfectly.
I return to work refreshed and renewed.
I experience a sense of well being while I work.
I experience a sense of inner peace while I relax.
I like what I do and I do what I like.
I increase in mental and emotional health daily.
This transcription is now concluded.

The below function produces the following output
Code Block
func recognizeFile_Compact(url:NSURL) {
let language = "en-US" //"en-GB"
let recognizer = SFSpeechRecognizer(locale: Locale.init(identifier: language))!
let recogRequest = SFSpeechURLRecognitionRequest(url: url as URL)
recognizer.supportsOnDeviceRecognition = true // make sure the device is ready to do the work
recognizer.defaultTaskHint = .dictation // give a hint as dictation
recogRequest.requiresOnDeviceRecognition = true // we want the device to do all the work
recogRequest.shouldReportPartialResults = false // we dont want partial results
var strCount = 0
let recogTask = recognizer.recognitionTask(with: recogRequest, resultHandler: { (result, error) in
guard let result = result else {
print("Recognition failed, \(error!)")
return
}
let progress = recognizer.queue.progress.fractionCompleted // we never get progress other then 0.0
let text = result.bestTranscription.formattedString
strCount += 1
print(" #\(strCount), Progress: \(progress) \n\n", "FormattedString: \(text) \n\n", "BestTranscription: \(result.bestTranscription)", "\n\n" )
if (result.isFinal) { print("WE ARE FINALIZED") }
})
}
code-block



#1, Progress: 0.0

FormattedString: I plan Lisa time regularly

BestTranscription: <SFTranscription: 0x600000cac240>, formattedString=I plan Lisa time regularly, segments=(
"<SFTranscriptionSegment: 0x6000026266a0>, substringRange={0, 1}, timestamp=15.96, duration=0.1499999999999986, confidence=0.862, substring=I, alternativeSubstrings=(\n), phoneSequence=AY, ipaPhoneSequence=\U02c8a\U0361\U026a, voiceAnalytics=(null)",
"<SFTranscriptionSegment: 0x6000026275a0>, substringRange={2, 4}, timestamp=16.11, duration=0.3000000000000007, confidence=0.172, substring=plan, alternativeSubstrings=(\n planned,\n blend,\n blame,\n played\n), phoneSequence=p l AA n, ipaPhoneSequence=p.l.\U02c8\U00e6.n, voiceAnalytics=(null)",
"<SFTranscriptionSegment: 0x600002625ec0>, substringRange={7, 4}, timestamp=16.41, duration=0.3300000000000018, confidence=0.71, substring=Lisa, alternativeSubstrings=(\n Liza,\n Lise\n), phoneSequence=l EE z uh, ipaPhoneSequence=l.\U02c8i.z.\U0259, voiceAnalytics=(null)",
"<SFTranscriptionSegment: 0x600002626f40>, substringRange={12, 4}, timestamp=16.74, duration=0.2999999999999972, confidence=0.877, substring=time, alternativeSubstrings=(\n), phoneSequence=t AY m, ipaPhoneSequence=t.\U02c8a\U0361\U026a.m, voiceAnalytics=(null)",
"<SFTranscriptionSegment: 0x6000026271e0>, substringRange={17, 9}, timestamp=17.04, duration=0.7200000000000024, confidence=0.88, substring=regularly, alternativeSubstrings=(\n), phoneSequence=r EH g y uh l ur l ee, ipaPhoneSequence=\U027b.\U02c8\U025b.g.j.\U0259.l.\U0259 \U027b.l.i, voiceAnalytics=(null)"
), speakingRate=0.000000, averagePauseDuration=0.000000





Replies

Conclusion:

This can't be done with SFSpeechURLRecognitionRequest(url:)
You must utilize SFSpeechAudioBufferRecognitionRequest()

The solution is to utilize SFSpeechAudioBufferRecognitionRequest and read the audio into a buffer then either shift the entire audio block left (left trimming, removing the previously recognized speech segment) after every recognition or to feed SFSpeechAudioBufferRecognitionRequest 60 second snippets of audio.

Also because the progress didn't work, I used the running count to determine the current position that was being recognized relative to the length of the audio, to determine the progress.

Requirement
You must keep a rolling count of where you segments were found to keep track of your position.

Caveats

If the request was short (let's say 30 seconds), recognition will not proceed, so a segment that was short must be padded with silence and adjusted for in your accounting

If the audio buffer contains blocks that are more than 1 minute of non-speech (could be silence, music, unintelligible speech), you must wait for a timeout and then advance 60 seconds, otherwise you will just timeout and not get any further recognition data. I have not been able to determine how to shorten the timeout which appears to be 22 seconds.

Example: If you have audio with a two minute stretch of non-speech, you will need to wait 22 seconds after the first timeout, cancel the request, then advance to the next position, append the audio and wait for the recognition request, which is another 22 seconds for the timeout before again advancing. So, if the audio contains many stretches of non-speech, this process works but is problematic in terms of processing time. Granted a 22 second timeout is better than a 1 minute timeout.

I am still tuning this process but it does work.

  • What improvements or discoveries have you had since you posted this?

Add a Comment