How is the SFTranscriptionSegment used to track where recognition of statements are made?
My goal is to transcribe the audio and record where in the audio file a sentences are spoken.
The timestamps start reset after each phrase recognition.
If I attempt to keep a running count of the timestamps + duration, it does not match when the phrase was spoken after the first or second recognized phrase.
If I keep a running count of the first SFTranscriptionSegment[0] plus subsequent SFTranscriptionSegment[last] + duration, I should stay aligned with to the next speech segment but it does not.
How is the SFTranscriptionSegment used to track where recognition of statements are made?
The following affirmations are used as a test of speech recognition.
I output the affirmations using the NSSpeechSynthesizer Veena speech voice (with only padded silence between sentences) to a file.
I then read the file into speech recognition, to test the output against a known set of sentences.
If I need to know where in the file a speech segment begins, how to I get it from the timestamps and duration?
I set on device recognition to TRUE because, there files are unlimited and my target files can be up to two hours in length, while my test files are 15-30 minutes in length, so this must be done on the device.
recogRequest.requiresOnDeviceRecognition = true
Running on macOS Catalina 10.5.7
Affirmations
I plan leisure time regularly.
I balance my work life and my leisure life perfectly.
I return to work refreshed and renewed.
I experience a sense of well being while I work.
I experience a sense of inner peace while I relax.
I like what I do and I do what I like.
I increase in mental and emotional health daily.
This transcription is now concluded.
The below function produces the following output
Code Block func recognizeFile_Compact(url:NSURL) { let language = "en-US" //"en-GB" let recognizer = SFSpeechRecognizer(locale: Locale.init(identifier: language))! let recogRequest = SFSpeechURLRecognitionRequest(url: url as URL) recognizer.supportsOnDeviceRecognition = true // make sure the device is ready to do the work recognizer.defaultTaskHint = .dictation // give a hint as dictation recogRequest.requiresOnDeviceRecognition = true // we want the device to do all the work recogRequest.shouldReportPartialResults = false // we dont want partial results var strCount = 0 let recogTask = recognizer.recognitionTask(with: recogRequest, resultHandler: { (result, error) in guard let result = result else { print("Recognition failed, \(error!)") return } let progress = recognizer.queue.progress.fractionCompleted // we never get progress other then 0.0 let text = result.bestTranscription.formattedString strCount += 1 print(" #\(strCount), Progress: \(progress) \n\n", "FormattedString: \(text) \n\n", "BestTranscription: \(result.bestTranscription)", "\n\n" ) if (result.isFinal) { print("WE ARE FINALIZED") } }) } code-block
#1, Progress: 0.0
FormattedString: I plan Lisa time regularly
BestTranscription: <SFTranscription: 0x600000cac240>, formattedString=I plan Lisa time regularly, segments=(
"<SFTranscriptionSegment: 0x6000026266a0>, substringRange={0, 1}, timestamp=15.96, duration=0.1499999999999986, confidence=0.862, substring=I, alternativeSubstrings=(\n), phoneSequence=AY, ipaPhoneSequence=\U02c8a\U0361\U026a, voiceAnalytics=(null)",
"<SFTranscriptionSegment: 0x6000026275a0>, substringRange={2, 4}, timestamp=16.11, duration=0.3000000000000007, confidence=0.172, substring=plan, alternativeSubstrings=(\n planned,\n blend,\n blame,\n played\n), phoneSequence=p l AA n, ipaPhoneSequence=p.l.\U02c8\U00e6.n, voiceAnalytics=(null)",
"<SFTranscriptionSegment: 0x600002625ec0>, substringRange={7, 4}, timestamp=16.41, duration=0.3300000000000018, confidence=0.71, substring=Lisa, alternativeSubstrings=(\n Liza,\n Lise\n), phoneSequence=l EE z uh, ipaPhoneSequence=l.\U02c8i.z.\U0259, voiceAnalytics=(null)",
"<SFTranscriptionSegment: 0x600002626f40>, substringRange={12, 4}, timestamp=16.74, duration=0.2999999999999972, confidence=0.877, substring=time, alternativeSubstrings=(\n), phoneSequence=t AY m, ipaPhoneSequence=t.\U02c8a\U0361\U026a.m, voiceAnalytics=(null)",
"<SFTranscriptionSegment: 0x6000026271e0>, substringRange={17, 9}, timestamp=17.04, duration=0.7200000000000024, confidence=0.88, substring=regularly, alternativeSubstrings=(\n), phoneSequence=r EH g y uh l ur l ee, ipaPhoneSequence=\U027b.\U02c8\U025b.g.j.\U0259.l.\U0259 \U027b.l.i, voiceAnalytics=(null)"
), speakingRate=0.000000, averagePauseDuration=0.000000