I'm using a Speech framework to transcribe a very long audio file (1h+) and I want to present partial results along the way.
What I've noticed is that SFSpeechRecognizer is processing audio in batches.
Delivered SFTranscriptionSegment have timestamp set to 0.0 most of the time, but it seems they are set to meaningful values at the end of the "batch". When it's done, the next reported partial results no longer contains those segments. It starts delivering partial results from the next batch.
Note that all I'm talking about here is when SFSpeechRecognitionResult has isFinal set to false.
I found zero mentions about this in the docs.
What's problematic for me is that segments timestamps in each batch are relative to the batch itself and not the entire audio file. Because of that, it's impossible to determine segment's absolute timestamp because we don't know absolute timestamp of the batch.
Is there any Apple engineer here that could shed some light on that behavior? Is there any way to get a meaningful segment timestamp from partial results callbacks?
What I've noticed is that SFSpeechRecognizer is processing audio in batches.
Delivered SFTranscriptionSegment have timestamp set to 0.0 most of the time, but it seems they are set to meaningful values at the end of the "batch". When it's done, the next reported partial results no longer contains those segments. It starts delivering partial results from the next batch.
Note that all I'm talking about here is when SFSpeechRecognitionResult has isFinal set to false.
I found zero mentions about this in the docs.
What's problematic for me is that segments timestamps in each batch are relative to the batch itself and not the entire audio file. Because of that, it's impossible to determine segment's absolute timestamp because we don't know absolute timestamp of the batch.
Is there any Apple engineer here that could shed some light on that behavior? Is there any way to get a meaningful segment timestamp from partial results callbacks?