Hi Quinn,
CPU and network usage. I would like to at least:
Continuously perform voice activity detection (this does seem to work with a basic VAD algo; and I imagine streaming apps are doing more work decoding audio than this anyway).
Send voice to a server for processing.
Receive and store (with minimal processing) JSON responses.
Play back synthesized voice.
Ideally, rather than sending voice to the server, I'd like to perform Siri speech-to-text transcription and speech synthesis on the way back, allowing me to upload only text and receive text responses.
My understanding is there are some limitations on CPU usage for at least some of these cases. However, I imagine that audio streaming apps (YouTube, Spotify, etc.) must be doing a fair bit of decoding work themselves?
Thank you,
-- B.
Post
Replies
Boosts
Views
Activity
Thanks, Quinn, that is incredibly helpful!
Re: Item 1, I currently have my own VAD but was also thinking of just using Siri speech-to-text as well. Will test it out. CPU limits are definitely a concern but I can test and see what happens. It becomes a CPU vs. network (which has a stable CPU cost) trade-off. On-device voice transcription is highly desirable from an economic perspective because doing this on the server is costly (not to mention the user's data plan bandwidth caps).
Did not know about constrained vs. expensive network flags. Very helpful.
Another related question (but maybe should start a new thread on this?):
Background Bluetooth mode (separate but related project): apps can receive Bluetooth events in the background but do similar constraints apply? That is, can I safely perform a REST API request and be confident that I will have time to process the response?
Specific use case:
Receive an audio sample from a Bluetooth peripheral (not headphones nor anything that can present itself as such)
Upload audio to a voice-to-text API (or use Siri speech-to-text).
Receive result of [2].
Hit a REST service with text obtained from [2].
Receive result of [4].
Send result of [4] (just some text data) back to the peripheral.
Oh gosh, I forgot to say explicitly (although I did tag it) that this is iOS. Just realized that RAM disks aren't an option there. There really isn't any API for producing asset files in memory?
I was able to eliminate these errors by configuring a custom URLSession for background mode however there are is still an issue when I attempt to throw SFSpeech into the mix to perform a voice transcription request before uploading its results to the server. I get this error: Lost connection to background transfer service
This post mentions that dataTask isn't supposed to work with a background configuration, but for me it does. I don't think I can use downloadRequest because I'm making a POST request.
Why would using SFSpeech (and therefore starting the dataTask within one of its delegate calls) cause this issue?
Once again the sequence of events is:
Bluetooth data received in background mode
SFSpeech kicked off to convert speech to text
POST request fired on a background URLSession in SFSpeech delegate handler
Here is how I configure the URLSession:
let configuration = URLSessionConfiguration.background(withIdentifier: "ChatGPT")
configuration.isDiscretionary = false
configuration.shouldUseExtendedBackgroundIdleMode = true
configuration.sessionSendsLaunchEvents = true
configuration.allowsConstrainedNetworkAccess = true
configuration.allowsExpensiveNetworkAccess = true
_session = URLSession(configuration: configuration, delegate: self, delegateQueue: nil)
Thank you,
-- B.
Did you ever figure this out? I’d like to transfer much larger files (hundreds of MB total) using this approach.
Thank you. One follow-up question in light of this:
Is user-initiated long-term recording a supported use case permissible in the app store? I ask because the documentation for .playAndRecord and audio background mode tends to focus almost exclusively on audio playback, with the recording capability seemingly implied to be to support communication.