CallKit breaks web based MediaStreams

Question

Created Nov ’24

Replies 4

Boosts 1

Participants 2

We're integrating a web based group calling application within a native iOS application and finding that every time a CallKit session gets fully established the web based media streams break, rendering as gray with no audio.

Up to iOS 18 we worked around it by not fulfilling the call start action but that's no longer an option as the audio stopped getting automatically redirected to the speakers. We would now need the CXProvider's didActivateAudioSession callback but that would break the video.

The sample project loads up a simple webpage in a WKWebView which contains a video tag streaming the media from the device's camera. At the same time it sets up a new CallKit session by requesting and fulfilling a CXStartCallAction transaction.

You will notice that the media doesn't render and, if you are to follow the warnings we left, you will find that not fulfilling the CXStartCallAction fixes it.

Unfortunately that's not a workaround we can use as we need the CXProvider delegate to inform us about audio session changes so we can redirect the audio to the speaker (so the proximity sensor doesn't activate and locking the screen doesn't end the call)

Any insights or workarounds would be greatly appreciated.

Answered by DTS Engineer in 812951022

Any insights or workarounds would be greatly appreciated.

Unfortunately, my answer here is that I think WKWebView and CallKit are architecturally incompatible and, to the extent anything works, that success is effectively "accidental", often as a side effect of incorrectly using one of the APIs. With each of the APIs, there are two fundamental conflicts in their design:

WKWebView

WKWebView was designed as primarily as a foreground API and has never really integrated background operation as an intended use. Note that Picture In Picture is a form of foreground usage.
WKWebView's out of process rendering system means that the audio playback is actually occurring in a a secondary process with it's own audio session.

CallKit/PushKit

CallKit/PushKit are specifically designed as "background" APIs. Their entire purpose is to wake app for incoming calls in the background, which means they can launch into the background at ANY time, even in the most secure device state ("Prior to first unlock").
This isn't obvious from a surface API read, but CallKit is an audio API (just a very specialized one). It has specific requirement about audio session configuration (like configuring before call report) and session activation (don't activate the session yourself) because what CallKit actually does is modify your audio session to a specialized audio session configuration which is different than the standard PlayAndRecord session.

The problem here is that the conflict between these two architectures will basically create a nearly endless stream of failures. For example, receiving calls in the background is "standard" voip functionality, however:

In my experience, it's difficult to get WKWebView into a fully functional state from a background launch.
If you manage to get part that point, WKWebView shouldn't be able to activate a PlayAndRecord session from the background, as capability is specifically restricted to CallKit (and the PTT framework).
If you manage to, it's typically because you distorted CallKit's audio session configuration in a way that means it's not ACTUALLY a correctly configured call session. That creates other weird side effects like interruption issues and/or a lower max volume.

However, the worst part of all this is that because of how the development process interacts with our background APIs, the typical experience of developers who try to get this working goes something like this:

An initial prototype is built and some basic experimentation is done. The approach seems promising except for <some details>.
Further testing and experimentation continue but it never seems to QUITE work the way you'd expect.

What's happening here is that #1 is almost either focused entirely on the foreground and/or tested through debugger, both of which distort the app behavior in ways that allow things to work that would otherwise fail. For example, WKWebView cannot activate a PlayAndRecord session in the background, but it can when your app is in the foreground, assuming CallKit isn't already active.

In any case, the assumption here is that if you can JUST sort out <some detail> everything will work fine when, in fact, to opposite is true. Foreground operation is the easy part, background operation is where everything really starts to fall apart.

Moving to the specific issue you described here:

CallKit session gets fully established the web based media streams break, rendering as gray with no audio.

Yes. This is a DIRECT result of #2. CallKit activated it's own audio session inside your app, which interrupted the audio session of your WebView, just like it would interrupt Music.app or Voice Memos.

You then said:

... Up to iOS 18 we worked around it by not fulfilling the call start action

Failing to fulfill the start action is functionally the same as not using CallKit at all. The CallKit audio session never activated, so you're not actually in a functioning CallKit call. Delaying the fullfil is basically leaving the call in a half complete state.

Unfortunately, you can't simply leave the call in this state. Every CallKit action has a timeout, after which the action will automatically fail. CXStartCallAction has one of the longest (600s) but this approach has always meant that you "call" could never be longer than 10 min.

In any case, here is the way I'd summarize all this:

If you intend to support receiving calls from the background, then you need CallKit and you can't/shouldn't really use WKWebView. It just isn't going to work.
If you only intend to support "foreground" calling (meaning, the call always starts when the call is in the foreground), then you don't need CallKit. Just use WKWebView and the "audio"* background category.

One somewhat subtle point about voip apps is that the "voip" background category is NOT how what keeps voip apps awake on calls, the audio background category is.

Note that "call notification" for #2 can be implemented without CallKit by using standard high priority alert pushes for call notification.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Boost

Answer 1

DTS Engineer OP

Apple

Nov ’24

Accepted Answer

Any insights or workarounds would be greatly appreciated.

Unfortunately, my answer here is that I think WKWebView and CallKit are architecturally incompatible and, to the extent anything works, that success is effectively "accidental", often as a side effect of incorrectly using one of the APIs. With each of the APIs, there are two fundamental conflicts in their design:

WKWebView

WKWebView was designed as primarily as a foreground API and has never really integrated background operation as an intended use. Note that Picture In Picture is a form of foreground usage.
WKWebView's out of process rendering system means that the audio playback is actually occurring in a a secondary process with it's own audio session.

CallKit/PushKit

CallKit/PushKit are specifically designed as "background" APIs. Their entire purpose is to wake app for incoming calls in the background, which means they can launch into the background at ANY time, even in the most secure device state ("Prior to first unlock").
This isn't obvious from a surface API read, but CallKit is an audio API (just a very specialized one). It has specific requirement about audio session configuration (like configuring before call report) and session activation (don't activate the session yourself) because what CallKit actually does is modify your audio session to a specialized audio session configuration which is different than the standard PlayAndRecord session.

The problem here is that the conflict between these two architectures will basically create a nearly endless stream of failures. For example, receiving calls in the background is "standard" voip functionality, however:

In my experience, it's difficult to get WKWebView into a fully functional state from a background launch.
If you manage to get part that point, WKWebView shouldn't be able to activate a PlayAndRecord session from the background, as capability is specifically restricted to CallKit (and the PTT framework).
If you manage to, it's typically because you distorted CallKit's audio session configuration in a way that means it's not ACTUALLY a correctly configured call session. That creates other weird side effects like interruption issues and/or a lower max volume.

However, the worst part of all this is that because of how the development process interacts with our background APIs, the typical experience of developers who try to get this working goes something like this:

An initial prototype is built and some basic experimentation is done. The approach seems promising except for <some details>.
Further testing and experimentation continue but it never seems to QUITE work the way you'd expect.

What's happening here is that #1 is almost either focused entirely on the foreground and/or tested through debugger, both of which distort the app behavior in ways that allow things to work that would otherwise fail. For example, WKWebView cannot activate a PlayAndRecord session in the background, but it can when your app is in the foreground, assuming CallKit isn't already active.

In any case, the assumption here is that if you can JUST sort out <some detail> everything will work fine when, in fact, to opposite is true. Foreground operation is the easy part, background operation is where everything really starts to fall apart.

Moving to the specific issue you described here:

CallKit session gets fully established the web based media streams break, rendering as gray with no audio.

Yes. This is a DIRECT result of #2. CallKit activated it's own audio session inside your app, which interrupted the audio session of your WebView, just like it would interrupt Music.app or Voice Memos.

You then said:

... Up to iOS 18 we worked around it by not fulfilling the call start action

Failing to fulfill the start action is functionally the same as not using CallKit at all. The CallKit audio session never activated, so you're not actually in a functioning CallKit call. Delaying the fullfil is basically leaving the call in a half complete state.

Unfortunately, you can't simply leave the call in this state. Every CallKit action has a timeout, after which the action will automatically fail. CXStartCallAction has one of the longest (600s) but this approach has always meant that you "call" could never be longer than 10 min.

In any case, here is the way I'd summarize all this:

If you intend to support receiving calls from the background, then you need CallKit and you can't/shouldn't really use WKWebView. It just isn't going to work.
If you only intend to support "foreground" calling (meaning, the call always starts when the call is in the foreground), then you don't need CallKit. Just use WKWebView and the "audio"* background category.

One somewhat subtle point about voip apps is that the "voip" background category is NOT how what keeps voip apps awake on calls, the audio background category is.

Note that "call notification" for #2 can be implemented without CallKit by using standard high priority alert pushes for call notification.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 2

stefanc OP

Nov ’24

Hi Kevin,

Thank you very much for taking the time to write such a clear and detailed answer!

It does indeed match my experience with it and it's quite unfortunately but hopefully this thread will help future people struggling with the same thing.

I believe there's also a 3rd option in which we keep the native CallKit ringing and redirect to the main app but end the call before presenting the web app. It worked pretty well for us so far but I'm going to have to test it further.

0

Answer 3

DTS Engineer OP

Apple

Nov ’24

I believe there's also a 3rd option in which we keep the native CallKit ringing and redirect to the main app but end the call before presenting the web app. It worked pretty well for us so far but I'm going to have to test it further.

Sure, that's potentially viable. A few other points/options here:

It may not be feasible, but the "best" design here would probably be for your app to be have an audio only stream that it can then play through CallKit. That does have other issues (for example, synchronizing between the CallKit audio and the web video), but it also means you can fully support background calling.
I have no idea how will this would work, but you might be able mute/hold the call (once you're foregrounded) instead of just ending it. The underlying issue here is that the phone call session has a higher session priority than a standard audio session. Making that concrete, if app one has a recording session active and the foreground app activates a recording session, then app one is interrupted and the foreground app activates. However, if app one has a CallKit session and the foreground app attempts to activate PlayAndRecord, then the foreground app activation will fail. In any case, it's possible that muting/holding the CallKit app would allow your WebView to activate it's own PlayAndRecord session.
Keep in mind that the main issue with NOT using CallKit is that ANY incoming call will IMMEDIATELY interrupted your audio session. That will both cut off audio and force your app to suspend "shortly" after the interruption. In practical terms, it means that just receiving (not just answering) can force your app to "hang up". This was actually the single biggest but/issue that lead to the creation of CallKit. In any case, if you're unable to keep a call active then this is an issue you need to keep in mind and design around. For example, you could use CXCallObserver to monitor system wide call activity so that you're at least aware that the issue is happening.
The other option is to drop PushKit/CallKit entirely. Because of the different audiences* the documentation was written for, the overall impression is that PushKit is in some way more reliable that basic APNS, but that impression is simply wrong and always has been. More specifically, the delivery priority of a high priority alert standard push is basically "deliver this payload to the device at the earliest possible moment", so there isn't really any "faster" option. A voip push has the same delivery behavior and is simply routed to PushKit instead of going through the standard push system once it reaches the device. We actually rely on the fact they are equivalent in the architecture we designed for end to end encrypted calling, so this equivalence in performance is definitely not accidental. Frankly, PushKit exists because our practical experience with APNS showed that push worked really well under real world conditions and our experience with voip sockets showed that they didn't work very well.

*For the curious, the issue here is caused by the different sections of our documentation using the word "unreliable/reliable" to refer to totally different issues. The "core" APNS documentation is very old (iOS 3?) and the concept of mobile development/networking was still very new, so there was a major concern about developers assuming that every push would reach the app. In that context, "unreliable" meant something like "you cannot assume every push will always reach the device because of... (obvious reasons like no network/device off/etc)".

On the other hand, the PushKit/voip push documentation is both much more recent (iOS 8) and written for voip app developers who have (presumably) lots of experience with mobile networking. In that context, "reliable" means something like "the push will consistently reach it's target "quickly", assuming that the device is reachable at all and any other external factors that would delay the push".

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

0

Answer 4

stefanc OP

Nov ’24

Great insights again, thank you!

Holding the call was a great idea but unfortunately it doesn't seem to fix anything on our side.

At this point I guess we're just going to cut our losses and live with what we have until we can implement the whole thing natively.

0