I'm investigating some reported connectivity issues from users of our iOS app. The app is failing to load/refresh data for multiple minutes at a time, despite other apps, Safari, etc. working fine. We are investigating this at various layers in the pipeline (i.e. server-side, network), but I'm taking a look at what we know or can find out on the client.
I'm sure a lot of these kinds of issues get posted which end up being caused by transient network issues but I have a few specific questions about how I can find out more, and some questions around the behaviour of URLSession, so hopefully there are some people qualified to answer (BatEskimo-signal activated).
packet trace or gtfo 🙂
Unfortunately we've been unable to get a network/packet trace as this requires reproducing the issue locally which we haven't been able to do.
Device logs show what look like typical timeout errors, occurring 60 seconds after initiating the requests:
Error Domain=NSURLErrorDomain Code=-1001 "The request timed out." UserInfo={NSUnderlyingError=0x280764ae0 {Error Domain=kCFErrorDomainCFNetwork Code=-1001 "(null)" UserInfo={_kCFStreamErrorCodeKey=-2102, _kCFStreamErrorDomainKey=4}}, NSErrorFailingURLStringKey=https://REDACTED, NSErrorFailingURLKey=https://REDACTED, _kCFStreamErrorDomainKey=4, _kCFStreamErrorCodeKey=-2102, NSLocalizedDescription=The request timed out.})
The app is trying to send multiple requests over the minutes things are not working, and they are all failing with the same timeout error after 60 seconds. We've had users give up after 5 minutes because nothing is working. This is despite them having a good cellular or wifi connection and with other apps and Safari working. The users who have reported this so far are on iOS 12.
We are using a single URLSession for all our requests, created when the app starts. It's pretty vanilla: the config is a
URLSessionConfiguration.default
but with a custom User-Agent to override the default one, added via httpAdditionalHeaders
. All our requests hit the same https hostname, and they are all POSTs.Now the interesting part is that we have a separate health check request we send occasionally which sends a POST to exactly the same end point as normal requests, and we are seeing this succeed during the periods when regular requests are timing out. One difference with the ping check is that we use a different URLSession instance on the client. This URLSession is also created on startup, and uses the same configuration. The only difference is a delegate that we use to do some cert pinning and report any certificate mismatch from what we expect.
We do have DNS load balancing on our end point, so different connections can end up hitting a different IP.
So there are a few initial thoughts and questions I have:
- The failing requests could be going to a different IP than the successful health check ones, and a specific server could be bad in some way. Is there a way to log the resolved IP address that a particular URLSession task used, at the point of receiving the error? Googling and looking in the docs doesn't show an obvious way to get this information. I imagine since URLSession can maintain a pool of connections to the same host, and there can be redirects during a request, that this is difficult to expose "nicely" via the API. We can obviously do this with local profiling but we would like to add telemetry to gather this data in the wild if possible.
- Is it possible the "bad" URLSession is reusing a stale/dead persistent (keep-alive) connection, and everything on that socket is just timing out? What is the behaviour of connection reuse in these situations and under what circumstances will URLSession open a new connection? How long will it reuse a connection for? Will it continue reusing a connection even when requests are failing with timeout errors, even for multiple minutes?
- Is there a way to log exactly where in the lifetime of the request the URLSession task got to before it timed out? i.e. did it even resolve DNS? Did it connect at all? Did it finish the TLS handshake? Did it send headers? Did it receive anything at all? There is the NSURLSessionTaskMetrics API but it doesn't look like there's an easy way to correlate an event from
to a particular data task / request, so we'd have to log everything (maybe checking if response is null to detect an incomplete load) and correlate later.urlSession(_ session: URLSession, task: URLSessionTask, didFinishCollecting metrics: URLSessionTaskMetrics)
- Some docs (e.g. "Technical Q&A QA1941" which I won't link because this post will be put in a moderator queue) talk about some retry behaviour in URLSession for idempotent (e.g. GET) vs. non-idempotent (e.g. POST) requests, at least for "The network connection was lost" errors. Is there a similar or related behaviour for timeouts, or when a connection looks dead? If this is some transient network issue, would GET requests behave better in such situations when stuff is timing out? There are reasons we are using POST but it would be interesting to know more about how differently idempotent requests are treated
Thanks in advance