9 Replies
      Latest reply on Dec 2, 2019 1:55 PM by Ragr
      Userbla Level 1 Level 1 (0 points)

        I'm investigating some reported connectivity issues from users of our iOS app. The app is failing to load/refresh data for multiple minutes at a time, despite other apps, Safari, etc. working fine. We are investigating this at various layers in the pipeline (i.e. server-side, network), but I'm taking a look at what we know or can find out on the client.

         

        I'm sure a lot of these kinds of issues get posted which end up being caused by transient network issues but I have a few specific questions about how I can find out more, and some questions around the behaviour of URLSession, so hopefully there are some people qualified to answer (BatEskimo-signal activated).

         

        packet trace or gtfo

         

        Unfortunately we've been unable to get a network/packet trace as this requires reproducing the issue locally which we haven't been able to do.

         

        Device logs show what look like typical timeout errors, occurring 60 seconds after initiating the requests:

         

        Error Domain=NSURLErrorDomain Code=-1001 "The request timed out." UserInfo={NSUnderlyingError=0x280764ae0 {Error Domain=kCFErrorDomainCFNetwork Code=-1001 "(null)" UserInfo={_kCFStreamErrorCodeKey=-2102, _kCFStreamErrorDomainKey=4}}, NSErrorFailingURLStringKey=https://REDACTED, NSErrorFailingURLKey=https://REDACTED, _kCFStreamErrorDomainKey=4, _kCFStreamErrorCodeKey=-2102, NSLocalizedDescription=The request timed out.})

         

        The app is trying to send multiple requests over the minutes things are not working, and they are all failing with the same timeout error after 60 seconds. We've had users give up after 5 minutes because nothing is working. This is despite them having a good cellular or wifi connection and with other apps and Safari working. The users who have reported this so far are on iOS 12.

         

        We are using a single URLSession for all our requests, created when the app starts. It's pretty vanilla: the config is a URLSessionConfiguration.default but with a custom User-Agent to override the default one, added via httpAdditionalHeaders. All our requests hit the same https hostname, and they are all POSTs.

         

        Now the interesting part is that we have a separate health check request we send occasionally which sends a POST to exactly the same end point as normal requests, and we are seeing this succeed during the periods when regular requests are timing out. One difference with the ping check is that we use a different URLSession instance on the client. This URLSession is also created on startup, and uses the same configuration. The only difference is a delegate that we use to do some cert pinning and report any certificate mismatch from what we expect.

         

        We do have DNS load balancing on our end point, so different connections can end up hitting a different IP.

         

        So there are a few initial thoughts and questions I have:

         

        • The failing requests could be going to a different IP than the successful health check ones, and a specific server could be bad in some way. Is there a way to log the resolved IP address that a particular URLSession task used, at the point of receiving the error? Googling and looking in the docs doesn't show an obvious way to get this information. I imagine since URLSession can maintain a pool of connections to the same host, and there can be redirects during a request, that this is difficult to expose "nicely" via the API. We can obviously do this with local profiling but we would like to add telemetry to gather this data in the wild if possible.
        • Is it possible the "bad" URLSession is reusing a stale/dead persistent (keep-alive) connection, and everything on that socket is just timing out? What is the behaviour of connection reuse in these situations and under what circumstances will URLSession open a new connection? How long will it reuse a connection for? Will it continue reusing a connection even when requests are failing with timeout errors, even for multiple minutes?
        • Is there a way to log exactly where in the lifetime of the request the URLSession task got to before it timed out? i.e. did it even resolve DNS? Did it connect at all? Did it finish the TLS handshake? Did it send headers? Did it receive anything at all? There is the NSURLSessionTaskMetrics API but it doesn't look like there's an easy way to correlate an event from urlSession(_ session: URLSession, task: URLSessionTask, didFinishCollecting metrics: URLSessionTaskMetrics) to a particular data task / request, so we'd have to log everything (maybe checking if response is null to detect an incomplete load) and correlate later.
        • Some docs (e.g. "Technical Q&A QA1941" which I won't link because this post will be put in a moderator queue) talk about some retry behaviour in URLSession for idempotent (e.g. GET) vs. non-idempotent (e.g. POST) requests, at least for "The network connection was lost" errors. Is there a similar or related behaviour for timeouts, or when a connection looks dead? If this is some transient network issue, would GET requests behave better in such situations when stuff is timing out? There are reasons we are using POST but it would be interesting to know more about how differently idempotent requests are treated

         

        Thanks in advance

        • Re: How to better diagnose -1001 "The request timed out." URLSession errors
          eskimo Apple Staff Apple Staff (12,425 points)

          In situations like this, where users are reporting issues from the field but you can’t reproduce them, the weapon of choice is a sysdiagnose log.  This includes a bunch of information from both CFNetwork and the underlying libnetcore (this is the internal name for the core implementation of the now-public Network framework).

          You can learn more about sysdiagnose logs on our Bug Reporting > Profiles and Logs page.

          Device logs show what look like typical timeout errors, occurring 60 seconds after initiating the requests:

          FYI, error -2102 is an internal error kCFStreamErrorRequestTimeout, which isn’t any more informative.

          Is there a way to log the resolved IP address that a particular URLSession task used, at the point of receiving the error?

          No.  IMO this would make a great addition to NSURLSessionTaskTransactionMetrics and if you agree I encourage you to file an enhancement request explaining your rationale.

          Please post your bug number, just for the record.

          I think you’ll be able to get this information out of a sysdiagnose log, but I’d have to look at the log in depth to be sure.

          Is it possible the "bad" URLSession is reusing a stale/dead persistent (keep-alive) connection, and everything on that socket is just timing out? What is the behaviour of connection reuse in these situations and under what circumstances will URLSession open a new connection?

          Is this HTTP/1.1 or HTTP 2?

          For HTTP 1.1 a timeout error on a specific request will close the connection used by that request because the nature of HTTP 1.1 means that it’s impossible to continue using that connection.

          Is there a way to log exactly where in the lifetime of the request the URLSession task got to before it timed out?

          I would’ve thought the task and transaction metrics would help here.

          There is the NSURLSessionTaskMetrics API but it doesn't look like there's an easy way to correlate an event from urlSession(_:task:didFinishCollecting:) to a particular data task / request, so we'd have to log everything (maybe checking if response is null to detect an incomplete load) and correlate later.

          I don’t understand this problem.  That delegate callback gets the task via the task parameter.  What more do you need for correlation purposes?

          Is there a similar or related behaviour for timeouts, or when a connection looks dead?

          I’m not sure.  I will say that whatever retry logic available won’t kick in for POST requests.  However, I don’t think that switching to GET makes sense.  Something is blocking your entire session, so that completely new requests are also failing.  In that situation it’s unlikely that an automatic retry would fair any better than the other requests you issue to your session.

          Share and Enjoy

          Quinn “The Eskimo!”
          Apple Developer Relations, Developer Technical Support, Core OS/Hardware
          let myEmail = "eskimo" + "1" + "@apple.com"

            • Re: How to better diagnose -1001 "The request timed out." URLSession errors
              Userbla Level 1 Level 1 (0 points)
              No.  IMO this would make a great addition to NSURLSessionTaskTransactionMetrics and if you agree I encourage you to file an enhancement request explaining your rationale.

              Will do

              Is this HTTP/1.1 or HTTP 2?

              HTTP 2

              I don’t understand this problem.  That delegate callback gets the task via the task parameter.  What more do you need for correlation purposes?

              Sorry, disregard my previous comments, it looks like it has everything we need.

                • Re: How to better diagnose -1001 "The request timed out." URLSession errors
                  eskimo Apple Staff Apple Staff (12,425 points)

                  HTTP 2

                  OK, that opens up a bunch of possibilities, in that the HTTP 2 connection might be open but there’s something broken with its internal state machine (either on the client or the server).  A CFNetwork diagnostic log is definitely the right next step here.

                  Share and Enjoy

                  Quinn “The Eskimo!”
                  Apple Developer Relations, Developer Technical Support, Core OS/Hardware
                  let myEmail = "eskimo" + "1" + "@apple.com"

                    • Re: How to better diagnose -1001 "The request timed out." URLSession errors
                      Userbla Level 1 Level 1 (0 points)

                      Thanks.

                       

                      45304558 for enhancements to NSURLSessionTaskMetrics / NSURLSessionTaskTransactionMetrics

                      • Re: How to better diagnose -1001 "The request timed out." URLSession errors
                        Userbla Level 1 Level 1 (0 points)

                        A couple more followup questions:

                         

                        To get CFNetwork diagnostics information from devices in the wild, am I correct in thinking we need to first setenv("CFNETWORK_DIAGNOSTICS", "3", 1); in main() as per https://developer.apple.com/library/archive/qa/qa1887/_index.html (and probably put this behind some kind of remote flag or Settings.app setting to avoid doing it for all our users), then have the appropriate users install the profile etc. from https://developer.apple.com/bug-reporting/profiles-and-logs/?name=sysdiagnose&platform=ios and capture a sysdiagnose and send it to us?

                         

                        Without setting the env var will there be anything useful in a sysdiagnose, or is the CFNETWORK_DIAGNOSTICS where all/most of the interesting stuff comes from?

                         

                        Also, is there a way to capture the CFNetwork diagnostic os_log output from within our app/process, so we could include it directly into our application logs which we can upload automatically? Or can it only be accessed via Analytics->... / Console.app?

                          • Re: How to better diagnose -1001 "The request timed out." URLSession errors
                            eskimo Apple Staff Apple Staff (12,425 points)

                            Also, is there a way to capture the CFNetwork diagnostic os_log output from within our app/process …

                            No.  There are numerous difficulties in making that work [1] but I realise that it’d be a super-useful feature so don’t let that stop you from filing an enhancement request describing your requirements

                            Please post your bug number, just for the record.

                            To get CFNetwork diagnostics information from devices in the wild, am I correct in thinking we need to first setenv("CFNETWORK_DIAGNOSTICS", "3", 1); in main()

                            Correct.

                            and probably put this behind some kind of remote flag or Settings.app setting to avoid doing it for all our users

                            If you put this in a production build, make sure that the UI informs the user of the privacy risks involved in enabling it.

                            then have the appropriate users install the profile

                            That step is not necessary.

                            capture a sysdiagnose and send it to us?

                            Yes.

                            Without setting the env var will there be anything useful in a sysdiagnose … ?

                            Yes, although it’s hard to say whether that will help you debug this question.  This environment variable enables CFNetwork diagnostics, which lets you see what’s going on internal to CFNetwork.  If you don’t set it then you still get a bunch of logging.  On the networking front that includes extensive DNS logging from mDNSResponder and a bunch of lower-level network logging from libnetcore (the private library used by CFNetwork for its on-the-wire activities; this is the same library that backs the new Network framework).  That might help you understand this problem, or it might not.  It kinda depends on what the actual problem is, and we don’t know that yet.

                            Share and Enjoy

                            Quinn “The Eskimo!”
                            Apple Developer Relations, Developer Technical Support, Core OS/Hardware
                            let myEmail = "eskimo" + "1" + "@apple.com"

                            [1] One major problem is that we can’t give you all log messages on the system because of privacy concerns, but a lot of the value in OSLog is that it lets you track activity as it crosses processes.  For an example close to your heart, much of the internals of CFNetwork are implemented in separate daemons (DNS is done via mDNSResponder, authentication UI via CFNetworkAgent, cookies and caches via nsurlstoraged, background sessions via nsurlsessiond, and so on).

                    • Re: How to better diagnose -1001 "The request timed out." URLSession errors
                      sergeyne Level 1 Level 1 (0 points)

                      I have similar strange periodic timeouts in my app as Userbla

                      But what is different:

                      - I use NSURLSession's shared instance which means no custom delegate code is involved.

                      - HTTP/1.1

                      - Timeout happens for a specific domain/ip address

                      - App restart fixes the issue

                      - I am able to reproduce the issue locally. Although it is hard and may take up to 1-2 hours sometimes

                      - Captured packet trace and network logs and opened a bug 48359240

                      - Requests are sent OK using the same code but if I create a new instance of NSURLSession

                      - The issue happens when the app was in the background for more than 10-15 minutes(?)
                      - Can't reproduce the issue in a separate sample project, so not sure what exactly causes the issue

                        • Re: How to better diagnose -1001 "The request timed out." URLSession errors
                          Artsiom_K Level 1 Level 1 (0 points)

                          Hi sergeyne,

                          We've faced with a similar issue, had your bug is closed with some resolution?

                           

                          Use case:

                          • Make a switch between wifi and 3G/4G;
                          • Moving to the background and foreground;

                          At some point, the user sees time out error for all requests. Only restart of the app helps.

                           

                          eskimo,

                          Using "setenv("CFNETWORK_DIAGNOSTICS", "3", 1);"

                          I've gathered the following.

                          default 19:54:56.119246 +0300 **** CFNetwork Diagnostics [3:37959] 18:54:56.116 {
                           Protocol Enqueue: request GET https://***** HTTP/1.1
                                    Request:  {url = https://****, cs = 0x0}
                                    Message: GET https://**** HTTP/1.1
                                     Accept: application/json
                               Content-Type: application/json
                                X-Client-Id: 2.3.41 (18460)||Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148
                            header: ***
                            Accept-Language: en-us
                          header: ***
                            Accept-Encoding: br, gzip, deflate
                               header: ***
                          } [3:37959]
                          default 19:54:56.274545 +0300 bluetoothd Matched UUID *** for device ""
                          default 19:54:56.392140 +0300 *** TIC TCP Conn Cancel [518:0x282f57780]
                          default 19:54:56.392264 +0300 *** [C518 ***:443 tcp, url hash: e37a5a66, tls] cancel
                          default 19:54:56.392425 +0300 *** [C518 ***:443 tcp, url hash: e37a5a66, tls] cancelled

                          And then.

                          default 19:55:06.535110 +0300 *** CFNetwork Diagnostics [3:37960] 18:55:06.533 {
                                  Did Timeout: (null)
                                       Loader:  {url = https://***, cs = 0x0}
                             Timeout Interval: 60.000 seconds
                          init to origin load: 0.000351071s
                                   total time: 60.1877s
                                  total bytes: 0
                          } [3:37960]

                          Do you have any clues about what we should look for do understand the reason for such behavior?
                          Thanks in advance.

                        • Re: How to better diagnose -1001 "The request timed out." URLSession errors
                          Ragr Level 1 Level 1 (0 points)

                          Ok, I lost a lot of time investigeting similar issue.

                           

                          In my case the problem was strange (bad?) firewall on the server. It was banning device when there was many (not so many) requests in short period of time.

                           

                          I believe you can do easy test if you are facing similar issue.

                          1. Send a lot of (depends of firewall settings) requests in loop (let's say 50 in 1 second).

                          2. Close/Kill app (this will close connections to server)

                          3. (OPTIONAL) Wait a while (lets say 60 seconds)

                          4. Start app again and try send request

                           

                          If you now got timeout for all next requests, you probably have same issue and you should talk with sever guys.

                           

                          PS: if you don't have access to server you can give user info that he should restart wifi on device to quit that timeout loop. It could be last resort in some cases.