Short spikes in timeout calling APNS

Hello, I would like to check with you on a possible APNS issue. We saw a huge spike in the number of failed requests towards APNS (both api.push.apple.com and api.development.push.apple.com). On October 31st, at 13:39 (during 1 or 2 minutes), more than 700k requests failed. Which means more than 10% of all requests made by our service to api.push.apple.com. Our service is sending notification requests for more applications, different AppIds, in high amounts.

Even more concerning is the fact that it happens more or less regularly now. 2-3 times a week. Still, just for short time, but I would like to check with APNS, whether it is something you know about.

Checking graphs in CloudKit did not help me much. They don't allow good enough granularity. Not more precise than 1 day.

Please, let me know if you are aware of some very short transient issues in APNS. Happening more or less regularly, with noticeable impact.

Answered by Engineer in 814119022

@MichalZ12 While our investigations show we had some small bumps in the P99 response time in the past week, they don't coincide with your timestamps and were also not large enough to cause the issues you have observed - and definitely not on the P90 timeline.

At this point, we believe the issue might lie somewhere between your servers and APNs, and either the request or the response falling out of the timeout period you have.

Unfortunately at the volume we are talking about, it is not possible for us to pinpoint a single notification to see if the request was processed or not, without an apns-id.


Argun Tekant /  DTS Engineer / Core Technologies

Just checking, it has been 8 days since the first occurrence, so what does "it is since happening regularly 2-3 times a week mean"? That it happened 3 times?

Please clarify. In any case we will need an occurrence of such problems within 7 days of them happening before our logs roll off. So, when it happens next please provide time (and Time Zone) and as much detail as you can.

A sampling of apns-ids will be helpful.

There were few similar spikes of timeouts in last 7 days we noticed. Not as large, only in order of tens of thousands timed-out requests. Unlike the one mentioned originally, they are only from specific location: On Nov 5th, 9:40am UTC; Nov 6th, 5:53am UTC and Nov 8th, 6:34am UTC, timeouts towards APNS spiked in India, on Nov 5th, 11:11am UTC, in US and on Nov 4th, 8:42am UTC ( which is probably out of 7days scope), in EU.

Most affected seem to be APNS requests for bundle with App ID com.microsoft.skype.teams.

Since those are timeouts, we did not have and log responses with apns ids.

Please, let us know if you are aware of some outage/update/network issue or anything else that could have caused these timeouts on your side. We are also investigating on our side.

Accepted Answer

@MichalZ12 While our investigations show we had some small bumps in the P99 response time in the past week, they don't coincide with your timestamps and were also not large enough to cause the issues you have observed - and definitely not on the P90 timeline.

At this point, we believe the issue might lie somewhere between your servers and APNs, and either the request or the response falling out of the timeout period you have.

Unfortunately at the volume we are talking about, it is not possible for us to pinpoint a single notification to see if the request was processed or not, without an apns-id.


Argun Tekant /  DTS Engineer / Core Technologies

Short spikes in timeout calling APNS
 
 
Q