This issue has been fixed in Ventura Beta 11 and the one after that. Waiting for public release to verify on that.
Post
Replies
Boosts
Views
Activity
We ran route monitor during upgrade and we see below sequence of message:
got message of size 180 on Thu Oct 6 15:57:00 2022
RTM_ADD: Add Route: len 180, pid: 135, seq 88, errno 17, flags:<UP,CLONING,STATIC>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA>
240.0.0.1 255.192.0.0 240.0.0.1
got message of size 180 on Thu Oct 6 15:57:00 2022
RTM_DELETE: Delete Route: len 180, pid: 135, seq 89, errno 3, flags:<UP,CLONING,STATIC>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA>
240.0.0.1 255.192.0.0 240.0.0.1
got message of size 180 on Thu Oct 6 15:57:00 2022
RTM_ADD: Add Route: len 180, pid: 135, seq 90, errno 17, flags:<UP,CLONING,STATIC>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA>
240.0.0.1 255.192.0.0 240.0.0.1
As per these messages, add route is failing because route exists but delete route is failing because route does not exists which is quite weird.
That sounds like a bug to me, and I encourage you to file it as such. Please post your bug number, just for the record.
Sure, I will.
Which seems like the basis of a workaround
Yes but there is no way to know that tunnel configuration has failed and we need to double apply settings in all the cases.
Is there a nicer way to know whether our settings have been applied? Or is there a way we can get the routing table in code and if our include route does not exist, apply the settings again?
Finally one of our beta testers reported this issue and we were able to get sysdiagnose from their machine. A bug has been submitted: FB11594187
Thanks @eskimo, for the link.
Is this WebSocket server local to the Mac? Or something that you’d expect to be accessible via the default hardware interface?
No, its a remote WebSocket-server.
Just in case it helps,
We have split tunnel with only a fixed private range of IPv4 addresses in include routes. So, there should not be any cause of the loop in the provider.
We also have a transparent proxy provider in same system extension. That also is seeing "Network is down" error on NWConnection, when PacketTunnel provider receives it.
What are other macOS NECP policies that I should look out for? Since the issue happens at random time and goes away upon machine restart (restarting just the PacketTunnelProvider does not help), that makes it look like its some setting misconfiguration, but I'm not sure what. We have seen this issue with people in different network and geographies.
We also have a NWPathMonitor running which reports that none of wifi, wiredEthernet, cellular or other interface type is connected or path is unsatisfied, when this issue starts happening.
Thanks Matt, I will do that. It's just that we don't have definite steps to reproduce this issue. It happens after a long time of running the packet tunnel provider. We will try to reproduce and file a bug report.
There is only one PacketTunnel provider configured and running, the one in Connected state. The other one in Running state is a content filter. We see duplicates and inconsistent information for same one packet tunnel VPN.
We did further testing and we figured that if intermediate CA certificates are available on either side (part of anchor certs from client or part of trust object from server), they are simply used for establishing the trust chain to root but if any intermediate certificate is missing on both side, then evaluation fails. Here are our observations:
Given we have a Root certificate (rootCA), two intermediate certificates (iCA1 and iCA2)
Given leaf certificate is always available on both sides
If server presents rootCA as well as iCA1 and iCA2 while the client anchors rootCA as well as iCA1 and iCA2, the trust evaluation is successful.
If server presents rootCA as well as iCA1 and iCA2 while the client anchors rootCA as well as iCA1 but no iCA2, the trust evaluation is successful.
If server presents rootCA as well as iCA1 and iCA2 while the client anchors rootCA as well as iCA2 but no iCA1, the trust evaluation is successful.
If server presents only iCA2 while the client anchors rootCA as well as iCA2 but no iCA1, the trust evaluation fails since iCA1 is missing on both sides.
If server presents iCA1 and iCA2 while the client anchors rootCA only, the trust evaluation is successful.
If server presents only iCA2 while the client anchors rootCA and iCA1, the trust evaluation is successful.
Are these outcomes expected? Is this how certificate evaluation expected to behave?
Thanks for reply, Matt. We do not have the certificates in any of keychains and we do not have MDM on the machine. We have also tried setting true in SecTrustSetAnchorCertificatesOnly(_:_:) so that only our certificates are used for validation. The outcome is still the same.
Thanks for the tip @eskimo. Currently we are supporting macOS 11+ but will definitely use the new API for Monterey. Here is the feedback for documentation update: FB9964722
As for the original issue, we noticed that replacement with same version only happens when developer mode is enabled (systemextensionsctl developer on), on the machine, regardless of SIP status. Once we turn off developer mode, the API starts working properly and we do not get delegate callback for replacement of same version. It directly jumps to completed result.
I am not sure why this behaviour and could not find any documentation regarding this.
I hope this information helps others in future.
We opened a TSI where Matt informed us that NWProtocolWebSocket does not have HTTP stack (unlike URLSessionWebSocketTask) and can not parse HTTP responses from server. Therefore, any error returned as part of HTTP response from server during WebSocket handshake will not be available to clients.
We have opened a feedback to have ability to parse and return HTTP response in NWProtocolWebSocket: FB9878278
We are not using SCDynamicStore and have very limited understanding of it. Can you let us know how can we use this to get status of utun interface? How can we get a change in status callback and is it possible to use any API to change the status from container app?
In container app we are observing for both NEVPNConfigurationChange and NEVPNStatusDidChange notifications. But we do not receive any update there.
Probably unrelated but only callback or update we receive is in path update handler of NWPathMonitor but path remains satisfied with same interface.
Thanks Matt. In console logs I do not see any error or unexpected log. Here is a screenshot of logs when bringing the interface down: https://ibb.co/7gz331J and here is a screenshot of when bringing it up: https://ibb.co/kyNyTGV
I would recommend controlling this behavior from either your container app or Network System Extension
As per our design, we are providing users with control from container app. However, we noticed some (geeky) users are bringing our virtual interface down, then bringing it back up using ifconfig and reporting that our packet tunnel is not working anymore.
So, if there is a way for our container app or system extension to be notified when the interface is brought down/up, we could react to it by setting the tunnel settings again and bringing up the virtual interface. Is that at all possible? If yes, how?
To be honest this doesn't surprise me because of what I mentioned above, the provider and system state are out of sync.
When bringing any other physical interface down and back up, it starts working fine in few seconds. So why this issue only with packet tunnel providers? Shouldn't system keep the state in sync like any other interface? Provider itself is not informed about this change anyways.
Finally I was able to solve this problem. Here are few things we learned and fixed:
At first we were not receiving the FIN,ACK message from server when connection was closed from server side. When we looked at server packet capture, we could see that server was sending FIN,ACK but was not receiving ACK from client. So, server was retransmitting FIN,ACK. Similarly, client was not aware of server closing the connection and when trying to send something to server was also not being ACKed by server and was being retransmitted. This was happening after 5 or more minutes of idle. We narrowed it down to a NAT issue where NAT device was losing the mapping of client IP after certain period of idle connection. So, after losing the mapping, both server and client were not able to reach each other and were in re-transmission. We fixed this by sending TCP keep-alive every one minute during idle connection.
After fixing that we started getting FIN,ACK from server but were not able to receive it in NWConnection callback. SSL_shutdown from server closes the write direction and sends a FIN,ACK to client. On client side, this is indicates by isFinal property of NWConnection.ContentContext received in receiveMessage completion handler. Our mistake was that we were expecting either data or error in the completion handler to be non-nil. Otherwise, we were discarding the call-back. But when FIN,ACK is sent by server, error and data, both are nil and isFinal in context is set. This gives the hint that server has closed the write channel and client side should also send any remaining message and close the connection. Once we implemented that, we were able to fix complete issue.
Thanks @eskimo for your help and suggestions. Let me know if you see any issue with above fixes or any improvements that can be made.