Very basic question: diagnosing DNS issues

Our transparent proxy provider sends flows to a daemon which analyzes and then does proxying. Works fine.

Except that sometimes it stops working. As far as I can tell, it's due to DNS not working. Queries hang -- we've got some internal ones we log, that have timed out after 20 or 30 seconds. Now, clearly, we're doing something bad (because if we kill the daemon and it restarts, everything goes back to working).

Unfortunately, I have forgotten so much I can't figure out how to see where it's broken! Things like dig @8.8.8.8 com. any fail -- I am presuming because it's trying to do a lookup of "8.8.8.8" and that fails, but I could be wrong. Admittedly, that one doesn't time out, it simply says no servers could be reached. Meanwhile, pinging that address works. (And, also, the local DNS host -- the one provided via DHCP and listed in /etc/resolv.conf and ipconfig getstatus -- behaves the same way.)

I haven't been able to reproduce this myself, unfortunately. Although I have, somewhat interestingly, had a similar issue, which was clearly due to a Google Home WiFi access point (as resetting it fixed the problem, as does moving to another area of the house such that a different AP in the mesh takes over).

On my FreeBSD systems, I'd run tcpdump and truss/ktrace on named, but as I said, I've forgotten so much about how macOS does DNS I'm flailing.

Help?

I assume we're hitting some resource limit, but... I don't know which one, or how to find out. :(

Our transparent proxy provider sends flows to a daemon which analyzes and then does proxying. Works fine.

Does the daemon actually proxy the flows or just analyze them and then send information back to the Network Extension to proceed with proxying the flow? Are there any logs that start showing up when the system DNS starts hanging?

The TPP looks for specific flow types (using the application and destination), sends them up to the daemon if they're interesting, and then the daemon modifies them if necessary and sends them out to the internet.

I'm thinking not just DNS at this point -- that is definitely failing, but I think almost all networking is blocked by something -- I see mdns traffic, but nothing else. And if we restart the TPP, it gets maybe one or two flows, which it then sends off to the daemon, and then nothing else. Whereas if we restart the daemon, everything starts working properly. For a while, before repeating.

I haven't been able to reproduce it! But several other people can do so, fairly reliably.

Whereas if we restart the daemon, everything starts working properly.

I wonder if your daemon is running out of open files for the process? Try this out to see if that is the case:

#include <sys/resource.h>

struct rlimit p_limit;

// RLIMIT_NOFILE The maximum number of open files for this process.

int get_status = getrlimit(RLIMIT_NOFILE, &p_limit);
if (get_status == -1) {
    printf("getrlimit failed with: %d\n", get_status);
} else {
    printf("The (hard) limit on max number of open files for the process is: %llu\n", p_limit.rlim_max);
    printf("The (soft) limit on max number of open files for the process is: %llu\n", p_limit.rlim_cur);
}

We ran into that early; the launchd.plist file for it sets the open file limits to 1000000.

Very basic question: diagnosing DNS issues
 
 
Q