getattrlistbulk lists same files over and over on macOS 15 Sequoia

A customer of mine reported that since updating to macOS 15 they aren't able to use my app anymore, which performs a deep scan of selected folders by recursively calling getattrlistbulk. The problem is that the app apparently keeps scanning forever, with the number of scanned files linearly increasing to infinity.

This happens for some folders on a SMB volume.

The customer confirmed that they can reproduce the issue with a small sample app that I attach below. At first, I created a sample app that only scans the contents of the selected folder without recursively scanning the subcontents, but the issue didn't happen anymore, so it seems to be related to recursively calling getattrlistbulk.

The output of the sample app on the customer's Mac is similar to this:

start scan /Volumes/shares/Backup/Documents level 0 fileManagerCount 2847
continue scan /Volumes/shares/Backup/Documents new items 8, sum 8, errno 34
/Volumes/shares/Backup/Documents/A.doc
/Volumes/shares/Backup/Documents/B.doc
...
continue scan /Volumes/shares/Backup/Documents new items 7, sum 1903, errno 0
/Volumes/shares/Backup/Documents/FKV.pdf
/Volumes/shares/Backup/Documents/KFW.doc
/Volumes/shares/Backup/Documents/A.doc
/Volumes/shares/Backup/Documents/B.doc
...

which shows that counting the number of files in the root folder by using

try FileManager.default.contentsOfDirectory(atPath: path).count

returns 2847, while getattrlistbulk lists about 1903 files and then starts listing the files from the beginning, not even between repeated calls, but within a single call.

What could this issue be caused by?

(The website won't let me attach .swift files, so I include the source code of the sample app as a text attachment.)

Answered by DTS Engineer in 814122022

Answer One, a possible work around:

While looking through the log again today, I think I actually found the point the problem occurs, which is this log sequence:

2024-10-27 19:48:02.915-0400 kernel smbfs_enum_dir: Resuming enumeration for <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_find_cookie: Key 5, offset 1130, nodeID 0x400000007bd6c name <Front Door before and after.bmp> for <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_enum_dir: offset 1130 d_offset 0 d_main_cache.offset 0 for <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_fetch_new_entries: fetch offset 1130 d_offset 0 cachep->offset 0 for <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_fetch_new_entries: Main cache needs to be refilled <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_fetch_new_entries: Dir has not been modified recently <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_fetch_new_entries: Restart enum offset 1130 d_offset 0 for <Documents>

In other words, the failure here is occurring when you "return" to iterating the previous directory. That is something you could avoid/mitigate by removing/modifying the recursion "cycle" of your directory walk. Basically, what you'd do is this:

  1. iterate the directory with getattrlistbulk
  2. if(file)-> process normally
  3. if(directory)-> cache/store entry
  4. finish iteration of current directory
  5. for each directory in #3, return to #1

In concrete terms, if you have this hierarchy:

dir1
	file1
	dir2
		file2
	file3
	dir3
		file4
		dir4
			fileC
		file5
	file6

Currently, the order you process files is exactly the same as the order above:

iterate dir1
process file1
iterate dir2
process dir2/file2
process file3
iterate dir3
process dir3/file4
iterate dir3/dir4
process dir3/dir4/fileC
process dir3/file5
process file6

With the new approach, the same hierarchy would process as:

iterate dir1
process file1
process file3
process file6
iterate dir2
process dir2/file2
iterate dir3
process dir3/file4
process dir3/file5
iterate dir3/dir4
process dir3/dir4/fileC

This does add some additional book keeping and memory risk, however, I do think it's a "safer" approach overall. IF the issue is the SMB server (see answer 2 for other possibility), then issue is almost certainly caused by the complexity of tracking nested iteration. In other words, the problem isn't "iterating 10,000 files", it's RETURNING to that iteration after having iterated lots of other directories. The approach above removes that because, as far as the file system is concerned, you only ever iterate one directory. You can also use this as an easy opportunity to flatten your recursion, to there are some memory benefits as well.

Finally, flatting also help with the "larger" iteration context. As a simplified example, imagine this particular bug is that the server drops the least recent iteration anytime there are more than 10 or more simultaneous iterations. As far as the server is concerned, 10 apps iterating 1 directory looks exactly the same as a nested iteration 10 levels deep. Flattening the iteration obviously solves the second case, but probably helps the firt one as well- your single iteration never "blocks" (because you never recurse on getattrlistbulk), so your iteration is unlikely to very be "oldest". Something may still fail, but it won't be your app.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

A customer of mine reported that since updating to macOS 15 they aren't able to use my app anymore, which performs a deep scan of selected folders by recursively calling getattrlistbulk.

SO, my immediate question is why are you calling getattrlistbulk? There was some value in it when it was originally introduced, however, it's well enough integrated into our higher level APIs APIs that I can't really see any reason why you'd use it. The correct NSFileManager APIs are just as fast, far easier to use and, most important in this case, more likely to correctly handle weird edge cases that cause it to fail. I actually posted some performance testing on this issue, but the basic summary is that with a LOT of work, testing, and you MIGHT be able to get SLIGHTLY better performance out of fts (in my testing, 0.2s when iterating a set of 350,000 files)... but nothing that comes CLOSE to justifying all of the extra work. Note that I didn't specifically test gattrlistbulk because the three APIs I compare there are ALL "wrappers" around getattrlisbulk.

However, one qualifier there is the term "correct". FileManager has several very similar APIs that have COMPLETELY different performance characteristics. Case in point:

try FileManager.default.contentsOfDirectory(atPath: path).count

The "atPath" methods directory methods (contentsOfDirectory(atPath:) /enumerator(atPath:)) are FAR slower then than their URL equivalents (contentsOfDirectory(at:includingPropertiesForKeys:options:)/enumerator(at:includingPropertiesForKeys:options:errorHandler:)). Note that the difference in API design comes directly from the semantics of getattrlistbulk. The URL versions specifies the properties it needs so that they can be fetched during bulk iteration, while the "atPath" variant is forced to get "everything". Related to that point, you do need to be aware of exactly which methods in the API set you use. For example, properties retrieved through the URL APIs need to be accessed through the URL object itself:

"An array of keys that identify the file properties that you want pre-fetched for each item in the directory. For each returned URL, the specified properties are fetched and cached in the NSURL object."

...and NOT through DirectoryEnumerator.directoryAttributes/fileAttributes. Both of those methods are part of the legacy API and can end up forcing an unnecessary stat() call

Moving to the specific issue...

The problem is that the app apparently keeps scanning forever, with the number of scanned files linearly increasing to infinity. This happens for some folders on a SMB volume.

I don't know exactly what the problem might be but here is what I can say:

  • The problem is NOT getattrlistbulk itself.

You said:

try FileManager.default.contentsOfDirectory(atPath: path).count returns 2847,

...but contentsOfDirectory(atPath:) is actually built at as wrapper around "enumeratorAtPath:" which is then built on getattrlistbulk.

  • I think your attributeBuffer is FAR to small:
let attributeBuffer = UnsafeMutableRaBufferPointer.allocate(byteCount: 2048, alignment: 16)

I know our code snippet from the man page uses a buffer size of "256" but, to be blunt, I have no idea why that was picked as it's ridiculously small, small enough that the snippet won't actually work for all files. 2048 is "better" in the sense that it's big enough that it should work reliably, but I think it's much to small for good performance. All of the performance benefit this API provide somes from reducing the overall syscall count, which basically means "bigger is better". Ideally, you'll also want to be using multiple full pages, since that can allow performance opportunities that don't otherwise exist.

For comparison:

  1. The internal implementation of directory enumeration inside CoreFoundation (which is what FileManager is using) uses a complex calculation that tries to account for the size of every field actually requested, which is then multiplied by 20 (the number of files it's "wants" to try and process per call). However, as a concrete reference, the ATTR_CMN_NAME calculation assumes a max file name contains 256 characters of 3 byte (the max size of UTF-8 character), which means JUST including ATTR_CMN_NAME means it allocates a buffer of more than 15360 bytes (256x3x20)-> ~15 Kb.

  2. More sensibly, fts doesn't bother with any calculation and hard codes the buffer at 32*1024-> 32Kb-> 2 pages, which is what I would do. There might be an argument for using more than 2 pages, but I can't think of an argument for using less than that.

(The website won't let me attach .swift files, so I include the source code of the sample app as a text attachment.)

I did take a look at your attachment, but I think something was completely broken in the upload process, garbling the file to the point where I couldn't really get it working. However, looking at the code, I can see one explanation for this:

returns 2847, while getattrlistbulk lists about 1903 files and then starts listing the files from the beginning, not even between repeated calls, but within a single call.

If your "ATTR_CMN_NAME" parsing fails in the right name, then you'll end up returning "parent path" as the new file full path. You'll then recurse on exactly the same directory. This might seems silly/theoretical, however, if you look at the code for fts, significant code is dedicated to dealing with paths that exceed PATH_MAX.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

SO, my immediate question is why are you calling getattrlistbulk?

I think I switched to getattrlistbulk many years ago for performance reasons. Perhaps I didn't test correctly at the time, or perhaps FileManager became faster in the meantime... in any case, I cannot switch back just yet because of some missing or unsupported URLResourceKey values. fileIdentifierKey for instance is only available since macOS 13.3 and iOS 16.4, and it was only added because I opened a feedback about it (or it was coincidentally added at the same time).

  • I think your attributeBuffer is FAR to small:

(I would keep quoting you like above, but it seems that multiple quotes are not rendered correctly, so I have to resort to the Markdown quote syntax)

Thanks for pointing that out. I increased it to 32 * 1024 as suggested...

If your "ATTR_CMN_NAME" parsing fails in the right name, then you'll end up returning "parent path" as the new file full path. You'll then recurse on exactly the same directory.

... but still, even though your explanation makes sense, it cannot be the cause here. If any file name failed to parse in my original code and resulted in the parent directory being scanned again, the output would show start scan [...] which is not the case here. Instead, getattrlistbulk just continues by listing files we've already seen.

I did take a look at your attachment, but I think something was completely broken in the upload process, garbling the file to the point where I couldn't really get it working.

Here I try to upload the code again.

The file is uploaded incorrectly again. By comparing the uploaded version to my local one, it seems it doesn't like < and > characters and deletes everything in between.

Here is a version where I put spaces before and after those characters, maybe that works:

The file is uploaded incorrectly again. By comparing the uploaded version to my local one, it seems it doesn't like < and > characters and deletes everything in between.

Thank you, that worked and I've not got a working copy of your build. Seems to work fine, though I didn't expect to replicate the failure (since you haven't been able to either).

Thanks for pointing that out. I increased it to 32 * 1024 as suggested...

Once your code was working I looked back to 256 byts as well, and it looks like it worked better than I would have expected. I think you'll actually see a substansial performance increase, but I also think you're old size is unlikely to be the direct failure source.

Instead, getattrlistbulk just continues by listing files we've already seen.

SO, the first thing to understand here is that there isn't actually a single "getattrlistbulk". getattrlistbulk is a VFS function that each individual file system implements. The top level VFS layer goes through a mapping process that translates from the high level semantics (like file handles) to VFS semantics (like vnodes) and then calls into the individual VFS driver to do the actual "work"*. If your curious, this code is actually opensource with "smbfs_vnop_getattrlistbulk" being the entry point for getattrlistbulk.

*If getattrlistbulk is unsupported by a particular file system, then the vfs layer has a default implementation it falls back on. However, that implementation should work fine (it's just slow) and, more importantly, it would only happen on an old/odd smb server (see below).

The key point here is that limits the failure points to:

  1. Something went wrong in the way your code processed the data.

It's hard to be sure we haven't missed something, but I don't see any obvious failure point and the code does seem to "work". If this was the issue I'd also expect there to be something unusual about the source data (for example, non-ascii names), but that doesn't seem to be the case.

One major argument toward #1 here was this point:

...but contentsOfDirectory(atPath:) is actually built at as wrapper around "enumeratorAtPath:" which is then built on getattrlistbulk.

However, I'm less confident of that than originally was. Some work has been done to transition more of Foundation to Swift and that may have split the implementation. The big test I would add here is to call contentsOfDirectory(at:includingPropertiesForKeys:options:) as well or, even better, enumerate the directory with CFURLEnumeratorCreateForDirectoryURL(). Either of those should end up calling getattrlistbulk and a failure in either would be good confirmation that the problem isn't actually with your code.

  1. Something went wrong in how the local SMB client processed the data it received.

This is hard to rule out and is entangled with #3 (see below), but this is the one I'd consider "least" likely. Somewhat counter intuitively, the smb protocol is old and complicated enough that a lot of works goes into compatibility testing from both the client and server, as both sides try to avoid breaking "each other".

  1. Something went wrong in with the data the remote SMB server sent.

As I mentioned above, 2 & 3 are entangled, as many issue come down to arguing about whether bad data was sent (by the server) or correct data parsed/handled improperly (by the client). In any case, important question here is what the server was. If it was macOS or some other "major" server then there isn't a lot to say, but if it's something rare/weird, then that's worth a closer look.

  1. Something went wrong in the data the remote SMB server read from the underlying data source.

The final piece here is that the server read from the local file system, particularly if some kind of "middleware" file system. Project like FUSE have made it much relatively easy to make a file system that "works" at a superficial level and many of those implementation make EXTREMELY naive assumption about how a VFS driver works and are never tested well enough that they realize there is any issue.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

The big test I would add here is to call contentsOfDirectory(at:includingPropertiesForKeys:options:) as well or, even better, enumerate the directory with CFURLEnumeratorCreateForDirectoryURL(). Either of those should end up calling getattrlistbulk and a failure in either would be good confirmation that the problem isn't actually with your code.

Thanks for all your detailed insights. I understand that the URL version of try FileManager.default.contentsOfDirectory(atPath: path).count is faster, but both should call getattrlistbulk, and since at the very beginning I mentioned that it indeed returns a reasonable file count, it would seem that... there is indeed something wrong with my code? Although you also mentioned that you don't see any obvious failure point in it.

I updated my feedback report FB15497909 on October 28 by uploading the sysdiagnose and other tests performed by the user, but haven't heard anything yet. Are you already aware of it? Are you able to tell where the issue might be?

Thanks for all your detailed insights. I understand that the URL version of try FileManager.default.contentsOfDirectory(atPath: path).count is faster, but both should call getattrlistbulk, and since at the very beginning I mentioned that it indeed returns a reasonable file count, it would seem that... there is indeed something wrong with my code? Although you also mentioned that you don't see any obvious failure point in it.

So, to clarify here, it is VERY difficult for me to directly compare "our code" to "your code" and be CERTAIN that I haven't missed "something". There are multiple API layers involved and the implementation themselves are structured very differently (the lowest level API is a enumeration layer written in C). I have done a surface level comparison of how our code interact with getattrlistbulk and from that "side" you code looks fine. Trying to go past that level isn't really practical.

I updated my feedback report FB15497909 on October 28 by uploading the sysdiagnose and other tests performed by the user, but haven't heard anything yet. Are you already aware of it? Are you able to tell where the issue might be?

I wasn't until now...

SO, having spent most of this morning looking at the data I do have a few things worth looking more closely at. First off, returning to this point in my earlier email:

In any case, important question here is what the server was. If it was macOS or some other "major" server then there isn't a lot to say, but if it's something rare/weird, then that's worth a closer look.

What the SMB server actually is the most important question to answer. I wasn't able to find anything definitive in the sysdiagnose data, but the name itself "hp-media" was evocative. "HP Media" the marketing name for a number of different NAS products from ~15+ years ago. That's the right time frame for hardware to have been impacted by suspiciously similar issue:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=612503

Moving to the file list itself, my basic answer is that something very weird is going on. If you start with the first file name and find it's second instance, you can find that the looped files look like this:

Start loop 1:
Line 3-> T--Ex--R--.docx
...
Line 40703-> FE-W--.FMT

Start loop 2:
Line 40704-> T--Ex--R--.docx
...
Line 81404-> FE-W--.FMT

If you cut those to "chunks" out and pull that into separate files then compare them, you'll find that with minimal cleanup:

  1. Some places return errno 2 while other return errno 0. Change these files to errno match.

  2. Some of the "continue scan..." calls don't match up. Removes or merge these.

...then the result is two lists of ~40695 entries that EXACTLY match. In other words, the same ~40,000+ entries were returned twice. However the really critical point is what happens at the end of the sequence:

Line 81404-> FE-W--.FM3
Line 81405-> FE-W--.FMT
Line 81406-> FE-W--.WK1
Line 81407-> FE-W--.WK3
Line 81408-> FE-W--.xls

In other words, the "loop" here was that the the entire iteration stopped at a particular point in time and then "reset", repeating the entire loop. I think that would match the behavior described in the bug I mentioned earlier.

Finally, I also looked at the Wireshark trace. I'm not qualified to do a truly in depth analysis, but I did find multiple returns for "T--Ex--R--.docx". Filtering for SMB2, the first return was at packet 14566 and the second at packet 290050.

Based on all that, my instinct would be that the issue here is caused by the source server. This is pure speculation, but my guess is that the combination of very high file counts and multiple open directories being iterated simultaneously is causing a failure on the server side, probably crashing part of the smb server. It's able to recover quickly (which is why there isn't an obvious failure) but it's lost it's iteration state... so the entire process starts over.

Finally, with all that background, I did have another thought/question/explanation here:

Thanks for all your detailed insights. I understand that the URL version of try FileManager.default.contentsOfDirectory(atPath: path).count is faster, but both should call getattrlistbulk, and since at the very beginning I mentioned that it indeed returns a reasonable file count, it would seem that... there is indeed something wrong with my code?

The big difference I can think of here is that contentsOfDirectory(atPath:) only looks at the first directory level (it's "shallow", not "deep"), which means in never recurses, which significantly changes how it interacts with SMB.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

The big difference I can think of here is that contentsOfDirectory(atPath:) only looks at the first directory level (it's "shallow", not "deep"), which means in never recurses, which significantly changes how it interacts with SMB.

Thank you very much for your detailed help.

The customer agreed to run a new test app which uses FileManager.default.enumerator(at:includingPropertiesForKeys:errorHandler:) instead of recursive getattrlistbulk calls, and it loops as well. So it seems like something weird was changed in getattrlistbulk in macOS 15.

Here's what they say regarding to their server model:

My original server was from HP. I rebuilt it about 8-9 years ago and has nothing to do with HP, I just left the naming convention for consistency. I think the issue is likely with Windows 10 Storage Spaces. I use Storage Spaces to pool the disk space for redundancy. It’s kind of like drive mirroring.

I'm somewhat happy to learn that it's not a fault in my implementation. I guess the only thing I can do now is to wait if I get any response in Feedback Assistant, right? Or can you already tell that this won't likely be fixed in a future macOS release (if it's a macOS issue at all) and is something to be fixed in the customer's setup (Windows 10 Storage Spaces, which I've never heard of before)?

Answer One, a possible work around:

While looking through the log again today, I think I actually found the point the problem occurs, which is this log sequence:

2024-10-27 19:48:02.915-0400 kernel smbfs_enum_dir: Resuming enumeration for <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_find_cookie: Key 5, offset 1130, nodeID 0x400000007bd6c name <Front Door before and after.bmp> for <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_enum_dir: offset 1130 d_offset 0 d_main_cache.offset 0 for <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_fetch_new_entries: fetch offset 1130 d_offset 0 cachep->offset 0 for <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_fetch_new_entries: Main cache needs to be refilled <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_fetch_new_entries: Dir has not been modified recently <Documents>
2024-10-27 19:48:02.915-0400 kernel smbfs_fetch_new_entries: Restart enum offset 1130 d_offset 0 for <Documents>

In other words, the failure here is occurring when you "return" to iterating the previous directory. That is something you could avoid/mitigate by removing/modifying the recursion "cycle" of your directory walk. Basically, what you'd do is this:

  1. iterate the directory with getattrlistbulk
  2. if(file)-> process normally
  3. if(directory)-> cache/store entry
  4. finish iteration of current directory
  5. for each directory in #3, return to #1

In concrete terms, if you have this hierarchy:

dir1
	file1
	dir2
		file2
	file3
	dir3
		file4
		dir4
			fileC
		file5
	file6

Currently, the order you process files is exactly the same as the order above:

iterate dir1
process file1
iterate dir2
process dir2/file2
process file3
iterate dir3
process dir3/file4
iterate dir3/dir4
process dir3/dir4/fileC
process dir3/file5
process file6

With the new approach, the same hierarchy would process as:

iterate dir1
process file1
process file3
process file6
iterate dir2
process dir2/file2
iterate dir3
process dir3/file4
process dir3/file5
iterate dir3/dir4
process dir3/dir4/fileC

This does add some additional book keeping and memory risk, however, I do think it's a "safer" approach overall. IF the issue is the SMB server (see answer 2 for other possibility), then issue is almost certainly caused by the complexity of tracking nested iteration. In other words, the problem isn't "iterating 10,000 files", it's RETURNING to that iteration after having iterated lots of other directories. The approach above removes that because, as far as the file system is concerned, you only ever iterate one directory. You can also use this as an easy opportunity to flatten your recursion, to there are some memory benefits as well.

Finally, flatting also help with the "larger" iteration context. As a simplified example, imagine this particular bug is that the server drops the least recent iteration anytime there are more than 10 or more simultaneous iterations. As far as the server is concerned, 10 apps iterating 1 directory looks exactly the same as a nested iteration 10 levels deep. Flattening the iteration obviously solves the second case, but probably helps the firt one as well- your single iteration never "blocks" (because you never recurse on getattrlistbulk), so your iteration is unlikely to very be "oldest". Something may still fail, but it won't be your app.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer Two, A completely different theory:

SO, my other answer is a perfect example of two things:

  1. Sometimes it better to be lucky than good.

  2. In an investigation like this, the MOST important thing is to not assume you understand the problem. Focusing on to much on "what the problem is" can very well mean you miss the entire problem.

Point 1:

This morning I was look at a completely unrelated problem from a totally different developer. This particular problem happened to involve a file system performance problem for a single user who had recently upgraded to macOS 14.7.1. I was looking in the the activity of an endpoint security client (more on why later) and, as part of that, I put the bundle of ID of from that ES client into Spotlight, which happily searched my entire hard drive (I hadn't restricted it to the sysdiagnose log).

What it found was a spindump log from that ES client inside YOUR sysdiagnose file (the name starts with "RTProtectionDaemon").

Point 2:

Because of the overall context, my focus here has been around the assumption that this was somehow tied to your implementation or the server. That that focus hadn't accounted for was the two MOST important facts about the problem. Those are:

  1. The problem is "weird"*. Case in point, getattrlistbulk isn't simply failing (like crashing) it's working totally fine for quite awhile, somehow "resetting itself", then (seemingly) continuing to work "fine".

  2. It's happening to one (or a small number) of users.

*My working definition of "weird" is "how hard would it be for me make this bug happen if I were TRYING to make the system work this way". The harder it would be to make a bug happen, the weirder it is.

With those two factors an ES client issue is one of the first things I should have looked for. The ES API presents so many "hooks" in to the system that it can basically break "anything" in the system. More to the point, the typical ES client failure isn't "the client breaks", it's "the client creates performance problem or unexpectedly interferes with system activity in a way that causes OTHER things to break".

Once I start looking for an ES client, odd things immediately start popping up. For example these two lines are logged ever ~5s a total of 2226 times, starting here:

2024-10-27 16:46:24.380-0400 kernel Client connected (pid 336 cdhash 10bcbf9d9d3ec4c2ebc3166d7b7598d7eb70e4ec)
2024-10-27 16:46:24.381-0400 kernel Client disconnected (pid 336, RTProtectionDaem)

And ending here:

2024-10-27 19:51:49.680-0400 kernel Client connected (pid 336 cdhash 10bcbf9d9d3ec4c2ebc3166d7b7598d7eb70e4ec)
2024-10-27 19:51:49.681-0400 kernel Client disconnected (pid 336, RTProtectionDaem)

That's not normal. I have no idea why a process would be doing that, but nothing I can think of seems like a good idea.

Similarly, RTProtectionDaem is visible in the spindump and with one thread deep in directory iteration. I would love to see a trace captured while you were actively scanning, but I think it's a safe/reasonable assumption that it's probably interacting with the same hierarchy as you are.

Summarizing all this, I'm not sure which of these two (SMB server or ES client) is the true failure, but if I had to guess then I would lean toward the ES client. I don't think it's this on it's own:

I think the issue is likely with Windows 10 Storage Spaces. I use Storage Spaces to pool the disk space for redundancy. It’s kind of like drive mirroring.

...because, just like macOS, Windows is (or should be) better "layered" than that. The SMB server shouldn't know or care about the lower level storage layer for the same reason our SMB server doesn't care about AppleRAID vs. APFS vs. HFS+. All the SMB cares about are the top level "files".

It's possible there's an issue with the SMB server itself, but that seems unlikely unless he's using something particularly old and out of date. That's actually the reason I was concerned about a NAS box in particular. It's not just old, it's that it's old AND less likely to have every been updated.

I'm somewhat happy to learn that it's not a fault in my implementation. I guess the only thing I can do now is to wait if I get any response in Feedback Assistant, right?

Setting expectations here, I don't think it'll be awhile before you hear anything back there. If it is a bug on our side, then I wouldn't expect it to be fixed in a software update. There are SO many configurations and edge cases in smb that ANY change is inherently very high risk, so fixes tend to be held for major updates (when there is far more time for testing). That also assumes that it's truly "our bug", which I just don't know.

Or can you already tell that this won't likely be fixed in a future macOS release (if it's a macOS issue at all) and is something to be fixed in the customer's setup (Windows 10 Storage Spaces, which I've never heard of before)?

The main issue here is making sure he's running the most recent windows update (for Windows 10).

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks for your detailed response. Flattening sounds like a good idea, although I'm a little bit reluctant to adopt it since this would likely mean that if one day I switch to FileManager.default.enumerator(at:includingPropertiesForKeys:errorHandler:) the bug would appear again (unless it has been fixed in the meantime).

The user just confirmed to me that when copying the folder in the Finder, the same looping issue occurs. So I was wondering: wouldn't FileManager.default.enumerator(at:includingPropertiesForKeys:errorHandler:) (which I assume is used by the Finder) profit from using iteration instead of recursion as well? If we put the memory benefits you mentioned aside, are there any downsides to using iteration instead of recursion?

Thanks for your detailed response. Flattening sounds like a good idea, although I'm a little bit reluctant to adopt it since this would likely mean that if one day I switch to FileManager.default.enumerator(at:includingPropertiesForKeys:errorHandler:) the bug would appear again (unless it has been fixed in the meantime).

Nah, you can do exactly what I've described with enumerator(at:includingPropertiesForKeys:errorHandler:). One of it's options is skipsSubdirectoryDescendants, which lets you disable recursion.

The user just confirmed to me that when copying the folder in the Finder, the same looping issue occurs. So I was wondering: wouldn't FileManager.default.enumerator(at:includingPropertiesForKeys:errorHandler:) (which I assume is used by the Finder) profit from using iteration instead of recursion as well?

No, it's definitely not using that. If you pay attention to exactly what the Finder does when copying files and the functionality provide by NSFileManager, you'll pretty quickly realize that NSFileManager's copy API just can't do what it does, since the Finder provides "real time" progress and the NSFileManager API simply doesn't do that. The "copyfile" Unix API will let you do that but the Finder isn't using that either, which is why the Finder can do server side SMB copying, which copyfile cannot.

In terms of why the Finder isn't having issue, I suspect it's because:

  • For display purposes, the Finder doesn't try recursive iteration. It's just trying to display/cache directories as quickly as possible, so recursion would be counter productive.

  • I'm not sure what's happening on the copy side. It uses its own copy engine, but that copy engine is not easy to quickly analyze. It think it might actually fail, but there but I think that would heavily depend on exactly what was being copied (both total volume and source/destination).

If we put the memory benefits you mentioned aside, are there any downsides to using iteration instead of recursion?

Actually, having thought about it more, yes there is though I don't think it matters to you. There's a difference here in terms of their "worst case" hierarchy, which would be:

  1. Iteration-> A nested series of directories, each of which are "full" of lots of directories. My approach ends up "collecting" every directory and then processing it, which could cause an major accumulation of unprocessed directories.

  2. Recursion-> Deep file hierarchies.

The issue here is that the worse case for #1 is much more common than #2. That is, a directory is more likely to contain #1000 directories than it is to be nested #1000 deep.

However, whether or not this actually matters depends on what you're actually doing with the data. If you're "discarding" the data (for example, calculating the total size of single directory) then the difference could be quite significant. However, if you're "recording" (meaning, you're storing all the data you collect for later display) the data, then the distinction doesn't actually matter. In the recording case, all you're really changing is the order you visit objects, so you're peak usage is always the same no matter what.

Note that in the "recording" case, I think you also need to be thinking in terms of the the entire record processing "sequence", not just "how do I iterate as quickly as possible". I think I would actually start with a file format that I'd use to store the data I was collecting, which I would then "feed" files AND directories into to as I performed the scan. Directories are initially stored as "unscanned", then "scanning" (meaning, the engine has started iterating their contents), then "scanned" (meaning, the engine has scanned their full contents). That structure could then be used to store you intermediate state (so you're scanning isn't storing the full intermediate data) and could also be used to let you resume scans.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

One of it's options is skipsSubdirectoryDescendants, which lets you disable recursion.

But then I get a shallow enumeration. I still want to scan an entire directory hierarchy.

The "copyfile" Unix API will let you do that but the Finder isn't using that either, which is why the Finder can do server side SMB copying, which copyfile cannot

I've noticed that, which is why I've been using copyfile for a long time now myself.

In terms of why the Finder isn't having issue

I think you misunderstood. I said that the Finder is having the same issue.

One of it's options is skipsSubdirectoryDescendants, which lets you disable recursion. But then I get a shallow enumeration. I still want to scan an entire directory hierarchy.

I know, but you can still use NSDirectoryEnumerator. What you do is:

  1. Shallow enumerate the current directory.
  2. During that enumeration, collect the directories you find.
  3. Repeat from #1 using the directories you find in #2.

In other words, you do a shallow enumeration of a given directory, then repeat the shallow enumeration on whatever directories you find. This is actually the same thing you'd do with getattrlistbulk.

The "copyfile" Unix API will let you do that but the Finder isn't using that either, which is why the Finder can do server side SMB copying, which copyfile cannot

I've noticed that, which is why I've been using copyfile for a long time now myself.

Have or haven't? How are you copying files? FYI, if server side copying is an issue, this post has a code snippet for that.

In terms of why the Finder isn't having issue

I think you misunderstood. I said that the Finder is having the same issue.

I did misundertand. What's the actual failure? The copy itself failing, display issue(s), etc?

(which I assume is used by the Finder) profit from using iteration instead of recursion as well?

In theory, yes, but it is a tricky trade off to make. To start with, there's a basic questions of how far the Finder (or any other part of the system) should "go" to resolves issues that are caused by other external components. Trying to make broken systems work can easily turn into a time sink that greatly complicates your implementation without an equal "payoff". You're facing exactly the same choice but the right choice for you may be to "make it work" simply because you're in a totally different business situation.

On the technical, the resource usage difference can easily be quite large, since "wide" hierarchies are much more common than "deep" hierarchies. The Finder is also pretty constrained in what it can do- it's memory usages needs to be relatively flat/constrained so it runs well on low end machines and/or when other app usage constrains memory. It also can't rely on being able to write data out, since that may not be possible.

It's likely neither of those constraints really apply to your app, since the user will both tolerate more memory use and storage space for your app to write out data.

I also want to go back and expanding on this:

Note that in the "recording" case, I think you also need to be thinking in terms of the the entire record processing "sequence", not just "how do I iterate as quickly as possible". I think I would actually start with a file format that I'd use to store the data I was collecting, which I would then "feed" files AND directories into to as I performed the scan.

For a product like yours, your effectively creating a "database" that duplicates the file system's hierarchy. When you're ultimate goal is to end up with a record for every file system object, that can change the resource situation significantly. The additional "in progress" directory records iteration creates don't actually need to be "extra" at all- you were always going to need a record for that directory, so you're really just creating at a different point in time and then revisiting them "later" to finish the iteration.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

In other words, you do a shallow enumeration of a given directory, then repeat the shallow enumeration on whatever directories you find. This is actually the same thing you'd do with getattrlistbulk.

Right. For some reason I thought it would be less efficient.

Have or haven't? How are you copying files? FYI, if server side copying is an issue, this post has a code snippet for that.

Have. To copy files, I'm just calling copyfile. I didn't know about server-side copying, but that may explain why every now and then a user complains that my app is slow, though they never got back to me explaining their setup or comparing the speed of my app to the that of the Finder.

What's the actual failure? The copy itself failing, display issue(s), etc?

I'm not sure, they simply replied "I just tried [in the Finder] and it did indeed loop!" I assume the "Preparing X files" display when starting a copy operation just kept counting up

On the technical, the resource usage difference can easily be quite large, since "wide" hierarchies are much more common than "deep" hierarchies

That makes sense. My app does a deep scan to determine what files to copy, then copies them, and finally stores the directory hierarchy so it can be compared with the next scan. My users also assume that the app kind of runs in the background and doesn't use too much resources, so I think I'll stick to recursion for now.

Have. To copy files, I'm just calling copyfile. I didn't know about server-side copying, but that may explain why every now and then a user complains that my app is slow, though they never got back to me explaining their setup or comparing the speed of my app to the that of the Finder.

Yes, that's almost certainly the case. Again, the key thing here is that the best case for server-side copying (basically, duplicating large number of files on a remote APFS volume) can be ASTRONOMICALLY faster, as you basically end up comparing a file clone operations against moving the full data back and forth across the network. At one point while testing this I did one test (duplicating a single 6.4 GB file) where the result was:

Carbon-> 0.210258s NSFileManager-> 166.364387s

Yes, that's effectively "instant" vs. "~2.5 min.". That's also over gigabit ethernet between two machines "next to each other". The same copy over the broader internet would show a TINY increase in the Carbon side (conservatively... 0.5s?) and HOURS on NSFileManager. Falling back to carbon for this is obviously painful, but the gap between the Finder and NSFileManager is SO large than I understand why it would be justified.

That makes sense. My app does a deep scan to determine what files to copy, then copies them, and finally stores the directory hierarchy so it can be compared with the next scan.

SO, making sure this is clear, the performance "trick" here is to use your final scan storage format as your "intermediate" format instead of doing the entire scan in memory. If done properly, this means:

  • No additional memory or storage usage, as your simply building up scan format incrementally instead of dumping it from memory at the end.

  • Easy support for resuming scans, since your on disk format now captures your intermediate state during the scan.

Note that resuming scanning is much easier in the iterative case than in the recursive scan. In the recursive case, your app has multiple nested scans active, which makes it harder to now exactly "where" a given scan should resume. In the iterative case, you have:

  1. Directories that are "done" (meaning you've captured every object directly in that directory).

  2. Directories that you "know about" but have not scanned.

  3. EXACTLY one directory that you're currently scanning.

When resuming all you need to do is discard any data from #3 and simply scan #3. This may be more work than's justified at the moment, but I think it's something you should think about if/when you have the time.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

getattrlistbulk lists same files over and over on macOS 15 Sequoia
 
 
Q