Given a directory path (or NSURL) I need to get the total number of files/documents in that directory - recursively - as fast and light as possible.
I don't need to list the files, and not filter them.
All the APIs I found so far (NSFileManger, NSURL, NSDirectoryEnumerator) collect too much information, and those who are recursive - are aggregating the whole hierarchy before returning.
If applied to large directory - this both implies a high CPU peak and slow action, and a huge memory impact - even if transient.
My question: What API is best to use to accomplish this count, must I scan recursively the hierarchy? Is there a "lower level" API I could use that is below NSFileManager that provides better performance?
One time in the middle-ages, I used old MacOS 8 (before MacOS X) file-system APIs that were immensely fast and allowed doing this without aggregating anything.
I write my code in Objective-C, using latest Xcode and MacOS and of course ARC.
My question: What API is best to use to accomplish this count, must I scan recursively the hierarchy?
NSDirectoryEnumerator when created with "enumeratorAtURL" and including specific properties required and accessing them through the NSURL object. Note that the "enumeratorAtPath" is SIGNIFICANTLY slower, as are the other path methods. The performance difference is a direct result of the arguments themselves- both of the NSURL methods ask for "includingPropertiesForKeys" because they use that property as part of the fetch cycle for iterating the directory. "enumeratorAtPath" is required to prefetch significant additional data, which ends up requiring an extra stat call.
Is there a "lower level" API I could use that is below NSFileManager
Yes, that API is "fts".
that provides better performance?
It's lower level but that doesn't mean it's any faster. As it happens, I have a test tool lying around that specifically tests fts and enumeratorAtURL. Running it today, here is how things break down:
Run 1:
ftsTest -> File Count: 357248, Time: 4.610113s
enumeratorAtURL (empty)-> File Count: 357248, Time: 4.817799s
enumeratorAtURL (nil) -> File Count: 357248, Time: 9.075000s
enumeratorAtPath -> File Count: 357248, Time: 13.589894s
Run: 2
ftsTest -> File Count: 357248, Time: 4.502485s
enumeratorAtURL (empty)-> File Count: 357248, Time: 4.828324s
enumeratorAtURL (nil) -> File Count: 357248, Time: 8.988125s
enumeratorAtPath -> File Count: 357248, Time: 14.359770s
Note: "empty" here means passing in an empty array ("[NSArray array]") instead of nil. The empty array means "I don't need anything at all", while the nil value is interpreted as a "default" set which is why it ends up being slower.
Also, be aware that this is with a very specific fts configuration ("FTS_PHYSICAL | FTS_NOCHDIR | FTS_NOSTAT").
fts default configuration ("0") is actually slower than enumeratorAtURL:
ftsTest -> File Count: 357248, Time: 5.979837s
enumeratorAtURL (empty)-> File Count: 357248, Time: 4.815174s
However, the big issue here is why I had this code lying around in the first place. I originally wrote it to trying to sort out EXACTLY the question you're asking ("What's the fastest way to iterate a larger directory hiearchy...") and what I expected to find was that fts wasn't really THAT much faster than enumeratorAtURL. This is what I actually found:
2016-02-09 13:01:48.089 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.366178
2016-02-09 13:01:51.611 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.521848
2016-02-09 13:01:59.645 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -8.033731
2016-02-09 13:02:03.253 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.608108
2016-02-09 13:02:11.072 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.818806
2016-02-09 13:02:14.562 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.489380
2016-02-09 13:02:22.358 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.796076
2016-02-09 13:02:25.783 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.425546
2016-02-09 13:02:33.339 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.555523
2016-02-09 13:02:36.817 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.478089
In other words, 8 years ago fts was ~2x SLOWER than enumeratorAtURL. It was the slow API that "caught up", not the "fast low, level API". The key point here is that "low level" doesn't always mean better or faster, sometimes it just means "different". In this particular case, NSFileManager and the CoreFoundation (CFURLEnumeratorCreateForDirectoryURL) file APIs are so widely used that a great deal of effort has been put into making them as fast as possible. Indeed, the API "underneath" all of this was SPECIFICALLY added to make THAT layer (Foundation/CoreFoundation) faster, NOT the lower level Unix APIs. fts eventually adopted it, but it was significantly slower until it did.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware