What is the recommended way to count files recursively in a specific folder

Question

Created Jul ’24

Replies 7

Boosts 0

Participants 4

Given a directory path (or NSURL) I need to get the total number of files/documents in that directory - recursively - as fast and light as possible.

I don't need to list the files, and not filter them.

All the APIs I found so far (NSFileManger, NSURL, NSDirectoryEnumerator) collect too much information, and those who are recursive - are aggregating the whole hierarchy before returning.

If applied to large directory - this both implies a high CPU peak and slow action, and a huge memory impact - even if transient.

My question: What API is best to use to accomplish this count, must I scan recursively the hierarchy? Is there a "lower level" API I could use that is below NSFileManager that provides better performance?

One time in the middle-ages, I used old MacOS 8 (before MacOS X) file-system APIs that were immensely fast and allowed doing this without aggregating anything.

I write my code in Objective-C, using latest Xcode and MacOS and of course ARC.

Answered by DTS Engineer in 796984022

My question: What API is best to use to accomplish this count, must I scan recursively the hierarchy?

NSDirectoryEnumerator when created with "enumeratorAtURL" and including specific properties required and accessing them through the NSURL object. Note that the "enumeratorAtPath" is SIGNIFICANTLY slower, as are the other path methods. The performance difference is a direct result of the arguments themselves- both of the NSURL methods ask for "includingPropertiesForKeys" because they use that property as part of the fetch cycle for iterating the directory. "enumeratorAtPath" is required to prefetch significant additional data, which ends up requiring an extra stat call.

Is there a "lower level" API I could use that is below NSFileManager

Yes, that API is "fts".

that provides better performance?

It's lower level but that doesn't mean it's any faster. As it happens, I have a test tool lying around that specifically tests fts and enumeratorAtURL. Running it today, here is how things break down:

Run 1:
ftsTest                -> File Count: 357248, Time: 4.610113s
enumeratorAtURL (empty)-> File Count: 357248, Time: 4.817799s
enumeratorAtURL (nil)  -> File Count: 357248, Time: 9.075000s
enumeratorAtPath       -> File Count: 357248, Time: 13.589894s

Run: 2
ftsTest                -> File Count: 357248, Time: 4.502485s
enumeratorAtURL (empty)-> File Count: 357248, Time: 4.828324s
enumeratorAtURL (nil)  -> File Count: 357248, Time: 8.988125s
enumeratorAtPath       -> File Count: 357248, Time: 14.359770s

Note: "empty" here means passing in an empty array ("[NSArray array]") instead of nil. The empty array means "I don't need anything at all", while the nil value is interpreted as a "default" set which is why it ends up being slower.

Also, be aware that this is with a very specific fts configuration ("FTS_PHYSICAL | FTS_NOCHDIR | FTS_NOSTAT"). fts default configuration ("0") is actually slower than enumeratorAtURL:

ftsTest                -> File Count: 357248, Time: 5.979837s
enumeratorAtURL (empty)-> File Count: 357248, Time: 4.815174s

However, the big issue here is why I had this code lying around in the first place. I originally wrote it to trying to sort out EXACTLY the question you're asking ("What's the fastest way to iterate a larger directory hiearchy...") and what I expected to find was that fts wasn't really THAT much faster than enumeratorAtURL. This is what I actually found:

2016-02-09 13:01:48.089 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.366178
2016-02-09 13:01:51.611 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.521848
2016-02-09 13:01:59.645 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -8.033731
2016-02-09 13:02:03.253 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.608108
2016-02-09 13:02:11.072 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.818806
2016-02-09 13:02:14.562 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.489380
2016-02-09 13:02:22.358 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.796076
2016-02-09 13:02:25.783 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.425546
2016-02-09 13:02:33.339 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.555523
2016-02-09 13:02:36.817 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.478089

In other words, 8 years ago fts was ~2x SLOWER than enumeratorAtURL. It was the slow API that "caught up", not the "fast low, level API". The key point here is that "low level" doesn't always mean better or faster, sometimes it just means "different". In this particular case, NSFileManager and the CoreFoundation (CFURLEnumeratorCreateForDirectoryURL) file APIs are so widely used that a great deal of effort has been put into making them as fast as possible. Indeed, the API "underneath" all of this was SPECIFICALLY added to make THAT layer (Foundation/CoreFoundation) faster, NOT the lower level Unix APIs. fts eventually adopted it, but it was significantly slower until it did.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Boost

Answer 1

ssmith_c OP

Jul ’24

man readdir ?

0

Answer 2

endecotp OP

Jul ’24

Yes I believe you need to recursively descend through the directory tree. So you need to read each directory, determine which of the directory entries are sub-directories, and visit those. This boils down to readdir and stat.

I am aware of one possible optimisation, if you want to squeeze the last bit of performance. In order to determine whether a directory entry is a sub-directory, rather than a regular file, in general you need to stat it. But, when you stated the parent directory you got st_nlink, which tells you the number of hard links to that directory. Each directory's .. entry counts as a hard link to its parent directory, and its . entry is a hard link to itself. So if st_nlink is 4, then the directory can have at most 3 subdirectories. So you can enumerate the directory contents, checking if each entry is a subdirectory, until you have found 3. At that point you can stop stating everything - the remaining entries must be non-directories.

This is classic UNIX stuff that has worked since forever. It's not impossible that new Apple filesystems have alternative APIs that make this more efficient. Maybe someone else will comment. I don't know where the bottleneck is - is it the disk access speed? - is it the number of kernel calls?

Things you need to consider if you want a robust implementation:

Hard links
Symlinks
Hard or symbolic links that introduce cycles
Other special directory entries
Mount points
Whether to include . and .. and other dotfiles in your count
Unreadable (and unexecutable) directories

Personally, I'd do something like this:

auto count_files(std::filesystem::path p)
{
  int n_files = 0;
  for (auto& entry: std::filesystem::recursive_directory_iterator(p)) {
    if (is_regular_file(entry)) ++n_files;
  }
  return n_files;
}

1

Answer 3

suMac OP

Jul ’24

one

0

Answer 4

suMac OP

Jul ’24

Thank you very much. I'll try your suggested approach as well. Seems interesting.

I think the inefficiency comes from the basic functionality of the APIs I found so far ( in MacOS Cocoa).

They all assume that I'm interested in the actual items, and they fetch/pre-fetch meta-data for each file/item whereas all I need to do is COUNT. Furthermore - all existing APIs that support recursive scan of directory hierarchies - are AGGREGATING the results along the scan, and won't return until they have the full list of items.

This of course both burdens CPU work and Memory consumption.

I never thought to go to Posix APIs until now, because

I dislike them (A True Mac programmer since 1987) because they're all synchronous and plain dumb - they do NOT cover the rich and strange behaviors and attributes of modern file-system.
At least in the past, apple had its low-level FileSystem APIs exposed, and I could work magic with them. Since then the FileSystem has changed, but I don't know which APIs exist today.
By example, I wrote the following naive code:

 static NSUInteger totalCount = 0;   NSDirectoryEnumerator *dirPathEnumenumerator = [fm  enumeratorAtPath:@"/Users/me/Documents"];
 NSString *currRelativePath = nil; // local path
 while ( (currRelativePath = [dirPathEnumenumerator nextObject]) != nil)  {
 NSDictionary *fileAttributes = [dirPathEnumenumerator fileAttributes];
  if (![fileAttributes[NSFileType] isEqualToString:NSFileTypeDirectory])
         totalCount++;

And for my ~100,000 files Documents folder, it took about 6 seconds to run.

I then wrote it differently - using SHALLOW directory enumeration, and instead of recursing, I dispatched "need to scan" code-blocks onto concurrent NSOperationQueue, thus removing recursion (and stacks) and also spreading the task over several cores -- like this:


   static NSUInteger totalCount = 0;
   static NSFileManager *fmm = nil;
   static NSArray *requiredProperties = nil;

-(void)countFilesInDocumentsFolder {
    NSOperationQueue *q = [[NSOperationQueue alloc] init];
    q.maxConcurrentOperationCount = 5;
    q.qualityOfService = NSQualityOfServiceUtility;
    q.name = @"file counting queue";
    dirFullPath = @"/Users/me/Documents";
    NSURL *topURL = [NSURL fileURLWithPath:dirFullPath];

    if (fmm == nil)
        fmm = [NSFileManager defaultManager];
    if (requiredProperties == nil)
        requiredProperties = @[NSURLNameKey, NSURLIsRegularFileKey ,NSURLIsDirectoryKey, NSURLIsSymbolicLinkKey, NSURLIsVolumeKey, NSURLIsPackageKey];
    
    [self countFilesInDirectory:topURL usingQueue:q];
    [q waitUntilAllOperationsAreFinished];
    NSLog (@"Total File count in directory: %@ is: %lu", dirFullPath, totalCount);
}

-(void)countFilesInDirectory:(NSURL *)directoryURL usingQueue:(NSOperationQueue *)queue {
    [queue addOperationWithBlock:^{
        NSError *error = nil;
        NSArray<NSURL *> *itemURLs = [fmm contentsOfDirectoryAtURL:directoryURL includingPropertiesForKeys:requiredProperties options:NSDirectoryEnumerationSkipsHiddenFiles error:&error];
        if (error) {
            NSLog(@"Failed to get contents of: %@, Error:%@", directoryURL, error);
            return;
        };
        for (NSURL *url in itemURLs) {
            
            NSDictionary<NSURLResourceKey, id> *fileAttributes = [url resourceValuesForKeys:requiredProperties error:&error];
            if (error!=nil || fileAttributes == nil) {
                NSLog(@"Failed to retrieve attributes for:%@ Error:%@",url, error);
                continue;
            }
            if ([fileAttributes[NSURLIsDirectoryKey] boolValue]) {
                [queue addOperationWithBlock:^{
                    [self countFilesInDirectory:url usingQueue:queue];
                }];
            }
            else {
                if ( [fileAttributes[NSURLIsRegularFileKey] boolValue])
                        totalCount++;
                }
            }
        }
    }];
}

And this one - although lengthy - took about 0.45 sec to do the same job (even a little better).

So... with my ***** of a Mac and a fast-as-hell SSD, I am still very far from satisfied.

I'll go down the POSIX rabbit hole and see what goes.

Thanks!

If I find the POSIX faster than my current thing, I'll accept your answer as the best :)

0

Answer 5

DTS Engineer OP

Apple

Jul ’24

Recommended

My question: What API is best to use to accomplish this count, must I scan recursively the hierarchy?

NSDirectoryEnumerator when created with "enumeratorAtURL" and including specific properties required and accessing them through the NSURL object. Note that the "enumeratorAtPath" is SIGNIFICANTLY slower, as are the other path methods. The performance difference is a direct result of the arguments themselves- both of the NSURL methods ask for "includingPropertiesForKeys" because they use that property as part of the fetch cycle for iterating the directory. "enumeratorAtPath" is required to prefetch significant additional data, which ends up requiring an extra stat call.

Is there a "lower level" API I could use that is below NSFileManager

Yes, that API is "fts".

that provides better performance?

It's lower level but that doesn't mean it's any faster. As it happens, I have a test tool lying around that specifically tests fts and enumeratorAtURL. Running it today, here is how things break down:

Run 1:
ftsTest                -> File Count: 357248, Time: 4.610113s
enumeratorAtURL (empty)-> File Count: 357248, Time: 4.817799s
enumeratorAtURL (nil)  -> File Count: 357248, Time: 9.075000s
enumeratorAtPath       -> File Count: 357248, Time: 13.589894s

Run: 2
ftsTest                -> File Count: 357248, Time: 4.502485s
enumeratorAtURL (empty)-> File Count: 357248, Time: 4.828324s
enumeratorAtURL (nil)  -> File Count: 357248, Time: 8.988125s
enumeratorAtPath       -> File Count: 357248, Time: 14.359770s

Note: "empty" here means passing in an empty array ("[NSArray array]") instead of nil. The empty array means "I don't need anything at all", while the nil value is interpreted as a "default" set which is why it ends up being slower.

Also, be aware that this is with a very specific fts configuration ("FTS_PHYSICAL | FTS_NOCHDIR | FTS_NOSTAT"). fts default configuration ("0") is actually slower than enumeratorAtURL:

ftsTest                -> File Count: 357248, Time: 5.979837s
enumeratorAtURL (empty)-> File Count: 357248, Time: 4.815174s

However, the big issue here is why I had this code lying around in the first place. I originally wrote it to trying to sort out EXACTLY the question you're asking ("What's the fastest way to iterate a larger directory hiearchy...") and what I expected to find was that fts wasn't really THAT much faster than enumeratorAtURL. This is what I actually found:

2016-02-09 13:01:48.089 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.366178
2016-02-09 13:01:51.611 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.521848
2016-02-09 13:01:59.645 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -8.033731
2016-02-09 13:02:03.253 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.608108
2016-02-09 13:02:11.072 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.818806
2016-02-09 13:02:14.562 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.489380
2016-02-09 13:02:22.358 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.796076
2016-02-09 13:02:25.783 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.425546
2016-02-09 13:02:33.339 TestDirPeformance[2394:31781259] ftsTest: 211763, Time: -7.555523
2016-02-09 13:02:36.817 TestDirPeformance[2394:31781259] dirEnumTest: 252195, Time: -3.478089

In other words, 8 years ago fts was ~2x SLOWER than enumeratorAtURL. It was the slow API that "caught up", not the "fast low, level API". The key point here is that "low level" doesn't always mean better or faster, sometimes it just means "different". In this particular case, NSFileManager and the CoreFoundation (CFURLEnumeratorCreateForDirectoryURL) file APIs are so widely used that a great deal of effort has been put into making them as fast as possible. Indeed, the API "underneath" all of this was SPECIFICALLY added to make THAT layer (Foundation/CoreFoundation) faster, NOT the lower level Unix APIs. fts eventually adopted it, but it was significantly slower until it did.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

0

Answer 6

DTS Engineer OP

Apple

Jul ’24

Recommended

All the APIs I found so far (NSFileManger, NSURL, NSDirectoryEnumerator) collect too much information, and those who are recursive - are aggregating the whole hierarchy before returning.

No or at least not exactly. When iterating a large directory hierarchy like this, the biggest performance factor is the total number of system calls into the kernel. This is actually what makes "enumeratorAtPath" so slow- it's API contract requires that it return a signficant amount of "extra" data, which basicalkly forces a stat() call on every file. What fts and enumeratorAtURL both do is use a different API which returns information about multiple files with in the same system call. Both of them use a "nextObject" pattern, but that's because most calls to "nextObject" are just returning data they'd already fetched, NOT calling into the kernel for new data.

This is also why enumeratorAtURL has the "includingPropertiesForKeys" option (fts also ahs somewhat similar functionality). That same system call can retrieve additional data within the same "fetch" operation, so getting that additional information during an iteration cycle is FAR faster than requesting it againt "later".

If applied to large directory - this both implies a high CPU peak and slow action, and a huge memory impact - even if transient.

I'm not sure what you were testing here, but this was probably caused by a accumulating temporary file objects in the "outer" autoreleasepool. This is the code I used to test "enumeratorAtURL":

void  dirURLEnumTest(NSString* volPath, NSArray* keys)
{
    NSURL* volURL = [NSURL fileURLWithPath: volPath];
    NSDirectoryEnumerator *fileEnum = [[NSFileManager defaultManager] enumeratorAtURL: volURL includingPropertiesForKeys: keys options:
                                           0 errorHandler:^BOOL(NSURL * _Nonnull url, NSError * _Nonnull error) {
        NSLog(@"Error: %@ at %@", error.debugDescription, url.path);
        return false;
    }];
   
    CountFile(volPath.fileSystemRepresentation);
    NSString* fileString= [fileEnum nextObject];
    
    while (fileString != NULL)
    {
        @autoreleasepool
        {
            //do file manipulation here….
            CountFile([fileString fileSystemRepresentation]);
            
            fileString= [fileEnum nextObject];
        }
    }
}

The "@autoreleasepool" inside the while loop is absolutely critical to totally memory usage. Without that pool, memory usage peaks at ~491mb of memory. With that pool, it peaks at ~6.3mb. That total is basically "flat" accross file count as well. The number increased slightly to ~6.5mb with 4x the file count:

enumeratorAtURL (empty)-> File Count: 1428982, Time: 25.334312s

However, I believe that all of that is due to malloc holding more memory for reuse (instead of returning to the system), not NSDirectoryEnumerator's implementation. fts actually went up significantly more (~5.6mb) and, again, I think that's simply because malloc was holding on to "more" instead of immediately returning to the system. In both cases, none of that would be relevant in a real application.

One time in the middle-ages, I used old MacOS 8 (before MacOS X) file-system APIs that were immensely fast and allowed doing this without aggregating anything.

I suspect you're thinking of "FSGetCatalogInfoBulk", which was the fastest way to retrieve catalog data about the filesystem hiearchy. That API is what lead to getattrlistbulk (see it's man page for details), which is what fts and enumeratorAtURL are built on. However, getattrlistbulk is not an API I'd recommend building on. It's a tricky API to use at all and I don't think you'd actually get much of a performance benefit unless you're trying to do something far more involved/complex. The trickiest part of using getattrlistbulk is managing it's memory buffers and parsing it's output into individual files, which is basically what fts and enumeratorAtURL were designed to "do".

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

0

Answer 7

DTS Engineer OP

Apple

Jul ’24

Following up on your later messages, as I was working on my response through most of today.

And for my ~100,000 files Documents folder, it took about 6 seconds to run.

Here's how long my enumeratorAtURL code took:

enumeratorAtURL (empty)-> File Count: 100000, Time: 1.344492s

And this one - although lengthy - took about 0.45 sec to do the same job (even a little better).

Yes. What you're doing here is replacing a single serial run through the directory hiearchy with a series of parallel interation with inside the secondary hiearchy. Particularly on APFS, this can speed things up as there isn't very much I/O contention (the catalog data is all in memory) or lock contention (APFS is better than at this than many file systems).

FYI, contentsOfDirectoryAtURL is a wrapper around enumeratorAtURL, iterating the directory hiearchy and accumulating into a target array. In your particular case, you inadvertendly ended up jumping from the slowest enumerator (enumeratorAtPath) to the fastest (enumeratorAtURL + a property array), which helped create the big performance jump. If you shift your existing code to enumeratorAtURL, you'll find that the performance is identical except now you can use @autoreleasepool to limit your memory growth.

Keep in mind that the performance benefit you get here is going hugely variable depending on the character of the data you're iterating. In a directory that only contains files, you end up the same as just using enumeratorAtURL. At the other extreme (say 1 file per directory), you end up creating an explosion of tiny operations that don't actually "do" anything and you've lost all of the benefits bulk iteration provides. You also need to be careful about queue width, otherwise you end up burying the system in wasted I/O threads.

Finally, the underlying file system target can make a huge difference. APFS is highly parallel and EXTREMELY fast on local flash. Things will look very different when you looking at a slow SMB device.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1