Unexpected Permission denied error on file sharing volume

I am getting recurring errors running code on macOS 15.1 on arm that is using a volume mounted from a machine running macOS 14.7.1 on x86. The code I am running copies files to the remote volume and deletes files and directories on the remote volume. The files and directories it deletes are typically files it previously had copied.

The problem is that I get permission failures trying to delete certain directories.

After this happens, if I try to list the directory using Terminal on the 15.1 system, I get a strange error:

ls -lA TestVAppearances.app/Contents/runtime-arm/Contents
total 0
ls: fts_read: Permission denied

If I try to list the directory on the target (14.7.1) system, there is no error:

TestVAppearances.app/Contents/runtime-arm/Contents:
total 0

Answered by DTS Engineer in 819722022

I am somewhat surprised that moving the application to a directory that is not and has never been displayed by Finder (before trying to delete the application) does not fix the problem.

There's an odd difference in the listing output that might explain the issue. The values for "runtime-arm" match:

Client:
drwxr-xr-x@ 1 alan  staff  16384 Dec  8 09:37 runtime-arm

Server
drwxr-xr-x@ 3 alan  staff  102 Dec  8 09:37 runtime-arm

But the values for the contents of "runtime-arm" do NOT match:

Client: 
drwxr-xr-x  1 alan  staff  16384 Dec 12 11:34 Contents

Server:
drwxr-xr-x@ 2 alan  staff  68 Dec 12 11:34 Contents

The "@" symbol above indicates that an extended attribute has been attached, so what does the command: " xattr -lx <path> " return for the 4 objects above?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

So, let me start with the error here:

ls: fts_read: Permission denied

"fts" ("File Traversal Stream") is basically the Unix/POSIX equivalent to Foundation NSDirectoryEnumerator. The details are documented in it's man page ("man fts"), but "fts_read" returns the next file system object in the iteration set. So "Permission denied" would mean you weren't able to read the contents fo that directory.

Moving to here:

The problem is that I get permission failures trying to delete certain directories.

  • How were you deleting the directories?

  • What actually failed, particularly in relation to the directory you're trying to list?

After this happens, if I try to list the directory using Terminal on the 15.1 system, I get a strange error:

  • Do you get the same failure if you open a new terminal window an navigate to the directory?

  • Do you get the same failure if you unmount and remount the SMB volume?

  • What do you see if you list the contents of the parent directory?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Kevin, I thank you for your response. I believe progress is happening.

For clarity, let me call the arm machine the client system and the x86 machine the server system.

I am deleting a directory tree on the server system from a Java application running on the client system. Java uses basic system calls (rmdir and unlink) to delete items.

I put a breakpoint on the exception handler and discovered an interesting situation.

The failure on directory deletion is directory not empty. That should not happen because before attempting to delete the directory, my program deleted its contents.

When I examine the directory that I could not delete (D) in Terminal on the server system, it is indeed not empty. It contains an empty subdirectory (S), which my program previously "deleted".

A few seconds later, directory S disappeared (as viewed in Terminal on the server system)!

It appears that there is a race condition. The operation to delete S apparently succeeded, but did not take effect immediately. The operation to delete D somehow overtook the previous operation and failed as a result.

From Terminal on the client system, S appears to exist but trying to list its contents fails with the fts_read error. I get the same error if I open a new Terminal window and navigate to D and try to list S.

If I unmount the volume and reconnect, I see the same bad state in Terminal. Listing D shows S. Listing S gets the fts_read error.

Is this a bug or am I doing something wrong?

Is there a reliable way to work around this problem?

For clarity, let me call the arm machine the client system and the x86 machine the server system.

What's are the network conditions between these two machines? What's the latency and bandwidth of the connection? Note that latency in particular has a huge effect here.

I am deleting a directory tree on the server system from a Java application running on the client system. Java uses basic system calls (rmdir and unlink) to delete items.

Just to clarify, where are that actually directory commands being issued? Are you:

  1. Calling rmdir/unlink on the mac, targeting the files in the smb mount.

OR

  1. Tell your server app do delete those file "directly" and then viewing the changes through the smb mount on the mac?

Note that while race conditions are possible in both cases, they're all be guaranteed in the second.

Jumping to here:

It appears that there is a race condition. The operation to delete S apparently succeeded, but did not take effect immediately. The operation to delete D somehow overtook the previous operation and failed as a result.

What's the actual SMB server? More specifically, is it a Windows machine? I'm not sure how widely it's being used*, but the SMB2/3 delete works by:

*This was part of SMB2 which isn't exactly "new".

smb2fs_smb_delete(struct smb_share *share, struct smbnode *np, enum vtype vnode_type,
...
    /*
     * Looking at Win <-> Win with SMB 2/3, delete is handled by opening the file
     * with Delete and "Read Attributes", then a Set Info is done to set
     * "Delete on close", then a Close is sent.
     */
...

That's a fairly elegant approach, but I believe it can mean that a delete ends up being "deferred" because some other process/client has the directory open.

However, please keep in mind that this is only one example among many. Part of the nature of network files systems is simply that it's nearly impossible to create a network file system that:

  1. "Feels" like a local file system under normal usage conditions.

  2. Doesn't exhibit "weird behavior" under specific conditions and/or when monitored more closely.

In the category of weird, I'm not sure how these connect to each other:

A few seconds later, directory S disappeared (as viewed in Terminal on the server system)!

...

If I unmount the volume and reconnect, I see the same bad state in Terminal. Listing D shows S. Listing S gets the fts_read error.

Are you saying that the server and the client are persistently showing inconsistent results, particularly across unmounts? The unmount is important here because, as far as the system is concerned, it basically "forgets" everything it knows about the previous volume state when it unmounts the volume. So any data it's showing, came from the server*.

*Is the client mounting multiple shares from the server, particularly shares that "overlap", so the client can "see" the same directory through two different mountpoints? Things become more complicated when multiple shares are involved because the client "knows" that it's both shares are from the same source.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

As I already mentioned, the client and server are both macOS systems and the server volume is offered using macOS file sharing. The connection is a few feet of ethernet. There are no multiple shares.

As you suggest, the persistence seems to be on the server side. Not the file system, but the network server. Using Unix tools on the server to view the directories produces the expected results, except for the delay. Using Unix tools on the client to view or delete the directory fails. This failure still occurs even now.

As I mentioned, the client is using ordinary Unix system class to delete files and directories.

I just noticed that in some other applications that were installed by my program there is a .DS_Store file in the corresponding location. I do not have Calculate All Sizes enabled on the server system, but I notice that Finder displays the sizes of bundled applications. I suspect there is a problem with Finder trying to recalculate the size of the application concurrently with my program trying to delete the application. The problem might involve creating a .DS_Store file (a known problem) in D or it might just introduce a delay in the actual deletion of S.

My program already has retry logic to handle the spontaneous creation of .DS_Store files during a directory tree deletion, but it cannot handle the permanent failure introduced by file sharing.

Ordinary Unix calls...

I just noticed that in some other applications that were installed by my program there is a .DS_Store file in the corresponding location. I do not have Calculate All Sizes enabled on the server system, but I notice that Finder displays the sizes of bundled applications. I suspect there is a problem with Finder trying to recalculate the size of the application concurrently with my program trying to delete the application.

Assuming the underlying volume is an APFS volume, then I don't think it's actually calculating the size. One of APFS lesser known features is "Fast Directory Sizing", which works by having the file system calculate the size of a hierarchy and then storing (and updating based on changes) that value as metadata. It's not enable everywhere, as it can have performance downsides as well, but app bundles are a nearly "ideal" use case and I believe the system specifically enables them there.

*Basically, "Fast Directory Sizing" forces the file system to a size update every time that directory hierarchy is modified, which can be "wasted" work on a directory that's frequently updated or never "looked at" by the user. App bundles are great example of the opposite case- they're rarely updated and generally user visible.

The problem might involve creating a .DS_Store file (a known problem) in D or it might just introduce a delay in the actual deletion of S.

Possibly, though I think focusing to much on the specific cause can be misleading. Jumping back to here:

The failure on directory deletion is directory not empty. That should not happen because before attempting to delete the directory, my program deleted its contents.

First off, any file system is inherently shared data structure that many process can modify at anytime. Strictly speaking, it isn't possible to create a program that is GUARANTEED to be able to delete any given directory- file deletion is generally slower than creation, so it's entirely possible for a process to create files faster than you can create them, preventing deletion.

That dynamic is part of the reason why the "Trash" abstraction exists. Instead of trying to "block" modification to a given hierarchy/object, the Finder moves that object in to a "special" directory which:

  1. Discourages further modification/interactions with those object. For example, the Finder and LaunchServices both block launching apps that are in the trash, even though the lower level system will happily do so.

  2. Moves the object out of the users normal "interaction space". For example, most users don't store their documents inside "Trash" even though they "could".

  3. Delays the final deletion, increasing the likely hood that the object won't be in use when the user ACTUALLY tries to delete it (by emptying the trash)

In the app context, the equivalent operation would be to move the object "somewhere else", then delete the object from that location. Typically you use NSFileManager.URLForDirectory:inDomain:appropriateForURL:create:error: to get a temp directory on the same volume, move the object to that location, then delete it from there.

Lastly, where are these file actually located on the users machine? More specifically, if they're in the user home directory, it's possible that some kind of cloud storage provider (iCloud or 3rd party) is entangled with this as well. That doesn't actually change the basic issue, but it could explain what was "intersted" in this directory.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

First off, any file system is inherently shared data structure that many process can modify at anytime. Strictly speaking, it isn't possible to create a program that is GUARANTEED to be able to delete any given directory- file deletion is generally slower than creation, so it's entirely possible for a process to create files faster than you can create them, preventing deletion.

Of course, but that is not what is happening here. At most one file is being created, a .DS_Store file. Having encountered this problem before in a strictly local context, I changed my program to retry and that has worked (with local files).

If the problem is APFS delaying the deletion of the directory, then perhaps I need a longer timeout on my retry loop?

The real problem is not the failure to delete the directory, it is inconsistent state that is created (as viewed by a network client). Apparently, the file server believes the directory exists (it shows up when remotely listing the parent) even after the directory ceases to exist (as viewed on the server file system). At the very least, when the network client tries to list or delete this directory, the file server should notice that the directory no longer exists, update its state and return a nonexistent file error. Instead, it returns a permissions error but does not update its state, so any future attempt to list or delete the directory will get the same error.

This bad state apparently lasted for days.

I consider this a bug in the file server.

I probably will modify my program to move the application before trying to delete it. I don't like the idea of adding macOS specific code to an application that is currently not OS specific, but that may be the only option. I believe that macOS specific code is needed to find a location (trash) where the Finder (or APFS) will not continue to try to calculate the application size. (Does moving the application to the trash stop APFS from recalculating its size?)

The directories are not in the home directory. They are on an external SSD volume.

Of course, but that is not what is happening here. At most one file is being created, a .DS_Store file. Having encountered this problem before in a strictly local context, I changed my program to retry and that has worked (with local files).

If the problem is APFS delaying the deletion of the directory, then perhaps I need a longer timeout on my retry loop?

To clarify, I don't think APFS itself is a key factor. How long is your current retry timing? Certainly comparing local and network volumes, much longer timeouts are often required for a network file system.

The real problem is not the failure to delete the directory, it is inconsistent state that is created (as viewed by a network client).

What does "viewed" here actually mean? How the user actually "looking at" the file/directory?

The complication here is around exactly happened here:

Apparently, the file server believes the directory exists (it shows up when remotely listing the parent) even after the directory ceases to exist (as viewed on the server file system).

The question here is which of these actually happened:

1)The server return old data when the client asked.
OR
2)The client displayed old data and never asked the server.

Both of those a possible, but I'd generally consider #2 more likely than #1. The server has relatively "free" access* to the underlying data, so it's easy for it to just check the disk and see what's there. By contrast, the client is ACTIVELY trying to avoid server communication, which creates "space" for this kind of failure. Note that #2 doesn't necessarily mean the problem is the client. There's a complicated architecture here where the client is supposed to server about directories it's "intersted in" and the server is supposed to tell the client about changes to those location(s)., so failure is possible in both components.

That leads back to an earlier issue here:

Using Unix tools on the client to view or delete the directory fails. This failure still occurs even now.

The big thing that still isn't clear to me is the difference between:

  • Part of the system (your app, the terminal, etc.) is seeing weird/old data.

vs

  • The "entire" system is seeing weird/old data, particularly the Finder and ESPECIALLY across mount/unmount cycle.

I can see how the first case might happen depending on EXACTLY how the directory is being examined and manipulated. Whatever is viewing that data ends up in state where the client has bad data AND the server doesn't know that the client is "interested" in that directory. The client keeps showing the bad data and the server never realizes there was any reason to update the client.

The second case is much, much stranger. Unmounting reset the clients "world" and, more importantly, I expect it to basically reset the server as well.

Moving to here:

The directories are not in the home directory. They are on an external SSD volume.

What's the file system on the SSD? APFS, HFS+, or something else?

At the very least, when the network client tries to list or delete this directory,

Yes and no. On the listing side, no, not necessarily. It may just be returning it local data and relying on the server to notify it of changes (which it should be). On the delete side... what's the specific command/function your executing? It's possible it's effectively doing an "ls" first, which menas it would fail in exactly the same way.

the file server should notice that the directory no longer exists, update its state and return a nonexistent file error. Instead, it returns a permissions error but does not update its state, so any future attempt to list or delete the directory will get the same error.

As I mentioned above, I would not assume that the server is the failure point here.

I probably will modify my program to move the application before trying to delete it. I don't like the idea of adding macOS specific code to an application that is currently not OS specific, but that may be the only option. I believe that macOS specific code is needed to find a location (trash) where the Finder (or APFS) will not continue to try to calculate the application size. (Does moving the application to the trash stop APFS from recalculating its size?)

Clarifying here, I don't think the size calculation is a factor here at all. Fast directory sizing is part of APFS internal implementation and won't have any effect the kind of issue you're describing. I only mentioned it to explain how the Finder could be getting an accurate directory size without a bunch of extra directory iterations.

Also, is the Finder actively open and monitoring these directories while all this is happening? The Finder doesn't normally create .DS_Store file for directories it isn't actually interacting with.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

To clarify, I don't think APFS itself is a key factor. How long is your current retry timing? Certainly comparing local and network volumes, much longer timeouts are often required for a network file system.

My timeout is currently 20 seconds, but I have previously used 60 seconds. I have never had a similar problem with local directory deletion. When local directories fail to delete, it is because of a spontaneously created .DS_Store file.

What does "viewed" here actually mean?

For example, using Terminal on the client system to list the directory, delete the directory, or list the parent directory.

What's the file system on the SSD? APFS, HFS+, or something else?

HFS+. I use this volume with various old macOS releases back to 10.10.

On the delete side... what's the specific command/function your executing?

When diagnosing from Terminal, rmdir. Probably the same system call from my application.

As I mentioned above, I would not assume that the server is the failure point here.

I believe the problem survives a client reboot. I will double check that after I submit this reply.

Also, is the Finder actively open and monitoring these directories while all this is happening? The Finder doesn't normally create .DS_Store file for directories it isn't actually interacting with.

Yes (on the server system). And I have observed .DS_Store files after the deletion attempt that were not there before the deletion attempt. (The Finder is open because after updating the applications on the server system, the next thing I will do is run some of them.)

One more thing: I successfully implemented a workaround where the client moves the application to the (remote) trash instead of deleting it. That worked when the server system was 14.7.1, but fails after updating it to 14.7.2 (Operation not permitted).

My only current workaround is to rename the application (within the same directory) and leave it there for me to delete manually using the Finder on the server system.

I can confirm that after rebooting the client system, I get the same error report from Terminal on the client system.

Mac-mini:testSystem alan$ ll -R VAquaManager.app.1734032083499
total 32
drwxr-xr-x@ 1 alan  staff  16384 Dec 12 11:34 Contents

VAquaManager.app.1734032083499/Contents:
total 64
drwxr-xr-x@ 1 alan  staff  16384 Dec  8 09:37 runtime-arm
drwxr-xr-x@ 1 alan  staff  16384 Dec  8 09:37 runtime-x86

VAquaManager.app.1734032083499/Contents/runtime-arm:
total 32
drwxr-xr-x  1 alan  staff  16384 Dec 12 11:34 Contents

VAquaManager.app.1734032083499/Contents/runtime-arm/Contents:
total 0
ls: fts_read: Permission denied

Using Terminal on the server system, I get:

alan@Alans-iMac testSystem % ll -R VAquaManager.app.1734032083499
total 0
drwxr-xr-x@ 4 alan  staff  136 Dec 12 11:34 Contents

VAquaManager.app.1734032083499/Contents:
total 0
drwxr-xr-x@ 3 alan  staff  102 Dec  8 09:37 runtime-arm
drwxr-xr-x@ 3 alan  staff  102 Dec  8 09:37 runtime-x86

VAquaManager.app.1734032083499/Contents/runtime-arm:
total 0
drwxr-xr-x@ 2 alan  staff  68 Dec 12 11:34 Contents

VAquaManager.app.1734032083499/Contents/runtime-arm/Contents:
total 0

VAquaManager.app.1734032083499/Contents/runtime-x86:
total 0
drwxr-xr-x@ 3 alan  staff  102 Dec 12 11:35 Contents

VAquaManager.app.1734032083499/Contents/runtime-x86/Contents:
total 0
drwxr-xr-x@ 2 alan  staff  68 Dec 12 11:35 Home

VAquaManager.app.1734032083499/Contents/runtime-x86/Contents/Home:
total 0

I am somewhat surprised that moving the application to a directory that is not and has never been displayed by Finder (before trying to delete the application) does not fix the problem.

I am somewhat surprised that moving the application to a directory that is not and has never been displayed by Finder (before trying to delete the application) does not fix the problem.

There's an odd difference in the listing output that might explain the issue. The values for "runtime-arm" match:

Client:
drwxr-xr-x@ 1 alan  staff  16384 Dec  8 09:37 runtime-arm

Server
drwxr-xr-x@ 3 alan  staff  102 Dec  8 09:37 runtime-arm

But the values for the contents of "runtime-arm" do NOT match:

Client: 
drwxr-xr-x  1 alan  staff  16384 Dec 12 11:34 Contents

Server:
drwxr-xr-x@ 2 alan  staff  68 Dec 12 11:34 Contents

The "@" symbol above indicates that an extended attribute has been attached, so what does the command: " xattr -lx <path> " return for the 4 objects above?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I'm not sure there is a real difference.

The client may have gotten an error when it tried to read the extended attributes of runtime-arm/Contents.

In any case, the extended attribute is com.apple.provenance.

The client may have gotten an error when it tried to read the extended attributes of runtime-arm/Contents.

What was the error? I'm not expecting the read to work, I want to see how it failed?

In any case, the extended attribute is com.apple.provenance.

What was the full value? And what the the only xattr?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I tried to replicate the situation on the server, but the newly copied application files do not have xattrs, so I am unable to say what error might have been reported. The file that I described had a value of 01 02 00 F5 28 1A 84 40 15 BA C9 when I inspected it on the server. However, its current value is empty. Other files in that application had the same attribute and it seems they all now have empty values. I do not see any other attributes.

Unexpected Permission denied error on file sharing volume
 
 
Q