For clarity, let me call the arm machine the client system and the x86 machine the server system.
What's are the network conditions between these two machines? What's the latency and bandwidth of the connection? Note that latency in particular has a huge effect here.
I am deleting a directory tree on the server system from a Java application running on the client system. Java uses basic system calls (rmdir and unlink) to delete items.
Just to clarify, where are that actually directory commands being issued? Are you:
- Calling rmdir/unlink on the mac, targeting the files in the smb mount.
OR
- Tell your server app do delete those file "directly" and then viewing the changes through the smb mount on the mac?
Note that while race conditions are possible in both cases, they're all be guaranteed in the second.
Jumping to here:
It appears that there is a race condition. The operation to delete S apparently succeeded, but did not take effect immediately. The operation to delete D somehow overtook the previous operation and failed as a result.
What's the actual SMB server? More specifically, is it a Windows machine? I'm not sure how widely it's being used*, but the SMB2/3 delete works by:
*This was part of SMB2 which isn't exactly "new".
smb2fs_smb_delete(struct smb_share *share, struct smbnode *np, enum vtype vnode_type,
...
/*
* Looking at Win <-> Win with SMB 2/3, delete is handled by opening the file
* with Delete and "Read Attributes", then a Set Info is done to set
* "Delete on close", then a Close is sent.
*/
...
That's a fairly elegant approach, but I believe it can mean that a delete ends up being "deferred" because some other process/client has the directory open.
However, please keep in mind that this is only one example among many. Part of the nature of network files systems is simply that it's nearly impossible to create a network file system that:
-
"Feels" like a local file system under normal usage conditions.
-
Doesn't exhibit "weird behavior" under specific conditions and/or when monitored more closely.
In the category of weird, I'm not sure how these connect to each other:
A few seconds later, directory S disappeared (as viewed in Terminal on the server system)!
...
If I unmount the volume and reconnect, I see the same bad state in Terminal. Listing D shows S. Listing S gets the fts_read error.
Are you saying that the server and the client are persistently showing inconsistent results, particularly across unmounts? The unmount is important here because, as far as the system is concerned, it basically "forgets" everything it knows about the previous volume state when it unmounts the volume. So any data it's showing, came from the server*.
*Is the client mounting multiple shares from the server, particularly shares that "overlap", so the client can "see" the same directory through two different mountpoints? Things become more complicated when multiple shares are involved because the client "knows" that it's both shares are from the same source.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware