API to mark files as duplicate?

APFS currently supports copy-on-write and internally marking files as deltas of each other, which is great. This feature usually calls for two operations to be supported:

  1. cloning or reflinking, which is available on the file level via clonefile(2).
  2. marking two files as duplicates of each other so they would become clones, which is the subject of this question.


People are scratching their heads writing deduplication software that tries to emulate (2) using (1), but the implementation always comes out a bit sloppy. Sure, copy the Unix attributes and everything back to the new "clone", and then you get reminded that there is ACL, extra forks, and all those extended attributes. This is inelegant and easy to get wrong. APFS needs to provide such a feature as a primitive from the driver itself.


Linux is not exactly the place people should look for good interface designs in, but what their Btrfs (now generalized) provides is fairly good. Instead of operating on files themselves, they provide a bit more control by addressing on file data, treating them as a chunk of a continuous blob to be taken offsets and lengths of. Now here are the two operations we have seen again, with a bit more flavor:

  1. ioctl_ficlone(2): clone the data from one fd into another
  2. ioctl_ficlonerange(2): clone the data from one fd into another, but only the chunk requested by an offset-length pair
  3. ioctl_fideduperange(2): take two fds, two offsets, and a length, tell the filesystem to let file1[off1...off1+len] and file2[off2...off2+len] share storage if they are identical


With these primitives, programmers will find their work deduplicating files much easier. Apple should seriously consider adding these interfaces to take advantage of what the APFS is capable of.

Replies

APFS needs to provide such a feature as a primitive from the driver itself.

The best way to get this feedback in front of the folks who actually work on the code is to put it in an enhancement request.

Please post your bug number, just for the record.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

As a developer of a (not so sloppy ;-) ) deduplication app diskDedeupe I fully agree that APFS could support deduplication better. Apple has probably decided not to implement online deduplication due to high ressource requirements (i.e. memory usage in ZFS if online deduplication is enabled).


To implement improved offline deduplication block level operations would be very helpful. Also time consuming hashing could become obsolete, if APFS would provide hashes for blocks on a filesystem level. These hashes could be also used for data consitency checks.