Making filecopy faster by changing block size

I'm using the filecopy function to copy many files and I noticed that it always takes longer than similar tools like cp or a Finder copy (I already did a comparison in my other post). What I didn't know before was that I can set the block size which apparently can have a big influence on how fast the file copy operation is.

The question now is: what should I consider before manually setting the block size? Does it make sense to have a block size that is not a power of 2? Can certain block sizes cause an error, such as a value that is too large (for the Mac the code is running on, or for the source and target devices)? When should or shouldn't I deviate from the default? Is there a way to find out the optimal block size for given source and target devices, or at least one that performs better than the default?

In the following sample code I tried to measure the average time for varying block sizes, but I'm not sure it's the best way to measure it, since each loop iteration can have wildly different durations.

class AppDelegate: NSObject, NSApplicationDelegate {
    
    func applicationDidFinishLaunching(_ aNotification: Notification) {
        let openPanel = NSOpenPanel()
        openPanel.runModal()
        let source = openPanel.urls[0]
        openPanel.canChooseDirectories = true
        openPanel.canChooseFiles = false
        openPanel.runModal()
        let destination = openPanel.urls[0].appendingPathComponent(source.lastPathComponent)
        let date = Date()
        let count = 10
        for _ in 0..<count {
            try? FileManager.default.removeItem(at: destination)
            do {
                try copy(source: source, destination: destination)
            } catch {
                preconditionFailure(error.localizedDescription)
            }
        }
        print(-date.timeIntervalSinceNow / Double(count))
    }
    
    func copy(source: URL, destination: URL) throws {
        try source.withUnsafeFileSystemRepresentation { sourcePath in
            try destination.withUnsafeFileSystemRepresentation { destinationPath in
                let state = copyfile_state_alloc()
                defer {
                    copyfile_state_free(state)
                }
//                var bsize = Int32(16_777_216)
                var bsize = Int32(1_048_576)
                if copyfile_state_set(state, UInt32(COPYFILE_STATE_BSIZE), &bsize) != 0 || copyfile_state_set(state, UInt32(COPYFILE_STATE_STATUS_CB), unsafeBitCast(copyfileCallback, to: UnsafeRawPointer.self)) != 0 || copyfile_state_set(state, UInt32(COPYFILE_STATE_STATUS_CTX), unsafeBitCast(self, to: UnsafeRawPointer.self)) != 0 || copyfile(sourcePath, destinationPath, state, copyfile_flags_t(COPYFILE_ALL | COPYFILE_NOFOLLOW | COPYFILE_EXCL)) != 0 {
                    throw NSError(domain: NSPOSIXErrorDomain, code: Int(errno))
                }
            }
        }
    }

    private let copyfileCallback: copyfile_callback_t = { what, stage, state, src, dst, ctx in
        if what == COPYFILE_COPY_DATA {
            if stage == COPYFILE_ERR {
                return COPYFILE_QUIT
            }
            var size: off_t = 0
            copyfile_state_get(state, UInt32(COPYFILE_STATE_COPIED), &size)
            let appDelegate = unsafeBitCast(ctx, to: AppDelegate.self)
            if !appDelegate.setCopyFileProgress(Int64(size)) {
                return COPYFILE_QUIT
            }
        }
        return COPYFILE_CONTINUE
    }
    
    private func setCopyFileProgress(_ progress: Int64) -> Bool {
        return true
    }
    
}

what should I consider before manually setting the block size?

Like most performance tuning questions, this doesn’t have an easy answer )-:

Does it make sense to have a block size that is not a power of 2?

No. Things will go better if your block size matches the allocation block size of the underlying volume, and that’ll always be a power of two.

Can certain block sizes cause an error, such as a value that is too large (for the Mac the code is running on, or for the source and target devices)?

No.

Well, if you use a stupidly large number you could run out of memory, but I don’t think that’s what you’re asking about.

Is there a way to find out the optimal block size for given source and target devices … ?

copyfile already tries to do that. If you’re curious how that works, you can look at the Darwin source for it.

since each loop iteration can have wildly different durations.

Right.

If you’re going to do this sort of performance optimisation the first step is to come up with a reliable metric. Without that, you’re just wandering around in the dark.

The main problem with doing that is caching. There are at least two levels of caching that are relevant here:

  • The macOS file system cache, aka the UBC (unified buffer cache)

  • The disk drive’s cache

It’s not too hard to defeat the UBC:

  • Unmount the volume between runs.

  • Create a new source file for each run, using code that prevents the file from going into the UBC [1].

Defeating the disk drive’s cache is trickier. One sure-fire option is to power cycle the drive [2] but that’s a pain to set up. Probably the easiest option for you to try right now is to copy a large file, one that’s large enough to be unlikely to fit in the drive’s cache.


Finally, have you tried using NSFileManager for this? While it is ‘only’ a wrapper around copyfile, it does have a lot of smarts.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] To do this:

  1. Open a new file.

  2. Use fcntl to set F_NOCACHE.

  3. Write pseudo random data to the file.

  4. From a buffer that’s page aligned.

  5. With each write being a multiple of the page size (this means that the file offset is always a multiple of the page size).

[2] Assuming the drive doesn’t have a persistent cache, which can be a thing.

Thanks a lot for your valuable input!

Things will go better if your block size matches the allocation block size of the underlying volume

When you say "match", do you mean equal, or a multiple, or something else? I'm asking because the value returned by URLResourceKey.preferredIOBlockSizeKey for my Macintosh HD is 1048576, but significantly increasing the block size of filecopy from the "preferred" 1KB to 1MB or higher usually performs better.

I'm asking because the value returned by URLResourceKey.preferredIOBlockSizeKey for my Macintosh HD is 1048576, but significantly increasing the block size of filecopy from the "preferred" 1KB to 1MB or higher usually performs better.

Why do you say preferred 1KB, when in the same sentence you say the preferred block size is 1MB?

Why do you say preferred 1KB, when in the same sentence you say the preferred block size is 1MB?

Sorry, my mistake. Let me give you some actual results: when using the "preferred" 1_048_576 block size (returned by URLResourceKey.preferredIOBlockSizeKey) it takes about 6.15 seconds to copy a 7 GB file from my Mac to another folder on the same Mac. When using 16_777_216 block size, I get about 3 seconds instead. If I don't set the block size option for filecopy, I get about the same time as with the preferred block size, so I guess it's probably the same value.

My question is: what makes the 1_048_576 block size "preferred", since using 16_777_216 drastically increases the filecopy performance? And can we assume that increasing the block size will always give better performance for filecopy?

When you say "match", do you mean equal, or a multiple, or something else?

I meant “an even multiple of”. Sorry about the ambiguity.

what makes the 1_048_576 block size "preferred"

There’s a clear space / speed trade-off being made here. After all, a 7 GB buffer would be even faster, but you wouldn’t want to allocate that much memory.

Once you get reliable performance metrics, you should graph out this trade-off. It wouldn’t surprise me if 1 MiB was the ‘knee’ in that curve.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Things will go better if your block size matches the allocation block size of the underlying volume

Thanks again for your input. To clarify the "things will go better" part: does that mean that providing a block size that doesn't match can impact performance? Should I make sure to calculate the next multiple of two bigger than the file size to copy (with a maximum threshold of course), or can I simply do this:

var bsize = Int32(16_777_216)
if let size = Int32(exactly: file.size) {
    bsize = min(size, bsize)
}
bsize = min(size, bsize)

I'd be amazed if that made any difference.

I'd be amazed if that made any difference.

Why? It sets a block size of 16 MiB for files that are larger than 16 MiB, or uses the file size itself for files that are smaller.

Why? It sets a block size of 16 MiB for files that are larger than 16 MiB, or uses the file size itself for files that are smaller.

What do you think happens for files that are smaller than the block size? Do you imagine that it somehow reads beyond the end of the file on the disk until it reaches the block size you have specified? No, it won't do that (for all sorts of reasons).

But don't listen to me. You were doing benchmarks above. Continue to use benchmarks to determine what is best.

Do you imagine that it somehow reads beyond the end of the file on the disk until it reaches the block size you have specified?

In case you're wondering why I don't simply use a fixed block size of 16_777_216: I thought that the larger the allocated space, the more time it would take to allocate it, and the less space would be available to the rest of the system while the copy is in progress. It may be a negligible time difference, but since I can avoid it with a very simple calculation, why not do it?

does that mean that providing a block size that doesn't match can impact performance?

Yes.

However, not in the example you outlined. It’s important when the buffer size is less than the file size because it keeps all transfers aligned within the file, which facilitates uncached I/O.

I thought that the larger the allocated space, the more time it would take to allocate it

I don’t think you should be optimising at that level at this time. As endecotp pointed out, it’s time to benchmark and use that to guide your optimisation.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

It’s important when the buffer size is less than the file size because it keeps all transfers aligned within the file, which facilitates uncached I/O.

Good, that's what I was aiming at: use a block size equal to the file size up to a maximum power of 2. I don't think I was really trying to optimize the buffer allocation time; rather it didn't feel right to allocate an arbitrarily large buffer. After all, space is constrained and there is a reason why we can specify the block size, or it would automatically be set to infinity... right?

it’s time to benchmark and use that to guide your optimisation

Here are my results when copying a file with a given size using varying block sizes and 5 repetitions. (Explicitly writing random data to the file makes the file creation quite slow, but simply writing a buffer with uninitialized data seems to work as well.)

  • Different file sizes seem to follow a similar trend. Using a multiple of 1000 or a multiple of the page size also doesn't seem to make a difference in the overall trend.
  • The lowest block sizes, 1'024 and 2'048, seem to be special and cause a very fast copy.
  • From 4'096 upwards, the time decreases...
  • ...up to 65'536, where it suddenly increases again, but from then it definitely decreases.
  • The bigger the file, the higher the block size needs to be to notice a difference.
    • With a 100 MB file, increasing the block size from 1'048'576 to 2'097'152 makes the operation about twice as fast, with little improvements above that block size.
    • With a 1 GB file, increasing the block size from 1'048'576 to 8'388'608 makes the operation about twice as fast, with little improvements above that block size.
  • Without using F_NOCACHE, the operation gets slowly faster when increasing the block size from 1'048'576, and then gets slower again from 8'388'608 upwards. Not sure if that means anything.

Here are the graphs for a 100 MB and a 1 GB file.

Copying a 100 MB file:

Copying a 1 GB file:

Copying a 100 MB file created without F_NOCACHE:

And here is the code:

class AppDelegate: NSObject, NSApplicationDelegate {

    func applicationDidFinishLaunching(_ aNotification: Notification) {
        print("page size", getpagesize())
        let openPanel = NSOpenPanel()
        openPanel.canChooseDirectories = true
        openPanel.runModal()
        test(url: openPanel.urls[0])
    }
    
    func test(url: URL) {
        let source = url.appendingPathComponent("file source")
        let destination = url.appendingPathComponent("file destination")
        let fileSizes = [1_000, 1_000_000, 10_000_000, 100_000_000, 1_000_000_000, 10_000_000_000, Int(getpagesize()) * 10_000]
        let blockSizes: [Int32] = (10..<31).map({ 1 << $0 })
        let repetitions = 5
        var times = [[TimeInterval]](repeating: [TimeInterval](repeating: 0, count: repetitions), count: blockSizes.count)
        for fileSize in fileSizes {
            print("fileSize", fileSize)
            for (i, blockSize) in blockSizes.enumerated() {
                print("blockSize", blockSize)
                let date = Date()
                for j in 0..<repetitions {
                    try? FileManager.default.removeItem(at: destination)
                    var date = Date()
                    print("create", terminator: " ")
                    createFile(source: source, size: fileSize)
                    print(-date.timeIntervalSinceNow)
                    date = Date()
                    print("copy", terminator: " ")
                    do {
                        try copy(source: source, destination: destination, blockSize: blockSize)
                    } catch {
                        preconditionFailure(error.localizedDescription)
                    }
                    let time = -date.timeIntervalSinceNow
                    times[i][j] = time
                    print(time)
                }
                let average = -date.timeIntervalSinceNow / Double(repetitions)
                print("average copy", average)
                print()
            }
            
            let header = blockSizes.map({ NumberFormatter.localizedString(from: $0 as NSNumber, number: .decimal) }).joined(separator: "\t")
            try! Data(([header] + (0..<repetitions).map { j in
                (["\(j)"] + (0..<blockSizes.count).map { i in
                    return timeToString(times[i][j])
                }).joined(separator: "\t")
            }).joined(separator: "\n").utf8).write(to: url.appendingPathComponent("results \(fileSize).tsv"))
        }
    }
    
    func timeToString(_ time: TimeInterval) -> String {
        return String(format: "%.6f", time)
    }
    
    func createFile(source: URL, size: Int) {
        var buffer = UnsafeMutableRawBufferPointer.allocate(byteCount: size, alignment: Int(getpagesize()))
//        for i in 0..<size {
//            buffer[i] = UInt8.random(in: 0...255)
//        }
        let fp = fopen(source.path, "w")
        let success = fcntl(fileno(fp), F_NOCACHE, 1)
        assert(success == 0)
        let bytes = fwrite(buffer.baseAddress!, 1, size, fp)
        assert(bytes == size)
        fclose(fp)
    }

    func copy(source: URL, destination: URL, blockSize: Int32) throws {
        try source.withUnsafeFileSystemRepresentation { sourcePath in
            try destination.withUnsafeFileSystemRepresentation { destinationPath in
                let state = copyfile_state_alloc()
                defer {
                    copyfile_state_free(state)
                }
                var blockSize = blockSize
                if copyfile_state_set(state, UInt32(COPYFILE_STATE_BSIZE), &blockSize) != 0 || copyfile_state_set(state, UInt32(COPYFILE_STATE_STATUS_CB), unsafeBitCast(copyfileCallback, to: UnsafeRawPointer.self)) != 0 || copyfile_state_set(state, UInt32(COPYFILE_STATE_STATUS_CTX), unsafeBitCast(self, to: UnsafeRawPointer.self)) != 0 || copyfile(sourcePath, destinationPath, state, copyfile_flags_t(COPYFILE_ALL | COPYFILE_NOFOLLOW | COPYFILE_EXCL)) != 0 {
                    throw NSError(domain: NSPOSIXErrorDomain, code: Int(errno))
                }
            }
        }
    }

    private let copyfileCallback: copyfile_callback_t = { what, stage, state, src, dst, ctx in
        if what == COPYFILE_COPY_DATA {
            if stage == COPYFILE_ERR {
                return COPYFILE_QUIT
            }
        }
        return COPYFILE_CONTINUE
    }

}

And then I tested if copying a 1 KB file performs better with a 1'000 or 1'024 block size, and the first iteration of the whole test is always an outlier. Am I still doing something wrong?

By the way, I just had to learn the hard way that this code

var blockSize = UInt32(16_777_216)
if copyfile_state_set(state, UInt32(COPYFILE_STATE_BSIZE), &blockSize) != 0 {
    throw NSError(domain: NSPOSIXErrorDomain, code: Int(errno))
}

always throws on macOS 11 or older with a POSIX error 22: Invalid argument regardless of the provided block size. There's no mention of this in the documentation and I assumed that it would just work, but got bug reports from users running older systems.

Making filecopy faster by changing block size
 
 
Q