Unexpected behavior of concurrent code?

I am trying to run some long running code in an async queue in concurrent and serial mode. I am getting much better execution time in serial mode (and when using DispatchQueue.sync). When I check CPU utilization, I see all my cores 100% utilized in concurent mode and 3 cores barely running in serial. Yet serial is 3x times faster...


import Foundation
var data = [Int].init(repeating: 0, count: 10_000_000)
let totalStartTime = Date()
var g = DispatchGroup()
var q = DispatchQueue(label: "myQueue", qos: .default, attributes: [.concurrent])
(0..<8).forEach { i in
    q.async(group: g) {
        let startTime = Date()
        for i in data.indices { data[i] = Int(arc4random_uniform(1000)) }
        print("\((Date().timeIntervalSince(startTime)))", "s")
    }
}
g.wait()
print("Total:", "\((Date().timeIntervalSince(totalStartTime)))", "s")


Output from concurrent,

using attributes: [.concurrent]:

10.6698750257492 s

10.6752609610558 s

10.6778860092163 s

10.6787539720535 s

10.6814050078392 s

10.6833950281143 s

10.6855019927025 s

10.6869139671326 s

Total: 10.6888610124588 s

Output from serial,

using attributes: []:

0.451411008834839 s

0.437644004821777 s

0.432780981063843 s

0.431493043899536 s

0.447892963886261 s

0.426267981529236 s

0.42144501209259 s

0.421342968940735 s

Total: 3.472864985466 s


I'd expect the individual times being approx. the same for concurrent and serial mode and the total time several times faster in concurrent, not the other way around. Am I missing anything?

Accepted Reply

Creating micro-benchmarks is hard. To do it right you need to have a good understanding of how the CPU actually works, so you can focus on the specific part of the CPU you’re trying to benchmark. I’m not really the right person to help with that; while I have a reasonable understanding of the problem, my expertise is more on the I/O side of things.

With regards your current code, a slight performance boost seems reasonable to me because you’re not actually testing core speed but memory bandwidth. Let me explain why.

addingTimeInterval(_:)
is effectively a single floating point (
Double
) add. So your inner loop consists of:
  1. Read memory to a floating point register

  2. Add constant to that register

  3. Write a floating point register to memory

Modern CPUs typically have tonnes of computation power (step 2), so this is constrained by the memory accesses (steps 1 and 3). In short, you’ve created a fancy

memcpy
(-:

Now, the real

memcpy
can saturate the memory bus from a single thread. It’s unlikely that the compiler has managed to construct code to do this as efficiently as that, so it stands to reason you’ll get some boost from multi-threading. However, those threads quickly saturate the memory bus, and thus you can’t get per-core scaling like you’d expect.

Taking a step back, I recommend you re-think your benchmarking, with the goạl of producing a less ‘micro’ benchmark that more accurately reflects your product’s intended use case.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Replies

It’s hard to say exactly what’s going on here because you’ve only showed the concurrent version of your code. However, I do have two specific concerns:

  • You’re accessing

    data
    from multiple threads, which is not safe.
  • arc4random
    has an internal lock to protect its internal data structures, so I fear your concurrent code is just bashing its head against that lock.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Thanks for the reply and great comments. I have added commented out line that switches the queue from concurrent to serial. To see the serial, comment out the line "var q..." and uncomment the following one. I also moved the array to the closure, so each thread now has its own array.


I replaced the code with arc4random to use Date and the overall execution time is now significantly shorter. The concurrent version is about 2x faster than serial. That's on 4 core CPU with HT. Each individual run is way longer in the concurrent version than it is in the serial version.


1. Is the reative performance gain 2x that I am seeing now all I should expect to get with 4 cores and HT? Just trying to understand the limits and the overhead...

"2. Isn't the absolute performance suspiciously good? Does 0.5 to 1.5 ns per operation like "data[i] = data[i].addingTimeInterval(1_000)" sound reasonable on a ~2.3 GHz CPU or am I working with numbers where the actual activity was optimized out by the compiler?




import Foundation

let totalStartTime = Date()
var g = DispatchGroup()
var q = DispatchQueue(label: "myQueue", qos: .userInitiated, attributes: [.concurrent])
/* var q = DispatchQueue(label: "myQueue", qos: .default, attributes: []) */

(0..<8).forEach { i in
    q.async(group: g) {
        let startTime = Date()
        var data = [Date].init(repeating: Date(), count: 10_000_000)
        for i in data.indices { data[i] = data[i].addingTimeInterval(1_000) }
        let time = Date().timeIntervalSince(startTime)
        print(time, "s")
    }
}

g.wait()
print("Total:", "\((Date().timeIntervalSince(totalStartTime)))", "s")


Concurrent:

0.151054978370667 s

0.152642011642456 s

0.157088994979858 s

0.157481014728546 s

0.157347977161407 s

0.154221951961517 s

0.157037019729614 s

0.156207025051117 s

Total: 0.194258987903595 s


Serial:

0.057561993598938 s

0.0358740091323853 s

0.0370190143585205 s

0.0376629829406738 s

0.0370659828186035 s

0.0368690490722656 s

0.0420860052108765 s

0.0410929918289185 s

Total: 0.404715001583099 s

Creating micro-benchmarks is hard. To do it right you need to have a good understanding of how the CPU actually works, so you can focus on the specific part of the CPU you’re trying to benchmark. I’m not really the right person to help with that; while I have a reasonable understanding of the problem, my expertise is more on the I/O side of things.

With regards your current code, a slight performance boost seems reasonable to me because you’re not actually testing core speed but memory bandwidth. Let me explain why.

addingTimeInterval(_:)
is effectively a single floating point (
Double
) add. So your inner loop consists of:
  1. Read memory to a floating point register

  2. Add constant to that register

  3. Write a floating point register to memory

Modern CPUs typically have tonnes of computation power (step 2), so this is constrained by the memory accesses (steps 1 and 3). In short, you’ve created a fancy

memcpy
(-:

Now, the real

memcpy
can saturate the memory bus from a single thread. It’s unlikely that the compiler has managed to construct code to do this as efficiently as that, so it stands to reason you’ll get some boost from multi-threading. However, those threads quickly saturate the memory bus, and thus you can’t get per-core scaling like you’d expect.

Taking a step back, I recommend you re-think your benchmarking, with the goạl of producing a less ‘micro’ benchmark that more accurately reflects your product’s intended use case.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"