Attempts to apply concurrency to modulus loop

I wish to perform the following function on large arrays of UInt32s in Swift:


for i in 0..<array.count {
    array[i] = array[i] % UInt32(i + 1)

As each element is changed independently, this suggests that this function could be parallelised. So I started looking into GCD methods.


My first attempt (in a Playground) used dispatch_async:


import Cocoa
import XCPlayground
XCPlaygroundPage.currentPage.needsIndefiniteExecution = true

let myQueue = dispatch_get_global_queue(QOS_CLASS_BACKGROUND, 0)

func moduloFunc(inout thisArray: [UInt32]) {
    let len = thisArray.count
    for i in 0..<len {
        dispatch_async(myQueue) {
            thisArray[i] = thisArray[i] % UInt32(i + 1)
        }
    }
}

// test1 is a [UInt32] array with 100,000+ elements

let testLen = test1.count

moduloFunc(&test1)

print((test1.filter({$0 > UInt32(testLen)})).count)
// this prints how many elements remain unchanged when the result of the function is passed on


When run, moduloFunc works through the array twice as fast as the regular for-loop — but, being within an async queue, it passes the array on to the print function before it's finished; consequently, the print shows most of the array unmodified. Most of what I've read about GCD says I should include a 'completion handler' function within moduloFunc, but I'm unclear as to how this would be implemented in this case (and many of the examples online are based on Objective-C).


If dispatch_async is replaced with dispatch_sync, all elements are successfully modified, but the function runs only about 20% faster than the regular for-loop.


Following on from this post and this link about concurrent loops, I tried using dispatch_apply instead:


let myQueue = dispatch_queue_create("myQueue", DISPATCH_QUEUE_CONCURRENT)

func moduloFunc(inout thisArray: [UInt32]) {
    let len = thisArray.count

    dispatch_apply(len, myQueue) {i in
        thisArray[i] = thisArray[i] % UInt32(i + 1)
    }
}


This worked better: the function went through the loop as fast as the dispatch_async example and most of the elements were modified — but still not all of them. (A few hundred remained unchanged, generally.) Changing DISPATCH_QUEUE_CONCURRENT to DISPATCH_QUEUE_SERIAL produced the same result as the dispatch_sync example.


I sense I'm missing something obvious here. Or am I going about this completely the wrong way?


Running on a 2008 8-core Mac Pro, OS X 10.11.6, XCode 7.3.1, Swift 2.2

Accepted Reply

And then it gets stranger…


modFunc is called several times in a loop in the main program, along with other functions. When I tried the above variation (dispatch_apply with .withUnsafeMutableBufferPointer) and put the program through Time Profiler in Instruments, I saw that CPU usage went up with each cycle, as you'd expect with concurrent processing — but that these peaks lasted the length of each cycle instead of only when modFunc was called; the dispatch queues seemed not to be "letting go" when the function returned.


I looked through several online links and tried various code combinations, none of which were much help. Then I remembered the Concurrency Programming Guide mentioning striding to improve performance, so I re-edited the above code to:


    func modFunc (inout randArray: [UInt32], inout moduloArray: [UInt32]) {
        let len = randArray.count
        let section = len/8 // array length is always a multiple of 8
        let myQueue = dispatch_queue_create("myQueue", DISPATCH_QUEUE_CONCURRENT)
       
        randArray.withUnsafeMutableBufferPointer { buffer in
            dispatch_apply(8, myQueue) {j in
                let piece = j * section
                for i in 0..<section {
                    buffer[piece + i] = buffer[piece + i] % moduloArray[piece + i]
                }
            }
        }
    }


And… it worked; Time Profiler showed CPU usage only peaked when modFunc was called, and the program ran a little faster, by 8–14% depending on the size of the arrays, which is around what I'd expected.


I'm pleased about this, but still curious as to why it worked. Should tasks always be "cut up" for dispatch_apply to work effectively?

Replies

I wish to perform the following function on large arrays of UInt32s in Swift:

How big is this array? If it’s huge, the maximum performance is going to be determined by the machine’s bandwidth to main memory, so throwing more cores at the problem is unlikely to be a big win. Problems like this (large amounts of memory, small amounts of computation per byte) are better tackled with vector instructions.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

You might want to consider using dispatch_barrier to queue the print statement after the other async blocks complete.

The arrays are large: as low at 100,000; usually running into the tens or hundreds of millions (or larger).


By "vector instructions", are you referring to Metal?

I was thinking about CPU SIMD instructions (if you’re on the Mac, that means one of the various flavours of SSE). You can access these via the very lightweight

<simd/simd.h>
abstraction (it’s very nice in Swift, just
import simd
).

Doing this on the GPU (via Metal or OpenCL) would be even faster but you have to worry about getting the data to and from the GPU. Again, for simple calculations like this the upper bound is set by the system memory bandwidth.

How does this bazillion element array flow through the system? Do you read it all from disk, crunch it, and then write it back out? Or do you set it up once and then run lots of crunch passes? With data of this size the overall flow becomes critical. For example:

  • If you’re reading and writing the data, it’s important to overlap the I/O and the crunch

  • If you’re looping on the data, you can win be doing things in chunks and assigning different chunks to different CPUs (and, most importantly, their per-CPU cache).

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

SIMD is completely new to me. I'll get back to you after experiments. Any tutorial suggestions for a beginner?

Any tutorial suggestions for a beginner?

On SIMD in general? Or on Swift’s

simd
module?

SIMD is an industry standard thing so, while I don’t have any specific resources to recommend, it shouldn’t be hard to find tutorials for it.

Swift’s

simd
module is relatively new but a quick ’net search turned up Russ Bishop Swift 2: SIMD, which seems like a good place to start.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

I've had a first look over SIMD in Swift. So far it's not so helpful. Currently, it only supports vectors in doubles, floats and signed Int32s, not the unsigned Int32s in my loop, so there's conversion to factor in. Also, SIMD (in general) appears to have no modulus operator, so I'd have to create a function for that. From latest WatchOS docs, it looks like Swift 3 will add UInt32 support to SIMD, so I may wait for that.


I did have some results playing with Accelerate. Instead of using


func moduloFunc(inout testArray: [UInt32], inout modArray: [UInt32]) {
    let len = testArray.count
    for i in 0..<len {
        testArray[i] = testArray[i] % modArray[i]
    }
}

// create 'test', a large [UInt32] array

var modArray = [UInt32](1...UInt32(test.count))
// substitute for 'index + 1' in original version; a tad quicker

moduloFunc(&test, modArray: &modArray)


I tried


import Cocoa
import Accelerate

func moduloFunc(inout testArray: [UInt32], inout floatTestArray: [Float], inout floatModArray: [Float]) {
    let rILen = Int32(testArray.count)
    let rULen = UInt(testArray.count)

    // convert testArray into float equivalent
    vDSP_vfltu32(testArray, 1, &floatTestArray, 1, rULen)
   
    // apply modulo function to floatTestArray
    vvfmodf(&floatTestArray, floatTestArray, floatModArray, [rILen])
}

// create 'test'

let testLen = test.count
var fTest = [Float](count: testLen, repeatedValue: 0.0)
var fModArray = fTest

var modArray = [UInt32](1...UInt32(testLen))

// convert modArray to float equivalent
vDSP_vfltu32(modArray, 1, &fModArray, 1, UInt(testLen))

modFunc(&test, floatTestArray: &fTest, floatModArray: &fModArray)


and I end up with a float version of the modulo'd array. (Modulused array?) In tests, this performs faster than the plain loop. When put into the main program I'm working on, where it's repeated several times, it actually makes the program take a bit longer than normal. Again, I'm probably missing something here.


Thanks for all your comments. I'll keep reading about Accelerate and SIMD (and dispatch_various), and will look out for SIMD when Swift 3 is out of beta.

Incidentally, this variation


func modFunc (inout randArray: [UInt32], inout moduloArray: [UInt32]) {
    let len = randArray.count
    let myQueue = dispatch_queue_create("myQueue", DISPATCH_QUEUE_CONCURRENT)
   
    randArray.withUnsafeMutableBufferPointer { buffer -> Void in
        dispatch_apply(len, myQueue) {j in
            buffer[j] = buffer[j] % moduloArray[j]
        }
    }
}


does work to completion, but, when put into the main program, makes it about 6–7 times slower (due to dispatch overhead, I suppose).

And then it gets stranger…


modFunc is called several times in a loop in the main program, along with other functions. When I tried the above variation (dispatch_apply with .withUnsafeMutableBufferPointer) and put the program through Time Profiler in Instruments, I saw that CPU usage went up with each cycle, as you'd expect with concurrent processing — but that these peaks lasted the length of each cycle instead of only when modFunc was called; the dispatch queues seemed not to be "letting go" when the function returned.


I looked through several online links and tried various code combinations, none of which were much help. Then I remembered the Concurrency Programming Guide mentioning striding to improve performance, so I re-edited the above code to:


    func modFunc (inout randArray: [UInt32], inout moduloArray: [UInt32]) {
        let len = randArray.count
        let section = len/8 // array length is always a multiple of 8
        let myQueue = dispatch_queue_create("myQueue", DISPATCH_QUEUE_CONCURRENT)
       
        randArray.withUnsafeMutableBufferPointer { buffer in
            dispatch_apply(8, myQueue) {j in
                let piece = j * section
                for i in 0..<section {
                    buffer[piece + i] = buffer[piece + i] % moduloArray[piece + i]
                }
            }
        }
    }


And… it worked; Time Profiler showed CPU usage only peaked when modFunc was called, and the program ran a little faster, by 8–14% depending on the size of the arrays, which is around what I'd expected.


I'm pleased about this, but still curious as to why it worked. Should tasks always be "cut up" for dispatch_apply to work effectively?