Is there a way to use SIMD in Swift for better performance? (Question includes test code).

I have an app in Swift that does a lot of numerical processing on ordered pairs and vectors, so I'm looking into some ways to improve performance including adopting SIMD from the Accelerate framework for some calculations, but I'm not seeing performance improve.

There are some related posts on the forums that seem inconclusive and also a bit more complex. I pared my testing down to a couple of brief XCTests with self.measure blocks on repeated add and multiply operations of two double values. The tests set random initial values to ensure there's no compiler optimization of loop calculations based on constants. There's also no big collection of fixture data, so there's no chance allocations or vector index dereference or similar issues could be involved.

The regular multiple-instruction code runs an order of magnitude faster than the SIMD code! I don't understand why this is - would the SIMD code be faster in C++, could it be some Swift conversion? Or is there some aspect of my SIMD code that is incurring some known penalty? Curious if anyone out there is using SIMD in Swift in production and if you see anything in my test code that explains the difference.

Code Block Swift
func testPerformance_double() {
var xL = Double.random(in: 0.0...1.0)
var yL = Double.random(in: 0.0...1.0)
let xR = Double.random(in: 0.0...1.0)
let yR = Double.random(in: 0.0...1.0)
let increment = Double.random(in: 0.0...0.1)
Swift.print("xL: \(xL), xR: \(xR), increment: \(increment)")
var result: Double = 0.0
self.measure {
for _ in 0..<100000 {
result = xL + xR
result = yL + yR
result = xL * xR
result = yL * yR
xL += increment
yL += increment
}
}
Swift.print("last result: \(result)") // read from result
}


Code Block Swift
func testPerformance_simd() {
var vL = simd_double2(Double.random(in: 0.0...1.0), Double.random(in: 0.0...1.0))
let vR = simd_double2(Double.random(in: 0.0...1.0), Double.random(in: 0.0...1.0))
let increment = Double.random(in: 0.0...0.1)
let vIncrement = simd_double2(increment, increment)
var result = simd_double2(0.0, 0.0)
Swift.print("vL.x: \(vL.x), vL.y: \(vL.y), increment: \(increment)")
self.measure {
for _ in 0..<100000 {
result = vL + vR
result = vL * vR
vL = vL + vIncrement
}
}
Swift.print("last result: \(String(describing: result))")
}


The measurements show the block with SIMD operations taking an order of magnitude more time than the multiple operations!

...testPerformance_double measured [Time, seconds] average: 0.049, relative standard deviation: 3.059%, values: [0.049262, 0.049617, 0.048499, 0.047859, 0.048270, 0.048564, 0.047529, 0.052578, 0.047267, 0.047432], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100

...testPerformance_simd measured [Time, seconds] average: 0.579, relative standard deviation: 5.932%, values: [0.626196, 0.605790, 0.635180, 0.611197, 0.553179, 0.548163, 0.552648, 0.549264, 0.552745, 0.551465], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100

Accepted Reply

With more investigation I think I've solved this issue! Hope this helps others who are investigating performance with XCTest and SIMD or other Accelerate technologies...

XCTest performance tests can work great for benchmarking and investing alternate implementations even with micro performance, but the trick is to make sure you're not testing code built for debug or running under debugging.

I now have XCTest running the performance tests from my original post and showing meaningful (and actionable) results. On my current machine, the 100000 regular Double calculation block has an average measurement of 0.000328 s, while the simd_double2 test block has an average measurement of 0.000257 s, which is about 78% of the non-SIMD time, very close to the difference I measured in my release build. So now I can reliably measure what performance gains I'll get from SIMD and other Accelerate APIs as I decide whether to adopt.

Here's the approach I recommend:
  1. Put all of your performance XCTests in separate files from functional tests, so you can have a separate target compile them.

  2. Create a separate Performance Test target in the Xcode project. If you already have a UnitTest target, it's easy just to duplicate it and rename.

  3. Separate your tests between these targets, with the functional tests only in the original Unit Test target, and the performance tests in the Performance Test target.

  4. Create a new Performance Test Scheme associated with the Performance Test Target.

  5. THE IMPORTANT PART: Edit the Performance Test Scheme, Test action, and set its Build Configuration to Release, uncheck Debug Executable, and uncheck everything under Diagnostics. This will make sure that when you run Project->Test, it's Release-optimized code that's getting run for your performance measurements.

There are a couple of additional steps if you want to be able to run performance tests ad hoc from the editor, with your main app set as the current scheme. First you'll need to add the Performance Test target to your main app scheme's Test section.

The problem now is that your main app's scheme only has one setting for test configuration (Debug vs. Release), so assuming it's set to Debug when you run your performance test ad hoc it will display the behavior in my OP, with SIMD code especially orders of magnitude slower.

I do want my main app's test configuration to remain Debug for working with functional unit test code. So to make performance tests work tolerably in this scenario, I edited the build settings of the Performance Test target (only) so that it's Debug settings were more like Release - the key setting being Swift Compiler Code Generation, changing Debug to Optimize for Speed [-O]. While I don't think this is going to be quite as accurate as running under the Performance Test scheme with Release configuration and all other debug options disabled, I'm now able to run the performance test under my main app's scheme and see reasonable results - it again shows SIMD time measurement in the 75-80% range compared to non-SIMD for the test in question.

Replies

my testing down to a couple of brief XCTests

Seems XCTest is not a good tool to measure micro performance.

With this simple code:
Code Block
import Foundation
import simd
class MyClass {
func measure1(code: ()->Void) -> Double {
let start = clock_gettime_nsec_np(CLOCK_UPTIME_RAW)
        code()
let end = clock_gettime_nsec_np(CLOCK_UPTIME_RAW)
let time = Double(end - start) / 1_000_000_000
return time
}
func measure(code: ()->Void) {
var times: [Double] = Array(repeating: 0, count: 10)
for i in times.indices {
let t = measure1(code: code)
times[i] = t
}
let avg = times.reduce(0, +)/Double(times.count)
        print("average: \(String(format: "%.6f", avg)), values: \(times)")
}
func testPerformance_double() {
//...Exactly the same as shown
}
func testPerformance_simd() {
//...Exactly the same as shown
}
}
let myObj = MyClass()
print("testPerformance_double:")
myObj.testPerformance_double()
print("testPerformance_simd:")
myObj.testPerformance_simd()


The output:
(Tested on Mac mini (2018), Xcode 12.3, macOS Command Line Tool project, Release settings.)
Code Block
testPerformance_double:
xL: 0.6408629334044746, xR: 0.735714601868435, increment: 0.05068415313623026
average: 0.000156, values: [0.00023855, 0.000140175, 0.000211979, 0.000193554, 0.000178274, 0.000118627, 0.000157364, 0.000136575, 9.3264e-05, 9.3267e-05]
last result: 35599.67430242713
testPerformance_simd:
vL.x: 0.2631681036829726, vL.y: 0.026889765245537545, increment: 0.04897156513471709
average: 0.000156, values: [0.000269612, 0.000122409, 0.000192792, 0.000141103, 0.000124708, 0.00012638, 0.000124496, 0.000211614, 0.000122489, 0.000122688]
last result: SIMD2<Double>(15474.766375371206, 40414.973240285304)

(I have chosen one interesting result from a few tries, but no significant differences in all other tries.)

I'm afraid your code is too simple to measure micro performance, Swift compiler would reduce many parts of code inside iteration.

Performance improvement is a very interesting topic, but you may need some better ways to explore.
Thanks for the reply OOPer.

OK since XCTest isn't useful for this I've approached it by adopting SIMD in some production code and comparing actual performance at runtime - please see new code and results below. I now see about 15% improvement in release-build performance, I'm wondering if that's about one would expect.

The code here is converting a set of x,y Double values from model coordinates to screen coordinates, so there's a multiply and an add for every x, y. I'm pre-populating the output array with zeros and passing in to keep allocation out of the picture:

Regular implementation:
Code Block Swift
final func xyPointsToPixels(points: [(Double, Double)],
output: inout [CGPoint]) {
if output.count < points.count {
Swift.print("ERROR: output array not pre-allocated")
return
}
let start = clock_gettime_nsec_np(CLOCK_UPTIME_RAW)
for n in 0..<points.count {
output[n].x = CGFloat(self._opt_xAdd + points[n].0 * self._opt_xfactorDouble)
output[n].y = CGFloat(self._opt_yAdd + points[n].1 * self._opt_yfactorDouble)
}
let end = clock_gettime_nsec_np(CLOCK_UPTIME_RAW)
let time = Double(end - start) / 1_000_000_000
os_log(.info, "=== regular conversion of %d points took %g", points.count, time)
}


SIMD implementation:
Code Block Swift
final func xyPointsToPixels_simd(points: [simd_double2],
output: inout [CGPoint]) {
if output.count < points.count {
Swift.print("ERROR: output array not pre-allocated")
return
}
let start = clock_gettime_nsec_np(CLOCK_UPTIME_RAW)
for n in 0..<points.count {
let xyVec = self._opt_simd_add + points[n] * self._opt_simd_factor
output[n].x = CGFloat(xyVec.x)
output[n].y = CGFloat(xyVec.y)
}
let end = clock_gettime_nsec_np(CLOCK_UPTIME_RAW)
let time = Double(end - start) / 1_000_000_000
os_log(.info, "=== simd conversion of %d points took %g", points.count, time)
}


A debug build run in the debugger with this is just as misleading as the XCTest - and again SIMD is an order of magnitude slower there, so in general it seems to be bad for debugging, though maybe there are some build settings that could improve that.

Changing the build scheme to release and launching the app normally with console log set to info I was able to finally get more reasonable looking data. Here, the SIMD implementation was slightly faster than the normal implementation, confirming that the slower execution was a debug-build issue. The SIMD average time is around 85% of the regular average - is that about what would be expected? (I as actually hoping for a little better, considering we're only executing one instruction where we were executing two esp. for the multiply).

My outputs:
info 11:38:27.658463-0800 MathPaint === simd conversion of 4122 points took 4.741e-06
info 11:38:28.303478-0800 MathPaint === simd conversion of 4123 points took 5.876e-06
info 11:38:28.724909-0800 MathPaint === simd conversion of 4122 points took 5.793e-06
info 11:38:31.132216-0800 MathPaint === simd conversion of 4122 points took 7.305e-06
info 11:38:31.675180-0800 MathPaint === simd conversion of 4123 points took 6.942e-06
info 11:38:32.186911-0800 MathPaint === simd conversion of 4123 points took 5.849e-06
info 11:38:34.185091-0800 MathPaint === simd conversion of 4122 points took 5.832e-06
info 11:38:34.603739-0800 MathPaint === simd conversion of 4122 points took 5.425e-06
info 11:38:37.465219-0800 MathPaint === simd conversion of 4123 points took 7.502e-06
info 11:38:38.840133-0800 MathPaint === simd conversion of 4123 points took 8.319e-06
simd average: 6.356-e06

info 11:49:35.332700-0800 MathPaint === regular conversion of 4123 points took 7.058e-06
info 11:49:36.014312-0800 MathPaint === regular conversion of 4122 points took 5.488e-06
info 11:49:38.079446-0800 MathPaint === regular conversion of 4122 points took 7.05e-06
info 11:49:39.658169-0800 MathPaint === regular conversion of 4122 points took 9.533e-06
info 11:49:41.327541-0800 MathPaint === regular conversion of 4122 points took 8.659e-06
info 11:49:42.779920-0800 MathPaint === regular conversion of 4122 points took 8.923e-06
info 11:49:43.286273-0800 MathPaint === regular conversion of 4122 points took 5.422e-06
info 11:49:43.847928-0800 MathPaint === regular conversion of 4122 points took 7.464e-06
info 11:49:49.293082-0800 MathPaint === regular conversion of 4123 points took 8.986e-06
info 11:49:49.793853-0800 MathPaint === regular conversion of 4122 points took 5.573e-06
regular average: 7.516-e06



The SIMD average time is around 85% of the regular average - is that about what would be expected?

It depends on many things. In your case, you use 2d vector operation which executes 2 scalar operations simultaneously.
So, in an ideal best efficiency, you can expect 50%, are you OK with this?

I too expected a little better than 85%, but I do not think it's too bad.

With detailed examination of the generated codes, it could be a little better.
But I recommend you to explore other Accelerate features if you want to improve the performance significantly,
With more investigation I think I've solved this issue! Hope this helps others who are investigating performance with XCTest and SIMD or other Accelerate technologies...

XCTest performance tests can work great for benchmarking and investing alternate implementations even with micro performance, but the trick is to make sure you're not testing code built for debug or running under debugging.

I now have XCTest running the performance tests from my original post and showing meaningful (and actionable) results. On my current machine, the 100000 regular Double calculation block has an average measurement of 0.000328 s, while the simd_double2 test block has an average measurement of 0.000257 s, which is about 78% of the non-SIMD time, very close to the difference I measured in my release build. So now I can reliably measure what performance gains I'll get from SIMD and other Accelerate APIs as I decide whether to adopt.

Here's the approach I recommend:
  1. Put all of your performance XCTests in separate files from functional tests, so you can have a separate target compile them.

  2. Create a separate Performance Test target in the Xcode project. If you already have a UnitTest target, it's easy just to duplicate it and rename.

  3. Separate your tests between these targets, with the functional tests only in the original Unit Test target, and the performance tests in the Performance Test target.

  4. Create a new Performance Test Scheme associated with the Performance Test Target.

  5. THE IMPORTANT PART: Edit the Performance Test Scheme, Test action, and set its Build Configuration to Release, uncheck Debug Executable, and uncheck everything under Diagnostics. This will make sure that when you run Project->Test, it's Release-optimized code that's getting run for your performance measurements.

There are a couple of additional steps if you want to be able to run performance tests ad hoc from the editor, with your main app set as the current scheme. First you'll need to add the Performance Test target to your main app scheme's Test section.

The problem now is that your main app's scheme only has one setting for test configuration (Debug vs. Release), so assuming it's set to Debug when you run your performance test ad hoc it will display the behavior in my OP, with SIMD code especially orders of magnitude slower.

I do want my main app's test configuration to remain Debug for working with functional unit test code. So to make performance tests work tolerably in this scenario, I edited the build settings of the Performance Test target (only) so that it's Debug settings were more like Release - the key setting being Swift Compiler Code Generation, changing Debug to Optimize for Speed [-O]. While I don't think this is going to be quite as accurate as running under the Performance Test scheme with Release configuration and all other debug options disabled, I'm now able to run the performance test under my main app's scheme and see reasonable results - it again shows SIMD time measurement in the 75-80% range compared to non-SIMD for the test in question.