The current Apple Silicon chips support ARM Neon intrinsics. I have had success including the arm_neon.h header file, both with Xcode and using clang directly on the command line.
Post
Replies
Boosts
Views
Activity
That's a good observation! I had done many more tests than shown, the switch happened off-screen ;)
But you know, now that we have found the most dramatic speedup, it's worth revisiting. In the above code, the fastest I was able to achieve was 0.86 seconds (using direct indexing of the buffer). Going back to .append() increases the speed again!
let arrayLength = (1 << 24)
let randomSource = GKLinearCongruentialRandomSource()
var buffer: [Float] = []
buffer.reserveCapacity(arrayLength)
var idx: Int = 0
while idx < arrayLength {
buffer.append(Float(randomSource.nextInt())/Float(RAND_MAX))
idx += 1
}
Gives me a loop time of 0.78 seconds. That's a 9% reduction in execution time! This is only possible because I am reserving capacity in the array before starting the loop.
As for build settings and optimizations, I created a new Xcode project for the Swift version and therefore got the default settings. I left the Apple sample code project with whatever Apple had set as the defaults. No fiddling on my part.
Good idea about changing the Obj-C code to use arc4random() instead of rand(). I did so, and have found that for the 16.7 million floats (1 << 24) in Apple's sample code, rand() takes about 0.13 second, while arc4random() takes about 1.25 seconds.
I tried a few more things with the Swift code that have produced some very interesting results. Switching from for in to a while < loop drastically reduces the execution time.
let randomRange: ClosedRange<Float> = 0...Float(100.0)
let arrayLength = (1 << 24)
var buffer: [Float] = Array(repeating: 0.0, count: arrayLength)
var idx: Int = 0
while idx < arrayLength {
buffer[idx] = Float.random(in: randomRange)
idx += 1
}
The loop in the above code runs in 5.12 seconds - less than half the time that for idx in 0..<arrayLength takes!
This promising result led me back to using GameplayKit.
let arrayLength = (1 << 24)
let randomSource = GKLinearCongruentialRandomSource()
var buffer: [Float] = Array(repeating: 0.0, count: arrayLength)
var idx: Int = 0
while idx < arrayLength {
buffer[idx] = Float(randomSource.nextInt())/Float(RAND_MAX)
idx += 1
}
The loop in the above code runs in 0.86 seconds!
This is still significantly slower than rand(), but I'm fine with it. I may look more into other implementations in the future.
I appreciate the feedback and suggestions. It really helped get me thinking.
Sure, I will post a few snippets. These are all on an M1 Pro.
Creating an array of data for later passing in to the Metal buffer:
let arrayLength = (1 << 24)
let randomRange: ClosedRange<Float> = 0...Float(100.0)
var randomFloats: [Float] = []
randomFloats.reserveCapacity(arrayLength)
print("Capacity reserved")
let start = DispatchTime.now()
for _ in 0..<arrayLength {
randomFloats.append(Float.random(in: randomRange))
}
let end = DispatchTime.now()
let totalTime = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000.0
print("Total time \(totalTime) seconds.")
The output for this one is:
Capacity reserved
Total time 11.37157275 seconds.
Program ended with exit code: 0
Here we try to create the Metal buffer directly:
guard let device = MTLCreateSystemDefaultDevice() else {
fatalError( "Failed to get the system's default Metal device." )
}
let arrayLength = (1 << 24)
let bufferSize = arrayLength * MemoryLayout<Float>.size
let randomRange: ClosedRange<Float> = 0...Float(100.0)
print("Starting")
let start = DispatchTime.now()
guard let buffer = device.makeBuffer(bytes: (0..<arrayLength).map { _ in Float.random(in: randomRange) },
length: bufferSize,
options: .storageModeShared) else {
fatalError( "Failed to make buffer" )
}
let end = DispatchTime.now()
let totalTime = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000.0
print("Total time \(totalTime) seconds.")
The output for this one is:
2022-09-03 08:43:27.183968-0500 computeTest[2988:125455] Metal GPU Frame Capture Enabled
2022-09-03 08:43:27.184253-0500 computeTest[2988:125455] Metal API Validation Enabled
Starting
Total time 10.732446667 seconds.
Program ended with exit code: 0
I've tried a number of others, but this should give a good idea of what is going on. Using GKLinearCongruentialRandomSource from GameKit speeds it up by a few percent, but it still doesn't compare to the Objective-C version in the above linked sample code. That entire program runs in less than 1 second on my MacBook Pro.
Hello all, I went back and did some more reading. It seems that the best (public) explanation for what "package power" means on Apple Silicon is that it is everything on your M1 chip except memory. And I think that DRAM power is for the memory. So package power + dram power is everything on the M1 chip; it does not include things such as laptop backlight, disk power, etc.
I will keep my ears and eyes open for any better explanation, but I think that for my purposes this is a good enough answer.