This is typically code using AVX or AVX2 or fp16 extensions in Rosetta 2. So you'll need an ARM native or version of that x64 code that doesn't use those. Only SSE 4.2 max.
Post
Replies
Boosts
Views
Activity
Note that most recent Intel gpus removed fp64 support. Maybe Apple won't update their Intel lineup, so you'll be safe.
This is still an issue. Can we have a timing system that doesn't go through the slow and unconfigurable os_log? We don't use an unreasonable number of scopes for any of the profilers, but these swamp os_signpost.
Hmm, I guess option+click does the capture without bringing up the popup. So question answered, but this might help others.
Here's the developer doc built from the sources, that mentions iOS 13.
https://developer.apple.com/documentation/metal/mtlcompileoptions/3564462-preserveinvariance?language=objc
And here's the flag itself.
/*!
@property preserveInvariance
@abstract If YES, set the compiler to compile shaders to preserve invariance. The default is false.
*/
@property (readwrite, nonatomic) BOOL preserveInvariance API_AVAILABLE(macos(11.0), macCatalyst(14.0), ios(13.0));
@end
We ran this on new and old hardware, but our app targets minimum iOS 12 since our minspec is a 5S, and we can't move higher since Apple dropped iOS updates for it in 12.
We will also move to reverseZ and infiniteZ, but that's a much bigger code change. This should also improve depth precision, but here it's using the same matrix for both passes, and the depth precision shouldn't matter.
enum MTLSamplerMinMagFilter
{
MTLSamplerMinMagFilterNearest,
MTLSamplerMinMagFilterLinear,
MTLSamplerMinMagFilterMin, <- add
MTLSamplerMinMagFilterMax, <- add
}
Seems that p-cores can drop off during thermal constraint. That would make affinity assignments more challenging. We're now setting round-robin 45/41 priority as per the talk, set qos 31 on non-important threads, but have audio threads at super high fifo priority. But we can control where any of this runs consistently, and we have that and priority on all other platforms. We're not even sure if based on these vague numbers that we consistent get p- or e-cores. That makes consistent performance tuning a lot more difficult.
Seems like the new Swift threading model is all about no more threads than cores, and keeping each core busy with work. So that implementation is locking threads to cores via affinity. That's exactly why we also need affinity control. And this api doesn't appear to be supplied for C++ code.
I hadn't seen that presentation, so digging into it. Thanks!
Apparently M1 macOS doesn't implement this call either, and also returns an error. So iOS on M1 macOS is a failure case too. We'll just let threads run wherever for now, but it's less than ideal.
I have no idea where to find 46 is not supported, but thanks as always Quinn. You are super helpful. I guess it's from sys/errno.h.
What is the alternative? QoS control isn't even remotely the same as affinity. The system will always have higher and QoS levels unavailable to us, so the system should always be responsive even if we use affinity in game. At least Android has affinity control, and it hasn't destroyed the platform.
And when you're building a game, and want to run jobs consistently on cores and monitor them in captures, then not having any affinity control on macOS or iOS is a problem. We use affinity control for cores on all products except Apple's, and the workarounds for this aren't ideal for optimizing performance.
I somehow feel this is like removing dll hotloading on iOS in iOS 12. We used to be able to reload our C++ game code, and now Apple requires the app devs to completely relaunch builds. That kills iteration. Look at Unity or UE4/5 having to do the same. What iOS removed affinity hinting?
This came out in macOS 10.5, and the call is available on iOS. Just seems like the call has been disabled of late. Maybe it's available to set other values, but at this point it's a little late for being experimental.
https://developer.apple.com/library/archive/releasenotes/Performance/RN-AffinityAPI/#//apple_ref/doc/uid/TP40006635-CH1-DontLinkElementID_2
We have 2 big and 4 little cores. The 4 little cores run 2-3x slower than the big. We'd like to prioritize tasks on the big cores and then see those tasks running there. Maybe even ignore the little cores so that we hit our frame rate. There are no scheduler examples from Apple on how to do this. Having 50 queues going to libdispatch also isn't the correct model.
Also we're running iOS builds on macOS M1. Does this call work there?
Same problem here trying to set the affinity mask/hint. What is return code 46? Is Apple trying to prevent use of this API on iOS? The macOS side correctly returns 0.
We're using the following code as per Apple's documentation. It doesn't for for any mask value. This is good as we can do without real affinity support, and just hint and hope.
thread_affinity_policy_data_t policy = { (int)( mask & 0xFFFFFFFF ) };
int rc = thread_policy_set( pthread_self(), THREAD_AFFINITY_POLICY, (thread_policy_t)&policy, 1 );
This leads to 5-8ms of cpu driver processing that overlaps with the 10ms to 26ms nextDrawable wait. I can't post a picture from Metal System Trace here, but seems that one should be able to completely commit one CB before getting stalled by the API. Having to use two just to workaround the nextDrawable stall isn't great, but is my workaround for now. No stall is seen, since it's all using triple buffering. If I switched to double, then it gets unusable.
Very little of the render command buffer submission depends on the drawable. The framebuffer cb reads from the results of the offscreen in that cb just to display to the UIView. So the nextDrawable stall basically prevents that work from being submitted in the single cb case.
Seeing select() take up to 200ms or more because of this. This then hitches our game since we're trying to communicate with a remote manager. I tried setting TCP_NODELAY to no effect to fix this. So must also need TCP_QUICKACK but that's not defined. Since OSX was as BSD OS, it seems odd that this is missing and that full support isn't there.
I probably can't share links on this forum, but this is the issue.
Nagle's Algorithm and Delayed ACK Do Not Play Well Together in a TCP/IP Network
Yes, I ended up using the draw stage boundaries around all of our renderPasses, and on iOS I use the stage boundary calls. I thought I was going to have to set draw boundary data on each draw call, but the draw stage boundaries were really just a timestamp to inject into the command stream. The WWDC video was helpful.
MTLParallelRenderCommandEncoders weren't supported, but was able to define timers around the sub-encoders on those. It was a ton of code and tricky to support both macOS and iOS, and I had to deal with 3 different encoders, and adjusting the timestamps on macOS Intel.
It's done now and working at least for macOS 11+ and iOS 14+. Also should solve M1 timings.
Could stage boundary and depthClamp support (which docs erroneously list as family v4_1 instead of v2_4 in the docs) be put onto the Metal Feature Set docs, so I don't waste time writing workarounds for missing functionality.