vkCmdDrawIndexedIndirectCount functionality under Metal

Hello,

It looks like my previous question was closed without being resolved.

https://developer.apple.com/forums/thread/668171

There are FPS values from our new benchmark.
Indirect command buffers are not working properly.
So there is no way to emulate multi-draw indirect count
functionality other than a loop of draw indirect commands. As you can see below, the same hardware is working three times slower under Metal because of it. And Apple M1 performance is worse than AMD integrated graphics performance.

We have a buffer with multiple draw commands. How should we render it efficiently under Metal?

AMD Vega 56 eGPU:
Direct3D12: 94.0
Direct3D11: 87.2
Vulkan: 91.1
Metal: 35.8

AMD Ryzen™ 7 4800H:
Direct3D12: 21.1
Direct3D11: 19.4
Vulkan: 20.5

Apple M1:
Metal: 16.9

Thank you
What problems are you seeing with indirect command buffers? It would be helpful if you create a request via Feedback Assistant to see what might be going on.
Everything is described in the original post. The main problem is performance because even a loop of draw indirect is faster than an indirect command buffer:

https://www.icloud.com/iclouddrive/0ICuhBkHgGuLjCxaJwRyHoLmw#execute_commands_in_buffer
https://www.icloud.com/iclouddrive/0hDo_q0oXs4uzC25yZdKmL83A#multiple_draw_indirect

I made Feedback Assistant more than half of year ago. There was no answer. After that, I wrote here.

Thank you!

Hi frustumo,

I checked on each of the tickets mentioned in the last thread.

  • FB8254449 - Still under investigation, but as mentioned in the other thread, you should be able to use ICBs although it sounds like you weren't able to get the performance you wanted.
  • FB8638856 - Closed because you created 2 other FBA requests to the separate issues there.
  • FB8928674 - Got stuck because I guess the driver engineer thought he needed an Xcode project, I just pointed out that you attached a reproducer and am trying to get the driver team to look at it again.
  • FB8928678 - Was looked at some. One engineer suspects the perf issue is due to the low bandwidth to the eGPU and the GPU fetching ICB over the Thunderbolt bus. There no way for you to control the location of the ICB though, so this would be something the GPU driver team needs to handle. I have pushed this to the AMD driver team to look at. I'm asking what else you may be able to do in the mean time.

Thank you for your answer.

I have created a new FB9127527 issue with the benchmark and more information inside.

Would be able to provide an Xcode project to reproduce the issue?

There are links inside FB9127527 to a notarized application for macOS, Windows, and Linux. And multiple simple tests to reproduce the problem on macOS in other FB. Thank you.

So there is no way to emulate multi-draw indirect count

There is. But you have to call drawPrimitive or drawIndexedPrimitive multiple times, each one indexing into the next indirect draw in the buffer. I don't know why Metal left the drawCount out of the api, but the current implementation has a drawLimit of 1. Nice thing is indirect draw works back to iPhone 5S.

You can even do GPU buffers with compute, and then indirect draw them. But you have to call draw 10 times, even if compute culls and produces 5 results, so make sure to set numInstances to 0 on the remaining draws. Or if you can wait a frame, then you could return a count to the cpu.

This is the same eGPU hardware with 3 times lower performance under Metal: https://gravitymark.tellusim.com/report/?id=bc453e851c5dede3cedef6c3ac9caca2f8dffa47 https://gravitymark.tellusim.com/report/?id=7f1b799adc588938fc02f140a2ee48dbd4f36e69

ICB is working stable with the last OS updates. We have updated our macOS benchmark and released the iOS version: https://apps.apple.com/us/app/gravitymark-gpu-benchmark/id1595186532 ICB is giving a 2.5 performance boost in comparison with the previous version. Thank you for the great improvements.

vkCmdDrawIndexedIndirectCount functionality under Metal
 
 
Q