Just for the ones curious about this issue, thanks to Github user baldand (who, btw, is developing a super cool Metal wrapper for Python, https://github.com/baldand/py-metal-compute) who pointed out that I needed to set setPurgeableState in my MTLBuffers to be sure they get released.
Post
Replies
Boosts
Views
Activity
Hi guys, thanks for taking a look, really appreciate it.
I tried to explore the Metal capture at some point, but because the project is coded outside XCode, I didn't manage to make it work. For example, I tried to follow these suggestions (https://alia-traces.github.io/metal/tools/xcode/2020/07/18/adding-framecapture-outside-of-xcode.html) (https://developer.apple.com/documentation/metal/frame_capture_debugging_tools/capturing_gpu_command_data_programmatically?language=objc#3325255) to capture Metal outside XCode in a gputrace file but MTLCaptureManager.supportsDestination(.gpuTraceDocument) returned that is not supported. I read I needed to enable it with a Info.plist with the flag MetalCaptureEnabled set to true.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>MetalCaptureEnabled</key>
<true/>
</dict>
</plist>
But I never managed to make it work. I put Info.plist in the directory where I run my application (via Python) as suggested here (https://stackoverflow.com/questions/63707866/macos-metal-failing-to-capture-gpu-frame-from-command-line-app), in the location of the libraries, Python binary, etc) but never seemed to be taken into account.
Do you have any other suggestions to enable the capture? Same for the instruments trace, any hint to do it from a command-line application would be welcome.
Cheers,
S
Amazing, that METAL_CAPTURE_ENABLED should be mentioned in the official documentation! :). It worked without a problem, now I can capture the gputrace.
There is indeed some register spilling in 2 of 3 kernels being run in the main for..loop of the FDTD process. As commented, I need to use MTLCommandBuffer.waitUntilCompleted() at each loop otherwise the results start to be unstable given the nature of the time-difference method (you need to wait that stress values finish calculation before calculating particle values, and so on).
For the spillage, and after analyzing the trace, I may have an idea to try to split the most intense kernels (they are massive currently) into mini kernels that should reduce the number of registers. These mini kernels can be run independently using a single command buffer in a single encoding operation, hoping this will maximize occupancy. It would be interesting to see how all 3 backends (OpenCL, Metal and CUDA) behave since this splitting will be common to all of them.
Thanks again for that METAL_CAPTURE_ENABLED info, I can see I will use it heavily in the future.
The instruments trace shows that the OpenCL execution has higher occupancy and much higher memory bandwidth (390 GB/s in OpenCL vs 110 GB/s in Metal). The access to memory strategy of the OpenCL alone can explain the difference. But keep in mind this execution of the OpenCL was with the very large kernels, so memory is being accessed aggressively, still, it is nice to see hitting the peak performance.
This is the Metal instruments overview using the "mini-kernels" approach
and here OpenCL using the original large kernels
From the instruments trace (and apologies since I'm just learning how to use it correctly), besides seeing some of the evidence, how can go into more details to identify what specific bottleneck may explain the difference?
Awesome, thanks for taking a look, and apologies, I should have pointed you to the right spot beforehand... you are watching the wrong function :), This segment of
code points to the Rayleigh-Sommerfeld Integral that is included in the library to complement the FDTD solution of the Viscoelastic solver, which is mainly covered in the FDTD3D_GPU_VERSION.h file. I'm linking to the experimental branch FirstTestAppleSilicon that has all the recent "mini-kernels" approach, those changes have not been yet merged into the main branch. All the recent tests of the instruments profiling were done with that new branch.
As a side note, yes, in the ForwardSimpe function I should move out of the main loop the defaultLibrary.makeFunction(name: "ForwardSimpleMetal")! call.], but that function is already working quite well, the one that really initiated this thread is at FDTD3D_GPU_VERSION.h
So back to FDTD3D_GPU_VERSION.h, there you can see all the 3 GPU backends compilations controlled with macro conditions (i.e #if defined(METAL) ...) . You can see (here), that for the code in question I didn't do that mistake of creating the library inside the main loop. Be aware that for FDTD3D_GPU_VERSION.h I used a this wrapper mtlpp instead of a Swift-based interface since FDTD3D_GPU_VERSION.h was originally developed as a C-based Python extension (I have a student who will work to change that so we can use Swift instead, but that is for another day). You can see there how each of the GPU backends get their synchronization.
Btw, the mini-kernel Metal approach showed significant improvements for the AMD GPUs, the W6800 is now showing even better results and the Vega 56 shows now also a better performance than OpenCL. Here you can see updated benchmarks,
The Vega 56 was the one showing a dramatic improvement from 144s to 83s after using the mini-kernels approach.
I haven't yet completed doing the proper tests in OpenCL and CUDA to see how the mini-kernels approach will work with those backends. But at least this exercise has been quite productive to improve performance in Metal-supported GPUs, it is clear that with adequate fine-tuning Metal can and should do better than OpenCL, but now I just need to continue to investigate how to ensure Metal shows also a similar trend in the M1 Max as it shows with the AMD GPUs.
Thanks again for taking a look.
It is pity this seems to get lost through the cracks, and now I have a bigger issue. The latest MacOS 12.3 update truly broke the operation of my W6800 GPU!!! I can't run anymore any of my Metal code, it just hangs there indefinitely. I saw multiple reports of a drop in performance with eGPUS associated with the latest MacOS release. Please check what is going on! This is highly disruptive.
Thanks for the reply. The problem with my eGPU was just fixed yesterday with the roll of MacOS 12.3.1. There were reports all over the place in many forums of issues with eGPU associated with 12.3. After the update, now my programs were able to run again. It was just my app that hanged. But anyway, it is not a problem anymore. Looking forward to hearing more on the original issue. Maybe to give extra motivation, given the excellent experience with the M1 Max, for our applications, we are now considering using Mac Studios with the M1 Ultra (and anything that is your pipeline of new processors) for neuronavigated procedures in human participants where this numerical model running in OpenCL/Metal will be tightly integrated, so being able to be future-proof with Metal is truly something critical for us moving forward.
Hi guys, any updates on what I could try out? Or is this something you think could be addressed in the future with Metal 3 once it becomes available?
AFAIK, it is not possible... but I'd love to hear if this may be possible. I ended up packing multiple "virtual" arrays in each of the 32 allowed buffers you can pass. Then I used some indexing to differentiate one virtual array from another. This approach has obviously important limitations in that there may be instances a virtual array can't fit along others in the maximum size of the buffer. But at least it allowed us to pass more arrays than the upper limit of 32.
I just want to wrap up this thread as we managed to finally bring OpenCL and Metal to show the same level of performance. It took a lot of changes (replacing C++ wrappers by Swift and then later with a modified Python library that compiles kernels on the flight), but ultimately the biggest difference was we packed as many constants as possible as #define statements instead of passing them through constant memory. Once that change all those changes were done, finally, the M1 Max Metal's performance is slightly better than OpenCL, so an improvement of 300%, which was dramatic. For those interested, the newest and much more simplified code is at https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. Below is a screenshot of the performance test (I also pushed it to a more challenging test that illustrates better how the M1 processors stand vs the A6000).