Post

Replies

Boosts

Views

Activity

Why is the speed of metal shading kernel so slow?
Hi, I am recently writing metal shader language to parallelize the algorithms to accelerate the speed of it. I created a simple example to show the acceleration result of it. Since Rust is used in our algorithm, so I used metal-rs as the wrapper to execute the MSL kernels from rust side. In this example, I am calculating the result of two arrays, and kernel looks like: kernel void two_array_addition_2( constant uint* a [[buffer(0)]], constant uint* b [[buffer(1)]], device uint* c [[buffer(2)]], uint idx [[thread_position_in_grid]] ) { c[idx] = a[idx] + b[idx]; } in the main.rs, you can see a function called execute_kernel() , this function has all it needs to execute the kernel in MSL (such as commandEncoder, piplelineState, etc). use core::mem; use metal::{Buffer, MTLSize}; use objc::rc::autoreleasepool; use std::time::Instant; use two_array_addition::abstractions::state::MetalState; fn execute_kernel( name: &str, state: &MetalState, input_a: &Buffer, input_b: &Buffer, output_c: &Buffer, ) -> Vec<u32> { // assert!(input_a.len() == input_b.len() && input_a.len() == output_c.len()); // let len = input_a.len() as u64; let len = input_a.length() as u64 / mem::size_of::<u32>() as u64; // 1. Init the MetalState // - we inited it // 2. Set up Pipeline State let pipeline = state.setup_pipeline(name).unwrap(); // 3. Allocate the buffers for A, B, and C // - we allocated outside of this function let mut result: &[u32] = &[]; autoreleasepool(|| { // 4. Create the command buffer & command encoder let (command_buffer, command_encoder) = state.setup_command( &pipeline, Some(&[(0, input_a), (1, input_b), (2, output_c)]), ); // 5. command encoder dispatch the threadgroup size and num of threads per threadgroup let threadgroup_count = MTLSize::new((len + 256 - 1) / 256, 1, 1); let thread_per_threadgroup = MTLSize::new(256, 1, 1); // let grid_size = MTLSize::new(len, 1, 1); // let threadgroup_count = MTLSize::new(pipeline.max_total_threads_per_threadgroup(), 1, 1); command_encoder.dispatch_thread_groups(threadgroup_count, thread_per_threadgroup); command_encoder.end_encoding(); command_buffer.commit(); command_buffer.wait_until_completed(); // 6. Copy the result back to the host let start = Instant::now(); result = MetalState::retrieve_contents::<u32>(output_c); let duration = start.elapsed(); println!("Duration for copying result back to host: {:?}", duration); }); result.to_vec() } The performance of the result is kinda interesting to me. This is the result: $ cargo run -r This is expected to run for a while... please wait... Generating input arrays... Generating input arrays... Generating output array... Generating expected output... Duration for allocating buffers: 2.015258s Executing 1st kernel (1)... Duration for copying result back to host: 5.75µs Executing 1st kernel (2)... Duration for copying result back to host: 542ns Executing 2nd kernel (1)... Duration for copying result back to host: 1µs Executing 2nd kernel (2)... Duration for copying result back to host: 458ns Duration expected: 183.406167ms Duration for 1st kernel (1): 1.894994875s Duration for 1st kernel (2): 537.318208ms Duration for 2nd kernel (1): 501.33275ms Duration for 2nd kernel (2): 497.339916ms You have successfully run the kernels! The speed is slower when executing in the MSL kernel, while I reckon of the dataset is quite big ($2^{29}$) The first kernel execution takes more time to launch. Is there any way to optimize the MSL in this case? And in most case, when you design the algorithm into parallelism, what would be the concerns? The machine I am using is M1 Pro with 14-core GPU and 16 GB memory. Does anyone have idea / explanation for why these happen? Thank you
1
0
402
Sep ’24