To optimize buffer read, I intend to use threadgroup memory.
Buf it seems:
(1) There is no API like std::memcpy in MSL;
(2) Also, there is no API like [setBuffer: atIndex:] to set data for threadgroup memory.
The amount of data is 2~4KB. How can I get the fastest way to copy data from device data to threadgroup memory? THX!