6 Replies
      Latest reply on Oct 11, 2019 12:26 AM by trojanfoe
      Marek Simonik Level 1 Level 1 (0 points)

        Situation (simplified):

        There're 3 Metal buffers (created on app initialization via MTLDevice – not via MTLHeap);

        • Buffer A: contains total of n float3 elements (which may represent particle positions)
        • Buffer B: double-buffered A
        • Buffer C: contains total of n float elements (i-th element of B corresponds to i-th element of A (and also B))

         

        I also have 3 kernels (Metal compute functions), which maipulate the 3 buffers;

        • Kernel preprocess(A, B): its parameters are A and B. It reads (old) values from A and writes updated values into B. [1 thread = 1 element]
        • Kernel process(B, C): this is a very computation-expensive/time-comsuming kernel. (Just to be clear: each element of buffer B represents particle position.) For each element/particle from B, the kernel finds the particle's nearest neighbors (i.e. their positions). After the neighbors are found, it reduces their positions into one float value. The float value is written to buffer C. [1 thread = 1 element (i.e. particle)]
        • Kernel postprocess(A, B, C): it reads element from B and C, combines them in a simple way and writes the result back to A. (Something like A[i] = B[i]*C[i])[1 thread = 1 element]

         

         

        Problem:

        The kernel process takes serious amount of time to complete (Because for each particle it iterates over a range of B's values. Particles that are near (i.e. adjacent threads) iterate over very similar ranges in B – there's heavy memory reuse of B between adjacent kernels). The kernel is defined similarly to this:

        kernel void process(..., device float3* B, ...) {...}
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        

         

         

        Question:

        Since the kernel process loads repeteadly the same regions of B (even though I use threadgroup memory) and B is read-only in this kernel, I thought it'd be a good idea for performance improvement to cache the buffer B – to transfer it into a highly cached portion of memory optimized for repeated reads. AFAIK such portion of memory is the constant address space. Is it possible to do the following? And if it is, then how?

        1. Call the kernel preprocess(A, B) [this updates the contents of buffer B – hence buffer B cannot be in the constant address space]
        2. Transfer the buffer B (containing the new, updated values from previous step) into constant address space.  <——— How do I do this?
        3. Call the kernel process(B, C) [read repeatedly values from the buffer B which is in constant address space (improving performance)]
        4. Call the kernel postprocess(A, B, C)

         

        Extra question:

        When I create an buffer at the app initialization, I don't specify what address space it belongs to, so when is it decided that a buffer belongs into device or constant address space? Is it per Command Buffer, per Encoder or per each individual Pipeline state of one Encoder? Or does a buffer's position in memory address spaces change throughout the app lifetime?

         

        Example: Say I have 1 buffer and 2 kernels:

        // CPU side:
        var data = device.newBuffer(...)
        
        commandEncoder.setComputePipelineState(fooPipeline);
        commandEncoder.setBuffer(data, offset: 0, at: 0)
        commandEncoder.dispatchThreadgroups(...)
        
        commandEncoder.setComputePipelineState(barPipeline);
        commandEncoder.setBuffer(data, offset: 0, at: 0)
        commandEncoder.dispatchThreadgroups(...)
        
        commandEncoder.endEncoding()
        
        // GPU side
        kernel foo(device float* data){...}  // foo writes to data
        kernel bar(constant float& data){...} // bar only reads from data
        
        
        
        
        

         

        Where is the buffer data stored (in the device or the (cached) constant address space space)? What would happen when I swap the order in which are foo and bar dispatched?

        • Re: Confusion over the `constant` memory address space and memory layout
          KMT Level 9 Level 9 (15,185 points)

              >when is it decided that a buffer belongs into device or constant address space?


          See: https://developer.apple.com/library/ios/documentation/Metal/Reference/MetalShadingLanguageGuide/func-var-qual/func-var-qual.html


          "All arguments to a graphics (vertex or fragment) or compute function that are a pointer or reference to a type must be declared with an address space qualifier. For graphics functions, an argument that is a pointer or reference to a type must be declared in the device or constant address space.


          The address space for a variable at program scope must be constant.

          Any variable that is a pointer or reference must be declared with one of the address space qualifiers discussed in this section. If an address space qualifier is missing on a pointer or reference type declaration, a compilation error occurs."


          Best of luck w/your homework

            • Re: Confusion over the `constant` memory address space and memory layout
              Marek Simonik Level 1 Level 1 (0 points)

              Thanks for reply.

              First I want to point out that this is not a homework – I'm trying to optimize nearest neighbor traversal for my personal project (and understand the internals of Metal).

               

              I've read that documentation page several times before I posted this question, but I wasn't able to find any answer to my problem (i.e. I believe your citation doesn't explain my problem – please take a look at the example code at the bottom of my original question).

               

              Another formulation of the question you tried to answer: There's one metal buffer created at the app initialization. Creating a buffer means (in my point of view) reserving an address space of n bytes in a memory — is it device or constant memory (or are those two internaly the same)? Can the same buffer (i.e. the same bytes in memory) be in the device space for one kernel and in the constant space for another kernel?

               

              The cause of my confusion probably comes from my weak, only intuitive, definition of constant and device memory address spaces (I've haven't been able to find the difference between them precisely defined in terms of hardware and internal implementation) and from my non-justified assumptions about them.

              Unfortunately, the most technical stuff Apple published on the internal implementation (that I was able to find) is WWDC 2016 session Advanced Metal Shader Optimization (which doesn't go into much detail of my problem).

               

              In other words: what does it mean for a buffer to be in constant or device address space? Based on the fact that device and constant are called spaces, I'd assume that in the GPU memory, there're 2 discrete pools of memory – constant and device. But on the other hand, when creating a new buffer, I don't specify where does the chunk of data representing my buffer belong (whether to constant or device pool) – this would suggest that there's only one pool of address space in the GPU memory, that stores MTLBuffers. Then, when dispatching a particular kernel, based on whether the address space qualifier, a decision is made on how to process the data – if a buffer parameter is specified as device, then GPU does nothing special and just reads/writes to the one memory pool of MTLBuffers. When it's specified as constant, the GPU could copy the entire buffer to some fast on-chip cache (de-facto duplicating the buffer from the pool) and then just operate on the cached data (and since it's assured we won't write into constant buffer, no synchronization is required) – but again that contradicts the term address space. Moreover in the beforementioned WWDC session (take a look at page 11 in the slides), there's a diagram showing constant and device spaces as discrete pools of memory, which again contradicts the fact that we don't specify address space membership of buffers at their creation time...(for me, it's just a circle of confusion).

                • Re: Confusion over the `constant` memory address space and memory layout
                  KMT Level 9 Level 9 (15,185 points)

                  I understood the circle of confusion, which is why I felt it necessary to cite the docs, rather than just link to them. I like to keep things simple and work out from there. Remember, they are reference, not tutorial, so best to see them without over analysis...therein lies confusion.

                   

                  If you think they are wrong/misleading, feel free to file bugs against them.

                   

                  Good luck.

                  • Re: Confusion over the `constant` memory address space and memory layout
                    wcm Apple Staff Apple Staff (335 points)

                    It may be useful to think of these qualifiers not as mapping to discrete pools of memory, but as describing the intended memory access patterns. By following the advice given in the documentation and WWDC sessions, you'll be able to achieve the best possible performance, regardless of target hardware.

                      • Re: Confusion over the `constant` memory address space and memory layout
                        Marek Simonik Level 1 Level 1 (0 points)

                        I understand that a concrete implementation of those qualifiers depends on a GPU architecture. But from now on, let's assume just the iOS platform and the "latest" (until tomorrow ) GPU families (A9, A9X SoCs) with unified memory model;

                        1. Are both constant and device buffers stored in one common memory location (the RAM)?
                        2. When a buffer is specified as constant for a specific kernel, does it get copied somewhere (L2 cache?), or is it accessed from the same memory as a device-qualified buffer (but with some optimizations)?
                        3. In the WWDC session, the following is said about constant buffer preloading: "...So the idea here is that rather than loading through the constant address space, what we can actually do is take your data and put it into special constant registers that are even faster for the ALU to access...". What are those constant data registers? Are those the same on-chip registers that hold contexts for individual threads?
                        4. What happens when I use too much constant memory? Does some of the overlapping constant memory get spilled into global memory (i.e. device a.k.a. RAM), resulting into latencies?

                         

                        Finally and foremostly, does your advice "think of these qualifiers not as mapping to discrete pools of memory, but as describing the intended memory access patterns" mean that I don't need to perform step 2 from my original question (step 2 being "Transfer the buffer B into constant address space.") explicitly, but the HW/Metal implementation will do that implicitly?

                         

                        Thank you.

                          • Re: Confusion over the `constant` memory address space and memory layout
                            trojanfoe Level 1 Level 1 (10 points)

                            These are good questions and the responses seem very evasive. Did you manage to get your answer?

                             

                            The only semantic difference I have found so far is that Metal will complain if a `constant` buffer is not aligned on a 256-byte boundary (or rather the buffer's offset).  If, however, `constant` address-space stuff is copied somewhere fast then it makes it useful for global uniforms in my implementation and I would like to know this detail.