Posts

Post not yet marked as solved
2 Replies
Especially on iOS you can find yourself in this position if your application uses too much memory. Applications and especially demons are often sandboxed for memory usage. Using too much memory will yield the exception of this kind. You can read more about it here: https://developer.apple.com/documentation/metrickit/improving_your_app_s_performance/reducing_your_app_s_memory_use If you simply crash in this vImage function with EXC_BAD_ACCESS or similar then that is a strong indication that the buffer passed to vImage was too small, non existent or not writeable. The vImage routine simply writes a byte to the last valid byte in the vImage_Buffer to see what will happen. If all is well, it goes through and nobody notices. If the buffer is too small (depressingly common) then the write may run off the end of the buffer and cause a crash. This is a signal that the bug is in the callers code and not a problem in vImage. (As the framework ultimately responsible for touching potentially improperly configured pixel buckets, vImage would otherwise find itself having to diagnose a lot of other people's bugs, so it is there as a means of self defense.) You can use vImageBuffer_Init to allocate buffers for you to help reduce this sort of error. Otherwise, you can use vmmap <pid> to dump a list of the application's allocations to go look at whether the crash data address is in the buffer that it is supposed to be in or somewhere off the end. Vmmap will also tell you whether the allocation is readonly or readwrite. If you aren't the one calling vImage directly, then the error might lie in the callers code, or potentially could be in an incorrectly allocated CFData, CGImageDestination, or similar. Where are those pixels supposed to end up? Look there. Note that vmmap shows the view of memory from the perspective of virtual memory -- the set of large page aligned allocations that underlie what is going on in the system. It is also the granularity along which memory protection functions. For smaller allocations, malloc() allocates a biggish chunk of page aligned memory and then sub allocates from there as needed. So, while you will see the large page aligned buffer that contains your 16-byte allocation, it won't be stored individually as such, unless MallocGuardEdges, GuardMalloc or similar is on.
Post not yet marked as solved
4 Replies
Keep the GPU busy The GPU clock slows way down when it is asleep and takes a long time to come back up again.  If you are submitting a lot of short jobs with small breaks in between — just the time to get the job back on the CPU look at it and queue up the next one is enough to cause problems— then it will go to sleep and take a very long time to come back. We have measured 2-4x performance loss due to this in the lab on even extremely large machine learning workloads. These are enormous. Your workload is not going to be any better off.  You need to be pushing work to the GPU in a way such that when one command buffer completes, there is already the next one fully queued up and ready to go so that the GPU can seamlessly skip from one to the next without skipping a beat.  Small ad hoc perf experiments invariably get this wrong. The GPU cycles down, takes a long time to spin back up again, not to mention the time to just wake it up, and all you measure is overhead.  Use MTLHeaps It can very easily take longer for the CPU to allocate and wire down memory than it will take the GPU to run its full workload using that memory.  While developing Metal Performance Shaders, we found that the even hand-tuned kernels would still run slower than the CPU if we did not keep the memory allocation under control.  This is why MPS goes through a lot of effort to provide temporary MPSImages and buffers. Temporary images and buffers are backed by a MTLHeap. The MPS heap is used to recycle memory over the course of the command buffer, and can also migrate from command buffer to command buffer if the time interval is short enough. Even if you don’t use MPSKernels, you can use it in your program by making use of MPSTemporaryImages and buffers.  Why is a heap critical? Would you write your ordinary CPU based application by setting up all of your storage needs up front at compile time as global arrays?  No. Of course, you wouldn’t!  Not only is this a major hassle to anticipate everything that might happen ever, you would probably also waste tons of memory statically allocating for the worst case and more memory by failing to do enough analysis on your application workflows to find ways to reuse and alias memory whenever possible to keep the overall allocation size down.  This reuse also is beneficial for the caches.  For a complex enough program, it is quite possible your memory needs might be indeterminable or so large that the program will be jetsammed for consuming too much.   Consider: why is so much energy devoted to memory safe languages online as if nothing could otherwise be done about the heap? I mean, you could static allocate everything up front, and thereby never leak any memory again!  This has always been possible in C….  Well, the reason is that the heap is in fact AWESOME, and it is inconceivable not to use it. The question is really just how to use it safely. &lt;Insert unending religious argument here&gt;  So, it should not be a surprise to any GPU programmer that statically allocating writable MTLResources up front is a bad idea. Just because it is easy doesn’t mean it is a good idea.  Your application should use MTLHeaps to allocate and deallocate MTLResources over the course of the command buffer or multiple command buffers as appropriate. In this way, memory can be reused and the cost of allocating tons of memory per command buffer eliminated. Only then can the GPU shine.  For MPS, which can’t know the full nature of its workload in advance, complicated by the fact that the MTLHeap is not dynamically resizable, what this meant was solving the problem at two levels.  For simple usage, a largish heap is speculatively allocated ahead of time, in a fashion similar to how malloc allocates large chunks of memory as needed and then sub allocates from there for smaller malloc calls. We attached it to the MTLCommandBuffer, which provides a nice linear timeline for memory usage so that mere code can reason about when each bit is used and for how long, as long as no kernels are running concurrently. (This can be problematic when both render and compute encoders are running, unfortunately.) It also provides a time, command buffer completion, when we can safely tear down the whole thing and return the memory to the system. For more complicated workloads like MPSNNGraph, the entire workload is introspected ahead of time, a high water mark is determined, only then the heap is allocated, and if the estimate proves incorrect more heaps are allocated as needed to back additional MTLResources.  This can occur because MPSTemporaryImages and buffers do not allocate their backing store at creation, but defer it to first use and of course retire their exclusive use right on backing store when the readCount reaches 0. The MPSTemporaryImage does know however how big its allocation will be before this occurs, so it is possible to traverse the entire graph, making all MPS resources, then determine how big they are, then make a MTLHeap to hold them and only then allocate all the underlying MTLResource objects just in time for encoding.  I have long felt the MTLCommandBuffer should have a feature that does just this! But until it does, this is your job. Compile offline Your CPU code is compiled offline long before the user sees it. This can take quite a while, and is certainly not something you’d want to attempt every time your app is launched. So, don’t do it on the GPU either. Just as on the CPU, jitting from source to ready to run code at the time you need it could easily take more time than it takes to run the code. To avoid this problem, compile your kernels to a .metallib ahead fo time and load them as needed. If you think your code would benefit from jitting to remove expensive but unused special cases, for example, then make use of Metal function constants to turn that behavior on and off. This will let you avoid the expensive front end of the compiler, which is most of the cost, and enjoy the benefit of jitting the code without paying for jitting the code from source.  Get these overheads out of the way, and we can begin to have a discussion about how to write a fast kernel.
Post not yet marked as solved
4 Replies
The useful window for the GPU appears to be gated from both ends.  On one side, there is substantial time to copying all your data to wired down memory in MTLResources, encoding your workload to a command buffer, jitting code, waiting for the GPU to become available, possibly including waking up the GPU, actually running the code, copying the data back to system memory (if discrete) and then waiting for your CPU thread to be swapped in to receive the data.  2.5 ms does not at all sound unreasonable for this to occur on older devices.   On the other end, there is a watchdog timer running that will kill your GPU workload if it runs more than a few seconds. This is there to keep UI frame rates up. The GPU doesn’t preemptively multitask well, so if your job is running a long time, the user interface may not be able to refresh and the machine appears to freeze. Ideally, your workload should be done in the time it takes to refresh the screen, less than 1/60th of a second.  So the useful workload appears to be one that is large enough to not suffer unduly from 2.5 ms overhead for involving the GPU and one that does not run so long as to trigger the watchdog time or damage UI interactivity.  This is a fairly narrow window, and you would not be blamed if you naively come to believe that GPU compute is doomed.  Importantly, however, nearly all of the GPU overhead described above occurs because of poor program design!   The REAL lesson here is that you should pick one device, CPU or GPU, and stick with it!  Don’t bounce data back and forth all the time. (There is a similar lesson to be learned for people learning to use the CPU vector units, though of course it manifests on much smaller time scales.)   Which should you choose?  Start with where your data is and where it will be consumed.  If the data starts and ends on the CPU, except for the largest workloads, stick with the CPU. Look to see what easy wins you can get from Accelerate.framework and GCD. Maybe even try your hand at writing some vector code. If that still isn’t enough, or you need the CPU for other things, then you might need to go to the GPU. If the data starts and ends life on the GPU, then obviously use the GPU for everything if you can.  To be clear, the GPU is not good at inherently serial workloads like Huffman decoding, so some things that are not parallelizable just don’t belong there.  If you do decide to use the GPU, you have to understand that you are working on a high throughput machine with consequently less tolerance for overhead from other factors, so additional work will be needed on your end to make sure these do not become a problem. In many cases on the CPU, these things also could pose problems on the CPU, but the working environment is structured to make it really hard to do these things so you don’t run into them and they aren’t a problem. Metal makes all of these things possible, and more often than not does not go out of its way to make practices deemed harmful difficult. Any one of these has the potential to cause large (factor of 2-10) losses in performance, and alone make the GPU run slower than the CPU. In combination, well… it doesn’t look good. I’ll detail essential GPU survival strategies to follow:
Post not yet marked as solved
5 Replies
Sometimes this can happen normally when the buffer you are writing to either isn't resident in the cache, or isn't resident in memory at all, because it was recently allocated and needs to be zero filled by the kernel after the pages are made resident (and probably other pages evicted). In the latter case, you are seeing hundreds of stores complete normally and in one loop iteration that hits a new page, the store taking a very long time due to the VM trap. You are more likely to see this in situ in an app than some tight benchmark loop that is reusing the same memory over and over. If these effects are the ones causing trouble for you, calling memset on the buffer before you write to it should move the stall to memset, assuming the buffer is not so large it doesn't fit in the cache causing other problems. This won't make things go any faster, but you'll at least see your code running as you thought it would and allow you to rule out instruction selection as the problem and go see about reusing memory more effectively. As I understand latest arm chips have out of order execution, so, even arm.com docs say that it's not easy to predict exact timing. Out of order execution, but in order completion, so that the program order is observed to occur as written and we don't get unpredictable results. What usually (not always) happens with relatively straight line, non-branchy code on out of order machines is that the out of order buffers fill up with work as intended and the pipeline is thereafter limited by the rate at which instructions can retire. New instructions can not enter the pipeline until others retire to make room. Instructions can not retire if the work is not done and the program counter where the samples land is not updated until the instruction retires, so the pattern of long instructions showing up as stalls in traces reasserts itself even with out of order execution. Most vector loops have this behavior. If the machine is able to retire some number of instructions per cycle, which then is the pattern you see. For example, a set of Intel processors at one point could retire 4 instructions per cycle, and if you had well tuned code with no microcode expansion or long instructions like division going on, you'd see the samples landing every 4 instructions. When you are not seeing instructions retiring at 4 instructions per cycle, but perhaps only manage one or two, this would then be an indication of either a stall or a instruction that was broken down into many operations as microcode. Microcode can both load a pile of unobservable operations into the pipeline slowing things down and introduce decoder stalls that prevents other instructions from being decoded. So, I try to avoid microcoded instructions in my own code. You can look up the work of Agner Fog online for a list of Intel microcoded instructions. I am not aware of a similar effort for Apple Silicon. That said, sometimes you can just tell. If an instruction is doing multiple very different things, such as store with update -- we have address calculation, a data store and an update of a general purpose register -- this would then be a very tempting target for a hardware engineer to involve different ALUs to accomplish each part, which means the instruction will have to be microcoded. Sometimes machines have magical boundaries that optimize instructions that are cracked into two microops but not three or more (for example), so depending on microarchitecture, some limited microcoding can be okay but really complex ops with lots of micro operations are much more likely to be bad news. You see a lot of this in older ISAs like arm32 / thumb, and not so much in newer ones like arm64 which emphasize more RISC style instructions over complex ones. If I didn't have specs on the microarchitecture, I might try removing the update on the store and see if that makes any difference. You can also try compiling the code in C and seeing what the compiler does. The compiler writers at Apple do have specs on the hardware and tune the compiler accordingly, so if the compiler is not emitting a store with update, that may be sign that either the compiler writer missed an important and obvious optimization, or the compiler writer knows something you don't and the instruction is bad news!! The former outcome might have been more common 20 years ago due to man power constraints, but these days, I'd be betting on the back end engineer knowing what he is doing and having the time to do the right thing. There are of course many, many right things to do.
Post marked as solved
4 Replies
Note that _fp16 will only emit the hardware conversion on Intel if you pass -mf16c to the compiler. Fortunately, all BigSur supported machines support the conversion, but I don't believe it is on by default yet for x8664 targeting that OS. It is an IvyBridge or later instruction. If you don't pass -mf16c, then you'll get a software conversion, which is not very cheap. On Apple Silicon, the appropriate instructions have been there since Cortex A9 and should be there on any arm64 machine. If you need to convert a whole bunch of half float values in an array, please see vImageConvertPlanarFtoPlanar16F and vImageConvertPlanar16FtoPlanarF. If you need to test to see if the machine supports the instruction, you can use sysctlbyname( "hw.optional.f16c", ...)
Post not yet marked as solved
3 Replies
PART 2:   // make a colorspace conversion recipe   CGColorConversionInfoRef recipe = CGColorConversionInfoCreateWithOptions( exrColorSpace, desiredColorSpace,  NULL /* options */ );     vImage_CGImageFormat exrFormat = (vImage_CGImageFormat)     {         .bitsPerPixel = pixelSize * 8,         .bitsPerComponent = channelSize * 8,         .colorSpace = exrColorSpace,         .bitmapInfo = kCGBitmapFloatComponents | kCGImageAlphaLast | (channelSize == 2 ? kCGBitmapByteOrder16Host : kCGBitmapByteOrder32Host)     };     vImage_CGImageFormat textureFormat = (vImage_CGImageFormat)     {         .bitsPerPixel = pixelSize * 8,       .bitsPerComponent = channelSize * 8,       .colorSpace = desiredColorSpace,       .bitmapInfo = kCGBitmapFloatComponents | kCGImageAlphaLast | (channelSize == 2 ? kCGBitmapByteOrder16Host : kCGBitmapByteOrder32Host)     };     vImage_Flags vImageFlags = #if DEBUG     kvImagePrintDiagnosticsToConsole; #else     kvImageNoFlags; #endif     vImageConverterRef converter = vImageConverter_CreateWithCGColorConversionInfo( recipe, &amp;exrFormat, &amp;textureFormat, NULL, vImageFlags, &amp;vImageError);     vImage_Buffer converted = buf;     if( converter &amp;&amp; kvImageNoError != vImageConverter_MustOperateOutOfPlace( converter, &amp;buf, &amp;converted, kvImageNoFlags))     {         vImageBuffer_Init( &amp;converted, buf.height, buf.width,  pixelSize * 8, kvImageNoFlags);               if( NULL == buf.data)         {             // handle malloc failure         }     }     if( tex )         for( unsigned long mipLevel = 0; mipLevel < mipLevelCount; mipLevel++ )         {             // Make a decoder for the part of the file you want to read             //    default parameters will read the default layer from the first part             //  axr_decoder_create_rgba is a simplified interface for RGBA data. If you are after depth information,             //  you'll possibly want to take the longer route with axr_decoder_create() and configure manually.             axr_decoder_t decoder = axr_decoder_create_rgba( axrData, desiredLayerName, desiredPartition, mipLevel, myFlags);             if( NULL == decoder )             {         #if ! __has_feature(objc_arc)                 os_release(axrData);                 [tex release];         #endif                 vImageConverter_Release(converter);                 free(buf.data);                 *e = -1;                 return nil;             }             axr_decoder_info_t info = axr_decoder_get_info(tempDecoder, axr_decoder_info_current);             unsigned long width = info.subregion.size.width;             unsigned long height = info.subregion.size.height;             if( axr_error_success == (axrErr = axr_decoder_read_rgba_pixels( decoder, buf.data, buf.rowBytes, 1.0, myFlags )))             {                 if( converter )                 { &#9;&#9;&#9;&#9;&#9;&#9;&#9;&#9;&#9;&#9; buf.width = converted.width = width; buf.height = converted.height = height;                     // Note that a similar colorspace conversion is also available on the GPU using MPSImageConversion                     vImageError = vImageConvert_AnyToAny( converter, &amp;buf, &amp;converted, NULL, kvImageNoFlags);                     // handle error...                 }                 [tex replaceRegion: (MTLRegion){{0,0,0}, {width, height, 1}}                        mipmapLevel: mipLevel                              slice: 0                          withBytes: converted.data                        bytesPerRow: converted.rowBytes                      bytesPerImage: height * converted.rowBytes ];             } #if ! __has_feature(objc_arc)             os_release(decoder); #endif         }     vImageConverter_Release(converter);     if( converted.data != buf.data )         free(converted.data);     free(buf.data); #if ! __has_feature(objc_arc)     os_release(axrData); #endif     return tex; } AppleEXR is new in Big Sur and associated iOS / iPadOS / tvOS / watchOS releases.
Post not yet marked as solved
3 Replies
If you are tasked with supporting the Full Monte, I recommend using instead AppleEXR.dylib and vImage.  The pathway for mipmaps with full colorspace conversion looks something like this: PART 1: id &lt;MTLTexture&gt; MakeMipMapEXRTexture( int argc, const char *argv[], int * e ) {     int error = 0;     const axr_flags_t myFlags = #if DEBUG         axr_flags_print_debug_info; #else         axr_flags_default; #endif &#9;&#9;void * fileData = mmap(...);     axr_error_t axrErr = axr_error_success;          // Note: in C++ many of the function arguments below have default values and can be omitted          // create a axr_data_t to represent the file. See also axr_introspect_data()     axr_data_t axrData = axr_data_create( fileData, fileSize, &amp;axrErr, myFlags,                                          ^( void * __nonnull fileData, size_t fileSize ){ munmap( fileData, fileSize);});     if( NULL == axrData )     {         *e = (int) axrErr;         return nil;     }          // Look for the layer and part that I want in the file     // Each EXR file may be segmented up into many parts     unsigned long desiredPartition = 0;     const char * desiredLayerName = NULL;     unsigned long partitionCount = axr_data_get_part_count(axrData);   unsigned long mipLevelCount = 0;     bool found = false;     for( unsigned long part = 0; part < partitionCount &amp;&amp; ! found; part++ )     {         axr_part_info_t partInfo = axr_data_get_part_info( axrData,  part, axr_part_info_current);               // and each part may have many layers!!         unsigned long layerCount = axr_data_get_layer_count( axrData, part);         for( unsigned long layer = 0; layer < layerCount; layer++)         {             axr_layer_info_t layerInfo = axr_data_get_layer_info( axrData, part, layer, axr_layer_info_current);                          if( false == IsThisTheLayerToDisplay( &amp;partInfo, &amp;layerInfo ))                 continue; &#9;          desiredPartition = part;             desiredLayerName = layerInfo.name;             mipLevelCount = axr_data_get_level_count( axrData, part);             found = true;             break;         }     }       // Figure out how big to make the buffer to receive the pixels     axr_decoder_t tempDecoder = axr_decoder_create_rgba( axrData, desiredLayerName, desiredPartition, 0, myFlags);     axr_type_t sampleType = axr_decoder_get_channel_info(tempDecoder, 0, axr_channel_info_current).sampleType;     uint32_t channelSize = (uint32_t) axr_type_get_size(sampleType);     uint32_t pixelSize = 4 * channelSize;       // RGBA = 4 channels     axr_decoder_info_t info = axr_decoder_get_info(tempDecoder, axr_decoder_info_current);     vImage_Buffer buf;     vImage_Error vImageError = vImageBuffer_Init( &amp;buf, info.subregion.size.height, info.subregion.size.width, pixelSize * 8 /*bits per byte */, kvImageNoFlags);     if( NULL == buf.data )     {            // ...handle malloc failure ...     }         // read the desired content from the file into buf as RGBA pixels     id &lt;MTLDevice&gt; device = MTLCreateSystemDefaultDevice();     MTLTextureDescriptor * desc = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat: channelSize == 2 ? MTLPixelFormatRGBA16Float : MTLPixelFormatRGBA32Float                                                                                      width: buf.width                                                                                     height: buf.height                                                                                  mipmapped: mipLevelCount > 1];     // don't forget an autoreleasepool when working with Metal, especially command buffers!!     id &lt;MTLTexture&gt; tex = [device newTextureWithDescriptor: desc]; #if ! __has_feature(objc_arc)     [desc release]; #endif     // sort out colorspace     CGColorSpaceRef exrColorSpace = axr_decoder_create_rgba_colorspace( tempDecoder, myFlags, NULL);     CGColorSpaceRef desiredColorSpace = CGColorSpaceCreateWithName( kCGColorSpaceExtendedSRGB );        // or whatever colorspace you are using for your rendering surface in Metal   // Caution: extended linear HDR to SDR colorspace conversions on BigSur don't do tone mapping yet,     //          so large out of [0,1] values would get clamped to [0,1] possibly causing noticeable hue shifts.   //   I used kCGColorSpaceExtendedSRGB so this wouldn't happen.      //   EXR files may put values > 1 or < 0 into your shader.      #if ! __has_feature(objc_arc)     os_release(tempDecoder); #endif &lt;continued...&gt;
Post not yet marked as solved
3 Replies
The easy answer for trivial cases is to use the CGImageSourceRef: int main(int argc, const char * argv[]) {          int error = 0;     id &lt;MTLTexture&gt; tex = MakeMipMapEXRTexture(argc, argv, &amp; error);             // load arg1 as the path to the EXR file     CFStringRef s = CFStringCreateWithCString(NULL, argv[1], kCFStringEncodingUTF8 );     CFURLRef url = CFURLCreateWithFileSystemPath(NULL, s, kCFURLPOSIXPathStyle, false);     CFRelease(s);              // Create a CGImageSource and make a CGImage out of it     CGImageSourceRef source = CGImageSourceCreateWithURL( url, NULL);     CFRelease(url);     CGImageRef image = CGImageSourceCreateImageAtIndex(source, 0, NULL);     CFRelease(source);     // Create a RGBAf16 context and draw the image into it     CGContextRef context = CGBitmapContextCreate( NULL,                                                    CGImageGetWidth(image),                                                    CGImageGetHeight(image),                                                    16,                                                    CGImageGetWidth(image) * 8,                                                    CGColorSpaceCreateWithName( kCGColorSpaceSRGB ),                                                    kCGBitmapByteOrder16Host | kCGImageAlphaPremultipliedLast | kCGBitmapFloatComponents );     CGRect where = CGRectMake(0, 0, CGImageGetWidth(image), CGImageGetHeight(image));     CGContextClearRect( context, where);     CGContextDrawImage( context, where, image);     CGContextFlush(context);     unsigned long width = CGImageGetWidth(image);     unsigned long height = CGImageGetHeight(image);     void * bytes = CGBitmapContextGetData(context);     size_t rowBytes = CGBitmapContextGetBytesPerRow(context);          CGImageRelease(image);          @autoreleasepool {         id &lt;MTLDevice&gt; device = MTLCreateSystemDefaultDevice();                  MTLTextureDescriptor * d = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat: MTLPixelFormatRGBA16Float                                                                                       width: width                                                                                      height: height                                                                                   mipmapped: NO];         id &lt;MTLTexture&gt; tex = [device newTextureWithDescriptor: d];         [tex replaceRegion: (MTLRegion){ {0,0,0}, {width, height, 1}}                mipmapLevel: 0                  withBytes: bytes                bytesPerRow: rowBytes];     }       CGContextRelease(context);         return 0; } That said, this isn't going to get you very far with complex cases like cube maps, ripmaps/mipmaps and depth buffers, which are also representable in OpenEXR. So there is that. You also need to pay attention to color with OpenEXR. It is encoded in linear gamma, defined by a set of chromaticities, and not something necessarily simple like SRGB. You have to actually look at the chromaticities, for example to distinguish between sRGB, another RGB or XYZ, and sometimes the data is YCbCr which is another level of stuff. In so far as colorspace conversions go, if your drawing pipeline is in linear gamma already, then great! Otherwise, you may find yourself having to do some color conversions so that the artwork in your assets aren't completely off. 
Post not yet marked as solved
5 Replies
ImageIO.framework has supported OpenEXR for many major releases now. If you are just trying to get RGBA content up on the screen or loaded into one of these other frameworks, the CGImageSourceRef should be your first stop. It works just like for JPEGs, TIFFs, and such. For example, in this case, we draw the image into a CG context: &#9; #include &lt;ImageIO/ImageIO.h&gt; &#9; #include &lt;CoreGraphics/CoreGraphics.h&gt; &#9;&#9;const char * path_to_file = ...; &#9;&#9;CFStringRef s = CFStringCreateWithCString(NULL, path_to_file, kCFStringEncodingUTF8 );     CFURLRef url = CFURLCreateWithFileSystemPath(NULL, s, kCFURLPOSIXPathStyle, false);     CFRelease(s);     CGImageSourceRef source = CGImageSourceCreateWithURL( url, NULL);     CFRelease(url);     CGImageRef image = CGImageSourceCreateImageAtIndex(source, 0, NULL);     CFRelease(source);     CGContextRef context = CGBitmapContextCreate( NULL,                                                    CGImageGetWidth(image),                                                    CGImageGetHeight(image),                                                    16,                                                    CGImageGetWidth(image) * 8, // 8 = 4 channels * 16bit/chan RGBA                                                    CGColorSpaceCreateDeviceRGB(),                                                    kCGBitmapByteOrder16Host | kCGImageAlphaPremultipliedLast | kCGBitmapFloatComponents );     CGRect where = CGRectMake(0, 0, CGImageGetWidth(image), CGImageGetHeight(image));     CGContextClearRect( context, where);     CGContextDrawImage( context, where, image);     CGContextFlush(context);     CGImageRelease(image);     CGContextRelease(context); __________________ New in Big Sur, we've replaced ImageIO's underlying OpenEXR implementation with an Apple developed library, AppleEXR. It is accelerated for Neon (Apple Silicon) and AVX2 (Intel) and uses GCD. It is also available on iOS, iPadOS, watchOS and tvOS. You may find it is quite a bit faster than the ImageIO EXR plugin that was there before. It has a C interface so that it is callable from C/C++/ObjC/Swift as API and supports ARC: #include &lt;AppleEXR.h&gt; <== /usr/include/AppleEXR.h link: -lAppleEXR <== /usr/bin/libAppleEXR.dylib It is exposed to provide more direct access to lower level features in the OpenEXR file format important to some apps. The file format is extremely flexible and not all of its feature set fits entirely under the CoreGraphics / ImageIO feature space. It is not source level compatible with the Academy Software Foundation's OpenEXR implementation, just file format compatible, so some refactoring of your preexisting OpenEXR code would be needed to use it if you had any. (OpenEXR has a C++ interface which can be challenging to export stably over the long term as a dynamic library.) Relevant documentation is in the C version of the header. AppleEXR.h is annotated in Doxygen style, so you might be able to get something nice with the Doxygen tool, though I will confess that as of the WWDC release, I have not yet made it a priority to make the Doxygen output looking pretty. AppleEXR presumes some basic knowledge of the OpenEXR file format or experience with OpenEXR, so for most developers, I would start with CGImageSourceRef first and then drill down if you can't get what you need out of it. For more details on OpenEXR itself and the file format features, you may visit the Academy Software foundation website at OpenEXR dot com / documentation.html
Post not yet marked as solved
1 Replies
You can't. The MPSTemporaryImage lives only in GPU memory and can't map its storage back to the CPU to be read. You can substitute a regular MPSImage for it it in your computation, which can be read, but keep in mind this will negate the memory savings of the temporary image and many such may cause your job to exceed available memory either on GPU or CPU and cause problems. Also don't forget to synchronize the MPSImage before you look at its contents. Otherwise, you are liable to see a bunch of NaNs.
Post not yet marked as solved
7 Replies
If you are on desktop, please don't forget to synchronize your resources. AMD in particular is fond of stuffing NaNs in output buffers that are not synchronized. https://developer.apple.com/documentation/metal/synchronizing_a_managed_resource See also MPS class routines for same.
Post not yet marked as solved
2 Replies
Lots in the (C) headers!
Post not yet marked as solved
4 Replies
vImageBufferInitWithCVPixelBuffer reads the CVPixelBuffer and copies the data to a (typically internally allocated) vImageBuffer.data memory store. Along the way it will convert from the CVPixelBuffer format to the format provided as the 2nd argument of the vImageBufferInitWithCVPixelBuffer call. Since you passed NULL there, the compiler is complaining there is no format to convert to. You also aren't using it correctly, thinking that you are supposed to init the vImageBuffer itself and then (unknown magic) happens in the vImageBuffer_InitWithCVPixelBuffer call. If the CVPixelBuffer is not ARGB8888, then the code will look like: vImage_Error err; vImage_Buffer buffer; vImageCVImageFormatRef vformat = vImageCVImageFormat_CreateWithCVPixelBuffer( pixelBuffer ); vImage_CGImageFormat resultFormat = (vImage_CGImageFormat){ &#9;&#9;.bitsPerComponent = 8, &#9;&#9;.bitsPerPixel = 32, &#9;&#9;.colorSpace = CGColorspaceCreateDeviceRGB(), &#9;&#9;.bitmapInfo = kCGImageAlphaPremultipliedFirst | kCGImageByteOrderDefault }; err = vImageBuffer_InitWithCVPixelBuffer(&amp;buffer, &amp;resultFormat, pixelBuffer, vformat, NULL, kvImageNoFlags); vImagePixelCount alpha[256] = {0}; vImagePixelCount red[256] = {0}; vImagePixelCount green[256] = {0}; vImagePixelCount blue[256] = {0}; vImagePixelCount *histogram[4] = {blue, green, red, alpha}; err = vImageHistogramCalculation_ARGB8888(&amp;buffer, histogram, kvImageNoFlags); free(buffer.data); vImageCVImageFormat_Release(vformat); If it is ARGB8888, then you can just wrap the CVPixelBuffer with the vImageBuffer: vImage_Buffer buffer; buffer.data = (unsigned char *)CVPixelBufferGetBaseAddress( pixelBuffer ); buffer.rowBytes = CVPixelBufferGetBytesPerRow( pixelBuffer ); buffer.width = CVPixelBufferGetWidth( pixelBuffer ); buffer.height = CVPixelBufferGetHeight( pixelBuffer ); vImagePixelCount alpha[256] = {0}; vImagePixelCount red[256] = {0}; vImagePixelCount green[256] = {0}; vImagePixelCount blue[256] = {0}; vImagePixelCount *histogram[4] = {blue, green, red, alpha}; err = vImageHistogramCalculation_ARGB8888(&amp;buffer, histogram, kvImageNoFlags);