mylesDoyle’s Profile | Apple Developer Forums

mylesDoyle

Post

Replies

Boosts

Views

Activity

How to dispatch my `MLCustomLayer` to GPU instead of CPU

MLCustomLayer implementation always dispatches to CPU instead of GPU Background: I am trying to run my CoreML model with a custom layer on the iPhone 13 Pro. My custom layer runs successfully on the CPU, however it still dispatches to the CPU instead of the mobile's GPU despite the encodeToCommandBuffer member function being defined in the application's binding class for the custom layer. I have been following the CoreMLTools documentation's suggested Swift example to get this working, but note that my implementation is purely in Objective-C++. Despite reading in depth into the documentation, I still have not come across any resolution to the problem. Any help looking into this issue (or perhaps even bug in CoreML) would be much appreciated! Below, I provide a minimal example based off of the Swift example mentioned above. Implementation My toy Objective C++ implementation is based off of the Swift example here. This implements the Swish activation function for both the CPU and GPU. PyTorch model to CoreML MLModel transformation For brevity, I will not define my toy PyTorch model, nor the Python bindings to allow the custom Swish layer to be scripted/traced and then converted to a CoreML MLModel, but I can provide these if necessary. Just note that the Python layer's name and bindings should match the name in the class defined below, ie. ToySwish. To convert the scripted/traced PyTorch model (called torchscript_model in the listing below) to a CoreML MLModel, I use CoreMLTools (from Python) and then save the model as follows; input_shapes = [[1,64,256,256]] mlmodel = coremltools.converters.convert( torchscript_model, source='pytorch', inputs=[coremltools.TensorType(name=f'input_{i}', shape=input_shape) for i, input_shape in enumerate(input_shapes)], add_custom_layers = True, minimum_deployment_target = coremltools.target.iOS14, compute_units = coremltools.ComputeUnit.CPU_AND_GPU, ) mlmodel.save('toy_swish_model.mlmodel') Metal shader I use the same Metal shader function swish from Swish.metal here. MLCustomLayer binding class for Swish MLModel layer I define an analogous Objective-C++ class to the Swift example. This class inherits from NSObject and the MLCustomLayer protocol. This class follows the guidelines in the Apple documentation for integrating a CoreML MLModel with a custom layer. This is defined as follows; Class definition and resource setup; #import <Foundation/Foundation.h> #include <CoreML/CoreML.h> #import <Metal/Metal.h> @interface ToySwish : NSObject<MLCustomLayer>{} @end @implementation ToySwish{ id<MTLComputePipelineState> swishPipeline; } - (instancetype) initWithParameterDictionary:(NSDictionary<NSString *,id> *)parameters error:(NSError *__autoreleasing _Nullable *)error{ NSError* errorPSO = nil; id<MTLDevice> device = MTLCreateSystemDefaultDevice(); id<MTLLibrary> defaultlibrary = [device newDefaultLibrary]; id<MTLFunction> swishFunction = [defaultlibrary newFunctionWithName:@"swish"]; swishPipeline = [device newComputePipelineStateWithFunction:swishFunction error:&errorPSO]; assert(errorPSO == nil); return self; } - (BOOL) setWeightData:(NSArray<NSData *> *)weights error:(NSError *__autoreleasing _Nullable *) error{ return YES; } - (NSArray<NSArray<NSNumber *> * > *) outputShapesForInputShapes:(NSArray<NSArray<NSNumber *> *> *)inputShapes error:(NSError *__autoreleasing _Nullable *) error{ return inputShapes; } CPU compute method (this is only shown for completeness); - (BOOL) evaluateOnCPUWithInputs:(NSArray<MLMultiArray *> *)inputs outputs:(NSArray<MLMultiArray *> *)outputs error:(NSError *__autoreleasing _Nullable *)error{ NSLog(@"Dispatching to CPU"); for(NSInteger i = 0; i < inputs.count; i++){ NSInteger num_elems = inputs[i].count; float* input_ptr = (float *) inputs[i].dataPointer; float* output_ptr = (float *) outputs[i].dataPointer; for(int j = 0; j < num_elems; j++){ output_ptr[j] = 1.0/(1.0 + exp(-input_ptr[j])); } } return YES; } Encode GPU commands to command buffer; Note, according to documentation, this command buffer should not be committed, as it is executed by CoreML after this method returns. - (BOOL) encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer inputs:(NSArray<id<MTLTexture>> *)inputs outputs:(NSArray<id<MTLTexture>> *)outputs error:(NSError *__autoreleasing _Nullable *)error{ NSLog(@"Dispatching to GPU"); id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial]; assert(computeEncoder != nil); for(int i = 0; i < inputs.count; i++){ [computeEncoder setComputePipelineState:swishPipeline]; [computeEncoder setTexture:inputs[i] atIndex:0]; [computeEncoder setTexture:outputs[i] atIndex:1]; NSInteger w = swishPipeline.threadExecutionWidth; NSInteger h = swishPipeline.maxTotalThreadsPerThreadgroup / w; MTLSize threadGroupSize = MTLSizeMake(w, h, 1); NSInteger groupWidth = (inputs[0].width + threadGroupSize.width - 1) / threadGroupSize.width; NSInteger groupHeight = (inputs[0].height + threadGroupSize.height - 1) / threadGroupSize.height; NSInteger groupDepth = (inputs[0].arrayLength + threadGroupSize.depth - 1) / threadGroupSize.depth; MTLSize threadGroups = MTLSizeMake(groupWidth, groupHeight, groupDepth); [computeEncoder dispatchThreads:threadGroups threadsPerThreadgroup:threadGroupSize]; [computeEncoder endEncoding]; } return YES; } Run inference for a given input The MLModel is loaded and compiled in the application. I check to ensure that the model configuration's computeUnits are set to MLComputeUnitsAll as desired (this should allow dispatching to CPU, GPU and ANU) of the MLModel layers. I define a MLDictionaryFeatureProvider object called feature_provider from a NSMutableDictionary of input features (input tensors in this case), and then pass this to the predictionFromFeatures method of my loaded model model as follows; @autoreleasepool { [model predictionFromFeatures:feature_provider error:error]; } This computes a single forward pass of my model. When this executes, you can see that the 'Dispatching to CPU' string is printed instead of the 'Dispatching to GPU' string. This (along with the slow execution time) indicates the Swish layer is being run from the evaluateOnCPUWithInputs method and thus on the CPU, instead of the GPU as expected. I am quite new to developing for iOS and to Objective-C++, so I might have missed something that is quite simple, however from reading the documentation and examples, it is not at all clear to me what the issue is. Any help or advice would be really appreciated :) Environment XCode 13.1 iPhone 13 iOS 15.1.1 iOS deployment target 15.0

Graphics & Games General Metal ML Compute Core ML Objective-C

1.6k

Nov ’21

Memory leak for CoreML inference on iOS device

In my mobile application, I observe a memory leak when running inference with my image convolution model. The memory leak occurs when getting predictions from the model. Given a pointer to a loaded MLModel object called module and input feature provider feature_provider (of type MLDictionaryFeatureProvider*), the memory leak is observed each time a prediction is made by calling [module predictionFromFeatures:feature_provider error:NULL]; The amount of memory leaked between each iteration appears to be related to the output size of the model. Assuming the mobile GPU backend is running in half-precision (float16), I observe the following for the given output sizes; Output image of dimension [1,3,3840,2160] (of size 1*3*3840*2160*16bits/(8bits * 1000^2) == 49.7664MB) Constant increase in memory of approximately 91.7MB after each image prediction. Output image of dimension [1,3,2048,1080] (of size 1*3*2048*1080*16bits/(8bits * 1000^2) == 13.27104MB) Constant increase in memory of approximately 23.7MB after each image prediction. Is there a known issue with the CoreML MLModel's predictionFromFeatures which allocates memory each time it is called? Or is this the intended behaviour? At the moment this is limiting me from running inference on mobile devices, and I was wondering if anyone has a suggested workaround, patch, or advice? Thank you in advance, and please find the information to reproduce the issue below. To Reproduce To reproduce the problem, a simple model with three convolutions and one pixel-shuffle layer was converted from PyTorch to an MLModel. The MLModel was then run with a debugger in a mobile application. A breakpoint was set on the line computing the predictions in a loop and the memory use after each iteration was observed to increase. Alternatively to setting a breakpoint, the number of prediction iterations can be set to 50 (assuming output size is [1,3,3840,2160] and phone memory is 4GB), which causes the application to run out of memory at runtime. The PyTorch model: import torch.nn as nn class Model(nn.Module): def __init__(self): super().__init__() upscale_factor = 8 self.Conv1 = nn.Conv2d(in_channels = 48, out_channels = 48, kernel_size = 3, stride = 1) self.Conv2 = nn.Conv2d(48, 48, 3, 1) self.Conv3 = nn.Conv2d(48, 3 * (upscale_factor*upscale_factor), 3, 1) self.PS = nn.PixelShuffle(upscale_factor) def forward(self, x): Conv1 = self.Conv1(x) Conv2 = self.Conv2(Conv1) Conv3 = self.Conv3(Conv2) y = self.PS(Conv3) return y The PyTorch to MLModel converter: import torch import coremltools def convert_torch_to_coreml(torch_model, input_shapes, save_path): torchscript_model = torch.jit.script(torch_model) mlmodel = coremltools.converters.convert( torchscript_model, inputs=[coremltools.TensorType(name=f'input_{i}', shape=input_shape) for i, input_shape in enumerate(input_shapes)], ) mlmodel.save(save_path) Generate MLModel using the above definitions: if __name__ == "__main__": torch_model = Model() # input_shapes = [[1,48,256,135]] # 2K input_shapes = [[1,48,480,270]] # 4K coreml_model_path = "./toy.mlmodel" convert_torch_to_coreml(torch_model, input_shapes, coreml_model_path) Mobile application: The mobile application was generated using PyTorch's iOS TestApp and adapted for our use case. The adapted TestApp is available here.. The most relevant lines in the application for loading the model and running inference are included below: Set MLMultiArray pointer to input tensor's data pointer: + (MLMultiArray*) tensorToMultiArray:(at::Tensor) input { float* input_ptr = input.data_ptr<float>(); int batch = (int) input.size(0); int ch = (int) input.size(1); int height = (int) input.size(2); int width = (int) input.size(3); int pixels = ch * height * width; NSArray* shape = @[[NSNumber numberWithInt:batch][NSNumber numberWithInt: ch], [NSNumber numberWithInt: height], [NSNumber numberWithInt: width]]; MLMultiArray* output = [[MLMultiArray alloc] initWithShape:shape dataType:MLMultiArrayDataTypeFloat32 error:NULL]; float* output_ptr = (float *) output.dataPointer; for (int pixel_index = 0; pixel_index < pixels; ++pixel_index) { output_ptr[pixel_index] = input_ptr[pixel_index]; } return output; } Load model, set input feature provider, and run inference over multiple iterations: NSError* __autoreleasing __nullable* __nullable error = nil; NSString* modelPath = [NSString stringWithUTF8String:model_path.c_str()]; NSURL* modelURL = [NSURL fileURLWithPath:modelPath]; NSURL* compiledModel = [MLModel compileModelAtURL:modelURL error:error]; MLModel* module = [MLModel modelWithContentsOfURL:compiledModel error:NULL]; NSMutableDictionary* feature_inputs = [[NSMutableDictionary alloc] init]; for (int i = 0; i < inputs.size(); ++i) { NSString* key = [NSString stringWithFormat:@"input_%d", i]; [feature_inputs setValue:[Converter tensorToMultiArray: inputs[i].toTensor()] forKey: key]; } MLDictionaryFeatureProvider* feature_provider = [[[MLDictionaryFeatureProvider alloc] init] initWithDictionary:feature_inputs error:NULL]; // Running inference on the model results in memory leak for (int i = 0; i < iter; ++i) { [module predictionFromFeatures:feature_provider error:NULL]; } Complete example source The complete minimal example of both the MLModel generation and the TestApp are available here. System environment: Original environment: coremltools version: 5.0b5: OS: build on MacOS targetting iOS for mobile application: macOS version: Big Sur (version 11.4) iOS version: 14.7.1 (run on iPhone 12) XCode version: Version 12.5.1 (12E507) How you install python: Install from source python version: 3.8.10 How you install Pytorch: Install from source PyTorch version: 1.8.1. Update to 'latest' environment coremltools version: 5.0b5: OS: build on MacOS targetting iOS for mobile application: macOS version: Big Sur (version 11.4) iOS version: 15.0.2 (run on iPhone 12) XCode version: Version 13.0(13A233) How you install Python: Install from source python version: 3.8.10 How you install Pytorch: Install from source PyTorch version: 1.10.0-rc2 Additional Information Given the model definition and tensor output shapes above, the corresponding tensor input shapes for the model are as follows: Output shape of [1,3,3840,2160] has input shape [1,48,480,270] Output shape of [1,3,2048,1080] has input shape [1,48,256,135]

App & System Services Core OS iOS Core ML Objective-C Vision

1.7k

Oct ’21