What are some reliable mechanism to prevent data duplication in CoreData CloudKit?

Every of our data row, contains an unique uuid column.

Previously, before adopting CloudKit, the uuid column has a unique constraint. This enables us to prevent data duplication.

Now, we start to integrate CloudKit, into our existing CoreData. Such unique constraint is removed. The following user flow, will cause data duplication.

Steps to cause data duplication when using CloudKit

  1. Launch the app for the first time.
  2. Since there is empty data, a pre-defined data with pre-defined uuid is generated.
  3. The pre-defined data is sync to iCloud.
  4. The app is uninstalled.
  5. The app is re-installed.
  6. Launch the app for the first time.
  7. Since there is empty data, a pre-defined data with pre-defined uuid is generated.
  8. Previous old pre-defined data from step 3, is sync to the device.
  9. We are now having 2 pre-defined data with same uuid! :(

I was wondering, is there a way for us to prevent such duplication?

In step 8, we wish we have a way to execute such logic before written into CoreData

Check whether such uuid exists in CoreData. If not, write to CoreData. If not, we will pick the one with latest update date, then overwrite the existing data.

I once try to insert the above logic into https://developer.apple.com/documentation/coredata/nsmanagedobject/1506209-willsave . To prevent save, I am using self.managedObjectContext?.rollback(). But it just crash.

Do you have any idea, what are some reliable mechanism I can use, to prevent data duplication in CoreData CloudKit?


Additional info:

Before adopting CloudKit

We are using using the following CoreData stack

class CoreDataStack {
    static let INSTANCE = CoreDataStack()
    
    private init() {
    }
    
    private(set) lazy var persistentContainer: NSPersistentContainer = {
        precondition(Thread.isMainThread)
        
        let container = NSPersistentContainer(name: "***", managedObjectModel: NSManagedObjectModel.wenote)
        
        container.loadPersistentStores(completionHandler: { (storeDescription, error) in
            if let error = error as NSError? {
                // This is a serious fatal error. We will just simply terminate the app, rather than using error_log.
                fatalError("Unresolved error \(error), \(error.userInfo)")
            }
        })
        
        // So that when backgroundContext write to persistent store, container.viewContext will retrieve update from
        // persistent store.
        container.viewContext.automaticallyMergesChangesFromParent = true
        
        // TODO: Not sure these are required...
        //
        //container.viewContext.mergePolicy = NSMergeByPropertyObjectTrumpMergePolicy
        //container.viewContext.undoManager = nil
        //container.viewContext.shouldDeleteInaccessibleFaults = true
        
        return container
    }()

Our CoreData data schema has

  1. Unique constraint.
  2. Deny deletion rule for relationship.
  3. Not having default value for non-null field.

After adopting CloudKit

class CoreDataStack {
    static let INSTANCE = CoreDataStack()
    
    private init() {
    }
    
    private(set) lazy var persistentContainer: NSPersistentContainer = {
        precondition(Thread.isMainThread)
        
        let container = NSPersistentCloudKitContainer(name: "***", managedObjectModel: NSManagedObjectModel.wenote)
        
        container.loadPersistentStores(completionHandler: { (storeDescription, error) in
            if let error = error as NSError? {
                // This is a serious fatal error. We will just simply terminate the app, rather than using error_log.
                fatalError("Unresolved error \(error), \(error.userInfo)")
            }
        })
        
        // So that when backgroundContext write to persistent store, container.viewContext will retrieve update from
        // persistent store.
        container.viewContext.automaticallyMergesChangesFromParent = true
        
        // TODO: Not sure these are required...
        //
        //container.viewContext.mergePolicy = NSMergeByPropertyObjectTrumpMergePolicy
        //container.viewContext.undoManager = nil
        //container.viewContext.shouldDeleteInaccessibleFaults = true
        
        return container
    }()

We change the CoreData data schema to

  1. Not having unique constraint.
  2. Nullify deletion rule for relationship.
  3. Having default value for non-null field.

Based on a feedback of a Developer Technical Support engineer from https://developer.apple.com/forums/thread/699634?login=true , hen mentioned we can

  1. Detecting Relevant Changes by Consuming Store Persistent History
  2. Removing Duplicate Data

But, it isn't entirely clear on how it should be implemented, as the github link provided in broken.

Answered by DelawareMathGuy in 717306022

hi,

if the issue is about loading pre-defined data on a device (independent of the CloudKit not supporting uniquing of objects on its own), any strategy of "if the store is empty, then load up all the pre-defined data" will fail, because you cannot count on when or if the cloud is available -- there's no guaranteed way to properly make the decision "should i load the pre-defined data."

your DTS answer is probably the best -- examine the history tracking. this document from Apple is probably the relevant one.

a second possible strategy is to use two different configurations in your Core Data model. all pre-defined data goes into a local store that is not synched to the cloud; but all user-defined/user-modified data goes into a cloud store that is synched with the cloud.

you can check these references:

hope that helps,

DMG

Accepted Answer

hi,

if the issue is about loading pre-defined data on a device (independent of the CloudKit not supporting uniquing of objects on its own), any strategy of "if the store is empty, then load up all the pre-defined data" will fail, because you cannot count on when or if the cloud is available -- there's no guaranteed way to properly make the decision "should i load the pre-defined data."

your DTS answer is probably the best -- examine the history tracking. this document from Apple is probably the relevant one.

a second possible strategy is to use two different configurations in your Core Data model. all pre-defined data goes into a local store that is not synched to the cloud; but all user-defined/user-modified data goes into a cloud store that is synched with the cloud.

you can check these references:

hope that helps,

DMG

What are some reliable mechanism to prevent data duplication in CoreData CloudKit?
 
 
Q