SwiftData + CloudKit process for deduplication / consuming relevant store changes?

It is often the case that offline devices can add duplicate entities that needs to be merged when CloudKit syncs. Consider user-created tags. A user might create a Note, and then tag it with a newly created tag “Family.” On a separate offline device, they might create another note, and create another tag also called ”Family.” On device sync, both duplicate ”Family” tags would need to be identified as duplicates based on their name property, merged to a single entity, and their original relationships consolidated to the single merged Tag. And this needs to happen before the CloudKit sync data is presented to the UI in the main context.

With Core Data we have the mechanism to consume relevant store changes described here. These tools allow us to listen for remote change, then process them appropriately (e.g. remove / merge duplicates) and then merge the changes into the app’s main context. This perfectly solves the problem described in the first paragraph above. Apple provides code using this mechanism for deduplicating tags in a sample app.

Is there a mechanism to solve this deduplication problem using SwiftData technology without implementing and maintaining a parallel Core Data stack?

consider "How to Observe Data Changes in SwiftData using Persistent History Tracking" at Fatbobman's Blog from Nov 2, 2023 (fatbobman.com). DMG

I'd like to respond to DelawareMathGuys's suggestion to use fatbobman's SwiftDataKit to implement history processing, but it is too big to be in reply to a root comment so I'm putting it here. Fatbobman has written some great blog posts about SwiftData, and it deserves a real response.

I have actually already implemented fabobman's approach in a dev branch of my project. I don't think it's viable for production in a commercial app for a few reasons:

  1. In a Core Data history processor you can execute a query with a predicate, but in FBM's hack, you cannot, because SwiftData itself does not support passing predicates to the history fetch. Normally you set the transaction author to be something like "app" or "widget" and if the author is missing you can assume it is coming from CloudKit. So you query your history for author != excluded authors, and you can process relevant changes from the network. It is impossible to do any pre-filtering on the transaction so you have to process every transaction that has happened and filter in memory.
        let fetchRequest = NSPersistentHistoryChangeRequest.fetchHistory(after: timestamp)
        // In SwiftData, the fetchRequest.fetchRequest created by fetchHistory is nil and predicate cannot be set.
  1. You can't set merge policies (like NSMergeByPropertyObjectTrumpMergePolicy) in SwiftData using FBM's approach, so you can't easily control how the merge happens.
  2. His approach is fully based on his SwiftDataKit extensions, which is based entirely on undocumented internal implementation details of SwiftData. For example, to get a SwiftData PersistentIdentifier, he makes a mock Codable struct that can be populated with undocumented elements of a Core Data NSManagedObject to build a struct that can be encoded to JSON, that can be decoded back to a SwiftData PersistentIdentifier. So it depends on the undocumented structure of the PersistentIdentifier and its relationship to the underlying Core Data object. That's probably stable...
// from https://github.com/fatbobman/SwiftDataKit/blob/main/Sources/SwiftDataKit/CoreData/NSManagedObjectID.swift
// Compute PersistentIdentifier from NSManagedObjectID
public extension NSManagedObjectID {
    // Compute PersistentIdentifier from NSManagedObjectID
    var persistentIdentifier: PersistentIdentifier? {
        guard let storeIdentifier, let entityName else { return nil }
        let json = PersistentIdentifierJSON(
            implementation: .init(primaryKey: primaryKey,
                                  uriRepresentation: uriRepresentation(),
                                  isTemporary: isTemporaryID,
                                  storeIdentifier: storeIdentifier,
                                  entityName: entityName)
        )
        let encoder = JSONEncoder()
        guard let data = try? encoder.encode(json) else { return nil }
        let decoder = JSONDecoder()
        return try? decoder.decode(PersistentIdentifier.self, from: data)
    }
}

// Extensions to expose needed implementation details
extension NSManagedObjectID {
    // Primary key is last path component of URI
    var primaryKey: String {
        uriRepresentation().lastPathComponent
    }

    // Store identifier is host of URI
    var storeIdentifier: String? {
        guard let identifier = uriRepresentation().host() else { return nil }
        return identifier
    }

    // Entity name from entity name
    var entityName: String? {
        guard let entityName = entity.name else { return nil }
        return entityName
    }
}

So, as I worked on trying his approach, I felt that it was a clever hack that I wouldn't be comfortable depend on in production, to ultimately implement a solution that isn't very good (request all transactions from all sources and filter in memory without being able to set a merge policy for the final set of transactions). I think what he made is a neat workaround, but for a commercial app I think it would be better to implement the parallel Core Data stack and do real history change processing. Or fix the gaps discussed above with unavailable predicates and merge policies. But best of all would be a mechanism to do this in SwiftData itself.

SwiftData + CloudKit process for deduplication / consuming relevant store changes?
 
 
Q