4 Replies
      Latest reply on Nov 18, 2015 9:09 AM by Buckley
      Buckley Level 1 Level 1 (0 points)

        This post will actually have multiple questions in it. Apologies in advance.

         

        In swift 2.1 (Xcode 7.1.1 7B1005), emoji with skin color modifiers are represented by 2 characters in a String.CharacterView. For example,

         

        var str = "\u{270C}\u{1F3FE}"
        print("\(str.characters.count)")
        
        
        
        
        

         

        This code prints 2. These forums don't support image emoji, but the unicode codepoints in the string above render as a hand with a skin tone on iOS 8.3+ and OS X 10.10.3+. On earlier systems, they render as a hand, and a colored square.

         

        This is because skin tone emoji are implemented as two existing unicode codepoints next to each other in order to provide backwards compatibility. That way, systems too old to display the skin tone emoji will display characters that still convey the intent of the sender. The same is true for multi-person groupings.

         

        That Swift strings allow me to work with extended grapheme clusters instead of Unicode codepoints is, for me, one of the most important features of the language. However, with the addition of skin tone and multi-person grouping emoji, we are now in a situation where the extended grapheme clusters got the same codepoints could be different on different systems. I can only imagine that this problem will affect more characters in the future, as new characters are created in the same backwards-compatible manner.

         

        Which brings me to question #1: Is there any way to allow Swift strings to combine codepoints into extended grapheme clusters based on how they will be displayed on the system in which the code is running? It's OK that the code will produce different results on different machines, in fact, for my needs, it's preferable.

         

        Question #2: If the answer to question #1 is no, is there any way to allow Swift strings to always combine the skin tone and multi-person grouping emoji, essentially breaking compatibility with older systems? This would not be ideal, as it would require shipping app updates every time new Unicode characters are added, but it would be better than nothing.

         

        Question #3: If the answer to questions 1 or 2 is yes, is there any way to do so in a way that allows String.UTF8View.Index.samePositionIn() to return the position based on these emoji being represented as a single extended grapheme cluster?

         

        Question 4: Speaking of String.UTF8View.Index.samePositionIn(), the documentation states that it will, "Return the position in characters that corresponds exactly to self, or if no such position exists, nil." I can understand why this would return nil if the index == endIndex, but why dies it return nil otherwise? For example,

         

        var str = "\u{1F1EE}\u{1F1F9}"
        var index = str.utf8.startIndex
        var nonNilIndexes: [String.UTF8View.Index] = []
        var nilIndexes: [String.UTF8View.Index] = []
        var stringIndex = str.characters.endIndex
        while index != str.utf8.endIndex {
            var currentIndex = index.samePositionIn(str)
            if (currentIndex == nil) {
                nilIndexes.append(index)
            } else {
                nonNilIndexes.append(index)
            }
            index = index.successor()
        }
        
        
        
        
        

         

        The unicode codepoints in the string above render as a single character which doesn't render on these forums: an Italian flag emoji. Swift considers the Italian flag emoji used in this code to be a single extended grapheme cluster, comprised of 8 UTF-8 bytes. When this code is run, only index 0 is placed in nonNilIndexes. Indexes 1, 2, 3, 4, 5, 6, and 7 are placed in nilIndexes.

         

        Question #5: The Swift 2.1 book states "Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character." However, an offical blog post about Swift 2 strings says, "The characters property segments the text into extended grapheme clusters, which are an approximation of user-perceived characters (in this case: c, a, f, and é)." (emphasis mine) How close is this approximation supposed to be? Reading the official documentation, one would be led to believe that it is not an approximation, and one character is guaranteed to represent one extended grapheme cluster. What guarantees should we expect from Swift in this regard? If we have code that, for example, attempts to extract the UTF-8 bytes for a the first extended grapheme cluster in a string, would it be inappropriate to use Swift strings for this task (ignoring the performance implications)?

         

        Sorry for the long post. Thanks in advance.