Is there any way to get String.characters to group skin tone emoji as a single extended grapheme cluster?

This post will actually have multiple questions in it. Apologies in advance.


In swift 2.1 (Xcode 7.1.1 7B1005), emoji with skin color modifiers are represented by 2 characters in a String.CharacterView. For example,


var str = "\u{270C}\u{1F3FE}"
print("\(str.characters.count)")


This code prints 2. These forums don't support image emoji, but the unicode codepoints in the string above render as a hand with a skin tone on iOS 8.3+ and OS X 10.10.3+. On earlier systems, they render as a hand, and a colored square.


This is because skin tone emoji are implemented as two existing unicode codepoints next to each other in order to provide backwards compatibility. That way, systems too old to display the skin tone emoji will display characters that still convey the intent of the sender. The same is true for multi-person groupings.


That Swift strings allow me to work with extended grapheme clusters instead of Unicode codepoints is, for me, one of the most important features of the language. However, with the addition of skin tone and multi-person grouping emoji, we are now in a situation where the extended grapheme clusters got the same codepoints could be different on different systems. I can only imagine that this problem will affect more characters in the future, as new characters are created in the same backwards-compatible manner.


Which brings me to question #1: Is there any way to allow Swift strings to combine codepoints into extended grapheme clusters based on how they will be displayed on the system in which the code is running? It's OK that the code will produce different results on different machines, in fact, for my needs, it's preferable.


Question #2: If the answer to question #1 is no, is there any way to allow Swift strings to always combine the skin tone and multi-person grouping emoji, essentially breaking compatibility with older systems? This would not be ideal, as it would require shipping app updates every time new Unicode characters are added, but it would be better than nothing.


Question #3: If the answer to questions 1 or 2 is yes, is there any way to do so in a way that allows String.UTF8View.Index.samePositionIn() to return the position based on these emoji being represented as a single extended grapheme cluster?


Question 4: Speaking of String.UTF8View.Index.samePositionIn(), the documentation states that it will, "Return the position in characters that corresponds exactly to self, or if no such position exists, nil." I can understand why this would return nil if the index == endIndex, but why dies it return nil otherwise? For example,


var str = "\u{1F1EE}\u{1F1F9}"
var index = str.utf8.startIndex
var nonNilIndexes: [String.UTF8View.Index] = []
var nilIndexes: [String.UTF8View.Index] = []
var stringIndex = str.characters.endIndex
while index != str.utf8.endIndex {
    var currentIndex = index.samePositionIn(str)
    if (currentIndex == nil) {
        nilIndexes.append(index)
    } else {
        nonNilIndexes.append(index)
    }
    index = index.successor()
}


The unicode codepoints in the string above render as a single character which doesn't render on these forums: an Italian flag emoji. Swift considers the Italian flag emoji used in this code to be a single extended grapheme cluster, comprised of 8 UTF-8 bytes. When this code is run, only index 0 is placed in nonNilIndexes. Indexes 1, 2, 3, 4, 5, 6, and 7 are placed in nilIndexes.


Question #5: The Swift 2.1 book states "Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character." However, an offical blog post about Swift 2 strings says, "The characters property segments the text into extended grapheme clusters, which are an approximation of user-perceived characters (in this case: c, a, f, and é)." (emphasis mine) How close is this approximation supposed to be? Reading the official documentation, one would be led to believe that it is not an approximation, and one character is guaranteed to represent one extended grapheme cluster. What guarantees should we expect from Swift in this regard? If we have code that, for example, attempts to extract the UTF-8 bytes for a the first extended grapheme cluster in a string, would it be inappropriate to use Swift strings for this task (ignoring the performance implications)?


Sorry for the long post. Thanks in advance.

Replies

I’m not going to tackle your actual questions, alas, but I do have a suggestion. Rather than posting code snippets containing characters that don’t render on the forums, you should encode those characters using Swift’s

\u
notation. That makes it easier for folks to read your code and try things out for themselves.

For example, in your last example you could have written:

var str = "\u{1F1EE}\u{1F1F9}"

Share and Enjoy

Quinn "The Eskimo!"
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Thanks for the good suggestion. I will edit the post as you suggested.

Similar issues with miscounting have been labelled as bugs before. You're right that it is a tough case because it will display as a different number of characters depending on the OS and perhaps even font.

That's a very good point about the font affecting the number of extended grapheme clusters. I hadn't considered that, since in most cases, the Apple Color Emoji font is used to render emoji. (Though as this forum shows, there are exceptions.) This could also affect non-emoji characters.


I will go ahead and file a bug for this specific case, thanks. It would still be great to get some clairification on questions 4 and 5.