My (admittedly limited) experience with Java is that it treats strings as an array of UTF-16 code units, much like
NSString
. This is convenient but error prone. It is easy to make mistakes when processing strings that contain non-ASCII characters. For example:
let myString: NSString = "My String"
print(myString.substring(with: NSRange(location: 1, length: myString.length - 2))) // prints "y Strin"
let nonBMP: NSString = "My String"
print(nonBMP.substring(with: NSRange(location: 1, length: nonBMP.length - 2))) // prints "[?]My Stri"
The
nonBMP
example fails because 😀 is U+1F600 GRINNING FACE, which is outside of the
Basic Multilingual Plane and thus has to be represented by a UTF-16 surrogate pair. Accessing the string using naïve UTF-16 indexing splits the surrogate pair, resulting in a malformed string (
[?]
is my representation of the Unicode replacement character).
I must stress that this problem isn’t limited to ‘weird’ characters. Consider this:
let nonASCII: NSString = "Zo\u{00eb}"
print(nonASCII) // prints "Zoë"
print(nonASCII.substring(with: NSRange(location: 1, length: nonASCII.length - 2))) // prints "o"
let nonASCII2: NSString = "Zoe\u{0308}"
print(nonASCII2) // prints "Zoë"
print(nonASCII2.substring(with: NSRange(location: 1, length: nonASCII2.length - 2))) // prints "oe"
The
nonASCII2
example shows how this can affect text you may well find in an English-only app. In this example, the
ë
character is represented as a combination of U+0065 LATIN SMALL LETTER E and U+0308 COMBINING DIAERESIS, and the naïve approach splits them apart.
Swift strings don’t suffer from such problems because they treat strings as a sequence of extended grapheme clusters. To wit:
let myString = "My String"
print(myString.dropFirst().dropLast()) // prints "y Strin"
let nonBMP = "My String"
print(nonBMP.dropFirst().dropLast()) // prints "My Stri"
let nonASCII = "Zo\u{00eb}"
print(nonASCII) // prints "Zoë"
print(nonASCII.dropFirst().dropLast()) // prints "o"
let nonASCII2 = "Zoe\u{0308}"
print(nonASCII2) // prints "Zoë"
print(nonASCII2.dropFirst().dropLast()) // prints "oe"
The cost of this correctness is that string indices are more complex, and thus you can’t treat a string as an array of characters. If you attempt to do this — using the extensions that Claude31 posted — you can run into performance problems. Specifically, those
index(_:offsetBy:)
calls are O(n), so it’s easy to go accidentally quadratic.
The Swift team is interested in improving the ergonomics of strings, and I encourage you to wade into those efforts. There’s no single thread I can point you to here — there’s been lots of discussions, covering lots of topics — but a major player here is Michael Ilseman and you can look at his posts to get up to speed (on that page, click on More Topics to see a longer list).
Notwithstanding all of the above, my experience is that:
I almost never want to access localised strings like this. I try to avoid parsing localised strings entirely, but when I find it necessary to do that my parsing code almost always involves finding delimiters in the string and then splitting based on that.
If I’m dealing with non-localised strings, I generally prefer dealing with the data as UTF-8 (either the string’s
.utf8
view, or just an array of UInt8
). Even if the data contains strings that might be localised, I find it better to find the delimiters using the UTF-8 view and then extract any localised content as strings from there.
Which brings me back to your example:
Suppose I want the first and last char removed from a string.
That’s obviously a straw man that you’re using to illustrate your point. I find such examples problematic because they don’t represent what folks should be doing with strings.
So, my question for you is, can you post some more realistic examples? That is, examples of problems that you’ve encountered in real code. Make sure to described the context because when, it comes to string process, context really matters.
Share and Enjoy
—
Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware
let myEmail = "eskimo" + "1" + "@apple.com"