If you write :
let c = "c" as Unicode.Scalar
that should work
While that’s true, it’s only part of the story. The main issue here is one of definition:
Traditional Cocoa APIs, like
NSString
, define a character is a single UTF-16 code unit, that is, a single unsigned 16-bit value (unichar
). For example, if you look at the return result of -characterAtIndex:
you’ll find it’s a unichar
.Note This approach is grandfathered in the from the original NeXT adoption of Unicode. At the time Unicode was a 16-bit encoding and there was only the Basic Multilingual Plane
Swift defines a character to be an extended grapheme cluster. This, in turn, is made up of multiple Unicode scalar values, each of which comprise multiple UTF-16 code units.
For example, consider this code:
let s = "nai\u{0308}ve"
print("String:")
for ch in s {
print(" \(ch)")
}
print("NSString:")
let ns = s as NSString
for i in 0..<ns.length {
print(" 0x\(String(ns.character(at: i), radix: 16))")
}
which prints:
String:
n
a
ï
v
e
NSString:
0x6e
0x61
0x69
0x308
0x76
0x65
Note how
String
returns the
i
and the accent (U+0308 COMBINING DIAERESIS) as one character but
NSString
gives you them separately (0x69 and 0x308).
NSCharacterSet
was designed around the
NSString
model, and that design is reflected in Swift
CharacterSet
. The end result is that
CharacterSet
is
very badly named in Swift. The type would be better named as
UnicodeScalarSet
.
There are ongoing efforts to fix this (it comes up regularly on Swift Forums [1]) but this is not easy. There’s a deep semantic issue in play here, namely that the semantics of an extended grapheme cluster are more than the some of its parts. Consider this code:
func isAllLowerCase(_ s: String) -> Bool {
let cs = CharacterSet.lowercaseLetters
for us in s.unicodeScalars {
if !cs.contains(us) {
return false
}
}
return true
}
which seems reasonable enough until you actually test it:
print(isAllLowerCase("naive")) // -> true
print(isAllLowerCase("nai\u{0308}ve")) // -> false
The reason you get false in the second test is that the accent is, in and of itself, not a lowercase letter. Logically the accent takes on the ‘lowercaseness’ of the character it’s combined with. Alas,
CharacterSet
has no way to represent that concept, because
CharacterSet
only deals with Unicode scalars.
So what should you do? That very much depends on your specific goals. If, for example, you know you’re dealing with ASCII strings, you can use
CharacterSet
because that works fine for ASCII. However, if you’re dealing with arbitrary Unicode that’s been typed in by the user then things get more complex.
If you can explain more about your goals we should be able to help you further.
Share and Enjoy
—
Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware
let myEmail = "eskimo" + "1" + "@apple.com"
[1] There’s actually a proposal in review right now, SE-0211 Add Unicode Properties to Unicode.Scalar, that lays some of the ground work for this. Check out the evolution thread and the review thread.