Check if character in character set?

Hi,


How do I check to see if a Character is in a Character Set?


In Objective-C, I could write:


[[NSCharacterSet lowercaseLetterCharacterSet] characterIsMember:c]


Where "c" is a variable of type char.


In Swift, I tried to write:


CharacterSet.lowercaseLetters.contains(c)


Where "c" is of type Character, but I received an error saying that the function was expecting a parameter of type "Unicode.Scalar".


Thanks,

Frank

Replies

If you write :

let c = "c" as Unicode.Scalar


that should work

If you write :

let c = "c" as Unicode.Scalar

that should work

While that’s true, it’s only part of the story. The main issue here is one of definition:

  • Traditional Cocoa APIs, like

    NSString
    , define a character is a single UTF-16 code unit, that is, a single unsigned 16-bit value (
    unichar
    ). For example, if you look at the return result of
    -characterAtIndex:
    you’ll find it’s a
    unichar
    .

    Note This approach is grandfathered in the from the original NeXT adoption of Unicode. At the time Unicode was a 16-bit encoding and there was only the Basic Multilingual Plane

  • Swift defines a character to be an extended grapheme cluster. This, in turn, is made up of multiple Unicode scalar values, each of which comprise multiple UTF-16 code units.

For example, consider this code:

let s = "nai\u{0308}ve"
print("String:")
for ch in s {
    print("  \(ch)")
}
print("NSString:")
let ns = s as NSString
for i in 0..<ns.length {
    print("  0x\(String(ns.character(at: i), radix: 16))")
}

which prints:

String:
  n
  a
  ï
  v
  e
NSString:
  0x6e
  0x61
  0x69
  0x308
  0x76
  0x65

Note how

String
returns the
i
and the accent (U+0308 COMBINING DIAERESIS) as one character but
NSString
gives you them separately (0x69 and 0x308).
NSCharacterSet
was designed around the
NSString
model, and that design is reflected in Swift
CharacterSet
. The end result is that
CharacterSet
is very badly named in Swift. The type would be better named as
UnicodeScalarSet
.

There are ongoing efforts to fix this (it comes up regularly on Swift Forums [1]) but this is not easy. There’s a deep semantic issue in play here, namely that the semantics of an extended grapheme cluster are more than the some of its parts. Consider this code:

func isAllLowerCase(_ s: String) -> Bool {
    let cs = CharacterSet.lowercaseLetters
    for us in s.unicodeScalars {
        if !cs.contains(us) {
            return false
        }
    }
    return true
}

which seems reasonable enough until you actually test it:

print(isAllLowerCase("naive"))          // -> true
print(isAllLowerCase("nai\u{0308}ve"))  // -> false

The reason you get false in the second test is that the accent is, in and of itself, not a lowercase letter. Logically the accent takes on the ‘lowercaseness’ of the character it’s combined with. Alas,

CharacterSet
has no way to represent that concept, because
CharacterSet
only deals with Unicode scalars.

So what should you do? That very much depends on your specific goals. If, for example, you know you’re dealing with ASCII strings, you can use

CharacterSet
because that works fine for ASCII. However, if you’re dealing with arbitrary Unicode that’s been typed in by the user then things get more complex.

If you can explain more about your goals we should be able to help you further.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

[1] There’s actually a proposal in review right now, SE-0211 Add Unicode Properties to Unicode.Scalar, that lays some of the ground work for this. Check out the evolution thread and the review thread.

Thanks for the detailed answer. I ended up using the expression "c.unicodeScalars.first!" as the argument to the contains function.


My goal in this case is to take a URL string and filter out anything that isn't a letter or a digit to create something I can use as a temporary filename. So if I err on the side of throwing something out that's really a letter, that is fine.


Frank

My goal in this case is to take a URL string and filter out anything that isn't a letter or a digit to create something I can use as a temporary filename.

OK. Just be aware that the various built-in character sets are based on Unicode properties, so, while the results might make sense, they might not always be what you’re expecting. For example:

let u = Unicode.Scalar(0x1B05)! // U+1B05 BALINESE LETTER AKARA
func test(_ cs: CharacterSet) {
    print(cs.contains(u))
}
test(.alphanumerics)            // true
test(.letters)                  // true
test(.lowercaseLetters)         // false
test(.uppercaseLetters)         // false

In situations like this, where the user won’t see the name of the temporary file, I tend to set up my own ASCII-only character sets. For example:

extension CharacterSet {
    static let asciiDigits = CharacterSet(charactersIn: Unicode.Scalar("0")!...Unicode.Scalar("9")!)
    … and so on …
}

ps In my last post I failed to reference the Swift forums thread that’s probably most relevant to this situation, namely Character and String properties.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Here's my approach. Handles both single- and multi-scalar characters equally thanks to allSatisfy.

public extension CharacterSet {

	func contains(_ character: Character) -> Bool {
		character.unicodeScalars.allSatisfy(contains)
	}

	func containsAll(in string: String) -> Bool {
		string.unicodeScalars.allSatisfy(contains)
	}
}

Some other useful/helpful operators...

public extension CharacterSet {

	static func + (lhs: CharacterSet, rhs: CharacterSet) -> CharacterSet {
		lhs.union(rhs)
	}

	static func - (lhs: CharacterSet, rhs: CharacterSet) -> CharacterSet {
		lhs.subtracting(rhs)
	}
}