Simple String Question

Hi Apple Developer Community,

Is there a simple and *clean* way to get a substring of a given string?

Suppose I want the first and last char removed from a string. In swift, I can do

let myString = "My String"
print(myString[myString.index(myString.startIndex, offsetBy:1)..<myString.index(myString.endIndex, offsetBy:-1)]

which is extremely long, complicated, and UGLY. In fact, it is so long, complicated, and ugly that it's probably better to use multiple lines (which feels rediculous for such a simple task).

Surely, there's a better, cleaner way to do this?
For example, in java, I can simple just do

String myString = "My String"
System.out.println(myString.substring(1,myString.length-1))

which is much cleaner and easier to read. Is there something like this in Swift?
Thanks in advance 🙂
KC

Replies

You can write an extension to String.

extension String {
   
    subscript (r: Range) -> String {
        let start = index(self.startIndex, offsetBy: r.lowerBound)
        let end = self.index(self.startIndex, offsetBy: r.upperBound)
        return String(self[start ..< end]) 
    }
   
    subscript (r: ClosedRange) -> String {
       
        let start = self.index(self.startIndex, offsetBy: r.lowerBound)
        let end = self.index(self.startIndex, offsetBy: r.upperBound + 1)
        return String(self[start ..< end]) 
    }
}


Then

let myString = "My String"
print(myString[1...4])
print(myString[1..<4])


will produce

y St

y S

which is much cleaner and easier to read. Is there something like this in Swift?

`Clean` is sort of a subjective word and may depend on reader's experiences or some other things.


And there's no integer based String manipulation methods in Swift Standard Library.


I can find some reasons why Swift team does not provide us such easier methods.


- Integer based offset or length may differ in each use case, byte (of some specific encoding), UTF-16 code unit, Unicode code point or Character.

- If the desired unit is not the same as internal representation of String, the cost of such String manipulation can be O(n), which are extremely slow when you manipulate large Strings.


In my opinion, such easier methods should be provided even if they are not efficient in some cases, there are many cases efficiency would not be a main concern. Just that programmers should know how efficient those methods are, as well as other methods.


But I do not think such methods cleaner, with all the characteristics I described above.


With all such things in mind, you can define your own extension as shown in Claude31's reply. But please do never think it as a general purpose solution.

My (admittedly limited) experience with Java is that it treats strings as an array of UTF-16 code units, much like

NSString
. This is convenient but error prone. It is easy to make mistakes when processing strings that contain non-ASCII characters. For example:
let myString: NSString = "My String"
print(myString.substring(with: NSRange(location: 1, length: myString.length - 2)))      // prints "y Strin"

let nonBMP: NSString = "My String"
print(nonBMP.substring(with: NSRange(location: 1, length: nonBMP.length - 2)))          // prints "[?]My Stri"

The

nonBMP
example fails because 😀 is U+1F600 GRINNING FACE, which is outside of the Basic Multilingual Plane and thus has to be represented by a UTF-16 surrogate pair. Accessing the string using naïve UTF-16 indexing splits the surrogate pair, resulting in a malformed string (
[?]
is my representation of the Unicode replacement character).

I must stress that this problem isn’t limited to ‘weird’ characters. Consider this:

let nonASCII: NSString = "Zo\u{00eb}"
print(nonASCII)                                                                         // prints "Zoë"
print(nonASCII.substring(with: NSRange(location: 1, length: nonASCII.length - 2)))      // prints "o"

let nonASCII2: NSString = "Zoe\u{0308}"
print(nonASCII2)                                                                        // prints "Zoë"
print(nonASCII2.substring(with: NSRange(location: 1, length: nonASCII2.length - 2)))    // prints "oe"

The

nonASCII2
example shows how this can affect text you may well find in an English-only app. In this example, the
ë
character is represented as a combination of U+0065 LATIN SMALL LETTER E and U+0308 COMBINING DIAERESIS, and the naïve approach splits them apart.

Swift strings don’t suffer from such problems because they treat strings as a sequence of extended grapheme clusters. To wit:

let myString = "My String"
print(myString.dropFirst().dropLast())  // prints "y Strin"

let nonBMP = "My String"
print(nonBMP.dropFirst().dropLast())    // prints "My Stri"

let nonASCII = "Zo\u{00eb}"
print(nonASCII)                         // prints "Zoë"
print(nonASCII.dropFirst().dropLast())  // prints "o"

let nonASCII2 = "Zoe\u{0308}"
print(nonASCII2)                        // prints "Zoë"
print(nonASCII2.dropFirst().dropLast()) // prints "oe"

The cost of this correctness is that string indices are more complex, and thus you can’t treat a string as an array of characters. If you attempt to do this — using the extensions that Claude31 posted — you can run into performance problems. Specifically, those

index(_:offsetBy:)
calls are O(n), so it’s easy to go accidentally quadratic.

The Swift team is interested in improving the ergonomics of strings, and I encourage you to wade into those efforts. There’s no single thread I can point you to here — there’s been lots of discussions, covering lots of topics — but a major player here is Michael Ilseman and you can look at his posts to get up to speed (on that page, click on More Topics to see a longer list).

Notwithstanding all of the above, my experience is that:

  • I almost never want to access localised strings like this. I try to avoid parsing localised strings entirely, but when I find it necessary to do that my parsing code almost always involves finding delimiters in the string and then splitting based on that.

  • If I’m dealing with non-localised strings, I generally prefer dealing with the data as UTF-8 (either the string’s

    .utf8
    view, or just an array of
    UInt8
    ). Even if the data contains strings that might be localised, I find it better to find the delimiters using the UTF-8 view and then extract any localised content as strings from there.

Which brings me back to your example:

Suppose I want the first and last char removed from a string.

That’s obviously a straw man that you’re using to illustrate your point. I find such examples problematic because they don’t represent what folks should be doing with strings.

So, my question for you is, can you post some more realistic examples? That is, examples of problems that you’ve encountered in real code. Make sure to described the context because when, it comes to string process, context really matters.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

In this code, you should text for the value of upper bound to avoid crash.


extension String {
   
    subscript (r: Range) -> String {

        if r.upperBound > self.count { return "" }
        let start = index(self.startIndex, offsetBy: r.lowerBound)
        let end = self.index(self.startIndex, offsetBy: r.upperBound)
        return String(self[start ..< end])  // 30.8.2018
    }
   
    subscript (r: ClosedRange) -> String {
       
        if r.upperBound >= self.count { return "" }
        let start = self.index(self.startIndex, offsetBy: r.lowerBound)
        print(r.lowerBound, r.upperBound)
        let end = self.index(self.startIndex, offsetBy: r.upperBound + 1)
        return String(self[start ..< end])  // 30.8.2018
    }
}



This works as well with characters as emoji.