Regular expression to remove characters from strings

Here's one for the regular expression aficionados. I have some strings containing full stops. I want to remove these when they occur immeditately between two other characters, as in "a.2" or "1.2.3". But I don't want to remove them at the end of lines.


I presume I'll have to use regular expressions to do this, and I think I know how to write one that will match those characters (although it looks so ugly I'm ashamed to post it here.) But I can't figure out how to use it to filter my strings. Suggestions would be welcome.

Accepted Reply

The responses from Claude and Quinn inspired me to come up with this code (from a playground):


var testString = "I want to get rid 1.2.3 of inline citations. But I don't want 1.2.4 to get rid of full stops."
let pattern = #"\.[^\s]"# // Matches . before non-whitespace characters.
let regex = try! NSRegularExpression(pattern: pattern, options: [])
let mString = NSMutableString(string: testString)
regex.replaceMatches(in: mString, options: [], range: NSMakeRange(0, mString.length), withTemplate: "")
testString = String(mString)
testString = testString.filter {CharacterSet.decimalDigits.inverted.contains($0.unicodeScalars.first!)}
testString = testString.replacingOccurrences(of: "  ", with: " ")


This gives me what I want:


I want to get rid of inline citations. But I don't want to get rid of full stops.


Having to switch between String and NSMutableString is a bit of a nuisance, but perhaps that won't be necessary in future versions of Swift.


I'll wait to see if you guys have any improvements before marking this as correct!

Replies

Just an awful solution, waiting for improvement (all should be included in regex, without need to explicitely test for lastChar and without all those type conversions).


In fact, as it is, regex could be replace by just replacing dots by "" !!!


var testString : NSMutableString = "Hello.You."
var testConverted = testString as String
var lastChar = String(testConverted.last!)
// Keep last char apart for future use…
testString = lastChar == "." ? NSMutableString(string: String(testConverted.dropLast())) : testString

let patternDot = "[\\.]*[.]{0}" // digits, followed by :
let regexDot = try? NSRegularExpression(pattern: patternDot, options: .caseInsensitive)
// Remove all dots inside
let matchDot = regexDot?.replaceMatches(in: testString , options: .reportProgress, range: NSRange(location: 0, length: testString.length), withTemplate: "")
testConverted = testString as String
if lastChar == "." {
    testString = NSMutableString(string: (testConverted + String(lastChar)))
}

print(testString)

I want to remove these when they occur immeditately between two other characters, as in "a.2" or "1.2.3". But I don't want to remove them at the end of lines.

How do you want to handle multiple dots in a row? That is, do you expect

1..2
to map to
12
? Or to
1.2
? Or
1..2
?

How do you want to handle leading dots?

For the moment I’m assuming you want all dots removed except one at the end, in which case I first define a helper on

NSRegularExpression
[1]:
extension NSRegularExpression {

    func stringByReplacingMatches(in string: String, withTemplate template: String) -> String {
        let r = NSRange.init(string.startIndex..<string.endIndex, in: string)
        return self.stringByReplacingMatches(in: string, options: [], range: r, withTemplate: template)
    }
}

And then use it:

let re = try! NSRegularExpression(pattern: #"\.+(.)"#, options: [])
print(re.stringByReplacingMatches(in: "1.2", withTemplate: #"$1"#))             // 12
print(re.stringByReplacingMatches(in: "1..2", withTemplate: #"$1"#))            // 12
print(re.stringByReplacingMatches(in: "..1.2", withTemplate: #"$1"#))           // 12
print(re.stringByReplacingMatches(in: "1.2.", withTemplate: #"$1"#))            // 12.
print(re.stringByReplacingMatches(in: "1.2..", withTemplate: #"$1"#))           // 12.
print(re.stringByReplacingMatches(in: "...123...456...", withTemplate: #"$1"#)) // 123456.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

[1] Getting decent regular expression support into Swift continues to be an important goal. See this Swift Forums post for some background.

There is probably something much simpler, maybe less elegant, without rege:


let originalString = "Hello.2..You."
var newString = originalString
if originalString.count > 0 {
    newString = String(originalString.dropLast()).replacingOccurrences(of: ".", with: "") + String(originalString.last!)
    print(newString)
}


Hello2You.

The responses from Claude and Quinn inspired me to come up with this code (from a playground):


var testString = "I want to get rid 1.2.3 of inline citations. But I don't want 1.2.4 to get rid of full stops."
let pattern = #"\.[^\s]"# // Matches . before non-whitespace characters.
let regex = try! NSRegularExpression(pattern: pattern, options: [])
let mString = NSMutableString(string: testString)
regex.replaceMatches(in: mString, options: [], range: NSMakeRange(0, mString.length), withTemplate: "")
testString = String(mString)
testString = testString.filter {CharacterSet.decimalDigits.inverted.contains($0.unicodeScalars.first!)}
testString = testString.replacingOccurrences(of: "  ", with: " ")


This gives me what I want:


I want to get rid of inline citations. But I don't want to get rid of full stops.


Having to switch between String and NSMutableString is a bit of a nuisance, but perhaps that won't be necessary in future versions of Swift.


I'll wait to see if you guys have any improvements before marking this as correct!

You get

I want to get rid of inline citations. But I don't want to get rid of full stops.


Is it what you want, or do you want ?

I want to get rid 123 of inline citations. But I don't want 124 to get rid of full stops.


Code deos not compile (XCode 10.3)

mString.count

need

testString.count


Without regex:


let originalString = "I want to get rid 1.2.3 of inline citations. But I don't want 1.2.4 to get rid of full stops. " 
var newString = originalString
// Keep apart the trailing "." even if ".    "
if originalString.count > 0 {
    while newString.count > 0  && newString.last! == " " {          // No crash  if original is "  "
        newString = String(newString.dropLast())
    }
    if newString.count > 0 {    
      // let us replace ". " with some specific substring, never to be seen (hopefully)
     // Then replace "." by ""
     // And put back again the trailing ";"
        newString = String(newString2.dropLast()).replacingOccurrences(of: ". ", with: "ßfiµ").replacingOccurrences(of: ".", with: "").replacingOccurrences(of: "ßfiµ", with: ". ") + String(newString2.last!)
    }
}
print("new", newString)

Result

I want to get rid 123 of inline citations. But I don't want 124 to get rid of full stops.


With regex:

I want to get rid of inline citations. But I don't want to get rid of full stops.

No, I wanted to get rid of the citations entirely, and leave just the text behind. The citations represent chapter.section.line references that are embedded in the text.


And yes, I need to have mString.length instead of mString.count. As you said, the latter doesn not compile. I've changed the code in my post accordingly.

The inline citations are always of the form x.y.z, so no leading dots to worry about. I'd take care of leading dots with the trimmingCharacters(in:) property of String.


Sorry, I should have given my testString example string in my first post, to illustrate more clearly what I was dealing with.

Test with: "The event occured on 8.27.2019 in New York."


Of course, you get

The event occured on in New York.


Is it what you want ?


I fear that shows your spec is not precise enough.

Yes, that is the expected result. But the texts I am dealing with contain no dates, or anything else of the form x.y.z except the inline citations.

So if you are sure it cannot contain dates, email, names as George W. Bush, …then you probably have your solution.

I’m not a huge fan of regular expressions but in some cases they are the right tool for the job. Consider this:

let pattern = #"[ \t]+[0-9]+(\.[0-9]+)*[ \t]+"#
let re = try! NSRegularExpression(pattern: pattern, options: [])
print(re.stringByReplacingMatches(in: "I want to get rid 1.2.3 of inline citations. But I don't want 1.2.4 to get rid of full stops. ", withTemplate: #" "#))
// "I want to get rid of inline citations. But I don't want to get rid of full stops. "

There’s a few assumptions here:

  • A citation has two or more dot-separated decimals.

  • It normalises leading and trailing spaces and tabs to a single space.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

I get the following output:

I want to get rid of inline citations. But I don't want 1.2.4 to get rid of full stops.


Only the first 1.2.3. is removed…

Only the first 1.2.3. is removed…

How are you testing this? I did the following:

  1. In Xcode 10.3 on macOS 10.14.6, I created a new project from the Command Line Tool template.

  2. I replaced the code in

    main.swift
    with this:
    import Foundation
    
    
    extension NSRegularExpression {
    
    
    func stringByReplacingMatches(in string: String, withTemplate template: String) -&gt; String {
        let r = NSRange.init(string.startIndex..&lt;string.endIndex, in: string)
        return self.stringByReplacingMatches(in: string, options: [], range: r, withTemplate: template)
    }
    } let pattern = #"[ \t]+[0-9]+(\.[0-9]+)*[ \t]+"# let re = try! NSRegularExpression(pattern: pattern, options: []) print(re.stringByReplacingMatches(in: "I want to get rid 1.2.3 of inline citations. But I don't want 1.2.4 to get rid of full stops. ", withTemplate: #" "#)) // "I want to get rid of inline citations. But I don't want to get rid of full stops. "

    .

  3. I ran it, and it prints as shown.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Tested in playground, XCode 10.3, MacOS 10.14.6


With the following code

var newTestString = "I want to get rid 1.2.3 of inline citations. But I don't want 1.2.4 to get rid 1.2.5 of full stops. "
let newPattern = #"[ \t]+[0-9]+(\.[0-9]+)*[ \t]+"#
let re = try! NSRegularExpression(pattern: newPattern, options: [])
print("Eskimo", re.stringByReplacingMatches(in: newTestString, range: NSMakeRange(0, testString.count), withTemplate: #" "#))

I get

Eskimo I want to get rid of inline citations. But I don't want 1.2.4 to get rid 1.2.5 of full stops.

The code you posted won’t compile. To start, it’s missing the

import Foundation
. Secondly, line 4 references
testString
, not
newTestString
. With those two fixes I get this:
Eskimo I want to get rid of inline citations. But I don't want to get rid of full stops.

Even with the above result, this code has a more subtle, and hence more worrying, bug. The

count
property of Swift’s
String
returns the number of
Character
elements in the string, where each element is an extended grapheme cluster. In contrast, the
length
property of
NSRange
is counted in UTF-16 code units. For all but the simplest ASCII strings, these are likely to be different. If you want to get an
NSRange
that spans the entire string, use the code I posted on 28 Aug.

I’m not sure what’s going wrong at your end, but the bugs above suggest that there’s a problem with your testing methodology.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"