Swift Regex too slow als grep replacement?

I made a log analyzer which does currently use the SHELL which works fine. Now I wanted to replace it with native Swift code but the result is really slow.

Here is an example with about 100 MB real log data. It needs just a second via SHELL but more than 80 seconds with Swift code.

Is there a way to improve the Swift code?

In SHELL I use

% time (cd /regTest/logs;cat access.log access.log.1 | grep '26\/Apr' | egrep -v '(www.apple.com/go/applebot | www.bing.com/bingbot.htm | www.googlebot.com/bot.html | Xing Bot)' | awk '/GET/ {print $1}' | sort -n | uniq 1>/dev/null)
1,09s user 0,05s system 105% cpu 1,081 total

Here is the result of my Swift test:

% time ./regTest
Found 1813 lines.
 82,54s user 0,26s system 99% cpu 1:22,83 total

My Swift sample code

import Foundation
import RegexBuilder

guard let fullText = try? String(contentsOf: URL(filePath: "/regTest/logs/access.log")) + String(contentsOf: URL(filePath: "/regTest/logs/access.log.1")) else {
    print("Cannot read files!")
    exit(1)
}

let yesterdayRegEx = Regex {
    Capture {
        "26/Apr"
    }
}
let botListRegEx = Regex {
    Capture {
        ChoiceOf {
            "www.apple.com/go/applebot"
            "www.bing.com/bingbot.htm"
            "www.googlebot.com/bot.html"
            "Xing Bot"
        }
    }
}

let dateMatch = fullText.split(separator: "\n")
    .filter{ $0.firstMatch(of: yesterdayRegEx) != nil }
    .filter{ $0.firstMatch(of: botListRegEx) == nil }

print("Found \(dateMatch.count) lines.")
Answered by GreatOm in 757419022

There was another discussion on https://forums.swift.org/t/use-regexbuilder-als-grep-replacement/65782 where I was able to get much faster code. This code does now run in 3 seconds.

let paths = ["regTest/logs/access.log", "/regTest/logs/access.log.1"]
var startDate = Date()

let matchedLines = try await withThrowingTaskGroup(of: [String].self, body: { group in
    for aPath in paths {
        group.addTask {
            let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
                        .filter { $0.contains("26/Apr") }
                        .filter { $0.contains("\"GET ") }
                        .filter{ !$0.contains(botListRegEx) }
            var matched : [String] = []
            for try await line in lines {
                if let m = line.firstMatch(of: ipAtStartRegEx) {
                    let (s, _) = m.output
                    matched.append(String(s))
                }
            }
            return matched
        }
    }
    var matchedLines : [String] = []
    for try await partialMatchedLines in group {
        matchedLines.append(contentsOf: partialMatchedLines)
    }
    return matchedLines
})

let uniqArray = Set(matchedLines)

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")
print("Found \(uniqArray.count) IP adresses.")

One thing to note is that your shell script includes significant concurrency; the five processes will run at the same time on different processors. Your swift code does everything sequentially on one processor. I don't know how significant that is compared to other factors.

How quick is the swift code without the actual regex testing, i.e. how quickly can it read the file, split into lines and count the lines?

The Swift code is fast for everything but the RegEx. The file read takes about 1 s but the RegEx takes more than 80 s. That's why I ask for help to speed-up the RegEx.

My Swift sample code does no sort and uniq as in the shell example. This will be added after the RegEx is usable. I won´t care if the code in not quite so fast as my shell but it must be much quicker to be usable.

I note that you are capturing the matches. You don't need to do that, do you? Is it faster if you just use "contains", rather than "firstMatchOf"?

    .filter{ $0.contains(yesterdayRegEx) }
    .filter{ $0.contains(botListRegEx) }

Makes no difference.

But now I got another idea: I removed the filter to get a clue how much time the spilt needs. It is about 50 s! Therefore the split is the major slowdown and the RegEx take just 30 s.

$0.contains(yesterdayRegEx)

Did you also change the definition of the regex to remove the Capture { } ?

I removed the filter to get a clue how much time the spilt needs. It is about 50 s!

Right. It's important to know what is actually slow, first!

My Swift is poor. (I'm really a C++ person). But clearly you don't actually need to create new strings for each individual line of the input. I'm not sure what the best, i.e. most idiomatic, way to do that is in Swift. Maybe someone else will comment.

Good catch.

Did you also change the definition of the regex to remove the Capture { } ?

Nope. I did just replace "firstMatch". Now I used a plain string and it is much faster. Also I use now another split which is way faster (20 vs. 50s). So here is the code

let matchedLines = fullText.split(whereSeparator: \.isNewline)
    .filter{ $0.contains(yesterdayRegEx) }
    .filter{ $0.contains(botListRegEx) }

Match time: 52.7 s

let matchedLines = fullText.split(whereSeparator: \.isNewline)
    .filter{ $0.contains("26/Apr") }
    .filter{ $0.contains(botListRegEx) }

Match time: 21.8 s

So the slow part is now the Split which take most of the time (21 s).

You need a Swift expert to understand that. You need, somehow, to avoid allocating memory for and copying strings for every line.

Or alternatively, create a regex you can apply to the entire input, which matches just the bits you want but captures the entire line. Then you wouldn't need to split.

Accepted Answer

There was another discussion on https://forums.swift.org/t/use-regexbuilder-als-grep-replacement/65782 where I was able to get much faster code. This code does now run in 3 seconds.

let paths = ["regTest/logs/access.log", "/regTest/logs/access.log.1"]
var startDate = Date()

let matchedLines = try await withThrowingTaskGroup(of: [String].self, body: { group in
    for aPath in paths {
        group.addTask {
            let lines = FileHandle(forReadingAtPath: aPath)!.bytes.lines
                        .filter { $0.contains("26/Apr") }
                        .filter { $0.contains("\"GET ") }
                        .filter{ !$0.contains(botListRegEx) }
            var matched : [String] = []
            for try await line in lines {
                if let m = line.firstMatch(of: ipAtStartRegEx) {
                    let (s, _) = m.output
                    matched.append(String(s))
                }
            }
            return matched
        }
    }
    var matchedLines : [String] = []
    for try await partialMatchedLines in group {
        matchedLines.append(contentsOf: partialMatchedLines)
    }
    return matchedLines
})

let uniqArray = Set(matchedLines)

print("Match time: \(abs(startDate.timeIntervalSinceNow)) s")
print("Found \(matchedLines.count) lines.")
print("Found \(uniqArray.count) IP adresses.")

I tried to post a further reply last week but the Submit button seemed to be disabled. I've just read this thread, which seems to explain that forum bug:

https://developer.apple.com/forums/thread/728466

So here is the post that I tried to send last week, except that I've replaced the string that I'm not allowed to mention with nonsense:

I note that none of your regular expressions actually need to be regular expressions, i.e. they are just fixed text strings. For example, where you have "non.sense", I think you only want it to literally match "non.sense", not "non0sense".

Since I like a challenge, I've just tried to do this in C++. Using simple string searching for the patterns, I can process a 100 MB file in 0.165 seconds. But if I just replace the date search with a fixed-string regex, the runtime increases to about 3 seconds. (Which I find pretty poor.)

It is interesting that this is about the same as the runtime you are seeing. It would not surprise me if Swift and Clang's libc++ share the same (poor) regex implementation.

I note that none of your regular expressions actually need to be regular expressions, i.e. they are just fixed text strings. For example, where you have "non.sense", I think you only want it to literally match "non.sense", not "non0sense".

Yes. The first two filters are fixed strings but the last one contains a lot of different search strings. Therefore RegEx makes sense. You're right that a simple string search may be faster but since it takes now just 3 seconds I'm fine with it. Thanks for sharing your thoughts on my code.

Swift Regex too slow als grep replacement?
 
 
Q