Actually, my "fix" from earlier isn't quite right -- the line (from whatever input file) could have null chars in it, and grep ought to handle that gracefully instead of exploding (GNU grep handles this just fine). But we see:
% printf '\0' | LANG="en_US.UTF-8" grep -o 'b*'
Assertion failed: (advance > 0), function procline, file util.c, line 732.
[1] 24086 done printf '\0' |
24087 abort LANG="en_US.UTF-8" ./grep-debug -o 'b*'
Same sort of result, but now the '\0' char is coming from the input, rather than the line buffer's terminal '\0'.
So maybe something like so:
diff --git a/grep/util.c b/grep/util.c
index f362f97..1689061 100644
--- a/grep/util.c
+++ b/grep/util.c
@@ -691,7 +691,7 @@ procline(struct parsec *pc)
#ifdef __APPLE__
/* rdar://problem/86536080 */
if (pmatch.rm_so == pmatch.rm_eo) {
- if (MB_CUR_MAX > 1) {
+ if (MB_CUR_MAX > 1 && nst < pc->ln.len && pc->ln.dat[nst] != '\0') {
wchar_t wc;
int advance;
@@ -721,7 +721,7 @@ procline(struct parsec *pc)
* either make progress or end the search.
*/
if (pmatch.rm_so == pmatch.rm_eo) {
- if (MB_CUR_MAX > 1) {
+ if (MB_CUR_MAX > 1 && nst < pc->ln.len && pc->ln.dat[nst] != '\0') {
wchar_t wc;
int advance;
Post
Replies
Boosts
Views
Activity
Here's the bug.
Let's use this example:
% printf '%s' 'a' | grep -o 'b*'
The code from earlier:
/*
* rdar://problem/86536080 - if our first match
* was 0-length, we wouldn't progress past that
* point. Incrementing nst here ensures that if
* no other pattern matches, we'll restart the
* search at one past the 0-length match and
* either make progress or end the search.
*/
if (pmatch.rm_so == pmatch.rm_eo) {
if (MB_CUR_MAX > 1) {
wchar_t wc;
int advance;
advance = mbtowc(&wc,
&pc->ln.dat[nst],
MB_CUR_MAX);
assert(advance > 0);
nst += advance;
} else {
nst++;
}
}
Here's the problem: pc->ln.dat is the string for the current line. nst is an offset into that string. Note that this code is enclosed in a loop. The first time around that loop, pc->ln.dat is "a", and nst is 0. Thus &pc->ln.dat[nst] is effectively "a". mbtowc returns 1 as we would expect.
The loop iterates, and now pc->ln.dat is still "a", but nst is 1, so &pc->ln.dat[nst] is "" (the empty string). When mbtowc is given a pointer to a null char (as we have here), it returns 0. Given that, the assertion now fails.
The problem can be state in one of two ways:
The loop should have exited early after the first iteration (or at least changed the local match state so that we don't arrive at the aforementioned code block), or
The code block should be amended so that we neither try to read at nor past the terminating null char.
For option 2, something like this -- as I have tested by compiling Apple's grep from source -- would suffice:
diff --git a/grep/util.c b/grep/util.c
index f362f97..ab3aec1 100644
--- a/grep/util.c
+++ b/grep/util.c
@@ -691,7 +691,7 @@ procline(struct parsec *pc)
#ifdef __APPLE__
/* rdar://problem/86536080 */
if (pmatch.rm_so == pmatch.rm_eo) {
- if (MB_CUR_MAX > 1) {
+ if (MB_CUR_MAX > 1 && nst < pc->ln.len) {
wchar_t wc;
int advance;
@@ -721,7 +721,7 @@ procline(struct parsec *pc)
* either make progress or end the search.
*/
if (pmatch.rm_so == pmatch.rm_eo) {
- if (MB_CUR_MAX > 1) {
+ if (MB_CUR_MAX > 1 && nst < pc->ln.len) {
wchar_t wc;
int advance;
To restate the problem: the latest grep on macOS indexes into the current string out of bounds, and the only reason the error isn't more catastrophic is because grep terminates the current line buffer with an additional null character (which is not part of the original input file), which just so happens to tickle an assert that checks how many bytes wide the current character (which is outside of the string!) is.
Okay... I guess comments on replies don't get formatted, so reposting here:
I don't think there's anything unusual about my environment, nor locale:
% locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Note that the problem goes away if you specify a locale that does not support multibyte characters (i.e. where MB_CUR_MAX=1):
% printf '%s' 'a' | LANG=C grep -o 'b*'
vs
% printf '%s' 'a' | LANG="en_US.UTF-8" grep -o 'b*'
Assertion failed: (advance > 0), function procline, file util.c, line 732.
[1] 20179 done printf '%s' 'a' |
20180 abort LANG="en_US.UTF-8" ./grep-debug -o 'b*'