Forum: RetroDigital BBS

freebsd arm64 (rpi4) problem with regex?

From Mike Scott@3:770/3 to All on Thu Feb 3 15:52:47 2022

XPost: comp.unix.bsd.freebsd.misc

This is on freebsd13.0/arm64/rpi4

A problem arising from milter-regex. This fails to accept known-good
regular expressions, directly taken from a working i386 system.

I believe the problem lies in the regex library, as a test program fails
to compile regular expressions that contain backslashed special characters:

The salient chunk of my test program is
regex_t re;
if( regcomp( &re, argv[1], REG_ICASE ) ) {
printf("bad re\n");

which works on "simple" things:
# ./a '123' 'abc123def'
re <<123>> string <<abc123def>>
matching:- <<123>>

but fails on \s and \t etc:
# ./a '\s' 'abc def'
re <<\s>> string <<abc def>>
bad re

although this also works
# ./a '\\' 'abc\def'
re <<\\>> string <<abc\def>>
matching:- <<\>>

(test program takes the re and a test string as its two args)

I'd not be surprised if this is another char <==> int problem, but the
regex stuff is a tad more complex than spfmilter was.

Can anyone check this out please?

--
Mike Scott
Harlow, England
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Lew Pitcher@3:770/3 to Mike Scott on Thu Feb 3 16:39:18 2022

XPost: comp.unix.bsd.freebsd.misc

On Thu, 03 Feb 2022 15:52:47 +0000, Mike Scott wrote:

This is on freebsd13.0/arm64/rpi4

A problem arising from milter-regex. This fails to accept known-good
regular expressions, directly taken from a working i386 system.

I believe the problem lies in the regex library, as a test program fails
to compile regular expressions that contain backslashed special
characters:

The salient chunk of my test program is
regex_t re;
if( regcomp( &re, argv[1], REG_ICASE ) ) {
printf("bad re\n");

In general, it would be helpful to know /why/ regcomp(3) disliked a given regex. Try using regerror(3) [1]. Something like this (caution: code
neither syntax checked nor
tested) ...
regex_t re;
int regcomp_rc;

if(regcomp_rc = regcomp(&re, argv[1], REG_ICASE))
{
char regcomp_error[256]; /* or some other large-enough size */

regerror(regcomp_rc,argv[1],regcomp_error,sizeof regcomp_error);
printf("bad re: regcomp() = %d (%s\n",regcomp_rc,regcomp_error);
/*
... other error handling as required
*/
}
could give you a better idea of why regcomp() didnt like a given regex.

which works on "simple" things:
# ./a '123' 'abc123def'
re <<123>> string <<abc123def>>
matching:- <<123>>

but fails on \s and \t etc:
# ./a '\s' 'abc def'
re <<\s>> string <<abc def>>
bad re

re_format(7) [2] gives a list of handled backslash-escaped sequences,
and '\t' isn't one of the handled sequences. Given that, regex(7)
says that an atom may be
"...
a '\' followed by any other character (matching that character taken
as an ordinary character, as if the '\' had not been present)
..."

So, it looks like regcomp() /should/ handle your test case here.

although this also works # ./a '\\' 'abc\def'
re <<\\>> string <<abc\def>>
matching:- <<\>>

It looks to me like the regcomp(3) backslash-handling logic may be
rejecting anything that doesn't match it's list of handled characters
(although it /should/ handle your '\t' as 't', according to the docs).

(test program takes the re and a test string as its two args)

I'd not be surprised if this is another char <==> int problem, but the
regex stuff is a tad more complex than spfmilter was.

Can anyone check this out please?

[1] https://www.freebsd.org/cgi/man.cgi?query=regex&sektion=3
[2] https://www.freebsd.org/cgi/man.cgi?query=re_format&sektion=7

HTH
--
Lew Pitcher
"In Skills, We Trust"
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Christian Weisgerber@3:770/3 to Mike Scott on Thu Feb 3 18:23:07 2022

XPost: comp.unix.bsd.freebsd.misc

On 2022-02-03, Mike Scott <usenet.16@scottsonline.org.uk.invalid> wrote:

This is on freebsd13.0/arm64/rpi4

A problem arising from milter-regex. This fails to accept known-good
regular expressions, directly taken from a working i386 system.

I think your arm64 system is at a different revision of FreeBSD
than your i386 one.

I believe the problem lies in the regex library, as a test program fails
to compile regular expressions that contain backslashed special characters:

but fails on \s and \t etc:
# ./a '\s' 'abc def'
re <<\s>> string <<abc def>>
bad re

What are "\s" and "\t" supposed to mean? In traditional regular
expressions, they have no meaning. In that case, the '\' used to
be ignored, i.e., they were equivalent to plain "s" and "t".

However, that was changed in this commit...

regex(3): Interpret many escaped ordinary characters as EESCAPE https://cgit.freebsd.org/src/commit/lib/libc/regex?id=adeebf4cd47c3e85155d92f386bda5e519b75ab2

... so such sequences would now result in an error.

Subsequently, some GNU extensions have been added that give new
meaning to "\s" but not to "\t".

--
Christian "naddy" Weisgerber naddy@mips.inka.de
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Mike Scott@3:770/3 to Lew Pitcher on Fri Feb 4 14:05:36 2022

XPost: comp.unix.bsd.freebsd.misc

On 03/02/2022 16:39, Lew Pitcher wrote:

On Thu, 03 Feb 2022 15:52:47 +0000, Mike Scott wrote:

This is on freebsd13.0/arm64/rpi4

A problem arising from milter-regex. This fails to accept known-good
regular expressions, directly taken from a working i386 system.

I believe the problem lies in the regex library, as a test program fails
to compile regular expressions that contain backslashed special
characters:

The salient chunk of my test program is
regex_t re;
if( regcomp( &re, argv[1], REG_ICASE ) ) {
printf("bad re\n");

.....

Thanks for the responses.

Firstly, I have to confess to some history: way, way back, I modified milter-regex to use pcre rather than libc's regex routines. That's
probably why my patterns still have \s strings and the like: these are
valid in pcre as \s -> whitespace and \t -> tab etc.

That said, the "proper" package code from freebsd was reinstated several
years ago, and both systems (i386 and arm64) are running the same
packaged version 2.7.2. (It means my re's certainly haven't worked for a
while, but that's a separate issue: ooops!) I have the exact same
milter-regex config file on both machines.

On the arm64 box (fbsd 13.0), I get logged:
parse_ruleset: /usr/local/etc/milter-regex.conf:196: regcomp:
^\s*Fwd.?\s*$: trailing backslash (\)

As has been pointed out, \s may not mean what I wanted it to: but is nevertheless valid, and that re should be accepted as equivalent to
^s*Fwd.?s*$
(man re_format is unambiguous on this)

On the i386 (running fbsd 11.4), the file compiles happily in full.
Hence my supposition about errors in the regex library.

The error returned in my test code on the arm64 from regcomp() is 5 (REG_EESCAPE). On the i386 I get

% ./a '\b' '123abc4 56'
re <<\b>> string <<123abc4 56>>
matching:- <<b>>

and on the arm64:
root@kirk:/usr/plumtree/config/milter-regex # ./a '\b' '123abc4 56'
re <<\b>> string <<123abc4 56>>
bad re 5

Hmmm. I'm wondering about char's and int's. It's been a long, long while
since I looked into the depths of Henry Spencer's original code (that on
a Sun): I have a vague recollection of liberties being taken with them
but IMWBW.

FWIW the test code, hacked from elsewhere, is

#include <stdio.h>
#include <regex.h>
#include <stdlib.h>

#define MAXMATCH 100

int main(int argc, char *argv[]) {
regex_t re;
regmatch_t matches[MAXMATCH];

if( argc != 3 ) exit(1);

printf("re <<%s>> string <<%s>>\n", argv[1], argv[2]);

/* if( regcomp( &re, argv[1], REG_EXTENDED | REG_ICASE ) ) { */
int errc = regcomp( &re, argv[1], REG_ICASE );
if( errc ) {
printf("bad re %x\n", errc);
exit(1);
}

int err = regexec( &re, argv[2], MAXMATCH, matches, 0);
if( err ) {
printf("match failed %d\n", err);
exit(1);
}

printf("matching:- <<");
int p;
for( p = matches[0].rm_so; p < matches[0].rm_eo; ++p )
printf("%c", argv[2][p]);
printf(">>\n");

}

--
Mike Scott
Harlow, England
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Christian Weisgerber@3:770/3 to Mike Scott on Fri Feb 4 15:11:43 2022

XPost: comp.unix.bsd.freebsd.misc

On 2022-02-04, Mike Scott <usenet.16@scottsonline.org.uk.invalid> wrote:

On the arm64 box (fbsd 13.0), I get logged:
parse_ruleset: /usr/local/etc/milter-regex.conf:196: regcomp:
^\s*Fwd.?\s*$: trailing backslash (\)

On the i386 (running fbsd 11.4), the file compiles happily in full.
Hence my supposition about errors in the regex library.

Again: This is an intentional change in behavior that was at some
point introduced in FreeBSD's libc regex code.

Specifically this commit, which is in 13.x but not in 11.x: https://cgit.freebsd.org/src/commit/lib/libc/regex?id=adeebf4cd47c3e85155d92f386bda5e519b75ab2

Here's full commit message:

------------------->
regex(3): Interpret many escaped ordinary characters as EESCAPE

In IEEE 1003.1-2008 [1] and earlier revisions, BRE/ERE grammar allows for
any character to be escaped, but "ORD_CHAR preceded by an unescaped
<backslash> character [gives undefined results]".

Historically, we've interpreted an escaped ordinary character as the
ordinary character itself. This becomes problematic when some extensions
give special meanings to an otherwise ordinary character
(e.g. GNU's \b, \s, \w), meaning we may have two different valid interpretations of the same sequence.

To make this easier to deal with and given that the standard calls this undefined, we should throw an error (EESCAPE) if we run into this scenario
to ease transition into a state where some escaped ordinaries are blessed
with a special meaning -- it will either error out or have extended
behavior, rather than have two entirely different versions of undefined behavior that leave the consumer of regex(3) guessing as to what behavior
will be used or leaving them with false impressions.

This change bumps the symbol version of regcomp to FBSD_1.6 and provides the old escape semantics for legacy applications, just in case one has an older application that would immediately turn into a pumpkin because of an
extraneous escape that's embedded or otherwise critical to its
operation.

This is the final piece needed before enhancing libregex with GNU extensions and flipping the switch on bsdgrep.

[1] http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/

PR: 229925 (exp-run, courtesy of antoine)
Differential Revision: https://reviews.freebsd.org/D10510
<-------------------

--
Christian "naddy" Weisgerber naddy@mips.inka.de
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From druck@3:770/3 to Christian Weisgerber on Fri Feb 4 20:55:23 2022

XPost: comp.unix.bsd.freebsd.misc

On 03/02/2022 18:23, Christian Weisgerber wrote:

What are "\s" and "\t" supposed to mean? In traditional regular
expressions, they have no meaning.

I'm not sure how many decades ago you are claiming for traditional reg
ex, but \s and \t have been any white space and tab for a long as I can remember.

---druck
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Lew Pitcher@3:770/3 to druck on Fri Feb 4 21:32:26 2022

XPost: comp.unix.bsd.freebsd.misc

On Fri, 04 Feb 2022 20:55:23 +0000, druck wrote:

On 03/02/2022 18:23, Christian Weisgerber wrote:

What are "\s" and "\t" supposed to mean? In traditional regular
expressions, they have no meaning.

I'm not sure how many decades ago you are claiming for traditional reg
ex, but \s and \t have been any white space and tab for a long as I can remember.

Yah.... no.

In POSIX regular expressions, neither \s nor \t have any documented
"special" meaning; for BREs,
"The interpretation of an ordinary character preceded by a backslash
( '\' ) is undefined, except for:
* The characters ')', '(', '{', and '}'
* The digits 1 to 9 inclusive (see BREs Matching Multiple Characters)
* A character inside a bracket expression"
and for EREs,
"An ordinary character is any character in the supported character set,
except for the ERE special characters listed in ERE Special
Characters. The interpretation of an ordinary character preceded by a
backslash ( '\' ) is undefined."
where, ERE Special Characters consists of a handful of punctuation
characters, and no alphabetics [1].

A common implementation of the POSIX regular expression parser defines a regular expression atom, in part, as
"..., a '\' followed by one of the characters "^.[$()|*+?{\" (matching
that character taken as an ordinary character), a '\' followed by any
other character (matching that character taken as an ordinary
character, as if the '\' had not been present), ..." [2]

In neither case do either \s or \t have any "special" meaning.

OTOH, Perl-compatable regular expressions recognize \s and \t as having
special meanings, with \s meaning "any white space character", and \t
meaning "tab (hex 09)"

It is worth noting that the OP was asking about POSIX regular
expressions, as handled by the POSIX regcomp(3) interface, and /not/ pcre regular expressions.

HTH

[1] https://pubs.opengroup.org/onlinepubs/009696899/basedefs/
xbd_chap09.html

[2] https://man7.org/linux/man-pages/man7/regex.7.html
--
Lew Pitcher
"In Skills, We Trust"
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ahem A Rivet's Shot@3:770/3 to druck on Fri Feb 4 21:53:16 2022

XPost: comp.unix.bsd.freebsd.misc

On Fri, 4 Feb 2022 20:55:23 +0000
druck <news@druck.org.uk> wrote:

On 03/02/2022 18:23, Christian Weisgerber wrote:

What are "\s" and "\t" supposed to mean? In traditional regular expressions, they have no meaning.

I'm not sure how many decades ago you are claiming for traditional reg
ex, but \s and \t have been any white space and tab for a long as I can remember.

They are in many places (including pcre) but not re_format(7).

--
Steve O'Hara-Smith
Odds and Ends at http://www.sohara.org/
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From A. Dumas@3:770/3 to Lew Pitcher on Fri Feb 4 23:09:32 2022

XPost: comp.unix.bsd.freebsd.misc

On 04-02-2022 22:32, Lew Pitcher wrote:

OTOH, Perl-compatable regular expressions recognize \s and \t as having special meanings,

Not only pcre, also enhanced or extended. I don't have FreeBSD here but
this is from the MacOS man page which is based on BSD: (conclusion below
that)

-----
ENHANCED FEATURES
When the REG_ENHANCED flag is passed to one of the regcomp()
variants, additional features are activated. Like the enhanced regex implementations in scripting languages such as
perl(1) and python(1), these additional features may conflict with
the IEEE Std 1003.2 (``POSIX.2'') standards in some ways. Use this with
care in situations which require
portability (including to past versions of the Mac OS X using the previous regex implementation).

For enhanced basic REs, `+', `?' and `|' remain regular
characters, but `\+', `\?' and `\|' have the same special meaning as the unescaped characters do for extended REs, i.e.,
one or more matches, zero or one matches and alteration,
respectively. For enhanced extended REs, back references are available.
Additional enhanced features are listed below.

Within a bracket expression, most characters lose their magic.
This also applies to the additional enhanced features, which don't
operate inside a bracket expression.

Assertions (available for both enhanced basic and enhanced extended REs)
In addition to `^' and `$' (the assertions that match the null
string at the beginning and end of line, respectively), the following assertions become available:

[...]

Shortcuts (available for both enhanced basic and enhanced extended REs)
The following shortcuts can be used to replace more complicated
bracket expressions.

[...]
\s Matches a space character. This is equivalent to `[[:space:]]'.
[...]

Literal Sequences (available for both enhanced basic and enhanced
extended REs)
Literals are normally just ordinary characters that are matched
directly. Under enhanced mode, certain character sequences are
converted to specific literals.

[...]
\t The ``horizontal-tab'' character (ASCII code 9).
[...]
-----

So in practice it turns out that, using the built-in BSD-based grep on
MacOS without any flags, it does support both \s and \t.
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Lew Pitcher@3:770/3 to A. Dumas on Fri Feb 4 23:07:13 2022

XPost: comp.unix.bsd.freebsd.misc

On Fri, 04 Feb 2022 23:09:32 +0100, A. Dumas wrote:

On 04-02-2022 22:32, Lew Pitcher wrote:

OTOH, Perl-compatable regular expressions recognize \s and \t as having
special meanings,

Not only pcre, also enhanced or extended. I don't have FreeBSD here but
this is from the MacOS man page which is based on BSD: (conclusion below that)

-----
ENHANCED FEATURES
When the REG_ENHANCED flag is passed to one of the regcomp()
variants, additional features are activated.

[snip]

So in practice it turns out that, using the built-in BSD-based grep on
MacOS without any flags, it does support both \s and \t.

From the OP

The salient chunk of my test program is
regex_t re;
if( regcomp( &re, argv[1], REG_ICASE ) ) {
printf("bad re\n");

If I correctly understand the documentation you posted, to get the BSD
regex "Enhanced features" that include enhanced escape parsing, the OP
would have had to specify the REG_ENHANCED flag to regcomp()

I note that the OP /did not/ include this flag, nor has this flag
been discussed elsethread. I also note that the OP /did not/ include the REG_EXTENDED flag, so his regcomp() will interpret the regex as a BRE.

Still, it is up to the implementation as to how it will handle the
expansion of those escaped characters that the POSIX standard leaves
undefined (in BRE only; they are well-defined in ERE, which the OP
is not using).

--
Lew Pitcher
"In Skills, We Trust"

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Mike Scott@3:770/3 to Christian Weisgerber on Wed Feb 9 15:25:16 2022

XPost: comp.unix.bsd.freebsd.misc

On 03/02/2022 18:23, Christian Weisgerber wrote:

On 2022-02-03, Mike Scott <usenet.16@scottsonline.org.uk.invalid> wrote:

This is on freebsd13.0/arm64/rpi4

A problem arising from milter-regex. This fails to accept known-good
regular expressions, directly taken from a working i386 system.

I think your arm64 system is at a different revision of FreeBSD
than your i386 one.

I believe the problem lies in the regex library, as a test program fails
to compile regular expressions that contain backslashed special characters: >>
but fails on \s and \t etc:
# ./a '\s' 'abc def'
re <<\s>> string <<abc def>>
bad re

What are "\s" and "\t" supposed to mean? In traditional regular
expressions, they have no meaning. In that case, the '\' used to
be ignored, i.e., they were equivalent to plain "s" and "t".

However, that was changed in this commit...

regex(3): Interpret many escaped ordinary characters as EESCAPE https://cgit.freebsd.org/src/commit/lib/libc/regex?id=adeebf4cd47c3e85155d92f386bda5e519b75ab2

... so such sequences would now result in an error.

Subsequently, some GNU extensions have been added that give new
meaning to "\s" but not to "\t".

Thanks to everyone for comments.

I'll note my own error in accidentally trying to use pcre-style
expressions in regex; I also take on board the changes being made: but
just perhaps, the man pages should also be kept in step with the code --
and incompatible changes like that noted above perhaps merit a warning
in a large flashing fluorescent font for some years.

Meanwhile I've removed the offending RE's from the milter's config file,
and it runs happily. Whether it runs /correctly/ remains to be seen.

Thanks again.

--
Mike Scott
Harlow, England
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

Who's Online

System Info

Sysop:	deepend
Location:	Calgary, Alberta
Users:	287
Nodes:	10 (0 / 10)
Uptime:	01:27:35
Calls:	2,449
Files:	5,322
D/L today:	14 files (9,995K bytes)
Messages:	445,137

Synchronet Oneliners
- Vintagegeek@rdbbs
  Thu Nov 27 05:45:45 2025
  HAPPY THANKSGIVING
- Vintagegeek@rdbbs
  Fri Nov 28 14:18:50 2025
  Amazon Prime Green Friday
- Vintagegeek@rdbbs
  Fri Dec 5 06:02:20 2025
  Maple Leafs
- Pedro Herzensbuch@rdbbs
  Sat Dec 6 14:51:28 2025
  Van a pasar cosas
- Vintagegeek@rdbbs
  Sun Dec 7 13:30:36 2025
  Merry Christmas
- Vintagegeek@rdbbs
  Wed Dec 10 08:04:26 2025
  M E R R Y C H R I S T M A S

freebsd arm64 (rpi4) problem with regex?

Who's Online

System Info

Synchronet Oneliners