• freebsd arm64 (rpi4) problem with regex?

    From Mike Scott@3:770/3 to All on Thu Feb 3 15:52:47 2022
    XPost: comp.unix.bsd.freebsd.misc

    This is on freebsd13.0/arm64/rpi4

    A problem arising from milter-regex. This fails to accept known-good
    regular expressions, directly taken from a working i386 system.

    I believe the problem lies in the regex library, as a test program fails
    to compile regular expressions that contain backslashed special characters:

    The salient chunk of my test program is
    regex_t re;
    if( regcomp( &re, argv[1], REG_ICASE ) ) {
    printf("bad re\n");

    which works on "simple" things:
    # ./a '123' 'abc123def'
    re <<123>> string <<abc123def>>
    matching:- <<123>>


    but fails on \s and \t etc:
    # ./a '\s' 'abc def'
    re <<\s>> string <<abc def>>
    bad re

    although this also works
    # ./a '\\' 'abc\def'
    re <<\\>> string <<abc\def>>
    matching:- <<\>>


    (test program takes the re and a test string as its two args)


    I'd not be surprised if this is another char <==> int problem, but the
    regex stuff is a tad more complex than spfmilter was.

    Can anyone check this out please?

    --
    Mike Scott
    Harlow, England
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Lew Pitcher@3:770/3 to Mike Scott on Thu Feb 3 16:39:18 2022
    XPost: comp.unix.bsd.freebsd.misc

    On Thu, 03 Feb 2022 15:52:47 +0000, Mike Scott wrote:

    This is on freebsd13.0/arm64/rpi4

    A problem arising from milter-regex. This fails to accept known-good
    regular expressions, directly taken from a working i386 system.

    I believe the problem lies in the regex library, as a test program fails
    to compile regular expressions that contain backslashed special
    characters:

    The salient chunk of my test program is
    regex_t re;
    if( regcomp( &re, argv[1], REG_ICASE ) ) {
    printf("bad re\n");

    In general, it would be helpful to know /why/ regcomp(3) disliked a given regex. Try using regerror(3) [1]. Something like this (caution: code
    neither syntax checked nor
    tested) ...
    regex_t re;
    int regcomp_rc;

    if(regcomp_rc = regcomp(&re, argv[1], REG_ICASE))
    {
    char regcomp_error[256]; /* or some other large-enough size */

    regerror(regcomp_rc,argv[1],regcomp_error,sizeof regcomp_error);
    printf("bad re: regcomp() = %d (%s\n",regcomp_rc,regcomp_error);
    /*
    ... other error handling as required
    */
    }
    could give you a better idea of why regcomp() didnt like a given regex.


    which works on "simple" things:
    # ./a '123' 'abc123def'
    re <<123>> string <<abc123def>>
    matching:- <<123>>


    but fails on \s and \t etc:
    # ./a '\s' 'abc def'
    re <<\s>> string <<abc def>>
    bad re

    re_format(7) [2] gives a list of handled backslash-escaped sequences,
    and '\t' isn't one of the handled sequences. Given that, regex(7)
    says that an atom may be
    "...
    a '\' followed by any other character (matching that character taken
    as an ordinary character, as if the '\' had not been present)
    ..."

    So, it looks like regcomp() /should/ handle your test case here.

    although this also works # ./a '\\' 'abc\def'
    re <<\\>> string <<abc\def>>
    matching:- <<\>>

    It looks to me like the regcomp(3) backslash-handling logic may be
    rejecting anything that doesn't match it's list of handled characters
    (although it /should/ handle your '\t' as 't', according to the docs).

    (test program takes the re and a test string as its two args)


    I'd not be surprised if this is another char <==> int problem, but the
    regex stuff is a tad more complex than spfmilter was.

    Can anyone check this out please?

    [1] https://www.freebsd.org/cgi/man.cgi?query=regex&sektion=3
    [2] https://www.freebsd.org/cgi/man.cgi?query=re_format&sektion=7

    HTH
    --
    Lew Pitcher
    "In Skills, We Trust"
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Christian Weisgerber@3:770/3 to Mike Scott on Thu Feb 3 18:23:07 2022
    XPost: comp.unix.bsd.freebsd.misc

    On 2022-02-03, Mike Scott <usenet.16@scottsonline.org.uk.invalid> wrote:

    This is on freebsd13.0/arm64/rpi4

    A problem arising from milter-regex. This fails to accept known-good
    regular expressions, directly taken from a working i386 system.

    I think your arm64 system is at a different revision of FreeBSD
    than your i386 one.

    I believe the problem lies in the regex library, as a test program fails
    to compile regular expressions that contain backslashed special characters:

    but fails on \s and \t etc:
    # ./a '\s' 'abc def'
    re <<\s>> string <<abc def>>
    bad re

    What are "\s" and "\t" supposed to mean? In traditional regular
    expressions, they have no meaning. In that case, the '\' used to
    be ignored, i.e., they were equivalent to plain "s" and "t".

    However, that was changed in this commit...

    regex(3): Interpret many escaped ordinary characters as EESCAPE https://cgit.freebsd.org/src/commit/lib/libc/regex?id=adeebf4cd47c3e85155d92f386bda5e519b75ab2

    ... so such sequences would now result in an error.

    Subsequently, some GNU extensions have been added that give new
    meaning to "\s" but not to "\t".

    --
    Christian "naddy" Weisgerber naddy@mips.inka.de
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Mike Scott@3:770/3 to Lew Pitcher on Fri Feb 4 14:05:36 2022
    XPost: comp.unix.bsd.freebsd.misc

    On 03/02/2022 16:39, Lew Pitcher wrote:
    On Thu, 03 Feb 2022 15:52:47 +0000, Mike Scott wrote:

    This is on freebsd13.0/arm64/rpi4

    A problem arising from milter-regex. This fails to accept known-good
    regular expressions, directly taken from a working i386 system.

    I believe the problem lies in the regex library, as a test program fails
    to compile regular expressions that contain backslashed special
    characters:

    The salient chunk of my test program is
    regex_t re;
    if( regcomp( &re, argv[1], REG_ICASE ) ) {
    printf("bad re\n");
    .....

    Thanks for the responses.

    Firstly, I have to confess to some history: way, way back, I modified milter-regex to use pcre rather than libc's regex routines. That's
    probably why my patterns still have \s strings and the like: these are
    valid in pcre as \s -> whitespace and \t -> tab etc.

    That said, the "proper" package code from freebsd was reinstated several
    years ago, and both systems (i386 and arm64) are running the same
    packaged version 2.7.2. (It means my re's certainly haven't worked for a
    while, but that's a separate issue: ooops!) I have the exact same
    milter-regex config file on both machines.

    On the arm64 box (fbsd 13.0), I get logged:
    parse_ruleset: /usr/local/etc/milter-regex.conf:196: regcomp:
    ^\s*Fwd.?\s*$: trailing backslash (\)

    As has been pointed out, \s may not mean what I wanted it to: but is nevertheless valid, and that re should be accepted as equivalent to
    ^s*Fwd.?s*$
    (man re_format is unambiguous on this)

    On the i386 (running fbsd 11.4), the file compiles happily in full.
    Hence my supposition about errors in the regex library.

    The error returned in my test code on the arm64 from regcomp() is 5 (REG_EESCAPE). On the i386 I get

    % ./a '\b' '123abc4 56'
    re <<\b>> string <<123abc4 56>>
    matching:- <<b>>

    and on the arm64:
    root@kirk:/usr/plumtree/config/milter-regex # ./a '\b' '123abc4 56'
    re <<\b>> string <<123abc4 56>>
    bad re 5

    Hmmm. I'm wondering about char's and int's. It's been a long, long while
    since I looked into the depths of Henry Spencer's original code (that on
    a Sun): I have a vague recollection of liberties being taken with them
    but IMWBW.



    FWIW the test code, hacked from elsewhere, is

    #include <stdio.h>
    #include <regex.h>
    #include <stdlib.h>

    #define MAXMATCH 100

    int main(int argc, char *argv[]) {
    regex_t re;
    regmatch_t matches[MAXMATCH];

    if( argc != 3 ) exit(1);

    printf("re <<%s>> string <<%s>>\n", argv[1], argv[2]);

    /* if( regcomp( &re, argv[1], REG_EXTENDED | REG_ICASE ) ) { */
    int errc = regcomp( &re, argv[1], REG_ICASE );
    if( errc ) {
    printf("bad re %x\n", errc);
    exit(1);
    }

    int err = regexec( &re, argv[2], MAXMATCH, matches, 0);
    if( err ) {
    printf("match failed %d\n", err);
    exit(1);
    }

    printf("matching:- <<");
    int p;
    for( p = matches[0].rm_so; p < matches[0].rm_eo; ++p )
    printf("%c", argv[2][p]);
    printf(">>\n");

    }







    --
    Mike Scott
    Harlow, England
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Christian Weisgerber@3:770/3 to Mike Scott on Fri Feb 4 15:11:43 2022
    XPost: comp.unix.bsd.freebsd.misc

    On 2022-02-04, Mike Scott <usenet.16@scottsonline.org.uk.invalid> wrote:

    On the arm64 box (fbsd 13.0), I get logged:
    parse_ruleset: /usr/local/etc/milter-regex.conf:196: regcomp:
    ^\s*Fwd.?\s*$: trailing backslash (\)

    On the i386 (running fbsd 11.4), the file compiles happily in full.
    Hence my supposition about errors in the regex library.

    Again: This is an intentional change in behavior that was at some
    point introduced in FreeBSD's libc regex code.

    Specifically this commit, which is in 13.x but not in 11.x: https://cgit.freebsd.org/src/commit/lib/libc/regex?id=adeebf4cd47c3e85155d92f386bda5e519b75ab2

    Here's full commit message:

    ------------------->
    regex(3): Interpret many escaped ordinary characters as EESCAPE

    In IEEE 1003.1-2008 [1] and earlier revisions, BRE/ERE grammar allows for
    any character to be escaped, but "ORD_CHAR preceded by an unescaped
    <backslash> character [gives undefined results]".

    Historically, we've interpreted an escaped ordinary character as the
    ordinary character itself. This becomes problematic when some extensions
    give special meanings to an otherwise ordinary character
    (e.g. GNU's \b, \s, \w), meaning we may have two different valid interpretations of the same sequence.

    To make this easier to deal with and given that the standard calls this undefined, we should throw an error (EESCAPE) if we run into this scenario
    to ease transition into a state where some escaped ordinaries are blessed
    with a special meaning -- it will either error out or have extended
    behavior, rather than have two entirely different versions of undefined behavior that leave the consumer of regex(3) guessing as to what behavior
    will be used or leaving them with false impressions.

    This change bumps the symbol version of regcomp to FBSD_1.6 and provides the old escape semantics for legacy applications, just in case one has an older application that would immediately turn into a pumpkin because of an
    extraneous escape that's embedded or otherwise critical to its
    operation.

    This is the final piece needed before enhancing libregex with GNU extensions and flipping the switch on bsdgrep.

    [1] http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/

    PR: 229925 (exp-run, courtesy of antoine)
    Differential Revision: https://reviews.freebsd.org/D10510
    <-------------------

    --
    Christian "naddy" Weisgerber naddy@mips.inka.de
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From druck@3:770/3 to Christian Weisgerber on Fri Feb 4 20:55:23 2022
    XPost: comp.unix.bsd.freebsd.misc

    On 03/02/2022 18:23, Christian Weisgerber wrote:
    What are "\s" and "\t" supposed to mean? In traditional regular
    expressions, they have no meaning.

    I'm not sure how many decades ago you are claiming for traditional reg
    ex, but \s and \t have been any white space and tab for a long as I can remember.

    ---druck
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Lew Pitcher@3:770/3 to druck on Fri Feb 4 21:32:26 2022
    XPost: comp.unix.bsd.freebsd.misc

    On Fri, 04 Feb 2022 20:55:23 +0000, druck wrote:

    On 03/02/2022 18:23, Christian Weisgerber wrote:
    What are "\s" and "\t" supposed to mean? In traditional regular
    expressions, they have no meaning.

    I'm not sure how many decades ago you are claiming for traditional reg
    ex, but \s and \t have been any white space and tab for a long as I can remember.

    Yah.... no.

    In POSIX regular expressions, neither \s nor \t have any documented
    "special" meaning; for BREs,
    "The interpretation of an ordinary character preceded by a backslash
    ( '\' ) is undefined, except for:
    * The characters ')', '(', '{', and '}'
    * The digits 1 to 9 inclusive (see BREs Matching Multiple Characters)
    * A character inside a bracket expression"
    and for EREs,
    "An ordinary character is any character in the supported character set,
    except for the ERE special characters listed in ERE Special
    Characters. The interpretation of an ordinary character preceded by a
    backslash ( '\' ) is undefined."
    where, ERE Special Characters consists of a handful of punctuation
    characters, and no alphabetics [1].

    A common implementation of the POSIX regular expression parser defines a regular expression atom, in part, as
    "..., a '\' followed by one of the characters "^.[$()|*+?{\" (matching
    that character taken as an ordinary character), a '\' followed by any
    other character (matching that character taken as an ordinary
    character, as if the '\' had not been present), ..." [2]

    In neither case do either \s or \t have any "special" meaning.

    OTOH, Perl-compatable regular expressions recognize \s and \t as having
    special meanings, with \s meaning "any white space character", and \t
    meaning "tab (hex 09)"

    It is worth noting that the OP was asking about POSIX regular
    expressions, as handled by the POSIX regcomp(3) interface, and /not/ pcre regular expressions.

    HTH

    [1] https://pubs.opengroup.org/onlinepubs/009696899/basedefs/
    xbd_chap09.html

    [2] https://man7.org/linux/man-pages/man7/regex.7.html
    --
    Lew Pitcher
    "In Skills, We Trust"
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ahem A Rivet's Shot@3:770/3 to druck on Fri Feb 4 21:53:16 2022
    XPost: comp.unix.bsd.freebsd.misc

    On Fri, 4 Feb 2022 20:55:23 +0000
    druck <news@druck.org.uk> wrote:

    On 03/02/2022 18:23, Christian Weisgerber wrote:
    What are "\s" and "\t" supposed to mean? In traditional regular expressions, they have no meaning.

    I'm not sure how many decades ago you are claiming for traditional reg
    ex, but \s and \t have been any white space and tab for a long as I can remember.

    They are in many places (including pcre) but not re_format(7).

    --
    Steve O'Hara-Smith
    Odds and Ends at http://www.sohara.org/
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From A. Dumas@3:770/3 to Lew Pitcher on Fri Feb 4 23:09:32 2022
    XPost: comp.unix.bsd.freebsd.misc

    On 04-02-2022 22:32, Lew Pitcher wrote:
    OTOH, Perl-compatable regular expressions recognize \s and \t as having special meanings,

    Not only pcre, also enhanced or extended. I don't have FreeBSD here but
    this is from the MacOS man page which is based on BSD: (conclusion below
    that)

    -----
    ENHANCED FEATURES
    When the REG_ENHANCED flag is passed to one of the regcomp()
    variants, additional features are activated. Like the enhanced regex implementations in scripting languages such as
    perl(1) and python(1), these additional features may conflict with
    the IEEE Std 1003.2 (``POSIX.2'') standards in some ways. Use this with
    care in situations which require
    portability (including to past versions of the Mac OS X using the previous regex implementation).

    For enhanced basic REs, `+', `?' and `|' remain regular
    characters, but `\+', `\?' and `\|' have the same special meaning as the unescaped characters do for extended REs, i.e.,
    one or more matches, zero or one matches and alteration,
    respectively. For enhanced extended REs, back references are available.
    Additional enhanced features are listed below.

    Within a bracket expression, most characters lose their magic.
    This also applies to the additional enhanced features, which don't
    operate inside a bracket expression.

    Assertions (available for both enhanced basic and enhanced extended REs)
    In addition to `^' and `$' (the assertions that match the null
    string at the beginning and end of line, respectively), the following assertions become available:

    [...]

    Shortcuts (available for both enhanced basic and enhanced extended REs)
    The following shortcuts can be used to replace more complicated
    bracket expressions.

    [...]
    \s Matches a space character. This is equivalent to `[[:space:]]'.
    [...]

    Literal Sequences (available for both enhanced basic and enhanced
    extended REs)
    Literals are normally just ordinary characters that are matched
    directly. Under enhanced mode, certain character sequences are
    converted to specific literals.

    [...]
    \t The ``horizontal-tab'' character (ASCII code 9).
    [...]
    -----

    So in practice it turns out that, using the built-in BSD-based grep on
    MacOS without any flags, it does support both \s and \t.
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Lew Pitcher@3:770/3 to A. Dumas on Fri Feb 4 23:07:13 2022
    XPost: comp.unix.bsd.freebsd.misc

    On Fri, 04 Feb 2022 23:09:32 +0100, A. Dumas wrote:

    On 04-02-2022 22:32, Lew Pitcher wrote:
    OTOH, Perl-compatable regular expressions recognize \s and \t as having
    special meanings,

    Not only pcre, also enhanced or extended. I don't have FreeBSD here but
    this is from the MacOS man page which is based on BSD: (conclusion below that)

    -----
    ENHANCED FEATURES
    When the REG_ENHANCED flag is passed to one of the regcomp()
    variants, additional features are activated.
    [snip]
    So in practice it turns out that, using the built-in BSD-based grep on
    MacOS without any flags, it does support both \s and \t.

    From the OP
    The salient chunk of my test program is
    regex_t re;
    if( regcomp( &re, argv[1], REG_ICASE ) ) {
    printf("bad re\n");

    If I correctly understand the documentation you posted, to get the BSD
    regex "Enhanced features" that include enhanced escape parsing, the OP
    would have had to specify the REG_ENHANCED flag to regcomp()

    I note that the OP /did not/ include this flag, nor has this flag
    been discussed elsethread. I also note that the OP /did not/ include the REG_EXTENDED flag, so his regcomp() will interpret the regex as a BRE.

    Still, it is up to the implementation as to how it will handle the
    expansion of those escaped characters that the POSIX standard leaves
    undefined (in BRE only; they are well-defined in ERE, which the OP
    is not using).

    --
    Lew Pitcher
    "In Skills, We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Mike Scott@3:770/3 to Christian Weisgerber on Wed Feb 9 15:25:16 2022
    XPost: comp.unix.bsd.freebsd.misc

    On 03/02/2022 18:23, Christian Weisgerber wrote:
    On 2022-02-03, Mike Scott <usenet.16@scottsonline.org.uk.invalid> wrote:

    This is on freebsd13.0/arm64/rpi4

    A problem arising from milter-regex. This fails to accept known-good
    regular expressions, directly taken from a working i386 system.

    I think your arm64 system is at a different revision of FreeBSD
    than your i386 one.

    I believe the problem lies in the regex library, as a test program fails
    to compile regular expressions that contain backslashed special characters: >>
    but fails on \s and \t etc:
    # ./a '\s' 'abc def'
    re <<\s>> string <<abc def>>
    bad re

    What are "\s" and "\t" supposed to mean? In traditional regular
    expressions, they have no meaning. In that case, the '\' used to
    be ignored, i.e., they were equivalent to plain "s" and "t".

    However, that was changed in this commit...

    regex(3): Interpret many escaped ordinary characters as EESCAPE https://cgit.freebsd.org/src/commit/lib/libc/regex?id=adeebf4cd47c3e85155d92f386bda5e519b75ab2

    ... so such sequences would now result in an error.

    Subsequently, some GNU extensions have been added that give new
    meaning to "\s" but not to "\t".


    Thanks to everyone for comments.

    I'll note my own error in accidentally trying to use pcre-style
    expressions in regex; I also take on board the changes being made: but
    just perhaps, the man pages should also be kept in step with the code --
    and incompatible changes like that noted above perhaps merit a warning
    in a large flashing fluorescent font for some years.

    Meanwhile I've removed the offending RE's from the milter's config file,
    and it runs happily. Whether it runs /correctly/ remains to be seen.

    Thanks again.

    --
    Mike Scott
    Harlow, England
    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)