• Re: "sed" question

    From Grant Taylor@21:1/5 to Keith Thompson on Thu Mar 7 20:38:28 2024
    XPost: comp.unix.shell

    On 3/7/24 18:09, Keith Thompson wrote:
    I know that's what awk does, but I don't think I would have expected
    it if I didn't know about it.

    Okay. I think that's a fair observation.

    $0 is the current input line.

    Or $0 is the current /record/ in awk parlance.

    If you don't change anything, or if you modify $0 itself, whitespace
    betweeen fields is preserved.

    If you modify any of the fields, $0 is recomputed and whitespace
    between tokens is collapsed.

    I don't agree with that.

    % echo 'one two three' | awk '{print $0; print $1,$2,$3}'
    one two three
    one two three

    I didn't /modify/ anything and awk does print the fields with different
    white space.

    awk *could* have been defined to preserve inter-field whitespace even
    when you modify individual fields,

    I question the veracity of that. Specifically when lengthening or
    shortening the value of a field. E.g. replacing "two" with "fifteen".
    This is particularly germane when you look at $0 as a fixed width
    formatted output.

    and I think I would have found that more intuitive.

    I don't agree.

    (And ideally there would be a way to refer to that inter-field
    whitespace.)

    Remember, awk is meant for working on fields of data in a record. By
    default, the fields are delimited by white space characters. I'll say
    it this way, awk is meant for working on the non-white space characters.
    Or yet another way, awk is not meant for working on white space charters.

    The fact that modifying a field has the side effect of messing up $0
    seems counterintuitive.

    Maybe.

    But I think it's one that is acceptable for what awk is intended to do.

    Perhaps the behavior matches your intuition better than it matches
    mine.

    I sort of feel like you are wanting to / trying to use awk in places
    where sed might be better. sed just sees a string of text and is
    ignorant of any structure without a carefully crafted RE to provide it.

    Conversely awk is quite happy working with an easily identified field
    based on the count with field separators of one or more white space
    characters.

    Consider the output of `netstat -an` wherein you have multiple columns
    of IP addresses.

    Please find a quick way, preferably that doesn't involve negation
    (because what needs to be negated may bey highly dynamic) that lists
    inbound SMTP connections on an email server but doesn't list outbound
    SMTP connections.

    awk makes it trivial to identify and print records that have the SMTP
    port in the local IP column, thus ignoring outbound connections with
    SMTP in the remote column.

    Aside: Yes, I know that ss and the likes have more features for this,
    but this is my example and ss is not installed everywhere.

    I sort of view awk as somewhat akin to SQL wherein fields in awk are
    like columns in SQL.

    I'd be more than a little bit surprised to find an SQL interface that
    preserved white space /between/ columns. -- Many will do it /within/
    columns.

    awk makes it trivial to take field oriented output from commands and
    apply some logic / parsing / action on specific fields in records.

    (And perhaps this should be moved to comp.lang.awk if it doesn't die
    out soon.

    comp.lang.awk added and followup pointed there.

    Though both sed and awk are both languages in their own right
    and tools that can be used from the shell, so I'd argue there's a
    topicality overlap.)

    ;-)



    --
    Grant. . . .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Keith Thompson on Fri Mar 8 09:38:50 2024
    On 2024-03-08, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    But awk doesn't work with fixed-width data. The length of each field,
    and the length of $0, is variable.

    GNU Awk, however, can. It has a FIELDWIDTHS variable where you can
    specify column widths that then work instead of the FS field separator.

    There is also FPAT (see below).

    If awk *purely* dealt with input lines only as lists of tokens, then
    this:

    echo 'one two three' | awk '{print $0}'

    would print "one two three" rather than "one two three" (and awk would
    lose the ability to deal with arbitrarily formatted input). The fact
    that the inter-field whitespace is reset only when individual fields are touched feels arbitrary to me.

    There is no inter-field whitespace.

    There is the original record in $0, and parsed out fields in $1, $2, ..

    The fields don't have any space.

    The space comes from the value of the OFS variable.

    $ echo 'one two three' | awk -v OFS=: '{ $1=$1; print }'
    one:two:three

    GNU Awk has a FPAT mechanism by which we can specify the positive
    regex for recognizing fields as tokens. By means of that, we can
    save the separating whitespace, turning it into a field:

    $ echo 'one two three' | \
    awk -v FPAT='[^ ]+|[ ]+' -v OFS= \
    '{ $1=$1; print; print NF }'
    one two three
    5

    There you go. We now have 5 fields. The interstitial space is a field.
    We set OFS to empty and so $1=$1 doesn't collapse the separation.

    Not really. I'm just remarking on one particular awk feature that I
    find a bit counterintuitive.

    The proposed feature of preserving the whitespace separation is a niche
    use case in relation to Awk's orientation toward tabular data.

    In tabular data that is not formatted into nice columns for a monospaced
    font, the whitespace doesn't matter. Awk's behavior is that it will
    normalize the separation.

    In tabular data that is aligned visually, preserving the whitespace
    will not work, if any of your field edits change a field width.

    I'm believe that your niche use case has a value though.

    That's why, in the TXR Lisp Awk macro, I implemented something which
    helps with that use case: the "kfs" variable (keep field separators).
    This Boolean variable, if true, causes the separating character
    sequences to be retained, and turned into fields.

    $ echo 'one two three' | txr -e '(awk (:set kfs t) (t))'
    one two three

    We can see the list f instead, printed machine readably:

    $ echo 'one two three' | txr -e '(awk (:set kfs t) (t (prinl f)))'
    ("" "one" " " "two" " " "three" "")

    There is a leading and trailing empty separator. There is a
    very good reason for that, in that the default strategy in Awk
    is that, for instance " a b c " produces three fields.
    For consistency, if we retain separation, we should always have
    five fields where there woudl be three.

    My FPAT approach above in the GNU Awk example won't do this correctly;
    more work is needed.

    This kfs variable is not so recent; I implemented in it 2016.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Mr. Man-wai Chang on Fri Mar 8 15:46:20 2024
    On 08.03.2024 10:03, Mr. Man-wai Chang wrote:

    The original Awk doesn't support regular expressions, right?

    Where did you get that from? - Awk without regexps makes little sense;
    mind that the basic syntax of Awk programs is described as
    /pattern/ { action }
    What would remain if there's no regexp patterns; string comparisons?

    Because regex was not yet talked about back then??

    Stable Awk (1985) was released 1987. The (initial) old Awk (1977) was
    released 1979. Before that tool we had Sed (1974), and before that we
    had Ed and Grep (1973). My perception is that regexps were there as a
    basic concept of UNIX in all these tools, so why should Awk be exempt. According to the authors Awk was designed to see how Sed and Grep could
    be generalized.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to Janis Papanagnou on Fri Mar 8 09:12:05 2024
    On 3/8/24 08:46, Janis Papanagnou wrote:
    Awk without regexps makes little sense;

    I think this comes down to what is a regular expression and what is not
    a regular expression.

    mind that the basic syntax of Awk programs is described as
    pattern { action }

    I'm guessing that 40-60% of the awk that I use doesn't use what I would consider to be regular expressions.

    (NF == 5){print $3}
    (NF == 8){print $4}

    Or:

    {total+=$5}
    END{print total}

    I usually think of regular expressions when I'm doing a sub(/re/, ...)
    type thing or a (... ~ /re/) type conditional. More specifically things between the // in both of those statements are the REs.

    Maybe I have an imprecise understanding / definition.



    --
    Grant. . . .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Mr. Man-wai Chang on Sat Mar 9 03:15:00 2024
    On 08.03.2024 16:56, Mr. Man-wai Chang wrote:
    On 8/3/2024 10:46 pm, Janis Papanagnou wrote:

    Stable Awk (1985) was released 1987. The (initial) old Awk (1977) was
    released 1979. Before that tool we had Sed (1974), and before that we
    had Ed and Grep (1973). My perception is that regexps were there as a
    basic concept of UNIX in all these tools, so why should Awk be exempt.
    According to the authors Awk was designed to see how Sed and Grep could
    be generalized.

    That part of history is beyond me. Sorry... my fault for not doing a check.

    The mistake may stem from a myth (I heard it before already); it may
    have been misinterpreted where it's said that in the first Awk there
    was no match function (which is true, but it means the concrete match() function not the abstract function of a (regexp) pattern match).

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Grant Taylor on Sat Mar 9 03:07:24 2024
    On 08.03.2024 16:12, Grant Taylor wrote:
    On 3/8/24 08:46, Janis Papanagnou wrote:
    Awk without regexps makes little sense;

    I think this comes down to what is a regular expression and what is not
    a regular expression.

    mind that the basic syntax of Awk programs is described as
    pattern { action }

    I'm guessing that 40-60% of the awk that I use doesn't use what I would consider to be regular expressions.
    [...]
    Maybe I have an imprecise understanding / definition.

    Your definition matches the common naming, where I deliberately
    deviate from. (I think that "pattern" is an inferior naming and
    "condition" should better be used, where a 'condition' can also
    be a regexp that I regularly write as '/regexp/' or '/pattern/'
    in explanations.) So I agree that it's likely that this alone
    doesn't serve well as explanation for the existence of regexps
    in Awk. The rationale is better seen in the statement "Awk was
    designed to see how Sed and Grep could be generalized." that I
    quoted (not literally, but from the original Awk book).

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Weisgerber@21:1/5 to Janis Papanagnou on Sat Mar 9 12:27:05 2024
    XPost: comp.unix.shell

    On 2024-03-06, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    $ awk '{print $1, "1-1"}' newsrc-news.eternal-september.org-test >
    newsrc-news.eternal-september.org

    In this specific case of regular data you can simplify that to

    awk '$2="1-1"' sourcefile > targetfile

    That had me scratching my head. You can't have an action without
    enclosing braces. But it's still legal syntax because... it's an
    expression serving as a pattern. The assignment itself is a side
    effect.

    Care needs to be taken when using this shortcut so the expression
    doesn't evalute as false:

    $ printf 'one 1\ntwo 2\nthree 3\n' | awk '$2=4'
    one 4
    two 4
    three 4
    $ printf 'one 1\ntwo 2\nthree 3\n' | awk '$2=0'
    $

    $ printf 'one 1\ntwo 2\nthree 3\n' | awk '$2="4"'
    one 4
    two 4
    three 4
    $ printf 'one 1\ntwo 2\nthree 3\n' | awk '$2=""'
    $

    --
    Christian "naddy" Weisgerber naddy@mips.inka.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julieta Shem@21:1/5 to Christian Weisgerber on Sat Mar 9 11:52:09 2024
    Christian Weisgerber <naddy@mips.inka.de> writes:

    On 2024-03-06, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    $ awk '{print $1, "1-1"}' newsrc-news.eternal-september.org-test >
    newsrc-news.eternal-september.org

    In this specific case of regular data you can simplify that to

    awk '$2="1-1"' sourcefile > targetfile

    That had me scratching my head. You can't have an action without
    enclosing braces. But it's still legal syntax because... it's an
    expression serving as a pattern. The assignment itself is a side
    effect.

    Without braces, the default action takes place, which is ``{print}''.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to jshem@yaxenu.org on Sat Mar 9 15:40:04 2024
    In article <877cibsbja.fsf@yaxenu.org>, Julieta Shem <jshem@yaxenu.org> wrote: >Christian Weisgerber <naddy@mips.inka.de> writes:

    On 2024-03-06, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    $ awk '{print $1, "1-1"}' newsrc-news.eternal-september.org-test >
    newsrc-news.eternal-september.org

    In this specific case of regular data you can simplify that to

    awk '$2="1-1"' sourcefile > targetfile

    That had me scratching my head. You can't have an action without
    enclosing braces. But it's still legal syntax because... it's an
    expression serving as a pattern. The assignment itself is a side
    effect.

    Without braces, the default action takes place, which is ``{print}''.

    Somehow, I think Christian knows that (since everybody knows that).

    My guess is that he just doesn't like it...

    --
    "Everything Roy (aka, AU8YOG) touches turns to crap."
    --citizens of alt.obituaries--

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Ed Morton on Sat Mar 9 21:07:00 2024
    On 09.03.2024 17:52, Ed Morton wrote:

    About 20 or so years ago we had a discussion in this NG (which I'm not
    going to search for now) and, shockingly, a consensus was reached that
    we should encourage people to always write:

    '{$2="1-1"} 1'


    I don't recall such a "consensus". If you want to avoid cryptic code
    you'd rather write

    '{$2="1-1"; print}'

    Don't you think?

    And of course add more measures in case the data is not as regular as
    the sample data suggests. (See my other postings what may be defined
    as data, line missing or spurious blanks in the data, comment lines
    or empty lines that have to be preserved, etc.)

    instead of:

    $2="1-1"


    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Christian Weisgerber on Sat Mar 9 21:00:02 2024
    On 09.03.2024 13:27, Christian Weisgerber wrote:
    On 2024-03-06, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    $ awk '{print $1, "1-1"}' newsrc-news.eternal-september.org-test >
    newsrc-news.eternal-september.org

    In this specific case of regular data you can simplify that to

    awk '$2="1-1"' sourcefile > targetfile

    That had me scratching my head.

    Part of the joy programming in Awk. ;-)

    You can't have an action without
    enclosing braces. But it's still legal syntax because...

    it's an expression serving as a pattern.

    This is the key observation!

    Here we have only a condition in the general condition { action }

    The assignment itself is a side effect.

    Assignments generally have a side effect, inherently. :-)


    Care needs to be taken when using this shortcut so the expression
    doesn't evalute as false:

    I've carefully formulated "In this specific case of regular data ..."


    $ printf 'one 1\ntwo 2\nthree 3\n' | awk '$2=4'
    one 4
    two 4
    three 4
    $ printf 'one 1\ntwo 2\nthree 3\n' | awk '$2=0'
    $

    $ printf 'one 1\ntwo 2\nthree 3\n' | awk '$2="4"'
    one 4
    two 4
    three 4
    $ printf 'one 1\ntwo 2\nthree 3\n' | awk '$2=""'
    $


    Other questions on the data may be whether...
    - the article number list may contain spaces
    - the space after the colon is always existing
    - blank lines may be existing in the file
    - comment lines are possible in the file

    These all will require a more "complex" awk pattern or action, yet
    still simply solvable. Maybe something like

    BEGIN { FS=":[[:space:]]*" }
    !NF || /^[[:space:]]*#/ || $0=$1": 1-1"


    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Janis Papanagnou on Sat Mar 9 20:49:43 2024
    On 2024-03-09, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 09.03.2024 17:52, Ed Morton wrote:

    About 20 or so years ago we had a discussion in this NG (which I'm not
    going to search for now) and, shockingly, a consensus was reached that
    we should encourage people to always write:

    '{$2="1-1"} 1'


    I don't recall such a "consensus". If you want to avoid cryptic code
    you'd rather write

    '{$2="1-1"; print}'

    Don't you think?

    I don't remember it either, but it's a no brainer that '$2=expr'
    is incorrect if expr is arbitrary, and the intent is that
    the implicit print is to be unconditionally invoked.

    If expr is a nonblank, nonzero literal term, then the assignment
    is obviously true and '$2=literal' as the entire program is a fine
    idiom.

    I don't agree with putting braces around it and adding 1, or explicit
    print.

    You are not then using Awk like it was meant to be.

    When Awk was conceived, the authors peered into a crystal ball and saw
    Perl. After the laughter died down, they got serious and made sure to
    provide for idioms like:

    awk '!s[$0]++'

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geoff Clare@21:1/5 to Mr. Man-wai Chang on Tue Mar 12 13:47:09 2024
    Mr. Man-wai Chang wrote:

    Do Linux and Unix have a ONE AND ONLY ONE STANDARD regex library?

    It seemed that tools and programming languages have their own
    implementions, let alone different versions among them.

    In the POSIX/UNIX standard the functions used for handling regular
    expressions are regcomp() and regexec() (and regerror() and regfree()).
    They are in the C library, not a separate "regex library".

    They support different RE flavours via flags. The standard requires
    that "basic regular expressions" (default) and "extended regular
    expressions" (with REG_EXTENDED flag) are supported. Implementations
    can support other flavours with non-standard flags.

    POSIX requires that awk uses extended regular expressions (i.e. the
    same as regcomp() with REG_EXTENDED).

    --
    Geoff Clare <netnews@gclare.org.uk>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Aharon Robbins@21:1/5 to netnews@gclare.org.uk on Tue Mar 12 19:00:13 2024
    In article <tv26ck-3qt.ln1@ID-313840.user.individual.net>,
    Geoff Clare <netnews@gclare.org.uk> wrote:
    POSIX requires that awk uses extended regular expressions (i.e. the
    same as regcomp() with REG_EXTENDED).

    There is the additional requirement that \ inside [....] can
    be used to escape characters, so that [abc\]def] is valid in
    awk but not in other uses of REG_EXTENDED.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geoff Clare@21:1/5 to Aharon Robbins on Wed Mar 13 13:57:33 2024
    Aharon Robbins wrote:

    In article <tv26ck-3qt.ln1@ID-313840.user.individual.net>,
    Geoff Clare <netnews@gclare.org.uk> wrote:
    POSIX requires that awk uses extended regular expressions (i.e. the
    same as regcomp() with REG_EXTENDED).

    There is the additional requirement that \ inside [....] can
    be used to escape characters,

    Yes, awk effectively has an extra "layer" of backslash escaping
    before the ERE rules kick in, both inside and outside [....].
    I didn't mention this so as not to overload the OP with
    information - he seemed more interested in the different flavours
    of RE than in nitty gritty details like that.

    --
    Geoff Clare <netnews@gclare.org.uk>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Ed Morton on Thu Mar 14 03:01:14 2024
    On 12.03.2024 23:49, Ed Morton wrote:
    On 3/9/2024 2:07 PM, Janis Papanagnou wrote:
    On 09.03.2024 17:52, Ed Morton wrote:

    About 20 or so years ago we had a discussion in this NG (which I'm not
    going to search for now) and, shockingly, a consensus was reached that
    we should encourage people to always write:

    '{$2="1-1"} 1'


    I don't recall such a "consensus".

    I do, I have no reason to lie about it, but I can't be bothered
    searching through 20-year-old usenet archives for it (I did take a very
    quick shot at it but I don't even know how to write a good search for it
    - you can't just google "awk '1'" and I'm not even sure if it was in comp.lang.awk or comp.unix.shell).

    I didn't say anything about "lying"; why do you insinuate so?

    But your memory may mislead you. (Or mine, or Kaz', of course.)

    (And no, I don't do the search for you; since you have been the
    one contending something here.)

    Without a reference such a statement is just void (and not more
    than a rhetorical move).

    You should at least elaborate on the details and facts of that
    "consensus" - but for the _specific OP context_ (not for made
    up cases).


    If you want to avoid cryptic code you'd rather write

    '{$2="1-1"; print}'

    Don't you think?


    If I'm writing a multi-line script I use an explicit `print` but it just doesn't matter for a tiny one-line script like that.

    Actually, for the given case, the yet better solution is what the
    OP himself said (in CUS, where his question was initially posted):

    Grant Taylor on alt.comp.software.thunderbird suggested [...]:
    $ awk '{print $1, "1-1"}'

    Since this suggestion doesn't overwrite fields and is conceptually
    clear. It inherently also handles (possible?) cases where there's
    more than two fields in the data (e.g. by spurious blanks).

    Everyone using awk
    needs to know the `1` idiom as it's so common and once you've seen it
    once it's not hard to figure out what `{$2="1-1"} 1` does.

    The point is that $2="1-1" as condition is also an Awk idiom.


    By changing `condition` to `{condition}1` we just add 3 chars to remove
    the guesswork from anyone reading it in future and protect against unconsidered values so we don't just make it less cryptic but also less fragile.

    Your examples below are meaningless since you make up cases that have
    nothing to do with the situation here, and especially in context of
    my posting saying clearly: "In this specific case of regular data".

    The more problematic issue is that $2="1-1" and also {$2="1-1"}
    both overwrite fields and thus a reorganization of the fields is
    done which has - probably unexpected by a newbie coder - side effects.

    But YMMV, of course.

    Janis


    For example, lets say someone wants to copy the $1 value into $3 and
    print every line:

    $ printf '1 2 3\n4 5 7\n' | awk '{$3=$1}1'
    1 2 1
    4 5 4

    $ printf '1 2 3\n0 5 7\n' | awk '{$3=$1}1'
    1 2 1
    0 5 0

    $ printf '1 2 3\n4 5 7\n' | awk '$3=$1'
    1 2 1
    4 5 4

    $ printf '1 2 3\n0 5 7\n' | awk '$3=$1'
    1 2 1

    Note the 2nd line is undesirably (because I wrote the requirements)
    missing from that last output.

    It happens ALL the time that people don't consider all possible input
    values so it's safer to just write the code that reflects your intent
    and if you intend for every line to be printed then write code that will print every line.

    Ed.


    And of course add more measures in case the data is not as regular as
    the sample data suggests. (See my other postings what may be defined
    as data, line missing or spurious blanks in the data, comment lines
    or empty lines that have to be preserved, etc.)

    instead of:

    $2="1-1"


    Janis



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)