• Simplify an AWK pipeline?

    From Robert Mesibov@21:1/5 to All on Wed Aug 16 16:48:57 2023
    I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.

    fld1 fld2 fld3 fld4 fld5
    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    004 shoe dd try tiger
    005 worm ee law tiger
    006 pear ff law tiger
    007 pear gg hat apple
    008 shoe hh cup heron
    009 worm ii cup heron

    To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:

    awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

    which returns

    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    007 pear gg hat apple

    I would like those 2 sets of partial duplicates (the rose-hat-apple set and the pear-hat-apple set) to be sorted alphabetically and separated, like this:

    002 pear bb hat apple
    007 pear gg hat apple

    001 rose aa hat apple
    003 rose cc hat apple

    I can do that by piping the first AWK command's output to

    sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'

    but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single AWK command, if possible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Robert Mesibov on Thu Aug 17 00:59:20 2023
    On 2023-08-16, Robert Mesibov <robert.mesibov@gmail.com> wrote:
    I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.

    fld1 fld2 fld3 fld4 fld5
    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    004 shoe dd try tiger
    005 worm ee law tiger
    006 pear ff law tiger
    007 pear gg hat apple
    008 shoe hh cup heron
    009 worm ii cup heron

    To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:

    awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

    which returns

    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    007 pear gg hat apple

    I would like those 2 sets of partial duplicates (the rose-hat-apple
    set and the pear-hat-apple set) to be sorted alphabetically and
    separated, like this:

    002 pear bb hat apple
    007 pear gg hat apple

    001 rose aa hat apple
    003 rose cc hat apple

    Like this?

    $ txr group.tl < data
    002 pear bb hat apple
    007 pear gg hat apple

    006 pear ff law tiger

    001 rose aa hat apple
    003 rose cc hat apple

    008 shoe hh cup heron

    004 shoe dd try tiger

    009 worm ii cup heron

    005 worm ee law tiger

    $ cat group.tl
    (flow (get-lines)
    (sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))
    (each ((group @1))
    (put-lines group)
    (put-line)))

    Here's a dime kid, ...

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Robert Mesibov on Thu Aug 17 05:38:54 2023
    On 17.08.2023 01:48, Robert Mesibov wrote:
    I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each
    record.

    fld1 fld2 fld3 fld4 fld5
    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    004 shoe dd try tiger
    005 worm ee law tiger
    006 pear ff law tiger
    007 pear gg hat apple
    008 shoe hh cup heron
    009 worm ii cup heron

    To find the partial duplicate records which are identical except in
    those unique codes, I can parse "demo" twice like this:

    awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

    which returns

    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    007 pear gg hat apple

    I would like those 2 sets of partial duplicates (the rose-hat-apple
    set and the pear-hat-apple set) to be sorted alphabetically and
    separated, like this:

    002 pear bb hat apple
    007 pear gg hat apple

    001 rose aa hat apple
    003 rose cc hat apple

    I can do that by piping the first AWK command's output to

    sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'

    but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single
    AWK command, if possible.

    You can alternatively do it (e.g.) in one instance also like this...

    { k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
    END { for(k in a) if (c[k]>1) print a[k] }

    which is not (not much) shorter character wise but doesn't need the
    external sort command, it is all in one awk instance (as you want),
    and single pass. (I think the code is also a bit clearer than the
    one you posted above, but YMMV.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Mesibov@21:1/5 to Janis Papanagnou on Thu Aug 17 13:56:47 2023
    On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:

    You can alternatively do it (e.g.) in one instance also like this...

    { k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
    END { for(k in a) if (c[k]>1) print a[k] }

    which is not (not much) shorter character wise but doesn't need the
    external sort command, it is all in one awk instance (as you want),
    and single pass. (I think the code is also a bit clearer than the
    one you posted above, but YMMV.)

    Janis

    Many thanks, Janis, that's very nice, but it depends on specifying the non-unique fields 2, 4 and 5. In the real-world cases I work with, there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID fields (2, 4, 5...300+). That's why I replace
    the unique-ID fields with the arbitrary value "1" when testing for duplication.

    Bob

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Robert Mesibov on Thu Aug 17 23:42:50 2023
    On 17.08.2023 22:56, Robert Mesibov wrote:
    On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:

    You can alternatively do it (e.g.) in one instance also like this...

    { k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
    END { for(k in a) if (c[k]>1) print a[k] }

    which is not (not much) shorter character wise but doesn't need the
    external sort command, it is all in one awk instance (as you want),
    and single pass. (I think the code is also a bit clearer than the
    one you posted above, but YMMV.)

    Janis

    Many thanks, Janis, that's very nice, but it depends on specifying
    the non-unique fields 2, 4 and 5. In the real-world cases I work
    with, there are 1-2 unique ID code fields and sometimes 300+
    non-unique-ID fields (2, 4, 5...300+). That's why I replace the
    unique-ID fields with the arbitrary value "1" when testing for
    duplication.

    That was not apparent from your description. But defining the key
    by constructing it is not mandatory, you can also define it using
    elimination (as in your code); the point was what is following in
    the code after the k=... statement.

    Janis


    Bob


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Fri Aug 18 00:03:50 2023
    On 17.08.2023 23:27, Kenny McCormack wrote:
    In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
    Robert Mesibov <robert.mesibov@gmail.com> wrote:
    ...
    Many thanks, Janis, that's very nice, but it depends on specifying the
    non-unique fields 2, 4 and 5. In the real-world cases I work with,
    there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID
    fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
    the arbitrary value "1" when testing for duplication.

    1) Well, it seems like it shouldn't be too hard for you to retrofit your
    hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
    "" instead of 1.

    Yes, indeed. (See my other post.)


    2) You probably don't need to mess with SUBSEP. Your data seems to be OK with assuming no embedded spaces (i.e., so using space as the delimiter is OK)
    Note that SUBSEP is intended to be used as the delimiter for the implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
    have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.

    Yes, SUBSEP is the default separation character for arrays and. Of
    course you can use other characters (that require less text). Why
    you think that "nobody uses that functionality anymore" is beyond
    me; I doubt you have any evidence for that, so I interpret it just
    as "I [Kenny] don't use it anymore.", which is fine by me.


    3) I don't see how Janis's solution implements your need for sorting.

    Sort can make sense in three different abstractions.

    I interpreted the OP as doing the 'sort' just to be able to compare
    the actual data set with the previous data set, to have them together;
    this is unnecessary, though, with the approach I used with the keys
    in associative array. Since the original data is also already sorted
    my a unique numeric key and I sequentially concatenate the data it's
    also not necessary to sort the data in that respect. So what's left
    is the third thing that can be sorted, and that's the order of the
    classes; that all, say, "pear" elements come before all "rose"
    elements. This sort, in case it would be desired, is not reflected
    in my approach.

    Janis

    Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to robert.mesibov@gmail.com on Thu Aug 17 21:27:41 2023
    In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
    Robert Mesibov <robert.mesibov@gmail.com> wrote:
    ...
    Many thanks, Janis, that's very nice, but it depends on specifying the >non-unique fields 2, 4 and 5. In the real-world cases I work with,
    there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID >fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
    the arbitrary value "1" when testing for duplication.

    1) Well, it seems like it shouldn't be too hard for you to retrofit your
    hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
    "" instead of 1.

    2) You probably don't need to mess with SUBSEP. Your data seems to be OK
    with assuming no embedded spaces (i.e., so using space as the delimiter is OK) Note that SUBSEP is intended to be used as the delimiter for the
    implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
    have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.

    3) I don't see how Janis's solution implements your need for sorting.
    Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...

    --
    "Every time Mitt opens his mouth, a swing state gets its wings."

    (Should be on a bumper sticker)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Mesibov@21:1/5 to All on Thu Aug 17 15:36:38 2023
    Apologies for not explaining that there are numerous non-unique-ID fields, and yes, what I am aiming for is a sort beginning with the first non-unique-ID field.

    My code is complicated because I need to preserve the original records for the output, while also modifying the original records by "de-uniquifying" the unique-ID fields in order to hunt for partial duplicates.

    I'll continue to tinker with this and report back if I can simplify the code, but I would be grateful for any other AWK solutions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Kaz Kylheku on Thu Aug 17 23:00:45 2023
    On 2023-08-17, Kaz Kylheku <864-117-4973@kylheku.com> wrote:
    (flow (get-lines)
    (sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))
    ^^^^^^^^
    [...]

    This selects the second, fourth and fifth fields and each field after
    the fifth, as the non-unique fields on which to group.

    I inferred the requirement that the complement of the unique fields
    should be used: all fields which are not the unique ones.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Robert Mesibov on Fri Aug 18 01:47:10 2023
    On 18.08.2023 00:36, Robert Mesibov wrote:

    I'll continue to tinker with this and report back if I can simplify
    the code, but I would be grateful for any other AWK solutions.

    For any additional sorting Kenny gave hints (see his point 3) that
    can simply be added if you're using GNU awk.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Mesibov@21:1/5 to All on Fri Aug 18 01:23:00 2023
    Many thanks again, Janis. I doubt that I can improve on

    awk '{x=$0; $1=$3=1; y=$0; a[y]=a[y] RS x; b[y]++}; END {for (i in a) if (b[i]>1) print a[i]}' demo

    and the sorting isn't critical.

    Bob

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)