Forum: >>> Magnum BBS <<<

Simplify an AWK pipeline?

From Robert Mesibov@21:1/5 to All on Wed Aug 16 16:48:57 2023

I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.

fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron

To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:

awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

which returns

001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple

I would like those 2 sets of partial duplicates (the rose-hat-apple set and the pear-hat-apple set) to be sorted alphabetically and separated, like this:

002 pear bb hat apple
007 pear gg hat apple

001 rose aa hat apple
003 rose cc hat apple

I can do that by piping the first AWK command's output to

sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'

but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single AWK command, if possible.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Robert Mesibov on Thu Aug 17 00:59:20 2023

On 2023-08-16, Robert Mesibov <robert.mesibov@gmail.com> wrote:

I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.

fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron

To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:

awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

which returns

001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple

I would like those 2 sets of partial duplicates (the rose-hat-apple
set and the pear-hat-apple set) to be sorted alphabetically and
separated, like this:

002 pear bb hat apple
007 pear gg hat apple

001 rose aa hat apple
003 rose cc hat apple

Like this?

$ txr group.tl < data
002 pear bb hat apple
007 pear gg hat apple

006 pear ff law tiger

001 rose aa hat apple
003 rose cc hat apple

008 shoe hh cup heron

004 shoe dd try tiger

009 worm ii cup heron

005 worm ee law tiger

$ cat group.tl
(flow (get-lines)
(sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))
(each ((group @1))
(put-lines group)
(put-line)))

Here's a dime kid, ...

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Robert Mesibov on Thu Aug 17 05:38:54 2023

On 17.08.2023 01:48, Robert Mesibov wrote:

I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each
record.

fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron

To find the partial duplicate records which are identical except in
those unique codes, I can parse "demo" twice like this:

awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

which returns

001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple

I would like those 2 sets of partial duplicates (the rose-hat-apple
set and the pear-hat-apple set) to be sorted alphabetically and
separated, like this:

002 pear bb hat apple
007 pear gg hat apple

001 rose aa hat apple
003 rose cc hat apple

I can do that by piping the first AWK command's output to

sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'

but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single
AWK command, if possible.

You can alternatively do it (e.g.) in one instance also like this...

{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }

which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Mesibov@21:1/5 to Janis Papanagnou on Thu Aug 17 13:56:47 2023

On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:

You can alternatively do it (e.g.) in one instance also like this...

{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }

which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)

Janis

Many thanks, Janis, that's very nice, but it depends on specifying the non-unique fields 2, 4 and 5. In the real-world cases I work with, there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID fields (2, 4, 5...300+). That's why I replace
the unique-ID fields with the arbitrary value "1" when testing for duplication.

Bob

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Robert Mesibov on Thu Aug 17 23:42:50 2023

On 17.08.2023 22:56, Robert Mesibov wrote:

On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:

You can alternatively do it (e.g.) in one instance also like this...

{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }

which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)

Janis

Many thanks, Janis, that's very nice, but it depends on specifying
the non-unique fields 2, 4 and 5. In the real-world cases I work
with, there are 1-2 unique ID code fields and sometimes 300+
non-unique-ID fields (2, 4, 5...300+). That's why I replace the
unique-ID fields with the arbitrary value "1" when testing for
duplication.

That was not apparent from your description. But defining the key
by constructing it is not mandatory, you can also define it using
elimination (as in your code); the point was what is following in
the code after the k=... statement.

Janis

Bob

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kenny McCormack on Fri Aug 18 00:03:50 2023

On 17.08.2023 23:27, Kenny McCormack wrote:

In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
Robert Mesibov <robert.mesibov@gmail.com> wrote:
...

Many thanks, Janis, that's very nice, but it depends on specifying the
non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID
fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.

1) Well, it seems like it shouldn't be too hard for you to retrofit your
hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
"" instead of 1.

Yes, indeed. (See my other post.)

2) You probably don't need to mess with SUBSEP. Your data seems to be OK with assuming no embedded spaces (i.e., so using space as the delimiter is OK)
Note that SUBSEP is intended to be used as the delimiter for the implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.

Yes, SUBSEP is the default separation character for arrays and. Of
course you can use other characters (that require less text). Why
you think that "nobody uses that functionality anymore" is beyond
me; I doubt you have any evidence for that, so I interpret it just
as "I [Kenny] don't use it anymore.", which is fine by me.

3) I don't see how Janis's solution implements your need for sorting.

Sort can make sense in three different abstractions.

I interpreted the OP as doing the 'sort' just to be able to compare
the actual data set with the previous data set, to have them together;
this is unnecessary, though, with the approach I used with the keys
in associative array. Since the original data is also already sorted
my a unique numeric key and I sequentially concatenate the data it's
also not necessary to sort the data in that respect. So what's left
is the third thing that can be sorted, and that's the order of the
classes; that all, say, "pear" elements come before all "rose"
elements. This sort, in case it would be desired, is not reflected
in my approach.

Janis

Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to robert.mesibov@gmail.com on Thu Aug 17 21:27:41 2023

In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
Robert Mesibov <robert.mesibov@gmail.com> wrote:
...

Many thanks, Janis, that's very nice, but it depends on specifying the >non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID >fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.

1) Well, it seems like it shouldn't be too hard for you to retrofit your
hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
"" instead of 1.

2) You probably don't need to mess with SUBSEP. Your data seems to be OK
with assuming no embedded spaces (i.e., so using space as the delimiter is OK) Note that SUBSEP is intended to be used as the delimiter for the
implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.

3) I don't see how Janis's solution implements your need for sorting.
Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...

--
"Every time Mitt opens his mouth, a swing state gets its wings."

(Should be on a bumper sticker)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Mesibov@21:1/5 to All on Thu Aug 17 15:36:38 2023

Apologies for not explaining that there are numerous non-unique-ID fields, and yes, what I am aiming for is a sort beginning with the first non-unique-ID field.

My code is complicated because I need to preserve the original records for the output, while also modifying the original records by "de-uniquifying" the unique-ID fields in order to hunt for partial duplicates.

I'll continue to tinker with this and report back if I can simplify the code, but I would be grateful for any other AWK solutions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Kaz Kylheku on Thu Aug 17 23:00:45 2023

On 2023-08-17, Kaz Kylheku <864-117-4973@kylheku.com> wrote:

(flow (get-lines)
(sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))

^^^^^^^^

[...]

This selects the second, fourth and fifth fields and each field after
the fifth, as the non-unique fields on which to group.

I inferred the requirement that the complement of the unique fields
should be used: all fields which are not the unique ones.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Robert Mesibov on Fri Aug 18 01:47:10 2023

On 18.08.2023 00:36, Robert Mesibov wrote:

I'll continue to tinker with this and report back if I can simplify
the code, but I would be grateful for any other AWK solutions.

For any additional sorting Kenny gave hints (see his point 3) that
can simply be added if you're using GNU awk.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Mesibov@21:1/5 to All on Fri Aug 18 01:23:00 2023

Many thanks again, Janis. I doubt that I can improve on

awk '{x=$0; $1=$3=1; y=$0; a[y]=a[y] RS x; b[y]++}; END {for (i in a) if (b[i]>1) print a[i]}' demo

and the sorting isn't critical.

Bob

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Sun Jun 22 21:19:20 2025
  from Wales, Uk via Telnet
- Ian Rihard Kosednar
  Sun Jun 22 19:01:22 2025
  from No via SSH
- Ian Rihard Kosednar
  Sun Jun 22 14:16:22 2025
  from No via SSH
- Ian Rihard Kosednar
  Sun Jun 22 13:55:59 2025
  from No via SSH
- Ian Rihard Kosednar
  Sun Jun 22 13:52:11 2025
  from No via SSH
- Ian Rihard Kosednar
  Sun Jun 22 13:51:03 2025
  from No via SSH
- Ian Rihard Kosednar
  Sun Jun 22 13:24:06 2025
  from No via SSH
- Bob Worm
  Sun Jun 22 10:44:40 2025
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	498
Nodes:	16 (3 / 13)
Uptime:	02:05:25
Calls:	9,821
Calls today:	9
Files:	13,757
Messages:	6,190,249

Simplify an AWK pipeline?

Who's Online

Recent Visitors

System Info