I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.
fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron
To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:
awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo
which returns
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple
I would like those 2 sets of partial duplicates (the rose-hat-apple
set and the pear-hat-apple set) to be sorted alphabetically and
separated, like this:
002 pear bb hat apple
007 pear gg hat apple
001 rose aa hat apple
003 rose cc hat apple
I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each
record.
fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron
To find the partial duplicate records which are identical except in
those unique codes, I can parse "demo" twice like this:
awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo
which returns
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple
I would like those 2 sets of partial duplicates (the rose-hat-apple
set and the pear-hat-apple set) to be sorted alphabetically and
separated, like this:
002 pear bb hat apple
007 pear gg hat apple
001 rose aa hat apple
003 rose cc hat apple
I can do that by piping the first AWK command's output to
sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'
but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single
AWK command, if possible.
You can alternatively do it (e.g.) in one instance also like this...
{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }
which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)
Janis
On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:
You can alternatively do it (e.g.) in one instance also like this...
{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }
which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)
Janis
Many thanks, Janis, that's very nice, but it depends on specifying
the non-unique fields 2, 4 and 5. In the real-world cases I work
with, there are 1-2 unique ID code fields and sometimes 300+
non-unique-ID fields (2, 4, 5...300+). That's why I replace the
unique-ID fields with the arbitrary value "1" when testing for
duplication.
Bob
In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
Robert Mesibov <robert.mesibov@gmail.com> wrote:
...
Many thanks, Janis, that's very nice, but it depends on specifying the
non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID
fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.
1) Well, it seems like it shouldn't be too hard for you to retrofit your
hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
"" instead of 1.
2) You probably don't need to mess with SUBSEP. Your data seems to be OK with assuming no embedded spaces (i.e., so using space as the delimiter is OK)
Note that SUBSEP is intended to be used as the delimiter for the implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.
3) I don't see how Janis's solution implements your need for sorting.
Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...
Many thanks, Janis, that's very nice, but it depends on specifying the >non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID >fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.
(flow (get-lines)^^^^^^^^
(sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))
[...]
I'll continue to tinker with this and report back if I can simplify
the code, but I would be grateful for any other AWK solutions.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 498 |
Nodes: | 16 (3 / 13) |
Uptime: | 02:05:25 |
Calls: | 9,821 |
Calls today: | 9 |
Files: | 13,757 |
Messages: | 6,190,249 |