i think if to write a simple comandline program
that remove duplicates in a given folder
i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set
this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space
so is there some approach i need to take to make this
proces faster?
probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte
in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much
some thoughts?
fir wrote:
i think if to write a simple comandline program
that remove duplicates in a given folder
i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set
this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space
so is there some approach i need to take to make this
proces faster?
probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte
in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much
some thoughts?
couriously, i could add i once searched for program to remove duplicates
but they was not looking good..so such commandline
(or commandline less in fact as i dont even want toa dd comandline
options maybe) program is quite practically needed
fir wrote:
fir wrote:
i think if to write a simple comandline program
that remove duplicates in a given folder
i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set
this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space
so is there some approach i need to take to make this
proces faster?
probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte
in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much
some thoughts?
couriously, i could add i once searched for program to remove duplicates
but they was not looking good..so such commandline
(or commandline less in fact as i dont even want toa dd comandline
options maybe) program is quite practically needed
assuming i got code to read in list of filanemes in given directory
(which i found) what you suggest i should add to remove such duplicates
- the code to read those filenames into l;ist
(tested to work but not tested for being 100% errorless)
#include<windows.h>
#include<stdio.h>
void StrCopyMaxNBytes(char* dest, char* src, int n)
{
for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
}
//list of file names
const int FileNameListEntry_name_max = 500;
struct FileNameListEntry { char name[FileNameListEntry_name_max]; };
FileNameListEntry* FileNameList = NULL;
int FileNameList_Size = 0;
void FileNameList_AddOne(char* name)
{
FileNameList_Size++;
FileNameList = (FileNameListEntry*) realloc(FileNameList, FileNameList_Size * sizeof(FileNameListEntry) );
StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
name, FileNameListEntry_name_max);
return ;
}
// collect list of filenames
WIN32_FIND_DATA ffd;
void ReadDIrectoryFileNamesToList(char* dir)
{
HANDLE h = FindFirstFile(dir, &ffd);
if(!h) { printf("error reading directory"); exit(-1);}
do {
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
FileNameList_AddOne(ffd.cFileName);
}
while (FindNextFile(h, &ffd));
}
int main()
{
ReadDIrectoryFileNamesToList("*");
for(int i=0; i< FileNameList_Size; i++)
printf("\n %d %s", i, FileNameList[i].name );
return 'ok';
}
On 9/21/2024 11:53 AM, fir wrote:
[...]
i think if to write a simple comandline program
that remove duplicates in a given folder
Not sure if this will help you or not... ;^o
Fwiw, I have to sort and remove duplicates in this experimental locking system that I called the multex. Here is the C++ code I used to do it. I
sort and then remove any duplicates, so say a threads local lock set was:
31, 59, 69, 31, 4, 1, 1, 5
would become:
1, 4, 5, 31, 59, 69
this ensures no deadlocks. As for the algorithm for removing duplicates, well, there are more than one. Actually, I don't know what one my C++
impl is using right now.
https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/Ti8LFyH4CgAJ
// Deadlock free baby!
void ensure_locking_order()
{
// sort and remove duplicates
std::sort(m_lock_idxs.begin(), m_lock_idxs.end());
m_lock_idxs.erase(std::unique(m_lock_idxs.begin(),
m_lock_idxs.end()), m_lock_idxs.end());
}
Using the std C++ template lib.
i think if to write a simple comandline program
that remove duplicates in a given folder
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same
size
On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same
size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons. That’s an O(N²) algorithm.
There is a faster way.
Lawrence D'Oliveiro wrote:
There is a faster way.not quite ...
Lawrence D'Oliveiro wrote:
On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:not quite as most files have different sizes so most binary comparsions
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same
size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.
There is a faster way.
are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)
what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say
Paul wrote:
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up - im
not strongly convinced that the probablility of misteke in this hashing
is strictly zero (as i dont ever used this and would need to produce my
own hashing probably).. probably its mathematically proven ists almost
zero but as for now at least it is more interesting for me if the cde i posted is ok
On Sat, 9/21/2024 10:36 PM, fir wrote:
Lawrence D'Oliveiro wrote:
On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:not quite as most files have different sizes so most binary comparsions
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same >>>> size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.
There is a faster way.
are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)
what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
hashdeep64 -c MD5 -j 1 -r H: > H_sums.txt # Took about two minutes to run this on an SSD
# Hard drive, use -j 1 . For an SSD, use a higher thread count for -j .
Size MD5SUM Path
Same size, same hash value. The size is zero. The MD5SUM in this case, is always the same (the initialization value of MD5).
0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\AadConfigurations\AadConfiguration.lock
0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\V2AccountStore.lock
Same size, different hash value. These are not the same file.
65536, a8113cfdf0227ddf1c25367ecccc894b, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\5213954f4433d4fbe45ed37ffc67d43fc43b54584bfd3a8d.bin
65536, 5e91acf90e90be408b6549e11865009d, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\bf7b3ea78a361dc533a9344051255c035491d960f2bc7f31.bin
You can use the "sort" command, to sort by the first and second fields if you want.
Sorting the output lines, places the identical files next to one another, in the output.
The output of data recovery software is full of "fragments". Using
the "file" command (Windows port available, it's a Linux command),
can allow ignoring files which have no value (listed as "Data").
Recognizable files will be listed as "PNG" or "JPG" and so on.
A utility such as Photorec, can attempt to glue together files. Your mileage may vary.
That is a scan based file recovery method. I have not used it.
https://en.wikipedia.org/wiki/PhotoRec
Paul
On 22/09/2024 11:24, fir wrote:
Paul wrote:
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
Bart wrote:
On 22/09/2024 11:24, fir wrote:i wanta discus nt to do enything that is mentioned .. it is hard to understand? so i may read on options but literally got no time to
Paul wrote:
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
implement even good idead - thsi program i wrote showed to work and im
now using it
fir wrote:
Bart wrote:
On 22/09/2024 11:24, fir wrote:i wanta discus nt to do enything that is mentioned .. it is hard to
Paul wrote:
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me >>>> if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
understand? so i may read on options but literally got no time to
implement even good idead - thsi program i wrote showed to work and im
now using it
also note i posted whole working program and some other just say what
can be done... in working code was my main goal not quite starting in
contest of what is fastest (this is also interesting topic but not the
main goal)
Bart wrote:
On 22/09/2024 11:24, fir wrote:
Paul wrote:
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
yet to say about this efficiency
whan i observe how it work - this program is square in a sense it has
half square loop over the directory files list, so it may be lik 20x*20k/2-20k comparcions but it only compares mostly sizes so this
kind of being square im not sure how serious is ..200M int comparsions
is a problem? - mayeb it become to be for larger sets
in the meaning of real binary comparsions is not fully square but
its liek sets of smaller squares on diagonal of this large square
if yu (some) know what i mean... and that may be a problem as
if in that 20k files 100 have same size then it makes about 100x100 full loads and 100x100 full binary copmpares byte to byte which
is practically full if there are indeed 100 duplicates
(maybe its less than 100x100 as at first finding of duplicate i mark it
as dumpicate and ship it in loop then
but indeed it shows practically that in case of folders bigger than 3k
files it slows down probably unproportionally so the optimisation is
in hand /needed for large folders
thats from the observation on it
On 22/09/2024 11:24, fir wrote:
Paul wrote:
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
fir wrote:
Bart wrote:
On 22/09/2024 11:24, fir wrote:
Paul wrote:
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me >>>> if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
yet to say about this efficiency
whan i observe how it work - this program is square in a sense it has
half square loop over the directory files list, so it may be lik
20x*20k/2-20k comparcions but it only compares mostly sizes so this
kind of being square im not sure how serious is ..200M int comparsions
is a problem? - mayeb it become to be for larger sets
in the meaning of real binary comparsions is not fully square but
its liek sets of smaller squares on diagonal of this large square
if yu (some) know what i mean... and that may be a problem as
if in that 20k files 100 have same size then it makes about 100x100 full
loads and 100x100 full binary copmpares byte to byte which
is practically full if there are indeed 100 duplicates
(maybe its less than 100x100 as at first finding of duplicate i mark it
as dumpicate and ship it in loop then
but indeed it shows practically that in case of folders bigger than 3k
files it slows down probably unproportionally so the optimisation is
in hand /needed for large folders
thats from the observation on it
but as i said i mainly wanted this to be done to remove soem space of
this recovered somewhat junk files.. and having it the partially square
way is more important than having it optimised
it works and if i see it slows down on large folders i can divide those
big folders on few for 3k files and run this duplicate mover in each one
more hand work but can be done by hand
On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same
size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons. That’s an O(N²) algorithm.
There is a faster way.
i think if to write a simple comandline program[...]
that remove duplicates in a given folder
I have had the same problem. My solution was to use extended file
attributes and some file checksum, eg sha512sum, also, I wrote this in
PERL (see code below). Using the file attributes, I can re-run the
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 12:08:01 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,714 |