Forum: >>> Magnum BBS <<<

program to remove duplicates

From fir@21:1/5 to All on Sat Sep 21 20:53:47 2024

i think if to write a simple comandline program
that remove duplicates in a given folder

i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set

this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space

so is there some approach i need to take to make this
proces faster?

probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte

in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much

some thoughts?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to fir on Sat Sep 21 20:56:18 2024

fir wrote:

i think if to write a simple comandline program
that remove duplicates in a given folder

i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set

this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space

so is there some approach i need to take to make this
proces faster?

probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte

in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much

some thoughts?

couriously, i could add i once searched for program to remove duplicates
but they was not looking good..so such commandline
(or commandline less in fact as i dont even want toa dd comandline
options maybe) program is quite practically needed

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to fir on Sat Sep 21 21:27:08 2024

fir wrote:

fir wrote:

i think if to write a simple comandline program
that remove duplicates in a given folder

i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set

this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space

so is there some approach i need to take to make this
proces faster?

probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte

in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much

some thoughts?

couriously, i could add i once searched for program to remove duplicates
but they was not looking good..so such commandline
(or commandline less in fact as i dont even want toa dd comandline
options maybe) program is quite practically needed

assuming i got code to read in list of filanemes in given directory
(which i found) what you suggest i should add to remove such duplicates
- the code to read those filenames into l;ist
(tested to work but not tested for being 100% errorless)

#include<windows.h>
#include<stdio.h>

void StrCopyMaxNBytes(char* dest, char* src, int n)
{
for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
}

//list of file names
const int FileNameListEntry_name_max = 500;
struct FileNameListEntry { char name[FileNameListEntry_name_max]; };

FileNameListEntry* FileNameList = NULL;
int FileNameList_Size = 0;

void FileNameList_AddOne(char* name)
{
FileNameList_Size++;
FileNameList = (FileNameListEntry*) realloc(FileNameList, FileNameList_Size * sizeof(FileNameListEntry) );
StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
name, FileNameListEntry_name_max);
return ;
}

// collect list of filenames
WIN32_FIND_DATA ffd;

void ReadDIrectoryFileNamesToList(char* dir)
{
HANDLE h = FindFirstFile(dir, &ffd);

if(!h) { printf("error reading directory"); exit(-1);}

do {
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
FileNameList_AddOne(ffd.cFileName);
}
while (FindNextFile(h, &ffd));

}

int main()
{

ReadDIrectoryFileNamesToList("*");

for(int i=0; i< FileNameList_Size; i++)
printf("\n %d %s", i, FileNameList[i].name );

return 'ok';
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to fir on Sat Sep 21 22:12:04 2024

fir wrote:

fir wrote:

fir wrote:

i think if to write a simple comandline program
that remove duplicates in a given folder

i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set

this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space

so is there some approach i need to take to make this
proces faster?

probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte

in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much

some thoughts?

couriously, i could add i once searched for program to remove duplicates
but they was not looking good..so such commandline
(or commandline less in fact as i dont even want toa dd comandline
options maybe) program is quite practically needed

assuming i got code to read in list of filanemes in given directory
(which i found) what you suggest i should add to remove such duplicates
- the code to read those filenames into l;ist
(tested to work but not tested for being 100% errorless)

#include<windows.h>
#include<stdio.h>

void StrCopyMaxNBytes(char* dest, char* src, int n)
{
for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
}

//list of file names
const int FileNameListEntry_name_max = 500;
struct FileNameListEntry { char name[FileNameListEntry_name_max]; };

FileNameListEntry* FileNameList = NULL;
int FileNameList_Size = 0;

void FileNameList_AddOne(char* name)
{
FileNameList_Size++;
FileNameList = (FileNameListEntry*) realloc(FileNameList, FileNameList_Size * sizeof(FileNameListEntry) );
StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
name, FileNameListEntry_name_max);
return ;
}

// collect list of filenames
WIN32_FIND_DATA ffd;

void ReadDIrectoryFileNamesToList(char* dir)
{
HANDLE h = FindFirstFile(dir, &ffd);

if(!h) { printf("error reading directory"); exit(-1);}

do {
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
FileNameList_AddOne(ffd.cFileName);
}
while (FindNextFile(h, &ffd));

}

int main()
{

ReadDIrectoryFileNamesToList("*");

for(int i=0; i< FileNameList_Size; i++)
printf("\n %d %s", i, FileNameList[i].name );

return 'ok';
}

ok i skethed some code only i dont know how to remove given file
(given by filename) to some subfolder ..is such dunction in c?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to All on Sat Sep 21 23:13:50 2024

ok i wrote this duplicates remover but i dont know if it has no errors etc

heres the code you may comment if you some errors, alternatives or
improvements (note i wrote it among the time i posted on it and that
moment here so its kinda speedy draft (i reused old routines for loading
files etc)

#include<windows.h>
#include<stdio.h>

void StrCopyMaxNBytes(char* dest, char* src, int n)
{
for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
}

//list of file names
const int FileNameListEntry_name_max = 500;
struct FileNameListEntry { char name[FileNameListEntry_name_max];
unsigned int file_size; };

FileNameListEntry* FileNameList = NULL;
int FileNameList_Size = 0;

void FileNameList_AddOne(char* name, unsigned int file_size)
{
FileNameList_Size++;
FileNameList = (FileNameListEntry*) realloc(FileNameList, FileNameList_Size * sizeof(FileNameListEntry) );
StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
name, FileNameListEntry_name_max);
FileNameList[FileNameList_Size-1].file_size = file_size;
return ;
}

// collect list of filenames
WIN32_FIND_DATA ffd;

void ReadDIrectoryFileNamesToList(char* dir)
{
HANDLE h = FindFirstFile(dir, &ffd);

if(!h) { printf("error reading directory"); exit(-1);}

do {
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
{
FileNameList_AddOne(ffd.cFileName, ffd.nFileSizeLow);
if(ffd.nFileSizeHigh!=0) { printf("this program only work for
files up to 4GB"); exit(-1);}
}
}
while (FindNextFile(h, &ffd));

}

#include <sys/stat.h>

int GetFileSize2(char *filename)
{
struct stat st;
if (stat(filename, &st)==0) return (int) st.st_size;

printf("error obtaining file size for %s", filename); exit(-1);
return -1;
}

int FolderExist(char *name)
{
static struct stat st;
if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
return 0;
}

//////////

unsigned char* bytes2 = NULL;
int bytes2_size = 0;
int bytes2_allocked = 0;

unsigned char* bytes2_resize(int size)
{
bytes2_size=size;
if((bytes2_size+100)*2<bytes2_allocked | bytes2_size>bytes2_allocked)
return bytes2=(unsigned char*)realloc(bytes2, (bytes2_allocked=(bytes2_size+100)*2)*sizeof(unsigned char));
}

void bytes2_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
int loaded = fread(bytes2_resize(flen), 1, flen, f);
fclose(f);
}

/////////////////

unsigned char* bytes1 = NULL;
int bytes1_size = 0;
int bytes1_allocked = 0;

unsigned char* bytes1_resize(int size)
{
bytes1_size=size;
if((bytes1_size+100)*2<bytes1_allocked | bytes1_size>bytes1_allocked)
return bytes1=(unsigned char*)realloc(bytes1, (bytes1_allocked=(bytes1_size+100)*2)*sizeof(unsigned char));
}

void bytes1_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
int loaded = fread(bytes1_resize(flen), 1, flen, f);
fclose(f);
}

/////////////

int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
{
bytes1_load(file_a);
bytes2_load(file_b);
if(bytes1_size!=bytes2_size) { printf("\n something is wrong
compared files assumed to be be same size"); exit(-1); }

for(unsigned int i=0; i<=bytes1_size;i++)
if(bytes1[i]!=bytes2[i]) return 0;

return 1;

}

#include<direct.h>
#include <dirent.h>
#include <errno.h>

int duplicates_moved = 0;
void MoveDuplicateToSubdirectory(char*name)
{

if(!FolderExist("duplicates"))
{
int n = _mkdir("duplicates");
if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
}

static char renamed[1000];
int n = snprintf(renamed, sizeof(renamed), "duplicates\\%s", name);

if(rename(name, renamed))
{printf("\n rename %s %s failed", name, renamed); exit(-1);}

duplicates_moved++;

}

int main()
{
printf("\n (RE)MOVE FILE DUPLICATES");
printf("\n ");

printf("\n this program searches for binaric (comparec byute to
byte)");
printf("\n duplicates/multiplicates of files in its own");
printf("\n folder (no search in subdirectories, just flat)");
printf("\n and if found it copies it into 'duplicates'");
printf("\n subfolder it creates If you want to remove that");
printf("\n duplicates you may delete the subfolder then,");
printf("\n if you decided to not remove just move the contents");
printf("\n of 'duplicates' subfolder back");
printf("\n ");
printf("\n note this program not work on files larger than 4GB ");
printf("\n and no warranty at all youre responsible for any dameges ");
printf("\n if use of this program would eventually do - i just
wrote ");
printf("\n the code and it work for me but not tested it to much besides");
printf("\n ");
printf("\n september 2024");

printf("\n ");
printf("\n starting.. ");

ReadDIrectoryFileNamesToList("*");

// for(int i=0; i< FileNameList_Size; i++)
// printf("\n %d %s %d", i, FileNameList[i].name, FileNameList[i].file_size );

for(int i=0; i< FileNameList_Size; i++)
{
for(int j=i+1; j< FileNameList_Size; j++)
{
if(FileNameList[i].file_size!=FileNameList[j].file_size) continue;
if( CompareTwoFilesByContentsAndSayIfEqual(FileNameList[i].name, FileNameList[j].name))
{
// printf("\nduplicate found (%s) ", FileNameList[j].name);
MoveDuplicateToSubdirectory(FileNameList[j].name);
}

}

}

printf(" \n\n %d duplicates moved \n\n\n", duplicates_moved);

return 'ok';
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Chris M. Thomasson on Sun Sep 22 00:18:09 2024

Chris M. Thomasson wrote:

On 9/21/2024 11:53 AM, fir wrote:

i think if to write a simple comandline program
that remove duplicates in a given folder

[...]

Not sure if this will help you or not... ;^o

Fwiw, I have to sort and remove duplicates in this experimental locking system that I called the multex. Here is the C++ code I used to do it. I
sort and then remove any duplicates, so say a threads local lock set was:

31, 59, 69, 31, 4, 1, 1, 5

would become:

1, 4, 5, 31, 59, 69

this ensures no deadlocks. As for the algorithm for removing duplicates, well, there are more than one. Actually, I don't know what one my C++
impl is using right now.

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/Ti8LFyH4CgAJ

// Deadlock free baby!
void ensure_locking_order()
{
// sort and remove duplicates

std::sort(m_lock_idxs.begin(), m_lock_idxs.end());

m_lock_idxs.erase(std::unique(m_lock_idxs.begin(),
m_lock_idxs.end()), m_lock_idxs.end());
}

Using the std C++ template lib.

im not sure what you talking about but i write on finding file
duplicates (by binary contents not by name).. it is disk thing and i
dont think mutexes are needed - you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to All on Sun Sep 22 00:48:05 2024

okay thet previous code has soem errors but i make changes and this one
seem to work

i put it on a 50 HB files from recuva and it moved about 22 GB as
duplicates... by the eye test it seem to work

#include<windows.h>
#include<stdio.h>

void StrCopyMaxNBytes(char* dest, char* src, int n)
{
for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
}

//list of file names
const int FileNameListEntry_name_max = 500;
struct FileNameListEntry { char name[FileNameListEntry_name_max];
unsigned int file_size; int is_duplicate; };

FileNameListEntry* FileNameList = NULL;
int FileNameList_Size = 0;

void FileNameList_AddOne(char* name, unsigned int file_size)
{
FileNameList_Size++;
FileNameList = (FileNameListEntry*) realloc(FileNameList, FileNameList_Size * sizeof(FileNameListEntry) );
StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
name, FileNameListEntry_name_max);
FileNameList[FileNameList_Size-1].file_size = file_size;
FileNameList[FileNameList_Size-1].is_duplicate = 0;

return ;
}

// collect list of filenames
WIN32_FIND_DATA ffd;

void ReadDIrectoryFileNamesToList(char* dir)
{
HANDLE h = FindFirstFile(dir, &ffd);

if(!h) { printf("error reading directory"); exit(-1);}

do {
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
{
FileNameList_AddOne(ffd.cFileName, ffd.nFileSizeLow);
if(ffd.nFileSizeHigh!=0) { printf("this program only work for
files up to 4GB"); exit(-1);}
}
}
while (FindNextFile(h, &ffd));

}

#include <sys/stat.h>

int GetFileSize2(char *filename)
{
struct stat st;
if (stat(filename, &st)==0) return (int) st.st_size;

printf("\n *** error obtaining file size for %s", filename); exit(-1);
return -1;
}

int FolderExist(char *name)
{
static struct stat st;
if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
return 0;
}

//////////

unsigned char* bytes2 = NULL;
int bytes2_size = 0;
int bytes2_allocked = 0;

unsigned char* bytes2_resize(int size)
{
bytes2_size=size;
return bytes2=(unsigned char*)realloc(bytes2,
bytes2_size*sizeof(unsigned char));

}

void bytes2_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
int loaded = fread(bytes2_resize(flen), 1, flen, f);
fclose(f);
}

/////////////////

unsigned char* bytes1 = NULL;
int bytes1_size = 0;
int bytes1_allocked = 0;

unsigned char* bytes1_resize(int size)
{
bytes1_size=size;
return bytes1=(unsigned char*)realloc(bytes1,
bytes1_size*sizeof(unsigned char));

}

void bytes1_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
int loaded = fread(bytes1_resize(flen), 1, flen, f);
fclose(f);
}

/////////////

int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
{
bytes1_load(file_a);
bytes2_load(file_b);
if(bytes1_size!=bytes2_size) { printf("\n something is wrong
compared files assumed to be be same size"); exit(-1); }

for(unsigned int i=0; i<bytes1_size;i++)
if(bytes1[i]!=bytes2[i]) return 0;

return 1;

}

#include<direct.h>
#include <dirent.h>
#include <errno.h>

int duplicates_moved = 0;
void MoveDuplicateToSubdirectory(char*name)
{

if(!FolderExist("duplicates"))
{
int n = _mkdir("duplicates");
if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
}

static char renamed[1000];
int n = snprintf(renamed, sizeof(renamed), "duplicates\\%s", name);

if(rename(name, renamed))
{printf("\n rename %s %s failed", name, renamed); exit(-1);}

duplicates_moved++;

}

int main()
{
printf("\n (RE)MOVE FILE DUPLICATES");
printf("\n ");

printf("\n this program searches for binaric (comparec byute to
byte)");
printf("\n duplicates/multiplicates of files in its own");
printf("\n folder (no search in subdirectories, just flat)");
printf("\n and if found it copies it into 'duplicates'");
printf("\n subfolder it creates If you want to remove that");
printf("\n duplicates you may delete the subfolder then,");
printf("\n if you decided to not remove just move the contents");
printf("\n of 'duplicates' subfolder back");
printf("\n ");
printf("\n note this program not work on files larger than 4GB ");
printf("\n and no warranty at all youre responsible for any dameges ");
printf("\n if use of this program would eventually do - i just
wrote ");
printf("\n the code and it work for me but not tested it to much besides");
printf("\n ");
printf("\n september 2024");

printf("\n ");
printf("\n starting.. ");

ReadDIrectoryFileNamesToList("*");

printf("\n\n found %d files in current directory", FileNameList_Size);
for(int i=0; i< FileNameList_Size; i++)
printf("\n #%d %s %d", i, FileNameList[i].name, FileNameList[i].file_size );

// return 'ok';

for(int i=0; i< FileNameList_Size; i++)
{
if(FileNameList[i].is_duplicate) continue;

for(int j=i+1; j< FileNameList_Size; j++)
{
if(FileNameList[j].is_duplicate) continue;

if(FileNameList[i].file_size!=FileNameList[j].file_size) continue;

if( CompareTwoFilesByContentsAndSayIfEqual(FileNameList[i].name, FileNameList[j].name))
{
printf("\n#%d %s (%d) has duplicate #%d %s (%d) ",i, FileNameList[i].name,FileNameList[i].file_size, j, FileNameList[j].name, FileNameList[j].file_size);
FileNameList[j].is_duplicate=1;
// MoveDuplicateToSubdirectory(FileNameList[i].name);
}

}

}

printf("\n moving duplicates to subfolder...");

for(int i=0; i< FileNameList_Size; i++)
{
if(FileNameList[i].is_duplicate) MoveDuplicateToSubdirectory(FileNameList[i].name);
}

printf(" \n\n %d duplicates moved \n\n\n", duplicates_moved);

return 'ok';
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to fir on Sun Sep 22 01:28:09 2024

On Sat, 21 Sep 2024 20:53:47 +0200, fir wrote:

i think if to write a simple comandline program
that remove duplicates in a given folder

<https://packages.debian.org/bookworm/duff> <https://packages.debian.org/bookworm/dupeguru> <https://packages.debian.org/trixie/backdown>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to fir on Sun Sep 22 02:06:49 2024

On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same
size

For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.

There is a faster way.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 04:36:03 2024

Lawrence D'Oliveiro wrote:

On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same
size

For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons. That’s an O(N²) algorithm.

There is a faster way.

not quite as most files have different sizes so most binary comparsions
are discarded becouse size of files differ (and those sizes i read
linearly when bulding lidt of filenames)

what i posted seem to work ok, it odesnt work fast but hard to say if it
can be optimised or it takes as long as it should..hard to say

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to fir on Sun Sep 22 07:09:45 2024

On Sun, 22 Sep 2024 04:36:03 +0200, fir wrote:

Lawrence D'Oliveiro wrote:

There is a faster way.

not quite ...

Yes there is. See how those other programs do it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to fir on Sun Sep 22 03:29:08 2024

On Sat, 9/21/2024 10:36 PM, fir wrote:

Lawrence D'Oliveiro wrote:

On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same
size

For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.

There is a faster way.

not quite as most files have different sizes so most binary comparsions
are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)

what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

hashdeep64 -c MD5 -j 1 -r H: > H_sums.txt # Took about two minutes to run this on an SSD
# Hard drive, use -j 1 . For an SSD, use a higher thread count for -j .

Size MD5SUM Path

Same size, same hash value. The size is zero. The MD5SUM in this case, is always the same (the initialization value of MD5).

0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\AadConfigurations\AadConfiguration.lock
0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\V2AccountStore.lock

Same size, different hash value. These are not the same file.

65536, a8113cfdf0227ddf1c25367ecccc894b, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\5213954f4433d4fbe45ed37ffc67d43fc43b54584bfd3a8d.bin
65536, 5e91acf90e90be408b6549e11865009d, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\bf7b3ea78a361dc533a9344051255c035491d960f2bc7f31.bin

You can use the "sort" command, to sort by the first and second fields if you want.
Sorting the output lines, places the identical files next to one another, in the output.

The output of data recovery software is full of "fragments". Using
the "file" command (Windows port available, it's a Linux command),
can allow ignoring files which have no value (listed as "Data").
Recognizable files will be listed as "PNG" or "JPG" and so on.

A utility such as Photorec, can attempt to glue together files. Your mileage may vary.
That is a scan based file recovery method. I have not used it.

https://en.wikipedia.org/wiki/PhotoRec

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to fir on Sun Sep 22 11:38:17 2024

On 22/09/2024 11:24, fir wrote:

Paul wrote:

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

the code i posted work ok, and if someone has windows and mingw/tdm may compiel it and check the application if wants

hashing is not necessary imo though probably could speed things up - im
not strongly convinced that the probablility of misteke in this hashing
is strictly zero (as i dont ever used this and would need to produce my
own hashing probably).. probably its mathematically proven ists almost
zero but as for now at least it is more interesting for me if the cde i posted is ok

I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).

But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.

(I didn't post after all because I knew it would be futile.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Paul on Sun Sep 22 12:24:06 2024

Paul wrote:

On Sat, 9/21/2024 10:36 PM, fir wrote:

Lawrence D'Oliveiro wrote:

On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same >>>> size

For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.

There is a faster way.

not quite as most files have different sizes so most binary comparsions
are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)

what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

hashdeep64 -c MD5 -j 1 -r H: > H_sums.txt # Took about two minutes to run this on an SSD
# Hard drive, use -j 1 . For an SSD, use a higher thread count for -j .

Size MD5SUM Path

Same size, same hash value. The size is zero. The MD5SUM in this case, is always the same (the initialization value of MD5).

0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\AadConfigurations\AadConfiguration.lock
0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\V2AccountStore.lock

Same size, different hash value. These are not the same file.

65536, a8113cfdf0227ddf1c25367ecccc894b, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\5213954f4433d4fbe45ed37ffc67d43fc43b54584bfd3a8d.bin
65536, 5e91acf90e90be408b6549e11865009d, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\bf7b3ea78a361dc533a9344051255c035491d960f2bc7f31.bin

You can use the "sort" command, to sort by the first and second fields if you want.
Sorting the output lines, places the identical files next to one another, in the output.

The output of data recovery software is full of "fragments". Using
the "file" command (Windows port available, it's a Linux command),
can allow ignoring files which have no value (listed as "Data").
Recognizable files will be listed as "PNG" or "JPG" and so on.

A utility such as Photorec, can attempt to glue together files. Your mileage may vary.
That is a scan based file recovery method. I have not used it.

https://en.wikipedia.org/wiki/PhotoRec

Paul

i do not do recovery - it removes duplicates

i mean programs such as recuva when recovers files recoves a tens of
thousands and gigabytes of files with lost names and soem common types
.mp3 .jpg .txt and so on, and many of those files are binary duplicates

this code i posted last just finds files that are duplicates and moves
then to subdirectory 'duplicates' and it could show that half of those
files or more (heavy gigabytes) are pure duplicates so some may remove
the subfolder and recover space

the code i posted work ok, and if someone has windows and mingw/tdm may
compiel it and check the application if wants

hashing is not necessary imo though probably could speed things up - im
not strongly convinced that the probablility of misteke in this hashing
is strictly zero (as i dont ever used this and would need to produce my
own hashing probably).. probably its mathematically proven ists almost
zero but as for now at least it is more interesting for me if the cde i
posted is ok

yopu may see the main procedure of it

first it build list of files with sizes
using windows winapi function

HANDLE h = FindFirstFile(dir, &ffd);

(ils linear say 12k times for 12 k files in folder)

then it runs square loop (12k * 12k /2 - 12k)

and compares binarly those who have same size

int GetFileSize2(char *filename)
{
struct stat st;
if (stat(filename, &st)==0) return (int) st.st_size;

printf("\n *** error obtaining file size for %s", filename); exit(-1);
return -1;
}

void bytes1_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
int loaded = fread(bytes1_resize(flen), 1, flen, f);
fclose(f);
}

int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
{
bytes1_load(file_a);
bytes2_load(file_b);
if(bytes1_size!=bytes2_size) { printf("\n something is wrong
compared files assumed to be be same size"); exit(-1); }

for(unsigned int i=0; i<bytes1_size;i++)
if(bytes1[i]!=bytes2[i]) return 0;

return 1;

}

this has 2 elements its file load into ram and then comparsions

(the reading files is redundant as i got the info from
FindFirstFile(dir, &ffd); winapi functions, but maybe to be sure i
read it also form thsi stat() function again

and then finally i got a linear pary to move that ones on the list
marked as duplicates to subfolder

int FolderExist(char *name)
{
static struct stat st;
if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
return 0;
}

int duplicates_moved = 0;
void MoveDuplicateToSubdirectory(char*name)
{

if(!FolderExist("duplicates"))
{
int n = _mkdir("duplicates");
if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
}

static char renamed[1000];
int n = snprintf(renamed, sizeof(renamed), "duplicates\\%s", name);

if(rename(name, renamed))
{printf("\n rename %s %s failed", name, renamed); exit(-1);}

duplicates_moved++;

}

im not sure if some of tis functions are not slow and there is an
element of redundancy calling if(!FolderExist("duplicates"))
many times as it would be normal "ram based" not disk related function
- but its probably okay i guess (and thsi disk related function i hope
not really activates disk but hopefully only read some ram about it)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Bart on Sun Sep 22 14:46:12 2024

Bart wrote:

On 22/09/2024 11:24, fir wrote:

Paul wrote:

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants

hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok

I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).

But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.

(I didn't post after all because I knew it would be futile.)

i wanta discus nt to do enything that is mentioned .. it is hard to
understand? so i may read on options but literally got no time to
implement even good idead - thsi program i wrote showed to work and im
now using it

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Bart on Sun Sep 22 14:48:11 2024

fir wrote:

Bart wrote:

On 22/09/2024 11:24, fir wrote:

Paul wrote:

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants

hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok

I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).

But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.

(I didn't post after all because I knew it would be futile.)

i wanta discus nt to do enything that is mentioned .. it is hard to understand? so i may read on options but literally got no time to
implement even good idead - thsi program i wrote showed to work and im
now using it

also note i posted whole working program and some other just say what
can be done... in working code was my main goal not quite starting in
contest of what is fastest (this is also interesting topic but not the
main goal)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Bart on Sun Sep 22 16:06:39 2024

fir wrote:

fir wrote:

Bart wrote:

On 22/09/2024 11:24, fir wrote:

Paul wrote:

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants

hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me >>>> if the cde i posted is ok

I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).

But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.

(I didn't post after all because I knew it would be futile.)

i wanta discus nt to do enything that is mentioned .. it is hard to
understand? so i may read on options but literally got no time to
implement even good idead - thsi program i wrote showed to work and im
now using it

also note i posted whole working program and some other just say what
can be done... in working code was my main goal not quite starting in
contest of what is fastest (this is also interesting topic but not the
main goal)

interesting thing is yet how it work in system ...
im used to write cpu intensive application and used to controll frame
times and usage of cpu.. bet generally never t vrite disk based apps

here above is the first..i use sysinternals on windows ind when i run
this prog the it has like 3 stages
1) read directory info (if ts big like 30k files it mat take soem time)
2) square part that read file contents and compares and sets flags of duplicates on list
3) the rename part - i mean i cal "reneme" function on duplicates

the most tiem it takes the square part and indicator of disk usage is
full , cpu usage is 50% it means probably one core usage is full

the disk indicator in tray shows (in square phase) something like
R: 1.6 GB
O: 635 KB
W: 198 B

dont know what it is, R is for read for sure and W is for write but what
it is exactly?

there is also a question if closing or killing program in those phases
may generate some disc dameges? - as ror most time in square phase it
takes reads i quite sure that closing in read phase may not incur anny
errors - but im not sure as to renaming phase

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Bart on Sun Sep 22 16:26:49 2024

fir wrote:

Bart wrote:

On 22/09/2024 11:24, fir wrote:

Paul wrote:

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants

hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok

I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).

But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.

(I didn't post after all because I knew it would be futile.)

yet to say about this efficiency

whan i observe how it work - this program is square in a sense it has
half square loop over the directory files list, so it may be lik 20x*20k/2-20k comparcions but it only compares mostly sizes so this
kind of being square im not sure how serious is ..200M int comparsions
is a problem? - mayeb it become to be for larger sets

in the meaning of real binary comparsions is not fully square but
its liek sets of smaller squares on diagonal of this large square
if yu (some) know what i mean... and that may be a problem as
if in that 20k files 100 have same size then it makes about 100x100 full loads and 100x100 full binary copmpares byte to byte which
is practically full if there are indeed 100 duplicates
(maybe its less than 100x100 as at first finding of duplicate i mark it
as dumpicate and ship it in loop then

but indeed it shows practically that in case of folders bigger than 3k
files it slows down probably unproportionally so the optimisation is
in hand /needed for large folders

thats from the observation on it

but as i said i mainly wanted this to be done to remove soem space of
this recovered somewhat junk files.. and having it the partially square
way is more important than having it optimised

it works and if i see it slows down on large folders i can divide those
big folders on few for 3k files and run this duplicate mover in each one

more hand work but can be done by hand

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Bart on Sun Sep 22 16:51:24 2024

fir wrote:

this program has yet one pleasnt thing as it works

#5 f1795800624.bmp (589878) has duplicate #216 f1840569816.bmp (589878)
#6 f1795801784.bmp (589878) has duplicate #217 f1840570976.bmp (589878)
#7 f1795802944.bmp (589878) has duplicate #218 f1840572136.bmp (589878)
#8 f1795804112.bmp (589878) has duplicate #219 f1840573296.bmp (589878)
#9 f1795805272.bmp (589878) has duplicate #220 f1840574456.bmp (589878)

and those numbers on left goes form #1 to say #3000 (last)
as it marks duplicates "forward" i man if #8 has duplicate #218 then
#218 is marked as duplicate and excluded in both loops (the outside/row
one and this inside, element one) so the scaning speeds up the more
further it goes and its not linear like when coping files

its nicely pleasant

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Bart on Sun Sep 22 16:22:00 2024

Bart wrote:

On 22/09/2024 11:24, fir wrote:

Paul wrote:

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants

hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok

I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).

But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.

(I didn't post after all because I knew it would be futile.)

yet to say about this efficiency

whan i observe how it work - this program is square in a sense it has
half square loop over the directory files list, so it may be lik
20x*20k/2-20k comparcions but it only compares mostly sizes so this
kind of being square im not sure how serious is ..200M int comparsions
is a problem? - mayeb it become to be for larger sets

in the meaning of real binary comparsions is not fully square but
its liek sets of smaller squares on diagonal of this large square
if yu (some) know what i mean... and that may be a problem as
if in that 20k files 100 have same size then it makes about 100x100 full
loads and 100x100 full binary copmpares byte to byte which
is practically full if there are indeed 100 duplicates
(maybe its less than 100x100 as at first finding of duplicate i mark it
as dumpicate and ship it in loop then

but indeed it shows practically that in case of folders bigger than 3k
files it slows down probably unproportionally so the optimisation is
in hand /needed for large folders

thats from the observation on it
from disk and

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fir@21:1/5 to Bart on Sun Sep 22 16:32:05 2024

fir wrote:

fir wrote:

Bart wrote:

On 22/09/2024 11:24, fir wrote:

Paul wrote:

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants

hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me >>>> if the cde i posted is ok

I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).

But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.

(I didn't post after all because I knew it would be futile.)

yet to say about this efficiency

whan i observe how it work - this program is square in a sense it has
half square loop over the directory files list, so it may be lik
20x*20k/2-20k comparcions but it only compares mostly sizes so this
kind of being square im not sure how serious is ..200M int comparsions
is a problem? - mayeb it become to be for larger sets

in the meaning of real binary comparsions is not fully square but
its liek sets of smaller squares on diagonal of this large square
if yu (some) know what i mean... and that may be a problem as
if in that 20k files 100 have same size then it makes about 100x100 full
loads and 100x100 full binary copmpares byte to byte which
is practically full if there are indeed 100 duplicates
(maybe its less than 100x100 as at first finding of duplicate i mark it
as dumpicate and ship it in loop then

but indeed it shows practically that in case of folders bigger than 3k
files it slows down probably unproportionally so the optimisation is
in hand /needed for large folders

thats from the observation on it

but as i said i mainly wanted this to be done to remove soem space of
this recovered somewhat junk files.. and having it the partially square
way is more important than having it optimised

it works and if i see it slows down on large folders i can divide those
big folders on few for 3k files and run this duplicate mover in each one

more hand work but can be done by hand

hovever saying that the checksuming/hashing idea is kinda good ofc
(sorting oprobably the less as maybe a bit harder to write, as im never
sure if my old quicksirt hand code has no error i once tested like 30
quicksort versions in mya life trying to rewrite it and once i get some
mistake in thsi code and later never strictly sure if the version i
finally get is good - its probably good but im not sure)

but i would need to understand that may own way of hashing has
practically no chances to generate same hash on different files..
and i never was doing that things so i not rethinked it..and now its a
side thing possibly not worth studying

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From DFS@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 17:11:02 2024

On 9/21/2024 10:06 PM, Lawrence D'Oliveiro wrote:

On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same
size

For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons. That’s an O(N²) algorithm.

for (i = 0; i < N; i++) {
for (j = i+1; j < N; j++) {
... byte-byte compare file i to file j
}
}

For N = 10, 45 byte-byte comparisons would be made (assuming all files
are the same size)

There is a faster way.

Calc the checksum of each file once, then compare the checksums as above?

Which is still an O(N^2) algorithm, but I would assume it's faster than
45 byte-byte comparisons.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Josef_M=C3=B6llers?=@21:1/5 to fir on Tue Oct 1 16:34:47 2024

On 21.09.24 20:53, fir wrote:

i think if to write a simple comandline program
that remove duplicates in a given folder

[...]

I have had the same problem. My solution was to use extended file
attributes and some file checksum, eg sha512sum, also, I wrote this in
PERL (see code below). Using the file attributes, I can re-run the
program after a while without having to re-calculate the checksums.
So, this solution only works for filesystems that have extended file attributes, but you could also use some simple database (sqlite3?) to
map checksums to pathnames.

What I did was to walk through the directory tree and check if the file
being considered already has a checksum in an extended attribute. If
not. I'll calculate the checksum and store that in the extended
attribute. Also, I store the pathname in a hash (remember, this is
PERL), key is the checksum.
If there is a collision (checksum already in the hash), I remove the new
file (and link the new filename to the old file). One could be paranoid
and do a byte-by-byte file comparison then.

If I needed to do this in a C program, I'd probably use a GList to store
the hash, but otherwise the code logic would be the same.

HTH,

Josef

#! /usr/bin/perl

use warnings;
use strict;
use File::ExtAttr ':all'; # In case of problems, maybe insert "use Scalar:Utils;" in /usr/lib/x86_64-linux-gnu/perl5/5.22/File/ExtAttr.pm
use Digest::SHA;
use File::Find;
use Getopt::Std;

# OPTIONS:
# s: force symlink
# n: don't do the actula removing/linking
# v: be more verbose
# h: print short help
my %opt = (
s => undef,
n => undef,
v => undef,
h => undef,
);
getopts('hnsv', \%opt);

if ($opt{h}) {
print STDERR "usage: lndup [-snvh] [dirname..]\n";
print STDERR "\t-s: use symlink rather than hard link\n";
print STDERR "\t-n: don't remove/link, just show what would be done\n";
print STDERR "\t-v: be more verbose (show pathname and SHA512 sum\n";
print STDERR "\t-h: show this text\n";
exit(0);
}

my %file;

if (@ARGV == 0) {
find({ wanted => \&lndup, no_chdir => 1 }, '.');
} else {
find({ wanted => \&lndup, no_chdir => 1 }, @ARGV);
}

# NAME: lndup
# PURPOSE: To handle a single file
# ARGUMENTS: None, pathname is taken from $File::Find::name
# RETURNS: Nothing
# NOTE: The SHA512 sum of a file is calculated.
# IF a file with the same sum was already found earlier, AND
# iF both files are NOT the same (hard link) AND
# iF both files reside on the same disk
# THEN the second occurrence is removed and
# replaced by a link to the first occurrence
sub lndup {
my $pathname = $File::Find::name;

return if ! -f $pathname;
if (-s $pathname) {
my $sha512sum = getfattr($pathname, 'SHA512');
if (!defined $sha512sum) {
my $ctx = Digest::SHA->new(512);
$ctx->addfile($pathname);
$sha512sum = $ctx->hexdigest;
print STDERR "$pathname $sha512sum\n" if $opt{v};
setfattr($pathname, "SHA512", $sha512sum);
} elsif ($opt{v}) {
print STDERR "Using sha512sum from attributes\n";
}

if (exists $file{$sha512sum}) {
if (!same_file($pathname, $file{$sha512sum})) {
my $links1 = (stat($pathname))[3];
my $links2 = (stat($file{$sha512sum}))[3];
# If one of them is a symbolic link, make sure it's
$pathname
if (is_symlink($file{$sha512sum})) {
print STDERR "Swapping $pathname and
$file{$sha512sum}\n" if $opt{v};
swap($file{$sha512sum}, $pathname);
}
# If $pathname has more links than $file{$sha512sum},
# exchange the two names.
# This ensures that $file{$sha512sum} has the most links.
elsif ($links1 > $links2) {
print STDERR "Swapping $pathname and
$file{$sha512sum}\n" if $opt{v};
swap($file{$sha512sum}, $pathname);
}

print "rm \"$pathname\"; ln \"$file{$sha512sum}\" \"$pathname\"\n";
if (! $opt{n}) {
my $same_disk = same_disk($pathname,
$file{$sha512sum});
if (unlink($pathname)) {
if (! $same_disk || $opt{s}) {
symlink($file{$sha512sum}, $pathname) ||
print STDERR "Failed to symlink($file{$sha512sum}, $pathname): $!\n";
} else {
link($file{$sha512sum}, $pathname) || print
STDERR "Failed to link($file{$sha512sum}, $pathname): $!\n";
}
} else {
print STDERR "Failed to unlink $pathname: $!\n";
}
}
# print "Removing $pathname\n";
# unlink $pathname or warn "$0: Cannot remove $_: $!\n";

}
} else {
$file{$sha512sum} = $pathname;
}
}
}

# NAME: same_disk
# PURPOSE: To check if two files are on the same disk
# ARGUMENTS: pn1, pn2: pathnames of files
# RETURNS: true if files are on the same disk, else false
# NOTE: The check is made by comparing the device numbers of the
# filesystems of the two files.
sub same_disk {
my ($pn1, $pn2) = @_;

my @s1 = stat($pn1);
my @s2 = stat($pn2);

return $s1[0] == $s2[0];
}

# NAME: same_file
# PURPOSE: To check if two files are the same
# ARGUMENTS: pn1, pn2: pathnames of files
# RETURNS: true if files are the same, else false
# NOTE: files are the same if device number AND inode number
# are identical
sub same_file {
my ($pn1, $pn2) = @_;

my @s1 = stat($pn1);
my @s2 = stat($pn2);

return ($s1[0] == $s2[0]) && ($s1[1] == $s2[1]);
}

sub is_symlink {
my ($path) = @_;

return -l $path;
}

sub swap {
my $tmp;
$tmp = $_[0];
$_[0] = $_[1];
$_[1] = $tmp;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to josef@invalid.invalid on Tue Oct 1 20:38:23 2024

In article <lm2fk7FpccjU1@mid.individual.net>,
Josef Mllers <josef@invalid.invalid> wrote:
...

I have had the same problem. My solution was to use extended file
attributes and some file checksum, eg sha512sum, also, I wrote this in
PERL (see code below). Using the file attributes, I can re-run the

And is thus entirely OT here. Keith will tell the same.

--
"You can safely assume that you have created God in your own image when
it turns out that God hates all the same people you do." -- Anne Lamott

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	12:08:01
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,714

program to remove duplicates

Who's Online

Recent Visitors

System Info