• program to remove duplicates

    From fir@21:1/5 to All on Sat Sep 21 20:53:47 2024
    i think if to write a simple comandline program
    that remove duplicates in a given folder

    i mean some should copy a program to given folder
    run it and all duplicates and multiplicates (when
    duplicate means a file with different name but
    exact binary size and byte content) will be removed
    leafting only one for multiplicate set

    this should work for a big doze of files -
    i need it for example i once recovered a hdd disk
    and as i got some copies of files on this disc
    the removed files are generally multiplicated
    and consume a lot of disk space

    so is there some approach i need to take to make this
    proces faster?

    probably i would need to read list of files and sizes in
    current directory then sort or go thru the list and if found
    exact size read it into ram tnen compare it byte by byte

    in not sure if to do sorting as i need write it quick
    also and maybe sorting will complicate a bit but not gives much

    some thoughts?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to fir on Sat Sep 21 20:56:18 2024
    fir wrote:


    i think if to write a simple comandline program
    that remove duplicates in a given folder

    i mean some should copy a program to given folder
    run it and all duplicates and multiplicates (when
    duplicate means a file with different name but
    exact binary size and byte content) will be removed
    leafting only one for multiplicate set

    this should work for a big doze of files -
    i need it for example i once recovered a hdd disk
    and as i got some copies of files on this disc
    the removed files are generally multiplicated
    and consume a lot of disk space

    so is there some approach i need to take to make this
    proces faster?

    probably i would need to read list of files and sizes in
    current directory then sort or go thru the list and if found
    exact size read it into ram tnen compare it byte by byte

    in not sure if to do sorting as i need write it quick
    also and maybe sorting will complicate a bit but not gives much

    some thoughts?

    couriously, i could add i once searched for program to remove duplicates
    but they was not looking good..so such commandline
    (or commandline less in fact as i dont even want toa dd comandline
    options maybe) program is quite practically needed

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to fir on Sat Sep 21 21:27:08 2024
    fir wrote:
    fir wrote:


    i think if to write a simple comandline program
    that remove duplicates in a given folder

    i mean some should copy a program to given folder
    run it and all duplicates and multiplicates (when
    duplicate means a file with different name but
    exact binary size and byte content) will be removed
    leafting only one for multiplicate set

    this should work for a big doze of files -
    i need it for example i once recovered a hdd disk
    and as i got some copies of files on this disc
    the removed files are generally multiplicated
    and consume a lot of disk space

    so is there some approach i need to take to make this
    proces faster?

    probably i would need to read list of files and sizes in
    current directory then sort or go thru the list and if found
    exact size read it into ram tnen compare it byte by byte

    in not sure if to do sorting as i need write it quick
    also and maybe sorting will complicate a bit but not gives much

    some thoughts?

    couriously, i could add i once searched for program to remove duplicates
    but they was not looking good..so such commandline
    (or commandline less in fact as i dont even want toa dd comandline
    options maybe) program is quite practically needed


    assuming i got code to read in list of filanemes in given directory
    (which i found) what you suggest i should add to remove such duplicates
    - the code to read those filenames into l;ist
    (tested to work but not tested for being 100% errorless)

    #include<windows.h>
    #include<stdio.h>

    void StrCopyMaxNBytes(char* dest, char* src, int n)
    {
    for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
    }

    //list of file names
    const int FileNameListEntry_name_max = 500;
    struct FileNameListEntry { char name[FileNameListEntry_name_max]; };

    FileNameListEntry* FileNameList = NULL;
    int FileNameList_Size = 0;

    void FileNameList_AddOne(char* name)
    {
    FileNameList_Size++;
    FileNameList = (FileNameListEntry*) realloc(FileNameList, FileNameList_Size * sizeof(FileNameListEntry) );
    StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
    name, FileNameListEntry_name_max);
    return ;
    }


    // collect list of filenames
    WIN32_FIND_DATA ffd;

    void ReadDIrectoryFileNamesToList(char* dir)
    {
    HANDLE h = FindFirstFile(dir, &ffd);

    if(!h) { printf("error reading directory"); exit(-1);}

    do {
    if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
    FileNameList_AddOne(ffd.cFileName);
    }
    while (FindNextFile(h, &ffd));

    }



    int main()
    {

    ReadDIrectoryFileNamesToList("*");

    for(int i=0; i< FileNameList_Size; i++)
    printf("\n %d %s", i, FileNameList[i].name );

    return 'ok';
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to fir on Sat Sep 21 22:12:04 2024
    fir wrote:
    fir wrote:
    fir wrote:


    i think if to write a simple comandline program
    that remove duplicates in a given folder

    i mean some should copy a program to given folder
    run it and all duplicates and multiplicates (when
    duplicate means a file with different name but
    exact binary size and byte content) will be removed
    leafting only one for multiplicate set

    this should work for a big doze of files -
    i need it for example i once recovered a hdd disk
    and as i got some copies of files on this disc
    the removed files are generally multiplicated
    and consume a lot of disk space

    so is there some approach i need to take to make this
    proces faster?

    probably i would need to read list of files and sizes in
    current directory then sort or go thru the list and if found
    exact size read it into ram tnen compare it byte by byte

    in not sure if to do sorting as i need write it quick
    also and maybe sorting will complicate a bit but not gives much

    some thoughts?

    couriously, i could add i once searched for program to remove duplicates
    but they was not looking good..so such commandline
    (or commandline less in fact as i dont even want toa dd comandline
    options maybe) program is quite practically needed


    assuming i got code to read in list of filanemes in given directory
    (which i found) what you suggest i should add to remove such duplicates
    - the code to read those filenames into l;ist
    (tested to work but not tested for being 100% errorless)

    #include<windows.h>
    #include<stdio.h>

    void StrCopyMaxNBytes(char* dest, char* src, int n)
    {
    for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
    }

    //list of file names
    const int FileNameListEntry_name_max = 500;
    struct FileNameListEntry { char name[FileNameListEntry_name_max]; };

    FileNameListEntry* FileNameList = NULL;
    int FileNameList_Size = 0;

    void FileNameList_AddOne(char* name)
    {
    FileNameList_Size++;
    FileNameList = (FileNameListEntry*) realloc(FileNameList, FileNameList_Size * sizeof(FileNameListEntry) );
    StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
    name, FileNameListEntry_name_max);
    return ;
    }


    // collect list of filenames
    WIN32_FIND_DATA ffd;

    void ReadDIrectoryFileNamesToList(char* dir)
    {
    HANDLE h = FindFirstFile(dir, &ffd);

    if(!h) { printf("error reading directory"); exit(-1);}

    do {
    if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
    FileNameList_AddOne(ffd.cFileName);
    }
    while (FindNextFile(h, &ffd));

    }



    int main()
    {

    ReadDIrectoryFileNamesToList("*");

    for(int i=0; i< FileNameList_Size; i++)
    printf("\n %d %s", i, FileNameList[i].name );

    return 'ok';
    }

    ok i skethed some code only i dont know how to remove given file
    (given by filename) to some subfolder ..is such dunction in c?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to All on Sat Sep 21 23:13:50 2024
    ok i wrote this duplicates remover but i dont know if it has no errors etc

    heres the code you may comment if you some errors, alternatives or
    improvements (note i wrote it among the time i posted on it and that
    moment here so its kinda speedy draft (i reused old routines for loading
    files etc)

    #include<windows.h>
    #include<stdio.h>

    void StrCopyMaxNBytes(char* dest, char* src, int n)
    {
    for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
    }

    //list of file names
    const int FileNameListEntry_name_max = 500;
    struct FileNameListEntry { char name[FileNameListEntry_name_max];
    unsigned int file_size; };

    FileNameListEntry* FileNameList = NULL;
    int FileNameList_Size = 0;

    void FileNameList_AddOne(char* name, unsigned int file_size)
    {
    FileNameList_Size++;
    FileNameList = (FileNameListEntry*) realloc(FileNameList, FileNameList_Size * sizeof(FileNameListEntry) );
    StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
    name, FileNameListEntry_name_max);
    FileNameList[FileNameList_Size-1].file_size = file_size;
    return ;
    }


    // collect list of filenames
    WIN32_FIND_DATA ffd;

    void ReadDIrectoryFileNamesToList(char* dir)
    {
    HANDLE h = FindFirstFile(dir, &ffd);

    if(!h) { printf("error reading directory"); exit(-1);}

    do {
    if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
    {
    FileNameList_AddOne(ffd.cFileName, ffd.nFileSizeLow);
    if(ffd.nFileSizeHigh!=0) { printf("this program only work for
    files up to 4GB"); exit(-1);}
    }
    }
    while (FindNextFile(h, &ffd));

    }

    #include <sys/stat.h>

    int GetFileSize2(char *filename)
    {
    struct stat st;
    if (stat(filename, &st)==0) return (int) st.st_size;

    printf("error obtaining file size for %s", filename); exit(-1);
    return -1;
    }

    int FolderExist(char *name)
    {
    static struct stat st;
    if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
    return 0;
    }


    //////////

    unsigned char* bytes2 = NULL;
    int bytes2_size = 0;
    int bytes2_allocked = 0;

    unsigned char* bytes2_resize(int size)
    {
    bytes2_size=size;
    if((bytes2_size+100)*2<bytes2_allocked | bytes2_size>bytes2_allocked)
    return bytes2=(unsigned char*)realloc(bytes2, (bytes2_allocked=(bytes2_size+100)*2)*sizeof(unsigned char));
    }

    void bytes2_load(unsigned char* name)
    {
    int flen = GetFileSize2(name);
    FILE *f = fopen(name, "rb");
    if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
    int loaded = fread(bytes2_resize(flen), 1, flen, f);
    fclose(f);
    }

    /////////////////


    unsigned char* bytes1 = NULL;
    int bytes1_size = 0;
    int bytes1_allocked = 0;

    unsigned char* bytes1_resize(int size)
    {
    bytes1_size=size;
    if((bytes1_size+100)*2<bytes1_allocked | bytes1_size>bytes1_allocked)
    return bytes1=(unsigned char*)realloc(bytes1, (bytes1_allocked=(bytes1_size+100)*2)*sizeof(unsigned char));
    }

    void bytes1_load(unsigned char* name)
    {
    int flen = GetFileSize2(name);
    FILE *f = fopen(name, "rb");
    if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
    int loaded = fread(bytes1_resize(flen), 1, flen, f);
    fclose(f);
    }

    /////////////



    int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
    {
    bytes1_load(file_a);
    bytes2_load(file_b);
    if(bytes1_size!=bytes2_size) { printf("\n something is wrong
    compared files assumed to be be same size"); exit(-1); }

    for(unsigned int i=0; i<=bytes1_size;i++)
    if(bytes1[i]!=bytes2[i]) return 0;

    return 1;

    }

    #include<direct.h>
    #include <dirent.h>
    #include <errno.h>

    int duplicates_moved = 0;
    void MoveDuplicateToSubdirectory(char*name)
    {

    if(!FolderExist("duplicates"))
    {
    int n = _mkdir("duplicates");
    if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
    }

    static char renamed[1000];
    int n = snprintf(renamed, sizeof(renamed), "duplicates\\%s", name);

    if(rename(name, renamed))
    {printf("\n rename %s %s failed", name, renamed); exit(-1);}

    duplicates_moved++;

    }

    int main()
    {
    printf("\n (RE)MOVE FILE DUPLICATES");
    printf("\n ");

    printf("\n this program searches for binaric (comparec byute to
    byte)");
    printf("\n duplicates/multiplicates of files in its own");
    printf("\n folder (no search in subdirectories, just flat)");
    printf("\n and if found it copies it into 'duplicates'");
    printf("\n subfolder it creates If you want to remove that");
    printf("\n duplicates you may delete the subfolder then,");
    printf("\n if you decided to not remove just move the contents");
    printf("\n of 'duplicates' subfolder back");
    printf("\n ");
    printf("\n note this program not work on files larger than 4GB ");
    printf("\n and no warranty at all youre responsible for any dameges ");
    printf("\n if use of this program would eventually do - i just
    wrote ");
    printf("\n the code and it work for me but not tested it to much besides");
    printf("\n ");
    printf("\n september 2024");

    printf("\n ");
    printf("\n starting.. ");

    ReadDIrectoryFileNamesToList("*");

    // for(int i=0; i< FileNameList_Size; i++)
    // printf("\n %d %s %d", i, FileNameList[i].name, FileNameList[i].file_size );


    for(int i=0; i< FileNameList_Size; i++)
    {
    for(int j=i+1; j< FileNameList_Size; j++)
    {
    if(FileNameList[i].file_size!=FileNameList[j].file_size) continue;
    if( CompareTwoFilesByContentsAndSayIfEqual(FileNameList[i].name, FileNameList[j].name))
    {
    // printf("\nduplicate found (%s) ", FileNameList[j].name);
    MoveDuplicateToSubdirectory(FileNameList[j].name);
    }

    }

    }

    printf(" \n\n %d duplicates moved \n\n\n", duplicates_moved);

    return 'ok';
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Chris M. Thomasson on Sun Sep 22 00:18:09 2024
    Chris M. Thomasson wrote:
    On 9/21/2024 11:53 AM, fir wrote:


    i think if to write a simple comandline program
    that remove duplicates in a given folder
    [...]

    Not sure if this will help you or not... ;^o

    Fwiw, I have to sort and remove duplicates in this experimental locking system that I called the multex. Here is the C++ code I used to do it. I
    sort and then remove any duplicates, so say a threads local lock set was:

    31, 59, 69, 31, 4, 1, 1, 5

    would become:

    1, 4, 5, 31, 59, 69

    this ensures no deadlocks. As for the algorithm for removing duplicates, well, there are more than one. Actually, I don't know what one my C++
    impl is using right now.

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/Ti8LFyH4CgAJ

    // Deadlock free baby!
    void ensure_locking_order()
    {
    // sort and remove duplicates

    std::sort(m_lock_idxs.begin(), m_lock_idxs.end());

    m_lock_idxs.erase(std::unique(m_lock_idxs.begin(),
    m_lock_idxs.end()), m_lock_idxs.end());
    }

    Using the std C++ template lib.

    im not sure what you talking about but i write on finding file
    duplicates (by binary contents not by name).. it is disk thing and i
    dont think mutexes are needed - you just need to read all files in
    folder and compare it byte by byte to other files in folder of the same size

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to All on Sun Sep 22 00:48:05 2024
    okay thet previous code has soem errors but i make changes and this one
    seem to work


    i put it on a 50 HB files from recuva and it moved about 22 GB as
    duplicates... by the eye test it seem to work

    #include<windows.h>
    #include<stdio.h>

    void StrCopyMaxNBytes(char* dest, char* src, int n)
    {
    for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
    }

    //list of file names
    const int FileNameListEntry_name_max = 500;
    struct FileNameListEntry { char name[FileNameListEntry_name_max];
    unsigned int file_size; int is_duplicate; };

    FileNameListEntry* FileNameList = NULL;
    int FileNameList_Size = 0;

    void FileNameList_AddOne(char* name, unsigned int file_size)
    {
    FileNameList_Size++;
    FileNameList = (FileNameListEntry*) realloc(FileNameList, FileNameList_Size * sizeof(FileNameListEntry) );
    StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
    name, FileNameListEntry_name_max);
    FileNameList[FileNameList_Size-1].file_size = file_size;
    FileNameList[FileNameList_Size-1].is_duplicate = 0;

    return ;
    }


    // collect list of filenames
    WIN32_FIND_DATA ffd;

    void ReadDIrectoryFileNamesToList(char* dir)
    {
    HANDLE h = FindFirstFile(dir, &ffd);

    if(!h) { printf("error reading directory"); exit(-1);}

    do {
    if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
    {
    FileNameList_AddOne(ffd.cFileName, ffd.nFileSizeLow);
    if(ffd.nFileSizeHigh!=0) { printf("this program only work for
    files up to 4GB"); exit(-1);}
    }
    }
    while (FindNextFile(h, &ffd));

    }

    #include <sys/stat.h>

    int GetFileSize2(char *filename)
    {
    struct stat st;
    if (stat(filename, &st)==0) return (int) st.st_size;

    printf("\n *** error obtaining file size for %s", filename); exit(-1);
    return -1;
    }

    int FolderExist(char *name)
    {
    static struct stat st;
    if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
    return 0;
    }


    //////////

    unsigned char* bytes2 = NULL;
    int bytes2_size = 0;
    int bytes2_allocked = 0;

    unsigned char* bytes2_resize(int size)
    {
    bytes2_size=size;
    return bytes2=(unsigned char*)realloc(bytes2,
    bytes2_size*sizeof(unsigned char));

    }

    void bytes2_load(unsigned char* name)
    {
    int flen = GetFileSize2(name);
    FILE *f = fopen(name, "rb");
    if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
    int loaded = fread(bytes2_resize(flen), 1, flen, f);
    fclose(f);
    }

    /////////////////


    unsigned char* bytes1 = NULL;
    int bytes1_size = 0;
    int bytes1_allocked = 0;

    unsigned char* bytes1_resize(int size)
    {
    bytes1_size=size;
    return bytes1=(unsigned char*)realloc(bytes1,
    bytes1_size*sizeof(unsigned char));

    }

    void bytes1_load(unsigned char* name)
    {
    int flen = GetFileSize2(name);
    FILE *f = fopen(name, "rb");
    if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
    int loaded = fread(bytes1_resize(flen), 1, flen, f);
    fclose(f);
    }

    /////////////



    int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
    {
    bytes1_load(file_a);
    bytes2_load(file_b);
    if(bytes1_size!=bytes2_size) { printf("\n something is wrong
    compared files assumed to be be same size"); exit(-1); }

    for(unsigned int i=0; i<bytes1_size;i++)
    if(bytes1[i]!=bytes2[i]) return 0;

    return 1;

    }

    #include<direct.h>
    #include <dirent.h>
    #include <errno.h>

    int duplicates_moved = 0;
    void MoveDuplicateToSubdirectory(char*name)
    {

    if(!FolderExist("duplicates"))
    {
    int n = _mkdir("duplicates");
    if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
    }

    static char renamed[1000];
    int n = snprintf(renamed, sizeof(renamed), "duplicates\\%s", name);

    if(rename(name, renamed))
    {printf("\n rename %s %s failed", name, renamed); exit(-1);}

    duplicates_moved++;

    }

    int main()
    {
    printf("\n (RE)MOVE FILE DUPLICATES");
    printf("\n ");

    printf("\n this program searches for binaric (comparec byute to
    byte)");
    printf("\n duplicates/multiplicates of files in its own");
    printf("\n folder (no search in subdirectories, just flat)");
    printf("\n and if found it copies it into 'duplicates'");
    printf("\n subfolder it creates If you want to remove that");
    printf("\n duplicates you may delete the subfolder then,");
    printf("\n if you decided to not remove just move the contents");
    printf("\n of 'duplicates' subfolder back");
    printf("\n ");
    printf("\n note this program not work on files larger than 4GB ");
    printf("\n and no warranty at all youre responsible for any dameges ");
    printf("\n if use of this program would eventually do - i just
    wrote ");
    printf("\n the code and it work for me but not tested it to much besides");
    printf("\n ");
    printf("\n september 2024");

    printf("\n ");
    printf("\n starting.. ");

    ReadDIrectoryFileNamesToList("*");


    printf("\n\n found %d files in current directory", FileNameList_Size);
    for(int i=0; i< FileNameList_Size; i++)
    printf("\n #%d %s %d", i, FileNameList[i].name, FileNameList[i].file_size );

    // return 'ok';

    for(int i=0; i< FileNameList_Size; i++)
    {
    if(FileNameList[i].is_duplicate) continue;

    for(int j=i+1; j< FileNameList_Size; j++)
    {
    if(FileNameList[j].is_duplicate) continue;

    if(FileNameList[i].file_size!=FileNameList[j].file_size) continue;

    if( CompareTwoFilesByContentsAndSayIfEqual(FileNameList[i].name, FileNameList[j].name))
    {
    printf("\n#%d %s (%d) has duplicate #%d %s (%d) ",i, FileNameList[i].name,FileNameList[i].file_size, j, FileNameList[j].name, FileNameList[j].file_size);
    FileNameList[j].is_duplicate=1;
    // MoveDuplicateToSubdirectory(FileNameList[i].name);
    }

    }

    }

    printf("\n moving duplicates to subfolder...");

    for(int i=0; i< FileNameList_Size; i++)
    {
    if(FileNameList[i].is_duplicate) MoveDuplicateToSubdirectory(FileNameList[i].name);
    }

    printf(" \n\n %d duplicates moved \n\n\n", duplicates_moved);

    return 'ok';
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to fir on Sun Sep 22 01:28:09 2024
    On Sat, 21 Sep 2024 20:53:47 +0200, fir wrote:

    i think if to write a simple comandline program
    that remove duplicates in a given folder

    <https://packages.debian.org/bookworm/duff> <https://packages.debian.org/bookworm/dupeguru> <https://packages.debian.org/trixie/backdown>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to fir on Sun Sep 22 02:06:49 2024
    On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

    ... you just need to read all files in
    folder and compare it byte by byte to other files in folder of the same
    size

    For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
    That’s an O(N²) algorithm.

    There is a faster way.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 04:36:03 2024
    Lawrence D'Oliveiro wrote:
    On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

    ... you just need to read all files in
    folder and compare it byte by byte to other files in folder of the same
    size

    For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons. That’s an O(N²) algorithm.

    There is a faster way.

    not quite as most files have different sizes so most binary comparsions
    are discarded becouse size of files differ (and those sizes i read
    linearly when bulding lidt of filenames)

    what i posted seem to work ok, it odesnt work fast but hard to say if it
    can be optimised or it takes as long as it should..hard to say

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to fir on Sun Sep 22 07:09:45 2024
    On Sun, 22 Sep 2024 04:36:03 +0200, fir wrote:

    Lawrence D'Oliveiro wrote:

    There is a faster way.

    not quite ...

    Yes there is. See how those other programs do it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to fir on Sun Sep 22 03:29:08 2024
    On Sat, 9/21/2024 10:36 PM, fir wrote:
    Lawrence D'Oliveiro wrote:
    On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

    ... you just need to read all files in
    folder and compare it byte by byte to other files in folder of the same
    size

    For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
    That’s an O(N²) algorithm.

    There is a faster way.

    not quite as most files have different sizes so most binary comparsions
    are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)

    what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say

    The normal way to do this, is do a hash check on the
    files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
    as a means to compare two files. If you want to be picky about
    it, stick with SHA256SUM.

    hashdeep64 -c MD5 -j 1 -r H: > H_sums.txt # Took about two minutes to run this on an SSD
    # Hard drive, use -j 1 . For an SSD, use a higher thread count for -j .

    Size MD5SUM Path

    Same size, same hash value. The size is zero. The MD5SUM in this case, is always the same (the initialization value of MD5).

    0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\AadConfigurations\AadConfiguration.lock
    0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\V2AccountStore.lock

    Same size, different hash value. These are not the same file.

    65536, a8113cfdf0227ddf1c25367ecccc894b, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\5213954f4433d4fbe45ed37ffc67d43fc43b54584bfd3a8d.bin
    65536, 5e91acf90e90be408b6549e11865009d, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\bf7b3ea78a361dc533a9344051255c035491d960f2bc7f31.bin

    You can use the "sort" command, to sort by the first and second fields if you want.
    Sorting the output lines, places the identical files next to one another, in the output.

    The output of data recovery software is full of "fragments". Using
    the "file" command (Windows port available, it's a Linux command),
    can allow ignoring files which have no value (listed as "Data").
    Recognizable files will be listed as "PNG" or "JPG" and so on.

    A utility such as Photorec, can attempt to glue together files. Your mileage may vary.
    That is a scan based file recovery method. I have not used it.

    https://en.wikipedia.org/wiki/PhotoRec

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to fir on Sun Sep 22 11:38:17 2024
    On 22/09/2024 11:24, fir wrote:
    Paul wrote:

    The normal way to do this, is do a hash check on the
    files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
    as a means to compare two files. If you want to be picky about
    it, stick with SHA256SUM.


    the code i posted work ok, and if someone has windows and mingw/tdm may compiel it and check the application if wants

    hashing is not necessary imo though probably could speed things up - im
    not strongly convinced that the probablility of misteke in this hashing
    is strictly zero (as i dont ever used this and would need to produce my
    own hashing probably).. probably its mathematically proven ists almost
    zero but as for now at least it is more interesting for me if the cde i posted is ok

    I was going to post similar ideas (doing a linear pass working out
    checksums for each file, sorting the list by checksum and size, then
    candidates for a byte-by-byte comparison, if you want to do that, will
    be grouped together).

    But if you're going to reject everyone's suggestions in favour of your
    own already working solution, then I wonder why you bothered posting.

    (I didn't post after all because I knew it would be futile.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Paul on Sun Sep 22 12:24:06 2024
    Paul wrote:
    On Sat, 9/21/2024 10:36 PM, fir wrote:
    Lawrence D'Oliveiro wrote:
    On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

    ... you just need to read all files in
    folder and compare it byte by byte to other files in folder of the same >>>> size

    For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
    That’s an O(N²) algorithm.

    There is a faster way.

    not quite as most files have different sizes so most binary comparsions
    are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)

    what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say

    The normal way to do this, is do a hash check on the
    files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
    as a means to compare two files. If you want to be picky about
    it, stick with SHA256SUM.

    hashdeep64 -c MD5 -j 1 -r H: > H_sums.txt # Took about two minutes to run this on an SSD
    # Hard drive, use -j 1 . For an SSD, use a higher thread count for -j .

    Size MD5SUM Path

    Same size, same hash value. The size is zero. The MD5SUM in this case, is always the same (the initialization value of MD5).

    0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\AadConfigurations\AadConfiguration.lock
    0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\V2AccountStore.lock

    Same size, different hash value. These are not the same file.

    65536, a8113cfdf0227ddf1c25367ecccc894b, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\5213954f4433d4fbe45ed37ffc67d43fc43b54584bfd3a8d.bin
    65536, 5e91acf90e90be408b6549e11865009d, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\bf7b3ea78a361dc533a9344051255c035491d960f2bc7f31.bin

    You can use the "sort" command, to sort by the first and second fields if you want.
    Sorting the output lines, places the identical files next to one another, in the output.

    The output of data recovery software is full of "fragments". Using
    the "file" command (Windows port available, it's a Linux command),
    can allow ignoring files which have no value (listed as "Data").
    Recognizable files will be listed as "PNG" or "JPG" and so on.

    A utility such as Photorec, can attempt to glue together files. Your mileage may vary.
    That is a scan based file recovery method. I have not used it.

    https://en.wikipedia.org/wiki/PhotoRec

    Paul

    i do not do recovery - it removes duplicates

    i mean programs such as recuva when recovers files recoves a tens of
    thousands and gigabytes of files with lost names and soem common types
    .mp3 .jpg .txt and so on, and many of those files are binary duplicates

    this code i posted last just finds files that are duplicates and moves
    then to subdirectory 'duplicates' and it could show that half of those
    files or more (heavy gigabytes) are pure duplicates so some may remove
    the subfolder and recover space

    the code i posted work ok, and if someone has windows and mingw/tdm may
    compiel it and check the application if wants

    hashing is not necessary imo though probably could speed things up - im
    not strongly convinced that the probablility of misteke in this hashing
    is strictly zero (as i dont ever used this and would need to produce my
    own hashing probably).. probably its mathematically proven ists almost
    zero but as for now at least it is more interesting for me if the cde i
    posted is ok

    yopu may see the main procedure of it

    first it build list of files with sizes
    using windows winapi function

    HANDLE h = FindFirstFile(dir, &ffd);

    (ils linear say 12k times for 12 k files in folder)

    then it runs square loop (12k * 12k /2 - 12k)

    and compares binarly those who have same size


    int GetFileSize2(char *filename)
    {
    struct stat st;
    if (stat(filename, &st)==0) return (int) st.st_size;

    printf("\n *** error obtaining file size for %s", filename); exit(-1);
    return -1;
    }

    void bytes1_load(unsigned char* name)
    {
    int flen = GetFileSize2(name);
    FILE *f = fopen(name, "rb");
    if(!f) { printf( "errot: cannot open file %s for load ", name); exit(-1); }
    int loaded = fread(bytes1_resize(flen), 1, flen, f);
    fclose(f);
    }


    int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
    {
    bytes1_load(file_a);
    bytes2_load(file_b);
    if(bytes1_size!=bytes2_size) { printf("\n something is wrong
    compared files assumed to be be same size"); exit(-1); }

    for(unsigned int i=0; i<bytes1_size;i++)
    if(bytes1[i]!=bytes2[i]) return 0;

    return 1;

    }


    this has 2 elements its file load into ram and then comparsions

    (the reading files is redundant as i got the info from
    FindFirstFile(dir, &ffd); winapi functions, but maybe to be sure i
    read it also form thsi stat() function again

    and then finally i got a linear pary to move that ones on the list
    marked as duplicates to subfolder

    int FolderExist(char *name)
    {
    static struct stat st;
    if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
    return 0;
    }


    int duplicates_moved = 0;
    void MoveDuplicateToSubdirectory(char*name)
    {

    if(!FolderExist("duplicates"))
    {
    int n = _mkdir("duplicates");
    if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
    }

    static char renamed[1000];
    int n = snprintf(renamed, sizeof(renamed), "duplicates\\%s", name);

    if(rename(name, renamed))
    {printf("\n rename %s %s failed", name, renamed); exit(-1);}

    duplicates_moved++;

    }


    im not sure if some of tis functions are not slow and there is an
    element of redundancy calling if(!FolderExist("duplicates"))
    many times as it would be normal "ram based" not disk related function
    - but its probably okay i guess (and thsi disk related function i hope
    not really activates disk but hopefully only read some ram about it)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Bart on Sun Sep 22 14:46:12 2024
    Bart wrote:
    On 22/09/2024 11:24, fir wrote:
    Paul wrote:

    The normal way to do this, is do a hash check on the
    files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
    as a means to compare two files. If you want to be picky about
    it, stick with SHA256SUM.


    the code i posted work ok, and if someone has windows and mingw/tdm
    may compiel it and check the application if wants

    hashing is not necessary imo though probably could speed things up -
    im not strongly convinced that the probablility of misteke in this
    hashing is strictly zero (as i dont ever used this and would need to
    produce my own hashing probably).. probably its mathematically proven
    ists almost zero but as for now at least it is more interesting for me
    if the cde i posted is ok

    I was going to post similar ideas (doing a linear pass working out
    checksums for each file, sorting the list by checksum and size, then candidates for a byte-by-byte comparison, if you want to do that, will
    be grouped together).

    But if you're going to reject everyone's suggestions in favour of your
    own already working solution, then I wonder why you bothered posting.

    (I didn't post after all because I knew it would be futile.)


    i wanta discus nt to do enything that is mentioned .. it is hard to
    understand? so i may read on options but literally got no time to
    implement even good idead - thsi program i wrote showed to work and im
    now using it

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Bart on Sun Sep 22 14:48:11 2024
    fir wrote:
    Bart wrote:
    On 22/09/2024 11:24, fir wrote:
    Paul wrote:

    The normal way to do this, is do a hash check on the
    files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
    as a means to compare two files. If you want to be picky about
    it, stick with SHA256SUM.


    the code i posted work ok, and if someone has windows and mingw/tdm
    may compiel it and check the application if wants

    hashing is not necessary imo though probably could speed things up -
    im not strongly convinced that the probablility of misteke in this
    hashing is strictly zero (as i dont ever used this and would need to
    produce my own hashing probably).. probably its mathematically proven
    ists almost zero but as for now at least it is more interesting for me
    if the cde i posted is ok

    I was going to post similar ideas (doing a linear pass working out
    checksums for each file, sorting the list by checksum and size, then
    candidates for a byte-by-byte comparison, if you want to do that, will
    be grouped together).

    But if you're going to reject everyone's suggestions in favour of your
    own already working solution, then I wonder why you bothered posting.

    (I didn't post after all because I knew it would be futile.)


    i wanta discus nt to do enything that is mentioned .. it is hard to understand? so i may read on options but literally got no time to
    implement even good idead - thsi program i wrote showed to work and im
    now using it

    also note i posted whole working program and some other just say what
    can be done... in working code was my main goal not quite starting in
    contest of what is fastest (this is also interesting topic but not the
    main goal)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Bart on Sun Sep 22 16:06:39 2024
    fir wrote:
    fir wrote:
    Bart wrote:
    On 22/09/2024 11:24, fir wrote:
    Paul wrote:

    The normal way to do this, is do a hash check on the
    files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
    as a means to compare two files. If you want to be picky about
    it, stick with SHA256SUM.


    the code i posted work ok, and if someone has windows and mingw/tdm
    may compiel it and check the application if wants

    hashing is not necessary imo though probably could speed things up -
    im not strongly convinced that the probablility of misteke in this
    hashing is strictly zero (as i dont ever used this and would need to
    produce my own hashing probably).. probably its mathematically proven
    ists almost zero but as for now at least it is more interesting for me >>>> if the cde i posted is ok

    I was going to post similar ideas (doing a linear pass working out
    checksums for each file, sorting the list by checksum and size, then
    candidates for a byte-by-byte comparison, if you want to do that, will
    be grouped together).

    But if you're going to reject everyone's suggestions in favour of your
    own already working solution, then I wonder why you bothered posting.

    (I didn't post after all because I knew it would be futile.)


    i wanta discus nt to do enything that is mentioned .. it is hard to
    understand? so i may read on options but literally got no time to
    implement even good idead - thsi program i wrote showed to work and im
    now using it

    also note i posted whole working program and some other just say what
    can be done... in working code was my main goal not quite starting in
    contest of what is fastest (this is also interesting topic but not the
    main goal)

    interesting thing is yet how it work in system ...
    im used to write cpu intensive application and used to controll frame
    times and usage of cpu.. bet generally never t vrite disk based apps

    here above is the first..i use sysinternals on windows ind when i run
    this prog the it has like 3 stages
    1) read directory info (if ts big like 30k files it mat take soem time)
    2) square part that read file contents and compares and sets flags of duplicates on list
    3) the rename part - i mean i cal "reneme" function on duplicates

    the most tiem it takes the square part and indicator of disk usage is
    full , cpu usage is 50% it means probably one core usage is full

    the disk indicator in tray shows (in square phase) something like
    R: 1.6 GB
    O: 635 KB
    W: 198 B

    dont know what it is, R is for read for sure and W is for write but what
    it is exactly?


    there is also a question if closing or killing program in those phases
    may generate some disc dameges? - as ror most time in square phase it
    takes reads i quite sure that closing in read phase may not incur anny
    errors - but im not sure as to renaming phase

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Bart on Sun Sep 22 16:26:49 2024
    fir wrote:
    Bart wrote:
    On 22/09/2024 11:24, fir wrote:
    Paul wrote:

    The normal way to do this, is do a hash check on the
    files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
    as a means to compare two files. If you want to be picky about
    it, stick with SHA256SUM.


    the code i posted work ok, and if someone has windows and mingw/tdm
    may compiel it and check the application if wants

    hashing is not necessary imo though probably could speed things up -
    im not strongly convinced that the probablility of misteke in this
    hashing is strictly zero (as i dont ever used this and would need to
    produce my own hashing probably).. probably its mathematically proven
    ists almost zero but as for now at least it is more interesting for me
    if the cde i posted is ok

    I was going to post similar ideas (doing a linear pass working out
    checksums for each file, sorting the list by checksum and size, then
    candidates for a byte-by-byte comparison, if you want to do that, will
    be grouped together).

    But if you're going to reject everyone's suggestions in favour of your
    own already working solution, then I wonder why you bothered posting.

    (I didn't post after all because I knew it would be futile.)



    yet to say about this efficiency

    whan i observe how it work - this program is square in a sense it has
    half square loop over the directory files list, so it may be lik 20x*20k/2-20k comparcions but it only compares mostly sizes so this
    kind of being square im not sure how serious is ..200M int comparsions
    is a problem? - mayeb it become to be for larger sets

    in the meaning of real binary comparsions is not fully square but
    its liek sets of smaller squares on diagonal of this large square
    if yu (some) know what i mean... and that may be a problem as
    if in that 20k files 100 have same size then it makes about 100x100 full loads and 100x100 full binary copmpares byte to byte which
    is practically full if there are indeed 100 duplicates
    (maybe its less than 100x100 as at first finding of duplicate i mark it
    as dumpicate and ship it in loop then

    but indeed it shows practically that in case of folders bigger than 3k
    files it slows down probably unproportionally so the optimisation is
    in hand /needed for large folders

    thats from the observation on it



    but as i said i mainly wanted this to be done to remove soem space of
    this recovered somewhat junk files.. and having it the partially square
    way is more important than having it optimised

    it works and if i see it slows down on large folders i can divide those
    big folders on few for 3k files and run this duplicate mover in each one

    more hand work but can be done by hand

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Bart on Sun Sep 22 16:51:24 2024
    fir wrote:

    this program has yet one pleasnt thing as it works

    #5 f1795800624.bmp (589878) has duplicate #216 f1840569816.bmp (589878)
    #6 f1795801784.bmp (589878) has duplicate #217 f1840570976.bmp (589878)
    #7 f1795802944.bmp (589878) has duplicate #218 f1840572136.bmp (589878)
    #8 f1795804112.bmp (589878) has duplicate #219 f1840573296.bmp (589878)
    #9 f1795805272.bmp (589878) has duplicate #220 f1840574456.bmp (589878)

    and those numbers on left goes form #1 to say #3000 (last)
    as it marks duplicates "forward" i man if #8 has duplicate #218 then
    #218 is marked as duplicate and excluded in both loops (the outside/row
    one and this inside, element one) so the scaning speeds up the more
    further it goes and its not linear like when coping files

    its nicely pleasant

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Bart on Sun Sep 22 16:22:00 2024
    Bart wrote:
    On 22/09/2024 11:24, fir wrote:
    Paul wrote:

    The normal way to do this, is do a hash check on the
    files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
    as a means to compare two files. If you want to be picky about
    it, stick with SHA256SUM.


    the code i posted work ok, and if someone has windows and mingw/tdm
    may compiel it and check the application if wants

    hashing is not necessary imo though probably could speed things up -
    im not strongly convinced that the probablility of misteke in this
    hashing is strictly zero (as i dont ever used this and would need to
    produce my own hashing probably).. probably its mathematically proven
    ists almost zero but as for now at least it is more interesting for me
    if the cde i posted is ok

    I was going to post similar ideas (doing a linear pass working out
    checksums for each file, sorting the list by checksum and size, then candidates for a byte-by-byte comparison, if you want to do that, will
    be grouped together).

    But if you're going to reject everyone's suggestions in favour of your
    own already working solution, then I wonder why you bothered posting.

    (I didn't post after all because I knew it would be futile.)



    yet to say about this efficiency

    whan i observe how it work - this program is square in a sense it has
    half square loop over the directory files list, so it may be lik
    20x*20k/2-20k comparcions but it only compares mostly sizes so this
    kind of being square im not sure how serious is ..200M int comparsions
    is a problem? - mayeb it become to be for larger sets

    in the meaning of real binary comparsions is not fully square but
    its liek sets of smaller squares on diagonal of this large square
    if yu (some) know what i mean... and that may be a problem as
    if in that 20k files 100 have same size then it makes about 100x100 full
    loads and 100x100 full binary copmpares byte to byte which
    is practically full if there are indeed 100 duplicates
    (maybe its less than 100x100 as at first finding of duplicate i mark it
    as dumpicate and ship it in loop then

    but indeed it shows practically that in case of folders bigger than 3k
    files it slows down probably unproportionally so the optimisation is
    in hand /needed for large folders

    thats from the observation on it
    from disk and

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fir@21:1/5 to Bart on Sun Sep 22 16:32:05 2024
    fir wrote:
    fir wrote:
    Bart wrote:
    On 22/09/2024 11:24, fir wrote:
    Paul wrote:

    The normal way to do this, is do a hash check on the
    files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
    as a means to compare two files. If you want to be picky about
    it, stick with SHA256SUM.


    the code i posted work ok, and if someone has windows and mingw/tdm
    may compiel it and check the application if wants

    hashing is not necessary imo though probably could speed things up -
    im not strongly convinced that the probablility of misteke in this
    hashing is strictly zero (as i dont ever used this and would need to
    produce my own hashing probably).. probably its mathematically proven
    ists almost zero but as for now at least it is more interesting for me >>>> if the cde i posted is ok

    I was going to post similar ideas (doing a linear pass working out
    checksums for each file, sorting the list by checksum and size, then
    candidates for a byte-by-byte comparison, if you want to do that, will
    be grouped together).

    But if you're going to reject everyone's suggestions in favour of your
    own already working solution, then I wonder why you bothered posting.

    (I didn't post after all because I knew it would be futile.)



    yet to say about this efficiency

    whan i observe how it work - this program is square in a sense it has
    half square loop over the directory files list, so it may be lik
    20x*20k/2-20k comparcions but it only compares mostly sizes so this
    kind of being square im not sure how serious is ..200M int comparsions
    is a problem? - mayeb it become to be for larger sets

    in the meaning of real binary comparsions is not fully square but
    its liek sets of smaller squares on diagonal of this large square
    if yu (some) know what i mean... and that may be a problem as
    if in that 20k files 100 have same size then it makes about 100x100 full
    loads and 100x100 full binary copmpares byte to byte which
    is practically full if there are indeed 100 duplicates
    (maybe its less than 100x100 as at first finding of duplicate i mark it
    as dumpicate and ship it in loop then

    but indeed it shows practically that in case of folders bigger than 3k
    files it slows down probably unproportionally so the optimisation is
    in hand /needed for large folders

    thats from the observation on it



    but as i said i mainly wanted this to be done to remove soem space of
    this recovered somewhat junk files.. and having it the partially square
    way is more important than having it optimised

    it works and if i see it slows down on large folders i can divide those
    big folders on few for 3k files and run this duplicate mover in each one

    more hand work but can be done by hand

    hovever saying that the checksuming/hashing idea is kinda good ofc
    (sorting oprobably the less as maybe a bit harder to write, as im never
    sure if my old quicksirt hand code has no error i once tested like 30
    quicksort versions in mya life trying to rewrite it and once i get some
    mistake in thsi code and later never strictly sure if the version i
    finally get is good - its probably good but im not sure)

    but i would need to understand that may own way of hashing has
    practically no chances to generate same hash on different files..
    and i never was doing that things so i not rethinked it..and now its a
    side thing possibly not worth studying

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From DFS@21:1/5 to Lawrence D'Oliveiro on Sun Sep 22 17:11:02 2024
    On 9/21/2024 10:06 PM, Lawrence D'Oliveiro wrote:
    On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:

    ... you just need to read all files in
    folder and compare it byte by byte to other files in folder of the same
    size

    For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons. That’s an O(N²) algorithm.


    for (i = 0; i < N; i++) {
    for (j = i+1; j < N; j++) {
    ... byte-byte compare file i to file j
    }
    }


    For N = 10, 45 byte-byte comparisons would be made (assuming all files
    are the same size)



    There is a faster way.

    Calc the checksum of each file once, then compare the checksums as above?

    Which is still an O(N^2) algorithm, but I would assume it's faster than
    45 byte-byte comparisons.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Josef_M=C3=B6llers?=@21:1/5 to fir on Tue Oct 1 16:34:47 2024
    On 21.09.24 20:53, fir wrote:


    i think if to write a simple comandline program
    that remove duplicates in a given folder
    [...]

    I have had the same problem. My solution was to use extended file
    attributes and some file checksum, eg sha512sum, also, I wrote this in
    PERL (see code below). Using the file attributes, I can re-run the
    program after a while without having to re-calculate the checksums.
    So, this solution only works for filesystems that have extended file attributes, but you could also use some simple database (sqlite3?) to
    map checksums to pathnames.

    What I did was to walk through the directory tree and check if the file
    being considered already has a checksum in an extended attribute. If
    not. I'll calculate the checksum and store that in the extended
    attribute. Also, I store the pathname in a hash (remember, this is
    PERL), key is the checksum.
    If there is a collision (checksum already in the hash), I remove the new
    file (and link the new filename to the old file). One could be paranoid
    and do a byte-by-byte file comparison then.

    If I needed to do this in a C program, I'd probably use a GList to store
    the hash, but otherwise the code logic would be the same.

    HTH,

    Josef

    #! /usr/bin/perl

    use warnings;
    use strict;
    use File::ExtAttr ':all'; # In case of problems, maybe insert "use Scalar:Utils;" in /usr/lib/x86_64-linux-gnu/perl5/5.22/File/ExtAttr.pm
    use Digest::SHA;
    use File::Find;
    use Getopt::Std;

    # OPTIONS:
    # s: force symlink
    # n: don't do the actula removing/linking
    # v: be more verbose
    # h: print short help
    my %opt = (
    s => undef,
    n => undef,
    v => undef,
    h => undef,
    );
    getopts('hnsv', \%opt);

    if ($opt{h}) {
    print STDERR "usage: lndup [-snvh] [dirname..]\n";
    print STDERR "\t-s: use symlink rather than hard link\n";
    print STDERR "\t-n: don't remove/link, just show what would be done\n";
    print STDERR "\t-v: be more verbose (show pathname and SHA512 sum\n";
    print STDERR "\t-h: show this text\n";
    exit(0);
    }

    my %file;

    if (@ARGV == 0) {
    find({ wanted => \&lndup, no_chdir => 1 }, '.');
    } else {
    find({ wanted => \&lndup, no_chdir => 1 }, @ARGV);
    }

    # NAME: lndup
    # PURPOSE: To handle a single file
    # ARGUMENTS: None, pathname is taken from $File::Find::name
    # RETURNS: Nothing
    # NOTE: The SHA512 sum of a file is calculated.
    # IF a file with the same sum was already found earlier, AND
    # iF both files are NOT the same (hard link) AND
    # iF both files reside on the same disk
    # THEN the second occurrence is removed and
    # replaced by a link to the first occurrence
    sub lndup {
    my $pathname = $File::Find::name;

    return if ! -f $pathname;
    if (-s $pathname) {
    my $sha512sum = getfattr($pathname, 'SHA512');
    if (!defined $sha512sum) {
    my $ctx = Digest::SHA->new(512);
    $ctx->addfile($pathname);
    $sha512sum = $ctx->hexdigest;
    print STDERR "$pathname $sha512sum\n" if $opt{v};
    setfattr($pathname, "SHA512", $sha512sum);
    } elsif ($opt{v}) {
    print STDERR "Using sha512sum from attributes\n";
    }

    if (exists $file{$sha512sum}) {
    if (!same_file($pathname, $file{$sha512sum})) {
    my $links1 = (stat($pathname))[3];
    my $links2 = (stat($file{$sha512sum}))[3];
    # If one of them is a symbolic link, make sure it's
    $pathname
    if (is_symlink($file{$sha512sum})) {
    print STDERR "Swapping $pathname and
    $file{$sha512sum}\n" if $opt{v};
    swap($file{$sha512sum}, $pathname);
    }
    # If $pathname has more links than $file{$sha512sum},
    # exchange the two names.
    # This ensures that $file{$sha512sum} has the most links.
    elsif ($links1 > $links2) {
    print STDERR "Swapping $pathname and
    $file{$sha512sum}\n" if $opt{v};
    swap($file{$sha512sum}, $pathname);
    }

    print "rm \"$pathname\"; ln \"$file{$sha512sum}\" \"$pathname\"\n";
    if (! $opt{n}) {
    my $same_disk = same_disk($pathname,
    $file{$sha512sum});
    if (unlink($pathname)) {
    if (! $same_disk || $opt{s}) {
    symlink($file{$sha512sum}, $pathname) ||
    print STDERR "Failed to symlink($file{$sha512sum}, $pathname): $!\n";
    } else {
    link($file{$sha512sum}, $pathname) || print
    STDERR "Failed to link($file{$sha512sum}, $pathname): $!\n";
    }
    } else {
    print STDERR "Failed to unlink $pathname: $!\n";
    }
    }
    # print "Removing $pathname\n";
    # unlink $pathname or warn "$0: Cannot remove $_: $!\n";

    }
    } else {
    $file{$sha512sum} = $pathname;
    }
    }
    }

    # NAME: same_disk
    # PURPOSE: To check if two files are on the same disk
    # ARGUMENTS: pn1, pn2: pathnames of files
    # RETURNS: true if files are on the same disk, else false
    # NOTE: The check is made by comparing the device numbers of the
    # filesystems of the two files.
    sub same_disk {
    my ($pn1, $pn2) = @_;

    my @s1 = stat($pn1);
    my @s2 = stat($pn2);

    return $s1[0] == $s2[0];
    }

    # NAME: same_file
    # PURPOSE: To check if two files are the same
    # ARGUMENTS: pn1, pn2: pathnames of files
    # RETURNS: true if files are the same, else false
    # NOTE: files are the same if device number AND inode number
    # are identical
    sub same_file {
    my ($pn1, $pn2) = @_;

    my @s1 = stat($pn1);
    my @s2 = stat($pn2);

    return ($s1[0] == $s2[0]) && ($s1[1] == $s2[1]);
    }

    sub is_symlink {
    my ($path) = @_;

    return -l $path;
    }

    sub swap {
    my $tmp;
    $tmp = $_[0];
    $_[0] = $_[1];
    $_[1] = $tmp;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to josef@invalid.invalid on Tue Oct 1 20:38:23 2024
    In article <lm2fk7FpccjU1@mid.individual.net>,
    Josef Mllers <josef@invalid.invalid> wrote:
    ...
    I have had the same problem. My solution was to use extended file
    attributes and some file checksum, eg sha512sum, also, I wrote this in
    PERL (see code below). Using the file attributes, I can re-run the

    And is thus entirely OT here. Keith will tell the same.

    --
    "You can safely assume that you have created God in your own image when
    it turns out that God hates all the same people you do." -- Anne Lamott

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)