• Re: backup of backup or alternating backups?

    From Henrik Ahlgren@21:1/5 to Default User on Mon Sep 30 19:00:02 2024
    On Mon, 2024-09-30 at 12:39 -0400, Default User wrote:
    But of course, any errors on drive A propagate daily to drive B.

    Having both drives connected and spinning simultaneusly creates a
    window of opportunity for some nasty ransomware (or a software bug,
    mistake, power surge, whatever) to destroy both backups. Of course it
    is safer to always have one copy offline.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Woodall@21:1/5 to Default User on Mon Sep 30 20:30:01 2024
    This message is in MIME format. The first part should be readable text,
    while the remaining parts are likely unreadable without MIME-aware tools.

    On Mon, 30 Sep 2024, Default User wrote:

    Hi!

    On a thread at another mailing list, someone mentioned that they, each
    day, alternate doing backups between two external usb drives. That got
    me to thinking (which is always dangerous) . . .

    I have a full backup on usb external drive A, "refreshed" daily using rsnapshot. Then, every day, I use rsync to make usb external drive B an "exact" copy of usb external drive A. It seemed to be a good idea,
    since if drive A fails, I can immediately plug in drive B to replace
    it, with no down time, and nothing lost.

    But of course, any errors on drive A propagate daily to drive B.

    So, is there a consensus on which would be better: 
    1) continue to "mirror" drive A to drive B?
    or,
    2) alternate backups daily between drives A and B?


    IMO it can take days, weeks even months to discover that something has
    got corrupted and/or deleted in error.

    I don't think either strategy is "better", they have different pros and
    cons. But in particular, your strategy doesn't require both drives to be
    online at once and at least gives you a one day window to discover that
    you've synced corruption.

    I think my strategy would be something more akin to the following (I
    think rsnapshot can do this but I've not actually used it)

    1. Alternate disks (as you are doing)
    2. Create a new directory YYYYMMDD and backup into that directory,
    creating hard links to the files from the previous backup (two days
    before)
    3. Delete the oldest directories as/when you start running out of space.


    On a slight tangent, how does rsnapshot deal with ext4 uninited extents?
    These are subtly different to sparse files, they're still not written to
    disk but the disk blocks are explicitly reserved for the file:

    truncate (sparse file) vs fallocate (blocks reserved)

    I've noticed that, at least on bookworm, lseek for SEEK_HOLE/SEEK_DATA
    treats fallocate as a hole similar to a sparse file. I haven't tested
    tar with the --sparse option but I suspect it will treat the two types
    of file the same too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonathan Dowland@21:1/5 to Default User on Mon Sep 30 23:00:01 2024
    On Mon Sep 30, 2024 at 5:39 PM BST, Default User wrote:
    So, is there a consensus on which would be better:Β 
    1) continue to "mirror" drive A to drive B?
    or,
    2) alternate backups daily between drives A and B?

    I'd go for (B), especially if you're continuing to do daily backups, so
    the oldest backup with an alternating pattern is 2 days.

    I do daily backups to a permanently-connected drive (I used to use
    rsnapshot for that, then rdiff-backup, now I use borg); monthly syncs of
    that drive to one of two external drives (rsync of the borg repository),
    which live off-site and I alternate those. If I lose my
    permanently-connected backup drive, one of the external drives is a
    month old and the other two months old, which I am happy with.


    --
    Please do not CC me for listmail.

    πŸ‘±πŸ» Jonathan Dowland
    ✎ jmtd@debian.org
    πŸ”— https://jmtd.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael =?utf-8?B?S2rDtnJsaW5n?=@21:1/5 to All on Mon Sep 30 23:40:01 2024
    On 30 Sep 2024 13:12 -0400, from hunguponcontent@gmail.com (Default User):
    Having both drives connected and spinning simultaneusly creates a
    window of opportunity for some nasty ransomware (or a software bug,
    mistake, power surge, whatever) to destroy both backups.

    Also why I would not want all backup-storage devices connected
    simultaneously. All it takes is one piece of software going haywire
    and you may have a situation where both the original and all backups
    are corrupted simultaneously.


    Of course it is safer to always have one copy offline.

    True. But easier (and cheaper) said than done. [...]

    Not at all. Backup to one of those external drives one day; the other
    one the next; the first one the day after that; and so on.

    It seems to me that you already have everything you need to remove
    this particular failure mode. You just need to tweak your usage
    slightly.

    I do that myself, except I switch approximately weekly rather than
    daily.

    --
    Michael KjΓΆrling πŸ”—Β https://michael.kjorling.se β€œRemember when, on the Internet, nobody cared that you were a dog?”

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Christensen@21:1/5 to Default User on Tue Oct 1 03:20:01 2024
    On 9/30/24 09:39, Default User wrote:
    Hi!

    On a thread at another mailing list, someone mentioned that they, each
    day, alternate doing backups between two external usb drives. That got
    me to thinking (which is always dangerous) . . .

    I have a full backup on usb external drive A, "refreshed" daily using rsnapshot. Then, every day, I use rsync to make usb external drive B an "exact" copy of usb external drive A. It seemed to be a good idea,
    since if drive A fails, I can immediately plug in drive B to replace
    it, with no down time, and nothing lost.

    But of course, any errors on drive A propagate daily to drive B.

    So, is there a consensus on which would be better:
    1) continue to "mirror" drive A to drive B?
    or,
    2) alternate backups daily between drives A and B?


    I migrated my data to a dedicated ZFS file server several years ago, in
    part due to advanced ZFS backup features -- snapshots, compression, de-duplication, replication, etc.. I used FreeBSD, but Debian has ZFS
    and should be able to do the same thing.


    My live server has a ZFS pool with two striped mirrors of two 3 TB HDD's
    each and a special mirror of two 180 GB SSD's:

    2024-09-30 16:44:38 toor@f5 ~
    # zpool iostat -v p5
    capacity operations bandwidth
    pool alloc free read write read write ------------------------------ ----- ----- ----- ----- ----- -----
    p5 3.19T 2.39T 49 2 28.4M 69.2K
    mirror-0 1.58T 1.14T 21 0 14.0M 10.7K
    gpt/hdd1.eli - - 8 0 6.99M 5.35K
    gpt/hdd2.eli - - 12 0 6.99M 5.35K
    mirror-1 1.58T 1.13T 20 0 14.0M 10.4K
    gpt/hdd3.eli - - 10 0 7.00M 5.20K
    gpt/hdd4.eli - - 9 0 7.00M 5.20K special - - - - - -
    mirror-2 29.4G 120G 7 2 408K 48.1K
    gpt/ssd1.eli - - 3 1 204K 24.1K
    gpt/ssd2.eli - - 3 1 204K 24.1K ------------------------------ ----- ----- ----- ----- ----- -----


    The 'special' SSD mirror stores metadata, which improves overall
    performance.


    I create ZFS filesystems for groups of data -- Samba users, CVS
    repository, rsync(1) backups of various non-ZFS filesystems, raw disk
    image backups, etc..


    ZFS has various properties that you can tune for each filesystem. Here
    is the filesystem for my Samba data:

    2024-09-30 16:50:07 toor@f5 ~
    # zfs get all p5/samba/dpchrist | sort | egrep 'NAME|inherited'
    NAME PROPERTY VALUE SOURCE p5/samba/dpchrist atime off
    inherited from p5
    p5/samba/dpchrist com.sun:auto-snapshot true
    inherited from p5
    p5/samba/dpchrist compression on
    inherited from p5
    p5/samba/dpchrist dedup verify
    inherited from p5
    p5/samba/dpchrist mountpoint /var/local/samba/dpchrist
    inherited from p5/samba
    p5/samba/dpchrist special_small_blocks 16K
    inherited from p5


    'atime' is off to eliminate metadata writes when files and directories
    are read.


    'com.sun:auto-snapshot' is true so that zfs-auto-snapshot(8) run via
    crontab(1) will find this filesystem, take snapshots periodically
    (daily, monthly, yearly), and manage (prune) those snapshots:

    2024-09-30 16:54:00 toor@f5 ~
    # crontab -l
    9 3 * * * /usr/local/sbin/zfs-auto-snapshot -k d 40
    21 3 1 * * /usr/local/sbin/zfs-auto-snapshot -k m 99
    27 3 1 1 * /usr/local/sbin/zfs-auto-snapshot -k y 99


    I currently have 96 snapshots (e.g. backups) of the above filesystem
    going back three and a half years:

    2024-09-30 16:59:48 dpchrist@f5 ~
    $ ls -d /var/local/samba/dpchrist/.zfs/snapshot/zfs-auto-snap_[dmy]* | wc -l
    96

    2024-09-30 17:01:12 dpchrist@f5 ~
    $ ls -dt /var/local/samba/dpchrist/.zfs/snapshot/zfs-auto-snap_[dmy]* |
    tail -n 1 /var/local/samba/dpchrist/.zfs/snapshot/zfs-auto-snap_m-2020-03-01-00h21


    'compression' is on so that compressible files are compressed. (The
    default compression algorithm will skip files that are incompressible.)


    'dedup' is on so that duplicate blocks are saved only once within the
    pool. De-duplication metadata is stored on the pool 'special' SSD
    mirror, which improves de-duplication performance.


    'special_small_blocks' is set to 16K so that files of size 16 KiB and
    smaller are stored on the pool 'special' SSD mirror, which improves
    small file read and write performance.


    I have a backup server with matching pool construction. I periodically replicate live server snapshots to the backup server (via SSH pull and pre-shared keys). I would like to automate this task.


    Both servers have SATA HDD mobile rack bays:

    https://www.startech.com/en-us/hdd/drw150satbk


    I have a pair of 6 TB HDD's in corresponding mobile rack trays, one for near-site backups and one off-site backups. Each HDD contains one ZFS
    pool. I periodically insert the near-site HDD into the backup server
    and replicate the live server snapshots to the removable HDD. I
    periodically rotate the near-site HDD and the off-site HDD.


    Be warned that ZFS has a non-trivial learning curve. I suggest the
    Lucas books if you are interested:

    https://mwl.io/nonfiction/os#fmzfs

    https://mwl.io/nonfiction/os#fmaz


    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Oct 1 16:20:01 2024
    Also why I would not want all backup-storage devices connected simultaneously. All it takes is one piece of software going haywire
    and you may have a situation where both the original and all backups
    are corrupted simultaneously.

    You can minimize this risk by having them both connected simultaneously
    but to different machines (this is also necessary if you want A and B to
    be in different physical locations, e.g. to survive disasters), and then
    make sure the machine which copies from A to B doesn't have write access
    to A.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Purgert@21:1/5 to Default User on Wed Oct 2 17:40:01 2024
    On Sep 30, 2024, Default User wrote:
    (...)
    So, is there a consensus on which would be better: 
    1) continue to "mirror" drive A to drive B?
    or,
    2) alternate backups daily between drives A and B?

    Primarily, I do (1); though every so often I do a variation of (2).

    Backups from all the PCs in the house go to drive "A" (a spare desktop
    in the basement playing server) as a daily process. This is performed
    with rsync in a cronjob on the PCs dumping to dated directories and
    symlinking the "current", so the next run just hardlinks anything that
    hasn't changed.

    "Drive A" is backed up to "Drive B" (an external USB SSD; only mounted
    for the copy job, then unmounted afterwards). Every 6 months or so
    (yes, this should be more frequent, but meh) I do this with "Drive C",
    which otherwise lives at the parents house.


    --
    |_|O|_|
    |_|_|O| Github: https://github.com/dpurgert
    |O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

    -----BEGIN PGP SIGNATURE-----

    iQIzBAEBCgAdFiEE3asj+xn6fYUcweBnbWVw5UznKGAFAmb9Z+gACgkQbWVw5Uzn KGC2KhAAj097ugOOAsQm+DIs3A/SzPCrGDEoz8r/ouwVfJCoCRXsfChlGjyDPjfd 43PZtri2N21Hc1XnQBOLbCh3YOPk+f/zu+zrWwvvIOxKXGThLbf0vu2HhgZLl0Ym LsiTIohhjGRY8p0B7i7MoCG5N5qDPFF9CBLtqytHwlWpXpKhK7rfsep1Yk8sbPyI RXJZ4qFq43zGTy7yqiGRH79lNTCRNNt7foG2b6JUd5khR2QdZ/g03AqWpaa69jZ2 h84xF6LCYv6H5CnU/rj9N+YpwtFHtpdF0uM1Wg/iUQTIuWYl0rETdEhL+VR6DHqD Q5D1bSnd6DHpt/VppzOvywZFkUYLRdeYnI4SxILw9ncAsw4khIF8aOSJZlwmL3D7 BIfrlhsovSQsBaPKWliSw0dadzLdEL8V69U75HrPA3DeGvGlimTRfwA6H/4TCQRv E41honTNfzHGhARth3o2kClmWhFWVkV0tG2b4bAS87S3FcIHTbNHLfEuyqQEycvE +pR8nJhh4QUB9TojB5Qqd37nJqK4e2R0PuoVMN7ag9+tpoPTlIFAho1X3ZaQ7b5s HO7RVEPYJppPJGAImCzpvCCZ+4ErEalN+wYeYLieENYujc7mJO/uGKxEKwR2VrmK Ngj/c86HCu70IkNxT5X3n4xO6H9cNRzcZwZn3O35zDr9E+OeVZc=
    =qUCR
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Us
  • From Jonathan Dowland@21:1/5 to Default User on Sun Oct 6 20:50:01 2024
    On Wed Oct 2, 2024 at 12:33 AM BST, Default User wrote:
    May I ask why you decided to switch from rsnapshot to rdiff-backup, and
    then to borg?

    Sure!

    At the time I was using rsnapshot, I was subscribed to some very high
    traffic mailing lists (such as LKML), and storing the mail in Maildir
    format (=1 file per email). rsnapshot's design of lots of hardlinks for
    files that are present in more than one backup increment proved very
    expensive at the time (I switched to rdiff-backup in around 2006-2007).

    I have a lot of time for rdiff-backup, I think it's very well designed.
    It addressed the problem I had with rsnapshot, and the backup format is
    simple enough and well documented that you could feasibly write other
    tools to read from it, should you need to. That gave me confidence.

    The main issue I hit with rdiff-backup was if I wanted to move files
    or directories containing large files around on my storage: that
    resulted in the new locations being considered "new", and the next
    backup increment being comparatively large. This reduced the number
    of increments I could fit into my backup storage (= shorter horizon
    for restores, although I've never had to restore back in time a great
    deal), and I found I started limiting the amount of moving around I
    was doing of large file (size) trees, to avoid that happening.

    I switched to Borg in Summer 2020 mainly to address that (Borg
    de-duplicates files and stores them content-addressible, of a fashion,
    so file moves don't grow increment sizes). At the time rdiff-backup was
    not being actively developed; that has changed. I was nervous about
    Borg's significant increase in complexity, but I've been running it for
    four years now and it's been fine.


    --
    Please do not CC me for listmail.

    πŸ‘±πŸ» Jonathan Dowland
    ✎ jmtd@debian.org
    πŸ”— https://jmtd.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From eben@gmx.us@21:1/5 to Jonathan Dowland on Sun Oct 6 22:30:01 2024
    On 10/6/24 14:44, Jonathan Dowland wrote:
    On Wed Oct 2, 2024 at 12:33 AM BST, Default User wrote:
    May I ask why you decided to switch from rsnapshot to rdiff-backup, and
    then to borg?

    The main issue I hit with rdiff-backup was if I wanted to move files
    or directories containing large files around on my storage: that
    resulted in the new locations being considered "new", and the next
    backup increment being comparatively large.


    I use rdiff to do the backups on the "server" (its job is serving video
    content to the TV box over NFS) and ran into that problem, so what I did was write a series of scripts that relinked identical files. It's not perfect,
    I suspect there are still bugs. It tries to be efficient (by not comparing files that can't possibly be the same because they have different sizes, or
    are already linked), but it gets the job done. Eventually. Running it
    takes about as long as running the backup in the first place. But hey,
    we're talking about 1 GiB of filespace which might change by 10-20 MiB
    between backups, so not a big deal.

    --
    We're standing there pounding a dead parrot on the counter,
    and the management response is to frantically swap in new counters
    to see if that fixes the problem.
    -- Peter Gutmann

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michel Verdier@21:1/5 to Jonathan Dowland on Mon Oct 7 10:40:08 2024
    On 2024-10-06, Jonathan Dowland wrote:

    At the time I was using rsnapshot, I was subscribed to some very high
    traffic mailing lists (such as LKML), and storing the mail in Maildir
    format (=1 file per email). rsnapshot's design of lots of hardlinks for files that are present in more than one backup increment proved very expensive at the time (I switched to rdiff-backup in around 2006-2007).

    Do you mean inodes expensive ? Which filesystem do you used ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonathan Dowland@21:1/5 to Michel Verdier on Mon Oct 7 21:50:01 2024
    On Mon Oct 7, 2024 at 9:37 AM BST, Michel Verdier wrote:
    Do you mean inodes expensive ? Which filesystem do you used ?

    It was 18 years ago so I can't remember that clearly, but I think it was
    a mixture of inodes expense and an enlarged amount of CPU time with the
    file churn (mails moved from new to cur, and later to a separate archive Maildir, that sort of thing). It was probably ext3 given the time.


    --
    Please do not CC me for listmail.

    πŸ‘±πŸ» Jonathan Dowland
    ✎ jmtd@debian.org
    πŸ”— https://jmtd.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonathan Dowland@21:1/5 to eben on Mon Oct 7 21:50:01 2024
    On Sun Oct 6, 2024 at 9:24 PM BST, eben wrote:
    I use rdiff to do the backups on the "server" (its job is serving video content to the TV box over NFS) and ran into that problem, so what I did was write a series of scripts that relinked identical files. It's not perfect,
    I suspect there are still bugs. It tries to be efficient (by not comparing files that can't possibly be the same because they have different sizes, or are already linked), but it gets the job done. Eventually.

    That's a neat solution!



    --
    Please do not CC me for listmail.

    πŸ‘±πŸ» Jonathan Dowland
    ✎ jmtd@debian.org
    πŸ”— https://jmtd.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Ritter@21:1/5 to eben@gmx.us on Mon Oct 7 22:30:01 2024
    eben@gmx.us wrote:

    I use rdiff to do the backups on the "server" (its job is serving video content to the TV box over NFS) and ran into that problem, so what I did was write a series of scripts that relinked identical files. It's not perfect,
    I suspect there are still bugs. It tries to be efficient (by not comparing files that can't possibly be the same because they have different sizes, or are already linked), but it gets the job done. Eventually. Running it
    takes about as long as running the backup in the first place. But hey,
    we're talking about 1 GiB of filespace which might change by 10-20 MiB between backups, so not a big deal.


    Possibly of interest: Debian package rdfind:

    Description: find duplicate files utility
    rdfind is a program to find duplicate files and optionally list, delete
    them or replace them with symlinks or hard links. It is a command
    line program written in c++, which has proven to be pretty quick compared
    to its alternatives.

    -dsr-

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From eben@gmx.us@21:1/5 to Dan Ritter on Mon Oct 7 22:50:01 2024
    On 10/7/24 16:06, Dan Ritter wrote:
    eben@gmx.us wrote:

    I use rdiff to do the backups on the "server" ... and ran into that
    problem, so what I did was write a series of scripts that relinked
    identical files.

    Possibly of interest: Debian package rdfind:

    Description: find duplicate files utility rdfind is a program to find duplicate files and optionally list, delete them or replace them with symlinks or hard links. It is a command line program written in c++,
    which has proven to be pretty quick compared to its alternatives.

    That's cool. I wonder if the apt subsystem on there still works. The installation's pretty old. It's off 90+% of the time and behind double-NAT
    the rest, so I'm not excessively worried.

    --
    I'm an apatheist. The question is no longer
    interesting, and the answer no longer matters.

    -- petro on ASR

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Charles Curley@21:1/5 to Jonathan Dowland on Tue Oct 8 04:00:02 2024
    On Mon, 07 Oct 2024 20:44:44 +0100
    "Jonathan Dowland" <jmtd@debian.org> wrote:

    It was 18 years ago so I can't remember that clearly, but I think it
    was a mixture of inodes expense and an enlarged amount of CPU time
    with the file churn (mails moved from new to cur, and later to a
    separate archive Maildir, that sort of thing). It was probably ext3
    given the time.

    Interesting.

    I've used rsnapshot for several years now with no such issue. My
    rsnapshot repository resides on ext4, on its own LVM logical volume, on
    top of an encrypted RAID 5 array on four four terabyte spinning rust
    drives.

    root@hawk:~# df /crc/rsnapshot/
    Filesystem Size Used Avail Use% Mounted on /dev/mapper/hawk--vg--raid-rsnapshot 247G 179G 55G 77%
    /crc/rsnapshot root@hawk:~# df -i /crc/rsnapshot/
    Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/hawk--vg--raid-rsnapshot 16M 3.2M 13M 21%
    /crc/rsnapshot root@hawk:~#

    As you can see, I am not greatly worried about running out of inodes.

    I have 11G of mail, also in maildir format, to back up. Since the
    archive goes back a year, I probably have more than 11G in the archive.
    Plus other stuff: /etc, etc., etc..

    As for the churn, that should be less of an issue now than it might have
    been 18 years ago, even though my motherboard dates to 2015. I
    definitely notice it (the hard drive activity light if nothing else),
    but it doesn't slow me down at all.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From tomas@tuxteam.de@21:1/5 to Jonathan Dowland on Tue Oct 8 06:40:01 2024
    On Mon, Oct 07, 2024 at 08:44:44PM +0100, Jonathan Dowland wrote:
    On Mon Oct 7, 2024 at 9:37 AM BST, Michel Verdier wrote:
    Do you mean inodes expensive ? Which filesystem do you used ?

    It was 18 years ago so I can't remember that clearly, but I think it was
    a mixture of inodes expense and an enlarged amount of CPU time with the
    file churn (mails moved from new to cur, and later to a separate archive Maildir, that sort of thing). It was probably ext3 given the time.

    Note that the transition to Ext4 must have been around 2006, making
    huge directories viable (HTree). So perhaps this is a factor too.

    Cheers
    --
    t

    -----BEGIN PGP SIGNATURE-----

    iF0EABECAB0WIQRp53liolZD6iXhAoIFyCz1etHaRgUCZwS3FwAKCRAFyCz1etHa RhYtAJwOdUV/KzmO/9mKx1hoUbrnSu7JDQCfYa06F79jBv4/VQJx8Di/aZOPnVQ=
    =8vgD
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michel Verdier@21:1/5 to Jonathan Dowland on Tue Oct 8 10:50:02 2024
    On 2024-10-07, Jonathan Dowland wrote:

    It was 18 years ago so I can't remember that clearly, but I think it was
    a mixture of inodes expense and an enlarged amount of CPU time with the
    file churn (mails moved from new to cur, and later to a separate archive Maildir, that sort of thing). It was probably ext3 given the time.

    Ok I see. I use nnml so I have no new/cur moves. Also I add dateext
    parameter for logrotate so old logs keep the same name.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Smith@21:1/5 to Michel Verdier on Tue Oct 8 20:00:01 2024
    Hi,

    On Tue, Oct 08, 2024 at 10:41:33AM +0200, Michel Verdier wrote:
    I add dateext parameter for logrotate so old logs keep the same name.

    This is another drawback to the design of rsnapshot. It doesn't matter
    that the files in your backup retain the same path: if they differ at
    all in any way, you'll get a new copy.

    i.e. if you have a 1GiB log file /var/log/somelog and you append one
    byte to it, rsync will take care of only transferring one byte, but both
    the old 1GiB file and the new 1GiB-and-1-byte version will be stored in
    their entirety in your backups.

    Other backup systems would chunk these files and recognise that the vast majority of the new file is the same as the old file and only store
    those chunks once. But would be more complicated than rsnapshot.

    "Differing at all" can also include mere metadata changes such as
    permissions, ownership or times, since all hardlinked versions of a file
    share all this metadata.

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Smith@21:1/5 to Charles Curley on Wed Oct 9 00:20:01 2024
    Hi,

    On Mon, Oct 07, 2024 at 07:52:55PM -0600, Charles Curley wrote:
    I've used rsnapshot for several years now with no such issue. My
    rsnapshot repository resides on ext4, on its own LVM logical volume, on
    top of an encrypted RAID 5 array on four four terabyte spinning rust
    drives.

    /crc/rsnapshot root@hawk:~# df -i /crc/rsnapshot/
    Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/hawk--vg--raid-rsnapshot 16M 3.2M 13M 21%

    This really isn't that much data and you have four drives to spread
    random reads across, so I'm not surprised that you don't really feel it
    yet.

    When you have hundreds of millions of files in rsnapshot it really
    starts to hurt because every backup run involves:

    - Deleting the oldest tree of files;
    - Walking the entire tree of the most recent backup once to cp -l it and
    then;
    - Walking it all again when rsync compares the new data to your previous
    iteration.

    Worse, it's all small, largely random IO which is worst case for
    spinning media. It easily gets to the point where the copy and compare
    steps take much longer than the actual data transfer.

    Other backup solutions get better performance by using some sort of
    index, manifest or other database, not just by walking every inode in
    the filesystem. But are then more complicated.

    This rsnapshot I have is really quite slow with only two 7200rpm HDDs.
    It spends way longer walking its data store than actually backing up any
    data. I could definitely make it speedier by switching to something
    else. But I like rsnapshot for this particular case.

    $ sudo find /data/backup/rsnapshot -print0 | grep -zc '.'
    202326554

    (This is a btrfs filesystem which doesn't report an inode count with df
    -i)

    Although it probably matters most how many files you have only in the
    most recent backup iteration rather than the entire rsnapshot store. For
    me that is approx 5.8 million.

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Wright@21:1/5 to tomas@tuxteam.de on Wed Oct 9 04:30:01 2024
    On Tue 08 Oct 2024 at 06:37:43 (+0200), tomas@tuxteam.de wrote:
    On Mon, Oct 07, 2024 at 08:44:44PM +0100, Jonathan Dowland wrote:
    On Mon Oct 7, 2024 at 9:37 AM BST, Michel Verdier wrote:
    Do you mean inodes expensive ? Which filesystem do you used ?

    It was 18 years ago so I can't remember that clearly, but I think it was
    a mixture of inodes expense and an enlarged amount of CPU time with the file churn (mails moved from new to cur, and later to a separate archive Maildir, that sort of thing). It was probably ext3 given the time.

    Note that the transition to Ext4 must have been around 2006, making
    huge directories viable (HTree). So perhaps this is a factor too.

    Perhaps you're on the inside track with respect to Debian.
    I didn't use ext4 at all until it was added to the squeeze
    installer (Feb 2011), and only when I was sure that a lenny
    ext3 installation would not need to read a file from a
    squeeze-written ext4 partition on the same machine.

    I think I eliminated my last ext3 partition at the end of 2014.
    (I was extremely conservative with my laptop during part of
    2013/2014, as I was totally reliant on this sole machine to be
    trouble-free.)

    Cheers,
    David.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michel Verdier@21:1/5 to Andy Smith on Wed Oct 9 11:00:01 2024
    On 2024-10-08, Andy Smith wrote:

    When you have hundreds of millions of files in rsnapshot it really
    starts to hurt because every backup run involves:

    - Deleting the oldest tree of files;

    rsnapshot can rename it apart and delete it after backup is done. Thus involving only the backup system

    - Walking the entire tree of the most recent backup once to cp -l it and
    then;

    rsnapshot only renames directories when rotating backups then does rsync
    with hard links to the newest

    This rsnapshot I have is really quite slow with only two 7200rpm HDDs.
    It spends way longer walking its data store than actually backing up any data. I could definitely make it speedier by switching to something
    else. But I like rsnapshot for this particular case.

    On 7200rpm HDDs I was using xfs over RAID1 and the slowest/blocking part
    was the deletion

    Although it probably matters most how many files you have only in the
    most recent backup iteration rather than the entire rsnapshot store. For
    me that is approx 5.8 million.

    I don't remember but I should have been around your volume.
    rsync uses metadata so it also depends on the filesystem. Some are
    quicker. I think metadata is quite like the index used by other backup
    systems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Smith@21:1/5 to Michel Verdier on Wed Oct 9 12:40:01 2024
    Hi,

    On Wed, Oct 09, 2024 at 10:57:12AM +0200, Michel Verdier wrote:
    On 2024-10-08, Andy Smith wrote:

    When you have hundreds of millions of files in rsnapshot it really
    starts to hurt because every backup run involves:

    - Deleting the oldest tree of files;

    rsnapshot can rename it apart and delete it after backup is done. Thus involving only the backup system

    Yes but this is still a necessary part of each backup cycle. You can't
    do another backup run while that job is still outstanding and the load
    it puts on the system is still there regardless of the timing within the
    backup procedure.

    - Walking the entire tree of the most recent backup once to cp -l it and
    then;

    rsnapshot only renames directories when rotating backups then does rsync
    with hard links to the newest

    Okay yes when you set link_dest to 1 in rsnapshot.conf then rsync will
    do that bit during its run, but having to hard link a directory tree of
    5 million files is not speedy. Other backup designs do not do this
    because they don't need to take any form of copies of what is already
    there. The point is that this step is "compare and hard link if
    unchanged" whereas usually it is "compare and do nothing if unchanged".

    rsync uses metadata so it also depends on the filesystem. Some are
    quicker. I think metadata is quite like the index used by other backup systems.

    The big difference is that to read the metadata of a tree of files in
    the filesystem you have to walk through all the inodes which is a lot of
    small random access.

    70 years of database theory has tried to make queries efficient and
    minimise random access, maximise cache locality etc. Otherwise all
    databases would just be filesystems!

    Like I say I like and use rsnapshot in some places, but speed and
    resource efficiency are not its winning points.

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Oct 9 17:00:02 2024
    Like I say I like and use rsnapshot in some places, but speed and
    resource efficiency are not its winning points.

    I have never used Rsnapshot, but I used Rsync backups for many years and
    then moved to Bup. The time to perform backups has been *very*
    substantially shortened by moving to Bup.

    The size of the backup repository is also nicely reduced (probably a mix
    of compression and of deduplication between files on different hosts
    that are backed up to the same repository).

    It is also much less demanding on the backup server, both in terms of
    RAM use and CPU time (I use low-power SBCs for that job).

    A full restore from Bup can be fairly slow, OTOH. Luckily, I've only
    ever had to fetch a few files from the backup (via `fuse`), but it does
    make it more costly to *use* your backup (e.g. I used to have a script
    which tracked the size of the last set of backed up files, as a way to
    detect unexpected changes in this size, but that is now impractical).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)