• Unused blocks and fstrim

    From Steve Keller@21:1/5 to All on Fri Sep 20 11:30:01 2024
    I'd like to understand some technical details about how fstrim, file
    systems, and block devices work.

    Do ext4 and btrfs keep a list of blocks that have already been reported as unused or do they have to report all unused blocks to the block device
    layer everytime the fstrim command is issued?

    Does LVM keep information on every block about its usage or does it always
    have to pass trim operations to the lower layer?

    And does software RAID, i.e. /dev/md* keep this information on every block?
    Can RAID skip unused blocks from syncing in a RAID-1 array when I replace a disk?

    Steve

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Woodall@21:1/5 to Steve Keller on Fri Sep 20 12:10:01 2024
    On Fri, 20 Sep 2024, Steve Keller wrote:

    I'd like to understand some technical details about how fstrim, file
    systems, and block devices work.

    Do ext4 and btrfs keep a list of blocks that have already been reported as unused or do they have to report all unused blocks to the block device
    layer everytime the fstrim command is issued?

    Does LVM keep information on every block about its usage or does it always have to pass trim operations to the lower layer?

    And does software RAID, i.e. /dev/md* keep this information on every block? Can RAID skip unused blocks from syncing in a RAID-1 array when I replace a disk?

    Steve


    In the default, iscsi, md, lvm, ext2 do not keep this information. Don't
    know if it's configurable sonewhere but I suspect not. Don't know about
    btrfs.

    Some of this data is cached, but not between reboots.

    The raid rebuild is a particular pain point IMO. It's important to do a
    discard after a failed disk rebuild otherwise every block is 'in use' on
    the underlying storage.

    After a rebuild I always create a LV with all the free space and then
    discard it.

    I think a VG free space skipping md rebuild would suit me better than
    discard tracking at all the different levels. I guess ZFS users might
    have a different view of how useful lvm aware mdraid is :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael =?utf-8?B?S2rDtnJsaW5n?=@21:1/5 to All on Fri Sep 20 13:00:01 2024
    On 20 Sep 2024 10:04 +0000, from debianuser@woodall.me.uk (Tim Woodall):
    I guess ZFS users might
    have a different view of how useful lvm aware mdraid is :-)

    ZFS nowadays has the pool `autotrim` property (default off) and the
    `zpool trim` subcommand for manual or scripted usage. This is one of
    those times when ZFS' awareness of actual storage usage and allocation
    comes in handy at what's typically considered other layers in the
    stack.

    --
    Michael Kjörling 🔗 https://michael.kjorling.se “Remember when, on the Internet, nobody cared that you were a dog?”

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Steve Keller@21:1/5 to Tim Woodall on Mon Sep 23 15:30:01 2024
    Tim Woodall <debianuser@woodall.me.uk> writes:

    In the default, iscsi, md, lvm, ext2 do not keep this information. Don't
    know if it's configurable sonewhere but I suspect not. Don't know about btrfs.

    Some of this data is cached, but not between reboots.

    I have played a bit and it seems for ext4 and btrfs they keep
    information on already trimmed blocks but only as long as the file
    system is mounted:

    # lvcreate -n foo -L1G vg0
    Logical volume "foo" created.
    # mkfs.ext4 /dev/vg0/foo
    mke2fs 1.47.0 (5-Feb-2023)
    [...]
    # mount /dev/vg0/foo /mnt
    # fstrim -v /mnt
    /mnt: 973.4 MiB (1020678144 bytes) trimmed
    # fstrim -v /mnt
    /mnt: 0 B (0 bytes) trimmed
    # umount /mnt
    # mount /dev/vg0/foo /mnt
    # fstrim -v /mnt
    /mnt: 973.4 MiB (1020678144 bytes) trimmed
    # fstrim -v /mnt
    /mnt: 0 B (0 bytes) trimmed
    # umount /mnt
    # mkfs.btrfs -f /dev/vg0/foo
    btrfs-progs v6.2
    See http://btrfs.wiki.kernel.org for more information.

    Performing full device TRIM /dev/vg0/foo (1.00GiB) ...
    [...]
    # mount /dev/vg0/foo /mnt
    # fstrim -v /mnt
    /mnt: 1022.6 MiB (1072267264 bytes) trimmed
    # fstrim -v /mnt
    /mnt: 126 MiB (132087808 bytes) trimmed
    # fstrim -v /mnt
    /mnt: 126 MiB (132087808 bytes) trimmed
    # fstrim -v /mnt
    /mnt: 126 MiB (132087808 bytes) trimmed
    # umount /mnt
    # mount /dev/vg0/foo /mnt
    # fstrim -v /mnt
    /mnt: 1022.6 MiB (1072267264 bytes) trimmed
    # fstrim -v /mnt
    /mnt: 126 MiB (132087808 bytes) trimmed
    # fstrim -v /mnt
    /mnt: 126 MiB (132087808 bytes) trimmed
    # umount /mnt

    I also currently play with ext4 and btrfs on QCOW2 with discard
    support. Looks nice.

    The raid rebuild is a particular pain point IMO. It's important to do a discard after a failed disk rebuild otherwise every block is 'in use' on
    the underlying storage.

    Hmm, does a RAID rebuild really always copy the whole new disk, even
    the unused space? But what kind of info is then kept in the first
    128 MiB of /dev/md0, if not a flag for every block telling whether it's
    used or not?

    After a rebuild I always create a LV with all the free space and then
    discard it.

    :(

    I currently have RAID only on a server with HDDs which don't support
    TRIM anyway. I have only needed twice to rebuild the RAID-1 with 2
    disks and I seem to remember that not the whole disk was copied, but I
    might be wrong on that.

    Steve

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Woodall@21:1/5 to Steve Keller on Mon Sep 23 16:40:01 2024
    On Mon, 23 Sep 2024, Steve Keller wrote:

    Tim Woodall <debianuser@woodall.me.uk> writes:


    The raid rebuild is a particular pain point IMO. It's important to do a
    discard after a failed disk rebuild otherwise every block is 'in use' on
    the underlying storage.

    Hmm, does a RAID rebuild really always copy the whole new disk, even
    the unused space? But what kind of info is then kept in the first
    128 MiB of /dev/md0, if not a flag for every block telling whether it's
    used or not?

    After a rebuild I always create a LV with all the free space and then
    discard it.

    :(

    I currently have RAID only on a server with HDDs which don't support
    TRIM anyway. I have only needed twice to rebuild the RAID-1 with 2
    disks and I seem to remember that not the whole disk was copied, but I
    might be wrong on that.


    I think the bitmaps are for dirty blocks - so a resynch after a power
    failure is quick, not for a failed disk replacement rebuild.

    But perhaps there's a config option somewhere so that the md device can
    track discards in a bitmap.

    My guess is most people run at 90% capacity so it's not that useful...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)