Forum: >>> Magnum BBS <<<

Unused blocks and fstrim

From Steve Keller@21:1/5 to All on Fri Sep 20 11:30:01 2024

I'd like to understand some technical details about how fstrim, file
systems, and block devices work.

Do ext4 and btrfs keep a list of blocks that have already been reported as unused or do they have to report all unused blocks to the block device
layer everytime the fstrim command is issued?

Does LVM keep information on every block about its usage or does it always
have to pass trim operations to the lower layer?

And does software RAID, i.e. /dev/md* keep this information on every block?
Can RAID skip unused blocks from syncing in a RAID-1 array when I replace a disk?

Steve

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Woodall@21:1/5 to Steve Keller on Fri Sep 20 12:10:01 2024

On Fri, 20 Sep 2024, Steve Keller wrote:

I'd like to understand some technical details about how fstrim, file
systems, and block devices work.

Do ext4 and btrfs keep a list of blocks that have already been reported as unused or do they have to report all unused blocks to the block device
layer everytime the fstrim command is issued?

Does LVM keep information on every block about its usage or does it always have to pass trim operations to the lower layer?

And does software RAID, i.e. /dev/md* keep this information on every block? Can RAID skip unused blocks from syncing in a RAID-1 array when I replace a disk?

Steve

In the default, iscsi, md, lvm, ext2 do not keep this information. Don't
know if it's configurable sonewhere but I suspect not. Don't know about
btrfs.

Some of this data is cached, but not between reboots.

The raid rebuild is a particular pain point IMO. It's important to do a
discard after a failed disk rebuild otherwise every block is 'in use' on
the underlying storage.

After a rebuild I always create a LV with all the free space and then
discard it.

I think a VG free space skipping md rebuild would suit me better than
discard tracking at all the different levels. I guess ZFS users might
have a different view of how useful lvm aware mdraid is :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael =?utf-8?B?S2rDtnJsaW5n?=@21:1/5 to All on Fri Sep 20 13:00:01 2024

On 20 Sep 2024 10:04 +0000, from debianuser@woodall.me.uk (Tim Woodall):

I guess ZFS users might
have a different view of how useful lvm aware mdraid is :-)

ZFS nowadays has the pool `autotrim` property (default off) and the
`zpool trim` subcommand for manual or scripted usage. This is one of
those times when ZFS' awareness of actual storage usage and allocation
comes in handy at what's typically considered other layers in the
stack.

--
Michael Kjörling 🔗 https://michael.kjorling.se “Remember when, on the Internet, nobody cared that you were a dog?”

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Steve Keller@21:1/5 to Tim Woodall on Mon Sep 23 15:30:01 2024

Tim Woodall <debianuser@woodall.me.uk> writes:

In the default, iscsi, md, lvm, ext2 do not keep this information. Don't
know if it's configurable sonewhere but I suspect not. Don't know about btrfs.

Some of this data is cached, but not between reboots.

I have played a bit and it seems for ext4 and btrfs they keep
information on already trimmed blocks but only as long as the file
system is mounted:

# lvcreate -n foo -L1G vg0
Logical volume "foo" created.
# mkfs.ext4 /dev/vg0/foo
mke2fs 1.47.0 (5-Feb-2023)
[...]
# mount /dev/vg0/foo /mnt
# fstrim -v /mnt
/mnt: 973.4 MiB (1020678144 bytes) trimmed
# fstrim -v /mnt
/mnt: 0 B (0 bytes) trimmed
# umount /mnt
# mount /dev/vg0/foo /mnt
# fstrim -v /mnt
/mnt: 973.4 MiB (1020678144 bytes) trimmed
# fstrim -v /mnt
/mnt: 0 B (0 bytes) trimmed
# umount /mnt
# mkfs.btrfs -f /dev/vg0/foo
btrfs-progs v6.2
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM /dev/vg0/foo (1.00GiB) ...
[...]
# mount /dev/vg0/foo /mnt
# fstrim -v /mnt
/mnt: 1022.6 MiB (1072267264 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# umount /mnt
# mount /dev/vg0/foo /mnt
# fstrim -v /mnt
/mnt: 1022.6 MiB (1072267264 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# umount /mnt

I also currently play with ext4 and btrfs on QCOW2 with discard
support. Looks nice.

The raid rebuild is a particular pain point IMO. It's important to do a discard after a failed disk rebuild otherwise every block is 'in use' on
the underlying storage.

Hmm, does a RAID rebuild really always copy the whole new disk, even
the unused space? But what kind of info is then kept in the first
128 MiB of /dev/md0, if not a flag for every block telling whether it's
used or not?

After a rebuild I always create a LV with all the free space and then
discard it.

:(

I currently have RAID only on a server with HDDs which don't support
TRIM anyway. I have only needed twice to rebuild the RAID-1 with 2
disks and I seem to remember that not the whole disk was copied, but I
might be wrong on that.

Steve

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Woodall@21:1/5 to Steve Keller on Mon Sep 23 16:40:01 2024

On Mon, 23 Sep 2024, Steve Keller wrote:

Tim Woodall <debianuser@woodall.me.uk> writes:

The raid rebuild is a particular pain point IMO. It's important to do a
discard after a failed disk rebuild otherwise every block is 'in use' on
the underlying storage.

Hmm, does a RAID rebuild really always copy the whole new disk, even
the unused space? But what kind of info is then kept in the first
128 MiB of /dev/md0, if not a flag for every block telling whether it's
used or not?

After a rebuild I always create a LV with all the free space and then
discard it.

:(

I currently have RAID only on a server with HDDs which don't support
TRIM anyway. I have only needed twice to rebuild the RAID-1 with 2
disks and I seem to remember that not the whole disk was copied, but I
might be wrong on that.

I think the bitmaps are for dirty blocks - so a resynch after a power
failure is quick, not for a failed disk replacement rebuild.

But perhaps there's a config option somewhere so that the md device can
track discards in a bitmap.

My guess is most people run at 90% capacity so it's not that useful...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 06:57:56 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (0 / 16)
Uptime:	156:57:39
Calls:	10,384
Calls today:	1
Files:	14,056
Messages:	6,416,474

Unused blocks and fstrim

Who's Online

Recent Visitors

System Info