Better than BTRFS
You should enable TLER (Time Limited Error Recovery) on those cheap WD MyBook Shucked drives to make ZFS more responsive incase of bad blocks.
Disk partitioning script for 500G SSD
sfdisk /dev/sda <<EOF
label: gpt
size=+1M name="BIOS1" type=21686148-6449-6E6F-744E-656564454649
size=+128M name="ESP1" type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B
size=+1G name="BPOOL1" type=BC13C2FF-59E6-4262-A352-B275FD6F7172
size=+400G name="RPOOL1" type=0FC63DAF-8483-4772-8E79-3D69D8477DE4
name="OVERPROVISION"
EOF
For 8TB drives
sudo sfdisk /dev/sdX <<EOF
label: gpt
name="PAIR-1A"
EOF
sudo sfdisk /dev/sdY <<EOF
label: gpt
name="PAIR-1B"
EOF
After that you will have nice partition names under /dev
ls -l /dev/disk/by-partlabel/
total 0
lrwxrwxrwx 1 root root 10 Mar 28 20:00 PAIR-1A -> ../../sda1
lrwxrwxrwx 1 root root 10 Mar 28 20:00 PAIR-1B -> ../../sdb1
Pool creation
zpool create \
-o ashift=12 \
-O acltype=posixacl -O compression=lz4 \
-O dnodesize=auto -O normalization=formD -O relatime=on \
-O xattr=sa tank mirror /dev/disk/by-partlabel/PAIR-1{A,B}
You should enable TLER (Time Limited Error Recovery) on those cheap WD MyBook Shucked drives to make ZFS more responsive incase of bad blocks.
#!/usr/bin/bash
# Limit SATA Disk Error Recovery to 7 seconds
# On consumer disks this is usually left as unlimited
# witch results in timeout errors and removal of the
# drive from ZFS/mdraid pool
# https://emdroid.github.io/operating%20systems/linux/desktop-drives-raid-tler-apm-aam/
# https://superuser.com/a/1551961
shopt -s extglob
for disk in /dev/disk/by-id/ata-!(*part*); do
echo "${disk}"
smartctl -l scterc,70,70 "${disk}"
done
NB: need to do this after every reboot
Runtime knobs are in /sys/module/zfs/parameters
, persistent knobs are in /etc/modprobe.d/zfs.conf
Example set ARC size to 16G. By default this is 50% of system ram.
# set to 16G
options zfs zfs_arc_max=17179869184
It makes sense to increase write speeds when using NVMe drives.
# write to l2arc at speed 512MB/s
options zfs l2arc_write_max=536870912 l2arc_write_boost=536870912
Also write zfs prefetch data into L2ARC to speed up cache warming, helps with backups.
options zfs l2arc_noprefetch=0
Some useful commands
sudo zpool status
sudo zpool iostat -lvL tank 1
sudo arcstat
S_COLORS=always iostat -xm 1
Policy driven snapshots https://github.com/jimsalterjrs/sanoid
sudo apt install --no-install-recommends libcapture-tiny-perl libconfig-inifiles-perl pv lzop mbuffer
curl https://github.com/jimsalterjrs/sanoid/archive/v2.0.3.tar.gz | tar -xv
ZFS snapshots to borgbackup repositories https://github.com/mikroskeem/zorg
Clone the repository and see flake.nix for dependencies for your system
Use badblocks for it:
sudo badblocks -B -b 4096 -svw /dev/sdX
-w
defaults to -t 0xaa -t 0x55 -t 0xff -t 0x00
It's easier to run the command without positional arguments (and SIGINT right away) to get total sectors, then divide it manually in to parts. For example with From block 0 to 3418095615
divided up in to 5x4 steps (each has 4 patterns):
#!/bin/sh
out="/root/bblogs"
dev="/dev/sdX"
mkdir -p "$out"
basedev="$(basename "$dev")"
badblocks -Bsvwb 4096 -t 0xaa -t 0x55 -t 0xff -t 0x00 "$dev" 0 683619123 | tee "${out}/${basedev}1"
badblocks -Bsvwb 4096 -t 0xaa -t 0x55 -t 0xff -t 0x00 "$dev" 1367238246 683619124 | tee "${out}/${basedev}2"
badblocks -Bsvwb 4096 -t 0xaa -t 0x55 -t 0xff -t 0x00 "$dev" 2050857369 1367238247 | tee "${out}/${basedev}3"
badblocks -Bsvwb 4096 -t 0xaa -t 0x55 -t 0xff -t 0x00 "$dev" 2734476492 2050857369 | tee "${out}/${basedev}4"
badblocks -Bsvwb 4096 -t 0xaa -t 0x55 -t 0xff -t 0x00 "$dev" 3418095615 2734476492 | tee "${out}/${basedev}5"
This basic bodge script could be improved. Preferrably functionality is built in to badblocks: give it a file where to keep track its progress, and it automagically recovers (in case of interruption, say power outage).
After an unintended interruption, look at the logs. You will see at what step the process got interrupted:
Checking for bad blocks in read-write mode
From block 2734476492 to 3418095615
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: 93.98% done, 66:38:32 elapsed. (0/0/0 errors)
^ the above means that we need to run the last part again, so leave only this badblocks command in the script:
badblocks -Bsvwb 4096 -t 0x00 "$dev" 3418095615 2734476492 | tee "${out}/${basedev}5"
TIL: ZFS logs to its own private log buffer and not into dmesg
watch -n1 tail -n30 /proc/spl/kstat/zfs/dbgmsg
note
dsl_scan.c:3425:dsl_process_async_destroys(): freed 1169878 blocks in 30001ms from free_bpobj/bptree txg 33826167; err=85
err=85
seems to mean there is more work to do
I was running ZFS 2.0.3-9+deb11u1 and the pool was full (92% usage).
ZFS was struggling in freeing space after deleting files and zpool get freeing
was
stuck on 10TB and slowly going down, it was looking like it would take 10 days to get to 0.
Fix was to set zfs_free_bpobj_enabled
to 0
(default is 1). after that freeing
started dropping a lot faster but it got stuck soon so I changed it back to 1
and let it slowly work throught the remaning data.
Other options is zfs_free_min_time_ms
that can be set to 0
(default 1000
). This makes background freeing take as little time as possible. This helps when ZFS starts to hang because its blocked on wirting out TXG-s but also seems to make freeing
a lot slower.