When applications utilize O_DIRECT combined with AIO (io_uring or libaio) to push high-IOPS random workloads, it looks that OpenZFS intercepts these asynchronous requests and translates them into synchronous operations. Even with deep application queue depths (iodepth=64), OpenZFS effectively processes them with a queue depth of 1, waiting for acknowledge each transaction before proceeding.
Using multi-threaded concurrency (numjobs > 1) just slightly improve number of IOPS with significant growing latencies.
Hardware: PCIe 5.0 NVMe SSDs, AMD Zen5, DDR5 DRAM 6400 MT/s
Software: Rocky Linux 10.2 (6.12.0-211.18.1.el10_2.x86_64), OpenZFS 2.4.2
[root@memverge4 ~]# zpool create -f -o ashift=12 -O recordsize=16K -O atime=off -O xattr=sa -O compression=off -O dedup=off tank raidz2 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1
[root@memverge4 ~]# df -h /tank
Filesystem Size Used Avail Use% Mounted on
tank 12T 192K 12T 1% /tank
[root@memverge4 ~]#
[root@memverge4 ~]# echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_async_read_max_active
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_sync_write_max_active
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_sync_read_max_active
[root@memverge4 ~]# echo 16 > /sys/module/zfs/parameters/zvol_threads
[root@memverge4 ~]# echo 2048 > /sys/module/zfs/parameters/zfs_vdev_max_active
[root@memverge4 ~]# echo 0 > /sys/module/zfs/parameters/zfs_vdev_direct_write_verify
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=1 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-39-g5e65
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=372MiB/s][w=23.8k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=7790: Wed Jun 10 06:43:37 2026
write: IOPS=23.0k, BW=359MiB/s (377MB/s)(21.1GiB/60001msec)
slat (usec): min=26, max=3685, avg=43.04, stdev=38.17
clat (nsec): min=1002, max=28033k, avg=2739276.82, stdev=1944877.94
lat (usec): min=78, max=28444, avg=2782.32, stdev=1972.86
Only 23 kIOPS and 2.8ms latency.
If I increase --numjobs=4, it scales only up to 60 kIOPS and 4.3ms latency is too high like for PCIe 5.0 NVMe SSDs
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=4 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
...
fio-3.42-39-g5e65
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=922MiB/s][w=59.0k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=7874: Wed Jun 10 06:45:10 2026
write: IOPS=59.4k, BW=928MiB/s (973MB/s)(54.4GiB/60001msec)
slat (usec): min=29, max=4212, avg=66.83, stdev=75.39
clat (nsec): min=1031, max=29206k, avg=4243159.85, stdev=3395372.40
lat (usec): min=45, max=29965, avg=4309.99, stdev=3443.21
The same numbers with ioengine=io_uring
[root@memverge4 ~]# echo 0 > /proc/sys/kernel/io_uring_disabled
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=4 --iodepth=64 --exitall --group_reporting --ioengine=io_uring --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=64
...
fio-3.42-39-g5e65
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=951MiB/s][w=60.8k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=7979: Wed Jun 10 06:48:13 2026
write: IOPS=61.5k, BW=961MiB/s (1008MB/s)(56.3GiB/60004msec)
slat (nsec): min=250, max=3488.3k, avg=1577.34, stdev=8749.91
clat (usec): min=666, max=31620, avg=4159.12, stdev=3945.10
lat (usec): min=711, max=31623, avg=4160.70, stdev=3945.90
I also found next discussion
#17940
"ZFS does not support async AIO. In fact every I/O is being submitted synchronously. You can observe this with zpool iostat -vq 1. You will see that there is only one thing in the queue during your write/reads for the most part. There was work on adding async I/O through the DMU async stuff, but it never got finished. So what you observed by increasing numjobs was more work going on."
Next I destroyed OpenZFS filesystem and on these disks created zvol with XFS filesystem
[root@memverge4 ~]# zpool create -f -o ashift=12 -O recordsize=16K -O atime=off -O xattr=sa -O compression=off -O dedup=off tank raidz2 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1
[root@memverge4 ~]# zfs create -V 11T -o volblocksize=16K -o compression=off -o dedup=off tank/fiotest
[root@memverge4 ~]# mkfs.xfs -K /dev/zvol/tank/fiotest
specified blocksize 4096 is less than device physical sector size 16384
switching to logical sector size 512
meta-data=/dev/zvol/tank/fiotest isize=512 agcount=48, agsize=55924053 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=4096 blocks=2684354544, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
= zoned=0 start=0 reserved=0
[root@memverge4 ~]# mount /dev/zvol/tank/fiotest /tank
[root@memverge4 ~]# df -h /tank
Filesystem Size Used Avail Use% Mounted on
/dev/zd0 10T 197G 9.9T 2% /tank
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=1 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-39-g5e65
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=3065MiB/s][w=196k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=6603: Wed Jun 10 14:11:45 2026
write: IOPS=184k, BW=2875MiB/s (3015MB/s)(168GiB/60003msec)
slat (nsec): min=691, max=5004.3k, avg=3502.59, stdev=14054.01
clat (nsec): min=150, max=174677k, avg=344079.59, stdev=620787.75
lat (usec): min=8, max=174678, avg=347.58, stdev=621.12
Single thread 184 kIOPS and and 348us! - much better than on OpenZFS filesystem with 1 and 4 numjobs.
Lack of async I/O is the performance killer for modern PCIe 4.0/5.0 and upcoming 6.0 NVMe SSDs.
OpenZFS is an all-in-one storage management solution, so switching to XFS is not the best way...
I would ask the community to add async I/O to the OpenZFS filesystem. Combination of these two I/O features (direct and async) for file and block access is what is definitely required by modern NVMe and workloads.
Regarding adding direct I/O for zvol I wrote few days ago - #18644
When applications utilize O_DIRECT combined with AIO (io_uring or libaio) to push high-IOPS random workloads, it looks that OpenZFS intercepts these asynchronous requests and translates them into synchronous operations. Even with deep application queue depths (iodepth=64), OpenZFS effectively processes them with a queue depth of 1, waiting for acknowledge each transaction before proceeding.
Using multi-threaded concurrency (numjobs > 1) just slightly improve number of IOPS with significant growing latencies.
Hardware: PCIe 5.0 NVMe SSDs, AMD Zen5, DDR5 DRAM 6400 MT/s
Software: Rocky Linux 10.2 (6.12.0-211.18.1.el10_2.x86_64), OpenZFS 2.4.2
[root@memverge4 ~]# zpool create -f -o ashift=12 -O recordsize=16K -O atime=off -O xattr=sa -O compression=off -O dedup=off tank raidz2 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1
[root@memverge4 ~]# df -h /tank
Filesystem Size Used Avail Use% Mounted on
tank 12T 192K 12T 1% /tank
[root@memverge4 ~]#
[root@memverge4 ~]# echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_async_read_max_active
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_sync_write_max_active
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_sync_read_max_active
[root@memverge4 ~]# echo 16 > /sys/module/zfs/parameters/zvol_threads
[root@memverge4 ~]# echo 2048 > /sys/module/zfs/parameters/zfs_vdev_max_active
[root@memverge4 ~]# echo 0 > /sys/module/zfs/parameters/zfs_vdev_direct_write_verify
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=1 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-39-g5e65
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=372MiB/s][w=23.8k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=7790: Wed Jun 10 06:43:37 2026
write: IOPS=23.0k, BW=359MiB/s (377MB/s)(21.1GiB/60001msec)
slat (usec): min=26, max=3685, avg=43.04, stdev=38.17
clat (nsec): min=1002, max=28033k, avg=2739276.82, stdev=1944877.94
lat (usec): min=78, max=28444, avg=2782.32, stdev=1972.86
Only 23 kIOPS and 2.8ms latency.
If I increase --numjobs=4, it scales only up to 60 kIOPS and 4.3ms latency is too high like for PCIe 5.0 NVMe SSDs
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=4 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
...
fio-3.42-39-g5e65
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=922MiB/s][w=59.0k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=7874: Wed Jun 10 06:45:10 2026
write: IOPS=59.4k, BW=928MiB/s (973MB/s)(54.4GiB/60001msec)
slat (usec): min=29, max=4212, avg=66.83, stdev=75.39
clat (nsec): min=1031, max=29206k, avg=4243159.85, stdev=3395372.40
lat (usec): min=45, max=29965, avg=4309.99, stdev=3443.21
The same numbers with ioengine=io_uring
[root@memverge4 ~]# echo 0 > /proc/sys/kernel/io_uring_disabled
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=4 --iodepth=64 --exitall --group_reporting --ioengine=io_uring --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=64
...
fio-3.42-39-g5e65
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=951MiB/s][w=60.8k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=7979: Wed Jun 10 06:48:13 2026
write: IOPS=61.5k, BW=961MiB/s (1008MB/s)(56.3GiB/60004msec)
slat (nsec): min=250, max=3488.3k, avg=1577.34, stdev=8749.91
clat (usec): min=666, max=31620, avg=4159.12, stdev=3945.10
lat (usec): min=711, max=31623, avg=4160.70, stdev=3945.90
I also found next discussion
#17940
"ZFS does not support async AIO. In fact every I/O is being submitted synchronously. You can observe this with zpool iostat -vq 1. You will see that there is only one thing in the queue during your write/reads for the most part. There was work on adding async I/O through the DMU async stuff, but it never got finished. So what you observed by increasing numjobs was more work going on."
Next I destroyed OpenZFS filesystem and on these disks created zvol with XFS filesystem
[root@memverge4 ~]# zpool create -f -o ashift=12 -O recordsize=16K -O atime=off -O xattr=sa -O compression=off -O dedup=off tank raidz2 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1
[root@memverge4 ~]# zfs create -V 11T -o volblocksize=16K -o compression=off -o dedup=off tank/fiotest
[root@memverge4 ~]# mkfs.xfs -K /dev/zvol/tank/fiotest
specified blocksize 4096 is less than device physical sector size 16384
switching to logical sector size 512
meta-data=/dev/zvol/tank/fiotest isize=512 agcount=48, agsize=55924053 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=4096 blocks=2684354544, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
= zoned=0 start=0 reserved=0
[root@memverge4 ~]# mount /dev/zvol/tank/fiotest /tank
[root@memverge4 ~]# df -h /tank
Filesystem Size Used Avail Use% Mounted on
/dev/zd0 10T 197G 9.9T 2% /tank
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=1 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-39-g5e65
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=3065MiB/s][w=196k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=6603: Wed Jun 10 14:11:45 2026
write: IOPS=184k, BW=2875MiB/s (3015MB/s)(168GiB/60003msec)
slat (nsec): min=691, max=5004.3k, avg=3502.59, stdev=14054.01
clat (nsec): min=150, max=174677k, avg=344079.59, stdev=620787.75
lat (usec): min=8, max=174678, avg=347.58, stdev=621.12
Single thread 184 kIOPS and and 348us! - much better than on OpenZFS filesystem with 1 and 4 numjobs.
Lack of async I/O is the performance killer for modern PCIe 4.0/5.0 and upcoming 6.0 NVMe SSDs.
OpenZFS is an all-in-one storage management solution, so switching to XFS is not the best way...
I would ask the community to add async I/O to the OpenZFS filesystem. Combination of these two I/O features (direct and async) for file and block access is what is definitely required by modern NVMe and workloads.
Regarding adding direct I/O for zvol I wrote few days ago - #18644