True asynchronous direct I/O (io_uring/libaio) for high-IOPS NVMe topologies

When applications utilize O_DIRECT combined with AIO (io_uring or libaio) to push high-IOPS random workloads, it looks that OpenZFS intercepts these asynchronous requests and translates them into synchronous operations. Even with deep application queue depths (iodepth=64), OpenZFS effectively processes them with a queue depth of 1, waiting for acknowledge each transaction before proceeding.

Using multi-threaded concurrency (numjobs > 1) just slightly improve number of IOPS with significant growing latencies.

Hardware: PCIe 5.0 NVMe SSDs, AMD Zen5, DDR5 DRAM 6400 MT/s
Software: Rocky Linux 10.2 (6.12.0-211.18.1.el10_2.x86_64), OpenZFS 2.4.2

[root@memverge4 ~]# zpool create -f -o ashift=12 -O recordsize=16K -O atime=off -O xattr=sa -O compression=off -O dedup=off tank raidz2 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1
[root@memverge4 ~]# df -h /tank
Filesystem      Size  Used Avail Use% Mounted on
tank             12T  192K   12T   1% /tank
[root@memverge4 ~]#
[root@memverge4 ~]# echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_async_read_max_active
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_sync_write_max_active
[root@memverge4 ~]# echo 1024 > /sys/module/zfs/parameters/zfs_vdev_sync_read_max_active
[root@memverge4 ~]# echo 16 > /sys/module/zfs/parameters/zvol_threads
[root@memverge4 ~]# echo 2048 > /sys/module/zfs/parameters/zfs_vdev_max_active
[root@memverge4 ~]# echo 0 > /sys/module/zfs/parameters/zfs_vdev_direct_write_verify

[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=1 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-39-g5e65
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=372MiB/s][w=23.8k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=7790: Wed Jun 10 06:43:37 2026
  write: IOPS=23.0k, BW=359MiB/s (377MB/s)(21.1GiB/60001msec)
    slat (usec): min=26, max=3685, avg=43.04, stdev=38.17
    clat (nsec): min=1002, max=28033k, avg=2739276.82, stdev=1944877.94
     lat (usec): min=78, max=28444, avg=2782.32, stdev=1972.86

Only 23 kIOPS and 2.8ms latency.

If I increase --numjobs=4, it scales only up to 60 kIOPS and 4.3ms latency is too high like for PCIe 5.0 NVMe SSDs

[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=4 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
...
fio-3.42-39-g5e65
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=922MiB/s][w=59.0k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=7874: Wed Jun 10 06:45:10 2026
  write: IOPS=59.4k, BW=928MiB/s (973MB/s)(54.4GiB/60001msec)
    slat (usec): min=29, max=4212, avg=66.83, stdev=75.39
    clat (nsec): min=1031, max=29206k, avg=4243159.85, stdev=3395372.40
     lat (usec): min=45, max=29965, avg=4309.99, stdev=3443.21

The same numbers with ioengine=io_uring

[root@memverge4 ~]# echo 0 > /proc/sys/kernel/io_uring_disabled
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=4 --iodepth=64 --exitall --group_reporting --ioengine=io_uring --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=64
...
fio-3.42-39-g5e65
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=951MiB/s][w=60.8k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=7979: Wed Jun 10 06:48:13 2026
  write: IOPS=61.5k, BW=961MiB/s (1008MB/s)(56.3GiB/60004msec)
    slat (nsec): min=250, max=3488.3k, avg=1577.34, stdev=8749.91
    clat (usec): min=666, max=31620, avg=4159.12, stdev=3945.10
     lat (usec): min=711, max=31623, avg=4160.70, stdev=3945.90

I also found next discussion

https://github.com/openzfs/zfs/issues/17940

"ZFS does not support async AIO. In fact every I/O is being submitted synchronously. You can observe this with zpool iostat -vq 1. You will see that there is only one thing in the queue during your write/reads for the most part. There was work on adding async I/O through the DMU async stuff, but it never got finished. So what you observed by increasing numjobs was more work going on."

Next I destroyed OpenZFS filesystem and on these disks created zvol with XFS filesystem

[root@memverge4 ~]# zpool create -f -o ashift=12 -O recordsize=16K -O atime=off -O xattr=sa -O compression=off -O dedup=off tank raidz2 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1
[root@memverge4 ~]# zfs create -V 11T -o volblocksize=16K -o compression=off -o dedup=off tank/fiotest
[root@memverge4 ~]# mkfs.xfs -K /dev/zvol/tank/fiotest
specified blocksize 4096 is less than device physical sector size 16384
switching to logical sector size 512
meta-data=/dev/zvol/tank/fiotest isize=512    agcount=48, agsize=55924053 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0   metadir=0
data     =                       bsize=4096   blocks=2684354544, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
         =                       zoned=0      start=0 reserved=0
[root@memverge4 ~]# mount /dev/zvol/tank/fiotest /tank
[root@memverge4 ~]# df -h /tank
Filesystem      Size  Used Avail Use% Mounted on
/dev/zd0         10T  197G  9.9T   2% /tank
[root@memverge4 ~]# fio --name=test --rw=randwrite --bs=16k --filename=/tank/testfile --direct=1 --numjobs=1 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-39-g5e65
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=3065MiB/s][w=196k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=6603: Wed Jun 10 14:11:45 2026
  write: IOPS=184k, BW=2875MiB/s (3015MB/s)(168GiB/60003msec)
    slat (nsec): min=691, max=5004.3k, avg=3502.59, stdev=14054.01
    clat (nsec): min=150, max=174677k, avg=344079.59, stdev=620787.75
     lat (usec): min=8, max=174678, avg=347.58, stdev=621.12

Single thread 184 kIOPS and and 348us! - much better than on OpenZFS filesystem with 1 and 4 numjobs.

Lack of async I/O is the performance killer for modern PCIe 4.0/5.0 and upcoming 6.0 NVMe SSDs.

OpenZFS is an all-in-one storage management solution, so switching to XFS is not the best way...

I would ask the community to add async I/O to the OpenZFS filesystem.  Combination of these two I/O features (direct and async) for file and block access is what is definitely required by modern NVMe and workloads.

Regarding adding direct I/O for zvol I wrote few days ago - https://github.com/openzfs/zfs/issues/18644

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

True asynchronous direct I/O (io_uring/libaio) for high-IOPS NVMe topologies #18660

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

True asynchronous direct I/O (io_uring/libaio) for high-IOPS NVMe topologies #18660

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions