live trace · spdk io request path

VM → Storage Node
Request Trace

An interactive simulator tracing a single I/O request from a guest VM through VFIO-user, the NVMf target, RAID mirroring, RDMA transport, and into the storage node's lvol/blobstore. Based on the real path for raid_270 — two remote replicas reached over NVMe-oF RDMA. Each step shows actual source code from the SPDK codebase, what objects are allocated, what tuning knobs control behavior, and deep explanations of how SPDK's internal subsystems work.

The five things people call "buffers" — separate them or tuning is impossible

Guest Memory — mapped via VFIO, used directly when possible

NVMf Transport iobuf — target-side staged payload buffers

bdev_io — metadata envelope, NOT a data buffer

Qpair Request Slot — NVMe command concurrency slots

Generic bdev iobuf — bdev-layer data buffers from central pool

IO size

Step 0 / 14

←→ navigate we write/read r reset

Complete Allocation & Knob Reference

What Allocates Where — Full Detail

Stage	Location	Object Allocated	Tuning Knobs	What The Knobs Do
VFIO-user entry	Baremetal	Request bookkeeping, guest memory mapping	`—`	Guest memory is registered with libvfio-user at controller setup time. No per-IO allocation knobs here.
NVMf transport buffering	Baremetal	Transport-side iobuf payloadsfrom transport poll-group's iobuf cache	`num_shared_buffers` `buf_cache_size` `io_unit_size` `max_io_size`	`num_shared_buffers`: how many buffers the transport "reserves" from the central iobuf pool. Validated at transport create time against the iobuf pool size — if it exceeds the pool, SPDK warns. `buf_cache_size`: per-poll-group cache of iobuf buffers. Default `UINT32_MAX` means auto-calculate: `(num_shared_buffers × ¾) / num_poll_groups`. Each poll group pre-populates this many from the central pool. `io_unit_size`: determines whether small or large buffers are used. If ≤ `small_bufsize` (8192), only small pool is used. `max_io_size`: maximum I/O size the transport accepts. Default 131072 (128K) for RDMA.
bdev request object	Baremetal	`bdev_io` metadata envelopefrom global mempool	`bdev_io_pool_size` `bdev_io_cache_size`	`bdev_io_pool_size`: total number of `bdev_io` objects in the global mempool. Default 65535. Must be ≥ `bdev_io_cache_size × (num_threads + 1)`. `bdev_io_cache_size`: pre-populated per-thread cache. Default 256. Each SPDK thread grabs this many from the global pool at channel creation. Ensures no thread can be starved under normal load.
RAID read-buffer staging	Baremetal	Generic bdev iobuf (reads only)from bdev mgmt channel iobuf cache	`iobuf_small_cache_size` `iobuf_large_cache_size`	`iobuf_small_cache_size`: per-thread local cache of small buffers for the bdev module. Default 128. Served first before hitting central pool. `iobuf_large_cache_size`: same for large buffers. Default 16. Fewer because large buffers are expensive (132K each).
NVMe initiator submission	Baremetal	Qpair request slot + NVMe request state	`io_queue_requests` `nvme_ioq_poll_period_us` `delay_cmd_submit`	`io_queue_requests`: outstanding command slots per qpair. Default 0 (use controller-reported max). Too low caps concurrency. Too high wastes memory and complicates recovery. `nvme_ioq_poll_period_us`: how often the IO qpair poller runs. Default 0 (poll continuously). Set non-zero to reduce CPU usage at cost of latency. `delay_cmd_submit`: batch command submissions. Default true. Improves throughput by coalescing doorbell writes.
RDMA transport	Network	Work request entries, qpair queue slots, RDMA CQ/SQ entries	`transport_ack_timeout` `rdma_srq_size` `rdma_max_cq_size` `rdma_cm_event_timeout_ms`	`transport_ack_timeout`: RDMA ACK timeout in ms. Default 0 (no timeout). Controls how long to wait for RDMA acknowledgments before declaring a path dead. `rdma_srq_size`: shared receive queue size. Allows multiple qpairs to share receive buffers, reducing total memory. Default 0 (disabled, per-qpair RQ). `rdma_max_cq_size`: max completion queue size. Default 0 (driver-chosen). Larger = more completions batched per poll. `rdma_cm_event_timeout_ms`: timeout for RDMA connection manager events.
NVMf target buffering	Storage Node	Transport-side iobuf (same code path as baremetal)	`num_shared_buffers` `buf_cache_size`	Same as baremetal transport buffering, but these are the storage node's independent settings. Different machine, different pool, different pressure profile.
bdev/lvol execution	Storage Node	`bdev_io` + blobstore metadata	`bdev_io_pool_size`	Storage node's own bdev_io pool. lvol/blobstore also allocates internal request metadata for cluster mapping and blob I/O translation.

Your Live Baremetal Runtime Map

Reactors, Threads, Pollers, IO Channels, Cache Owners

Your baremetal is not just "running SPDK". It is running SPDK in two opposite roles on the same machine: guest-facing target work through VFIO-user and replica-facing initiator work through NVMe-oF RDMA. The same hot threads bridge those two worlds.

Reactor: lcore 4Control-plane reactorThread: app_thread (cpumask 0x10)
Main pollers: rpc_subsystem_poll_servers, many timed bdev_nvme_poll_adminq, one active bdev_nvme_poll, bdev_nvme_remove_poller.
What it really does: RPC service, controller admin queue polling, reconnect / AER / controller-health style work. If controllers flap, this thread sees it first.

Reactors: lcores 5, 6, 7Hot bridge reactorsThreads: nvmf_tgt_poll_group_000, 001, 002 (cpumask 0xE0)
Main pollers: active nvmf_VFIOUSER, active bdev_nvme_poll, timed vfio_user_poll_vfu_ctx, many timed bdev_channel_poll_qos.
What they really do: take guest NVMe requests in, translate them, and progress RDMA NVMe qpairs out to storage nodes on the same thread.

IO channels on hot threadsThread-local, multiplied by thread countThe runtime map says the same remote NVMe namespaces and RAID devices appear as IO channels on multiple poll-group threads. So the cost of your deployment scales like threads × controllers × channels × paths, not just "number of volumes". Each hot thread can own: a VFIO-user transport poll-group context, RAID bdev channels, NVMe bdev channels, QoS pollers, and a bdev management channel with its own caches.

Which cache lives where?This is the critical ownership splitbdev caches: owned by the spdk_bdev_mgmt_channel on each thread. This channel holds the per-thread bdev_io cache and the per-thread "bdev" iobuf cache.
Transport caches: owned by each transport poll group as tgroup->buf_cache.
On your baremetal: VFIO-user transport cache is effectively zero. The active cache owner for payload staging is mainly the bdev mgmt channel on the hot poll-group threads.

Scope Matrix — What Is Global vs Thread-Local vs Controller-Local?

Thing	Per process / global	Per thread	Per controller / path	Per thread × controller
Central iobuf small/large pools	yes	no	no	no
bdev global `bdev_io` mempool	yes	no	no	no
bdev mgmt channel caches (`bdev_io`, bdev iobuf)	no	yes	no	no
NVMf RDMA transport poll-group iobuf cache	no	yes (per poll-group thread)	no	no
VFIO-user transport cache	no	effectively none on your baremetal	no	no
NVMe controller object / admin queue / reconnect state	no	no	yes	no
Keep-alive pollers	no	depends where scheduled	controller-driven	effectively distributed with controller ownership
NVMe io_path / qpair state	no	no	partial	yes
IO channels (raid, nvme, lvol, etc.)	no	thread-local handles	refer to controller/path-backed objects	yes, often the real multiplier

Your Actual Allocation Math

What Your Current Values Allocate

Central small iobuf poolsmall_pool_count=1048576 × small_bufsize=8192Exactly 8,589,934,592 bytes = 8 GiB of DMA-capable small buffers. This is central pool memory before any per-thread caching.

Central large iobuf poollarge_pool_count=65536 × large_bufsize=131072Exactly 8,589,934,592 bytes = 8 GiB of DMA-capable large buffers.

Total central iobuf poolsmall + large16 GiB total reserved centrally for iobuf payload buffering on this baremetal. This is why your iobuf_get_stats sample showed zero pressure — you gave yourself absurd headroom.

Per-thread bdev small-cache footprintiobuf_small_cache_size=128 × 81921 MiB per thread pinned in the local "bdev" iobuf small cache once a bdev mgmt channel is populated.

Per-thread bdev large-cache footprintiobuf_large_cache_size=32 × 1310724 MiB per thread pinned in the local "bdev" iobuf large cache.

Per-thread bdev iobuf cache total1 MiB + 4 MiB5 MiB per thread just for the bdev iobuf cache, before any live I/O uses extra central-pool buffers.

Known pinned bdev iobuf cache on your hot/runtime threads4 known busy threads × 5 MiBAt least 20 MiB pinned just across app_thread + nvmf_tgt_poll_group_000/001/002. Real pinned memory can be higher if more threads open bdev mgmt channels.

bdev_io per-thread cache countbdev_io_cache_size=128Each bdev mgmt channel pre-populates 128 bdev_io objects from the global mempool. Across the same 4 known busy threads, that's 512 pre-reserved bdev_io objects. The exact byte size depends on sizeof(spdk_bdev_io) + max_driver_ctx_size, which varies with compiled modules, but the object count is exact.

Global bdev_io pool sizebdev_io_pool_size=1048576You configured 1,048,576 bdev_io objects. This says very clearly: you do not want bdev_io scarcity to be the first bottleneck.

VFIO-user transport cache on baremetaleffectively 0Because your baremetal guest-facing transport is VFIO-user, not RDMA, and VFIO-user defaults num_shared_buffers=0 and buf_cache_size=0. It operates on guest memory directly instead of staging into transport-owned iobuf buffers.

Your Effective Baremetal Knobs — Set vs Default vs Unset

Layer	Knob	Your Value	Status	Why it matters on your box
NVMe initiator	`keep_alive_timeout_ms`	30000	set, non-default	Longer controller keepalive window. Less churn from transient stalls than the 10s upstream default.
NVMe initiator	`reconnect_delay_sec`	2	set, non-default	Reconnect attempts every 2s from init script. Still may be overridden per-controller later by diskengine attach.
NVMe initiator	`ctrlr_loss_timeout_sec`	120	set, non-default	Controller can stay in reconnect mode for 120s before being deleted. Much more forgiving than a 10s per-controller override.
NVMe initiator	`fast_io_fail_timeout_sec`	0	explicitly set	No fast-fail. I/O waits while recovery happens instead of failing quickly.
NVMe initiator	`io_queue_requests`	128	set, non-default	Deliberately conservative per-qpair queue depth. Strongly affects concurrency and qpair memory/state scale.
NVMe initiator	`io_path_stat`	true	set, non-default	Path statistics enabled. Helpful for visibility, small extra accounting cost.
bdev layer	`bdev_io_pool_size`	1048576	set, much larger than default	Massive global metadata-object pool. bdev_io starvation is very unlikely to be first bottleneck.
bdev layer	`bdev_io_cache_size`	128	set, lower than default 256	Smaller per-thread cache than upstream default, but huge global pool makes this fine. Slightly less memory pinned per thread.
bdev iobuf cache	`iobuf_small_cache_size`	128	set	1 MiB small-buffer cache per thread in the bdev module.
bdev iobuf cache	`iobuf_large_cache_size`	32	set, larger than default 16	4 MiB large-buffer cache per thread. Better reuse for larger IO, more memory pinned locally.
central iobuf	`small_pool_count`	1048576	set, enormous	8 GiB small-pool headroom. This is why zero observed iobuf pressure is believable.
central iobuf	`large_pool_count`	65536	set, enormous	8 GiB large-pool headroom. 128K-style workloads will not hit central starvation first.
central iobuf	`small_bufsize`	8192	set / same as common expectation	4K and 8K I/O classify as small-pool traffic.
central iobuf	`large_bufsize`	131072	set	16K–128K I/O classify as large-pool traffic on your host.
NVMf target config	`poll-groups-mask`	`0xE0`	set, topology-defining	Forces target poll groups to cores 5, 6, 7. This is why your runtime showed exactly 3 hot bridge threads.
VFIO-user transport	`max_queue_depth`	128	set, lower than VFIO-user default 256	Constricts guest-facing queue depth.
VFIO-user transport	`max_io_qpairs_per_ctrlr`	2	set, much lower than VFIO-user default	Constricts guest-visible queue-pair fanout. Lower guest-side parallelism, less per-VM controller state.
VFIO-user transport	`num_shared_buffers`, `buf_cache_size`	effectively 0	unset/default	Important: your baremetal guest-facing transport does not use transport iobuf caches the way RDMA target does.

Live Storage-Node Runtime Map

What Your Latest RPC Dump Proves About The RDMA Target Side

This latest dump is clearly the storage-node / RDMA-target runtime, not the baremetal VFIO-user bridge. The tell is the active poller mix: every hot thread is running nvmf_RDMA plus bdev_nvme_poll, and the app thread is running nvmf_rdma_accept. That is the signature of an RDMA target exporting namespaces while also polling local NVMe bdevs underneath.

Reactors / threads29 lw_threads across lcores 4–311 app thread + 28 NVMf poll-group threads (nvmf_tgt_poll_group_000 ... 027).
Interesting detail: nvmf_tgt_poll_group_027 is colocated with app_thread on lcore 4; all other poll-group threads sit alone on lcores 5–31.

Poll-group cpumaskfffffff0These threads are allowed on lcores 4–31. In the current runtime they are effectively spread one-per-reactor across that range, with the final poll-group sharing lcore 4 with the app thread.

What app_thread is doingaccept + RPC + controller adminTimed pollers on app_thread: nvmf_rdma_accept (new RDMA connections), rpc_subsystem_poll_servers, many bdev_nvme_poll_adminq, and bdev_nvme_remove_poller. So lcore 4 is the acceptor / control-plane / adminq home on this storage node.

What each hot poll-group thread is doingtarget ingress + local-device pollingEvery nvmf_tgt_poll_group_* thread has active nvmf_RDMA and active bdev_nvme_poll. Translation: the same thread accepts/executes target requests and also polls the local NVMe-backed bdev path below lvol/blobstore.

Keep-alive pollersmany nvmf_ctrlr_keep_alive_poll pollers, 30s periodYour dump shows many timed nvmf_ctrlr_keep_alive_poll instances per poll-group thread with period_ticks=60000000000 at a 2GHz tick rate = 30 seconds. That means controller keep-alive handling is distributed across the target poll groups, not centralized only on the app thread.

Busy / idle shapemostly polling, low busy fractionExample: lcore 4 is about 1.72% busy, while hot poll-group cores like 5–8 are roughly 0.22–0.28% busy by the reactor counters you provided. That means the system is polling continuously but doing real work only on a minority of passes — expected for SPDK. It is not CPU-saturated from this sample.

Storage-Node iobuf Stats — What They Actually Mean

Observed iobuf_get_stats Interpretation

Module	Small pool	Large pool	Interpretation
`accel`	`cache=0` `main=0` `retry=0`	`cache=0` `main=0` `retry=0`	Acceleration framework is not using iobuf meaningfully in this sample.
`bdev`	`cache=0` `main=0` `retry=0`	`cache=0` `main=0` `retry=0`	Very important: the generic bdev iobuf path is not pressuring the system in this sample. That strongly suggests the observed hot path is dominated by transport-side buffering, not RAID-read-style or generic bdev staging.
`nvmf_RDMA`	`cache=357,458,952` `main=1` `retry=0`	`cache=1,776,058,924` `main=24,723,075` `retry=0`	This is the real storage-node story. RDMA target buffering is hot. Zero retries means no starvation. Huge `cache` counts mean requests are usually served from per-poll-group local cache. Non-zero `main` on large pool means caches do refill from the central ring, but that is normal — it is not pressure by itself. The important signal is still retry=0.

Bottom line from this sample: on the storage node, the active cache owner is the nvmf_RDMA transport poll-group cache, not the generic bdev iobuf cache. So if you are debugging memory/buffer behavior here, start with transport-side request staging and poll-group fanout, not generic bdev buffer starvation.

iobuf Subsystem Deep Dive

How iobuf Actually Works — The Full Picture

The iobuf subsystem is SPDK's centralized DMA buffer management. It runs as a two-tier allocation scheme: a global central pool (per NUMA node) and per-thread local caches.

Central Pool Architecture

At startup, SPDK calls spdk_iobuf_initialize() which creates two pools per NUMA node:

Small pool: small_pool_count buffers × small_bufsize bytes each. Default: 8192 buffers × 8KB = 64MB of DMA memory. Backed by a single contiguous spdk_malloc() allocation with 4KB alignment. Stored in an spdk_ring (lock-free multi-producer/multi-consumer ring buffer).

Large pool: large_pool_count buffers × large_bufsize bytes each. Default: 1024 buffers × 132KB = ~132MB of DMA memory. Same structure. The 132KB default (not 128K) accounts for metadata that may need to travel alongside a 128K payload.

These are DMA-capable allocations (SPDK_MALLOC_DMA), meaning they can be used directly by RDMA NICs and NVMe controllers without additional memory registration.

Per-Thread Cache Mechanics

When an iobuf channel is created (via spdk_iobuf_channel_init()), it pre-populates a local cache by dequeuing buffers from the central ring in batches of 64 (IOBUF_POPULATE_BATCH_SIZE).

Get path (spdk_iobuf_get()): First checks the local cache (STAILQ). If empty, dequeues a batch from the central ring (up to min(32, cache_size) at once). If the central ring is also empty and an entry is provided, the request is queued on a wait list and the entry's callback will be invoked when a buffer becomes available.

Put path (spdk_iobuf_put()): If there are waiters on the queue, the buffer goes directly to the first waiter (callback invoked immediately — zero-copy handoff). If no waiters, the buffer goes into the local cache. If the cache overflows (exceeds cache_size + batch_size), a batch is returned to the central ring.

Stats tracked per channel: cache (served from local cache), main (had to go to central ring), retry (central ring was empty, request queued). These are what iobuf_get_stats RPC reports.

Small vs Large Decision Rule

In spdk_iobuf_get(), the decision is simple:

if (len <= cache->small.bufsize) → use small pool
else → use large pool (asserts len <= cache->large.bufsize)

On your host with defaults: 4K or 8K payloads → small pool. 16K–128K payloads → large pool.

This means a workload doing 4K random I/Os pressures the small pool, while sequential 128K I/Os pressure the large pool. They are completely independent tuning problems.

Who Registers as iobuf Modules?

Each consumer must call spdk_iobuf_register_module(name) before creating channels. The registered modules in your stack:

"bdev" — the generic bdev layer. Creates one iobuf channel per bdev management channel (per-thread). Uses iobuf_small_cache_size (default 128) and iobuf_large_cache_size (default 16) from bdev_set_options.

"nvmf_RDMA" / "nvmf_VFIOUSER" — NVMf transports. Create one iobuf channel per transport poll group. Use buf_cache_size from nvmf_create_transport. VFIO-user defaults to 0 (no shared buffers needed since it uses guest memory). RDMA defaults to auto-calculated from num_shared_buffers.

"accel" — the acceleration framework. Also has its own iobuf cache sizes.

The iobuf_get_stats RPC iterates all modules and their channels, summing cache/main/retry stats. This is your primary diagnostic for buffer pressure.

Why You Can Have Enough Global Buffers And Still See Cache Misses

The total global pool might be 8192 small buffers. But if you have 16 SPDK threads, each wanting a cache of 128, that's 2048 buffers locked in caches before any I/O happens. Add NVMf transport poll groups (one per core) each wanting their own cache, and you can exhaust the central pool at startup.

Formula: total needed ≥ (num_bdev_threads × iobuf_small_cache_size) + (num_poll_groups × buf_cache_size) + headroom

SPDK ships a helper: scripts/calc-iobuf.py — use it to calculate the right pool sizes for your deployment.

Reactor / Thread / Poller Model

SPDK's Execution Model — How Work Gets Done

SPDK doesn't use interrupts or kernel threads for I/O. Everything runs in userspace polling loops.

Reactor = the outer per-core loop (lib/event/reactor.c). One reactor per CPU core. It drains event queues, then iterates all SPDK threads assigned to that core, calling spdk_thread_poll(thread, ...) for each one. The reactor is the scheduler.

SPDK Thread = the inner logical execution context (lib/thread/thread.c). Owns pollers, IO channels, and state. All state transitions must happen on the owning thread. spdk_thread_poll() makes the thread current, runs its messages, runs active pollers, runs timed pollers when due, and updates stats.

Poller = repeated work registered on a thread. Concrete examples from your stack:

NVMe initiator IO poller (module/bdev/nvme/bdev_nvme.c:3924) — progresses remote RDMA qpairs, polls for completions
NVMe admin poller (module/bdev/nvme/bdev_nvme.c:6108) — polls admin queue for health, discovery, AER events. Period: nvme_adminq_poll_period_us (default 10ms)
VFIO-user request poller (lib/nvmf/vfio_user.c:4338) — picks up guest commands from VFIO-user queues
RDMA target poller (lib/nvmf/rdma.c) — progresses RDMA target request state machine
NVMf transport poller — registered per transport poll group, drives nvmf_tgroup_poll

In your prod: a baremetal core runs a reactor → that reactor polls SPDK threads → one thread owns an NVMe poll group with a poller progressing remote RDMA qpairs → another thread may own VFIO-user controller work. This is why "what core did it run on?" matters.

nvmf_create_transport — All Options Explained

Transport Create Options (RDMA Defaults Shown)

max_queue_depthDefault: 128Maximum number of outstanding I/O commands per queue pair on the target. Each connected controller can have this many commands in-flight per qpair. Higher = more concurrency per connection, but more memory per qpair for tracking structures.

max_qpairs_per_ctrlrDefault: 128 (RDMA), 64 (VFIO-user)Maximum IO queue pairs per NVMf controller. The guest/initiator negotiates how many to actually create. Each qpair needs its own polling, CQ entries, and tracking state. More qpairs = more parallelism but more resource consumption.

in_capsule_data_sizeDefault: 4096 (RDMA), 0 (VFIO-user)For RDMA: data that fits within this size is sent inline with the command capsule (single RDMA send), avoiding a separate RDMA READ. 4096 means 4K writes can be sent in one RDMA operation. Larger values reduce round-trips for small I/Os but increase per-receive-WR memory. VFIO-user doesn't use capsules (it has direct memory access).

max_io_sizeDefault: 131072 (128K)Maximum single I/O size the transport will accept. This determines the MDTS (Maximum Data Transfer Size) reported to the initiator/guest. Directly affects how large a single read/write command can be. Larger = fewer commands for big sequential workloads, but requires larger buffers.

io_unit_sizeDefault: 131072 (RDMA), max_io_size (VFIO-user)Size of each I/O buffer unit. An I/O larger than this is split into multiple buffer units. Must be ≤ large_bufsize from iobuf. The critical decision: if io_unit_size ≤ small_bufsize, only the small iobuf pool is used for transport buffers. Otherwise both pools are used.

max_aq_depthDefault: 128 (RDMA), 32 (VFIO-user)Maximum admin queue depth. Admin commands (identify, get log page, etc.) are serialized through this queue. Rarely a bottleneck except during discovery storms.

num_shared_buffersDefault: 4095 (RDMA), 0 (VFIO-user)How many buffers the transport "reserves" from the central iobuf pool. This is validated at transport creation: SPDK warns if it exceeds the available iobuf pool count. The actual allocation doesn't happen here — it happens when poll groups create their iobuf channels and pre-populate caches. VFIO-user sets 0 because it operates on guest memory directly.

buf_cache_sizeDefault: UINT32_MAX (auto-calculate)Per-poll-group iobuf cache size. When set to UINT32_MAX (the default), SPDK auto-calculates: (num_shared_buffers × 3/4) / num_poll_groups. This means 75% of the "reserved" buffers are distributed as caches, keeping 25% for burst absorption. Set explicitly to override.

abort_timeout_secDefault: 1Timeout for abort commands. If a command can't be aborted within this time, it's completed with error.

ack_timeoutDefault: 0 (no timeout)RDMA ACK timeout in milliseconds. Controls RDMA layer transport-level acknowledgment. When non-zero, RDMA will retransmit and eventually declare path failure if no ACK received.

data_wr_pool_sizeDefault: 4095RDMA-specific: size of the data work request pool. Work requests are the RDMA verbs-level structures for RDMA READ/WRITE operations. More = more concurrent data transfers. Too few and you bottleneck on available WRs.

zcopyDefault: falseZero-copy mode. When enabled and the backing bdev supports zcopy, the target avoids allocating iobuf buffers entirely — data goes directly between the transport and the bdev. Requires bdev support (not all bdevs implement zcopy IO type).

bdev_nvme_set_options — All Options Explained

NVMe Bdev Module Options (Initiator Side)

io_queue_requestsDefault: 0 (use controller max)Number of outstanding I/O requests per NVMe qpair. This is the initiator-side queue depth. Default 0 means use the value reported by the NVMe controller (MQES). Each slot consumes a tracking structure and an RDMA work request. Too low caps concurrency and throughput. Too high wastes memory per controller and makes recovery after path failures slower (more in-flight I/Os to drain/retry).

nvme_adminq_poll_period_usDefault: 10000 (10ms)How often the admin queue poller runs (microseconds). Admin qpairs handle: controller health checks, async event requests (AER), namespace change notifications, and keep-alive commands. 10ms is fine for steady state. Lower values detect controller failures faster but consume more CPU.

nvme_ioq_poll_period_usDefault: 0 (continuous polling)How often the I/O queue poller runs. Default 0 = poll continuously (tight loop, lowest latency, highest CPU). Set to non-zero to add a sleep between polls. Increases latency but dramatically reduces CPU usage for low-throughput workloads.

transport_retry_countDefault: 4Number of transport-layer retries before the NVMe library gives up on a command. This is the RDMA/TCP transport retry, separate from bdev-layer retry. Handles transient network errors.

bdev_retry_countDefault: 3Number of bdev-layer retries. After the transport layer exhausts its retries, the bdev layer can retry the entire I/O. Set to -1 for infinite retries (dangerous — can cause indefinite hangs). Set to 0 to disable retries entirely (fail-fast mode).

ctrlr_loss_timeout_secDefault: 0 (no timeout)Total time in seconds to wait for a controller reconnection before declaring it permanently lost. When a controller disconnects, SPDK enters reconnect mode. If this timeout expires, the controller and all its bdevs are removed. Set to -1 to retry forever.

reconnect_delay_secDefault: 0 (no reconnect)Delay between reconnection attempts. Must be non-zero to enable automatic reconnection. When a controller disconnects, SPDK waits this many seconds before trying to reconnect. Set alongside ctrlr_loss_timeout_sec to define the reconnection window.

fast_io_fail_timeout_secDefault: 0 (disabled)Time after which I/Os are failed immediately during a controller reconnection, rather than being queued for retry. Allows the application to get fast errors while SPDK continues trying to reconnect in the background.

transport_ack_timeoutDefault: 0 (no timeout)RDMA ACK timeout for the initiator side. Same concept as the target-side setting but applied to the initiator's RDMA qpairs.

delay_cmd_submitDefault: trueWhen true, SPDK batches NVMe command submissions by delaying doorbell writes. Instead of ringing the doorbell after each command, it waits until the poller finishes processing all pending submissions, then rings once. Significantly improves throughput for high-IOPS workloads by reducing PCIe/MMIO writes.

keep_alive_timeout_msDefault: 10000 (10s)NVMe-oF keep-alive timeout negotiated with the remote controller. If the controller doesn't receive a keep-alive within this period, it can disconnect the controller. SPDK sends keep-alives at half this interval.

rdma_srq_sizeDefault: 0 (disabled)Shared Receive Queue size for the initiator. When non-zero, multiple qpairs to the same controller share a single receive queue, reducing per-qpair memory. Trade-off: reduces isolation between qpairs.

rdma_max_cq_sizeDefault: 0 (driver-chosen)Maximum completion queue size for RDMA qpairs. Larger CQ = more completions can be batched per poll iteration, improving throughput. Driver default is usually sufficient.

disable_auto_failbackDefault: falseWhen false (default), SPDK automatically fails back to the preferred I/O path when a previously-failed path recovers. When true, I/O stays on the current path even after the preferred path comes back. Relevant for multipath configurations.

action_on_timeoutDefault: noneWhat to do when a command times out. Options: none (log only), abort (send abort command), reset (reset the controller). Only active if timeout_us is non-zero.

timeout_usDefault: 0 (disabled)I/O command timeout in microseconds. When non-zero, SPDK registers a timeout callback with the NVMe library. Timed-out commands trigger action_on_timeout. Separate from transport retries.

iobuf_set_options — Central Pool Configuration

iobuf Global Pool Options (Startup-Only RPC)

small_pool_countDefault: 8192 · Min: 64Total number of small DMA buffers in the central pool (per NUMA node if NUMA enabled). Each buffer is small_bufsize bytes. Total small pool memory = small_pool_count × small_bufsize. At defaults: 8192 × 8KB = 64MB. This pool serves ALL iobuf consumers: bdev layer, NVMf transports, accel framework. Must be large enough for: Σ(all per-thread caches) + burst headroom.

large_pool_countDefault: 1024 · Min: 8Total number of large DMA buffers in the central pool. Each buffer is large_bufsize bytes. Total large pool memory = large_pool_count × large_bufsize. At defaults: 1024 × 132KB ≈ 132MB. Large buffers are used for I/Os > small_bufsize (8K). Sequential 128K workloads live here.

small_bufsizeDefault: 8192 (8K) · Min: 4096Size of each small buffer in bytes. Rounded up to 4KB alignment internally. Any I/O with len ≤ small_bufsize uses the small pool. Increasing this means fewer I/Os hit the large pool, but each small buffer consumes more DMA memory.

large_bufsizeDefault: 135168 (132K) · Min: 8192Size of each large buffer. The 132K default (not 128K) exists because some code paths need to attach metadata alongside a 128K payload. The transport's io_unit_size must be ≤ this value. Increasing this allows larger max I/O sizes but costs proportionally more DMA memory.

enable_numaDefault: falseWhen true, SPDK creates separate small and large pools per NUMA node, and iobuf channels prefer their local NUMA pool. Reduces cross-NUMA DMA traffic. Multiplies total memory usage by number of NUMA nodes.

⚠ Startup-only: iobuf_set_options can only be called during SPDK startup (before subsystems initialize). It cannot be changed at runtime. Get it wrong and you must restart the SPDK process.

bdev_set_options — bdev Layer Configuration

bdev Layer Options

bdev_io_pool_sizeDefault: 65535 (64K - 1)Total number of bdev_io objects in the global mempool. A bdev_io is the metadata envelope for every block I/O request — it tracks the bdev, offset, length, iovecs, completion callback, driver context (like raid_bdev_io or nvme_bdev_io), and internal state. It does NOT contain payload data. Must satisfy: pool_size ≥ bdev_io_cache_size × (num_threads + 1). If you have 32 threads with cache_size 256, minimum pool is 8448.

bdev_io_cache_sizeDefault: 256Per-thread pre-populated cache of bdev_io objects. At channel creation, each thread bulk-dequeues this many from the global pool into a thread-local STAILQ. This guarantees that under normal load, bdev_io allocation is a simple linked-list pop — no lock contention, no global pool access. Only when the cache is exhausted does a thread go to the global mempool.

iobuf_small_cache_sizeDefault: 128Per-thread iobuf small buffer cache size for the "bdev" iobuf module. This is separate from the NVMf transport's buf_cache_size. The bdev layer uses this when spdk_bdev_io_get_buf() is called (e.g., by RAID for reads, or by the NVMe bdev for reads without a pre-existing buffer). Higher values reduce central pool contention but consume more small buffers from the global pool at startup.

iobuf_large_cache_sizeDefault: 16Per-thread iobuf large buffer cache for the bdev module. Only 16 by default because large buffers are expensive (132K each) and large I/Os are less common than small ones in typical mixed workloads. For pure sequential workloads with 128K I/Os, increasing this significantly reduces central pool contention.

Mental Model

For one VM I/O on your baremetal:

VFIO-user receives→ NVMf target wraps→ bdev layer represents→ RAID routes/mirrors→ NVMe initiator sends→ RDMA carries→ SN NVMf receives→ lvol/blobstore persists

At each step, a different kind of object is allocated: transport request, bdev_io, iobuf payload buffer, or qpair request slot. If you ask "which one of those is pressuring the system?", the tuning problem becomes tractable.

Based on your baremetal runtime samples: hot threads bridge VFIO-user and NVMe initiator work, many remote controllers are active, each replica namespace has two enabled RDMA paths, and iobuf_get_stats showed zero retries. Your first-order cost is controller/qpair/poller scale, not iobuf starvation. Focus on io_queue_requests, poll-group layout, and controller/path count before increasing buffer pools.

Useful diagnostic RPCs: framework_get_reactors, thread_get_pollers, thread_get_io_channels, iobuf_get_stats, bdev_nvme_get_controllers, bdev_nvme_get_io_paths, bdev_raid_get_bdevs, nvmf_get_subsystems