An interactive simulator tracing a single I/O request from a guest VM through VFIO-user, the NVMf target, RAID mirroring, RDMA transport, and into the storage node's lvol/blobstore. Based on the real path for raid_270 — two remote replicas reached over NVMe-oF RDMA.
Each step shows actual source code from the SPDK codebase, what objects are allocated, what tuning knobs control behavior, and deep explanations of how SPDK's internal subsystems work.
| Stage | Location | Object Allocated | Tuning Knobs | What The Knobs Do |
|---|---|---|---|---|
| VFIO-user entry | Baremetal | Request bookkeeping, guest memory mapping | — | Guest memory is registered with libvfio-user at controller setup time. No per-IO allocation knobs here. |
| NVMf transport buffering | Baremetal | Transport-side iobuf payloadsfrom transport poll-group's iobuf cache | num_shared_buffersbuf_cache_sizeio_unit_sizemax_io_size |
num_shared_buffers: how many buffers the transport "reserves" from the central iobuf pool. Validated at transport create time against the iobuf pool size — if it exceeds the pool, SPDK warns.buf_cache_size: per-poll-group cache of iobuf buffers. Default UINT32_MAX means auto-calculate: (num_shared_buffers × ¾) / num_poll_groups. Each poll group pre-populates this many from the central pool.io_unit_size: determines whether small or large buffers are used. If ≤ small_bufsize (8192), only small pool is used.max_io_size: maximum I/O size the transport accepts. Default 131072 (128K) for RDMA. |
| bdev request object | Baremetal | bdev_io metadata envelopefrom global mempool |
bdev_io_pool_sizebdev_io_cache_size |
bdev_io_pool_size: total number of bdev_io objects in the global mempool. Default 65535. Must be ≥ bdev_io_cache_size × (num_threads + 1).bdev_io_cache_size: pre-populated per-thread cache. Default 256. Each SPDK thread grabs this many from the global pool at channel creation. Ensures no thread can be starved under normal load. |
| RAID read-buffer staging | Baremetal | Generic bdev iobuf (reads only)from bdev mgmt channel iobuf cache | iobuf_small_cache_sizeiobuf_large_cache_size |
iobuf_small_cache_size: per-thread local cache of small buffers for the bdev module. Default 128. Served first before hitting central pool.iobuf_large_cache_size: same for large buffers. Default 16. Fewer because large buffers are expensive (132K each). |
| NVMe initiator submission | Baremetal | Qpair request slot + NVMe request state | io_queue_requestsnvme_ioq_poll_period_usdelay_cmd_submit |
io_queue_requests: outstanding command slots per qpair. Default 0 (use controller-reported max). Too low caps concurrency. Too high wastes memory and complicates recovery.nvme_ioq_poll_period_us: how often the IO qpair poller runs. Default 0 (poll continuously). Set non-zero to reduce CPU usage at cost of latency.delay_cmd_submit: batch command submissions. Default true. Improves throughput by coalescing doorbell writes. |
| RDMA transport | Network | Work request entries, qpair queue slots, RDMA CQ/SQ entries | transport_ack_timeoutrdma_srq_sizerdma_max_cq_sizerdma_cm_event_timeout_ms |
transport_ack_timeout: RDMA ACK timeout in ms. Default 0 (no timeout). Controls how long to wait for RDMA acknowledgments before declaring a path dead.rdma_srq_size: shared receive queue size. Allows multiple qpairs to share receive buffers, reducing total memory. Default 0 (disabled, per-qpair RQ).rdma_max_cq_size: max completion queue size. Default 0 (driver-chosen). Larger = more completions batched per poll.rdma_cm_event_timeout_ms: timeout for RDMA connection manager events. |
| NVMf target buffering | Storage Node | Transport-side iobuf (same code path as baremetal) | num_shared_buffersbuf_cache_size |
Same as baremetal transport buffering, but these are the storage node's independent settings. Different machine, different pool, different pressure profile. |
| bdev/lvol execution | Storage Node | bdev_io + blobstore metadata |
bdev_io_pool_size |
Storage node's own bdev_io pool. lvol/blobstore also allocates internal request metadata for cluster mapping and blob I/O translation. |
Your baremetal is not just "running SPDK". It is running SPDK in two opposite roles on the same machine: guest-facing target work through VFIO-user and replica-facing initiator work through NVMe-oF RDMA. The same hot threads bridge those two worlds.
app_thread (cpumask 0x10)rpc_subsystem_poll_servers, many timed bdev_nvme_poll_adminq, one active bdev_nvme_poll, bdev_nvme_remove_poller.nvmf_tgt_poll_group_000, 001, 002 (cpumask 0xE0)nvmf_VFIOUSER, active bdev_nvme_poll, timed vfio_user_poll_vfu_ctx, many timed bdev_channel_poll_qos.threads × controllers × channels × paths, not just "number of volumes". Each hot thread can own: a VFIO-user transport poll-group context, RAID bdev channels, NVMe bdev channels, QoS pollers, and a bdev management channel with its own caches.spdk_bdev_mgmt_channel on each thread. This channel holds the per-thread bdev_io cache and the per-thread "bdev" iobuf cache.tgroup->buf_cache.| Thing | Per process / global | Per thread | Per controller / path | Per thread × controller |
|---|---|---|---|---|
| Central iobuf small/large pools | yes | no | no | no |
bdev global bdev_io mempool | yes | no | no | no |
bdev mgmt channel caches (bdev_io, bdev iobuf) | no | yes | no | no |
| NVMf RDMA transport poll-group iobuf cache | no | yes (per poll-group thread) | no | no |
| VFIO-user transport cache | no | effectively none on your baremetal | no | no |
| NVMe controller object / admin queue / reconnect state | no | no | yes | no |
| Keep-alive pollers | no | depends where scheduled | controller-driven | effectively distributed with controller ownership |
| NVMe io_path / qpair state | no | no | partial | yes |
| IO channels (raid, nvme, lvol, etc.) | no | thread-local handles | refer to controller/path-backed objects | yes, often the real multiplier |
small_pool_count=1048576 × small_bufsize=8192Exactly 8,589,934,592 bytes = 8 GiB of DMA-capable small buffers. This is central pool memory before any per-thread caching.large_pool_count=65536 × large_bufsize=131072Exactly 8,589,934,592 bytes = 8 GiB of DMA-capable large buffers.iobuf_get_stats sample showed zero pressure — you gave yourself absurd headroom.iobuf_small_cache_size=128 × 81921 MiB per thread pinned in the local "bdev" iobuf small cache once a bdev mgmt channel is populated.iobuf_large_cache_size=32 × 1310724 MiB per thread pinned in the local "bdev" iobuf large cache.app_thread + nvmf_tgt_poll_group_000/001/002. Real pinned memory can be higher if more threads open bdev mgmt channels.bdev_io_cache_size=128Each bdev mgmt channel pre-populates 128 bdev_io objects from the global mempool. Across the same 4 known busy threads, that's 512 pre-reserved bdev_io objects. The exact byte size depends on sizeof(spdk_bdev_io) + max_driver_ctx_size, which varies with compiled modules, but the object count is exact.bdev_io_pool_size=1048576You configured 1,048,576 bdev_io objects. This says very clearly: you do not want bdev_io scarcity to be the first bottleneck.num_shared_buffers=0 and buf_cache_size=0. It operates on guest memory directly instead of staging into transport-owned iobuf buffers.| Layer | Knob | Your Value | Status | Why it matters on your box |
|---|---|---|---|---|
| NVMe initiator | keep_alive_timeout_ms | 30000 | set, non-default | Longer controller keepalive window. Less churn from transient stalls than the 10s upstream default. |
| NVMe initiator | reconnect_delay_sec | 2 | set, non-default | Reconnect attempts every 2s from init script. Still may be overridden per-controller later by diskengine attach. |
| NVMe initiator | ctrlr_loss_timeout_sec | 120 | set, non-default | Controller can stay in reconnect mode for 120s before being deleted. Much more forgiving than a 10s per-controller override. |
| NVMe initiator | fast_io_fail_timeout_sec | 0 | explicitly set | No fast-fail. I/O waits while recovery happens instead of failing quickly. |
| NVMe initiator | io_queue_requests | 128 | set, non-default | Deliberately conservative per-qpair queue depth. Strongly affects concurrency and qpair memory/state scale. |
| NVMe initiator | io_path_stat | true | set, non-default | Path statistics enabled. Helpful for visibility, small extra accounting cost. |
| bdev layer | bdev_io_pool_size | 1048576 | set, much larger than default | Massive global metadata-object pool. bdev_io starvation is very unlikely to be first bottleneck. |
| bdev layer | bdev_io_cache_size | 128 | set, lower than default 256 | Smaller per-thread cache than upstream default, but huge global pool makes this fine. Slightly less memory pinned per thread. |
| bdev iobuf cache | iobuf_small_cache_size | 128 | set | 1 MiB small-buffer cache per thread in the bdev module. |
| bdev iobuf cache | iobuf_large_cache_size | 32 | set, larger than default 16 | 4 MiB large-buffer cache per thread. Better reuse for larger IO, more memory pinned locally. |
| central iobuf | small_pool_count | 1048576 | set, enormous | 8 GiB small-pool headroom. This is why zero observed iobuf pressure is believable. |
| central iobuf | large_pool_count | 65536 | set, enormous | 8 GiB large-pool headroom. 128K-style workloads will not hit central starvation first. |
| central iobuf | small_bufsize | 8192 | set / same as common expectation | 4K and 8K I/O classify as small-pool traffic. |
| central iobuf | large_bufsize | 131072 | set | 16K–128K I/O classify as large-pool traffic on your host. |
| NVMf target config | poll-groups-mask | 0xE0 | set, topology-defining | Forces target poll groups to cores 5, 6, 7. This is why your runtime showed exactly 3 hot bridge threads. |
| VFIO-user transport | max_queue_depth | 128 | set, lower than VFIO-user default 256 | Constricts guest-facing queue depth. |
| VFIO-user transport | max_io_qpairs_per_ctrlr | 2 | set, much lower than VFIO-user default | Constricts guest-visible queue-pair fanout. Lower guest-side parallelism, less per-VM controller state. |
| VFIO-user transport | num_shared_buffers, buf_cache_size | effectively 0 | unset/default | Important: your baremetal guest-facing transport does not use transport iobuf caches the way RDMA target does. |
This latest dump is clearly the storage-node / RDMA-target runtime, not the baremetal VFIO-user bridge. The tell is the active poller mix: every hot thread is running nvmf_RDMA plus bdev_nvme_poll, and the app thread is running nvmf_rdma_accept. That is the signature of an RDMA target exporting namespaces while also polling local NVMe bdevs underneath.
nvmf_tgt_poll_group_000 ... 027).nvmf_tgt_poll_group_027 is colocated with app_thread on lcore 4; all other poll-group threads sit alone on lcores 5–31.fffffff0These threads are allowed on lcores 4–31. In the current runtime they are effectively spread one-per-reactor across that range, with the final poll-group sharing lcore 4 with the app thread.nvmf_rdma_accept (new RDMA connections), rpc_subsystem_poll_servers, many bdev_nvme_poll_adminq, and bdev_nvme_remove_poller. So lcore 4 is the acceptor / control-plane / adminq home on this storage node.nvmf_tgt_poll_group_* thread has active nvmf_RDMA and active bdev_nvme_poll. Translation: the same thread accepts/executes target requests and also polls the local NVMe-backed bdev path below lvol/blobstore.nvmf_ctrlr_keep_alive_poll pollers, 30s periodYour dump shows many timed nvmf_ctrlr_keep_alive_poll instances per poll-group thread with period_ticks=60000000000 at a 2GHz tick rate = 30 seconds. That means controller keep-alive handling is distributed across the target poll groups, not centralized only on the app thread.iobuf_get_stats Interpretation| Module | Small pool | Large pool | Interpretation |
|---|---|---|---|
accel |
cache=0 main=0 retry=0 |
cache=0 main=0 retry=0 |
Acceleration framework is not using iobuf meaningfully in this sample. |
bdev |
cache=0 main=0 retry=0 |
cache=0 main=0 retry=0 |
Very important: the generic bdev iobuf path is not pressuring the system in this sample. That strongly suggests the observed hot path is dominated by transport-side buffering, not RAID-read-style or generic bdev staging. |
nvmf_RDMA |
cache=357,458,952main=1retry=0 |
cache=1,776,058,924main=24,723,075retry=0 |
This is the real storage-node story. RDMA target buffering is hot. Zero retries means no starvation. Huge cache counts mean requests are usually served from per-poll-group local cache. Non-zero main on large pool means caches do refill from the central ring, but that is normal — it is not pressure by itself. The important signal is still retry=0. |
nvmf_RDMA transport poll-group cache, not the generic bdev iobuf cache. So if you are debugging memory/buffer behavior here, start with transport-side request staging and poll-group fanout, not generic bdev buffer starvation.
The iobuf subsystem is SPDK's centralized DMA buffer management. It runs as a two-tier allocation scheme: a global central pool (per NUMA node) and per-thread local caches.
At startup, SPDK calls spdk_iobuf_initialize() which creates two pools per NUMA node:
Small pool: small_pool_count buffers × small_bufsize bytes each. Default: 8192 buffers × 8KB = 64MB of DMA memory. Backed by a single contiguous spdk_malloc() allocation with 4KB alignment. Stored in an spdk_ring (lock-free multi-producer/multi-consumer ring buffer).
Large pool: large_pool_count buffers × large_bufsize bytes each. Default: 1024 buffers × 132KB = ~132MB of DMA memory. Same structure. The 132KB default (not 128K) accounts for metadata that may need to travel alongside a 128K payload.
These are DMA-capable allocations (SPDK_MALLOC_DMA), meaning they can be used directly by RDMA NICs and NVMe controllers without additional memory registration.
When an iobuf channel is created (via spdk_iobuf_channel_init()), it pre-populates a local cache by dequeuing buffers from the central ring in batches of 64 (IOBUF_POPULATE_BATCH_SIZE).
Get path (spdk_iobuf_get()): First checks the local cache (STAILQ). If empty, dequeues a batch from the central ring (up to min(32, cache_size) at once). If the central ring is also empty and an entry is provided, the request is queued on a wait list and the entry's callback will be invoked when a buffer becomes available.
Put path (spdk_iobuf_put()): If there are waiters on the queue, the buffer goes directly to the first waiter (callback invoked immediately — zero-copy handoff). If no waiters, the buffer goes into the local cache. If the cache overflows (exceeds cache_size + batch_size), a batch is returned to the central ring.
Stats tracked per channel: cache (served from local cache), main (had to go to central ring), retry (central ring was empty, request queued). These are what iobuf_get_stats RPC reports.
In spdk_iobuf_get(), the decision is simple:
if (len <= cache->small.bufsize) → use small poolelse → use large pool (asserts len <= cache->large.bufsize)
On your host with defaults: 4K or 8K payloads → small pool. 16K–128K payloads → large pool.
This means a workload doing 4K random I/Os pressures the small pool, while sequential 128K I/Os pressure the large pool. They are completely independent tuning problems.
Each consumer must call spdk_iobuf_register_module(name) before creating channels. The registered modules in your stack:
"bdev" — the generic bdev layer. Creates one iobuf channel per bdev management channel (per-thread). Uses iobuf_small_cache_size (default 128) and iobuf_large_cache_size (default 16) from bdev_set_options.
"nvmf_RDMA" / "nvmf_VFIOUSER" — NVMf transports. Create one iobuf channel per transport poll group. Use buf_cache_size from nvmf_create_transport. VFIO-user defaults to 0 (no shared buffers needed since it uses guest memory). RDMA defaults to auto-calculated from num_shared_buffers.
"accel" — the acceleration framework. Also has its own iobuf cache sizes.
The iobuf_get_stats RPC iterates all modules and their channels, summing cache/main/retry stats. This is your primary diagnostic for buffer pressure.
The total global pool might be 8192 small buffers. But if you have 16 SPDK threads, each wanting a cache of 128, that's 2048 buffers locked in caches before any I/O happens. Add NVMf transport poll groups (one per core) each wanting their own cache, and you can exhaust the central pool at startup.
Formula: total needed ≥ (num_bdev_threads × iobuf_small_cache_size) + (num_poll_groups × buf_cache_size) + headroom
SPDK ships a helper: scripts/calc-iobuf.py — use it to calculate the right pool sizes for your deployment.
SPDK doesn't use interrupts or kernel threads for I/O. Everything runs in userspace polling loops.
Reactor = the outer per-core loop (lib/event/reactor.c). One reactor per CPU core. It drains event queues, then iterates all SPDK threads assigned to that core, calling spdk_thread_poll(thread, ...) for each one. The reactor is the scheduler.
SPDK Thread = the inner logical execution context (lib/thread/thread.c). Owns pollers, IO channels, and state. All state transitions must happen on the owning thread. spdk_thread_poll() makes the thread current, runs its messages, runs active pollers, runs timed pollers when due, and updates stats.
Poller = repeated work registered on a thread. Concrete examples from your stack:
module/bdev/nvme/bdev_nvme.c:3924) — progresses remote RDMA qpairs, polls for completionsmodule/bdev/nvme/bdev_nvme.c:6108) — polls admin queue for health, discovery, AER events. Period: nvme_adminq_poll_period_us (default 10ms)lib/nvmf/vfio_user.c:4338) — picks up guest commands from VFIO-user queueslib/nvmf/rdma.c) — progresses RDMA target request state machinenvmf_tgroup_pollIn your prod: a baremetal core runs a reactor → that reactor polls SPDK threads → one thread owns an NVMe poll group with a poller progressing remote RDMA qpairs → another thread may own VFIO-user controller work. This is why "what core did it run on?" matters.
large_bufsize from iobuf. The critical decision: if io_unit_size ≤ small_bufsize, only the small iobuf pool is used for transport buffers. Otherwise both pools are used.(num_shared_buffers × 3/4) / num_poll_groups. This means 75% of the "reserved" buffers are distributed as caches, keeping 25% for burst absorption. Set explicitly to override.MQES). Each slot consumes a tracking structure and an RDMA work request. Too low caps concurrency and throughput. Too high wastes memory per controller and makes recovery after path failures slower (more in-flight I/Os to drain/retry).ctrlr_loss_timeout_sec to define the reconnection window.none (log only), abort (send abort command), reset (reset the controller). Only active if timeout_us is non-zero.action_on_timeout. Separate from transport retries.small_bufsize bytes. Total small pool memory = small_pool_count × small_bufsize. At defaults: 8192 × 8KB = 64MB. This pool serves ALL iobuf consumers: bdev layer, NVMf transports, accel framework. Must be large enough for: Σ(all per-thread caches) + burst headroom.large_bufsize bytes. Total large pool memory = large_pool_count × large_bufsize. At defaults: 1024 × 132KB ≈ 132MB. Large buffers are used for I/Os > small_bufsize (8K). Sequential 128K workloads live here.len ≤ small_bufsize uses the small pool. Increasing this means fewer I/Os hit the large pool, but each small buffer consumes more DMA memory.io_unit_size must be ≤ this value. Increasing this allows larger max I/O sizes but costs proportionally more DMA memory.iobuf_set_options can only be called during SPDK startup (before subsystems initialize). It cannot be changed at runtime. Get it wrong and you must restart the SPDK process.
bdev_io objects in the global mempool. A bdev_io is the metadata envelope for every block I/O request — it tracks the bdev, offset, length, iovecs, completion callback, driver context (like raid_bdev_io or nvme_bdev_io), and internal state. It does NOT contain payload data. Must satisfy: pool_size ≥ bdev_io_cache_size × (num_threads + 1). If you have 32 threads with cache_size 256, minimum pool is 8448.bdev_io objects. At channel creation, each thread bulk-dequeues this many from the global pool into a thread-local STAILQ. This guarantees that under normal load, bdev_io allocation is a simple linked-list pop — no lock contention, no global pool access. Only when the cache is exhausted does a thread go to the global mempool.buf_cache_size. The bdev layer uses this when spdk_bdev_io_get_buf() is called (e.g., by RAID for reads, or by the NVMe bdev for reads without a pre-existing buffer). Higher values reduce central pool contention but consume more small buffers from the global pool at startup.For one VM I/O on your baremetal:
At each step, a different kind of object is allocated: transport request, bdev_io, iobuf payload buffer, or qpair request slot. If you ask "which one of those is pressuring the system?", the tuning problem becomes tractable.
Based on your baremetal runtime samples: hot threads bridge VFIO-user and NVMe initiator work, many remote controllers are active, each replica namespace has two enabled RDMA paths, and iobuf_get_stats showed zero retries. Your first-order cost is controller/qpair/poller scale, not iobuf starvation. Focus on io_queue_requests, poll-group layout, and controller/path count before increasing buffer pools.
Useful diagnostic RPCs: framework_get_reactors, thread_get_pollers, thread_get_io_channels, iobuf_get_stats, bdev_nvme_get_controllers, bdev_nvme_get_io_paths, bdev_raid_get_bdevs, nvmf_get_subsystems