qemu/hw-nvme-fix-handling-of-over-committed-queues.patch
Jiabo Feng dddaf7aa9e QEMU update to version 8.2.0-26:
- vdpa-dev: Fix initialisation order to restore VDUSE compatibility
- tcg: Allow top bit of SIMD_DATA_BITS to be set in simd_desc()
- migration: fix-possible-int-overflow
- target/m68k: Map FPU exceptions to FPSR register
- qemu-options: Fix CXL Fixed Memory Window interleave-granularity typo
- hvf: arm: Fix encodings for ID_AA64PFR1_EL1 and debug System registers
- hw/intc/arm_gic: Fix handling of NS view of GICC_APR<n>
- qio: Inherit follow_coroutine_ctx across TLS
- target/riscv: Fix the element agnostic function problem
- accel/tcg: Fix typo causing tb->page_addr[1] to not be recorded
- tcg/loongarch64: Fix tcg_out_movi vs some pcrel pointers
- migration: Fix file migration with fdset
- ui/vnc: don't return an empty SASL mechlist to the client
- target/arm: Fix FJCVTZS vs flush-to-zero
- hw/ppc/e500: Prefer QOM cast
- sphinx/qapidoc: Fix to generate doc for explicit, unboxed arguments
- hw/ppc/e500: Remove unused "irqs" parameter
- hw/ppc/e500: Add missing device tree properties to i2c controller node
- hw/i386/amd_iommu: Don't leak memory in amdvi_update_iotlb()
- hw/arm/mps2-tz.c: fix RX/TX interrupts order
- target/i386: csv: Add support to migrate the incoming context for CSV3 guest
- target/i386: csv: Add support to migrate the outgoing context for CSV3 guest
- target/i386: csv: Add support to migrate the incoming page for CSV3 guest
- target/i386: csv: Add support to migrate the outgoing page for CSV3 guest
- linux-headers: update kernel headers to include CSV3 migration cmds
- vfio: Only map shared region for CSV3 virtual machine
- vga: Force full update for CSV3 guest
- target/i386: csv: Load initial image to private memory for CSV3 guest
- target/i386: csv: Do not register/unregister guest secure memory for CSV3 guest
- target/i386: cpu: Populate CPUID 0x8000_001F when CSV3 is active
- target/i386: csv: Add command to load vmcb to CSV3 guest memory
- target/i386: csv: Add command to load data to CSV3 guest memory
- target/i386: csv: Add command to initialize CSV3 context
- target/i386: csv: Add CSV3 context
- next-kbd: convert to use qemu_input_handler_register()
- qemu/bswap: Undefine CPU_CONVERT() once done
- exec/memop: Remove unused memop_big_endian() helper
- hw/nvme: fix handling of over-committed queues
- 9pfs: fix crash on 'Treaddir' request
- hw/misc/psp: Pin the hugepage memory specified by mem2 during use for psp
- hw/misc: support tkm use mem2 memory
- hw/i386: add mem2 option for qemu
- kvm: add support for guest physical bits
- target/i386: add guest-phys-bits cpu property

Signed-off-by: Jiabo Feng <fengjiabo1@huawei.com>
(cherry picked from commit f45f35e88509a4ffa9f62332ee9601e9fe1f8d09)
2024-12-12 17:39:01 +08:00

103 lines
4.1 KiB
Diff

From c4423b70160eb7ae91dac9f2cf61513758ee017d Mon Sep 17 00:00:00 2001
From: Klaus Jensen <k.jensen@samsung.com>
Date: Tue, 29 Oct 2024 13:15:19 +0100
Subject: [PATCH] hw/nvme: fix handling of over-committed queues
If a host chooses to use the SQHD "hint" in the CQE to know if there is
room in the submission queue for additional commands, it may result in a
situation where there are not enough internal resources (struct
NvmeRequest) available to process the command. For a lack of a better
term, the host may "over-commit" the device (i.e., it may have more
inflight commands than the queue size).
For example, assume a queue with N entries. The host submits N commands
and all are picked up for processing, advancing the head and emptying
the queue. Regardless of which of these N commands complete first, the
SQHD field of that CQE will indicate to the host that the queue is
empty, which allows the host to issue N commands again. However, if the
device has not posted CQEs for all the previous commands yet, the device
will have less than N resources available to process the commands, so
queue processing is suspended.
And here lies an 11 year latent bug. In the absense of any additional
tail updates on the submission queue, we never schedule the processing
bottom-half again unless we observe a head update on an associated full
completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups
(in the absense of over-commit of course). Incidentially, that "kick all
associated SQs" mechanism can now be killed since we now just schedule
queue processing when we return a processing resource to a non-empty
submission queue, which happens to cover both edge cases. However, we
must retain kicking the CQ if it was previously full.
So, apparently, no previous driver tested with hw/nvme has ever used
SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv
shows up with the driver that actually does. I salute you.
Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface")
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388
Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Zhongrui Tang <tangzhongrui_yewu@cmss.chinamobile.com>
---
hw/nvme/ctrl.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 104aebc5ea..29445938d5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1516,9 +1516,16 @@ static void nvme_post_cqes(void *opaque)
stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
break;
}
+
QTAILQ_REMOVE(&cq->req_list, req, entry);
+
nvme_inc_cq_tail(cq);
nvme_sg_unmap(&req->sg);
+
+ if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
+ qemu_bh_schedule(sq->bh);
+ }
+
QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
}
if (cq->tail != cq->head) {
@@ -7575,7 +7582,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
/* Completion queue doorbell write */
uint16_t new_head = val & 0xffff;
- int start_sqs;
NvmeCQueue *cq;
qid = (addr - (0x1000 + (1 << 2))) >> 3;
@@ -7626,18 +7632,15 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
- start_sqs = nvme_cq_full(cq) ? 1 : 0;
+ /* scheduled deferred cqe posting if queue was previously full */
+ if (nvme_cq_full(cq)) {
+ qemu_bh_schedule(cq->bh);
+ }
+
cq->head = new_head;
if (!qid && n->dbbuf_enabled) {
stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED);
}
- if (start_sqs) {
- NvmeSQueue *sq;
- QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
- qemu_bh_schedule(sq->bh);
- }
- qemu_bh_schedule(cq->bh);
- }
if (cq->tail == cq->head) {
if (cq->irq_enabled) {
--
2.41.0.windows.1