- vdpa-dev: Fix initialisation order to restore VDUSE compatibility - tcg: Allow top bit of SIMD_DATA_BITS to be set in simd_desc() - migration: fix-possible-int-overflow - target/m68k: Map FPU exceptions to FPSR register - qemu-options: Fix CXL Fixed Memory Window interleave-granularity typo - hvf: arm: Fix encodings for ID_AA64PFR1_EL1 and debug System registers - hw/intc/arm_gic: Fix handling of NS view of GICC_APR<n> - qio: Inherit follow_coroutine_ctx across TLS - target/riscv: Fix the element agnostic function problem - accel/tcg: Fix typo causing tb->page_addr[1] to not be recorded - tcg/loongarch64: Fix tcg_out_movi vs some pcrel pointers - migration: Fix file migration with fdset - ui/vnc: don't return an empty SASL mechlist to the client - target/arm: Fix FJCVTZS vs flush-to-zero - hw/ppc/e500: Prefer QOM cast - sphinx/qapidoc: Fix to generate doc for explicit, unboxed arguments - hw/ppc/e500: Remove unused "irqs" parameter - hw/ppc/e500: Add missing device tree properties to i2c controller node - hw/i386/amd_iommu: Don't leak memory in amdvi_update_iotlb() - hw/arm/mps2-tz.c: fix RX/TX interrupts order - target/i386: csv: Add support to migrate the incoming context for CSV3 guest - target/i386: csv: Add support to migrate the outgoing context for CSV3 guest - target/i386: csv: Add support to migrate the incoming page for CSV3 guest - target/i386: csv: Add support to migrate the outgoing page for CSV3 guest - linux-headers: update kernel headers to include CSV3 migration cmds - vfio: Only map shared region for CSV3 virtual machine - vga: Force full update for CSV3 guest - target/i386: csv: Load initial image to private memory for CSV3 guest - target/i386: csv: Do not register/unregister guest secure memory for CSV3 guest - target/i386: cpu: Populate CPUID 0x8000_001F when CSV3 is active - target/i386: csv: Add command to load vmcb to CSV3 guest memory - target/i386: csv: Add command to load data to CSV3 guest memory - target/i386: csv: Add command to initialize CSV3 context - target/i386: csv: Add CSV3 context - next-kbd: convert to use qemu_input_handler_register() - qemu/bswap: Undefine CPU_CONVERT() once done - exec/memop: Remove unused memop_big_endian() helper - hw/nvme: fix handling of over-committed queues - 9pfs: fix crash on 'Treaddir' request - hw/misc/psp: Pin the hugepage memory specified by mem2 during use for psp - hw/misc: support tkm use mem2 memory - hw/i386: add mem2 option for qemu - kvm: add support for guest physical bits - target/i386: add guest-phys-bits cpu property Signed-off-by: Jiabo Feng <fengjiabo1@huawei.com> (cherry picked from commit f45f35e88509a4ffa9f62332ee9601e9fe1f8d09)
103 lines
4.1 KiB
Diff
103 lines
4.1 KiB
Diff
From c4423b70160eb7ae91dac9f2cf61513758ee017d Mon Sep 17 00:00:00 2001
|
|
From: Klaus Jensen <k.jensen@samsung.com>
|
|
Date: Tue, 29 Oct 2024 13:15:19 +0100
|
|
Subject: [PATCH] hw/nvme: fix handling of over-committed queues
|
|
|
|
If a host chooses to use the SQHD "hint" in the CQE to know if there is
|
|
room in the submission queue for additional commands, it may result in a
|
|
situation where there are not enough internal resources (struct
|
|
NvmeRequest) available to process the command. For a lack of a better
|
|
term, the host may "over-commit" the device (i.e., it may have more
|
|
inflight commands than the queue size).
|
|
|
|
For example, assume a queue with N entries. The host submits N commands
|
|
and all are picked up for processing, advancing the head and emptying
|
|
the queue. Regardless of which of these N commands complete first, the
|
|
SQHD field of that CQE will indicate to the host that the queue is
|
|
empty, which allows the host to issue N commands again. However, if the
|
|
device has not posted CQEs for all the previous commands yet, the device
|
|
will have less than N resources available to process the commands, so
|
|
queue processing is suspended.
|
|
|
|
And here lies an 11 year latent bug. In the absense of any additional
|
|
tail updates on the submission queue, we never schedule the processing
|
|
bottom-half again unless we observe a head update on an associated full
|
|
completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups
|
|
(in the absense of over-commit of course). Incidentially, that "kick all
|
|
associated SQs" mechanism can now be killed since we now just schedule
|
|
queue processing when we return a processing resource to a non-empty
|
|
submission queue, which happens to cover both edge cases. However, we
|
|
must retain kicking the CQ if it was previously full.
|
|
|
|
So, apparently, no previous driver tested with hw/nvme has ever used
|
|
SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv
|
|
shows up with the driver that actually does. I salute you.
|
|
|
|
Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface")
|
|
Cc: qemu-stable@nongnu.org
|
|
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388
|
|
Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
|
|
Reviewed-by: Keith Busch <kbusch@kernel.org>
|
|
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
|
|
Signed-off-by: Zhongrui Tang <tangzhongrui_yewu@cmss.chinamobile.com>
|
|
---
|
|
hw/nvme/ctrl.c | 21 ++++++++++++---------
|
|
1 file changed, 12 insertions(+), 9 deletions(-)
|
|
|
|
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
|
|
index 104aebc5ea..29445938d5 100644
|
|
--- a/hw/nvme/ctrl.c
|
|
+++ b/hw/nvme/ctrl.c
|
|
@@ -1516,9 +1516,16 @@ static void nvme_post_cqes(void *opaque)
|
|
stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
|
|
break;
|
|
}
|
|
+
|
|
QTAILQ_REMOVE(&cq->req_list, req, entry);
|
|
+
|
|
nvme_inc_cq_tail(cq);
|
|
nvme_sg_unmap(&req->sg);
|
|
+
|
|
+ if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
|
|
+ qemu_bh_schedule(sq->bh);
|
|
+ }
|
|
+
|
|
QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
|
|
}
|
|
if (cq->tail != cq->head) {
|
|
@@ -7575,7 +7582,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
|
|
/* Completion queue doorbell write */
|
|
|
|
uint16_t new_head = val & 0xffff;
|
|
- int start_sqs;
|
|
NvmeCQueue *cq;
|
|
|
|
qid = (addr - (0x1000 + (1 << 2))) >> 3;
|
|
@@ -7626,18 +7632,15 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
|
|
|
|
trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
|
|
|
|
- start_sqs = nvme_cq_full(cq) ? 1 : 0;
|
|
+ /* scheduled deferred cqe posting if queue was previously full */
|
|
+ if (nvme_cq_full(cq)) {
|
|
+ qemu_bh_schedule(cq->bh);
|
|
+ }
|
|
+
|
|
cq->head = new_head;
|
|
if (!qid && n->dbbuf_enabled) {
|
|
stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED);
|
|
}
|
|
- if (start_sqs) {
|
|
- NvmeSQueue *sq;
|
|
- QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
|
|
- qemu_bh_schedule(sq->bh);
|
|
- }
|
|
- qemu_bh_schedule(cq->bh);
|
|
- }
|
|
|
|
if (cq->tail == cq->head) {
|
|
if (cq->irq_enabled) {
|
|
--
|
|
2.41.0.windows.1
|
|
|