glibc/x86-Only-align-destination-to-1x-VEC_SIZE-in-memset-.patch

From 5a64f933655384477d85122c6855dc6d84061810 Mon Sep 17 00:00:00 2001
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed, 1 Nov 2023 15:30:26 -0500
Subject: [PATCH] x86: Only align destination to 1x VEC_SIZE in memset 4x
 loop

Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on
performance other than potentially resulting in an additional
iteration of the loop.
1x maintains aligned stores (the only reason to align in this case)
and doesn't incur any unnecessary loop iterations.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>

(cherry picked from commit 9469261cf1924d350feeec64d2c80cafbbdcdd4d)
---
 sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 3d9ad49cb9..0f0636b90f 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -293,7 +293,7 @@ L(more_2x_vec):
 	leaq	(VEC_SIZE * 4)(%rax), %LOOP_REG
 #endif
 	/* Align dst for loop.  */
-	andq	$(VEC_SIZE * -2), %LOOP_REG
+	andq	$(VEC_SIZE * -1), %LOOP_REG
 	.p2align 4
 L(loop):
 	VMOVA	%VMM(0), LOOP_4X_OFFSET(%LOOP_REG)
-- 
2.27.0
backport form glibc upstream 2.38 branch, this include below patches: - stdlib: Test using setenv with updated environ [BZ #32588] - Fix underallocation of abort_msg_s struct (CVE-2025-0395) - elf: Support recursive use of dynamic TLS in interposed malloc - elf: Avoid some free (NULL) calls in _dl_update_slotinfo - x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212] - x86: Improve large memset perf with non-temporal stores [RHEL-29312] - x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4) - sysdeps/x86/Makefile: Split and sort tests - x86: Only align destination to 1x VEC_SIZE in memset 4x loop - elf: Fix slow tls access after dlopen [BZ #19924] - x86: Check the lower byte of EAX of CPUID leaf 2 [BZ #30643] - x86_64: Add log1p with FMA - x86_64: Add expm1 with FMA - x86_64: Add log2 with FMA - x86_64: Sort fpu/multiarch/Makefile (cherry picked from commit d5576a8feda207f06e46bcbcc1bdb566f0fd460a) 2025-01-26 10:05:30 +08:00			`From 5a64f933655384477d85122c6855dc6d84061810 Mon Sep 17 00:00:00 2001`
			`From: Noah Goldstein <goldstein.w.n@gmail.com>`
			`Date: Wed, 1 Nov 2023 15:30:26 -0500`
			`Subject: [PATCH] x86: Only align destination to 1x VEC_SIZE in memset 4x`
			`loop`

			`Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on`
			`performance other than potentially resulting in an additional`
			`iteration of the loop.`
			`1x maintains aligned stores (the only reason to align in this case)`
			`and doesn't incur any unnecessary loop iterations.`
			`Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>`

			`(cherry picked from commit 9469261cf1924d350feeec64d2c80cafbbdcdd4d)`
			`---`
			`sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S \| 2 +-`
			`1 file changed, 1 insertion(+), 1 deletion(-)`

			`diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S`
			`index 3d9ad49cb9..0f0636b90f 100644`
			`--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S`
			`+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S`
			`@@ -293,7 +293,7 @@ L(more_2x_vec):`
			`leaq (VEC_SIZE * 4)(%rax), %LOOP_REG`
			`#endif`
			`/* Align dst for loop. */`
			`- andq $(VEC_SIZE * -2), %LOOP_REG`
			`+ andq $(VEC_SIZE * -1), %LOOP_REG`
			`.p2align 4`
			`L(loop):`
			`VMOVA %VMM(0), LOOP_4X_OFFSET(%LOOP_REG)`
			`--`
			`2.27.0`