107692]
Followed by the discussion in pr107692, -munroll-only-small-loops
Does not turns on/off -funroll-loops, and current check in
pass_rtl_unroll_loops::gate would cause -fno-unroll-loops do not take
effect. Revert the change about targetm.loop_unroll_adjust and apply
the backend option change to strictly follow the rule that
-funroll-loops takes full control of loop unrolling, and
munroll-only-small-loops just change its behavior to unroll small size
loops.
gcc/ChangeLog:
PR target/107692
* common/config/i386/i386-common.cc (ix86_optimization_table):
Enable loop unroll O2, disable -fweb and -frename-registers
by default.
* config/i386/i386-options.cc
(ix86_override_options_after_change):
Disable small loop unroll when funroll-loops enabled, reset
cunroll_grow_size when it is not explicitly enabled.
(ix86_option_override_internal): Call
ix86_override_options_after_change instead of calling
ix86_recompute_optlev_based_flags and ix86_default_align
separately.
* config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
factor if -munroll-only-small-loops enabled.
* loop-init.cc (pass_rtl_unroll_loops::gate): Do not enable
loop unrolling for -O2-speed.
(pass_rtl_unroll_loops::execute): Rmove
targetm.loop_unroll_adjust check.
gcc/testsuite/ChangeLog:
PR target/107692
* gcc.dg/guality/loop-1.c: Remove additional option for ia32.
* gcc.target/i386/pr86270.c: Add -fno-unroll-loops.
* gcc.target/i386/pr93002.c: Likewise.
Modern processors has multiple way instruction decoders
For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
instructions (usually has 3 uops with a cmp/jmp pair that can be
macro-fused), the decoder would have 2 uops bubble for each iteration
and the pipeline could not be fully utilized.
Therefore, this patch enables loop unrolling for small size loop at O2
to fullfill the decoder as much as possible. It turns on rtl loop
unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
In x86 backend the default behavior is to unroll small loops with less
than 4 insns by 1 time.
This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
0.9% codesize increment. For other benchmarks the variants are minor
and overall codesize increased by 0.2%.
The kernel image size increased by 0.06%, and no impact on eembc.
gcc/ChangeLog:
* common/config/i386/i386-common.cc (ix86_optimization_table):
Enable small loop unroll at O2 by default.
* config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
factor if -munroll-only-small-loops enabled and -funroll-loops/
-funroll-all-loops are disabled.
* config/i386/i386.h (struct processor_costs): Add 2 field
small_unroll_ninsns and small_unroll_factor.
* config/i386/i386.opt: Add -munroll-only-small-loops.
* doc/invoke.texi: Document -munroll-only-small-loops.
* loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
loop unrolling for -O2-speed and above if target hook
loop_unroll_adjust exists.
(pass_rtl_unroll_loops::execute): Set UAP_UNROLL flag
when target hook loop_unroll_adjust exists.
* config/i386/x86-tune-costs.h: Update all processor costs
with small_unroll_ninsns = 4 and small_unroll_factor = 2.
gcc/testsuite/ChangeLog:
* gcc.dg/guality/loop-1.c: Add additional option
-mno-unroll-only-small-loops.
* gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
* gcc.target/i386/pr93002.c: Likewise.
We set up INLINE_HINT_known_hot hint only when we have profile feedback,
now add function attribute judgement for it, when both caller and callee
have __attribute__((hot)), we will also set up INLINE_HINT_known_hot hint
for it.
With this patch applied,
ADL Multi-copy: 538.imagic_r 16.7%
ICX Multi-copy: 538.imagic_r 15.2%
CLX Multi-copy: 538.imagic_r 12.7%
Znver3 Multi-copy: 538.imagic_r 10.6%
Arm Multi-copy: 538.imagic_r 13.4%
gcc/ChangeLog
* ipa-inline-analysis.cc (do_estimate_edge_time): Add function attribute
judgement for INLINE_HINT_known_hot hint.
gcc/testsuite/ChangeLog:
* gcc.dg/ipa/inlinehint-6.c: New test.
and fix bugs
- add-fp-model-options.patch: New file
- enable-simd-math.patch: Enable simd math library in C and Fortran
- fix-CTOR-vectorization.patch: New file
- fix-range-set-by-vectorization-on-niter-IVs.patch: New file
- medium-code-mode.patch: Fix bugs when used with fpic
- optabs-Dont-use-scalar-conversions-for-vectors.patch: New file
- PR92429-do-not-fold-when-updating.patch: New file
- redundant-loop-elimination.patch: Fix some programming specifications
- fix-ICE-in-vect.patch: New file
- Fix-type-mismatch-in-SLPed-constructors.patch: New file
- add-check-for-pressure-in-sche1.patch: New file
- revert-moutline-atomics.patch: New file
- fix-ICE-in-eliminate-stmt.patch: New file
- revise-type-before-build-MULT.patch: New file
- Simplify-X-C1-C2.patch: New file
- gcc.spec: Add new patches