performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs -

this related, not same, question: performance optimisations of x86-64 assembly - alignment , branch prediction , related previous question: unsigned 64-bit double conversion: why algorithm g++

the following not real-world test case. primality testing algorithm not sensible. suspect any real-world algorithm never execute such small inner-loop quite many times (num prime of size 2**50). in c++11:

using nt = unsigned long long; bool is_prime_float(nt num) {    (nt n=2; n<=sqrt(num); ++n) {       if ( (num%n)==0 ) { return false; }    }    return true; }

then g++ -std=c++11 -o3 -s produces following, rcx containing n , xmm6 containing sqrt(num). see previous post remaining code (which never executed in example, rcx never becomes large enough treated signed negative).

jmp .l20 .p2align 4,,10 .l37: pxor    %xmm0, %xmm0 cvtsi2sdq   %rcx, %xmm0 ucomisd %xmm0, %xmm6 jb  .l36   // exit loop .l20: xorl    %edx, %edx movq    %rbx, %rax divq    %rcx testq   %rdx, %rdx je  .l30   // failed divisibility test addq    $1, %rcx jns .l37 // further code deal case when ucomisd can't used

i time using std::chrono::steady_clock. kept getting weird performance changes: adding or deleting other code. tracked down alignment issue. command .p2align 4,,10 tried align 2**4=16 byte boundary, uses @ 10 bytes of padding so, guess balance between alignment , code size.

i wrote python script replace .p2align 4,,10 manually controlled number of nop instructions. following scatter plot shows fastest 15 of 20 runs, time in seconds, number of bytes padding @ x-axis:

scatter plot

from objdump no padding, pxor instruction occur @ offset 0x402f5f. running on laptop, sandybridge i5-3210m, turboboost disabled, found that

for 0 byte padding, slow performance (0.42 secs)
for 1-4 bytes padding (offset 0x402f60 0x402f63) better (0.41s, visible on plot).
for 5-20 bytes padding (offset 0x402f64 0x402f73) fast performance (0.37s)
for 21-32 bytes padding (offset 0x402f74 0x402f7f) slow performance (0.42 secs)
then cycles on 32 byte sample

so 16-byte alignment doesn't give best performance-- puts in better (or less variation, scatter plot) region. alignment of 32 plus 4 19 gives best performance.

why seeing performance difference? why seem violate rule of aligning branch targets 16-byte boundary (see e.g. intel optimisation manual)

i don't see branch-prediction problems. uop cache quirk??

by changing c++ algorithm cache sqrt(num) in 64-bit integer , make loop purely integer based, remove problem-- alignment makes no difference @ all.

here's found on skylake same loop. code reproduce tests on hardware is on github.

i observe 3 different performance levels based on alignment, whereas op saw 2 primary ones. levels distinct , repeatable²:

we see 3 distinct performance levels here (the pattern repeats starting offset 32), we'll call regions 1, 2 , 3, left right (region 2 split 2 parts straddling region 3). fastest region (1) offset 0 8, middle (2) region 9-18 , 28-31, , slowest (3) 19-27. the difference between each region close or 1 cycle/iteration.

based on performance counters, fastest region different other two:

all instructions delivered legacy decoder, not dsb¹.
there exactly 2 decoder <-> microcode switches (idq_ms_switches) every iteration of loop.

on hand, 2 slower regions similar:

all instructions delivered dsb (uop cache), , not legacy decoder.
there 3 decoder <-> microcode switches per iteration of loop.

the transition fastest middle region, offset changes 8 9, corresponds when loop starts fitting in uop buffer, because of alignment issues. count out in same way peter did in answer:

offset 8:

  lsd? <_start.l37>:   ab 1 4000a8:  66 0f ef c0             pxor   xmm0,xmm0   ab 1 4000ac:  f2 48 0f 2a c1          cvtsi2sd xmm0,rcx   ab 1 4000b1:  66 0f 2e f0             ucomisd xmm6,xmm0   ab 1 4000b5:  72 21                   jb     4000d8 <_start.l36>   ab 2 4000b7:  31 d2                   xor    edx,edx   ab 2 4000b9:  48 89 d8                mov    rax,rbx   ab 3 4000bc:  48 f7 f1                div    rcx   !!!! 4000bf:  48 85 d2                test   rdx,rdx        4000c2:  74 0d                   je     4000d1 <_start.l30>        4000c4:  48 83 c1 01             add    rcx,0x1        4000c8:  79 de                   jns    4000a8 <_start.l37>

in first column i've annotated how uops each instruction end in uop cache. "ab 1" means go in set associated address ...???a? or ...???b? (each set covers 32 bytes, aka 0x20), while 1 means way 1 (out of max of 3).

at point !!! busts out of uop cache because test instruction has no go, 3 ways used up.

let's @ offset 9 on other hand:

00000000004000a9 <_start.l37>:   ab 1 4000a9:  66 0f ef c0             pxor   xmm0,xmm0   ab 1 4000ad:  f2 48 0f 2a c1          cvtsi2sd xmm0,rcx   ab 1 4000b2:  66 0f 2e f0             ucomisd xmm6,xmm0   ab 1 4000b6:  72 21                   jb     4000d9 <_start.l36>   ab 2 4000b8:  31 d2                   xor    edx,edx   ab 2 4000ba:  48 89 d8                mov    rax,rbx   ab 3 4000bd:  48 f7 f1                div    rcx   cd 1 4000c0:  48 85 d2                test   rdx,rdx   cd 1 4000c3:  74 0d                   je     4000d2 <_start.l30>   cd 1 4000c5:  48 83 c1 01             add    rcx,0x1   cd 1 4000c9:  79 de                   jns    4000a9 <_start.l37>

now there no problem! test instruction has slipped next 32b line (the cd line), fits in uop cache.

so explains why stuff changes between mite , dsb @ point. doesn't, however, explain why mite path faster. tried simpler tests div in loop, , can reproduce simpler loops without of floating point stuff. it's weird , sensitive random other stuff put in loop.

for example loop executes faster out of legacy decoder dsb:

align 32     <add nops here swtich between dsb , mite> .top:     add r8, r9     xor eax, eax     div rbx     xor edx, edx     times 5 add eax, eax     dec rcx     jnz .top

in loop, adding pointless add r8, r9 instruction, doesn't interact rest of loop, sped things up mite version (but not dsb version).

so think difference between region 1 region 2 , 3 due former executing out of legacy decoder (which, oddly, makes faster).

let's take @ offset 18 offset 19 transition (where region2 ends , 3 starts):

offset 18:

00000000004000b2 <_start.l37>:   ab 1 4000b2:  66 0f ef c0             pxor   xmm0,xmm0   ab 1  4000b6: f2 48 0f 2a c1          cvtsi2sd xmm0,rcx   ab 1  4000bb: 66 0f 2e f0             ucomisd xmm6,xmm0   ab 1  4000bf: 72 21                   jb     4000e2 <_start.l36>   cd 1  4000c1: 31 d2                   xor    edx,edx   cd 1  4000c3: 48 89 d8                mov    rax,rbx   cd 2  4000c6: 48 f7 f1                div    rcx   cd 3  4000c9: 48 85 d2                test   rdx,rdx   cd 3  4000cc: 74 0d                   je     4000db <_start.l30>   cd 3  4000ce: 48 83 c1 01             add    rcx,0x1   cd 3  4000d2: 79 de                   jns    4000b2 <_start.l37>

offset 19:

00000000004000b3 <_start.l37>:   ab 1 4000b3:  66 0f ef c0             pxor   xmm0,xmm0   ab 1 4000b7:  f2 48 0f 2a c1          cvtsi2sd xmm0,rcx   ab 1 4000bc:  66 0f 2e f0             ucomisd xmm6,xmm0   cd 1 4000c0:  72 21                   jb     4000e3 <_start.l36>   cd 1 4000c2:  31 d2                   xor    edx,edx   cd 1 4000c4:  48 89 d8                mov    rax,rbx   cd 2 4000c7:  48 f7 f1                div    rcx   cd 3 4000ca:  48 85 d2                test   rdx,rdx   cd 3 4000cd:  74 0d                   je     4000dc <_start.l30>   cd 3 4000cf:  48 83 c1 01             add    rcx,0x1   cd 3 4000d3:  79 de                   jns    4000b3 <_start.l37>

the difference see here first 4 instructions in offset 18 case fit ab cache line, 3 in offset 19 case. if hypothesize dsb can deliver uops idq 1 cache set, means @ point 1 uop may issued , executed cycle earlier in offset 18 scenario in 19 scenario (imagine, example, idq empty). depending on port uop goes in context of surrounding uop flow, may delay loop 1 cycle. indeed, difference between region 2 , 3 ~1 cycle (within margin of error).

so think can difference between 2 , 3 due uop cache alignment - region 2 has better alignment 3, in terms of issuing 1 additional uop 1 cycle earlier.

some addition notes on things checked didn't pan out being possible cause of slowdowns:

despite dsb modes (regions 2 , 3) having 3 microcode switches versus 2 of mite path (region 1), doesn't seem directly cause slowdown. in particular, simpler loops div execute in identical cycle counts, still show 3 , 2 switches dsb , mite paths respectively. that's normal , doesn't directly imply slowdown.
both paths execute identical number of uops and, in particular, have identical number of uops generated microcode sequencer. it's not there more overall work being done in different regions.
there wasn't difference in cache misses (very low, expected) @ various levels, branch mispredictions (essentially zero³), or other types of penalties or unusual conditions checked.

what did bear fruit looking @ pattern of execution unit usage across various regions. here's @ distribution of uops executed per cycle , stall metrics:

+----------------------------+----------+----------+----------+ |                            | region 1 | region 2 | region 3 | +----------------------------+----------+----------+----------+ | cycles:                    | 7.7e8    | 8.0e8    | 8.3e8    | | uops_executed_stall_cycles | 18%      | 24%      | 23%      | | exe_activity_1_ports_util  | 31%      | 22%      | 27%      | | exe_activity_2_ports_util  | 29%      | 31%      | 28%      | | exe_activity_3_ports_util  | 12%      | 19%      | 19%      | | exe_activity_4_ports_util  | 10%      | 4%       | 3%       | +----------------------------+----------+----------+----------+

i sampled few different offset values , results consistent within each region, yet between regions have quite different results. in particular, in region 1, have fewer stall cycles (cycles no uop executed). have significant variation in non-stall cycles, although no clear "better" or "worse" trend evident. example, region 1 has many more cycles (10% vs 3% or 4%) 4 uops executed, other regions largely make more cycles 3 uops executed, , few cycles 1 uop executed.

the difference in upc⁴ execution distribution above implies explains difference in performance (this tautology since confirmed uop count same between them).

let's see toplev.py has ... (results omitted).

well, toplev suggests primary bottleneck front-end (50+%). don't think can trust because way calculates fe-bound seems broken in case of long strings of micro-coded instructions. fe-bound based on frontend_retired.latency_ge_8, defined as:

retired instructions fetched after interval front-end delivered no uops period of 8 cycles not interrupted back-end stall. (supports pebs)

normally makes sense. counting instructions delayed because frontend wasn't delivering cycles. "not interrupted back-end stall" condition ensures doesn't trigger when front-end isn't delivering uops because backend not able accept them (e.g,. when rs full because backend performing low-throuput instructions).

it kind of seems div instructions - simple loop pretty 1 div shows:

fe      frontend_bound:                57.59 %           [100.00%] bad     bad_speculation:                0.01 %below      [100.00%]      backend_bound:                  0.11 %below      [100.00%] ret     retiring:                      42.28 %below      [100.00%]

that is, bottleneck front-end ("retiring" not bottleneck, represents useful work). clearly, such loop trivially handled front-end , instead limited backend's ability chew threw uops generated div operation. toplev might wrong because (1) may uops delivered microcode sequencer aren't counted in frontend_retired.latency... counters, every div operation causes event count subsequent instructions (even though cpu busy during period - there no real stall), or (2) microcode sequencer might deliver ups "up front", slamming ~36 uops idq, @ point doesn't deliver more until div finished, or that.

still, can @ lower levels of toplev hints:

the main difference toplev calls out between regions 1 , region 2 , 3 increased penalty of ms_switches latter 2 regions (since incur 3 every iteration vs 2 legacy path. internally, toplev estimates 2-cycle penalty in frontend such switches. of course, whether these penalties slow down depends in complex way on instruction queue , other factors. mentioned above, simple loop div doesn't show difference between dsb , mite paths, loop additional instructions does. switch bubble absorbed in simpler loops (where backend processing of uops generated div main factor), once add other work in loop, switches become factor @ least transition period between div , non-div` work.

so guess conclusion way div instruction interacts rest of frontend uop flow, , backend execution, isn't understood. know involves flood of uops, delivered both mite/dsb (seems 4 uops per div) , microcode sequencer (seems ~32 uops per div, although changes different input values div op) - don't know uops (we can see port distribution though). makes behavior opaque, think down either ms switches bottlnecking front-end, or slight differences in uop delivery flow resulting in different scheduling decisions end making mite order master.

¹ of course, of uops not delivered legacy decoder or dsb @ all, microcode sequencer (ms). loosely talk instructions delivered, not uops.

² note x axis here "offset bytes 32b alignment". is, 0 means top of loop (label .l37) aligned 32b boundary, , 5 means loop starts 5 bytes below 32b boundary (using nop padding) , on. padding bytes , offset same. op used different meaning offset, if understand correctly: 1 byte of padding resulted in 0 offset. subtract 1 ops padding values offset values.

³ in fact, branch prediction rate typical test prime=1000000000000037 ~99.999997%, reflecting 3 mispredicted branches in entire run (likely on first pass through loop, , last iteration).

⁴ upc, i.e., uops per cycle - measure closely related ipc similar programs, , 1 bit more precise when looking in detail @ uop flows. in case, know uop counts same variations of alignment, upc , ipc directly proportional.

Search This Blog

Add

performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs -

Comments

Post a Comment

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

xcode - Swift Playground - Files are not readable -

jboss7.x - JBoss AS 7.3 vs 7.4 and differences -