1、Computer Architecture-A Quantitative Approach计算机体系结构计算机体系结构计算机计算机体系结构体系结构Chapter 4 (2)Instruction-Level Parallelism Software Approaches王奕王奕ELecture for ILP:Software approaches(软件方法软件方法)Basic Compiler Technique for Exposing ILP Loop unrolling(基本的发现(基本的发现ILP的编译技术是循环展开)的编译技术是循环展开)Static Branch Predicti
2、on(静态分支预测静态分支预测)Static multiple Issue:VLIW(静态多指令发射(静态多指令发射VLIW)Advanced Compilor Support for Exposing and Exploiting ILP(对发现和开发对发现和开发ILP的高级编译器支持的高级编译器支持)Software pipelining(软件流水软件流水)Global Code scheduling(全局代码调度)(全局代码调度)Hardware Support for Exposing More Parallelism at compile time(对编译时开发对编译时开发ILP的硬
3、件支持的硬件支持)Conditional or Predicated(断言的断言的)instructions(条件指令或预条件指令或预测指令测指令)Compiler speculation with hardware support(在硬件支持下在硬件支持下的编译器投机技术的编译器投机技术)FP Loop:Where are the Hazards?Loop:LD F0,0(R1);F0=vector element ADDD F4,F0,F2;add scalar from F2 SD 0(R1),F4;store result SUBI R1,R1,8;decrement pointer
4、8B(DW)BNEZ R1,Loop;branch R1!=zero NOP;delayed branch slotAssumptions of the latency of the FP operations:FP ALU opAnother FP ALU op 3FP ALU opStore double 2 Load doubleFP ALU op 1Load doubleStore double 0Integer opInteger op 0 Where are the stalls?Reducing stalls from schedulling in BB and delayed
5、branchLoop:LD F0,0(R1)ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,LoopF D X M W F D s A1 A2 A3 A4 W F s D s s X M W F s s D X M W F s D X M W 10 CC10 CC F FLoop:LD F0,0(R1)SUBI R1,R1,#8 ADDD F4,F0,F2 BNEZ R1,Loop SD +8(R1),F4F D X M W F D X M W F DA1A2A3A4W F D X M W F D s X M W 6 CC F s D X M W
6、Unroll Loop Four Times(straightforward way)Rewrite loop to minimize stalls?1 Loop:LDF0,0(R1)2ADDDF4,F0,F23SD0(R1),F4;drop SUBI&BNEZ4LDF6,-8(R1)5ADDDF8,F6,F26SD-8(R1),F8;drop SUBI&BNEZ7LDF10,-16(R1)8ADDDF12,F10,F29SD-16(R1),F12;drop SUBI&BNEZ10LDF14,-24(R1)11ADDD F16,F14,F212SUBIR1,R1,#32;alter to 4*
7、8/13SD+8(R1),F1614BNEZR1,LOOP15NOP 15+4 x(1+2)=27 clock cycles,or 6.8 per iteration Assumes R1 is multiple of 41 cycle stall2 cycles stallUnrolled Loop That Minimizes Stalls What assumptions made when moved code?OK to move store past SUBI even though changes register OK to move loads before stores:g
8、et right data?When is it safe for compiler to do such changes?1 Loop:LDF0,0(R1)2LDF6,-8(R1)3LDF10,-16(R1)4LDF14,-24(R1)5ADDDF4,F0,F26ADDDF8,F6,F27ADDDF12,F10,F28ADDDF16,F14,F29SD0(R1),F410SD-8(R1),F811SUBIR1,R1,#3212SD+16(R1),F1213BNEZR1,LOOP14SD8(R1),F16;8-32=-24 14 clock cycles,or 3.5 per iteratio
9、nUsing Loop unrolling and scheduling with static Multiple IssueInteger InstructionFP instructionClock cycleLoop:L.D F0,0(R1)1 L.D F0,-8(R1)2 L.D F0,-16(R1)ADD.D F4,F0.F23 L.D F0,-24(R1)ADD.D F8,F6.F24 L.D F0,-32(R1)ADD.D F12,F10.F25 S.D F4,0(R1)ADD.D F16,F14.F26 S.D F8,-8(R1)ADD.D F20,F18.F27 S.D F1
10、2,-16(R1)8 DADDUI R1,R1,#-409 S.D F16,16(R1)10 BNE R1,R2,Loop11 S.D F20,8(R1)12Static Branch Prediction静态分支预测静态分支预测 Static branch predictors are used in processors when branch behavior is expected highly predictable at compile time.(静态分支预测一般用于分支行为在编译器时就具有很高有可预测性的情形静态分支预测一般用于分支行为在编译器时就具有很高有可预测性的情形)Se
11、veral different methods Always predict a branch as taken or untaken (总是预测转移成功或不成功总是预测转移成功或不成功)Predict on the basis of branch direction(基于转移方向的预测基于转移方向的预测)Backward-going branch to be taken,(向后预测为成功向后预测为成功)Forward-going branch to be not taken.(向前预测为不成功向前预测为不成功)Profile-based Prediction(基于以往概要信息基于以往概要信息
12、(含多方面的行为含多方面的行为)的预测的预测)Static Multiple issue:VLIW(静态多发射:静态多发射:VLIW)VLIW:Very Long Instruction Word(超长指令字超长指令字)Each“instruction”has explicit coding for multiple operations(每条每条“指令指令”都显式地包括多个操作都显式地包括多个操作)In EPIC,grouping called a“packet”In Transmeta,grouping called a“molecule”(with“atoms”as ops)Tradeo
13、ff instruction space for simple decoding (为了编码简单,牺牲了一些代码空间为了编码简单,牺牲了一些代码空间)The long instruction word has room for many operations By definition,all the operations the compiler puts in the long instruction word are independent=execute in parallel E.g.,2 integer operations,2 FP ops,2 Memory refs,1 bra
14、nch 16 to 24 bits per field=7*16 or 112 bits to 7*24 or 168 bits wide Need compiling technique that schedules across several branchesLoop Unrolling in VLIWMemory MemoryFPFPInt.op/Clockreference 1reference 2operation 1 op.2 branchLD F0,0(R1)LD F6,-8(R1)1LD F10,-16(R1)LD F14,-24(R1)2LD F18,-32(R1)LD F
15、22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23LD F26,-48(R1)ADDD F12,F10,F2 ADDD F16,F14,F24ADDD F20,F18,F2 ADDD F24,F22,F25SD 0(R1),F4SD-8(R1),F8ADDD F28,F26,F26SD-16(R1),F12 SD-24(R1),F167SD-32(R1),F20 SD-40(R1),F24SUBI R1,R1,#488SD-0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clo
16、cks,or 1.3 clocks per iteration(1.8X)Average:2.5 ops per clock,50%efficiency Note:Need more registers in VLIW(15 vs.6 in SS)Problems for VLIW Technical problems(技术问题技术问题)Increase in code size(代码的增长代码的增长)Loop unrolling Unused function slots Limitations of lockstep operation(锁定同步操作的限制锁定同步操作的限制)A stall
17、 in any function unit may cause the entire processor to stall Logistical problem(逻辑问题逻辑问题)Binary code compatibility(二进制代码的兼容性二进制代码的兼容性)Major challenge for all multiple-issue processors Exploit large amounts of ILPAdvanced Compiler Support for Exploiting ILP(编译器对开发编译器对开发ILP的高级支持的高级支持)Detecting and En
18、hancing Loop-level Parallelism(检测并增强循环级并行检测并增强循环级并行)Eliminating Dependent Computations(消除相关计算消除相关计算)Software pipelining:Symbolic loop unrolling(软件流水:符号循环展开软件流水:符号循环展开)Global Code Scheduling(全局代码调度全局代码调度)Trace Scheduling:focus on Critical path (路径调度:关注关键路径路径调度:关注关键路径)SuperblocksDetecting and Enhancin
19、g Loop-level Parallelism Loop-carried dependence(循环传递相关循环传递相关-存在循环之间的相关性存在循环之间的相关性)Data accesses in later iterations are dependent on data values produced in earlier iterations A loop is parallel if it can be written without a cycle in the dependences.(一个循环中,如果相关性没有构成一个环,就说这个循环是可并行的一个循环中,如果相关性没有构成一个
20、环,就说这个循环是可并行的)An assumption All array indices(下标下标)are affine(仿射的仿射的).A one-dimensional array index is affine,if it can be written in the form of ai+b.A dependence exists if two conditions hold(满足下面两条件,即相关存在满足下面两条件,即相关存在):Two indices J,K,within the limits of the loop.(下标的两个取值下标的两个取值,j,k)The loop sto
21、res into Ea j+b and later fetch from the same element Eck+d,it can satisfy aj+b=ck+d (存数与读取数下标满足存数与读取数下标满足aj+b=ck+d)GCD(Greatest common divisor)test-最大公因子测试最大公因子测试 If a loop-carried dependence exists,then GCD(c,a)must divide(d-b).(GCD(c,a)必须被(d-b)整除)Eliminating Dependent Computations-消除相关计算消除相关计算DAD
22、DUI R1,R2,#4DADDUI R1,R1,#4ADD R1,R2,R3ADD R4,R1,R6ADD R8,R4,R7SUM=SUM+XDADDUI R1,R2,#8ADD R1,R2,R3ADD R4,R6,R7ADD R8,R1,R4SUM=SUM+X1+X2+X3+X4+X5SUM=(SUM+X1)+(X2+X3)+(X4+X5)R8=R2+R3+R6+R7把R1与R7的位置对换Software Pipelining-软件流水软件流水Observation:if iterations from loops are independent,then can get more ILP
23、 by taking instructions from different iterations (如果循环的迭代之间是不相关的,则可以从不同迭代中取指执行可以获得更多的可并行性如果循环的迭代之间是不相关的,则可以从不同迭代中取指执行可以获得更多的可并行性)Software pipelining:reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop(Tomasulo in SW)(软件流水是从源循环的不同迭
24、代体中取出必要的指令,重软件流水是从源循环的不同迭代体中取出必要的指令,重新建立新的循环,提供连续指令给多发射处理器新建立新的循环,提供连续指令给多发射处理器)Iteration 0Iteration 1Iteration 2Iteration 3Iteration 4Software-pipelined iterationSoftware Pipelining ExampleBefore:Unrolled 3 times 1 LDF0,0(R1)2ADDD F4,F0,F2 3SD0(R1),F4 4LDF6,-8(R1)5ADDD F8,F6,F2 6SD-8(R1),F8 7LDF10,
25、-16(R1)8ADDD F12,F10,F2 9SD-16(R1),F12 10SUBI R1,R1,#24 11BNEZ R1,LOOPAfter:Software Pipelined 1SD0(R1),F4;Stores Mi 2ADDDF4,F0,F2;Adds to Mi-1 3LDF0,-16(R1);Loads Mi-2 4SUBIR1,R1,#8 5BNEZR1,LOOP Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill&drain pipe only
26、 once per loop vs.once per each unrolled iteration in loop unrolling5 cycles per iterationSW PipelineLoop Unrolledoverlapped opsTimeTimeTrace Scheduling(路径调度路径调度专用于专用于VLIW)Parallelism across IF branches vs.LOOP branches (挖掘跨越挖掘跨越 if if 转移和转移和 LOOP LOOP 转移的并行性转移的并行性)Two steps(路径调度技术包含两个独立的处理过程路径调度技术包
27、含两个独立的处理过程)Trace Selection(路径选择路径选择)Find likely sequence of basic blocks(trace预测路径)of(statically predicted or profile predicted)long sequence of straight-line code(首先根首先根据转移行为预测转移可能的两个路径方向,找出使用概率大的那个方向作为扩展基本块的方向,据转移行为预测转移可能的两个路径方向,找出使用概率大的那个方向作为扩展基本块的方向,这个方向的后继指令称为预测路径)这个方向的后继指令称为预测路径)Trace Compacti
28、on(路径压缩路径压缩)Squeeze trace into few VLIW instructions(将选定路径上的操作封装成超长指令将选定路径上的操作封装成超长指令)Need bookkeeping code in case prediction is wrong This is a form of compiler-generated speculationCompiler must generate“fixup(修正修正)”code to handle cases in which trace is not the taken branch(预测失效要采取补偿措施预测失效要采取补偿措
29、施)Needs extra registers:undoes bad guess by discardingExample of Trace Scheduling AI=AI+BIBI=CI=X 操作AI=0?Y 操作TFAI=AI+BI;IF AI=0 BI=;ELSEX 操作;CI=;Y 操作;Out-ofTraceIntoTraceExampleYNBNEZ R4,I5I1 LW R4,0(R1)I2 LW R5,0(R2)I3 ADDI R4,R4,R5I4 SW 0(R1),R4I6 SW 0(R3),X 操作 Y 操作I5 SW 0(R2),YNBNEZ R4,I5I1 LW R4
30、,0(R1)I2 LW R5,0(R2)I3 ADDI R4,R4,R5I4 SW 0(R1),R4I5 SW 0(R2),I6 SW 0(R3),I6 SW 0(R3),X 操作 Y 操作清除 I5、I6 指令结果Split 模块Rejoin 模块原始代码路径调度之后的代码Advantages of HW(Tomasulo)vs.SW(VLIW)SpeculationHW advantages:HW better at memory disambiguation(内存释意)(内存释意)since knows actual addressesHW better at branch predic
31、tion since lower overheadHW maintains precise exception modelHW does not execute bookkeeping instructions(补偿代码)Same software works across multiple implementationsSmaller code size(not as many no ops filing blank instructions)SW advantages:Window of instructions that is examined for parallelism much
32、higherMuch less hardware involved in VLIW(unless you are Intel!)More involved types of speculation can be done more easilySpeculation can be based on large-scale program behavior,not just local informationSuperscalar v.VLIW Smaller code size(较小的代码长度较小的代码长度)Binary compatability (二进制代码的兼容性好)(二进制代码的兼容性
33、好)across generations of hardware Simplified Hardware for decoding,issuing instructions No Interlock Hardware(compiler checks?)More registers,but simplified Hardware for Register Ports(multiple independent register files?)Hardware Support for Expoilting ILP at compile time Conditional/predicated inst
34、ruction)(条件指令或预测指令条件指令或预测指令)A conditional instruction refers to a condition which is evaluated as part of the instruction execution,(条件指令的条件判断仅仅作为指令执行的一部分条件指令的条件判断仅仅作为指令执行的一部分)Example:If(A=0)S=T BNEZ R1,L CMOV R2,R3,R1 ADDU R2,R3,R0 L:the CPU always executes the instruction but writes the result onl
35、y if the condition is met.(CPU总是会执行这条指令,但是否写结果要看条件是否满足总是会执行这条指令,但是否写结果要看条件是否满足)A conditional branch changes a control dependence into a data dependence.(把控制相关转成数据相关把控制相关转成数据相关)Conditional instructions The execution of all instruction is controlled by a predicate.When predicate is false,the instructi
36、on becomes a no-op Simply convert small blocks of code that are branch dependent.Eliminate nonloop branches Can be used to speculatively move an instruction that is time critical.Compiler Speculation with Hardware Support-硬件支持的编译投机硬件支持的编译投机 Move speculated instructions not only before the branch,but
37、 before the condition evaluation.Four methods for supporting ambitious(大胆的大胆的)speculation Hardware and OS cooperatively ignore exceptions for speculative instructions.(硬件与硬件与OS协同忽略投机指令引起的异常中断协同忽略投机指令引起的异常中断)Speculative instructions that never raise exceptions are used.(调度那些不影响异常中断行为的指令作为投机指令调度那些不影响异常中断行为的指令作为投机指令)Poison bits are attached to the result registers written by speculative instructions.(采用抑制位的投机技术采用抑制位的投机技术)A mechanism is provided to indicate that an instruction is speculative,the hardware buffers the result until the instruction no longer speculative.
侵权处理QQ:3464097650--上传资料QQ:3464097650
【声明】本站为“文档C2C交易模式”,即用户上传的文档直接卖给(下载)用户,本站只是网络空间服务平台,本站所有原创文档下载所得归上传人所有,如您发现上传作品侵犯了您的版权,请立刻联系我们并提供证据,我们将在3个工作日内予以改正。