Parallel Processing & Distributed Systems - Chapter 7: Pipeline - Thoai Nam
            Pipeline throughput
– Determined by how often an instruction exists the pipeline
– Depends on the overhead of clock skew and setup
– Depends on the time required for the slowest pipe stage
 Pipeline stall
– Delay the execution of some instructions and all
succeeding instructions
– “Slow down” the pipeline
 Pipeline Designer’s goal
– Balance the length of pipeline stages
– Reduce / Avoid pipeline stalls
        – Determined by how often an instruction exists the pipeline
– Depends on the overhead of clock skew and setup
– Depends on the time required for the slowest pipe stage
 Pipeline stall
– Delay the execution of some instructions and all
succeeding instructions
– “Slow down” the pipeline
 Pipeline Designer’s goal
– Balance the length of pipeline stages
– Reduce / Avoid pipeline stalls
Bạn đang xem 20 trang mẫu của tài liệu "Parallel Processing & Distributed Systems - Chapter 7: Pipeline - Thoai Nam", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.
        
        File đính kèm:
 parallel_processing_distributed_systems_chapter_7_pipeline_t.pdf parallel_processing_distributed_systems_chapter_7_pipeline_t.pdf
Nội dung text: Parallel Processing & Distributed Systems - Chapter 7: Pipeline - Thoai Nam
- Pipeline Thoai Nam
- Concepts  A technique to make fast CPUs by overlapping execution of multiple instructions Cycles Instruction # 1 2 3 4 5 6 7 8 Instruction i S1 S2 S3 S4 Instruction i+1 S1 S2 S3 S4 Instruction i+2 S1 S2 S3 S4 Instruction i+3 S1 S2 S3 S4 Instruction i+4 S1 S2 S3 S4 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
- Concepts (cont’d) Average instruction time without pipeline Pipeline speedup = Average instruction time with pipeline CPI without pipelining * Clock cycle without pipelining = CPI with pipelining * Clock cycle with pipelining ( CPI = number of Cycles Per Instruction) CPI without pipelining = Ideal CPI * Pipeline depth CPI with pipelining = Ideal CPI + Pipeline stall clock cycles per instruction Ideal CPI * Pipeline depth Pipeline speedup = Ideal CPI + Pipeline stall clock cycles per instruction Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
- A Simple DLX Pipeline  Fetch a new instruction on each clock cycle  An instruction step = a pipe stage Cycles Instruction # 1 2 3 4 5 6 7 8 Instruction i IF ID EX MEM WB Instruction i+1 IF ID EX MEM WB Instruction i+2 IF ID EX MEM WB Instruction i+3 IF ID EX MEM WB Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
- Structure Hazard  Due to resource conflicts  Instances of structural hazards – Some functional unit is not fully pipelined » a sequence of instructions that all use that unit cannot be sequentially initiated – Some resource has not been duplicated enough. Eg: » Has only 1 register-file write port while needing 2 write in a cycle » Using a single memory pipeline for data and instruction  Why we allow this type of hazards? – To reduce cost. – To reduce the latency of the unit Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
- Hardware Solution to Data Hazard  Forwarding (bypassing/short-circuiting) techniques – Reduce the delay time between 2 depended instructions – The ALU result is fed back to the ALU input latches – Forwarding hardware check and forward the necessary result to the ALU input for the 2 next instructions ADD R1, R2, R3 IF ID EX MEM WB SUB R4, R1, R5 IF ID EX MEM WB No stall AND R6, R1, R7 IF ID EX MEM WB No stall OR R8,R1,R9 IF ID EX MEM WB No stall XOR R1, R10, R11 IF ID EX MEM WB Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
- Software Solution to Data Hazard  Pipeline scheduling (Instruction scheduling) – Use compiler to rearrange the generated code to eliminate hazard. Example: Source code Generated and rearranged code Generated code (no hazard) c=a+b LW Ra, a LW Ra, a d=e-f LW Rb, b LW Rb, b ADD Rc, Ra, Rb LW Re, e SW c, Rc ADD Rc, Ra, Rb LW Re, e LW Rf, f Data hazards LW Rf, f SW c, Rc SUB Rd, Re, Rf SUB Rd, Re, Rf SW d, Rd SW d, Rd Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
- Reducing Control Hazard Effects  Predict whether the branch is taken or not  Compute the branch target address earlier  Use many schemes – Pipeline freezing – Predict-not-taken scheme – Predict-taken scheme (N/A in DLX) – Delayed branch Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
- Predict-Not-Taken Scheme  Predict the branch as not taken and allow execution to continue – Must not change the machine state till the branch outcome is known  If the branch is not taken: no penalty  If the branch is taken: – Restart the fetch at the branch target – Stall one cycle Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
- Branch Delayed  Change the order of execution so that the next instruction is always valid and useful  “From before” approach ADD R1, R2, R3 becomes If R2=0 then If R2=0 then Delay slot ADD R1, R2, R3 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
- Branch Delayed (cont’d)  “From fall through” approach ADD R1, R2, R3 ADD R1, R2, R3 becomes If R1=0 then If R1=0 then SUB R4,R5,R6 Delay slot SUB R4,R5,R6 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM

