计算机体系结构复习
Review 11. Pipeline 特点Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stageMultiple tasks operating simultaneouslyPotential speedup = Number pipe stagesUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedup2. RISC MIPS5 steps of MIPS datapath IF ID EXE MA WB3. Three Hazards structural不能同时运作 data 之前结果 control 跳转4. One memory port two different cache entries holding data for the same physical address!for update: must update all cache entries with same physical addressor memory becomes inconsistent3. TLBs:A way to speed up translation is to use a special cache of recently used page table entries - this has many names, but the most frequently used is Translation Lookaside Buffer or TLBVirtual Address Physical Address Dirty Ref Valid Access4. P408 计算加速比 命中性能5. SPEC: System Performance Evaluation Cooperative6. Moores Law: the number of transistors in a dense integrated circuit doubles approximately 18 months. 摩尔定律指出集成电路上可容纳的晶体管数目,约每隔 18个月便会增加一倍,性能也将提升一倍。7. Performance Summary needs good benchmarks and good ways to summarize performance.8. AMAT = Average Memory Access Time例:Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty Suppose that 1% of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction1.1(cycles/ins) + 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss) + 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss) = (1.1 + 1.5 + .5) cycle/ins = 3.1 58% of the time the proc is stalled waiting for memory!AMAT=(1/1.3)x1+0.01x50+(0.3/1.3)x1+0.1x50=2.549. 冯诺依曼和哈佛结构性能呢对比:16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%32KB unified: Aggregate miss rate=1.99%Assume 33% data ops 75% accesses from instructions (1.0/1.33)hit time=1, miss time=50Note that data hit has 1 stall for unified cache (only one port)AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.2410.write through(a valid bit) write back(dirty bit and valid bit)Write Allocate vs Non-Allocate 写入缺失时做法,先读到缓存中在写,和直接写磁盘11. Improving Cache Performance P426 Reduce the miss rate 3Cs n-way 1-way(size x) 2-way(size x/2)Reduce Misses via Larger Block Size (因空间局部性会降低强制缺失,可能增大冲突缺失,若容量小可能增大容量缺失)提高了缺失代价Reduce Misses via Higher Associativity 2:1 Cache Rule会延长命中时间 AMATReducing Misses via a“Victim Cache”Add buffer to place data discarded from cacheReducing Misses via “Pseudo-Associativity”Reducing Misses by Hardware Prefetching of Instructions & Datals Reducing Misses by Software Prefetching DataPrefetching comes in two flavors:Binding prefetch: Requests load directly into register. Must be correct address and register!Non-Binding prefetch: Load into cache. Can be incorrect. Frees HW/SW to guess!Reducing Misses by Compiler Optimizations(merging arrays loop interchange loop fusion blocking)Reduce the miss penaltyRead Priority over Write on Miss(读取缺失优先级高于写入缺失)让读取缺失一直等待到写入缓冲区为空为止Reduce Miss Penalty: Early Restart and Critical Word FirstDont wait for full block to be loaded before restarting CPUNon-blocking Caches to reduce stalls on misses:Add a second-level cache:Which apply to L2 Cache?Reduce Conflict Misses via Higher AssociativityReduce the time to hit in the cache1. main memory DRAM cache SRAM1. Tomasulo算法: renaming 保留站