# Evaluating and Mitigating Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh > ISPASS 2017 25<sup>th</sup> April Santa Rosa, California ## Multithreading on GPUs # Multithreading on GPUs ### Deeper Memory Hierarchy ### Deeper Memory Hierarchy ### Deeper Memory Hierarchy ### Goals - <u>Characterize</u>: Understand the bandwidth bottlenecks across different levels of the memory hierarchy such as L1, L2 and DRAM - <u>Cause</u>: Investigate the architectural causes for congestion - <u>Effect</u>: Design-space exploration to evaluate the effect of mitigating congestion - <u>Proposal</u>: Use cause and effect analysis to present cost-effective configurations of the memory hierarchy ### **Experimental Environment** ### Platform - GPGPU-Sim (v3.2.2) - GPUWattch (McPAT) ### Benchmark Suites - Rodinia - Parboil - MapReduce ### **Baseline Configuration** #### GTX 480 NVIDIA GPU - 15 SMs - Private LI Data Cache (16 KB; 32 MSHRs) - Shared L2 Cache (768 KB; 32 MSHRs/bank) - LI-L2 Interconnect (Crossbar; 32+32 bytes) - DRAM (384 bits bus width) ### Latency Tolerance Performance versus Latency curve for memory-intensive benchmarks ### Latency Tolerance Added latencies due to increasing congestion ## Latency Tolerance - I. Baseline memory latencies critically higher than performance plateau latencies - 2. Baseline memory latencies critically higher than ideal access latencies to L2/DRAM ### Infinite Bandwidth Significant congestion in the cache hierarchy ## Understanding Bandwidth Bottleneck - While the bandwidth provided decreases in the lower levels of the memory hierarchy, bandwidth demand does not reduce proportionally. - This leads to a <u>bandwidth skew</u> between adjacent levels. - As a result, requests queue up in the memory hierarchy for long durations, causing congestion. - L2 access queues are full for 46% of its usage lifetime. - DRAM access queue are full for 39% of its usage lifetime Structural Hazards Back Pressure Structural Hazards - Back Pressure - Prolonged contention for cache resources such as MSHRs or replaceable cache lines. - Pending requests must complete and relinquish the resources. - Therefore, new miss requests get serialized, increasing the memory latencies even more. High cache hit latencies DRAM • Structural Hazards Back Pressure #### Independent compute? - Cascading effect of structural hazards - Higher level gets throttled - Eventually throttles core performance Restricted parallelism on cores - I. LI MSHR: 41% (Structural Hazards) - 2. L2 back pressure : 48% (Back pressure) - I. Crossbar (response path): 42% (Back pressure) - 2. DRAM: 35% (Back pressure) ### Classifying the Design Space - Category-I: Operate at peak throughput - Minimize stalls by exploiting existing peak throughput - e.g. MSHRs, Access Queue size - Category-2: Increase peak throughput - Minimize stalls by increasing the peak throughput - e.g. Crossbar flit size, DRAM bus width # Identifying the Design Space #### LI parameters - LI Miss Queue - LI MSHR - Memory pipeline width ### • L2 parameters - L2 Miss/Response Queue - L2 MSHR - 1 L2 Data Port Width - **L**2 Banks - Flit Size (Crossbar) ### DRAM parameters - Scheduler Queue - Banks - Bus width Improving bandwidth in isolation can lead to even more congestion at the lower levels ### Core frequency scaling on real GTX 480 Improving bandwidth in isolation can lead to even more congestion at the lower levels Shows the criticality of the L2 bandwidth Scaling LI and L2 parameters by 4x A case for synergistic scaling! Scaling LI and L2 parameters by 4x Higher speedup on mitigating congestion in the cache hierarchy compared to DRAM (as done in HBM) IPC (normalized to baseline) # Pruning the Design Space - Scaling all architectural parameters by 4x impractical - Need to prune the design space - We now know the ... - Causes of congestion (at each memory level) - Effects of reducing congestion (at different memory levels) #### Cost effective configuration Mitigate causes where the effect is maximum Boost bandwidth resources where it hurts most! ### Cost-effective Design Space #### • LI parameters - LI Miss Queue - LI MSHR - Memory pipeline width ### • <u>L2 parameters</u> - L2 Miss/Response Queue - L2 MSI IR - <del>L2 Data Port Width</del> - L2 Banks - Flit Size (Crossbar) ### DRAM parameters - Scheduler Queue - Banks - Bus width ## Cost-effective Design Space - LI parameters - LI Miss Queue — - LI MSHR - Memory pipeline width - L2 parameters - L2 Miss/Response Queue Simple Buffers Minimal cost of scaling Scale by 4x • Flit Size (Crossbar) # Cost-effective design-space - LI parameters - LI Miss Queue - LI MSHR — - Memory pipeline width - L2 parameters - L2 Miss/Response Queue Fully Associative Array Moderate cost of scaling Scale by I.5x • Flit Size (Crossbar) ## Cost-effective design-space - LI parameters - LI Miss Queue - LI MSHR - Memory pipeline width - L2 parameters - L2 Miss/Response Queue • Flit Size (Crossbar) 32+32 Baseline Crossbar Scales quadratically with flit size "Asymmetric Crossbar" ### Asymmetric Crossbar Symmetric Crossbar Asymmetric Crossbar #### Reads >> Writes Point-to-point Wiring (bytes) 32+32=64 No wiring overhead 16+48=64Wiring overhead of 20 bytes 16+68 / 32+52=84 # Cost-effective Design Space: Summary - LI Cache - LI Miss Queue : 8 entries $\rightarrow$ 32 entries - Memory pipeline width: 10 wide $\rightarrow$ 40 wide - LI MSHR : 32 entries $\rightarrow$ 48 entries - L2 Cache - L2 Miss/Response Queue : 8 entries $\rightarrow$ 32 entries - Flit Size (Crossbar) $: 32+32 \rightarrow 16+48 (=64), 16+68 (32+52) (=84)$ Evaluate 3 cost-effective configurations: 16+48 16+68 32+52 ### Results Area overhead: I.I% Point-to-point wires remains same as baseline IPC (normalized to baseline) ### Results Area overhead: I.6% Investing in the response path gives better returns IPC (normalized to baseline) ### Results Higher speedup on resolving bandwidth bottleneck in cache hierarchy Configuration with synergistic scaling (of LI and L2) and asymmetric crossbar with higher response bandwidth (16+68) performs best ### Conclusion #### Problem - High congestion across the memory hierarchy - Congestion leads to high memory latencies (both to L2 and DRAM) - High latencies appear in the critical path for memory-intensive applications, causing slowdown #### Observation - Characterize stalls and develop insights about bandwidth bottleneck - Significant bandwidth bottleneck in the cache hierarchy - Addressing bandwidth problem in isolation can even lead to slowdown #### Proposal - Synergistic scaling of bandwidth of LI and L2 cache - Asymmetric scaling of bandwidth of crossbar - 23% speedup with I.I% area overhead (no additional wires in crossbar) - 29% speedup with 1.6% area overhead (additional wiring in crossbar) # Questions? #### Saumay Dublish saumay.dublish@ed.ac.uk http://homepages.inf.ed.ac.uk/s1433370/ Institute for Computing Systems Architecture THE UNIVERSITY of EDINBURGH