外文科技文献译文
Impact of Die-to-Die and Within-Die Parameter Variations on the Clock Frequency and Throughput of Multi-Core Processors
Keith A. Bowman, Member, IEEE, Alaa R. Alameldeen, Member, IEEE, Srikanth T. Srinivasan,
Member, IEEE, and Chris B. Wilkerson, Member, IEEE
Abstract—A statistical performance simulator is developed to explore the impact of parameter variations on the maximum clock frequency (FMAX) and throughput distributions of multi-core processors in a future 22 nm technology. The simulator captures the effects of die-to-die(D2D) and within-die(WID) transistor and interconnect parameter variations on critical path delays in a die. A key component of the simulator is an analytical multi-core processor throughput model, which enables computationally efficient and accurate throughput calculations, as compared with cycle-accurate performance simulators, for single-threaded and highly parallel multi-threaded (MT) workloads. Based on microarchitecture designs from previous microprocessors, three multi-core processors with either small, medium, or large cores are projected for the 22 nm technology generation to investigate a range of design options. These three multi-core processors are optimized for maximum throughput within a constant die area. A traditional single-core processor is also scaled to the 22 nm technology to provide a baseline comparison. The salient contri- butions from this paper are: 1) product-level variation analysis for multi-core processors must focus on throughput, rather than just FMAX, and 2) multi-core processors are more variation tolerant than single-core processors due to the larger impact of memory la- tency and bandwidth on throughput. To elucidate these two points, statistical simulations indicate that multi-core and single-core processors with an equivalent total core area have similar FMAX distributions (mean degradation of 9% and standard deviation of 5%)
10
外文科技文献译文
for MT applications. In contrast to single-core processors, memory latency and bandwidth constraints significantly limit the throughput dependency on FMAX in multi-core processors, thus reducing the throughput mean degradation and standard deviation by 50% for the small and medium core designs and By~30% for the large core design. This improvement in the throughput distribution indicates that multi-core processors could significantly reduce the product design and process devel- opment complexities due to parameter variations as compared to single-core processors, enabling faster time to market for high-performance microprocessor products.
Index Terms—Clock frequency distribution, critical path delay variations, die-to-die (D2D) variations, inter-die variations, intra-die variations, maximum clock frequency (FMAX) distri- bution, multi-core, parameter fluctuations, parameter variations, performance distribution, throughput distribution, (WID) variations.
parameter variations in the manufacturing process. As ICROPROCESSORS have always been vulnerable to
Manuscript received May 17, 2008; revised August 15, 2008. First published May 19, 2009; current version published November 18, 2009.
The authors are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail: keith.a.bowman@intel.com;alaa.r.alameldeen@intel.com;srikanth.t.srinivasan@intel.com; chris.wilkerson@intel.com).
Digital Object Identifier 10.1109/TVLSI.2008.2006057
process technology continues scaling, variations in transistorand interconnect characteristics are increasing relative to nom- inal design targets. The adverse effects of parameter variations on the maximum clock frequency (FMAX) and power of a mi- croprocessor are also becoming more pronounced with tech- nology scaling [1], [2]. Parameter variations can be classified into two categories: die-to-die (D2D) and within-die (WID). D2D variations,resulting from lot-to-lot, wafer-to-wafer, and a
11
I. INTRODUCTION
外文科技文献译文
portion of the within-wafer variations, affect all transistors and interconnects on a die equally. Conversely, WID variations, con- sisting of random and systematic components, induce different electrical characteristics across a die [3]. A random WID pa- rameter variation fluctuates randomly and independently from device to device (i.e., device-to-device correlation is zero). A systematic WID parameter variation results from a repeatable and governing principle, where the device-to-device correlation is empirically determined as a function of the distance between the devices. Although systematic WID variations exhibit a cor- related behavior, the profile of these variations can randomly change from die to die. From a design perspective, systematic WID variations behave as continuous and smooth correlated random WID variations [1], [3]–[6].
In designing high-performance microprocessors, the impor- tance of accurately estimating the impact of parameter varia- tions on product-level performance directly relates to the overall revenue of a company. An overestimation increases design com- plexity, possibly leading to higher power consumption, an in- crease in design time, an increase in die size, rejection of other- wise good design options, and even missed market windows [3]. Conversely, an underestimation can compromise product per- formance and overall yield as well as increase the silicon debug time [3]. In summary, overestimating variations impacts the de- sign effort and underestimating variations impacts the manufac- turing effort.
In recent technology generations, multi-core processors have emerged as a power-efficient approach to designing high-per- formance microprocessors. Multi-core processors employ more than one core on a die, where the number of cores and core complexity is a key design tradeoff. Multi-core processors can achieve better performance than single-core processors for multi-threaded (MT) applications by executing threads in parallel across the cores.
Previous research has investigated the impact of D2D and WID parameter variations on the FMAX and power distribu- tions of single-core processors [1], [2], [4], [5], [7], [8]. The impact of parameter variations on power, where leakage is the
12
外文科技文献译文
dominant variation component, does not fundamentally change from single-core to multi-core processors. A multi-core pro- cessor may enable much finer granularity in placing portions of the chip into a sleep state. When all transistors on the chip are in an operational mode, however, the relative effect of D2D and WID parameter variations on the leakage is expected to be sim- ilar between single-core and multi-core processors. In contrast, the multi-core design represents a fundamental shift in micro- processor performance from the traditional single-core design, where the parallelism in MT applications is exploited across the cores in a die.
In this paper, the impact of D2D and WID parameter varia- tions on the FMAX and throughput distributions of multi-core processors [9] is explored. The throughput metric represents the actual microprocessor performance, thus providing an architecture-level perspective of device and circuit parameter variability. In Section II, an analytical multi-core processor throughput model is derived to enable accurate throughput cal- culations for highly parallel workloads with runtime efficiency. In Section III, three multi-core processors and a single-core processor are projected for a future 22 nm technology gener- ation based on historical data and traditional scaling trends. Applying the analytical throughput model, a multi-core pro- cessor optimization is described in Section IV to maximize the throughput of the three multi-core processors. In Section V, the analytical throughput model is integrated into a statistical performance simulator that captures the effects of D2D and WID parameter variations on critical path delays across a die to generate FMAX and throughput distributions for a given multi-core design. In Section VI, the impact of parameter vari- ations on the FMAX and throughput distributions of the three optimal multi-core processors and the single-core processor is presented. Section VII concludes by summarizing the key insights.
II. MULTI-CORE PROCESSOR THROUGHPUT MODEL
A compact analytical throughput model is derived to enable computationally efficient and accurate projections of multi-core processor throughput for highly parallel
13
外文科技文献译文
MT applications. Since the statistical performance simulator, which will be described in Section V, performs thousands of throughput calculations per multi-core design, the runtime efficiency is an essential fea-ture. For this reason, an analytical modeling approach is desired rather than a computationally expensive throughput simulator. The throughput model derivation starts by separating the die area (Adie) in two main parts as
Adie?Acores?AL2(N). (1)
Acoresis the total area allocated to the cores, where each core Is assumed to contain private level-1(L1) instruction and data caches.AL2(N)is the total level-2 (L2) cache area with cores sharing the cache. The L2 cache size in units of megabytes is calculated as
SL2(N)?AL2(N)A1MB (2) Where A1MBis the cache area per 1 MB, as determined by process technology.
For a given workload, the cycles per instruction (CPI) for a single core are modeled as
CPI(1)?CPIcom?Mrate(SL2(1))Lmiss(Fclk). (3)
CPIcom the computation component of CPI, is the core CPI, CPI with a perfect L2 cache (i.e., no cache misses).CPIcom is inde pendent of the processor clock frequency(Fclk).Mmiss(SL2(1)),the miss rate, is the number of misses per instruction for a cache
Lmiss(Fclk),the miss penalty, is the average number of cycles per L2 cache miss. L(F) is a function of Fclk. The product of L(F) and repre- sents the
missclkmissclkmemory latency and memory bandwidth components of CPISL2(1)is the effective L2 cache size for one core. If the cores do not share code or data in the cache, then the av- erage cache size per core is1/Nth of the entire L2 cache size(SL2(1)?SL2(N)/N).For
14