外文科技文献译文
applications that share code or data, the working set size is adjusted by the average number(Nshare)of cores that share an L2 cache line, whereNshare(N) is a func- tion of N . The average cache size for a single core is calculated as [10]
SL2(1)?SL2(N) (4)
(N?Nsha(N)?1)reTo project the miss rate for caches of different sizes, thesquare-root rule-of-thumb is typically applied, which models the cache miss rate as
Mrate(1MB)Mrate(SL2(1))?SL2(1)/S1MB (5)
whereS1MB is 1 MB. For some applications, the square-rootmodel in (5) is less accurate than the working set model, where the miss rate remains constant as cache size increases until the working set fits in the cache; subsequently, the miss rate sharply falls off. Since the miss rate dependency on cache size is ap- plication specific, the miss rate of a single core is simulated at multiple cache sizes with an industrial cycle-accurate simulator to determine the appropriate miss rate model for an individual application. Based on simulations across a wide range of appli- cations, the square-root model provides the most accurate ap- proximation of the average miss rate.
To model instructions per cycle (IPC) for the multi-core pro- cessor, the effects of limited off-chip memory bandwidth is cap- tured by separating Lmiss(Fclk)into two components as
Lmem(Fclk)Lmiss(Fclk)??Llink(Fclk). (6)
Npr15
外文科技文献译文
Lmem(Fclk),the off-chip DRAM memory latency, is calculated as the average number of cycles spent in the DRAM array to ob- tain data. In modeling out-of-order nonblocking cores that ex- ploit memory-level parallelism (MLP),Lmem(Fclk) is divided by the average number (Npr)of parallel memory requests since each request blocks the processor for a fraction of the total memory latency [11]. For in-order blocking cores,Llink(Fclk),the total link latency, includes the latency of the physical off-chip link and the queuing latency (e.g., waiting in miss status handling registers (MSHRs) and bus queues). Llink(Fclk) is separated into two com- ponents as
Llin(kFcl) k?Ls(Fcl)k?Lq(Fcl)k (7)
whereLs(Fclk)andLq(Fclk) are the service and queuing la- tencies per cache miss, respectively.off-chip link latency for data to traverse across the link from the processor to the DRAM chip and back, where no transmission errors are assumed.Ls(Fclk) is computed as the mean queuing latency. Assuming the physical off-chip link to memory repre-sents an M/D/1 queue (Markovian arrival rate for requests with a deterministic service time and an infinite number of request sources), Lq(Fclk) is modeled as
ULs(Fclk)Lq(Fclk)?2(1?U) (8)
where is the link utilization. Using Little’s law, U is computed as
U??Ls(Fcl) k. (9)
The parameter is the number of memory requests per cycle,which is calculated as
??IPC(N)Mrat(eSL2(1)) (10)
whereIPC(N) represents the IPC for a multi-core processor with Ncores. From (7)–(9), the total link latency is calculated
16
外文科技文献译文
?(Ls(Fclk))2 Llink(Fclk)?Ls(Fclk)?. (11)
2(1??Ls(Fclk))As described in (12) at the bottom of the page, the IPCfor a multi-core processor is calculated from (3), (6), and (11) [10]. Since is a function ofIPC(N)(12) reduces to a quadratic equation,the equation result in an explicitIPC(N)(is the nominal
processor
clock
frequency.
Assuming
CPImem,lat(Fclk)/Fclk和
CPImem,bw(Fclk)/Fclk represent the memory latency and bandwidth components of throughput, which are modeled as
and in (15) at the bottom of the page. Additional assumptions are applied to tradeoff accuracy for runtime efficiency: 1) MT benchmarks are perfectly parallelizable (i.e., only the parallel portion of MT applications is modeled); 2) average benchmark performance is an appropriate metric for evaluating general trends; and 3) the additional inter-thread interactions and oper- ating system overhead when scheduling threads on a multi-core processor are negligible.
The analytical model in (13)–(15) is validated for both single- threaded (ST) and highly parallel MT applications. For ST ap- plications, one core is assumed to have access to the entire L2 cache. Although the model primarily targets the performance of highly parallel MT applications, the analytical model is easily modified for ST applications by adjusting the miss rate fromMrate(SL2(1))到Mrate(SL2(N))。In validating
the analytical model for ST applications, the model projections of the av-erage IPC from
CPImem,lat(Fclk)Fclk?Mrate(SL2(1))Lmem(Fclk,nom)Fclk,nomNpr (14)
460 workloads are compared with an indus- trial cycle-accurate simulator for different core types and cache sizes. The 460 workloads consist of server, multimedia, games, SPEC2K, and office productivity applications. The only work- load-specific model
17
外文科技文献译文
parameters areCPIcom,Mrate(1MB),和Npr.CPIcom is extracted by operating the simulator with a perfect L2 cache;
IPC(N)?N?CPI(1)CPICOMN(12)
Lmem(Fclk)?(Ls(Fclk))2?Mrate(SL2(1))(?Ls(Fclk)?)Npr2(1??Ls(Fclk))
TP(N)?IPC(N)Fclk?NCPIcomCPImem,lat(Fclk)CPImem,bw(Fclk)??FclkFclkFclk (13)
)1F?L(F1?(clksclk,nom)CPImem,bw(Fclk)Ls(Fclk,nom))2Fclk (15) ?Mrate(SL2(1))F?L(F)FclkFclk,nom1?(clksclk,nom)Fclk,nomMrate(1MB)andN are extracted by op- erating with a 1 MB cache. TheCPI,
comprMrate(1MB)andN values applied in the analytical model represent the
praverageextracted values across the 460 workloads. Comparing the ana- lytical model to the industrial cycle-accurate simulator across a variety of core types and L2 cache sizes, the model projections of average IPC for the 460 workloads are within 4% of the sim- ulation results.
In validating the analytical model for highly parallel MT ap- plications, the IPC model in (12) is compared with Asim sim- ulations [12] in Fig. 1 for a variety of recognition, mining, and synthesis (RMS) benchmarks [13] across the number of cores contained in the multi-core processor. These RMS benchmarks focus on the basic building blocks of matrix-oriented data ma- nipulation and calculations that are increasingly being utilized to model and process complex systems [13]. The benchmarks metrical sparse matrix-vector multiplication; 4) dense_mmm,dense matrix-matrix multiplication; and 5) sparse_mvm, sparse matrix-vector multiplication.
18
外文科技文献译文
The Asim simulator [12] evalu- ates each workload while capturing the effects of multiple cores, shared L2 cache, and the interconnection network between the L2 cache and off-chip DRAM memory. The comparison in Fig.1 is based on a 2 wide in-order core with a 32 MB L2 cache,128 byte cache line size, and 200 cycle memory latency. Since for the core, the onlyworkloadNpr?1specific inputs to thebenchmarks (kmeans
ADAt and dense_mmm), the square-root sparse_mvm_sym和sparse_mvm),
benchmarks (kmeans,ADAt,dense_mmm,cache miss rate model in (5) is applied. For the other two bench-marks(sparse_mvm_sym and sparse_mvm), the working set model is used to estimate the cache miss rate. For the kmeans,, dense_mmm, and sparse_mvm benchmarks, the ana- lytical model agrees closely with Asim simulations, where the worst-case error is less than 5%. The sparse_mvm_sym bench- mark contains large sections of serial execution, leading to a worst-case model error of 22%. Although the model is less ac- curate for MT applications with large portions of serial exe- cution, the multi-core processor throughput model agrees well with the Asim simulator for MT applications with large sec- tions of parallel execution and with an industrial cycle-accurate simulator for ST applications. As previously discussed, the an- alytical model primarily targets highly parallel MT workloads with negligible serial execution. In the remainder of this paper, the MT applications are assumed perfectly parallelizable, where the analytical model is sufficiently accurate. If MT applications with large portions of serial execution are considered in a future work, then the analytical throughput model in (13)–(15) may be extended [14] to improve the accuracy for these applications.
III. MULTI-CORE PROCESSOR DESIGNS
In optimizing a multi-core processor in Section IV and in exploring the impact of parameter variations on multi-core processor FMAX and throughput in Section VI, three separate multi-core processors are evaluated. These three processors
19
外文科技文献译文
Fig.1. Comparison of IPC model projections from (12) with Asim simulator [12] for a variety of RMS benchmarks [13] versus number of cores.
Contain either small, medium, or large cores to investigate arange of multi-core processor design options. In addition, a traditional single-core processor, containing a monolithic core, is used as a baseline comparison. The small, medium, and large cores are based on the Intel Pentium P54C (in-order) [15], the Intel Pentium III (out-of-order) [16], and the Intel CORE 2 (advanced out-of-order) [17] microprocessors, respectively. In Fig. 2, the product introduction technology generation, core area, averageFclk,normalized average SPECint throughput,cache size, supply voltage(VDD)and core power for each core type are summarized based on historical data [15]–[20]. Note that the core area excludes the L2 cache area.
20