Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Date of Submission: 
May 4, 1998
Report Number: 
Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient, and can add to the hardware complexity significantly. Thispaper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with help from a compiler or support from hardware, partitions the memory stream into multiple independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set ofinteger programs from the SPEC 95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space typically is very small, averaging around 7 words per function. To service local variable accesses quickly, three optimizations, fast data combining, and dead variable detection are proposed. Some of the important design parameters, such as the cache size, the number of cacheports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme is a viable alternative to building a single multi-ported data cache.