Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Date of Submission: 
October 19, 1998
Report Number: 
98-020 REVISED
Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and hardware, partitions the memory stream into multiple independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC 95 benchmark suite, it is shown that local vaariable accesses constitute a large portion of all the memory references, while thier referenc space is very small, averaging around 7 words per procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cach size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternataive to building a single multi-ported data cache.