ABSTRACT

As the gap between processor speed and memory access time increases, the average service time of memory-based operations is becoming a dominant factor in the overall performance of modern processors. RISC processors, such as MIPS [10], can access the cache memory within a single cycle and guarantee a latency of three cycles between the time a Load instruction was fetched and the time the instructions that depends on the value of the Load can use it (i.e., fetch-load-to-use delay). However, the machine itself can execute at most a single instruction per cycle and uses a relatively slow clock frequency. Modern processors can execute several instructions per cycle in an Out-of-Order (OOO) fashion, run in much faster clock frequency, due to its deep pipeline design, and use several levels of memory hierarchies. However, it takes them from two to five cycles to access the fastest level of the cache hierarchy, depending on the size and the layout of the cache. Additionally, the fetch-load-to-use delay can vary from five to hundreds of cycles, even if the data reside in the fastest cache.