Parallel and high-performance systems | 11 | Computer Architecture

ABSTRACT

In Chapter 4 we explored a number of techniques, including pipelining, that can be used to make a single pr ocessor perform better. As we discovered, pipelining has its limits in terms of impr oving performance. Instr uction execution can only be divided into so many steps, and operations such as memory accesses and contr ol transfers (as well as dependencies between instructions) can cause delays. Superpipelined, superscalar , and very long instruction word (VLIW) designs ar e implementation and ar chitectural approaches that have been used to overcome (to some degree) the difﬁculties inherent to extracting performance from a single processor, but each of these approaches has its own costs and limitations. Ultimately , given the level of implementation technology available at any point in time, designers can make a CPU execute instructions only so fast, and no faster. If this is not fast enough for our purposes — if we cannot get the performance we need fr om a system with a single CPU — the r emaining, obvious alternative is to use multiple processors to increase performance. Machines with multiple pr ocessing units are commonly known as

parallel processing

systems, though a more appropriate term might be

concurrent

cooperative

processing. It should come as no surprise to the r eader that there have been, and

still are, many types of high-performance computer systems, most of which are parallel to some extent. The need for high-performance computing hardware is common acr oss many types of applications, each of which has different characteristics that favor some appr oaches over others. Some algorithms are more easily parallelized than others, and the nature of the inherent parallelism may be quite dif ferent from one pr ogram to another . Certain applications such as computational ﬂuid dynamics (CFD) codes may be able to take advantage of

massively parallel

systems with thousands of ﬂoating-point processors. Others, for example game tree searching, may only be able to ef ﬁciently use a small number of central pr ocessing units (CPUs), and the operations r equired of each one may be quite dif ferent than those required of the CFD machine. Thus, a wide variety of systems ranging fr om

two to tens of thousands of pr ocessors have been built and found some degree of success.