ABSTRACT

Solution 1.1 (page 34): How fast is a divide? Runtime is dominated by the divide and data resides in registers, so we can

assume that the number of clock cycles for each loop iteration equals the divide throughput (which is assumed to be identical to latency here). Take care if SIMD operations are involved; if p divides can be performed concurrently, the benchmark will underestimate the divide latency by a factor of p.