Collective operations calculate such quantities as vector norms and error residuals or the value of some physical quantity such as the value of the energy to check conservation of energy. Because these operations involve communication across images, they must be implemented carefully and used sparingly. Naive implementations based on a linear algorithm limit scalability because execution time increases linearly with the number of images. Implementations based on a binary-tree algorithm improve scalability because the execution time increases more slowly like the logarithm of the number of images. Optimization of collective operations is an active area of research [11] [18] [66] [96]. Indeed, for very large machines, it may be necessary to design new algorithms free of collective operations [4].