The matrix-transpose operation is an important operation found in many parallel application codes. Although it is one of the simplest operations in linear algebra, it is one of the hardest operations to perform efficiently in parallel because it stresses not only local memory but also the network that connects remote memory across the machine. It often becomes a major performance bottleneck.