SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors
SIMD instructions have been widely adopted due to their effectiveness at speeding up fine-grained data parallelism. Developers often take advantage of SIMD through automatic vectorization. For nests of short loops, however, automatic vectorization performs poorly. Vectorizers attempt to vectorize only a single loop, which uses only a fraction of the processor’s capacity when the loop is shorter than the processor’s SIMD width. Vectorizing multiple nested loops is not straightforward because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize.
We present a solution in the context of compiling small tensor multiplication algorithms. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed to permute instructions.
We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 10% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.7 times as fast on average as that produced by GCC’s state-of-the-art vectorizer, with a maximum speedup of 15 times. We discuss potential extensions to vectorize more general algorithms.
|Small Tensor Multiplication slides (WPMVP Small Tensor Multiplication.20180224.pdf)||2.5MiB|
Sat 24 Feb
|10:30 - 11:00|
|11:00 - 11:30|
|11:30 - 12:00|