SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors (WPMVP 2018 - Workshop on Programming Models for SIMD/Vector Processing)

Who

Christopher Rodrigues, Amarin Phaosawasdi, Peng Wu

Track

WPMVP 2018

Time Zone

The program is currently displayed in (GMT+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 24 Feb 2018 11:00 - 11:30 at Europa 5 - WPMVP 2018 Session 2

Abstract

SIMD instructions have been widely adopted due to their effectiveness at speeding up fine-grained data parallelism. Developers often take advantage of SIMD through automatic vectorization. For nests of short loops, however, automatic vectorization performs poorly. Vectorizers attempt to vectorize only a single loop, which uses only a fraction of the processor’s capacity when the loop is shorter than the processor’s SIMD width. Vectorizing multiple nested loops is not straightforward because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize.

We present a solution in the context of compiling small tensor multiplication algorithms. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed to permute instructions.

We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 10% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.7 times as fast on average as that produced by GCC’s state-of-the-art vectorizer, with a maximum speedup of 15 times. We discuss potential extensions to vectorize more general algorithms.

File attachments

Small Tensor Multiplication slides (WPMVP Small Tensor Multiplication.20180224.pdf)	2.6MiB

Christopher Rodrigues

Huawei America Research Lab

Amarin Phaosawasdi

Huawei America Research Lab

Peng Wu

Huawei America Research Lab

United States

Time Zone

The program is currently displayed in (GMT+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sat 24 Feb
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

10:30 - 12:00	WPMVP 2018 Session 2WPMVP at Europa 5

10:30 30m Talk		Small SIMD Matrices for CERN High Throughput Computing WPMVP Florian Lemaitre CERN, Benjamin Couturier CERN, Lionel Lacassagne University Paris 6 File Attached
11:00 30m Talk		SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors WPMVP Christopher Rodrigues Huawei America Research Lab, Amarin Phaosawasdi Huawei America Research Lab, Peng Wu Huawei America Research Lab File Attached
11:30 30m Talk		MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard WPMVP Adrien Cassagne INRIA, Olivier Aumage , Denis Barthou , Camille Leroux INRIA, Christophe Jégo IMS Lab - Institut Polytechnique de Bordeaux File Attached