An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming (PPoPP 2018 - Tutorials)

Who

Rafael Asenjo, Jim Cownie, Aleksei Fedotov

Track

PPoPP 2018 Tutorials

Time Zone

The program is currently displayed in (GMT+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 25 Feb 2018 08:30 - 10:00 at Pacific 1 - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 1
Sun 25 Feb 2018 10:30 - 12:00 at Pacific 1 - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 2
Sun 25 Feb 2018 13:30 - 15:00 at Pacific 1 - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 3
Sun 25 Feb 2018 15:30 - 17:00 at Pacific 1 - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 4

Abstract

Due to energy constraints, high performance computing platforms are becoming increasingly heterogeneous, achieving greater performance per watt through the use of hardware that is tuned to specific computational kernels or application domains. It can be challenging for developers to match computations to accelerators, choose models for targeting those accelerators, and then coordinate the use of those accelerators in the context of their larger applications. This tutorial starts with a survey of heterogeneous architectures and programming models, and discusses how to determine if a computation is suitable for a particular accelerator. Next, Intel® Threading Building Blocks (Intel® TBB), a widely used, portable C++ template library for parallel programming is introduced. TBB is available as both a commercial product and as a permissively licensed open-source project at http://www.threadingbuildingblocks.org. The library provides generic parallel algorithms, concurrent containers, a work-stealing task scheduler, a data flow programming abstraction, low-level primitives for synchronization, thread local storage and a scalable memory allocator. The generic algorithms in TBB capture many of the common design patterns used in parallel programming. While TBB was first introduced in 2006 as a shared-memory parallel programming library, it has recently been extended to support heterogeneous programming. These new extensions allow developers more easily to coordinate the use of accelerators such as integrated and discrete GPUs, or devices such as FPGAs into their parallel C++ applications. This tutorial will introduce students to the TBB library and provide a hands-on opportunity to use some of its features for shared-memory programming. The students will then be given an overview of the new features included in the library for heterogeneous programming and have a hands- on opportunity to convert an example they developed for shared-memory into one that performs hybrid execution on both the CPU and an accelerator. Finally, students will be provided with an overview of the TBB Flow Graph Analyzer tool and shown how it can be used to understand application inefficiencies related to utilization of system resources.

Goals

By the end of the tutorial, attendees will be familiar with the important architectural features of commonly available accelerators and will have a sense of what optimizations and types of parallelism are suitable for these devices. They will also know the TBB library, have experience using its generic algorithms and concurrent containers to create a shared-memory parallel program, understand its features for heterogeneous programming and will learn how to build and execute a hybrid application.

Prerequisite Knowledge

Attendees should be comfortable programming in C++ using modern features such as templates and lambda expressions. Attendees should also have an understanding of basic parallel programming concepts such as threads and locks. No previous experience with Intel® Threading Building Blocks is required.

For the hands-on session there will be three alternatives: 1) Use Virtual Box to run Linux-Ubuntu on a virtual machine; 2) Use a remote PC with Linux via ssh; 3) Install the compiler and TBB library on your own laptop. If you’d rather go for alternative 3, please follow setup instructions below to prepare your machine for hands-on session:

Windows or Linux instructions:

Download and install Intel® C++ Compiler.

a) Other popular compilers (such as MSVC, GCC, clang) should suffice too. However, this particular tutorial were tested using Intel® C++ Compiler. So, might require some troubleshooting/command-line options adjusting in case of other compilers.

b) Note: on Windows platform Microsoft* Visual Studio is required.

c) Note: on Linux platform GNU make utility is required.
Setup Threading Building Blocks (TBB) library environment:

a) Download the latest TBB release from https://github.com/01org/tbb/releases

b) Use utility to extract the contents of downloaded archive. E.g.:
- On Linux: issue the command “tar -xzvf .tgz”
- On Windows: open the archive in Windows Explorer and copy its contents into local empty folder on your disk.
c) Open command prompt and change its working directory to the directory where the archive was extracted
- On Windows it might also require to setup Microsoft* Visual C++ compiler first, e.g. by issuing the command “C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Auxiliary\Build\vcvars64.bat”
d) We will name the top-level directory, which was extracted from the archive as “tbbroot”.

e) Setup TBB environment variables by sourcing tbbvars script. E.g.
- On Linux: issue the command “source <tbbroot>/bin/tbbvars.sh intel64 linux auto_tbbroot”
- On Windows: issue the command “<tbbroot>\bin\tbbvars.bat intel64”
  - Note: it might be necessary to specify version of the Microsoft* Visual C++ runtime to use as the second argument for “tbbvars.bat” script.
Check that TBB works. From the configured command line:

a) On Linux: go to “<tbbroot>/examples/test_all/fibonacci” and issue command “make”. Make sure “TEST PASSED” is printed on the command prompt.

b) On Windows: go to “<tbbroot>\examples\test_all\fibonacci\msvs” and run “msbuild fibonacci.sln”. Notice where the output file is put. It should be written as something like “fibonacci.vcxproj -> ” at the tail of build command output. Run result executable to make sure “TEST PASSED” is printed on the command prompt.
Install Graphics Display Driver (required for Windows* platforms):

a) Choose an appropriate one from https://downloadcenter.intel.com
Download and Install® SDK for OpenCL™ Applications along with OpenCL™ Drivers from https://software.intel.com/en-us/intel-opencl/download

a) The page https://software.intel.com/en-us/articles/opencl-drivers can be useful to understand what should be installed on the system to be able to compile and execute OpenCL programs.
Check that OpenCL environment is setup correctly:

a) Download “Platform and Device Capabilities Viewer” from https://software.intel.com/en-us/intel-opencl-support/code-samples
- Extract the archive and change working directory to extracted “CapsBasic” directory in the command prompt.
- Build the sample:
- On Linux: issue “make”
- On Windows: setup Microsoft Visual C++ compiler the same way as was done in section 2.c., then issue the command “msbuild CapsBasic_<year>.sln” with substitution of “year” with one of the values 2012, 2013, 2015, depending on the Visual Studio* installed. Notice the output file as it was done in section 3.b.
b) Run the binary and make sure the sample enumerates the devices on the platform.
Follow the instructions on the page https://software.intel.com/en-us/articles/getting-started-with-flow-graph-analyzer to obtain Intel(R) Advisor Flow Graph Analyzer

There also will be possibility to use USB sticks with pre-configured virtual machine images as well as to access remote machines through SSH connection (instructions will be provided during the tutorial).

Mac instructions:

On Mac, OpenCL is already available if XCode is installed. The downside in this OS is that Flow Graph Analyzer traces can not be collected yet.

Install Parallel Studio for Mac using this link: https://software.intel.com/en-us/parallel-studio-xe/choose-download/free-trial-mac
Then do:

source /opt/intel/bin/compilervars.sh intel64

git clone https://github.com/01org/tbb

cd tbb

git checkout tbb_tutorials

cd examples/ppopp18

Outline of the Tutorial (6 hours)

Part 1: Motivation and background (90)

An introduction to heterogeneous architectures – 45 minutes

Important features of different accelerators such as GPUs and FPGAs
How to measure performance and energy
A survey of heterogeneous programming models
How to determine if a computation is suitable for an accelerator

Success stories: heterogeneous applications on top of TBB – 45 minutes

Overview of hybrid scheduling, goals and challenges
- Description of hybrid pipeline and parallel_for implementations
- An overview of experimental evaluations

Part 2: An Introduction to Threading Building Blocks (90 minutes)

An overview of the Threading Building Blocks Library – 40 minutes
- The philosophy and features of the library
- Generic Algorithms and the TBB task scheduler
  - parallel_for, parallel_reduce, pipeline, task_group, parallel_invoke, etc…
  - support for C++17 parallel STL
  - how the TBB runtime library maps algorithms to tasks
- Concurrent Containers
  - concurrent_hash_map, concurrent_queue, concurrent_vector, etc…
  - the benefits and limitations of concurrent containers
Hands-On – 50 minutes
- “Hello TBB”; verifying that the environment is set up correctly – 10 minutes
- Parallelizing a sample application with generic algorithms – 20 minutes
  - A sample serial application will be provided
  - Students will be lead through a parallelization of the application using algorithms such as parallel_for
- Counting strings using a concurrent_hash_map – 20 minutes
  - A sample serial application will be provided
  - Students will be lead through the parallelization of the application using parallel_for and concurrent_hash_map

Part 3: An Introduction to Heterogeneous Programming with TBB (90 minutes)

An overview of the flow graph and its features for heterogeneity – 90 minutes
- An overview of the TBB flow graph – 45 minutes
  - Types of graphs
  - Graph operation
- TBB flow graph heterogeneous features – 35 minutes
  - Using async_node to do asynchronous communication
  - Using streaming_node and the OpenCL factory to access accelerators
    - NOTE: OpenCL is only one of the models supported by the streaming_node
- An overview of Flow Graph Analyzer tool – 10 minutes

Part 4: Hands-On Programming with TBB flow graph (90 minutes)

Converting a previously parallelized sample application to a TBB flow graph to express its high-level parallelism – 40 minutes
- The students will add flow graph nodes to their existing sample code
Building a hybrid application with opencl_node – 40 minutes
- The students will add an opencl_node to their existing sample to create a hybrid application that uses both the CPU and an accelerator
  - Students will explore using tokens to dynamically balance load across devices
Using Flow Graph Analyzer tool to understand inefficiencies in TBB-based application – 10 minutes

About the presenters

Rafael Asenjo obtained a PhD in Telecommunication Engineering from the U. of Malaga, Spain in 1997. From 2001 to 2017, he was an Associate Professor in the Computer Architecture Department, being a Full Professor since 2017. He collaborated on the IBM XL-UPC compiler and on the Cray’s Chapel runtime. In the last five years, he has focused on productively exploiting heterogeneous chips. In 2013 and 2014 he visited UIUC to work on CPU+GPU chips. In 2015 and 2016 he also started to work on CPU+FPGA chips while visiting U. of Bristol. He has served as General Chair for ACM PPoPP’16 and as an Organization Committee member as well as a Program Committee member for several HPC related conferences (PACT’17, EuroPar’17, SC’15). His research interests include heterogeneous programming models and architectures, parallelization of irregular codes and energy consumption. He co-presented two TBB tutorials at both Europar’17 and SC17.

Jim Cownie is an ACM Distinguished Engineer and Intel Principal Engineer. He has been involved with parallel computing since starting to work for Inmos in 1979. Along the way he owned the profiling chapter in the MPI-1 standard and has worked on parallel debuggers and OpenMP implementations. He gave a TBB tutorial at PACT in 2007 and has helped with the presentation of TBB tutorials at both Europar 2017 and SC17.

Aleksei Fedotov is a software engineer at Intel. He worked for a few years on various TBB features such as parallel algorithms, containers, C++11 support. Now he leads the architecture and development of the Flow Graph API, including support for heterogeneity. His interests include parallel computer architectures, parallel programming, runtime development, optimization and machine learning.

Links to information about Threading Building Blocks

http://www.threadingbuildingblocks.org
The Special Issue of Parallel Universe Magazine, “Intel® Threading Building Blocks Celebrates 10 Years!” https://goparallel.sourceforge.net/wp-content/uploads/2016/06/ParallelUniverseMagazine_Special_Edition_v2.compressed.pdf
Vasanth Tovinkere and Michael Voss, “Flow Graph Designer: A Tool for Designing and Analyzing Intel® Threading Building Blocks Flow Graphs”, 2014 43nd International Conference on Parallel Processing Workshops (ICCPW), p. 149-158, 2014 http://doi.org/10.1109/ICPPW.2014.31

Rafael Asenjo

Universidad de Málaga

Spain

Jim Cownie

Intel

Aleksei Fedotov