Write a Blog >>
PPoPP 2018
Sat 24 - Wed 28 February 2018 Vösendorf / Wien, Austria

Due to energy constraints, high performance computing platforms are becoming increasingly heterogeneous, achieving greater performance per watt through the use of hardware that is tuned to specific computational kernels or application domains. It can be challenging for developers to match computations to accelerators, choose models for targeting those accelerators, and then coordinate the use of those accelerators in the context of their larger applications. This tutorial starts with a survey of heterogeneous architectures and programming models, and discusses how to determine if a computation is suitable for a particular accelerator. Next, Intel® Threading Building Blocks (Intel® TBB), a widely used, portable C++ template library for parallel programming is introduced. TBB is available as both a commercial product and as a permissively licensed open-source project at http://www.threadingbuildingblocks.org. The library provides generic parallel algorithms, concurrent containers, a work-stealing task scheduler, a data flow programming abstraction, low-level primitives for synchronization, thread local storage and a scalable memory allocator. The generic algorithms in TBB capture many of the common design patterns used in parallel programming. While TBB was first introduced in 2006 as a shared-memory parallel programming library, it has recently been extended to support heterogeneous programming. These new extensions allow developers more easily to coordinate the use of accelerators such as integrated and discrete GPUs, or devices such as FPGAs into their parallel C++ applications. This tutorial will introduce students to the TBB library and provide a hands-on opportunity to use some of its features for shared-memory programming. The students will then be given an overview of the new features included in the library for heterogeneous programming and have a hands- on opportunity to convert an example they developed for shared-memory into one that performs hybrid execution on both the CPU and an accelerator. Finally, students will be provided with an overview of the TBB Flow Graph Analyzer tool and shown how it can be used to understand application inefficiencies related to utilization of system resources.

Goals

By the end of the tutorial, attendees will be familiar with the important architectural features of commonly available accelerators and will have a sense of what optimizations and types of parallelism are suitable for these devices. They will also know the TBB library, have experience using its generic algorithms and concurrent containers to create a shared-memory parallel program, understand its features for heterogeneous programming and will learn how to build and execute a hybrid application.

Prerequisite Knowledge

Attendees should be comfortable programming in C++ using modern features such as templates and lambda expressions. Attendees should also have an understanding of basic parallel programming concepts such as threads and locks. No previous experience with Intel® Threading Building Blocks is required.

For the hands-on session there will be three alternatives: 1) Use Virtual Box to run Linux-Ubuntu on a virtual machine; 2) Use a remote PC with Linux via ssh; 3) Install the compiler and TBB library on your own laptop. If you’d rather go for alternative 3, please follow setup instructions below to prepare your machine for hands-on session:

Windows or Linux instructions:

  1. Download and install Intel® C++ Compiler.

    a) Other popular compilers (such as MSVC, GCC, clang) should suffice too. However, this particular tutorial were tested using Intel® C++ Compiler. So, might require some troubleshooting/command-line options adjusting in case of other compilers.

    b) Note: on Windows platform Microsoft* Visual Studio is required.

    c) Note: on Linux platform GNU make utility is required.

  2. Setup Threading Building Blocks (TBB) library environment:

    a) Download the latest TBB release from https://github.com/01org/tbb/releases

    b) Use utility to extract the contents of downloaded archive. E.g.:

    • On Linux: issue the command “tar -xzvf .tgz”

    • On Windows: open the archive in Windows Explorer and copy its contents into local empty folder on your disk.

    c) Open command prompt and change its working directory to the directory where the archive was extracted

    • On Windows it might also require to setup Microsoft* Visual C++ compiler first, e.g. by issuing the command “C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Auxiliary\Build\vcvars64.bat”

    d) We will name the top-level directory, which was extracted from the archive as “tbbroot”.

    e) Setup TBB environment variables by sourcing tbbvars script. E.g.

    • On Linux: issue the command “source <tbbroot>/bin/tbbvars.sh intel64 linux auto_tbbroot”
    • On Windows: issue the command “<tbbroot>\bin\tbbvars.bat intel64”
      • Note: it might be necessary to specify version of the Microsoft* Visual C++ runtime to use as the second argument for “tbbvars.bat” script.
  3. Check that TBB works. From the configured command line:

    a) On Linux: go to “<tbbroot>/examples/test_all/fibonacci” and issue command “make”. Make sure “TEST PASSED” is printed on the command prompt.

    b) On Windows: go to “<tbbroot>\examples\test_all\fibonacci\msvs” and run “msbuild fibonacci.sln”. Notice where the output file is put. It should be written as something like “fibonacci.vcxproj -> ” at the tail of build command output. Run result executable to make sure “TEST PASSED” is printed on the command prompt.

  4. Install Graphics Display Driver (required for Windows* platforms):

    a) Choose an appropriate one from https://downloadcenter.intel.com

  5. Download and Install® SDK for OpenCL™ Applications along with OpenCL™ Drivers from https://software.intel.com/en-us/intel-opencl/download

    a) The page https://software.intel.com/en-us/articles/opencl-drivers can be useful to understand what should be installed on the system to be able to compile and execute OpenCL programs.

  6. Check that OpenCL environment is setup correctly:

    a) Download “Platform and Device Capabilities Viewer” from https://software.intel.com/en-us/intel-opencl-support/code-samples

    • Extract the archive and change working directory to extracted “CapsBasic” directory in the command prompt.

    • Build the sample:

    • On Linux: issue “make”

    • On Windows: setup Microsoft Visual C++ compiler the same way as was done in section 2.c., then issue the command “msbuild CapsBasic_<year>.sln” with substitution of “year” with one of the values 2012, 2013, 2015, depending on the Visual Studio* installed. Notice the output file as it was done in section 3.b.

    b) Run the binary and make sure the sample enumerates the devices on the platform.

  7. Follow the instructions on the page https://software.intel.com/en-us/articles/getting-started-with-flow-graph-analyzer to obtain Intel(R) Advisor Flow Graph Analyzer

There also will be possibility to use USB sticks with pre-configured virtual machine images as well as to access remote machines through SSH connection (instructions will be provided during the tutorial).

Mac instructions:

On Mac, OpenCL is already available if XCode is installed. The downside in this OS is that Flow Graph Analyzer traces can not be collected yet.

  1. Install Parallel Studio for Mac using this link: https://software.intel.com/en-us/parallel-studio-xe/choose-download/free-trial-mac

  2. Then do:

source /opt/intel/bin/compilervars.sh intel64

git clone https://github.com/01org/tbb

cd tbb

git checkout tbb_tutorials

cd examples/ppopp18

Outline of the Tutorial (6 hours)

Part 1: Motivation and background (90)

An introduction to heterogeneous architectures – 45 minutes

  • Important features of different accelerators such as GPUs and FPGAs
  • How to measure performance and energy
  • A survey of heterogeneous programming models
  • How to determine if a computation is suitable for an accelerator

Success stories: heterogeneous applications on top of TBB – 45 minutes

  • Overview of hybrid scheduling, goals and challenges
    • Description of hybrid pipeline and parallel_for implementations
    • An overview of experimental evaluations

Part 2: An Introduction to Threading Building Blocks (90 minutes)

  • An overview of the Threading Building Blocks Library – 40 minutes
    • The philosophy and features of the library
    • Generic Algorithms and the TBB task scheduler
      • parallel_for, parallel_reduce, pipeline, task_group, parallel_invoke, etc…
      • support for C++17 parallel STL
      • how the TBB runtime library maps algorithms to tasks
    • Concurrent Containers
      • concurrent_hash_map, concurrent_queue, concurrent_vector, etc…
      • the benefits and limitations of concurrent containers
  • Hands-On – 50 minutes
    • “Hello TBB”; verifying that the environment is set up correctly – 10 minutes
    • Parallelizing a sample application with generic algorithms – 20 minutes
      • A sample serial application will be provided
      • Students will be lead through a parallelization of the application using algorithms such as parallel_for
    • Counting strings using a concurrent_hash_map – 20 minutes
      • A sample serial application will be provided
      • Students will be lead through the parallelization of the application using parallel_for and concurrent_hash_map

Part 3: An Introduction to Heterogeneous Programming with TBB (90 minutes)

  • An overview of the flow graph and its features for heterogeneity – 90 minutes
    • An overview of the TBB flow graph – 45 minutes
      • Types of graphs
      • Graph operation
    • TBB flow graph heterogeneous features – 35 minutes
      • Using async_node to do asynchronous communication
      • Using streaming_node and the OpenCL factory to access accelerators
        • NOTE: OpenCL is only one of the models supported by the streaming_node
    • An overview of Flow Graph Analyzer tool – 10 minutes

Part 4: Hands-On Programming with TBB flow graph (90 minutes)

  • Converting a previously parallelized sample application to a TBB flow graph to express its high-level parallelism – 40 minutes
    • The students will add flow graph nodes to their existing sample code
  • Building a hybrid application with opencl_node – 40 minutes
    • The students will add an opencl_node to their existing sample to create a hybrid application that uses both the CPU and an accelerator
      • Students will explore using tokens to dynamically balance load across devices
  • Using Flow Graph Analyzer tool to understand inefficiencies in TBB-based application – 10 minutes

About the presenters

Rafael Asenjo obtained a PhD in Telecommunication Engineering from the U. of Malaga, Spain in 1997. From 2001 to 2017, he was an Associate Professor in the Computer Architecture Department, being a Full Professor since 2017. He collaborated on the IBM XL-UPC compiler and on the Cray’s Chapel runtime. In the last five years, he has focused on productively exploiting heterogeneous chips. In 2013 and 2014 he visited UIUC to work on CPU+GPU chips. In 2015 and 2016 he also started to work on CPU+FPGA chips while visiting U. of Bristol. He has served as General Chair for ACM PPoPP’16 and as an Organization Committee member as well as a Program Committee member for several HPC related conferences (PACT’17, EuroPar’17, SC’15). His research interests include heterogeneous programming models and architectures, parallelization of irregular codes and energy consumption. He co-presented two TBB tutorials at both Europar’17 and SC17.

Jim Cownie is an ACM Distinguished Engineer and Intel Principal Engineer. He has been involved with parallel computing since starting to work for Inmos in 1979. Along the way he owned the profiling chapter in the MPI-1 standard and has worked on parallel debuggers and OpenMP implementations. He gave a TBB tutorial at PACT in 2007 and has helped with the presentation of TBB tutorials at both Europar 2017 and SC17.

Aleksei Fedotov is a software engineer at Intel. He worked for a few years on various TBB features such as parallel algorithms, containers, C++11 support. Now he leads the architecture and development of the Flow Graph API, including support for heterogeneity. His interests include parallel computer architectures, parallel programming, runtime development, optimization and machine learning.

Links to information about Threading Building Blocks

Sun 25 Feb

PPoPP-2018-Tutorials
08:30 - 10:00: Tutorials - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 1 at Pacific 1
PPoPP-2018-Tutorials151954380000008:30 - 10:00
Demonstration
Rafael AsenjoUniversidad de Málaga, Jim CownieIntel, Aleksei FedotovIntel
File Attached
PPoPP-2018-Tutorials
10:30 - 12:00: Tutorials - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 2 at Pacific 1
PPoPP-2018-Tutorials151955100000010:30 - 12:00
Demonstration
Rafael AsenjoUniversidad de Málaga, Jim CownieIntel, Aleksei FedotovIntel
File Attached
PPoPP-2018-Tutorials
13:30 - 15:00: Tutorials - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 3 at Pacific 1
PPoPP-2018-Tutorials151956180000013:30 - 15:00
Demonstration
Rafael AsenjoUniversidad de Málaga, Jim CownieIntel, Aleksei FedotovIntel
File Attached
PPoPP-2018-Tutorials
15:30 - 17:00: Tutorials - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 4 at Pacific 1
PPoPP-2018-Tutorials151956900000015:30 - 17:00
Demonstration
Rafael AsenjoUniversidad de Málaga, Jim CownieIntel, Aleksei FedotovIntel
File Attached