Write a Blog >>
PPoPP 2018
Sat 24 - Wed 28 February 2018 Vösendorf / Wien, Austria

Due to energy constraints, high performance computing platforms are becoming increasingly heterogeneous, achieving greater performance per watt through the use of hardware that is tuned to specific computational kernels or application domains. It can be challenging for developers to match computations to accelerators, choose models for targeting those accelerators, and then coordinate the use of those accelerators in the context of their larger applications. This tutorial starts with a survey of heterogeneous architectures and programming models, and discusses how to determine if a computation is suitable for a particular accelerator. Next, Intel® Threading Building Blocks (Intel® TBB), a widely used, portable C++ template library for parallel programming is introduced. TBB is available as both a commercial product and as a permissively licensed open-source project at http://www.threadingbuildingblocks.org. The library provides generic parallel algorithms, concurrent containers, a work-stealing task scheduler, a data flow programming abstraction, low-level primitives for synchronization, thread local storage and a scalable memory allocator. The generic algorithms in TBB capture many of the common design patterns used in parallel programming. While TBB was first introduced in 2006 as a shared-memory parallel programming library, it has recently been extended to support heterogeneous programming. These new extensions allow developers more easily to coordinate the use of accelerators such as integrated and discrete GPUs, or devices such as FPGAs into their parallel C++ applications. This tutorial will introduce students to the TBB library and provide a hands-on opportunity to use some of its features for shared-memory programming. The students will then be given an overview of the new features included in the library for heterogeneous programming and have a hands- on opportunity to convert an example they developed for shared-memory into one that performs hybrid execution on both the CPU and an accelerator. Finally, students will be provided with an overview of the TBB Flow Graph Analyzer tool and shown how it can be used to understand application inefficiencies related to utilization of system resources.


By the end of the tutorial, attendees will be familiar with the important architectural features of commonly available accelerators and will have a sense of what optimizations and types of parallelism are suitable for these devices. They will also know the TBB library, have experience using its generic algorithms and concurrent containers to create a shared-memory parallel program, understand its features for heterogeneous programming and will learn how to build and execute a hybrid application.

Prerequisite Knowledge

Attendees should be comfortable programming in C++ using modern features such as templates and lambda expressions. Attendees should also have an understanding of basic parallel programming concepts such as threads and locks. No previous experience with Intel® Threading Building Blocks is required.

Outline of the Tutorial (6 hours)

Part 1: Motivation and background (100)

An introduction to heterogeneous architectures – 45 minutes

  • Important features of different accelerators such as GPUs and FPGAs
  • How to measure performance and energy
  • A survey of heterogeneous programming models
  • How to determine if a computation is suitable for an accelerator

Success stories: heterogeneous applications on top of TBB – 45 minutes

  • Overview of hybrid scheduling, goals and challenges
    • Description of hybrid pipeline and parallel_for implementations
    • An overview of experimental evaluations

********************* 10 minute break ************************

Part 2: An Introduction to Threading Building Blocks (100 minutes)

An overview of the Threading Building Blocks Library – 40 minutes

  • The philosophy and features of the library
    • Generic Algorithms and the TBB task scheduler
      • parallel_for, parallel_reduce, pipeline, task_group, parallel_invoke, etc…
      • support for C++17 parallel STL
      • how the TBB runtime library maps algorithms to tasks
    • Concurrent Containers
      • concurrent_hash_map, concurrent_queue, concurrent_vector, etc…
      • the benefits and limitations of concurrent containers

Hands-On – 60 minutes

  • “Hello TBB”; verifying that the environment is set up correctly – 15 minutes
    • Parallelizing a sample application with generic algorithms – 20 minutes
      • A sample serial application will be provided
      • Students will be lead through a parallelization of the application using algorithms such as parallel_for
    • Counting strings using a concurrent_hash_map – 25 minutes
      • A sample serial application will be provided
      • Students will be lead through the parallelization of the application using parallel_for and concurrent_hash_map

********************* 10 minute break ************************

Part 3: An Introduction to Heterogeneous Programming with TBB (160 minutes)

An overview of the flow graph and its features for heterogeneity – 70 minutes

  • An overview of the TBB flow graph
    • Using async_node to do asynchronous communication
    • Using streaming_node and the OpenCL factory to access accelerators
      • NOTE: OpenCL is only one of the models supported by the streaming_node
    • An overview of Flow Graph Analyzer tool

Hands-On – 90 minutes

  • Converting a previously parallelized sample application to a TBB flow graph to express its high-level parallelism – 30 minutes
    • The students will add flow graph nodes to their existing sample code
    • Building a hybrid application with opencl_node – 30 minutes
      • The students will add an opencl_node to their existing sample to create a hybrid application that uses both the CPU and an accelerator
      • Students will explore using tokens to dynamically balance load across devices
    • Using Flow Graph Analyzer tool to understand inefficiencies in TBB-based application.

About the presenters

Rafael Asenjo obtained a PhD in Telecommunication Engineering from the U. of Malaga, Spain in 1997. From 2001 to 2017, he was an Associate Professor in the Computer Architecture Department, being a Full Professor since 2017. He collaborated on the IBM XL-UPC compiler and on the Cray’s Chapel runtime. In the last five years, he has focused on productively exploiting heterogeneous chips. In 2013 and 2014 he visited UIUC to work on CPU+GPU chips. In 2015 and 2016 he also started to work on CPU+FPGA chips while visiting U. of Bristol. He has served as General Chair for ACM PPoPP’16 and as an Organization Committee member as well as a Program Committee member for several HPC related conferences (PACT’17, EuroPar’17, SC’15). His research interests include heterogeneous programming models and architectures, parallelization of irregular codes and energy consumption. He co-presented two TBB tutorials at both Europar’17 and SC17.

Jim Cownie is an ACM Distinguished Engineer and Intel Principal Engineer. He has been involved with parallel computing since starting to work for Inmos in 1979. Along the way he owned the profiling chapter in the MPI-1 standard and has worked on parallel debuggers and OpenMP implementations. He gave a TBB tutorial at PACT in 2007 and has helped with the presentation of TBB tutorials at both Europar 2017 and SC17.

Alexei Katranov is a software engineer at Intel. He has professional experience in parallel programming and C++ for almost 10 years. Alexei is involved in multiple activities related to parallelism as well he owns Intel TBB task scheduler development. He gave talks about heterogeneous computations in TBB at IWOCL in 2016 and SES in 2017.

Aleksei Fedotov is a software engineer at Intel. He worked for a few years on various TBB features such as parallel algorithms, containers, C++11 support. Now he leads the architecture and development of the Flow Graph API, including support for heterogeneity. His interests include parallel computer architectures, parallel programming, runtime development, optimization and machine learning.

Links to information about Threading Building Blocks

Tutorial Advertisement (PPoPP_2018-TBB_Tutorial_Advertising.pdf)352KiB

Sun 25 Feb

08:30 - 10:00: Tutorials - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 1 at Room C
PPoPP-2018-Tutorials151954380000008:30 - 10:00
File Attached
10:30 - 12:00: Tutorials - An Introduction to Intel® Threading Building Blocks (Intel® TBB) and its Support for Heterogeneous Programming Session 2 at Room C
PPoPP-2018-Tutorials151955100000010:30 - 12:00
File Attached