The current wave of advances in Deep Learning (DL) has led to many exciting challenges and opportunities for Computer Science and Artificial Intelligence researchers alike. Modern DL frameworks like Caffe/Caffe2, TensorFlow, CNTK, Torch, and several others have emerged that offer ease of use and flexibility to describe, train, and deploy various types of Deep Neural Networks (DNN) including deep convolutional nets. In this tutorial, we will provide an overview of interesting trends in DL and how cutting-edge hardware architectures are playing a key role in moving the field forward. We will also present an overview of DL frameworks from an architectural as well as a performance standpoint. Most DL frameworks have utilized a single GPU to accelerate the performance of DNN training and inference. However, approaches to parallelizing the process of training are also being actively explored. The DL community has moved along MPI based parallel/distributed training as well. Thus, we will highlight new challenges for MPI runtimes to efficiently support DNN training. We highlight how we have designed efficient communication primitives in MVAPICH2 to support scalable DNN training. Finally, we will discuss how co-design of the OSU-Caffe framework and MVAPICH2 runtime enables scale-out of DNN training to 160 GPUs.
Recent advancements in Artificial Intelligence (AI) have been fueled by the resurgence of Deep Neural Networks (DNNs) and various Deep Learning (DL) frameworks like Caffe, Facebook Caffe2, Google TensorFlow, and Microsoft Cognitive Toolkit (CNTK). DNNs have found widespread applications in classical areas like Image Recognition, Speech Processing, Textual Analysis, as well as areas like Cancer Detection, Medical Imaging, and even Autonomous Vehicle systems. Two driving elements can be attributed to the momentum that DL has gained recently; first is the public availability of various data sets like ImageNet, CIFAR, etc., and second is the widespread adoption of data-parallel hardware like GPUs and accelerators to perform DNN training. The raw number crunching capabilities of GPUs have significantly improve DNN training. Today, the community is designing better, bigger, and deeper networks for improving the accuracy through models like AlexNet, GoogLeNet, Inception v3, and VGG. The models differ in the architecture (number and type of layers) but share the common requirement of faster computation and communication capabilities of the underlying systems. Based on these trends, this tutorial is proposed with the following objectives:
- Help newcomers to the field of distributed Deep Learning (DL) on modern high-performance computing clusters to understand various design choices and implementations of several popular DL frameworks.
- Guide Message Passing Interface (MPI) application researchers, designers and developers to achieve optimal training performance with distributed DL frameworks like OSU-Caffe, CNTK, and ChainerMN on modern HPC clusters with high-performance interconnects (e.g., InfiniBand), Nvidia GPUs, and multi/many core processors.
- Demonstrate the impact of advanced optimizations and tuning of CUDA-Aware MPI libraries like MVAPICH2 on DNN training performance through case studies with representative benchmarks and applications.
This tutorial is targeted for various categories of people working in the areas of Deep Learning and MPI-based distributed DNN training on modern HPC clusters with high-performance interconnects. Specific audience this tutorial is aimed at include:
- Scientists, engineers, researchers, and students engaged in designing next-generation Deep Learning frameworks and applications over high-performance interconnects and GPUs
- Designers and developers of Caffe, TensorFlow, and other DL frameworks who are interested in scaling-out DNN training to multiple nodes of a cluster
- Newcomers to the field of Deep Learning on modern high-performance computing clusters who are interested in familiarizing themselves with Caffe, CNTK, OSU-Caffe, and other MPI-based DL frameworks
- Managers and administrators responsible for setting-up next generation Deep Learning executions environments and modern high-performance clusters/facilities in their organizations/laboratories
There is no fixed pre-requisite. As long as the attendee has a general knowledge in HPC and Networking, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner. The content level will be as follows: 60% beginner, 30% intermediate, and 10% advanced.