Preface

David B. Kirk and Wen-mei W. Hwu

We are proud to introduce to you the third edition of Programming Massively Parallel Processors: A Hands-on Approach.

Mass-market computing systems that combine multi-core CPUs and many-thread GPUs have brought terascale computing to laptops and petascale computing to clusters. Armed with such computing power, we are at the dawn of pervasive use of computational experiments for science, engineering, health, and business disciplines. Many will be able to achieve breakthroughs in their disciplines using computational experiments that are of unprecedented level of scale, accuracy, controllability and observability. This book provides a critical ingredient for the vision: teaching parallel programming to millions of graduate and undergraduate students so that computational thinking and parallel programming skills will be as pervasive as calculus.

Since the second edition came out in 2012, we have received numerous comments from our readers and instructors. Many told us about the existing features they value. Others gave us ideas about how we should expand its contents to make the book even more valuable. Furthermore, the hardware and software technology for heterogeneous parallel computing has advanced tremendously since then. In the hardware arena, two more generations of GPU computing architectures, Maxwell and Pascal, have been introduced since the first edition. In the software domain, CUDA 6.0 through CUDA 8.0 have allowed programmers to access the new hardware features of Maxwell and Pascal. New algorithms have also been developed. Accordingly, we added five new chapters and completely rewrote more than half of the existing chapters.

Broadly speaking, we aim for three major improvements in the third edition while preserving the most valued features of the first two editions. The improvements are (1) adding new Chapter 9, Parallel patterns—parallel histogram computation (histogram); Chapter 11, Parallel patterns: merge sort (merge sort); and Chapter 12, Parallel patterns: graph search (graph search) that introduce frequently used parallel algorithm patterns; (2) adding new Chapter 16, Application case study—machine learning on deep learning as an application case study; and (3) adding a chapter to clarify the evolution of advanced features of CUDA. These additions are designed to further enrich the learning experience of our readers.

As we made these improvements, we preserved the features of the previous editions that contributed to the book’s popularity. First, we’ve kept the book as concise as possible. While it is tempting to keep adding material, we wanted to minimize the number of pages a reader needs to go through in order to learn all the key concepts. We accomplished this by moving some of the second edition chapters into appendices. Second, we have kept our explanations as intuitive as possible. While it is tempting to formalize some of the concepts, especially when we cover basic parallel algorithms, we have strived to keep all our explanations intuitive and practical.

Target Audience

The target audience of this book are the many graduate and undergraduate students from all science and engineering disciplines where computational thinking and parallel programming skills are needed to achieve breakthroughs. We assume that the reader has at least some basic C programming experience. We especially target computational scientists in fields such as computational financing, data analytics, cognitive computing, mechanical engineering, civil engineering, electrical engineering, bio-engineering, physics, chemistry, astronomy, and geography, all of whom use computation to further their field of research. As such, these scientists are both experts in their domain as well as programmers. The book takes the approach of teaching parallel programming by building up an intuitive understanding of the techniques. We use CUDA C, a parallel programming environment that is supported on NVIDIA GPUs. There are nearly 1 billion of these processors in the hands of consumers and professionals, and more than 4,00,000 programmers actively using CUDA. The applications that you develop as part of the learning experience will be used and run by a very large user community.

How to Use the Book

We would like to offer some of our experience in teaching courses with this book. Since 2006, we have taught multiple types of courses: in one-semester format and in one-week intensive format. The original ECE498AL course has become a permanent course known as ECE408 or CS483 of the University of Illinois at Urbana-Champaign. We started to write up some early chapters of this book when we offered ECE498AL the second time. The first four chapters were also tested in an MIT class taught by Nicolas Pinto in spring 2009. Since then, we have used the book for numerous offerings of ECE408 as well as the Coursera Heterogeneous Parallel Programming course, and the VSCSE and PUMPS summer schools.

A Three-Phased Approach

In ECE408, the lectures and programming assignments are balanced with each other and organized into three phases:

Phase 1: One lecture based on Chapter 2, Data parallel computing is dedicated to teaching the basic CUDA memory/threading model, the CUDA extensions to the C language, and the basic programming/debugging tools. After the lecture, students can write a simple vector addition code in a couple of hours. This is followed by a series of four-to-six lectures that give students the conceptual understanding of the CUDA memory model, the CUDA thread execution model, GPU hardware performance features, and modern computer system architecture. These lectures are based on Chapter 3, Scalable parallel execution; Chapter 4, Memory and data locality; and Chapter 5, Performance considerations. The performance of their matrix multiplication codes increases by about 10 times through this period.

Phase 2: A series of lectures cover floating-point considerations in parallel computing and common data-parallel programming patterns needed to develop a high-performance parallel application. These lectures are based on Chapter 7, Parallel patterns: convolution; Chapter 8, Parallel patterns: prefix sum; Chapter 9, Parallel patterns—parallel histogram computation; Chapter 10, Parallel patterns: sparse matrix computation; Chapter 11, Parallel patterns: merge sort; and Chapter 12, Parallel patterns: graph search. The students complete assignments on convolution, vector reduction, prefix-sum, histogram, sparse matrix-vector multiplication, merge sort, and graph search through this period. We typically leave two or three of the more advanced patterns for a graduate level course.

Phase 3: Once the students have established solid CUDA programming skills, the remaining lectures cover application case studies, computational thinking, a broader range of parallel execution models, and parallel programming principles. These lectures are based on Chapter 13, CUDA dynamic parallelism; Chapter 14, Application case study—non-Cartesian magnetic resonance imaging; Chapter 15, Application case study—molecular visualization and analysis; Chapter 16, Application case study—machine learning; Chapter 17, Parallel programming and computational thinking; Chapter 18, Programming a heterogeneous computing cluster; Chapter 19, Parallel programing with OpenACC; and Chapter 20, More on CUDA and graphics processing unit computing. (The voice and video recordings of these lectures are available as part of the Illinois–NVIDIA GPU Teaching Kit.)

Tying It All Together: The Final Project

While the lectures, labs, and chapters of this book help lay the intellectual foundation for the students, what brings the learning experience together is the final project, which is so important to the full-semester course that it is prominently positioned in the course and commands nearly 2 months’ focus. It incorporates five innovative aspects: mentoring, workshop, clinic, final report, and symposium. (While much of the information about the final project is available in the Illinois–NVIDIA GPU Teaching Kit, we would like to offer the thinking that was behind the design of these aspects.)

Students are encouraged to base their final projects on problems that represent current challenges in the research community. To seed the process, the instructors should recruit several computational science research groups to propose problems and serve as mentors. The mentors are asked to contribute a one-to-two-page project specification sheet that briefly describes the significance of the application, what the mentor would like to accomplish with the student teams on the application, the technical skills (particular type of math, physics, and chemistry courses) required to understand and work on the application, and a list of Web and traditional resources that students can draw upon for technical background, general information, and building blocks, along with specific URLs or ftp paths to particular implementations and coding examples. These project specification sheets also provide students with learning experiences in defining their own research projects later in their careers. (Several examples are available in the Illinois–NVIDIA GPU Teaching Kit.)

Students are also encouraged to contact their potential mentors during their project selection process. Once the students and the mentors agree on a project, they enter into a collaborative relationship, featuring frequent consultation and project reporting. We, the instructors, attempt to facilitate the collaborative relationship between students and their mentors, making it a very valuable experience for both mentors and students.

The project workshop

The project workshop is the primary vehicle that enables the entire class to contribute to each other’s final project ideas. We usually dedicate six of the lecture slots to project workshops. The workshops are designed for students’ benefit. For example, if a student has identified a project, the workshop serves as a venue to present preliminary thinking, get feedback, and recruit teammates. If a student has not identified a project, he/she can simply attend the presentations, participate in the discussions, and join one of the project teams. Students are not graded during the workshops in order to keep the atmosphere nonthreatening and to enable them to focus on a meaningful dialog with the instructor(s), teaching assistants, and the rest of the class.

The workshop schedule is designed for the instructor(s) and teaching assistants to take some time to provide feedback to the project teams so that students can ask questions. Presentations are limited to 10 minutes to provide time for feedback and questions during the class period. This limits the class size to about 24 presenters, assuming 90-minute lecture slots. All presentations are pre-loaded into a PC in order to control the schedule strictly and maximize feedback time. Since not all students present at the workshop, we have been able to accommodate up to 50 students in each class, with extra workshop time available as needed. At the University of Illinois, the high demand for ECE408 has propelled the size of the classes significantly beyond the ideal size for project workshops. We will comment on this issue at the end of the section.

The instructor(s) and TAs must make a commitment to attend all the presentations and to give useful feedback. Students typically need most help in answering the following questions. First, are the projects too big or too small for the amount of time available? Second, is there existing work in the field that the project can benefit from? Third, are the computations being targeted for parallel execution appropriate for the CUDA programming model?

The design document

Once the students decide on a project and form a team, they are required to submit a design document for the project. This helps them to think through the project steps before they jump into it. The ability to do such planning will be important to their later career success. The design document should discuss the background and motivation for the project, application-level objectives and potential impact, main features of the end application, an overview of their design, an implementation plan, their performance goals, a verification plan and acceptance test, and a project schedule.

The teaching assistants hold a project clinic for final project teams during the week before the class symposium. This clinic helps ensure that students are on track and that they have identified the potential roadblocks early in the process. Student teams are asked to come to the clinic with an initial draft of the following three versions of their application: (1) The best CPU sequential code in terms of performance, preferably with AVX and other optimizations that establish a strong serial base of the code for their speedup comparisons and (2) The best CDUA parallel code in terms of performance. This version is the main output of the project. This version is used by the students to characterize the parallel algorithm overhead in terms of extra computations involved.

Student teams are asked to be prepared to discuss the key ideas used in each version of the code, any numerical stability issues, any comparison against previous results on the application, and the potential impact on the field if they achieve tremendous speedup. From our experience, the optimal schedule for the clinic is 1 week before the class symposium. An earlier time typically results in less mature projects and less meaningful sessions. A later time will not give students sufficient time to revise their projects according to the feedback.

The project report

Students are required to submit a project report on their team’s key findings. We recommend a whole-day class symposium. During the symposium, students use presentation slots proportional to the size of the teams. During the presentation, the students highlight the best parts of their project report for the benefit of the whole class. The presentation accounts for a significant part of students’ grades. Each student must answer questions directed to him/her as individuals so that different grades can be assigned to individuals in the same team. The symposium is an opportunity for students to learn to produce a concise presentation that motivates their peers to read a full paper. After their presentation, the students also submit a full report on their final project.

Class Competition

In 2016, the enrollment level of ECE408 far exceeded the level that can be accommodated by the final project process. As a result, we moved from the final project to class competition. At the middle of the semester, we announce a competition challenge problem. We use one lecture to explain the competition challenge problem and the rules that will be used for ranking the teams. The students work in teams to solve the competition with their parallel solution. The final ranking of each team is determined by the execution time, correctness, and clarity of their parallel code. The students do a demo of their solution at the end of the semester and submit a final report. This is a compromise that preserves some of the benefits of final projects when the class size makes final projects infeasible.

Illinois–NVIDIA GPU Teaching Kit

The Illinois–NVIDIA GPU Teaching Kit is a publicly available resource that contains lecture, lab assignments, final project guidelines, and sample project specifications for instructors who use this book for their classes. While this book provides the intellectual contents for these classes, the additional material will be crucial in achieving the overall education goals. It can be accessed at http://syllabus.gputeachingkit.com/.

Finally, we encourage you to submit your feedback. We would like to hear from you if you have any ideas for improving this book. We would like to know how we can improve the supplementary on-line material. Finally, we would like to know what you liked about the book. We look forward to hearing from you.

Online Supplements

The lab assignments, final project guidelines, and sample project specifications are available to instructors who use this book for their classes. While this book provides the intellectual contents for these classes, the additional material will be crucial in achieving the overall education goals. We would like to invite you to take advantage of the online material that accompanies this book, which is available at http://textbooks.elsevier.com/9780128119860.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.107.100