SMCW08 — Presentations
Over 120 scientists, developers, and engineers from 30 companies and institutions met in Shanghai, for the Shanghai Many-Core Workshop (SMCW08). Over a 2-day period, attendees were treated to 19 technical presentations by some of the top research and industry experts.
Welcome & Introduction
Mark K. Smith, Gelato Central Operations
Presentation (Pdf, 988kB)
Many-Cores in the Future
Robert Schreiber, HP
The change from single core to multi-core processors is expected to continue, taking us to many-core chips (64 processors) and beyond. Cores are more numerous, but not faster. They also may be less reliable. Chip-level parallelism raises important questions about architecture, software, algorithms, and applications. A chip-edge bandwidth crisis is looming, but new technologies may help us cope. I'll consider the directions in which the architecture may be headed, and look at the impact on parallel programming and scientific computing.
No presentation available currently
Taking Multi-Core and Parallelism Seriously: The Intel Perspective
Xinmin Tian, Intel
The processor architecture and micro-architecture are undergoing a vigorous shaking-up. The major chip manufacturers have shifted their focus to "multi-core" processors with a "right turn" from GHz competition. The new focus is multiple cores with shared caches, providing increased concurrency instead of increased clock speed. As a challenge, software engineers can no longer rely on increasing clock speed to hide software bloat. Instead, they must learn to make effective use of increasing parallelism. This adaptation has never been easy. This talk consists of two parts -- the first part will focus on the parallel programming tools such as compilers, performance analysis tools and correctness checking tools that are applicable for developing mainstream parallel applications. We also share some of the challenges that developers face today in developing applications for homogeneous multi-core systems and we will discuss the situation with the advent of heterogeneous many-core systems in the next few years. The second part will cover progress on Transactional Memory technology research and development. We will discuss open transactional memory research problems, as there is a growing community of researchers and industry software and hardware vendors working on both software and hardware support for the TM approach.
This presentation cannot be published.
Challenges for HPC Future
Richard Kaufmann, HP
This talk will describe some of the trends in the base technologies used in HPC, and how these trends will affect HPC developers and users. The two biggest developments are multi-core processors (where Moore's Law has become Moore's Cores!) and accelerators (the use of GPUs, FPGAs and other specialized ASICs for general purpose computing). After listening to the talk, perhaps the audience will be better armed to ask probing questions of their computer suppliers and application developers.
No presentation available currently
Software Engineering for Multi-Core Systems - An Experience Report
Christoph Schaefer, University of Karlsruhe
The emergence of inexpensive parallel computers powered by multi-core chips combined with stagnating clock rates raises new challenges for software engineering. As future performance improvements will not come for free from increased clock rates, performance critical applications will need to be parallelized. However, little is known about the engineering principles for parallel general-purpose applications.
This talk presents an experience report based on four diverse case studies with multi-core software development for general-purpose applications. Our empirical findings include:
- Auto-tuning is indispensable, as manually tuning thread assignment, number of pipeline stages, size of data partitions and other parameters is difficult and error prone.
- Architectures that encompass several parallel components are poorly understood. Tuneable architectural patterns with parallelism at several levels need to be discovered.
Representing all case studies, I will focus on the parallelization process of a large commercial application containing multi-level parallelism and how our prototype of an auto-tuning framework is used to find the best configuration of the application's parallel sections.
Presentation (Pdf, 1.6 MB)
May Happen in Parallel Analysis
Yao Shi, Tsinghua University
Concurrent program analysis is an urgent and useful topic for programmers targeting multi-core processors. A fundamental technique of concurrent program analysis is May-Happen-in-Parallel (MHP) analysis that determines whether any two statements may be executed in parallel. This reduces the false positive rate and makes concurrent program analysis more efficient.
However, current research on MHP is weak, which can only process at most ten KLOC with some restrictions (Object-Oriented, OpenMP model, etc.). We propose a framework in the Open64 Compiler to solve general MHP problems for C/C++ programs.
Presentation (Pdf, 83 kB)
Scalable Concurrency in Many-Core Processors
Li Zhang, University of Amsterdam
The many-core technology exhibits tremendous computational capability and parallelism on a single chip. Meanwhile how to harness the power of parallelism has become a key issue in the field.
This talk will present the SVP (Self-Adaptive-Network-Entity Virtual Processor) programming model with explicit parallelism exploitation. It tackles the issue of extracting and utilizing the massive concurrency in hardware cost-effectively. Imposing the SVP model, uTC (an extension to the C language) is defined as a concurrency-oriented parallel language. An architectural solution, the micro-threaded architecture, based on the SVP model will also be introduced. It resembles the dataflow computational model and is capable of explicit context switch, register level data synchronization and dynamic concurrency management. The proposed Chip Multi-Processor as microgrid is aiming to be scalable across a large number of on-chip processing cores in terms of both power and performance.
Presentation (Pdf, 1.3 MB)
Dynamic Optimization - An Open Discussion
Wei Chung Hsu, University of Minnesota
This will be an open discussion led by Wei Chung Hsu and focused on Dynamic Optimizations.
No presentation available currently
Communication Analysis and Optimized Mapping of Explicit Parallel Codes
Lei Shang, Institute of Computing Technology, CAS
Mapping logical computing units onto physical computing units is one of the basic problems in parallel computing, especially for hierarchy architecture or topology-sensitive systems, like SMP clusters, multi-core SMP and many-core systems. The optimized mapping is relevant to both hardware characteristics and application communication patterns. In our work, we are going to build a general framework for optimizing the mapping with synthesis of application analysis and hardware architecture. The communication analysis techniques of explicit parallel codes, adapting abstract methods and heuristic methods for graph partitioning are used in our framework.
A toolbox approach is in the works, and it will be convenient for any explicit parallel code to get the optimized mapping and improve performance cheaply.
Presentation (Pdf, 510 kB)
HP Compiler Lab - The Many-Core Perspective
Shin-Ming Liu, HP
No presentation available currently
GPU Computing Research at UIUC
Wen-mei W. Hwu, University of Illinois at Urbana-Champaign
In the next decade, we are going to see continued performance scaling in single-chip, massively parallel compute engines. According to the semiconductor industry road map, these chips could provide up to 10,000x speedup over our current microprocessors by the end of the year 2016. Such a dramatic increase in computation power will likely enable revolutionary work in science, engineering and many other disciplines. Like any other massively parallel computer system, in order to achieve high performance, an application programmer currently has to understand the desirable parallel programming idioms, potential performance pitfalls, and proven coding strategies for the platform. However, the programming and code optimization models of GPU computing design are quite different from those of traditional CPUs. In this presentation, I will describe the vision and recent results of a collaborative effort between the University of Illinois and NVIDIA on building an infrastructure of programming tools, educational materials (www.courses.ece.uiuc.edu/ece498/al), application development experience, and architectural directions needed for application developers to fully exploit the hardware compute power of current and future GPU computing platforms.
No presentation available currently
Massively Parallel GPU Computing with NVIDIA's CUDA
David Kirk, Nvidia
In the past, graphics processors were special purpose hardwired application accelerators, suitable only for conventional rasterization-style graphics applications. Modern GPUs are now fully programmable, massively parallel floating point processors. This talk will describe NVIDIA's massively multi-threaded computing architecture and CUDA software for GPU computing. The architecture is a scalable, highly parallel architecture that delivers high throughput for data-intensive processing. Although not truly general-purpose processors, GPUs can now be used for a wide variety of compute-intensive applications beyond graphics.
No presentation available currently
Intel Threading Building Block (TBB)
Colt Gan, Intel
Presentation (Pdf, 664 kB)
Model-Driven Development Tool for Parallel Applications
James Gan, IBM
Parallel programming is extremely difficult. Programmers must be very careful to avoid popular defects like deadlock and data race. Our tool can provide a much easier style of programming. First, it won't require explicit concurrency. Instead, the developer creates a sequential computing kernel. After that, he/she can create a model for the parallel application being developed then the model can be transformed to a parallel application. The model-driven development tool can bring the following benefits:
Progressive disclosure information to developer
The software engineer can easily develop the parallel program before he or she becomes an expert of parallel programming
Concurrent pattern can handle classical scenarios quickly
If the user case fits into one of map/reduce, master/worker, pipeline, fork/join, it can be easily done.
Task-oriented API
When creating special task flow, the developer only needs to specify dependencies between tasks. Tasks will be automatically scheduled to multiple cores with consideration of dependency.
Presentation (Pdf, 877 kB)
The Parallel Framework for Realizing the Power of Multi-Core Processors
Yurong Chen, Intel
This talk will discuss the methodology in analyzing the scalability bottlenecks, and demonstrate how to improve the performance on the future chip multi-processor (CMP) systems. With the prevalence of CMP and the number of cores increasing steadily for the foreseeable future, one key issue is how to effectively manage and execute more and more threads on CMP at the same time. I will introduce the parallel framework, which uses an iterative parallel performance tuning method on the multi-core processor. Some emerging video processing applications are used to show how we can parallelize them to enable real-time performance on the multi-core processor.
I will also examine all aspects of parallel performance tuning techniques in this talk, and show how to use the analysis tools to improve the scalability performance.
Presentation (Pdf, 3.4 MB)
Parallel Processing Models and Research at CERN
Sverre Jarp, European Organization for Nuclear Research
With its current parallel processing paradigm (High Throughput Computing) CERN has been able to embrace multicore systems since Day 1. Today, to prepare for the start-up of the Large Hadron Collider, we have a large installation of Intel-64 Woodcrest/Clovertown systems as well as an IA-64 Montecito cluster. However, the parallel processing paradigm requires additional memory per process, and leads to other complications, such as inefficient scheduling. This talk will first explain the issues with the current multi-core processing model which is unlikely to scale to many-core environments (with hundreds of cores). Next I will describe several experimental programming models, based on multi-threading, that are being tried out in order to improve the situation. I will also briefly describe the tools we have deployed, such as performance monitors and threading tools. Finally I will highlight our ongoing educational effort that we think is mandatory in order to get the programming community to "think parallel".
Presentation (Pdf, 1.6 MB)
General Purpose Programming of Many-Core Devices and Many-Core Systems
Steven Ericsson-Zenith, Institute for Advanced Science & Engineering
This talk will highlight general purpose programming of many-core devices and many-core systems. I will discuss the process oriented programming model that is the basis of my work and also provide some historical anecdotes from my experience with Occam and the Transputer.
I will also discuss the Carnap programming language and the open source project that is implementing a compiler for that language for many-core platforms.
I will also speak in my capacity as Chief Scientist for Manycore Corporation and their architect of intellectual property for devices to support the Process Oriented Programming model.
Presentation (Pdf, 2.3 MB)
Dynamic Helper Thread Generation
Wei Chung Hsu, University of Minnesota
A multi-core CPU (or chip-level multiprocessor, CMP) combines two or more independent cores into a single chip. Most processor vendors are offering multi-core/many-core chips today, and more such CPUs will be coming out in the near future. At present, such multi-core/many-core CPUs are mainly used to improve throughput or highly parallel application performance rather than single thread performance. Since multiple processor cores on the same chip may share the level-2 on-chip cache, one or more helper threads can be spawned and executed speculatively ahead of the main thread to prefetch data into the shared cache. This can significantly reduce the cache miss penalty of the main thread, which is often the major performance bottleneck for modern applications.
This talk presents the design and implementation of a runtime optimization system that can automatically generate and spawn helper threads to speed up single threaded applications. The performance results are measured and collected on an UltraSparc IV+ dual-core CPU system.
No presentation available currently
High Performance Data Mining
Judy Qiu, Indiana University
The ever increasing number of cores per chip will be accompanied by a pervasive data deluge whose size will probably increase even faster than CPU core count over the next few years. This suggests the importance of parallel data analysis and data mining applications with good multi-core, cluster and grid performance. This talk considers data clustering, mixture models and dimensional reduction presenting a unified framework applicable to bioinformatics, cheminformatics and demographics. Deterministic annealing is used to lessen the effect of local minima. We present performance results on 4 and 8-core systems identifying effects from cache, runtime fluctuations, synchronization and memory bandwidth.
Presentation (Pdf, 2.8 MB)
Parallel Garbage Collection
Xiao-Feng Li, Intel
Garbage collection (GC) is one of the key components in modern programming systems, such as Java, C#, JavaScript, Ruby, etc. Its performance impacts the overall software scalability on multi-core platforms. While the major efforts in software parallelization are focused on multi-core programming, threading, and compilation, we investigate GC parallelization technology systematically. We classify the topics into the following categories: traversal of object connection graph, live object marking, object copying order, heap compaction, large object management, and concurrent collection. Each category has its own characteristics and worth separate study. In this talk, we will describe and compare the parallelization techniques in each category in a systematic approach, and also discuss their interactions with underlying platforms.
Presentation (Pdf, 877 kB)