Home
Home > Program
Program

Program

The final program PDF is now available.

Below is the online list of confirmed presentations and speakers.

Show all abstracts
Hide all abstracts

General

Presentation

Speaker

Welcome & Introduction Mark K. Smith, Gelato Central Operations

Why Many-Core - The BIG Picture

Presentation

Speaker

Challenges for HPC Future [+]

This talk will describe some of the trends in the base technologies used in HPC, and how these trends will affect HPC developers and users. The two biggest developments are multi-core processors (where Moore's Law has become Moore's Cores!) and accelerators (the use of GPUs, FPGAs and other specialized ASICs for general purpose computing). After listening to the talk, perhaps the audience will be better armed to ask probing questions of their computer suppliers and application developers.

Richard Kaufmann, HP
Many-Cores in the Future [+]

The change from single core to multi-core processors is expected to continue, taking us to many-core chips (64 processors) and beyond. Cores are more numerous, but not faster. They also may be less reliable. Chip-level parallelism raises important questions about architecture, software, algorithms, and applications. A chip-edge bandwidth crisis is looming, but new technologies may help us cope. I'll consider the directions in which the architecture may be headed, and look at the impact on parallel programming and scientific computing.

Robert Schreiber, HP
Taking Multi-Core and Parallelism Seriously: The Intel Perspective [+]

The processor architecture and micro-architecture are undergoing a vigorous shaking-up. The major chip manufacturers have shifted their focus to “multi-core” processors with a “right turn” from GHz competition. The new focus is multiple cores with shared caches, providing increased concurrency instead of increased clock speed. As a challenge, software engineers can no longer rely on increasing clock speed to hide software bloat. Instead, they must learn to make effective use of increasing parallelism. This adaptation has never been easy. This talk consists of two parts -- the first part will focus on the parallel programming tools such as compilers, performance analysis tools and correctness checking tools that are applicable for developing mainstream parallel applications. We also share some of the challenges that developers face today in developing applications for homogeneous multi-core systems and we will discuss the situation with the advent of heterogeneous many-core systems in the next few years. The second part will cover progress on Transactional Memory technology research and development. We will discuss open transactional memory research problems, as there is a growing community of researchers and industry software and hardware vendors working on both software and hardware support for the TM approach.

Xinmin Tian, Intel

New Paradigms - Thinking Parallel

Presentation

Speaker

General Purpose Programming of Many-Core Devices and Many-Core Systems [+]

This talk will highlight general purpose programming of many-core devices and many-core systems. I will discuss the process oriented programming model that is the basis of my work and also provide some historical anecdotes from my experience with Occam and the Transputer.

I will also discuss the Carnap programming language and the open source project that is implementing a compiler for that language for many-core platforms.

I will also speak in my capacity as Chief Scientist for Manycore Corporation and their architect of intellectual property for devices to support the Process Oriented Programming model.

Steven Ericsson-Zenith, Institute for Advanced Science & Engineering
Dynamic Helper Thread Generation [+]

A multi-core CPU (or chip-level multiprocessor, CMP) combines two or more independent cores into a single chip. Most processor vendors are offering multi-core/many-core chips today, and more such CPUs will be coming out in the near future. At present, such multi-core/many-core CPUs are mainly used to improve throughput or highly parallel application performance rather than single thread performance. Since multiple processor cores on the same chip may share the level-2 on-chip cache, one or more helper threads can be spawned and executed speculatively ahead of the main thread to prefetch data into the shared cache. This can significantly reduce the cache miss penalty of the main thread, which is often the major performance bottleneck for modern applications.

This talk presents the design and implementation of a runtime optimization system that can automatically generate and spawn helper threads to speed up single threaded applications. The performance results are measured and collected on an UltraSparc IV+ dual-core CPU system.

Wei Chung Hsu, University of Minnesota
Parallel Processing Models and Research at CERN [+]

With its current parallel processing paradigm (High Throughput Computing) CERN has been able to embrace multicore systems since Day 1. Today, to prepare for the start-up of the Large Hadron Collider, we have a large installation of Intel-64 Woodcrest/Clovertown systems as well as an IA-64 Montecito cluster. However, the parallel processing paradigm requires additional memory per process, and leads to other complications, such as inefficient scheduling. This talk will first explain the issues with the current multi-core processing model which is unlikely to scale to many-core environments (with hundreds of cores). Next I will describe several experimental programming models, based on multi-threading, that are being tried out in order to improve the situation. I will also briefly describe the tools we have deployed, such as performance monitors and threading tools. Finally I will highlight our ongoing educational effort that we think is mandatory in order to get the programming community to "think parallel".

Sverre Jarp, European Organization for Nuclear Research
Parallel Garbage Collection [+]

Garbage collection (GC) is one of the key components in modern programming systems, such as Java, C#, JavaScript, Ruby, etc. Its performance impacts the overall software scalability on multi-core platforms. While the major efforts in software parallelization are focused on multi-core programming, threading, and compilation, we investigate GC parallelization technology systematically. We classify the topics into the following categories: traversal of object connection graph, live object marking, object copying order, heap compaction, large object management, and concurrent collection. Each category has its own characteristics and worth separate study. In this talk, we will describe and compare the parallelization techniques in each category in a systematic approach, and also discuss their interactions with underlying platforms.

Xiao-Feng Li, Intel
High Performance Data Mining [+]

The ever increasing number of cores per chip will be accompanied by a pervasive data deluge whose size will probably increase even faster than CPU core count over the next few years. This suggests the importance of parallel data analysis and data mining applications with good multi-core, cluster and grid performance. This talk considers data clustering, mixture models and dimensional reduction presenting a unified framework applicable to bioinformatics, cheminformatics and demographics. Deterministic annealing is used to lessen the effect of local minima. We present performance results on 4 and 8-core systems identifying effects from cache, runtime fluctuations, synchronization and memory bandwidth.

Judy Qiu, Indiana University

Many-Core Application Development

Presentation

Speaker

The Parallel Framework for Realizing the Power of Multi-Core Processors [+]

This talk will discuss the methodology in analyzing the scalability bottlenecks, and demonstrate how to improve the performance on the future chip multi-processor (CMP) systems. With the prevalence of CMP and the number of cores increasing steadily for the foreseeable future, one key issue is how to effectively manage and execute more and more threads on CMP at the same time. I will introduce the parallel framework, which uses an iterative parallel performance tuning method on the multi-core processor. Some emerging video processing applications are used to show how we can parallelize them to enable real-time performance on the multi-core processor.

I will also examine all aspects of parallel performance tuning techniques in this talk, and show how to use the analysis tools to improve the scalability performance.

Yurong Chen, Intel
Intel Threading Building Block (TBB) Colt Gan, Intel
Model-Driven Development Tool for Parallel Applications [+]

Parallel programming is extremely difficult. Programmers must be very careful to avoid popular defects like deadlock and data race. Our tool can provide a much easier style of programming. First, it won't require explicit concurrency. Instead, the developer creates a sequential computing kernel. After that, he/she can create a model for the parallel application being developed then the model can be transformed to a parallel application. The model-driven development tool can bring the following benefits:

1. Progressive disclosure information to developer

The software engineer can easily develop the parallel program before he or she becomes an expert of parallel programming

2. Concurrent pattern can handle classical scenarios quickly

If the user case fits into one of map/reduce, master/worker, pipeline, fork/join, it can be easily done.

3. Task-oriented API

When creating special task flow, the developer only needs to specify dependencies between tasks. Tasks will be automatically scheduled to multiple cores with consideration of dependency.

James Gan, IBM
GPU Computing Research at UIUC [+]

In the next decade, we are going to see continued performance scaling in single-chip, massively parallel compute engines. According to the semiconductor industry road map, these chips could provide up to 10,000x speedup over our current microprocessors by the end of the year 2016. Such a dramatic increase in computation power will likely enable revolutionary work in science, engineering and many other disciplines. Like any other massively parallel computer system, in order to achieve high performance, an application programmer currently has to understand the desirable parallel programming idioms, potential performance pitfalls, and proven coding strategies for the platform. However, the programming and code optimization models of GPU computing design are quite different from those of traditional CPUs. In this presentation, I will describe the vision and recent results of a collaborative effort between the University of Illinois and NVIDIA on building an infrastructure of programming tools, educational materials (www.courses.ece.uiuc.edu/ece498/al), application development experience, and architectural directions needed for application developers to fully exploit the hardware compute power of current and future GPU computing platforms.

Wen-mei W. Hwu, University of Illinois at Urbana-Champaign
Massively Parallel GPU Computing with NVIDIA's CUDA [+]

In the past, graphics processors were special purpose hardwired application accelerators, suitable only for conventional rasterization-style graphics applications. Modern GPUs are now fully programmable, massively parallel floating point processors. This talk will describe NVIDIA’s massively multi-threaded computing architecture and CUDA software for GPU computing. The architecture is a scalable, highly parallel architecture that delivers high throughput for data-intensive processing. Although not truly general-purpose processors, GPUs can now be used for a wide variety of compute-intensive applications beyond graphics.

David Kirk, Nvidia

Compiling Code for Many-Core

Presentation

Speaker

Dynamic Optimization - An Open Discussion [+]

This will be an open discussion led by Wei Chung Hsu and focused on Dynamic Optimizations.

Wei Chung Hsu, University of Minnesota
HP Compiler Lab - The Many-Core Perspective Shin-Ming Liu, HP
Software Engineering for Multi-Core Systems - An Experience Report [+]

The emergence of inexpensive parallel computers powered by multi-core chips combined with stagnating clock rates raises new challenges for software engineering. As future performance improvements will not come for free from increased clock rates, performance critical applications will need to be parallelized. However, little is known about the engineering principles for parallel general-purpose applications.
This talk presents an experience report based on four diverse case studies with multi-core software development for general-purpose applications. Our empirical findings include:

1) Auto-tuning is indispensable, as manually tuning thread assignment, number of pipeline stages, size of data partitions and other parameters is difficult and error prone.

2) Architectures that encompass several parallel components are poorly understood. Tuneable architectural patterns with parallelism at several levels need to be discovered.

Representing all case studies, I will focus on the parallelization process of a large commercial application containing multi-level parallelism and how our prototype of an auto-tuning framework is used to find the best configuration of the application's parallel sections.

Christoph Schaefer, University of Karlsruhe
Communication Analysis and Optimized Mapping of Explicit Parallel Codes [+]

Mapping logical computing units onto physical computing units is one of the basic problems in parallel computing, especially for hierarchy architecture or topology-sensitive systems, like SMP clusters, multi-core SMP and many-core systems. The optimized mapping is relevant to both hardware characteristics and application communication patterns. In our work, we are going to build a general framework for optimizing the mapping with synthesis of application analysis and hardware architecture. The communication analysis techniques of explicit parallel codes, adapting abstract methods and heuristic methods for graph partitioning are used in our framework.

A toolbox approach is in the works, and it will be convenient for any explicit parallel code to get the optimized mapping and improve performance cheaply.

Lei Shang, Institute of Computing Technology, CAS
May Happen in Parallel Analysis [+]

Concurrent program analysis is an urgent and useful topic for programmers targeting multi-core processors. A fundamental technique of concurrent program analysis is May-Happen-in-Parallel (MHP) analysis that determines whether any two statements may be executed in parallel. This reduces the false positive rate and makes concurrent program analysis more efficient.

However, current research on MHP is weak, which can only process at most ten KLOC with some restrictions (Object-Oriented, OpenMP model, etc.). We propose a framework in the Open64 Compiler to solve general MHP problems for C/C++ programs.

Yao Shi, Tsinghua University
Scalable Concurrency in Many-Core Processors [+]

The many-core technology exhibits tremendous computational capability and parallelism on a single chip. Meanwhile how to harness the power of parallelism has become a key issue in the field.

This talk will present the SVP (Self-Adaptive-Network-Entity Virtual Processor) programming model with explicit parallelism exploitation. It tackles the issue of extracting and utilizing the massive concurrency in hardware cost-effectively. Imposing the SVP model, uTC (an extension to the C language) is defined as a concurrency-oriented parallel language. An architectural solution, the micro-threaded architecture, based on the SVP model will also be introduced. It resembles the dataflow computational model and is capable of explicit context switch, register level data synchronization and dynamic concurrency management. The proposed Chip Multi-Processor as microgrid is aiming to be scalable across a large number of on-chip processing cores in terms of both power and performance.

Li Zhang, University of Amsterdam