Gelato Home
  
Community
Home > Community > Events > presentations—April 2006 Gelato ICE
Presentations—Gelato ICE Itanium® Conference & Expo | San Jose | April 2006

Over 200 scientists, developers, and engineers from 80 companies and institutions met in San Jose, California, for the April 2006 Gelato ICE: Itanium® Conference & Expo. Attendees addressed current high-performance computing issues and collaborative solutions specific to Linux on the Intel Itanium architecture. Over a 3-day period, attendees were treated to 65+ technical presentations by some of the top research and industry users of Linux on the Itanium-based platform.

Aside from the presentations and discussions, attendees participated in a variety of social events. In addition to the technical presentations listed below, view some of the photographs from the meeting.

Keynotes
Itanium: Its Rationale and Potential from an HP Labs Perspective, William S. Worley - Secure64 & Itanium Solutions Alliance

Trends in Computer System Design, Jerry Huck - HP

The Road Ahead: Intel Itanium Architecture and Software, Don Soltis - Intel James Reinders - Intel

General Interest
Welcome, Mark K. Smith - Gelato Central Operations

Basic Itanium Architecture, Cameron McNairy - Intel

An Evaluation of High Performance Octave on Itanium, Ashok Krishnamurthy - Ohio Supercomputer Center

Highlights of the Upcoming October Gelato Conference, Jon Lau - National Grid Office

An Overview of Common Interconnects for Commodity Clusters, Doug Johnson - Ohio Supercomputer Center

Numerical Computation Tools for Itanium, Matthieu Delahaye & Shailesh Patel - Gelato Central Operations

Topics for Enterprise
Oracle: An Enterprise Itanium Use Case Study, Brian Hirano - Oracle

An Update on Xen on Itanium, Alex Williamson - HP

Enterprise Graphics on IPF, Hansong Zhang - SGI

MCA: Machine Check Architecture, Cameron McNairy - Intel

Itanium Solutions Alliance Developer Days
Itanium Architecture, Cameron McNairy - Intel

Hardware Overview, Jeff Donsbach - HP

Itanium Firmware (EFI), Jeff Donsbach - HP

A Systematic Approach to Tuning Software, Sverre Jarp - European Organization for Nuclear Research

Completing a Successful Migration, Jeff Donsbach - HP

Tools and Tuning
Columbia Application Tuning Case Studies, Johnny Chang - National Aeronautics and Space Administration

HP Caliper: An Update to the Linux IPF Performance Tool, Curt Wohlgemuth & Steve Williams - HP

VTune Update, Paul M. Cohen - Intel

An Update on the Current State of Open|SpeedShop, Jack Carter - SGI

Valgrind, Julian Seward - OpenWorks

A Dynamic Instrumentation-Based System for Building Program Analysis Tools for the IPF Platform, Jasper Kamperman - Intel

OpenMP: Past, Present, and Future, Timothy Mattson - Intel

Update on the Perfmon2 Interface, Stéphane Eranian - HP

Focus on GCC
The ISP RAS Effort to Improve GCC for Itanium, Arutyun I. Avetisyan - Institute for System Programming, Russian Academy of Science

GCC IP Issues, Dan Berlin - Google

Open64: An Alternative Backend for GCC, Shin-Ming Liu - HP

Aliasing in GCC, Dan Berlin - Google

Superblock Update, Robert Kidd - University of Illinois at Urbana-Champaign

An Interblock VLIW-Targeted Instruction Scheduler for GCC, Andrey Belevantsev - Institute for System Programming, Russian Academy of Science

Parallel Programming with GCC, Diego Novillo - Red Hat

LTO: A Brief Introduction, Mark Mitchell - CodeSourcery

LLVM: A Brief Introduction, Chris Lattner - Apple
Focus on Scalability
Blktrace: An Overview, Alan Brunelle - HP

Scaling Linux to 512 Processors and Beyond, John Hawkes - SGI

NFS Performance, Peter Chubb - University of New South Wales

Scalability Mini-Track Wrap Up, Lee Schermerhorn - HP

Advanced Topics
Mathematical Modeling to Formally Prove Correctness, John R. Harrison - Intel

Kernel Optimization for Enterprise Workloads, Kenneth Chen - Intel

Suggested Improvements in Itanium and Software, Clemens C. J. Roothaan - Gelato Honorary Member

Evolution of PCI IO: A Linux IO Geek's Perspective on HW, Grant Grundler - HP

Local and Remote Memory: Memory in a NUMA System, Christoph Lameter - SGI

Decimal Floating-Point, John Crawford - Intel

The Itanium Vector Math Library (VML), Clemens C. J. Roothaan - Gelato Honorary Member

Research
Preparing for the First Beam at the LHC, Lawrence Pinsky - University of Houston

Computing Optimal Equilibrium Strategies for Network Economies, Alejandro Jofré - University of Chile

Mathematical Libraries and the Implementation of Parallel Solvers for Engineering, Hugo Daniel Scolnik - University of Buenos Aires

Superpages / VM Work, Ian Wienand - University of New South Wales

Experiences on the Itanium-Based Grid Test Bed at UPRM, Wilson Rivera - University of Puerto Rico Mayaguez

In Search of Collaboration, Ping-Hui Kao - HP

Itanium Virtualization and vNUMA, Matthew Chapman - University of New South Wales

Bioinformatics in Biomining, Nicholas Loira & Andres Aravena - University of Chile

Keynotes
Monday
Itanium: Its Rationale and Potential from an HP Labs Perspective

The Intel/HP Itanium architecture definition effort started with the results of an HP Labs research program, called PA Wide Word internally, conducted from January 1990 to December 1993. Concepts and conclusions formulated during this research program established technical principles for a fundamental advance in processor architecture and led to the Intel/HP partnership. Less noticed in published accounts is the fact that many capabilities Intel and HP jointly innovated in the Itanium architecture were specifically designed to enable construction of secure systems.

Non-security objectives have led modern general-purpose operating systems to continue to rely upon a more than 40-year-old, CPU-only, hardware protection model. This limited hardware protection model simply is incapable of supporting the levels of remote-attack security required in today's massively complex systems, in today's online world. As a result, we find vulnerable servers surrounded by vulnerable external protective appliances. All require periodic patching and re-testing. It's not clear the good guys are winning.

Intel's Itanium 2 systems now offer the means for building "inherently secure" systems. Inherently secure means that the software controlling the hardware platform has specific, strong security properties. Without an inherently secure foundation, the current trends of virtualizing servers and consolidating network protective appliances magnify, rather than mitigate, security risks. Secure64's inherently secure hardware platform control software fully utilizes the capabilities of the Itanium architecture to provide such a foundation. This offers substantial benefits for information systems and infrastructures, and can establish Itanium hardware platforms as the winners both for secure consolidation and for secure virtualization.

William S. Worley - Secure64 & Itanium Solution Alliance

Dr. William (Bill) Worley Jr. is the CTO of Secure64 Software Corporation. He is a Retired HP Fellow (Chief Scientist and Distinguished Contributor), and Commissioner of Colorado Governor's Science and Technology Commission. He received an MS (Physics) and MS (Information Science) from the University of Chicago and a PhD (Computer Science) from Cornell University. Bill is a system architect. At HP, he directed the team that developed the PA RISC architecture. He later directed the development of the PA Wide Word architecture, the foundation for the HP/Intel partnership that led to the Itanium 2 microprocessor family. Prior to HP, during 13 years with IBM, he contributed to architectures for mainframes, storage systems, and IBM's first RISC architecture. In the years prior to his retirement from HP, Bill focused upon hardware and software architectures for secure systems. Following retirement, Bill joined Secure64 Software as a co-founder and CTO. Secure64 has developed a multi-core platform control system, including a queued, asynchronous network stack, which fully exploits the security capabilities of the Itanium architecture.

 
Tuesday
Trends in Computer System Design

This presentation will examine the issues and tradeoffs in high-performance commercial system design. The current family of chipsets and system enclosures from HP will be used to examine how system requirements influence design choices. These requirements include performance, power, reliability, availability, serviceability, and manageability.

Jerry Huck - HP

Jerry Huck is an HP Fellow with HP's server global business unit that produces the Itanium-based HP Integrity servers running HP-UX, Linux, OpenVMS and Windows operating environments. He is responsible to participate in technology and strategy development. This includes work on platform and processor architecture, virtualization, performance analysis, and manageability solutions. Huck joined HP in 1983 and participated in the development of HP's PA-RISC architecture specializing in floating-point and virtual memory definition. He and his team developed the 64-bit instruction set extensions to PA-RISC in the early 90's. Starting in 1994, Huck led the HP side of the instruction set and platform definition team for the co-developed Intel Itanium architecture. He continues to evangelize HP's server offerings with customers and industry analysts. He received his PhD from Stanford and holds more than 15 patents in computer architecture and design.

 
Wednesday
The Road Ahead: Intel Itanium Architecture and Software

Join us to hear about the road ahead for Itanium processors from hardware and software experts working at Intel. Itanium processor-based systems are winning in traditional RISC markets such as scalable enterprise, high-performance computing (HPC) and mainframe replacement. These markets require robust throughput, scalar and floating-point (FP) performance of the processor, as well as its memory and I/O system. Future Itanium processor designs will feature increased core count, higher operating frequencies, increased memory bandwidth, and lower memory latency. Itanium processor-based systems span from a few to thousands of processors.

Itanium processors are well suited for these varied environments because of high reliability, agile configurability, and strong software support. Enhanced reliability results from specialized soft error resistant circuits, integrated checking and error recovery algorithms along with extensive error checking and correction of array elements and datapaths. Processor reliability is increasingly critical due to virtualization applications because virtual processors may exist on each physical processor and an unrecoverable soft error of a physical processor would affect many virtual processors. Intel designs excel at addressing this challenge. Itanium processor designs also benefit greatly from the unmatched manufacturing capabilities and silicon processing experience of Intel, as well as a strong software ecosystem and excellent software development products.

Don Soltis - Intel

Don Soltis is a Senior Principal Engineer at Intel and has spent the past 10 years on Itanium CPU architecture, design, and development. He has 20 years experience in CPU and ASIC design, working on PA-RISC CPUs, I/O, memory and graphics chips. His favorite activity is freshwater fly-fishing in the Colorado mountains and saltwater fly-fishing in southwest Florida.

James Reinders - Intel

James Reinders is a Senior Engineer who joined Intel Corporation in 1989 and has contributed to projects including the world's first TeraFLOP supercomputer (ASCI Red), compilers, and architecture work for the iWarp, Pentium Pro, Pentium II, Itanium, and Pentium 4 processors.

Reinders is currently the Director of Business Development and Marketing for Intel's Software Development Products and serves as the chief evangelist and spokesperson. He has been a leader in the creation of Intel's Software Products including product plans, support, technical marketing, marketing and business developemnt. Reinders is also the author of a recent book titled "VTune Performance Analyzer Essentials."

 
General Interest
Monday
Welcome

Welcome, introduction, and overview of Gelato Federation activities.

Mark K. Smith - Gelato Central Operations

Mark K. Smith is the Managing Director of the Gelato Federation. He works with Federation members and sponsors around the world, fostering collaborative relationships among members, sponsors, and the general community to advance the Linux-Itanium platform. Mark leads a technical team at University of Illinois and dedicates time to educating the general community about the advantages of the platform. Prior to joining Gelato, he worked in the software industry for 10 years. Mark holds a PhD in Engineering from the University of Illinois.

 
Basic Itanium Architecture

The Itanium architecture and the paradigm of explicit parallel instruction computing (EPIC) are often poorly understood. This presentation will cover important aspects of the EPIC paradigm, including software pipelining, register save engine, predication, parallel instruction groups, data and control speculation, and many other mysteries of the Itanium application and system architectures.

Cameron McNairy - Intel

Cameron McNairy is a Principal Engineer and an Intel Architect for the Montecito program. Previous to Montecito, Cameron was a micro-architect for the Itanium 2 processor, contributing to its design and final validation. He plans to focus on performance, RAS (reliability, availability, serviceability), and system interface issues in the design of future IPF products. He came to the Itanium 2 team soon after its inception from performance work on the first Itanium processor. Cameron received a BSEE and an MSEE from Brigham Young University. He is a member of the Institute of Electrical and Electronics Engineers.

 
An Evaluation of High Performance Octave on Itanium

GNU Octave is a MATLAB-style interactive application for performing numerical computations. The Octave language is mostly compatible with MATLAB. MATLAB (and Octave) are being used as an executable specification language to develop synthetic compact applications for the DARPA HPCS program. This work has identified a clear need for a MATLAB-style interpreter that can handle large address spaces, run on multiple processors, and leverage high-performance interconnects.

The Ohio Supercomputer Center (OSC), Ohio State University, and Indiana University have been collaborating on research and software technologies for parallel Octave. We have constructed a version of parallel Octave for the Itanium 2 cluster at OSC. This interpreter has a 64-bit address space for large matrix support and uses the high-bandwidth Myrinet interconnect. This talk will review the software architecture, performance and scalability of parallel Octave on the OSC Itanium 2 cluster.

Ashok Krishnamurthy - Ohio Supercomputer Center

Ashok Krishnamurthy is the Director of Research and Scientific Development at Ohio Supercomputer Center and an Associate Professor of Electrical and Computer Engineering at Ohio State University. His research interests are: signal and image processing, high-performance computing applications, and data mining. His undergraduate degree in Electrical Engineering is from the Indian Institute of Technology, Madras, and his MS and PhD in Electrical and Computer Engineering are from the University of Florida. He is involved in the DARPA High Productivity Computing Systems Program and the DoD High Performance Computing Modernization Program.

 
Wednesday
Highlights of the Upcoming October Gelato Conference

This presentation will highlight the next Gelato ICE: Itanium Conference & Expo to be held October 1-4 in Singapore.

Jon Lau - National Grid Office

Jon Lau is the Assistant Head (Technical) at the National Grid Office (NGO) as well as the Technical Manager of the National Grid Pilot Platform (NGPP). He coordinates the technical issues of the NGPP and virtual grid communities, which span from network and security to middleware software. He developed the first Access Grid (AG) node in Singapore and has since seen the deployment of several sites in Singapore. He is currently involved in several other initiatives such as the Global Operational Grid, the Digital Media Grid Rendering Service, and the SG@Schools PC-Grid.

Prior to joining the NGO in January 2003, he was the Director of Engineering at eXage Private Limited, a high-tech spin-off from the Kent Ridge Digital Labs (KRDL), where he led the development team in designing a scalable architecture ready to evolve to meet the needs of eXage's customers. Jon's technological experience is driven from both hardware interests and software R&D work at both the Information Technology Institute and KRDL. The many projects that Jon has been involved with include WinViz, a data visualization tool, as well as the Expert Advisory System on the Internet (a national project), where he performed the role of a Technical Manager. Jon holds a Bachelor's Degree in Computing and a Master of Technology, both from the National University of Singapore.

 
An Overview of Common Interconnects for Commodity Clusters

There are a wide variety of interconnects for commodity clusters. Determining the appropriate network when constructing a new cluster can be seen as a daunting experience. This presentation intends to give a hardware overview of the more common interconnects available, their performance, and a comparison of the software available for the hardware.

Doug Johnson - Ohio Supercomputer Center

Doug Johnson is the Technical Lead for the Cluster Ohio project and production Linux clusters at OSC. He has worked on many projects to address usability and manageability of clusters of commodity systems. His current areas of interest include: grid meta-scheduling, storage for clusters, and high-availability services for clusters.

 
Numerical Computation Tools for Itanium

If you are working on developing new algorithms (signal processing, voice encoding, etc.), analyzing and visualizing data, or simply performing scientific and numerical computations like matrix operations, several tools are available today to help you. These applications usually manipulate large amounts of data and perform CPU intensive operations. Therefore, the Itanium processor is a suitable platform. We will explore the various solutions (MATLAB, Octave, and Scilab) and offered functionalities, then will present the results of our informal benchmarking/speed comparison tests and discuss the planned evolution.

Matthieu Delahaye - Gelato Central Operations

Although Matthieu Delahaye has worked on the Gelato portal since its creation in 2002, he officially joined Gelato Central Operations as a Software Engineer in August 2004. In addition to maintaining the Gelato portal, Matthieu works on Gelato Coconut, Gelato Vanilla, and other challenging infrastructure and development projects around the Itanium processor. Matthieu made his first kernel hacks while involved in the parisc-linux port effort, and then joined the Debian Project. At the same time, he received an MS in Computer Science from ESIEE, where he subsequently worked for two years in the IT Department.

Shailesh Patel - Gelato Central Operations

Shailesh Patel was born in India, and grew up in Dubai, UAE. He graduated from the National Institute of Technology (NIT), India with a BS in Engineering and then completed his MS in Computer Engineering from California State University, Long Beach. He has worked as a J2EE developer, creating software for the subtitling and marketing industry. At the University of Illinois at Urbana-Champaign, he worked with the SandBox group and the openIMPACT team. Currenlty, Patel works for Gelato Central Operations on the Vanilla project, developing optimized binaries for the Itanium platform.

 
Topics for Enterprise
Monday
Oracle: An Enterprise Itanium Use Case Study

Oracle's 4-way Itanium 2 TPC-C benchmarks, announced in November of 2002, were the culmination of a two-year project involving engineers from Intel, HP, and Oracle. Since that time, multiple groups in Oracle and Intel have continued to work closely on multiple versions of Oracle and Linux-based Itanium platforms to ensure performance and stability for enterprise solutions. This talk discusses the initial performance work and the evolution of Oracle's and Intel's focuses, and presents some of the current areas Oracle and Intel are jointly investigating.

Brian Hirano - Oracle

Brian Hirano is a Consulting Member of the technical staff in Server Technologies at Oracle Corporation. He leads the Oracle effort to release TPC-C benchmarks on McKinley-based Itanium platforms on Linux, HP-UX, and Windows, working with teams from Intel and HP. In addition to his development duties in the Oracle Database's Virtual Operating System group, Brian also works with hardware and operating system vendors on Oracle-related issues.

 
An Update on Xen on Itanium

Xen is rapidly becoming the de facto standard for open-source virtualization, with capabilities and performance matching or exceeding leading industry products. Paravirtualization techniques, efficient inter-domain virtual I/O mechanisms, clever migration, and support for multiple architectures (including VT and Pacifica hardware) have contributed to a large broad base of developers and piqued industry interest. Xen/ia64 is the first non-x86 architecture supported by Xen. It is still a work-in-progress, but the core hypervisor component utilizes code and/or experience from Xen, Linux/ia64, and the HP vBlades research project. Many interesting strategies are employed to ensure correctness, optimize performance, and leverage the many rapidly developing layers of tools provided by Xen.

We will provide a brief overview of virtualization in general, Xen specifically, and the current status of Xen/ia64. Then, we will spend the remaining time discussing some interesting details about the inner workings of Xen on Itanium.

Alex Williamson - HP

Alex Williamson is a member of HP's Open Source and Linux Organization focusing on HP Integrity enablement and more recently Xen/ia64. Alex has been involved with Linux/ia64 since 2000 and has made numerous contributions to the Linux kernel.

 
Tuesday
Enterprise Graphics on IPF

Large shared memory achitectures enable friendly programming models and allow efficient processing and visualization of large data sets produced in the areas of computer-aided design (CAD), science and engineering simulations, and new high-resolution sensor technology. In this talk, we'll look at the SGI Altix multiprocessor systems as an example of large shared memory architectures. We'll then showcase applications that have a large memory footprint in genome matching and visualization of CAD and high-resolution sensor data.

Hansong Zhang - SGI
Dr. Hansong Zhang leads CPU-based visualization efforts at SGI, where he advocates the cross-pollination between parallel computing, visualization, and media applications. Prior to SGI, Zhang worked at nVidia on real-time special effects. He was also the graphics architect at Intrinsic Graphics, a vendor of cross-platform game software. Zhang received his degree from the University of North Carolina, Chapel Hill.

 
MCA: Machine Check Architecture

The Itanium Machine Check Architecture (MCA) is at the center of the Itanium reliability, availability, and serviceability (RAS) approach. Itanium's MCA defines methods and requirements that tie together the processor, processor abstraction layer (PAL), system abstraction layer (SAL), operating system (OS), and application. This presentation will cover the various components and their roles, and then turn the focus to the MCA foundations; the PAL and the processor that it abstracts.

Cameron McNairy - Intel

 
Itanium Solutions Alliance Developer Days
Wednesday
Itanium Architecture

The Itanium architecture and the paradigm of explicit parallel instruction computing (EPIC) are often poorly understood. This presentation will cover important aspects of the EPIC paradigm, including software pipelining, register save engine, predication, parallel instruction groups, data and control speculation, and many other mysteries of the Itanium application and system architectures.

Cameron McNairy - Intel

 
Hardware Overview

This presentation will be an overview of the current Itanium product lines offered in the marketplace and a quick summary of the integral system specifications that set these systems apart.

Jeff Donsbach - HP
Jeff Donsbach is a Senior Software Engineer in HP's Solutions Alliances Engineering organization and Linux Expertise Center. The group helps ISVs, large and small, port and optimize their applications for HP platforms. Jeff has 20+ years of application development experience on various UNIX flavors and 15 years of experience working with 64 bit systems. Jeff has worked with a wide range of applications and ISVs from various industries including Databases, CAD/CAM, Software Development Tools, Molecular Modeling, High Performance Computing and Middleware.

 
Itanium Firmware (EFI)

This session provides an overview of the extensible firmware interface (EFI), which is used to manage system boot, install, diagnostics, and firmware properties.

Jeff Donsbach - HP

 
A Systematic Approach to Tuning Software

In this talk, we will look at performance optimization and bottleneck identification. In order to optimize an application, one needs to understand the "phase space" defined by the hardware and the external software. One also needs to understand the application itself: the algorithms used and the overall impact on the hardware platform. Furthermore, one needs to know which hardware/software tools are available for performance work.

This talk will therefore try to define a systematic and detailed approach in this field:

  • Definition the hardware/compiler phase space:
    • CPU specifications (frequency, microarchitectural features) multi-core designs, cache sizes, bus speeds, chip sets, I/O rates, etc.
    • Compilers (versions, features, flags, etc.) The compilers' encounters with the application software, algorithms, programming style, etc.
  • Review of performance tools (hardware/software)
  • Illustration of measurements inside our phase-space with a few applications, ideally in three forms:
    • Software kernels (testing only one feature at a time)
    • Well-known physics benchmark jobs (typically with emphasis on one physics feature, such as tracking in detector geometries, etc.)
    • Full-blown applications (e.g. a physics simulation framework, etc.)

The talk will hopefully provide some answers, and also give the audience enough "ammunition" to get started on their own.

Sverre Jarp - European Organization for Nuclear Research

Sverre Jarp is the Chief Technology Officer at CERN's openlab for DataGrid Application, which is a joint collaboration with industry in order to assess leading-edge information technology for the Large Hadron Collider's Computing Grid in 2007. He has been working in computing at CERN, the European Organization for Nuclear Research, for over 30 years and has held various managerial and technical positions promoting advanced but cost-effective computing solutions for the laboratory. In 2001-02, he spent a sabbatical year at the HP Labs, Palo Alto, California, USA, working on software for the Itanium Processor Family. His current field of interest is compiler optimization. Jarp holds a degree in Theoretical Physics from the Norwegian University of Science and Technology in Trondheim.

 
Completing a Successful Migration

Tips on evaluating, locating, and resolving problems before they happen in migrations will be covered. Information on additional resources on Linux solution migrations will also be presented.

Jeff Donsbach - HP

 
 
Tools and Tuning
Monday
Columbia Application Tuning Case Studies

This talk will present several case studies of application performance enhancements on the SGI Altix platform. The enhancements include both explicit (dplace) and implicit (cpubind/cpuset_pin) process-pinning, eliminating memory contention in OpenMP applications, eliminating unaligned memory accesses, and system profiling. These enhancements enabled 2- to 20-fold improvements in application performance.

Johnny Chang - National Aeronautics and Space Administration

Johnny Chang is a member of the Application Performance and Productivity group at the NASA Advanced Supercomputing (NAS) Division located in Moffett Field, California. He is part of a group that provides consulting service to the 700+ users of the Columbia supercomputer, a cluster of twenty 512p SGI Altix systems. His work includes code porting, debugging, tuning and optimization, and code scaling. Johnny received his PhD in Chemical Physics from the University of Texas at Austin in 1985. He has published papers in multi-photon dynamics,quantum scattering, path-integral methods, quantum functional sensitivity analysis, and, most recently, weather modeling.

 
HP Caliper: An Update to the Linux IPF Performance Tool

HP Caliper is a sophisticated general-purpose performance analysis tool that takes advantage of the Itanium processor's advanced performance monitoring unit to provide detailed and accurate performance measurements at the application and system level with minimal perturbation to the system's behavior.

Besides an overview of HP Caliper, we will discuss new features, including system-wide profiling and a new graphical user interface based on the rich client platform of Eclipse.

Curt Wohlgemuth - HP

Curt Wohlgemuth is an engineer in the HP Caliper project. He has worked at HP for many years, primarily in the areas of language tools, dynamic translation, and performance tools.

Steve Williams - HP
Stephen Williams is a member of the HP Caliper team and has worked at HP for the past 17 years. He has worked on debuggers and performance tools and has specialized in user interfaces.

 
VTune Update

This talk will cover what's new for tuning Intel Itanium 2-based applications, including native Eclipse IDE and NUMA aware support for data collection.

Paul M. Cohen - Intel

Paul Cohen is a Performance Tools Product Line Marketing Manager at Intel. He is responsible for Intel tools targeted at improving the performance of customer applications. His current focus is improving usability of the VTune Performance Analyzer, making it a robust enterprise-grade solution able to deal with extremely large executables (100MB+) that other products are unable to profile. In addition, he is working on integration of the VTune Analyzer with Intel C and FORTAN compliers under Eclipse with the ability to provide a close connection between Intel compiler optimization reports and performance bottlenecks represented in the VTune Analyzer.

 
An Update on the Current State of Open|SpeedShop

Open|SpeedShop is SGI's next generation Linux performance analysis tool. Based on the concepts of SGI's IRIX SpeedShop, Open|SpeedShop is designed to be modular and easily extendable. It supports the concept of plugins, which allow users to create their own performance experiments. Another key feature of the performance tool is its usability. Its user interface is designed for scientists in general, not just computer scientists. Open|SpeedShop currently supports 4 user interfaces: GUI, interactive command line, batch command file and as a pure python module. The Open|SpeedShop baseline functionality includes support for single system image (SSI) machines and for clusters (i.e. multiple OS kernels).

Current experiments are exclusive and inclusive user time, program counter (PC) sampling, MPI call tracing, input/output tracing, floating point exception tracing, and CPU hardware performance counter experiments. Open|SpeedShop enables FORTRAN (77, 90, and 95), C, and C++ programmers to use an advanced performance analysis tool within the open-source environment. The infrastructure and base components are released as open source under the GPL and LGPL licenses. Open|SpeedShop is being co-funded by the Department of Energy (DOE).

Jack Carter - SGI

Jack Carter has had over 20 years experience working with compilers and compiler related tools, with extensive work with linkage and post linkage object transformation technology. Currently he is a member of SGI's Open|SpeedShop team.

 
Tuesday
Valgrind

Valgrind is a GPL'd suite of simulation-based debugging and profiling tools for Linux. Around a common core a number of tools have been built, two of which are Memcheck, a memory error detector, and Cachegrind, a low-level cache profiler. The system is structured as a common core, which provides CPU virtualization, debug info management, and error management, and handles other simulation nasties, particularly signals, threads, and syscalls. The rich set of services provided by the core makes it relatively easy to build sophisticated dynamic analysis tools. The project Web site is http://www.valgrind.org.

Valgrind currently runs on {x86,amd64,ppc32,ppc64}-linux. A key component is dynamic-translation based CPU virtualization. This converts blocks of code into an architecture-neutral intermediate representation, hands them to the currently active tool for instrumentation, and then re-synthesizes runnable code from them. In this talk, I will take a look at the challenges of porting this and other important Valgrind components to Itanium.

Julian Seward - OpenWorks

Julian Seward founded the Valgrind project in 2000 and is the project lead and a full time developer. His background is in compiler technology for functional programming languages. He worked for several years on the Glasgow Haskell Compiler, an open-source compiler for the functional language Haskell, with earlier postdoctoral work on compilation of a hybrid functional/OO language. More recently, he led a small group developing a vectorizing code generator for SIMD architectures. He holds a PhD in Computer Science from the University of Manchester, UK. He is heavily involved with open-source software and is also the author of bzip2, a widely used lossless compression program.

 
Wednesday
A Dynamic Instrumentation-Based System for Building Program Analysis Tools for the IPF Platform

We will present a dynamic instrumentation-based system called Pin for building a variety of program analysis tools for the IPF platform. In this talk, we will introduce the basic concepts of dynamic instrumentation and provide details of the inner working of this system. We will also talk about various optimizations that happen in this system to ensure that programs running under control of Pin perform reasonably well. Some specific features of IPF, which create challenges for building a system like Pin, will be explored. We will provide several real world examples of how this system has been used for building program analysis tools. We will also talk about various applications of this system in building tools for architecture research and performance analysis.

Jasper Kamperman - Intel

Jasper Kamperman is the Product Manager for the Performance Tools Lab in Intel's Developer Products Division. He has a Master's Degree in Physics from the University of Utrecht in the Netherlands and holds a PhD in Computer Science from the University of Amsterdam. Jasper has presented at numerous conferences and published in scientific as well as trade journals. Before joining Intel, Jasper was the Director of Product Management at Reasoning, Inc. Previous engagements include a position as researcher at CWI, the Dutch Center for Mathematics and Computer Science, and consultant with ID Research (Now Ordina Research), a high-tech consultancy firm.

 
OpenMP: Past, Present, and Future

As the industry moves to multi-core processors, multi-threaded software will be essential. OpenMP is the industry standard API for writing multi-threaded software. It is focused on the needs of applications programmers and attempts to make it relatively simple to write parallel software. In this talk, we will discuss the history of OpenMP, some of the more innovative ways its being used today, and OpenMP innovations you can expect to see in the future.

Timothy Mattson - Intel

Tim Mattson earned a PhD for his work on quantum molecular scattering theory (UCSC, 1985). This was followed by a Post-doc at Caltech where he worked on the Caltech/JPL hypercubes. Since then, he has held a number of commercial and academic positions with high performance computers as the common thread. Application areas have included mathematics libraries, exploration geophysics, computational chemistry, molecular biology, and bioinformatics.

Dr. Mattson joined Intel in 1993. Among his many roles at Intel, he was applications manager for the ASCI teraFLOPS project, helped create OpenMP, founded the Open Cluster Group (OSCAR), and launched Intel's programs in computing for the Life Sciences. Currently, Mattson is conducting research on abstractions that bridge across parallel system design, parallel programming environments, and application software. This work builds on his recent book on Design Patterns in Parallel Programming (written with Professors Beverly Sanders and Berna Massingill and published by Addison Wesley). The patterns provide the "human angle" and help keep his research focused on technologies that help general programmers solve real problems.

 
 
Update on the Perfmon2 Interface

In this short presentation, we will update the audience about the progress of the perfmon2 interface. What are the latest features on Itanium and other architectures? We will cover the user level tools and Montecito support, and will report on the progress on getting our implementation accepted in the mainline kernel for all major platforms.

Stéphane Eranian - HP

Stéphane Eranian is a Senior Research Scientist at HP Labs, where he has been working on the porting of Linux to the IA-64 platform since 1998. He has made numerous contributions to the Linux/IA-64 kernel and related user-level programs. He is the main architect of the Linux/IA-64 kernel performance monitoring subsystem (perfmon). He is also the creator of the pfmon tool, which uses this subsystem to collect performance information.

Before joining HP, Stéphane worked on his PhD at Chorus Systems (now Jaluna) in France. He holds a D.E.A. (BSc degree) in Operating systems from Universite PARIS 6, France, and a Doctorate (PhD degree) in Computer Science from Universite PARIS 7, France. He is a member of USENIX and co-author of "IA-64 Linux Kernel: Design and Implementation."

 
Focus on GCC
Monday
The ISP RAS Effort to Improve GCC for Itanium

Ongoing work at ISP RAS on improving GCC for Itanium processors will be presented. Discussion will cover a past project with HP on improving GCC instruction scheduling and the current effort on implementing a new VLIW-targeted instruction scheduler. Future plans on improving GCC for Itanium and potential collaboration projects will also be presented including plans for a GCC meeting in Moscow this summer.

Arutyun I. Avetisyan - Institute for System Programming, Russian Academy of Science

Arutyun Avetisyan is the Deputy Director of the Institute for System Programming (ISP) at the Russian Academy of Sciences (RAS) in Moscow, Russia. His research focuses include parallel and distributed programming, cluster and grid technologies, and compiler technologies. Dr. Avetisyan leads a project on a model based parallel program performance tuning system.

 
GCC IP Issues

This talk will cover a variety of intellectual property issues that come up during working on GCC, including:

  • Copyright: Assignments of copyright, and how we deal with issues of contributions of code from from other open source/commercial projects.
  • Patents: How we deal with them in GCC, and what we require of companies that are going to contribute to GCC.
  • General other issues related to intellectual property and GCC.
Dan Berlin - Google

Daniel Berlin is an Advisory Engineer at IBM T.J. Watson Research Center, where he works on compiler optimization research for current and future IBM architectures. His main focus is designing and implementing new and existing optimization algorithms for GCC. He is responsible for implementing and maintaining several passes in GCC, including alias analysis, various SSA optimizations, and high level loop transforms. He received his CS Degree from the University of Rochester and has a JD from George Washington University School of Law.

 
Open64: An Alternative Backend for GCC

While GCC's Tree-based SSA optimization has been making good progress, the Itanium processor may benefit more in the near future from alternate high-performance optimizations. The Open64 compiler is the basis of the Open Research Compiler (ORC), which Intel has been promoting for Itanium-specific optimizations over the past couple years. This effort aims to present Open64 as an alternative backend for GCC/G++ on the Itanium/Linux platform. In addition to Itanium, this alternative backend supports the EM64T/IA32 target as well as several other embedded processors. In alignment with this effort, HP is coordinating the update of the GCC/G++ front-end and driving the quality on the Itanium/Linux platform. In this talk, the short- and long-term perspectives of this alternative backend will be presented.

Shin-Ming Liu - HP

Shin-Ming Liu is the Project Manager for High-Level Optimization and GCC of the Itanium C/C++ Compiler Section of the Java, Compiler, and Tools Lab at HP in Cupertino, California. Liu led the development effort for the high-level optimization and code generator project in compiler targeted for the Itanium processor. In this project, he helped redesigned the high-level optimization into a highly-robust, scalable, and efficient component by rearchitecting the infrastructure, from which many new techniques were developed. Many highly-recognized programming analysis methods were adopted as well. Liu led the reinvention of compiler development methodology by focusing on modulization, memory footprint control, canonical internal representation, and automatic error detection. Before joining HP, he worked at MIPS/SGI in the area of compiler front end, middle end, back end, and linker. During that time, he co-authored several technical publications.

 
Aliasing in GCC

This talk will cover aliasing in GCC, including:

  • An overview of the algorithms used to generate aliasing information.
  • An overview of how the aliasing information is represented in GCC's IR.
  • The improvements made in recent GCC versions to both of the above.
Dan Berlin - Google

 
Superblock Update

Superblock scheduling is a common technique to increase the level of ILP in generated code. By performing tail duplication, a Superblock-forming compiler creates a longer extended basic block, simplifying the task of moving instructions across basic block boundaries. More significantly, the control flow into the duplicated tail is dramatically simplified. This allows the compiler to draw much tighter bounds on the conditions that exist when the block is executed and allows the code in the block to be specialized for those conditions. This combination of radical control flow transformation followed by specializing optimizations, termed {\em structural compilation}, has been shown in the OpenIMPACT compiler to be particularly useful in developing ILP when compiling for the Itanium processor.

As a first step toward developing structural compilation techniques in GCC, we implemented Superblock formation at the Tree-SSA level. By performing structural transformations early, we give the compiler's high level optimizers an opportunity to specialize the transformed program, thereby cultivating higher levels of ILP. The early results of this modification are mixed, with some benchmarks improving and others slowing. I will present the effects of this structural transformation on later optimizations and thoughts on the changes that will be necessary to allow optimizations to benefit from this transformation.

Robert Kidd - University of Illinois at Urbana-Champaign

Robert Kidd is a graduate student in the IMPACT research group at the University of Illinois at Urbana-Champaign. Within the IMPACT compiler, he is responsible for the development of an interprocedural analysis and optimization framework that fits within the usage model of a traditional production compiler. Previous work within IMPACT has addressed GCC compatibility and general maintenance of the code generator. His work with GCC, supported by the Gelato Federation, aims to improve the performance on the Itanium processor.

 
An Interblock VLIW-Targeted Instruction Scheduler for GCC

Modern VLIW architectures (e.g. Itanium) require instruction level parallelism (ILP) to be explicitly exposed by a compiler. An instruction scheduler is a key compiler component for utilizing ILP. The current GCC scheduler has a number of pitfalls in approaching this goal, including: the oldest interblock scheduling algorithm, non-optimal region formation, a traditional two-pass execution scheme, and lack of transformations for eliminating false dependencies.

This presentation will cover an ongoing approach for implementing a new aggressive instruction scheduler for GCC. The scheduling algorithm is based on a selective scheduling approach. It is mainly targeted for VLIW-like platforms, but the framework being implemented is general enough and it can be used for other targets in the future. The key features of the approach are as follows: works with DAG regions, supports code motion with adding bookkeeping insns, supports register renaming and forward substitution, and integrates with software pipelining. We will discuss the algorithm and its adaptation to GCC, implementation issues, and the current state of the project.

Andrey Belevantsev - Institute for System Programming, Russian Academy of Science

Andrey Belevantsev is a Project Manager for the GCC Itanium project at RAS with a team of six. The current project of the team is implementing an aggressive VLIW-targeted interblock scheduler for GCC. Andrey's responsibilities include leading the team, designing the scheduler infrastructure, and implementing the code motion part. His research interests lay in the area of compiler optimizations, static analysis, and security, focusing on instruction scheduling, alias analysis, and interprocedural optimizations.

 
Parallel Programming with GCC

Multiprocessor systems are becoming increasingly popular, but taking advantage of their parallel capabilities is not always straightforward. Software developed for these systems must explicitly make use of concurrency.

In this talk, I will describe two recent additions to the GNU Compiler Collection (GCC) for developing software that can take advantage of parallelism: vectorization and OpenMP. Vectorization is a compiler feature that takes advantage of the multimedia capabilities of modern CPUs by offloading the execution of some inner loops into separate co-processors. OpenMP is a standard specification of compiler directives for C, C++, and FORTRAN. It provides new directives to specify parallelism, synchronization, and data sharing. This talk will describe both features in detail, provide usage examples, and give tips to take full advantage of these features when developing your applications.

Diego Novillo - Red Hat

Diego Novillo was born in Cordoba, Argentina, and holds a PhD in Parallel Computing from the University of Alberta, Canada. He is currently a member of the compiler group at Red Hat Canada, working to improve the GNU Compiler Collection (GCC), developing new ports and implementing new analyses and optimizations. He is one of the main architects of GCC's global optimization framework.

 
Tuesday
LTO: A Brief Introduction

Many compilers have obtained significant performance wins by using "link-time optimization," i.e. by performing optimizations that cross the boundaries of a single program unit. For example, if the argument to a function is a constant in one module and the function is defined in another module, the result of the function call may be constant as well. But compilation of either module independently cannot determine that fact.

The GNU compiler collection (GCC) does not presently implement link-time optimization, although it does provide a limited form of inter-module optimization, as implemented by Geoff Keating. Working with partners at AMD, HP, and IBM, we have developed a proposal for implementing link-time optimization in GCC based on serializing GCC's existing data structures. Thus, our proposal is conservative in that it leverages GCC's existing data structures and requires only minimal changes to GCC's core optimizers. A significant advantage of our approach is that the serialized data structures will be available to other consumers, such as program analyzers and IDEs. Finally, our approach would facilitate the implementation of the most significant missing feature in G++: the "export" keyword.

Mark Mitchell - CodeSourcery

Mark Mitchell is the founder of CodeSourcery and has been the Free Software Foundation's Release Manager for GCC since 2001. Mitchell received degrees in Computer Science from Harvard and Stanford. He left Stanford's PhD program after starting CodeSourcery, where, with his fellow Sourcerers, he strives to make the GNU Toolchain the choice of software developers everywhere.

 
LLVM: A Brief Introduction

This talk will provide a brief introduction to LLVM (http:// llvm.org), focusing on LLVM's robust interprocedural link-time optimization, runtime optimization, and just-in-time code generation support. Work is currently underway to integrate LLVM's mid-level and interprocedural optimization capabilities into the GNU Compiler Collection (GCC) compiler. Design, implementation, and status of GCC integration will be discussed.

Chris Lattner - Apple

Chris Lattner is the Chief Architect of the LLVM Compiler Infrastructure, which aims to build efficient and highly optimizing open-source compiler components. He currently leads a team at Apple Computer, which aims to integrate the GCC front-end with the LLVM optimizer and code generator, providing GCC with interprocedural link-time optimizations as well as a modern and efficient code generator. Chris holds a PhD in Computer Science from the University of Illinois at Urbana Champaign (UIUC).

 
Focus on Scalability
Wednesday
Blktrace: An Overview

"You can't count what you can't measure" is an old software engineering truism that inspires one to develop means to accurately and efficiently measure the various subsystems within Linux in order to make concrete performance improvements to the Linux kernel itself. Given that measuring how Linux manages I/O is a key component towards understanding overall system performance, Jens Axboe has recently been working on a new capability within Linux called Blktrace, which allows one to efficiently capture block I/O subsystem events for later analysis.

This presentation will start by providing an overview of Blktrace through a discussion about its kernel implementation and an overview of the utilities provided to capture traces. We will then show how it is currently being used to measure the LVM/DM subsystem as part of an effort to understand Linux IO performance from top-to-bottom.

Alan Brunelle - HP

Alan D. Brunelle works for HP's Open Source and Linux Organization in the Linux Scalability and Performance Group. He has been working on tools to measure performance in order to help understand how to improve Linux in that area. During his time in the group, Alan has primarily been focused on the Linux storage I/O stack, and his work on the blktrace utility has combined efforts in tool smithing with I/O. Prior to his Linux work, Alan worked in Tru64 TruCluster technology, again primarily in the I/O sphere. Prior to joining HP in 1988, he worked on attached processor card software with Alacron, Inc, as well as graphics algorithmic design and Unix/Mach device driver development with CalComp. Alan earned an MSc in Computer Science from the UMASS/Lowell (1989) and a BSc in Computer Science from the University of New Hampshire (1984).

 
Scaling Linux to 512 Processors and Beyond

SGI's Altix family of servers currently supports up to 512 Intel Itanium 2 processors and four terabytes of cache-coherent shared main memory, and newer platforms will substantially increase those limits. Some high-performance computing workloads benefit from executing on maximum hardware configurations and in a single system image environment. In the past few years as hardware capacity has increased, SGI and the Linux community in general have pushed kernel scalability to keep up. This presentation discusses the technical challenges of scaling to hundreds, even thousands, of processors and many terabytes of memory, what has been done to overcome those challenges, and what work remains.

John Hawkes - SGI

John Hawkes has been involved with the development and tuning of high-performance multiprocessor computers since the early 1970s, from HP's earliest multiprocessor Basic Language computer, to Elxsi's custom message-passing SMP, to MIPS's R6000 Uni- and Multiprocessors, to SGI's Challenge SMP and Altix ia64 ccNUMA. His involvement with Linux dates back to SGI's exploratory work in the late 1990s and continues today with the Altix servers, principally focusing on the measurement and analysis of system performance and scaling. In recent years he has co-authored papers about Linux performance for Usenix/Freenix and the Ottawa Linux Symposium (OLS).

 
NFS Performance

There have been many complaints about NFS performance on the Linux kernel mailing lists when it is compared with performance on IRIX or Solaris. Is it *really* so bad? And, what can be done to fix the problem? Over the Southern Summer, Gelato@UNSW has been trying to find out. We currently have tools to capture traces from real systems, anonymize them (so that real users don't mind if we grab information), and replay at a higher rate. In doing so, we have discovered, firstly, that there are problems; secondly, that there is a degree of regularity in most traces that can be exploited to improve NFS performance generally. This is a work-in-progress talk; we expect to have more results by the time of this conference.

Peter Chubb - University of New South Wales

Peter Chubb is a Senior Research Engineer at National ICT Australia and a Research Officer at UNSW. He completed his PhD under Associate Professor John Lions in 1989. Peter worked at Softway Pty Ltd as a consultant and software engineer doing UNIX kernel, security, and embedded work. He joined Gelato@UNSW at its inception in 2002.

Peter started using UNIX in 1979 and has never used Microsoft operating systems for more than a few moments. His home life includes wife Lucy, who also works at Gelato@UNSW, and two small daughters. Peter's hobbies include music (he runs a recorder consort), aquaria (3 tanks at present, no room for more), and fine wines.

 
Scalability Mini-Track Wrap Up

In each of the scalability presentations, we will try to leave time for questions and answers. However, we expect/hope that attendees will have additional scalability questions, issues, or topics not directly related to the presentations. The scalability wrap up session will provide an opportunity to discuss general scalability topics and areas for further investigation and collaboration to measure and improve the scalability of Linux on Itanium platforms. To this end, we encourage attendees to share any scalability or general performance concerns, war stories ("wins" are good, too!), unsolved mysteries, work in progress, etc., including a couple of slides/graphs if you think that would be helpful to illustrate the issue.

Lee Schermerhorn - HP

As a member of the Linux Performance and Scalability team within HP's Open Source and Linux Organization (OSLO), Lee Schermerhorn works on performance characterization and engineering for Linux on HP platforms (primarily HP's Itanium-based Integrity platforms), with emphasis on NUMA scheduling/affinity and (storage) IO performance.

 
Advanced Topics
Monday
Mathematical Modeling to Formally Prove Correctness

Formal verification attempts to establish the correctness of a computer artifact (hardware, software, microcode, protocol, etc.) by rigorous modeling and mathematical proof, rather than merely by testing or simulation. Formal verification in the hardware industry is widely practiced, and increasingly seen as necessary. We can perhaps identify at least three reasons:

  • Hardware is designed in a more modular way than most software, with refinement an important design method. Constraints of interconnect layering and timing means that one cannot really design "spaghetti hardware."
  • More proofs in the hardware domain can be largely automated, reducing the need for intensive interaction by a human expert with the mechanical theorem-proving system.
  • The potential consequences of a hardware error are greater, since such errors often cannot be patched or worked around, and may in extremis necessitate a hardware replacement.

It is not surprising that a considerable amount of effort has been in the floating-point domain. Floating-point algorithms have proven themselves difficult to get right. Yet in marked contrast to some other targets for formal verification, it is not hard to come up with widely accepted formal specifications of how floating-point operations should behave. In fact, many operations are specified almost completely by the IEEE Standard governing binary floating-point arithmetic. However, in some other respects, floating-point operations present a difficult challenge for formal verification. We will describe some of our work in formally verifying algorithms for operations such as division, square root, and transcendental functions for the Intel Itanium architecture.

John R. Harrison - Intel

John Harrison has worked in formal verification and automated theorem proving since 1990, when he joined Mike Gordon's "Hardware Verification Group" (HVG) at the University of Cambridge Computer Laboratory. As well as working on the development of the HOL theorem prover, he developed a particular interest in the formalization of real analysis and its application to formal verification of floating-point hardware. After completing his PhD research in 1995, John Harrison spent a very enjoyable year at Ĺbo Akademi University and Turku Centre for Computer Science (TUCS) in Turku, Finland, where he was a member of Ralph Back's Programming Methods Research Group. John Harrison then returned to Cambridge and worked on a formal model of floating-point arithmetic and its application to the verification of some realistic algorithms for transcendental functions. This work attracted the attention of Intel, and in 1998 John Harrison joined the company as a Senior Software Engineer, specializing in the design and formal verification of mathematical algorithms. He has formally verified and in many cases designed or redesigned numerous algorithms for mathematical functions including division, square root and trigonometric functions. In his limited spare time over the past 10 years, John Harrison has been working on a book, giving a comprehensive introduction to automated theorem proving. He hopes that this book will finally reach publication in 2006, and the associated code is already available from his Web page.

 
Kernel Optimization for Enterprise Workloads

Linux has been receiving a great deal of attention in the past few years. The popularity is being propelled by a wide range of adoption of Linux for enterprise computing. Major software vendors have been supporting their products on Linux for many years. As the enterprise software solution stack builds up everyday, it is crucial that Linux kernel development takes this opportunity to ensure that the kernel provides necessary infrastructure for enterprise application to excel. This means developing enterprise focused OS features, improving performance by extending the scalability, as well as improving many other areas.

Adding to the excitement, the Intel Itanium 2 processor is built with many innovative features that push the performance envelope. Featuring massive caches and CPU execution resource, EPIC technology (Explicitly Parallel Instruction Computing) provides a variety of optimization opportunities. In this talk, we will highlight kernel optimization work done on Linux-ia64, ranging from several critical low level assemblies to generic kernel components. We will present how the linux-ia64 kernel utilizes Itanium architecture features to extend scalability and performance for enterprise workloads.

Kenneth Chen - Intel

Ken Chen works at Intel as a Linux kernel engineer. His first encounter with Itanium was to develop processor firmware for the first generation of the Itanium processor, followed by many years work on optimizing enterprise software on Itanium architecture. For the last several years, he worked on the Linux kernel, which he optimized for Itanium platforms, ranging from low-level assembly code to generic SMP/ccNUMA scalability. His latest venture is optimizing the Linux kernel for a wide range of enterprise workloads and collaborating with the Linux community to produce a superior enterprise-class-ready Linux kernel on Itanium.

 
Suggested Improvements in Itanium and Software

In general, the Itanium is a major step forward in computer design. Nevertheless, there are still gaps in the instruction repertoire, and the specifications of some instructions could be expanded or modified.There are also some mandates by C++ concerning corner cases, which cannot be justified by any mathematical reasoning whatsoever; there is even one IEEE mandate that cannot pass muster. A detailed list of shortcomings and possible remedies will be presented for your consideration.

Clemens C. J. Roothaan - Gelato Honorary Member

Clemens Roothaan is a Professor Emeritus of Physics and Chemistry at the University of Chicago. In the 1950's, he published detailed algorithms to solve quantum mechanical movements of electrons in molecules and atoms. Today, most computer programs in this area are based on his method. After his retirement from the University in 1988, Roothaan started to work for HP Labs in Palo Alto, California. He has worked on the Itanium design team since 1990. Currently, Roothaan is working on a large software suite of scientific tools for function evaluation.

 
Evolution of PCI IO: A Linux IO Geek's Perspective on HW

PCI has been around since 1993 and has seen substantial changes since its conception. New features and functionality have been introduced with each generation (e.g. 64-bit, 3.3v, MSI, MSI-X, Split transactions, etc). PCI-e is the latest generation and is *not* HW compatible with previous generations. This gave HW vendors the "opportunity" (forced them really) to re-implement and take advantage of some of the features PCI-e offers.

This talk will explain a few PCI features and broken HW implementations, and will cover the reasons why PCI-e is an improvement over previous PCI-X implementations.

Grant Grundler - HP

Grant Grundler was born in Toronto, Canada, and grew up near Silicon Valley, California. He graduated from California State University, Hayward with a BS in Computer Science. He lived and worked in Germany for three years as a PC technician/support, ski tour "host," windsurf instructor, and firmware designer/developer for a custom TokenBus networking card. Back in "the States," Grant worked for three years at Olivetti on SVR4 ports to i860, MIPS R4000 (M700-10), and the first Alpha workstation. Since 1993, Grant has worked for HP on HP-UX SCSI drivers and HP-UX PCI subsystem design. He currently works on IO support and drivers for both parisc- and ia64-linux ports. Grant's public presentations are available at http://iou.parisc-linux.org/.

 
Tuesday
Local and Remote Memory: Memory in a NUMA System

Memory becomes difficult to handle in a NUMA system because storage is available at various "distances" from the running process. A higher distance means longer latency or less bandwidth, and therefore implies slower access to memory. Performance in a NUMA system depends on assigning available memory to processes in such a way that memory access speed is optimized. The kernel has various mechanisms to automatically or manually control NUMA memory placement.

The page allocator attempts to locate memory that is near the node where a process is executing. However, if the data is to be later used by processes running on other nodes, then memory would not be allocated in the best way. The kernel allows manual control of memory allocation per process via memory allocation policies. Similar issues occur in the SLAB allocator. The SLAB allocator was revised last year in order to insure that allocations occur in an optimal way and that allocations are controllable in the same way as the page allocator.

The kernel itself must be aware of where its own data structures will be placed and insure that data to be used by certain processors is on memory nodes local to these processors. Improvements in this area enhanced placement of core kernel structures and also allow device drivers to place their data local to hardware devices. Finally, the kernel now has the ability to migrate the physical location of pages to improve performance after a process has been reassigned to a processor on another node.

Christoph Lameter - SGI

Christoph Lameter is the Technical Lead at SGI for the Linux Kernel. He has been leading the effort to make the kernel more NUMA aware by reworking the SLAB allocator, page allocator, and various other components for optimal performance. Christoph's patches made it possible for the kernel to change the physical location of pages transparently while processes are running (page migration) and he introduced the functionality necessary to locally reclaim memory for optimal placement of memory. Christoph is currently serving on the Technical Advisory Board of OSDL, the Technical Program Committee for Gelato ICE, and the Advisory Board of the Linux Professional Institute. He has been teaching various classes on operating system design and programming languages in San Jose. He earned a PhD for work on the implications of quantum theory for concepts of reality.

 
Wednesday
Decimal Floating-Point

The IEEE 754 Floating-Point Standard is up for revision. A major new addition is a decimal data type and computation rules. This talk will define and motivate the need for decimal (vs binary) floating-point (FP), and demonstrate that a binary-integer based implementation is effective for high-performance software emulation, as well as being amenable to sharing hardware with binary FP units for maximum efficiency and leverage.

John Crawford - Intel

John H. Crawford is an Intel Fellow at the Intel Corporation, Santa Clara, California, where he investigates emerging technology directions and issues for future Itanium Processor Family products. Crawford was the Chief Architect of both the Intel386 and Intel486 microprocessors, and co-project manager of the Pentium microprocessor. He managed the joint Intel/HP team that defined the Itanium Processor Family instruction set architecture, and directed aspects of Itanium processor product development. Crawford was awarded the ACM/IEEE Eckert-Mauchly Award, and the IEEE Ernst Weber Engineering Leadership Recognition. He was elected to the National Academy of Engineering in 2002. Crawford received an ScB in Computer Science from Brown, and an MS in Computer Science from the University of North Carolina, Chapel Hill.

 
 
The Itanium Vector Math Library (VML)

The VML project was conceived in 1990 in parallel with the development of the Itanium. To match the ambitious design of Itanium, all mathematical and computational procedures relevant for the functions at hand were re-examined. Powerful new methods were developed to determine (1) Chebyshev expansions of a function by a straightforward transformation of its MacLaurin expansion, and (2) a Remez expansion from a Chebyshev expansion by a quadradically convergent iterative process.

The functions implemented are (1) reciprocal, division, square root, and reciprocal square root; (2) exponentials, logarithms, and the power function; (3) trigonometric functions and their inverses; (4) hyperbolic functions and their inverses. Actually each VML code is a subroutine that yields a vector of results from a vector of arguments. For corner cases due to pathological input arguments, the VML codes deliver the expected results directly, thereby avoiding error detection by hardware, and subsequent elaborate and costly error processing. Version 1 of VML, comprising 56 functions, is available in the Public Domain. For 55 of these functions, the floating-point performance of the inner loop is 100% saturated; the one exception is the single precision logarithm, which has one floating point vacancy in its 12 cycle inner loop.

The design and implementation strategies of the VML programming model will be shown in detail by a representative example. An earlier presentation entitled "Exploiting the Power of Itanium" on 7/31/2003 in Amsterdam is available at http://www.sara.nl/news/2003/20030813/lecture_roothaan_eng.html.

Clemens C. J. Roothaan - Gelato Honorary Member

 
Research
Monday
Preparing for the First Beam at the LHC

The Large Hadron Collider (LHC) at CERN, the European Laboratory for Particle Research in Geneva, Switzerland, is expecting to have the first beam next year. This is the culmination of more than a decade construction project, including the development of the supporting software and computing models. ALICE is one of the four major detectors that is being prepared for physics at the LHC, and the University of Houston is a member of the US contingent of institutions involved in that experiment. Along with the Ohio Supercomputer Center and the facility at NERSC (LBL), the University of Houston Itanium cluster has been participating in the increasingly severe sequence of "data challenges" that are being wrapped up now in preparation for the actual turn-on of the LHC. The ALICE computing model is necessarily dependent upon a grid-based model that will include many different platforms, Itanium among them. The data challenges have provided a good venue to compare the relative attributes of the various platforms in running the kind of simulations and analysis codes that are relevant to particle physics applications. An overview of these results will be presented along with a summary of the overall ALICE computing plans and the status of its deployment.

Lawrence Pinsky - University of Houston

Lawrence Pinsky is the chairperson of the Physics Department at the University of Houston. He holds a BS in Physics from Carnegie-Mellon University and an MA and PhD in Physics from the University of Rochester. Professor Pinsky also holds JD and LLM degrees from the University of Houston's Law Center. He has published over 125 articles in refereed journals and he gives from 5-10 invited talks each year. He is on the organizing committees of several major international conferences each year, including the recent CHEP'06 (Computing in High Energy Physics-2006) conference in Mumbai, India.

Professor Pinsky is a member of the ALICE-USA Collaboration and has served as the Computing Coordinator for that effort. He is a member of the ALICE Computing Board and the CERN Grid Deployment Board. At the University of Houston, he is a member of the Executive Committee of the Texas Learning and Computation Center. Pinsky also has an extensive NASA-supported research effort in the development of Monte Carlo Transport codes for use in simulating the space radiation environment.

 
Computing Optimal Equilibrium Strategies for Network Economies

Models for regulating, planning, and operating industries working on networks such as energy, transportation, and telecommunication are key ingredients today of which to take advantage. These models correspond to large stochastic optimization/equilibrium problems, which are very difficult to solve. In this talk, we will show three new distributed algorithms/strategies to compute a solution and its implementation on Itanium 2 clusters. This family of models and/or solutions are currently used by several companies and institutions participating in these industries.

Alejandro Jofré - University of Chile

Dr. Alejandro Jofré is a Professor at the University of Chile, acting as a researcher at the Department of Mathematical Engineering. His research focuses include optimization and mathematical economics. Since April 2000, he has been the Vice Director of the Centre for Mathematical Modeling and is the leader of projects related with the energy and telecommunication network such as "Rockmass Geo-Mechanical Instabilities in Cooper Mines." He is also a Professor of the PhD program in Mathematical Economics at the University Paris 1- Sorbonne and an associate member of the Center for Experimental Math, Canada. Jofré holds a PhD in Applied Mathematics from the University of Pau, France, and has published more than 30 papers.

 
Mathematical Libraries and the Implementation of Parallel Solvers for Engineering

Our research is focused on developing a highly efficient parallelizable solver of huge systems of linear equations that arise from finite element discretizations of complex nonlinear engineering problems. Those problems are nonlinear, require many linearizations, and hence several days of CPU time on Itanium platforms. Another important application is the reconstruction of tomographic images.

This work includes a comparison of the mathematical libraries like MKL (Linux) and MLIB (HP) from the point of view of the performance on numerical problems using sequential and parallel implementations. The new solver uses BLAS routines at levels 1,2,3, excluding complex data types. The conclusions of our study present the results obtained with several problems.

Hugo Daniel Scolnik - University of Buenos Aires

Hugo Daniel Scolnik is a Professor in the Computer Sciences Department (that he founded in 1984) at the School of Sciences of the University of Buenos Aires (UBA) where he teaches Cryptography, Numerical Analysis, and Optimization. For his Gelato-related work, Scolnik co-directed a Gelato-sponsored project comparing 64- and 32-bit architectures from the point of view of their performance for scientific programming. Scolnik is also currently directing three of his five graduate students on Gelato-related theses.

Beyond his work at UBA, Scolnik was an international consultant for United Nations agencies, HP, and Hitachi. He has been a Visiting Professor in several countries. He represents Argentina on the International Federation for Information Processing (IFIP) Technical Committee 7 (TC7). He has published papers on Optimization, Numerical Analysis, Automata Theory, Artificial Intelligence, Robotics, and Mathematical Modeling, and has refereed several journals. In 2003, Scolnik won the Konex Award for the best trajectory in Science and Technology for the 1993-2003 decade in the area of Informatics. Scolnik received a Licenciado en Ciencias Matemáticas at the University of Buenos Aires in 1964, and a PhD in Mathematics from the University of Zurich, Switzerland, in 1970.

 
Superpages / VM Work

This talk will present a short overview of Gelato@UNSW's latest work on issues relating to Itanium Linux virtual memory. Our work revolves around both taking advantage of unique properties of the Itanium MMU and some more "radical" ideas for overhauling parts of the Linux VM layer. Topics touched on will include using the long-format VHPT, strategies for providing dynamic superpages, and approaches for greater abstraction within the Linux VM implementation.

Ian Wienand - University of New South Wales

Ian Wienand has been a Research Assistant with Gelato@UNSW since late 2003, working on various Itanium Linux projects. He has recently changed the nature of his engagement to undertake a Master's Degree within the group, looking at new approaches for Itanium Linux virtual memory.

 
Tuesday
Experiences on the Itanium-Based Grid Test Bed at UPRM

The Parallel and Distributed Computing Laboratory (PDCLab) at the University of Puerto Rico, Mayaguez, has deployed an experimental grid test bed to perform research in the area of grid computing. The PDCLab grid test bed was deployed using components that allow flexible re-configuration, management, and programmability. The test bed was built upon heterogeneous components including an Itanium based cluster. This presentation provides discussion about the hardware and software configurations of the grid test bed, the rational used to choose each of grid components, and the research issues being investigated.

Wilson Rivera - University of Puerto Rico Mayaguez

Dr. Wilson Rivera obtained his PhD in Computational Engineering from Mississippi State University, while working at the NSF Engineering Research Center for Computational Field Simulation. There he concentrated on developing domain decomposition algorithms for solving time dependent partial differential equations with applications in Computational Fluid Dynamics. Dr. Rivera is an Associate Professor at the University of Puerto Rico Mayaguez Campus (UPRM). He leads the Parallel and Distributed Computing Laboratory (PDCLab) at UPRM. His current funded projects address fundamental research problems in the areas of grid computing (automated grid deployment, adaptive grid services, dynamic resource management and grid performance) and workflow management (workflow modeling, metadata description and dynamic scheduling). Rivera is also the Executive Director for the Institute for Computing and Informatics Studies at UPRM and is a faculty member of the NSF Center for Subsurface Sensing and Imaging Systems (CenSSIS) and the NSF Center for Collaborative Adaptive Sensing of the Atmosphere (CASA).

 
In Search of Collaboration

In advancing Linux on Itanium, there are many technical areas crying out for collaboration. HP is, in particular, interested in collaborating with research institutes, universities, and vendors in three areas: scalability, virtualization, and GCC. We believe these three areas are critical to the success of Itanium. In this presentation, we will present the needs in the three areas from HP's viewpoint. We will have short discussions on your needs as well. This goal is to stimulate off-line discussions concerning potential collaborations. In some cases, it could leads to funding from HP. Please join us to start more collaborations.

Ping-Hui Kao - HP

Ping-Hui Kao is a System Architect in HP Open Source and Linux Organization (OSLO) R&D. He contributed to the HP-UX operating system and kernel, especially in filesystems, Windows NT work on HPPA based platforms, Linux kernel developments, and HA and cluster technologies. In addition to various engineering tasks, he is in charge of Orchestrated Collaborative Engineering (OCE). As part of OCE, Ping-Hui manages a R&D lab in Beijing, China, and coordinates collaboration with the OSS community and consortia for OSLO.

 
Wednesday
Itanium Virtualization and vNUMA

In recent years, virtualization has become a hot technology, being widely deployed for applications such as server consolidation. However Itanium, like x86, was not originally designed with virtualization as a goal. In this presentation, I will talk about the challenges of virtualizing the Itanium architecture. I will present the various possible approaches, including para-virtualization, pre-virtualization (an automated technique we have developed), and hardware-assisted virtualization in the form of Intel Virtualization Technology.

I will also provide an overview of vNUMA, a novel application of these virtualization techniques. vNUMA provides a virtual ccNUMA-like environment on a cluster, by transparently implementing shared memory underneath the operating system. Thus, a single instance of an existing operating system such as Linux can run across multiple nodes of a cluster. While the general principles are applicable to any architecture, the initial version has been built for Itanium systems.

Matthew Chapman - University of New South Wales

Matthew Chapman is a PhD student at the University of New South Wales, Sydney, Australia, and also works part-time for HP Labs. His research interests include operating systems, computer architecture, and virtualization. Matthew has considerable experience with the Itanium architecture, having contributed to the Itanium ports of Linux, L4, Xen, and his own project vNUMA. He is also active in the wider open-source community, such as the rdesktop project, which he founded.

 
Bioinformatics in Biomining

Given the explosive growth of genomic databases in recent times, the development of efficient searching tools becomes more relevant every day. In particular, the design of biochemical elements used for bioidentification experiments requires search algorithms incorporating specific biological constraints. These experiments are designed to identify the biological diversity of metagenomic or environmental samples and are useful in ecology, environmental studies, and infection diagnosis, among others. We are focusing on text search algorithms for short words (under 60 symbols) where a small number of substitutions are allowed. The databases used are in the order of gigabytes. We have developed an efficient solution for this problem, which can take advantage of the Itanium 2 architecture. In this work, we will present a comparative study of performance of this algorithm on several architectures. This work is being developed at the Laboratory of Bioinformatics and Mathematics of Genome, Center for Mathematical Modeling, University of Chile.

Nicholas Loira - University of Chile

Nicolas Loira is a Computer Engineer from the University of Chile, with a background in videogame programming, system administration, and IT. After four years in the field of Bioinformatics, Nicolas currently works designing and implementing algorithms to handle and analyze the massive amounts of data produced by some of the most important biotechnology projects in Chile.

Andres Aravena - University of Chile

Andres Aravena is a Mathematical Engineer and MSc candidate in Computer Science at the University of Chile. He is currently the Project Manager of the Laboratory of Bioinformatics and Genome Mathematics at the Center for Mathematical Modeling at the University of Chile, a group focused on one of the main Chilean biotechnological projects. His previous experience includes system development and network management at a nationwide university, and representation on the national academic networking consortium.

 
-
-
-
-
-

 

All content © copyright 2002-2006 Gelato Federation. Click here to view the Gelato Federation Privacy Policy and Terms of Service Agreement. If you have any questions or comments, please contact us.

Gelato Central Operations is housed within the Coordinated Science Laboratory (CSL) of the College of Engineering at the University of Illinois at Urbana-Champaign (UIUC).