Morning tutorial: BSC Performance Analysis and Tools (Judit Gimenez)
Summary: In this day and age, with the increasing complexity of applications and platforms, it is crucial to be equipped with good and flexible performance tools when targeting the analysis and optimization of parallel codes. In this tutorial we describe the performance tools developed at BSC (tools.bsc.es), an open-source project whose goal is to understand the behavior of applications. The key component is Paraver, a performance analyzer based on traces with great flexibility to explore the collected data.
Judit Gimenez is a researcher at UPC/BSC. She has been working in the area of parallel computing since 1989 when she obtained her Computer Science degree. Her first job was doing the development and support of parallel systems based on transputers. She has been involved in technology transfer activities participating in different initiatives to promote the usage of parallel computing by European SMEs. She is responsible for the development and distribution of performance tools for the last 17 years, being the leader of the Performance Tools team in the Computer Science department at BSC since its creation. She actively participated in the POP Center of Excellence (EU H2020 project) promoting the performance and optimization as the path to improve the productivity.
Afternoon tutorial: Resilience in parallel applications (George Bosilca, Bogdan Nicolae)
Summary: As the age of Exascale draws closer and the size of large-scale, distributed applications continues to increase, so does the failure rate and thus the need for advanced resilience techniques to handle them. Over the last years, the resilience topic evolved from an open question to a clear requirement where the failure occurrences are not questioned anymore, but instead the focus is on the frequency of such radical events during the execution of applications at scale. Solutions to transparently manage faults at the system level exist, but their scalability potential and overhead in terms of performance and resource utilization remains high, even for low failure frequency. Therefore, empowering the developers to deal with the failures at application-level instead brings more opportunities to reduce the resilience overhead that needs holistic support from all layers: hardware and software as well as from the parallel programming paradigm. This tutorial highlights application-driven solutions to survive faults and provide a basic understanding of their expected costs at scale. The presented solutions cover two complementary approaches (1) application-defined checkpoint-restart (as demonstrated through the VeloC runtime); and (2) user-level failure mitigation (as demonstrated through ULFM runtime); as well as opportunities to mix them toward more portable and efficient solutions.
The target audience is any individual with interest in understanding the challenges related to reliability when dealing with the increasing scale of computer infrastructures, and any practician interested in acquiring basic understanding of some of the potential solutions that empower application developers to overcome process and node failures. Prior knowledge about MPI, and a certain familiarity with the C programming language is necessary. The content is structured such that it covers 30% beginner, 40% intermediate and 30% advanced level.
George Bosilca is a Research Assistant Professor and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. His research interests revolve around providing support for parallel applications to maximize their efficiency, scalability and heterogeneity at any scale and in any settings. Dr. Bosilca works in programming paradigms for parallel applications, and in designing parallel programming paradigms providing scalable and portable constructs for dealing with heterogeneity and resilience.
Bogdan Nicolae is a Computer Scientist at Argonne National Laboratory, USA. In the past, he held appointments at Huawei Research, Germany and IBM Research, Ireland. He specializes in scalable storage, data management and fault tolerance for large scale distributed systems, with a focus on high performance architectures cloud computing. He holds a PhD from University of Rennes 1, France and a Dipl. Eng. degree from Politehnica University Bucharest, Romania.
Afternoon tutorial: “MPI + Y” – interoperable APIs for maximising asynchrony
Summary: The tutorial features both an introduction and a hands-on session for two topics: the Task-Aware MPI (TAMPI) library and a recent GASPI extension which leverages the concept of shared windows. The TAMPI library provides a new MPI_TASK_MULTIPLE threading level which facilitates the development of hybrid MPI+OpenMP/OmpSs-2 applications. With the MPI_TASK_MULTIPLE any task can invoke synchronous MPI calls without blocking the underlying hardware thread, thus avoiding potential dead-locks. The GASPI extension (a SHAred Notifications (SHAN) communication library) primarily is aimed at migrating flat MPI legacy codes towards an asynchronous execution model. In order to achieve this goal the SHAN library makes use of notified communication, both within and across shared memory nodes. The SHAN API publishes solver data as well as corresponding datatypes in shared memory. Ît also leverages one-sided notified GASPI communication (for non-local communication with other nodes) in order to pipeline packing and unpacking of the published datatypes with communication.
The hands-on will focus on the migration from flat MPI legacy code towards both hybrid MPI + OmpSs-2 and the GASPI SHAN extension.
Dr. Christian Simmendinger is working as a Senior System Architect for T-Systems - Solutions for Research. After studying physics he received a PhD in the area of Solid State Physics at the Institute for Theoretical Physics (University of Stuttgart). In 2013 – together with Rui Machado and Carsten Lojewski - he was awarded the Joseph-von-Fraunhofer Prize for his contributions to GASPI/GPI, a novel PGAS API (GASPI) for next-generation Exascale supercomputers.
Dr. Vicenç Beltran is a Senior Researcher at the Barcelona Supercomputing Center (BSC), where he works on parallel and distributed programming models, hardware accelerators, domain specific languages, operating systems and tools for HPC systems. He has participated in several EU and industrial projects, including DEEP, DEEP-ER and DEEP-EST (leading the WPs on programming models and resiliency), INTERTWinE (leading the WP on runtime interoperability) and REPSOLVER II (leading the development of a DSL infrastructure). He currently leads the Runtime Systems for Parallel Programing Models group that develops the OmpSs v2 programming model.
EuroMPI 2018 Workshops and Tutorials Chair
Rosa M Badia, Barcelona Supercomputing Center