Morning tutorial: BSC Performance Analysis and Tools (Judit Gimenez)
Summary: In this day and age, with the increasing complexity of applications and platforms, it is crucial to be equipped with good and flexible performance tools when targeting the analysis and optimization of parallel codes. In this tutorial we describe the performance tools developed at BSC (tools.bsc.es), an open-source project whose goal is to understand the behavior of applications. The key component is Paraver, a performance analyzer based on traces with great flexibility to explore the collected data.
Judit Gimenez is a researcher at UPC/BSC. She has been working in the area of parallel computing since 1989 when she obtained her Computer Science degree. Her first job was doing the development and support of parallel systems based on transputers. She has been involved in technology transfer activities participating in different initiatives to promote the usage of parallel computing by European SMEs. She is responsible for the development and distribution of performance tools for the last 17 years, being the leader of the Performance Tools team in the Computer Science department at BSC since its creation. She actively participated in the POP Center of Excellence (EU H2020 project) promoting the performance and optimization as the path to improve the productivity.
Afternoon tutorial: Resilience in parallel applications (George Bosilca, Bogdan Nicolae)
Summary: As the age of Exascale draws closer and the size of large-scale, distributed applications continues to increase, so does the failure rate and thus the need for advanced resilience techniques to handle them. Over the last years, the resilience topic evolved from an open question to a clear requirement where the failure occurrences are not questioned anymore, but instead the focus is on the frequency of such radical events during the execution of applications at scale. Solutions to transparently manage faults at the system level exist, but their scalability potential and overhead in terms of performance and resource utilization remains high, even for low failure frequency. Therefore, empowering the developers to deal with the failures at application-level instead brings more opportunities to reduce the resilience overhead that needs holistic support from all layers: hardware and software as well as from the parallel programming paradigm. This tutorial highlights application-driven solutions to survive faults and provide a basic understanding of their expected costs at scale. The presented solutions cover two complementary approaches (1) application-defined checkpoint-restart (as demonstrated through the VeloC runtime); and (2) user-level failure mitigation (as demonstrated through ULFM runtime); as well as opportunities to mix them toward more portable and efficient solutions.
The target audience is any individual with interest in understanding the challenges related to reliability when dealing with the increasing scale of computer infrastructures, and any practician interested in acquiring basic understanding of some of the potential solutions that empower application developers to overcome process and node failures. Prior knowledge about MPI, and a certain familiarity with the C programming language is necessary. The content is structured such that it covers 30% beginner, 40% intermediate and 30% advanced level.
George Bosilca is a Research Assistant Professor and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. His research interests revolve around providing support for parallel applications to maximize their efficiency, scalability and heterogeneity at any scale and in any settings. Dr. Bosilca works in programming paradigms for parallel applications, and in designing parallel programming paradigms providing scalable and portable constructs for dealing with heterogeneity and resilience.
Bogdan Nicolae is a Computer Scientist at Argonne National Laboratory, USA. In the past, he held appointments at Huawei Research, Germany and IBM Research, Ireland. He specializes in scalable storage, data management and fault tolerance for large scale distributed systems, with a focus on high performance architectures cloud computing. He holds a PhD from University of Rennes 1, France and a Dipl. Eng. degree from Politehnica University Bucharest, Romania.
EuroMPI 2018 Workshops and Tutorials Chair
Rosa M Badia, Barcelona Supercomputing Center