Online platform
IMPORTANT: All registered participants, chairs, and keynote speakers, should have received an e-mail with instructions on how to access the conference platform. If not, please contact pdp2021@protonmail.com as soon as possible.
IMPORTANT: All speakers must be online during the broadcasting of their videos, in order to respond to subsequent questions. Failure in doing so without justification may affect the publication of the work in the conference proceedings.
Conference Program
IMPORTANT: All times are expressed in Central European Time (CET), that is, GMT+1. It is critical that participants do not mistake session times. For timezone conversion, please click here.
Wednesday, March 10th
9:00-9:30 Welcome to PDP 2021 and general information
9:30-10:15 Keynote Speaker: Rafael Asenjo, Universidad de Malaga (Spain)
Title: oneAPI for GPUs and FPGAs: portability, yes!, performance portability, not quite.
Abstract: Heterogeneous platforms are becoming increasingly common in HPC programming, with compute resources that include a diverse collection of integrated and discrete graphics processors, FPGAs and other domain-specific compute engines. Can we program CPUs, GPUs and FPGAs using a unified, standards-based programming model?. Yes, oneAPI includes a cross-architecture language: Data Parallel C++ (DPC++). DPC++ is an evolution of C++ that incorporates the SYCL language with extensions for Unified Shared Memory (USM), ordered queues and reductions, among other features. oneAPI also includes libraries for API-based programming, such as domain-specific libraries, math kernel libraries and Threading Building Blocks (TBB). The main benefit of using oneAPI over other heterogeneous programming models is the single programming language approach, which enables one to target both GPUs and FPGAs using the same programming model, and therefore to have a cleaner, portable, and more readable code. Understanding the tradeoffs in using these accelerators and how to select and optimize computations for offload to these devices is an important and timely topic. The goal of oneAPI is to augment C++ to create a model that covers all of these devices, without sacrificing performance. C++ continues to grow in importance in HPC programming and the combination of oneAPI’s DPC++ and oneAPI’s Threading Building Blocks (oneTBB) provides a powerful combination for expressing heterogeneous applications in C++. We will see that DPC++/SYCL provides portability and that in many cases we can port our code from CPU to GPU or FPGA just changing a single line of code (the device selector) and recompiling. However, performance portability is a different thing, as it means that we can get a similar fraction of peak performance on a wide range of target architectures using the same code. In this regard, we will elaborate on four different oneAPI related aspects: 1) why is it challenging to achieve performance portability?; 2) what are the differences between the GPU and FPGA architectures?; 3) which accelerator should we leverage to better suit different application needs?; and 4) what optimizations are relevant to specialize our code and get the most out of GPUs or FPGAs?. Finally, we will also touch on scheduling alternatives that may come in handy to exploit several devices at the same time.
10:15-10:30 Break
10:30-12:00 Session 1: Distributed Computing
- A Federated Content Distribution System to Build Health Data Synchronization Services.
- Nonblocking Data Structures for Distributed-Memory Machines: Stacks as an Example.
- Bucket MapReduce: Relieving the Disk I/O Intensity of Data-Intensive Applications in MapReduce Frameworks.
- Job Classification Through Long-Term Log Analysis Towards Power-Aware HPC System Operation
12:00-13:30 Lunch Break
13:30-15:00 Session 2: High-performance Computing Applications
- An Efficient Practical Non-Blocking PageRank Algorithm for Large Scale Graphs.
- Parallel Asynchronous Stochastic Dual Coordinate Descent Algorithms for Efficiency and Convergence,
- A Synchronized and Dynamic Distributed Graph structure to allow the native distribution of Multi-Agent System simulations,
- A Case Study of Run-Time Testing of Self-Organizations in Multi-Embedded-Agent Systems.
15:00-15:15 Break
15:15-16:45 Session 3: Parallel Programming
- Building representative and balanced datasets of OpenMP parallel regions.
- A Cross-Platform OpenVX Library for FPGA Accelerators.
- Introducing a Stream Processing Framework for Assessing Parallel Programming Interfaces
- Towards On-the-fly Self-Adaptation of Stream Parallel Patterns.
- Quantifiability: Correctness of Concurrent Programs in Vector Space.
Thursday, March 11th
9:30-10:15 Keynote Speaker: Clemens Grelck, University of Amsterdam (The Netherlands)
Title: Single Assignment C: High Productivity meets High Performance
Abstract: Today's computing landscape is characterised by a large variety of rapidly moving platforms accompanied by an even larger variety of likewise rapidly moving machine-centric programming models. At the same time parallel computing has long left the niche of high-performance computing and can truly be considered main-stream these days. While parallel computing gurus are more sought after than ever, the current developments essentially expose average software engineers to the notorious challenges and pitfalls of parallel programming, both regarding functional correctness and performance. This growing fraction of parallel programmers by need, not by interest, is often overwhelmed by the complexities and intricacies of our main-stream parallel programming models. Single Assignment C (SAC) is a purely functional high-productivity language that aims at addressing the challenges of modern hardware from the other end: abstraction. Our goal is to compile a single entirely architecture-agnostic source program to a variety of different computing architectures. SAC puts the emphasis on truly multi-dimensional and truly stateless arrays and pursues a stringent co-design of programming language and compiler technology. The abstract view on arrays combined with the functional semantics support far-reaching program transformations. A highly optimised runtime system takes care of automatic memory management with an emphasis on immediate reuse. Fully compiler-directed parallelisation for a variety of parallel architectures from multi-core to GPGPU accelerators to clusters enables average programmers to harness the power of modern computing systems with hardly any additional effort or insight into the abysses of parallel programming and computer architecture. We introduce the essential language design concepts of SAC and demonstrate how SAC supports programmers to write highly abstract, reusable and elegant code. We discuss the major challenges in compiling SAC programs into efficiently executable code across a variety of multi and many-core architectures and show some performance figures that demonstrate the ability of SAC to achieve runtime performance levels that are competitive with machine-centric, programming environments.
10:15-10:30 Break
10:30-12:00 Session 4: Neural Networks and Deep Learning
- Performance Modeling for Distributed Training of Convolutional Neural Networks,
- Evaluation of MPI Allreduce for Distributed Training of Convolutional Neural Networks.
- Analyzing the distributed training of deep-learning models via data locality.
- High Performance and Energy Efficient Integer Matrix Multiplication for Deep Learning.
12:00-13:30 Lunch Break
13:30-15:00 Session 5: Networking and On-Chip Technologies
- General hardware multicasting for fine-grained message-passing architectures
- A Case for Low-Latency Network-on-Chip using Compression Routers.
- Low-Latency Low-Energy Memory-Cube Networks Using Dual-Voltage Bypassing Datapaths
- Application Characterization for Near Memory Processing
15:00-15:15 Break
15:15-16:45 Session 6: Performance Evaluation and Tuning
- Optimizing Parallel Applications via Dynamic Concurrency Throttling and Turbo Boosting
- Boosting Graph Analytics by Tuning Threads and Data Affinity on NUMA Systems
- Performance Analysis of Array Database Systems in Non-Uniform Memory Architecture.
- Combining Thread Throttling and Mapping to Optimize the EDP of Parallel Applications
- Toward a Better Performance Portability Metric
Friday, March 12th
9:30-10:15 Keynote Speaker: Christoph W. Kessler, Linköping University (Sweden)
Title: Towards future-proof programs for heterogeneous parallel systems
Abstract: We live in the era of parallel and heterogeneous computer systems, with multi- and many-core CPUs, GPUs and other types of accelerators being omnipresent. The execution and programming models exposed by modern computer architectures are diverse, parallel, heterogeneous, distributed, and far away from the sequential von-Neumann model of the early days of computing. Yet, the convenience of single-threaded programming, together with technical debt from legacy code, make us mentally stick to programming interfaces that follow the familiar von-Neumann model, typically extended with various platform-specific APIs that allow to control parallelism and accelerator usage. High-level parallel programming based on generic, portable programming constructs known as algorithmic skeletons can raise the level of abstraction and bridge the semantic gap between a sequential-looking, platform-independent single-source program code and the heterogeneous and parallel hardware. Could this be a recipe for writing future-proof, performance-portable programs that still provide von-Neumann style simplicity? We present the design principles of one such framework, the latest generation of our open-source programming framework SkePU for heterogeneous parallel systems, which is based on modern C++, using variadic template metaprogramming and a custom source-to-source pre-compiler. We also survey some of the automated optimizations that are enabled by SkePU's high-level programming interface. We conclude with an outlook to future perspectives for the skeleton programming approach. Acknowledgments: The latest version of SkePU (https://skepu.github.io) is joint work with August Ernstsson and Johan Ahlqvist, and was partly funded by EU H2020 project EXA2PRO.
10:15-10:30 Break
10:30-12:00 Special sessions I and II: Cloud Computing on Infrastructure as a Service and Its Application; On-chip Parallel and Network-Based Systems (Chair: TBD)
- Location-aware Task Allocation Strategies for IoT-Fog-Cloud Environments
- Kubernetes WANWide: a Deployment Scenario to Expose and Use Edge Computing Resources?
- Local Traffic-Based Energy-Efficient Hybrid Switching for On-Chip Networks
- High-Performance Parallel Fault Simulation for Multi-Core Systems
12:00-13:15 Lunch Break
13:15-15:00 Special session III: High-Performance Computing in Modelling and Simulation
- Machine Learning Migration for Efficient Near-Data Processing
- LIMITLESS - LIght-weight MonItoring Tool for LargE Scale Systems
- On GPU optimizations of stencil codes for highly parallel simulations
- Sigma: a Scalable High Performance Big Data Architecture
- A compact encoding of security logs for high performance activity detection,
- Towards Parallel Multi-density Clustering for Urban Hotspots Detection
- Massively simulations on GPGPUs of subsurface flow on heterogeneous soils
15:00-15:15 Break
15:15-16:45 Special session IV: Security in Parallel, Distributed and Network-Based Computing
- Camouflaged bot detection using the friend list
- Selection of Deep Neural Network Models for IoT Anomaly Detection Experiments
- Attack Surface Assessment for Cybersecurity Engineering in the Automotive Domain
- Parallel Privacy-Preserving Shortest Paths by Radius-Stepping
- A technique for early detection of cyberattacks using the traffic self-similarity property and a statistical approach
- WorkTrue: An Efficient and Secure Cloud-based Workflow Management System