Author Archive TEXTAROSSA Project

ByTEXTAROSSA Project

Streaming Programming Models

One of the aims of the TEXTAROSSA project is to define and develop a stream-based programming paradigm able to integrate vertically with the heterogeneous TEXTAROSSA node.

Towards this activity the TEXTAROSSA project leverages the FastFlow [1] C++ header-only library which provides application designers abstractions for parallel programming (e.g., Pipeline, ordered Task-Farm, Divide & Conquer, Parallel-For-Reduce, Macro Data-Flow) and a carefully designed run-time system. At the lower layer, the library defines so-called Building Blocks (BB), i.e., recurrent data-flow compositions of concurrent activities working in a streaming fashion, which represent the primary abstraction layer to build FastFlow parallel patterns and streaming topologies [2, 3]. A parallel application is conceived by adequately selecting and assembling a small set of BBs modelling data and control flows. The BBs can be combined and nested in different ways forming either acyclic or cyclic concurrency graphs, where nodes are FastFlow concurrent entities and edges are communication channels.


Figure 1: Example of a FastFlow application: the communication topology is described as a composition of building blocks in a data-flow graph.

Within the project the aim is to extend FastFlow with a new offloader node able to delegate the computation to an FPGA accelerator hosted on the TEXTAROSSA node. This is done by programmatically loading the desired compute kernel onto the FPGA and streaming the input/output data to/from the FPGA accelerator. This will allow to have FastFlow applications which leverage seamlessly heterogeneous compute resources by simply using traditional nodes, using CPU threads, and offloader nodes, delegating work to accelerators, in the same concurrency graph.


Figure 2: FastFlow node comparison: Traditional vs. Offloader.

The main challenge in designing the offloader node is how to maximize the performance gain from using the accelerator card. Indeed, while the accelerator is expected to compute the results of the compute kernel substantially faster than the CPU, the communication with the card causes extra delays. The aim is to design schemes which can minimize/hide the impact of the extra time needed to send/receive data from the accelerator card.

Works Cited
[1] M. Aldinucci, M. Danelutto, P. Kilpatrick and M. Torquati, “FastFlow: High-level and Efficient Streaming on Multi-core,” in Programming Multi-core and Many-core Computing Systems, John Wiley & Sons, Ltd, 2017, pp. 261-280.
[2] T. Massimo, Harnessing Parallelism in Multi/Many-Cores with Streams and Parallel Patterns, University Of Pisa, 2019.
[3] M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick and M. Torquati, “Design patterns percolating to parallel programming framework implementation,” International Journal of Parallel Programming, vol. 42, no. 6, pp. 1012-1031, 2013.

Leading Partner: CINI/UNITO

ByTEXTAROSSA Project

Webinar “PATC: Heterogeneous Programming on FPGA with OmpSs@FPGA”

Carlo Alvarez, BSC, will present on March 24, 2023, 09.00-17.30 a webinar entitled “PATC: Heterogeneous Programming on FPGA with OmpSs@FPGA” in the context of the TEXTAROSSA project.

Link/Registration: https://www.bsc.es/education/training/patc-courses/hybrid-patc-heterogeneous-programming-fpgas-ompssfpga-0

ByTEXTAROSSA Project

Approximate computing for AI

Machine Learning in general, and Deep Neural Networks (DNNs) in particular, have recently been shown to tolerate low-precision representations of their parameters.

This represents an opportunity to accelerate computations, reduce storage, and, most importantly, reduce power consumption. At the edge and on embedded devices, the latter is critical.

In neural networks, two game-changing factors are developing.

The RISC-V open instruction set architecture (ISA) enables for the seamless implementation of custom instruction sets. Second, several novel formats for real number arithmetic exist. In TextaRossa we aim to merge these two major components by developing an accelerator for mixed precision, employing one or more promising low-precision formats (e.g., Posit, bfloat). We aim to develop an enhancement to an original RISC-V ISA that allows for the computation such formats as well as the interoperability of these formats alongside the standard 32-bit IEEE Floats (a.k.a. binary32) or traditional fixed-point formats to provide a compact representation of real numbers with minimal to no accuracy deterioration and with a compression factor of 2 to 4 times. In TextaRossa we have two main paths in exploiting low-precision format.

The first one is the design by UNIPISA of an IP core for a lightweight PPU (Posit Processing Unit) to be connected to a 64b RISC-V processor in the form of a co-processor with an extension of the Instruction Set Architecture (ISA). We focus on the compression abilities of posits by providing a co-processor with only conversions in mind, called light PPU. We can convert binary32 floating point numbers to posit numbers with 16 and 8 bits. This co-processor can be paired with a RISCV-V core that already has a floating-point unit (e.g., Ariane 64b RISC-V) without interrupting the existing pipeline. On the other hand, we can use this unit to enable ALU computation of posit numbers with the posit-to-fixed conversion modules on a RISCV-V core that does not support floating-point.

The second one is the design by UNIPISA of a complete Posit Processing Unit (namely Full PPU, FPPU) that can be connected to a RISC-V processor core with a further extension of the ISA, adding the capabilities of complete posit arithmetic to such core. This approach enables us to deliver efficient real number arithmetic with 8 or 16 bits (thus reducing the bits used by a factor 4 or 2, compared to binary32 numbers), even in low-power processors that are not equipped with a traditional floating-point unit. Low power performance of the PPU coprocessors has been also validated by UNIPISA and POLIMI.

Leading partner: UNIPI

References

  1. M. Cococcioni, F. Rossi, E. Ruffaldi and S. Saponara, “A Lightweight Posit Processing Unit for RISC-V Processors in Deep Neural Network Applications,” in IEEE Transactions on Emerging Topics in Computing, vol. 10, no. 4, pp. 1898-1908, 1 Oct.-Dec. 2022, doi: 10.1109/TETC.2021.3120538.
  2. Michele Piccoli, Davide Zoni, William Fornaciari, Marco Cococcioni, Federico Rossi, Emanuele Ruffaldi, Sergio Saponara, and Giuseppe Massari. “Dynamic Power consumption of the Full Posit Processing Unit: Analysis and Experiments”. In: PARMA-DITAM 2023. Open Access Series in Informatics (OASIcs). Dagstuhl, Germany, 2023, to appear.
ByTEXTAROSSA Project

TEXTAROSSA won the Favorite Zany Acronym Award

The TEXTAROSSA project won the superlative award for the Favorite Zany Acronym by HPC Wire. Not the most important of the scientific achievements, but for sure funny!

Read the full story here.

ByTEXTAROSSA Project

Mixed-Precision Computing

Motivation
ICT energy use is growing fast and expected to reach 20% of global demand in 2030, from current 5% (https://www.nature.com/articles/d41586-018-06610-y). Supercomputers are part of this trend, and are reaching also the limits of power supply that can be provided within a single site.

Approximate computing is a class of techniques to reduce the energy consumpion across the lifetime of an application. Precision tuning is a subset of approximate computing aiming at trading off the precision of a computation against the time and energy spent on it.
As a simple example, consider the time you would spend to compute the area of a circle (square power of the radius, times Pi), when approximating Pi to 3.14 against the same computation performed using an approximation of Pi to 3.14159265359.

TEXTAROSSA Contributions

TEXTAROSSA develops techniques to automatically transform a program or fragment of a program to use smaller data while keeping the error under control, performing a number of adjustments during the operation of a running program to keep the approximation in line with the current data set on which the computation is running.

Our techniques are implemented as part of the LLVM compiler, an industry-standard tool that is used to generate the executable program from the source code produced by an application programmer.
In particular, we extend the TAFFO (https://taffo-org.github.io/) set of plugins to support heterogeneous accelerators such as graphic cards (GPUs), which are extensively used in supercomputing to provide fast and massively parallel computation facilities.

Leading Partner: CINI/POLIMI

ByTEXTAROSSA Project

TEXTAROSSA @ SAMOS conference

Professor William Fornaciari (POLIMI) hold a keynote speech at SAMOS conference entitled “Design of Secure Power Monitors for Hardware Accelerators”.

Abstract, Slides, and Video will be published on the SAMOS conference dedicated page.

ByTEXTAROSSA Project

Today, June 27th, 2022, the TEXTAROSSA Project Technical Committee meets in Rome for the first time after the pandemic forced us to work mostly remotely. We’re glad to be back working together in presence!

ByTEXTAROSSA Project

IDV-A prototype

One Pillar in Textarossa is to improve the energy efficiency of future servers by developing a new high-efficiency cooling system at node and system level. This innovative technology, based on two-phase cooling technology, is developed by InQuattro and will be fully integrated in optimized multi-level runtime resource management to increase resource exploitation and performance.

Two Integrated Development Vehicle (IDVs) platforms will be developed to demonstrate the capability of this two-phase liquid-cooling technology.
With IDV-A, the project does not develop a new HPC node but focuses on the development of a blade based on a HPC node under development, where the standard single-phase liquid cooling is replaced by this innovative two-phase liquid cooling.

To demonstrate the improvement brought by such technology, it is important to compared it with one state-of-the-art implementation of single-phase liquid cooling. Atos has then selected one node developed for one hybrid blade in OpenSequana. OpenSequana is a new concept where all interfaces of BullSequanaXH3000 blades are published so that any OEM can develop one blade embedding one or several nodes of its interest and benefit from the BullSequanaXH3000 infrastructure regarding administration, power and cooling.
This node embeds one host with two CPU and four 700W Nvidia Hopper GPU.

The BullSequana Direct Liquid Cooling (DLC) technology is based on cold plates and water blocks inside the blades connected to a secondary loop inside the rack, with a primary loop at up to 40°C at rack input to allow free cooling (no energy spent to cool the liquid of the primary loop). This technology is capable of cooling efficiently this last generation of GPU component but might reach its limit in few years when the component consumption still increases with maximum case temperatures constantly decreasing.

Two-phase liquid cooling is an interesting opportunity to improve the cooling efficiency beyond the current technology.

The first challenge is to demonstrate the capability to evacuate the heat generated by several components with 700W peek consumption. Yet one OpenSequana blade is 100% cooled with liquid as there is no air to maximize the blade density and achieve a cooling efficiency close to 1. In IDV-A, as a first step, only the CPU and GPU with high consumption will be cooled with two-phase liquid cooling, by developing a new water blocks compatible with the two-phase technology. The standard DLC cold plate will be reused to cool other components such as DIMM, disk, Voltage regulators, interconnect controllers… One heat exchanger between the two-phase liquid and the secondary loop will be added inside the blade.

The second challenge will be to remove this additional level of heat exchanger and to study the evolution of the rack to provide a secondary loop with the InQuattro liquid.

Leading Partner: ATOS

ByTEXTAROSSA Project

TEXTAROSSA presented at IPDPS 2022

The TEXTAROSSA project has been presented in at the Scalable Deep Learning over Parallel And Distributed Infrastructures (SCADL) workshop of IPDPS 2022 by prof. William Fornaciari (POLIMI).

Title: Design of secure power monitors for accelerators, by exploiting ML techniques, in the Euro-HPC TEXTAROSSA project
Download Slides

ByTEXTAROSSA Project

Task-based Programming Models: StarPU

Finding novel programming models to better use complex and heterogeneous hardware has been an active research topic for decades. The objectives are to help developers parallelize their applications and increase the efficiency of the resulting applications. Among the successful approaches, the task-based method demonstrated a significant potential.

In this model, a developer splits its program into tasks connected by dependencies, which can usually be represented by a direct acyclic graph. A runtime system is then in charge of executing the task-graph by distributing the tasks over the processing units while ensuring a coherent execution. It can use advanced scheduling strategies that attempt to reduce the makespan or energy consumption, keep the processing units busy, and relieve developers by moving the data when needed.

StarPU is such a runtime system that was developed at Inria since 2009. It targets users with knowledge in HPC and has been successfully used to parallelize a dozen different numerical methods: dense/sparse linear algebra solvers, FMM, BEM solvers, H-matrix solvers, and seismic simulations, among others. In most of these applications, both CPUs and GPUs are used jointly. As such, StarPU is the perfect tool to use to study new technologies and features that will be developed in the Textarossa project through the task-based method.

More precisely, we will study the use of FPGAs in the task-based method and examine their complementarity to the classical HPC devices. At a high level, any existing task-based application can quickly use FPGAs by simply providing some alternative implementations of the tasks. At the runtime system level, an FPGA is nothing more than another type of device accelerator. Consequently, the challenges will come more from the task scheduling and optimization of the energy consumption. This is why we will provide a new scheduling strategy to tackle both aspects that we will validate on two existing StarPU-based applications: ScalFMM and Chameleon.

Leading Partner: INRIA