Category Archive Blog Posts


Streaming Programming Models

One of the aims of the TEXTAROSSA project is to define and develop a stream-based programming paradigm able to integrate vertically with the heterogeneous TEXTAROSSA node.

Towards this activity the TEXTAROSSA project leverages the FastFlow [1] C++ header-only library which provides application designers abstractions for parallel programming (e.g., Pipeline, ordered Task-Farm, Divide & Conquer, Parallel-For-Reduce, Macro Data-Flow) and a carefully designed run-time system. At the lower layer, the library defines so-called Building Blocks (BB), i.e., recurrent data-flow compositions of concurrent activities working in a streaming fashion, which represent the primary abstraction layer to build FastFlow parallel patterns and streaming topologies [2, 3]. A parallel application is conceived by adequately selecting and assembling a small set of BBs modelling data and control flows. The BBs can be combined and nested in different ways forming either acyclic or cyclic concurrency graphs, where nodes are FastFlow concurrent entities and edges are communication channels.

Figure 1: Example of a FastFlow application: the communication topology is described as a composition of building blocks in a data-flow graph.

Within the project the aim is to extend FastFlow with a new offloader node able to delegate the computation to an FPGA accelerator hosted on the TEXTAROSSA node. This is done by programmatically loading the desired compute kernel onto the FPGA and streaming the input/output data to/from the FPGA accelerator. This will allow to have FastFlow applications which leverage seamlessly heterogeneous compute resources by simply using traditional nodes, using CPU threads, and offloader nodes, delegating work to accelerators, in the same concurrency graph.

Figure 2: FastFlow node comparison: Traditional vs. Offloader.

The main challenge in designing the offloader node is how to maximize the performance gain from using the accelerator card. Indeed, while the accelerator is expected to compute the results of the compute kernel substantially faster than the CPU, the communication with the card causes extra delays. The aim is to design schemes which can minimize/hide the impact of the extra time needed to send/receive data from the accelerator card.

Works Cited
[1] M. Aldinucci, M. Danelutto, P. Kilpatrick and M. Torquati, “FastFlow: High-level and Efficient Streaming on Multi-core,” in Programming Multi-core and Many-core Computing Systems, John Wiley & Sons, Ltd, 2017, pp. 261-280.
[2] T. Massimo, Harnessing Parallelism in Multi/Many-Cores with Streams and Parallel Patterns, University Of Pisa, 2019.
[3] M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick and M. Torquati, “Design patterns percolating to parallel programming framework implementation,” International Journal of Parallel Programming, vol. 42, no. 6, pp. 1012-1031, 2013.

Leading Partner: CINI/UNITO


Mixed-Precision Computing

ICT energy use is growing fast and expected to reach 20% of global demand in 2030, from current 5% ( Supercomputers are part of this trend, and are reaching also the limits of power supply that can be provided within a single site.

Approximate computing is a class of techniques to reduce the energy consumpion across the lifetime of an application. Precision tuning is a subset of approximate computing aiming at trading off the precision of a computation against the time and energy spent on it.
As a simple example, consider the time you would spend to compute the area of a circle (square power of the radius, times Pi), when approximating Pi to 3.14 against the same computation performed using an approximation of Pi to 3.14159265359.

TEXTAROSSA Contributions

TEXTAROSSA develops techniques to automatically transform a program or fragment of a program to use smaller data while keeping the error under control, performing a number of adjustments during the operation of a running program to keep the approximation in line with the current data set on which the computation is running.

Our techniques are implemented as part of the LLVM compiler, an industry-standard tool that is used to generate the executable program from the source code produced by an application programmer.
In particular, we extend the TAFFO ( set of plugins to support heterogeneous accelerators such as graphic cards (GPUs), which are extensively used in supercomputing to provide fast and massively parallel computation facilities.

Leading Partner: CINI/POLIMI


IDV-A prototype

One Pillar in Textarossa is to improve the energy efficiency of future servers by developing a new high-efficiency cooling system at node and system level. This innovative technology, based on two-phase cooling technology, is developed by InQuattro and will be fully integrated in optimized multi-level runtime resource management to increase resource exploitation and performance.

Two Integrated Development Vehicle (IDVs) platforms will be developed to demonstrate the capability of this two-phase liquid-cooling technology.
With IDV-A, the project does not develop a new HPC node but focuses on the development of a blade based on a HPC node under development, where the standard single-phase liquid cooling is replaced by this innovative two-phase liquid cooling.

To demonstrate the improvement brought by such technology, it is important to compared it with one state-of-the-art implementation of single-phase liquid cooling. Atos has then selected one node developed for one hybrid blade in OpenSequana. OpenSequana is a new concept where all interfaces of BullSequanaXH3000 blades are published so that any OEM can develop one blade embedding one or several nodes of its interest and benefit from the BullSequanaXH3000 infrastructure regarding administration, power and cooling.
This node embeds one host with two CPU and four 700W Nvidia Hopper GPU.

The BullSequana Direct Liquid Cooling (DLC) technology is based on cold plates and water blocks inside the blades connected to a secondary loop inside the rack, with a primary loop at up to 40°C at rack input to allow free cooling (no energy spent to cool the liquid of the primary loop). This technology is capable of cooling efficiently this last generation of GPU component but might reach its limit in few years when the component consumption still increases with maximum case temperatures constantly decreasing.

Two-phase liquid cooling is an interesting opportunity to improve the cooling efficiency beyond the current technology.

The first challenge is to demonstrate the capability to evacuate the heat generated by several components with 700W peek consumption. Yet one OpenSequana blade is 100% cooled with liquid as there is no air to maximize the blade density and achieve a cooling efficiency close to 1. In IDV-A, as a first step, only the CPU and GPU with high consumption will be cooled with two-phase liquid cooling, by developing a new water blocks compatible with the two-phase technology. The standard DLC cold plate will be reused to cool other components such as DIMM, disk, Voltage regulators, interconnect controllers… One heat exchanger between the two-phase liquid and the secondary loop will be added inside the blade.

The second challenge will be to remove this additional level of heat exchanger and to study the evolution of the rack to provide a secondary loop with the InQuattro liquid.

Leading Partner: ATOS


Task-based Programming Models: StarPU

Finding novel programming models to better use complex and heterogeneous hardware has been an active research topic for decades. The objectives are to help developers parallelize their applications and increase the efficiency of the resulting applications. Among the successful approaches, the task-based method demonstrated a significant potential.

In this model, a developer splits its program into tasks connected by dependencies, which can usually be represented by a direct acyclic graph. A runtime system is then in charge of executing the task-graph by distributing the tasks over the processing units while ensuring a coherent execution. It can use advanced scheduling strategies that attempt to reduce the makespan or energy consumption, keep the processing units busy, and relieve developers by moving the data when needed.

StarPU is such a runtime system that was developed at Inria since 2009. It targets users with knowledge in HPC and has been successfully used to parallelize a dozen different numerical methods: dense/sparse linear algebra solvers, FMM, BEM solvers, H-matrix solvers, and seismic simulations, among others. In most of these applications, both CPUs and GPUs are used jointly. As such, StarPU is the perfect tool to use to study new technologies and features that will be developed in the Textarossa project through the task-based method.

More precisely, we will study the use of FPGAs in the task-based method and examine their complementarity to the classical HPC devices. At a high level, any existing task-based application can quickly use FPGAs by simply providing some alternative implementations of the tasks. At the runtime system level, an FPGA is nothing more than another type of device accelerator. Consequently, the challenges will come more from the task scheduling and optimization of the energy consumption. This is why we will provide a new scheduling strategy to tackle both aspects that we will validate on two existing StarPU-based applications: ScalFMM and Chameleon.

Leading Partner: INRIA


Bit-Packing Compressor in FPGA-Hardware

Figure 1

Figure 2

Figure 3

Data compressor is one key issue for storing and transmitting large dataset. Data compression algorithm based on the usage of different length code for integer representation may be effective only if output stream is tightly packed. Hardware realization of these procedures has definite advantages in comparison with the software execution. For example many fast video-cameras have high performance CMOS sensor technology usually adopted for acquiring and store scientific images. Typical use cases are in the field of nuclear fusion experiments with plasmas magnetically confined on which fast camera acquiring and store images of plasma discharges. A Photron FASTCAM SA4 camera has been in operations in ENEA on Frascati Tokamak Upgrade and Proto-Sphera tokamak experiments. The FASTCAM SA4 camera provides up to 3600 frames/s at 1024×1024 pixel resolution collected on 12-bit depth image. SA4 is supplied with 16 GB memory option, hence it can capture all the frames in the ~2 s of plasma discharge at the maximum performance of 3600 fps, 1024×1024 pixel resolution. The camera downloads 9.5 GB for each plasma discharge of raw uncompressed data with image processing software able to convert in standard format like “tif” (fig.1). The raw data acquired with a digital camera like SA4 is a measure of the radiant intensity which refers to the magnitude or quantity of light energy actually reflected from or transmitted through the object being imaged by an analog or digital device. it is the only variable that can be utilized by processing techniques in quantitative scientific experiments as shown the fig.2 with a different colors scale. The 12-bit pixel depth image, associated to the huge amount of data produced by the SA4 camera, poses a problem of storage versus data integrity. In fact, due to the fact that the smallest addressable unit in a digital processor is a byte, to preserve data information it would be necessary to save each pixel in a two-byte format, thus getting a 25% storage overhead. The alternative could be compression, but this would cause undesirable data loss. In order to save both storage space and data integrity, a Bit-Packing compression algorithm operating at bit level, has been developed to reduce the number of bit requested to store 12-bit raw data of SA4 camera. Roughly speaking, the Bit-Packing algorithm maps two 12-bit pixels to three 8-bit pixels. In this specific compression domain, Bit-Packing is one of the most frequently applied compression schemes. However, (de)compression should not come with any additional cpu load during run time, but should be provided as fast as possible and efficiently in terms of energy consumption. To achieve that target, the development of a Bit-Packing algorithm for Field Programmable Gate Arrays (FPGAs) is a candidate solution aware with the aims of the TEXTAROSSA project.

For more than 20 years, High-Level Synthesis (HLS) has been widely investigated and studied as a way to raise the level of abstraction in programmable electronic system design, allowing SW developers to design dedicated computing architectures to be implemented on top of the FPGA (Field Programmable Gate Array) technology.

The advantages of FPGAs come from:
– the huge parallelism available: thousands of DSP modules and millions of small programmable LUT allow to implement designs able to sustain hundreds of GFlop/s – or several TOp/s, if we refer to simpler limited precision fixed-point arithmetic

– the high internal memory bandwidth: thousands of independent small block RAMs allow to achieve an internal memory bandwidth of several TB/s, needed to sustain the high computing performance potentially achievable thanks to the huge HW parallelism

– the availability of IP (Intellectual Properties) blocks, on the same FPGA silicon, that implement Arm processor cores, PCIe and memory controllers, and high-speed communication links, …

– the low power consumption (few tens of watts)

Despite previous advantages, FPGA adoption in the HPC community has been quite limited, mainly because their programming required HW design expertise. To overcome this limitation, HLS flows play a fundamental role as they allow to define the behavior of the system to be synthesized by means of high-level languages such as C/C++, Matlab, SYCL, …

In the past years, the maturity of HLS flows has been the main limitation to their widespread diffusion: the quality of the produced HW was poor (the efficiency was low), the integration of the algorithm accelerated in HW with other parts running on conventional host computers was sometimes cumbersome, the reprogramming of FPGA was a problem, often requesting a new reboot of the system to detect the new HW.

We are now in a phase where HLS flows seem to have reached their maturity: there are HLS flows (like the Vitis from Xilinx or OneAPI from Intel) that allow obtaining quite an efficient parallel and pipelined implementation of a design (even if still some consciousness of the architecture to be generated is required to the SW programmer that can control through pragmas the compiler behavior and the structure of the generated HW), taking care of low-level details as the different clock domains, offering simple syntax to manage the communications and synchronization among different kernels, giving access to a wide set of pre-implemented IP and computing libraries. HLS flows use the partial reprogramming capabilities of FPGAs and allow the interface shell (i.e., the PCIe, memory, and all the I/O interfaces) to be not reprogrammed each time, being kept fixed and solving the rebooting problem. Furthermore, modern HLS flows come with run-time support that allows the simple and efficient integration between the application running in the host and the accelerated kernels running on the FPGA.

In TEXTAROSSA project one of the main aim is the software development for an FPGA accelerator using an HLS flow (namely, Vitis) neglecting the low-level FPGA details of the Xilinx Alveo U280 in operation at ENEA FPGA LAB in Portici.

A kernel function implementing the BitPacking algorithm delete the four leading bits in the pixel components of each image from an input stream resulting in a compressed image on the output stream.
If m and n are the numbers of bits used to store a component of an NxN image, the input image has dimension mN^2/8 [Byte] and the output image nN^2/8 [Byte], so the compression ratio is m/n; we indicate with S the size, in bit, of the input and output streams of the kernel; S is constrained to be a power of 2. In the beginning, input and output images are aligned with respect to S, i.e., both the input and output streams have a pixel component starting at bit 0 of the input/output; to determine how many components must be read to be aligned again both at the input and the output streams, we must find the smallest number of pixel components k∈ℕ+ satisfying the following equation (% is the modulus operator):

(k * m) % S = (k * n) % S  	(1)

From (1) it is easy to verify that the solution is:

k=LCM(LCM(m,S)/m,LCM(n,S)/n) 	(2)

Where LCM(a,b) returns the Least Common Multiple of the a and b. In our case we set the stream size S=512 bits (thus saturating the PCIe bandwidth in the reasonable hypothesis to use fck=300 MHz), m=16 bits and n=12 bits; the expression (2) gives k=128. As we read from the stream S bits, data are newly aligned after k  m/S=4 reads from the input stream and k * n/S=3 writes to the output stream. From previous computations, we derive the structure of the Bit-Packing compression function:

for (i=0; i<InputImageSize/S; i+= km/S) {
  read km/S words from the input stream;
  while (not all the input bits have been copied) {
    copy n bits from the input word to the output word;
    skip (n-m) bits of the input word;
  write the kn/S words that have just been filled;

The Bit-Packing compressed image of fig.2 is depicted in the fig.3.

Benchmarking sessions are ongoing in order to evaluate Time and Energy to Solution within the WP1 and WP4 activities of TEXTAROSSA.

Leading Partner: ENEA
People: Francesco Iannone, Paolo Palazzari


INFN progress in neural network research for HPC

INFN started tackling neural simulations with its own engine, the Distributed and Plastic Spiking Neural Network (DPSNN), which is a scalable C++/MPI code for HPC platforms at extreme scales simulating the spiking dynamics of a brain cortex modeled as a grid of cortical columns populated with neurons and their interconnecting synapses.
It has been used to first model brain cortex behaviour — with a special focus on sleep-like states — and to gauge compute and power efficiency on different architectures.
INFN has now transitioned to another, more versatile tool, the NEST Simulator; this is a C++/MPI/OpenMP code by the NEST Initiative ( that empowers a user with a domain-specific language to design a virtual neurophysiology experiment, from the equations driving the dynamics of the components of interest in the cortex (with a rich library of many types of either neurons and synapses ready to be used) to the topology of their interconnections — the so-called connectome — all the way to the necessary supporting tools, like probing or stimulating electrodes that read or inject electrical currents into the simulated cortex.
NEST offers to the experimenter an intuitive Python interface to easily setup a detailed and complex protocol of interaction between a simulated cortex and a set of external stimuli.
Coupled with the huge set of tools for analysis, visualization and data transformation available to the Python user, NEST allows for a compact yet expressive way to perform even very involved neural simulations. INFN has used NEST to implement a biologically-inspired thalamo-cortical model which is able to be trained in classification of handwritten digits from the MNIST dataset and then mimick the wake-sleep cycle, in order to test the enhancing effects of sleep on the quality of learning and recognition, even in noisy environments.
NEST does not currently support running on GPUs, therefore INFN is closely following and collaborating in the development of NeuronGPU (by B. Golosio), a CUDA/C++/MPI code that replicates many features of NEST with a similar Python interface while targeted at high performances on NVIDIA GPUs and which is poised to be integrated in the near future in some capacity into NEST as a GPU-enabling component called NEST-GPU.


MathLib: Colella’s Dwarves

Mathematical libraries are a key component of the software stack for exascale-class applications. As a matter of fact, exascale computing deployment is conditioned by the development of new suitable numerical algorithms, since most of existing ones are not able to face many issues raising into the race to exascale.
Mathematical libraries provide building-blocks implementing up-to-date methods and algorithms that application developers can reuse in form of highly-reliable and high-performance components.

Any new generation of computer architectures brings new challenges to achieve high performance mathematical solvers and a bi-directional interaction between new computer architectures/environments and new mathematical software is ever more crucial for deployment and effective use of the new technology to advance in Science, Industry and Society.

The need to improve computing performance in the near and medium term indicates that exascale and also post-exascale platforms will continue to emphasize heterogeneity. This type of architecture exploits at node level, accelerators such as modern GPUs and reconfigurable hardware such as FPGA to boost performance and also energy efficiency in computations. As computer architectures become heterogeneous, there is the need for algorithms that support mixed-precision and minimize communications among the different devices and memory levels. The new power-to-solution metrics requires a rethinking of many computational kernels of HPC applications looking for the best trade-off between the reduction of the energy consumption and the minimization of the time-to-solution, promoting reproducibility and scalability.

We will provide new high-performance algorithms and software modules for some of the so-called Colella’s dwarves, who identified numerical methods crucial for science and engineering. In particular, we will focus on algorithms and software for sparse linear algebra, where data sets include many zero-values and are usually stored in compressed data structures to reduce storage and memory bandwidth requirements,hey are generally accessed with indexed loads and stores, and main computational kernels are communication bound.

The library kernels will be of immediate use in a wide range of applications, ranging from classical scientific simulation to AI techniques, including automatic pattern recognition in complex systems, and will be tested in some of the applications proposed as use cases in this project.

Leading Partner: CNR


The TEXTAROSSA Co-Design Approach


Supercomputing offers access to enormous amounts of computational power that are needed by many applications in different scientific and industrial sectors. Public services such as weather predictions require such capabilities, as do critical industries like Oil \& Gas (for fuel deposit discovery) and Pharmaceutical (for drug design). Scientific discoveries in fields such as quantum physics or high energy physics are also made possible by supercomputers.

However, increasingly powerful supercomputers are hitting a ceiling imposed by the ability to provide (and sustain) electrical power through the grid. To avoid this limitation, supercomputing hardware designers need to rely on systems that are less power-hungry, but more difficult for application designers to effectively use, due to characteristics such as heterogeneity (that is, the use of processing elements different from the typical “processor” that is commonly found also in laptop and desktop personal computers) and reconfigurability (that is, the use of systems whose functions are programmable at the hardware rather than software level).

TEXTAROSSA Contributions

TEXTAROSSA aims at making the advantages of reconfigurable hardware and associated technical advances available to application developers by means of a co-design approach.
Whereas in standard supercomputing application design, the hardware is given, and the application developer works only at creating the software application, in co-design, hardware and software are designed together, at least in part.

TEXTAROSSA will leverage co-design, which was first developed in the context of embedded systems, through a new Integrated Development Vehicle, a hardware-software platform for supercomputing including reconfigurable hardware elements. TEXTAROSSA aims at providing tools that will help the application developer in designing and implementing the application on this type of platform, semi-automatically performing the tasks of deciding which activities will be performed on the reconfigurable hardware, and producing optimized hardware accelerators for those activities.