Author Archive TEXTAROSSA Project

ByTEXTAROSSA Project

Mixed-Precision Computing

Motivation
ICT energy use is growing fast and expected to reach 20% of global demand in 2030, from current 5% (https://www.nature.com/articles/d41586-018-06610-y). Supercomputers are part of this trend, and are reaching also the limits of power supply that can be provided within a single site.

Approximate computing is a class of techniques to reduce the energy consumpion across the lifetime of an application. Precision tuning is a subset of approximate computing aiming at trading off the precision of a computation against the time and energy spent on it.
As a simple example, consider the time you would spend to compute the area of a circle (square power of the radius, times Pi), when approximating Pi to 3.14 against the same computation performed using an approximation of Pi to 3.14159265359.

TEXTAROSSA Contributions

TEXTAROSSA develops techniques to automatically transform a program or fragment of a program to use smaller data while keeping the error under control, performing a number of adjustments during the operation of a running program to keep the approximation in line with the current data set on which the computation is running.

Our techniques are implemented as part of the LLVM compiler, an industry-standard tool that is used to generate the executable program from the source code produced by an application programmer.
In particular, we extend the TAFFO (https://taffo-org.github.io/) set of plugins to support heterogeneous accelerators such as graphic cards (GPUs), which are extensively used in supercomputing to provide fast and massively parallel computation facilities.

Leading Partner: CINI/POLIMI

ByTEXTAROSSA Project

TEXTAROSSA @ SAMOS conference

Professor William Fornaciari (POLIMI) hold a keynote speech at SAMOS conference entitled “Design of Secure Power Monitors for Hardware Accelerators”.

Abstract, Slides, and Video will be published on the SAMOS conference dedicated page.

ByTEXTAROSSA Project

Today, June 27th, 2022, the TEXTAROSSA Project Technical Committee meets in Rome for the first time after the pandemic forced us to work mostly remotely. We’re glad to be back working together in presence!

ByTEXTAROSSA Project

IDV-A prototype

One Pillar in Textarossa is to improve the energy efficiency of future servers by developing a new high-efficiency cooling system at node and system level. This innovative technology, based on two-phase cooling technology, is developed by InQuattro and will be fully integrated in optimized multi-level runtime resource management to increase resource exploitation and performance.

Two Integrated Development Vehicle (IDVs) platforms will be developed to demonstrate the capability of this two-phase liquid-cooling technology.
With IDV-A, the project does not develop a new HPC node but focuses on the development of a blade based on a HPC node under development, where the standard single-phase liquid cooling is replaced by this innovative two-phase liquid cooling.

To demonstrate the improvement brought by such technology, it is important to compared it with one state-of-the-art implementation of single-phase liquid cooling. Atos has then selected one node developed for one hybrid blade in OpenSequana. OpenSequana is a new concept where all interfaces of BullSequanaXH3000 blades are published so that any OEM can develop one blade embedding one or several nodes of its interest and benefit from the BullSequanaXH3000 infrastructure regarding administration, power and cooling.
This node embeds one host with two CPU and four 700W Nvidia Hopper GPU.

The BullSequana Direct Liquid Cooling (DLC) technology is based on cold plates and water blocks inside the blades connected to a secondary loop inside the rack, with a primary loop at up to 40°C at rack input to allow free cooling (no energy spent to cool the liquid of the primary loop). This technology is capable of cooling efficiently this last generation of GPU component but might reach its limit in few years when the component consumption still increases with maximum case temperatures constantly decreasing.

Two-phase liquid cooling is an interesting opportunity to improve the cooling efficiency beyond the current technology.

The first challenge is to demonstrate the capability to evacuate the heat generated by several components with 700W peek consumption. Yet one OpenSequana blade is 100% cooled with liquid as there is no air to maximize the blade density and achieve a cooling efficiency close to 1. In IDV-A, as a first step, only the CPU and GPU with high consumption will be cooled with two-phase liquid cooling, by developing a new water blocks compatible with the two-phase technology. The standard DLC cold plate will be reused to cool other components such as DIMM, disk, Voltage regulators, interconnect controllers… One heat exchanger between the two-phase liquid and the secondary loop will be added inside the blade.

The second challenge will be to remove this additional level of heat exchanger and to study the evolution of the rack to provide a secondary loop with the InQuattro liquid.

Leading Partner: ATOS

ByTEXTAROSSA Project

TEXTAROSSA presented at IPDPS 2022

The TEXTAROSSA project has been presented in at the Scalable Deep Learning over Parallel And Distributed Infrastructures (SCADL) workshop of IPDPS 2022 by prof. William Fornaciari (POLIMI).

Title: Design of secure power monitors for accelerators, by exploiting ML techniques, in the Euro-HPC TEXTAROSSA project
Download Slides

ByTEXTAROSSA Project

Task-based Programming Models: StarPU

Finding novel programming models to better use complex and heterogeneous hardware has been an active research topic for decades. The objectives are to help developers parallelize their applications and increase the efficiency of the resulting applications. Among the successful approaches, the task-based method demonstrated a significant potential.

In this model, a developer splits its program into tasks connected by dependencies, which can usually be represented by a direct acyclic graph. A runtime system is then in charge of executing the task-graph by distributing the tasks over the processing units while ensuring a coherent execution. It can use advanced scheduling strategies that attempt to reduce the makespan or energy consumption, keep the processing units busy, and relieve developers by moving the data when needed.

StarPU is such a runtime system that was developed at Inria since 2009. It targets users with knowledge in HPC and has been successfully used to parallelize a dozen different numerical methods: dense/sparse linear algebra solvers, FMM, BEM solvers, H-matrix solvers, and seismic simulations, among others. In most of these applications, both CPUs and GPUs are used jointly. As such, StarPU is the perfect tool to use to study new technologies and features that will be developed in the Textarossa project through the task-based method.

More precisely, we will study the use of FPGAs in the task-based method and examine their complementarity to the classical HPC devices. At a high level, any existing task-based application can quickly use FPGAs by simply providing some alternative implementations of the tasks. At the runtime system level, an FPGA is nothing more than another type of device accelerator. Consequently, the challenges will come more from the task scheduling and optimization of the energy consumption. This is why we will provide a new scheduling strategy to tackle both aspects that we will validate on two existing StarPU-based applications: ScalFMM and Chameleon.

Leading Partner: INRIA

ByTEXTAROSSA Project

TEXTAROSSA @ INFN Workshop on Computing

Today (May 25, 2022), Alessandro Lonardo (INFN) is presenting the TEXTAROSSA project at the annual INFN Workshop on Computing.

ByTEXTAROSSA Project

Bit-Packing Compressor in FPGA-Hardware


Figure 1

Figure 2

Figure 3

Data compressor is one key issue for storing and transmitting large dataset. Data compression algorithm based on the usage of different length code for integer representation may be effective only if output stream is tightly packed. Hardware realization of these procedures has definite advantages in comparison with the software execution. For example many fast video-cameras have high performance CMOS sensor technology usually adopted for acquiring and store scientific images. Typical use cases are in the field of nuclear fusion experiments with plasmas magnetically confined on which fast camera acquiring and store images of plasma discharges. A Photron FASTCAM SA4 camera has been in operations in ENEA on Frascati Tokamak Upgrade and Proto-Sphera tokamak experiments. The FASTCAM SA4 camera provides up to 3600 frames/s at 1024×1024 pixel resolution collected on 12-bit depth image. SA4 is supplied with 16 GB memory option, hence it can capture all the frames in the ~2 s of plasma discharge at the maximum performance of 3600 fps, 1024×1024 pixel resolution. The camera downloads 9.5 GB for each plasma discharge of raw uncompressed data with image processing software able to convert in standard format like “tif” (fig.1). The raw data acquired with a digital camera like SA4 is a measure of the radiant intensity which refers to the magnitude or quantity of light energy actually reflected from or transmitted through the object being imaged by an analog or digital device. it is the only variable that can be utilized by processing techniques in quantitative scientific experiments as shown the fig.2 with a different colors scale. The 12-bit pixel depth image, associated to the huge amount of data produced by the SA4 camera, poses a problem of storage versus data integrity. In fact, due to the fact that the smallest addressable unit in a digital processor is a byte, to preserve data information it would be necessary to save each pixel in a two-byte format, thus getting a 25% storage overhead. The alternative could be compression, but this would cause undesirable data loss. In order to save both storage space and data integrity, a Bit-Packing compression algorithm operating at bit level, has been developed to reduce the number of bit requested to store 12-bit raw data of SA4 camera. Roughly speaking, the Bit-Packing algorithm maps two 12-bit pixels to three 8-bit pixels. In this specific compression domain, Bit-Packing is one of the most frequently applied compression schemes. However, (de)compression should not come with any additional cpu load during run time, but should be provided as fast as possible and efficiently in terms of energy consumption. To achieve that target, the development of a Bit-Packing algorithm for Field Programmable Gate Arrays (FPGAs) is a candidate solution aware with the aims of the TEXTAROSSA project.

For more than 20 years, High-Level Synthesis (HLS) has been widely investigated and studied as a way to raise the level of abstraction in programmable electronic system design, allowing SW developers to design dedicated computing architectures to be implemented on top of the FPGA (Field Programmable Gate Array) technology.

The advantages of FPGAs come from:
– the huge parallelism available: thousands of DSP modules and millions of small programmable LUT allow to implement designs able to sustain hundreds of GFlop/s – or several TOp/s, if we refer to simpler limited precision fixed-point arithmetic

– the high internal memory bandwidth: thousands of independent small block RAMs allow to achieve an internal memory bandwidth of several TB/s, needed to sustain the high computing performance potentially achievable thanks to the huge HW parallelism

– the availability of IP (Intellectual Properties) blocks, on the same FPGA silicon, that implement Arm processor cores, PCIe and memory controllers, and high-speed communication links, …

– the low power consumption (few tens of watts)

Despite previous advantages, FPGA adoption in the HPC community has been quite limited, mainly because their programming required HW design expertise. To overcome this limitation, HLS flows play a fundamental role as they allow to define the behavior of the system to be synthesized by means of high-level languages such as C/C++, Matlab, SYCL, …

In the past years, the maturity of HLS flows has been the main limitation to their widespread diffusion: the quality of the produced HW was poor (the efficiency was low), the integration of the algorithm accelerated in HW with other parts running on conventional host computers was sometimes cumbersome, the reprogramming of FPGA was a problem, often requesting a new reboot of the system to detect the new HW.

We are now in a phase where HLS flows seem to have reached their maturity: there are HLS flows (like the Vitis from Xilinx or OneAPI from Intel) that allow obtaining quite an efficient parallel and pipelined implementation of a design (even if still some consciousness of the architecture to be generated is required to the SW programmer that can control through pragmas the compiler behavior and the structure of the generated HW), taking care of low-level details as the different clock domains, offering simple syntax to manage the communications and synchronization among different kernels, giving access to a wide set of pre-implemented IP and computing libraries. HLS flows use the partial reprogramming capabilities of FPGAs and allow the interface shell (i.e., the PCIe, memory, and all the I/O interfaces) to be not reprogrammed each time, being kept fixed and solving the rebooting problem. Furthermore, modern HLS flows come with run-time support that allows the simple and efficient integration between the application running in the host and the accelerated kernels running on the FPGA.

In TEXTAROSSA project one of the main aim is the software development for an FPGA accelerator using an HLS flow (namely, Vitis) neglecting the low-level FPGA details of the Xilinx Alveo U280 in operation at ENEA FPGA LAB in Portici.

A kernel function implementing the BitPacking algorithm delete the four leading bits in the pixel components of each image from an input stream resulting in a compressed image on the output stream.
If m and n are the numbers of bits used to store a component of an NxN image, the input image has dimension mN^2/8 [Byte] and the output image nN^2/8 [Byte], so the compression ratio is m/n; we indicate with S the size, in bit, of the input and output streams of the kernel; S is constrained to be a power of 2. In the beginning, input and output images are aligned with respect to S, i.e., both the input and output streams have a pixel component starting at bit 0 of the input/output; to determine how many components must be read to be aligned again both at the input and the output streams, we must find the smallest number of pixel components k∈ℕ+ satisfying the following equation (% is the modulus operator):

(k * m) % S = (k * n) % S  	(1)

From (1) it is easy to verify that the solution is:

k=LCM(LCM(m,S)/m,LCM(n,S)/n) 	(2)

Where LCM(a,b) returns the Least Common Multiple of the a and b. In our case we set the stream size S=512 bits (thus saturating the PCIe bandwidth in the reasonable hypothesis to use fck=300 MHz), m=16 bits and n=12 bits; the expression (2) gives k=128. As we read from the stream S bits, data are newly aligned after k  m/S=4 reads from the input stream and k * n/S=3 writes to the output stream. From previous computations, we derive the structure of the Bit-Packing compression function:

for (i=0; i<InputImageSize/S; i+= km/S) {
  read km/S words from the input stream;
  while (not all the input bits have been copied) {
    copy n bits from the input word to the output word;
    skip (n-m) bits of the input word;
  }
  write the kn/S words that have just been filled;
}

The Bit-Packing compressed image of fig.2 is depicted in the fig.3.

Benchmarking sessions are ongoing in order to evaluate Time and Energy to Solution within the WP1 and WP4 activities of TEXTAROSSA.

Leading Partner: ENEA
People: Francesco Iannone, Paolo Palazzari

ByTEXTAROSSA Project

INFN progress in neural network research for HPC

INFN started tackling neural simulations with its own engine, the Distributed and Plastic Spiking Neural Network (DPSNN), which is a scalable C++/MPI code for HPC platforms at extreme scales simulating the spiking dynamics of a brain cortex modeled as a grid of cortical columns populated with neurons and their interconnecting synapses.
It has been used to first model brain cortex behaviour — with a special focus on sleep-like states — and to gauge compute and power efficiency on different architectures.
INFN has now transitioned to another, more versatile tool, the NEST Simulator; this is a C++/MPI/OpenMP code by the NEST Initiative (https://nest-initiative.org) that empowers a user with a domain-specific language to design a virtual neurophysiology experiment, from the equations driving the dynamics of the components of interest in the cortex (with a rich library of many types of either neurons and synapses ready to be used) to the topology of their interconnections — the so-called connectome — all the way to the necessary supporting tools, like probing or stimulating electrodes that read or inject electrical currents into the simulated cortex.
NEST offers to the experimenter an intuitive Python interface to easily setup a detailed and complex protocol of interaction between a simulated cortex and a set of external stimuli.
Coupled with the huge set of tools for analysis, visualization and data transformation available to the Python user, NEST allows for a compact yet expressive way to perform even very involved neural simulations. INFN has used NEST to implement a biologically-inspired thalamo-cortical model which is able to be trained in classification of handwritten digits from the MNIST dataset and then mimick the wake-sleep cycle, in order to test the enhancing effects of sleep on the quality of learning and recognition, even in noisy environments.
NEST does not currently support running on GPUs, therefore INFN is closely following and collaborating in the development of NeuronGPU (by B. Golosio), a CUDA/C++/MPI code that replicates many features of NEST with a similar Python interface while targeted at high performances on NVIDIA GPUs and which is poised to be integrated in the near future in some capacity into NEST as a GPU-enabling component called NEST-GPU.

ByTEXTAROSSA Project

In Quattro and POLIMI testing the new two-phase cooling system.



In Quattro and Politecnico di Milano are collecting experimental data at HEAP Lab in March 23-24. The goal is to build the thermal model of the two-phase cooling systems. The thermal test chip developed by POLIMI and the (portable) prototype of the two-phase cooling systems have been integrated to simulate the thermal behavior of a real processor.

A first step towards a real in-field validation!

In Quattro: Giorgia Rancione, Luca Saraceno
Politecnico di Milano: William Fornaciari and Federico Terraneo