News

ByTEXTAROSSA Project

Task-based Programming Models: StarPU

Finding novel programming models to better use complex and heterogeneous hardware has been an active research topic for decades. The objectives are to help developers parallelize their applications and increase the efficiency of the resulting applications. Among the successful approaches, the task-based method demonstrated a significant potential.

In this model, a developer splits its program into tasks connected by dependencies, which can usually be represented by a direct acyclic graph. A runtime system is then in charge of executing the task-graph by distributing the tasks over the processing units while ensuring a coherent execution. It can use advanced scheduling strategies that attempt to reduce the makespan or energy consumption, keep the processing units busy, and relieve developers by moving the data when needed.

StarPU is such a runtime system that was developed at Inria since 2009. It targets users with knowledge in HPC and has been successfully used to parallelize a dozen different numerical methods: dense/sparse linear algebra solvers, FMM, BEM solvers, H-matrix solvers, and seismic simulations, among others. In most of these applications, both CPUs and GPUs are used jointly. As such, StarPU is the perfect tool to use to study new technologies and features that will be developed in the Textarossa project through the task-based method.

More precisely, we will study the use of FPGAs in the task-based method and examine their complementarity to the classical HPC devices. At a high level, any existing task-based application can quickly use FPGAs by simply providing some alternative implementations of the tasks. At the runtime system level, an FPGA is nothing more than another type of device accelerator. Consequently, the challenges will come more from the task scheduling and optimization of the energy consumption. This is why we will provide a new scheduling strategy to tackle both aspects that we will validate on two existing StarPU-based applications: ScalFMM and Chameleon.

Leading Partner: INRIA

ByTEXTAROSSA Project

TEXTAROSSA @ INFN Workshop on Computing

Today (May 25, 2022), Alessandro Lonardo (INFN) is presenting the TEXTAROSSA project at the annual INFN Workshop on Computing.

ByTEXTAROSSA Project

Bit-Packing Compressor in FPGA-Hardware


Figure 1

Figure 2

Figure 3

Data compressor is one key issue for storing and transmitting large dataset. Data compression algorithm based on the usage of different length code for integer representation may be effective only if output stream is tightly packed. Hardware realization of these procedures has definite advantages in comparison with the software execution. For example many fast video-cameras have high performance CMOS sensor technology usually adopted for acquiring and store scientific images. Typical use cases are in the field of nuclear fusion experiments with plasmas magnetically confined on which fast camera acquiring and store images of plasma discharges. A Photron FASTCAM SA4 camera has been in operations in ENEA on Frascati Tokamak Upgrade and Proto-Sphera tokamak experiments. The FASTCAM SA4 camera provides up to 3600 frames/s at 1024×1024 pixel resolution collected on 12-bit depth image. SA4 is supplied with 16 GB memory option, hence it can capture all the frames in the ~2 s of plasma discharge at the maximum performance of 3600 fps, 1024×1024 pixel resolution. The camera downloads 9.5 GB for each plasma discharge of raw uncompressed data with image processing software able to convert in standard format like “tif” (fig.1). The raw data acquired with a digital camera like SA4 is a measure of the radiant intensity which refers to the magnitude or quantity of light energy actually reflected from or transmitted through the object being imaged by an analog or digital device. it is the only variable that can be utilized by processing techniques in quantitative scientific experiments as shown the fig.2 with a different colors scale. The 12-bit pixel depth image, associated to the huge amount of data produced by the SA4 camera, poses a problem of storage versus data integrity. In fact, due to the fact that the smallest addressable unit in a digital processor is a byte, to preserve data information it would be necessary to save each pixel in a two-byte format, thus getting a 25% storage overhead. The alternative could be compression, but this would cause undesirable data loss. In order to save both storage space and data integrity, a Bit-Packing compression algorithm operating at bit level, has been developed to reduce the number of bit requested to store 12-bit raw data of SA4 camera. Roughly speaking, the Bit-Packing algorithm maps two 12-bit pixels to three 8-bit pixels. In this specific compression domain, Bit-Packing is one of the most frequently applied compression schemes. However, (de)compression should not come with any additional cpu load during run time, but should be provided as fast as possible and efficiently in terms of energy consumption. To achieve that target, the development of a Bit-Packing algorithm for Field Programmable Gate Arrays (FPGAs) is a candidate solution aware with the aims of the TEXTAROSSA project.

For more than 20 years, High-Level Synthesis (HLS) has been widely investigated and studied as a way to raise the level of abstraction in programmable electronic system design, allowing SW developers to design dedicated computing architectures to be implemented on top of the FPGA (Field Programmable Gate Array) technology.

The advantages of FPGAs come from:
– the huge parallelism available: thousands of DSP modules and millions of small programmable LUT allow to implement designs able to sustain hundreds of GFlop/s – or several TOp/s, if we refer to simpler limited precision fixed-point arithmetic

– the high internal memory bandwidth: thousands of independent small block RAMs allow to achieve an internal memory bandwidth of several TB/s, needed to sustain the high computing performance potentially achievable thanks to the huge HW parallelism

– the availability of IP (Intellectual Properties) blocks, on the same FPGA silicon, that implement Arm processor cores, PCIe and memory controllers, and high-speed communication links, …

– the low power consumption (few tens of watts)

Despite previous advantages, FPGA adoption in the HPC community has been quite limited, mainly because their programming required HW design expertise. To overcome this limitation, HLS flows play a fundamental role as they allow to define the behavior of the system to be synthesized by means of high-level languages such as C/C++, Matlab, SYCL, …

In the past years, the maturity of HLS flows has been the main limitation to their widespread diffusion: the quality of the produced HW was poor (the efficiency was low), the integration of the algorithm accelerated in HW with other parts running on conventional host computers was sometimes cumbersome, the reprogramming of FPGA was a problem, often requesting a new reboot of the system to detect the new HW.

We are now in a phase where HLS flows seem to have reached their maturity: there are HLS flows (like the Vitis from Xilinx or OneAPI from Intel) that allow obtaining quite an efficient parallel and pipelined implementation of a design (even if still some consciousness of the architecture to be generated is required to the SW programmer that can control through pragmas the compiler behavior and the structure of the generated HW), taking care of low-level details as the different clock domains, offering simple syntax to manage the communications and synchronization among different kernels, giving access to a wide set of pre-implemented IP and computing libraries. HLS flows use the partial reprogramming capabilities of FPGAs and allow the interface shell (i.e., the PCIe, memory, and all the I/O interfaces) to be not reprogrammed each time, being kept fixed and solving the rebooting problem. Furthermore, modern HLS flows come with run-time support that allows the simple and efficient integration between the application running in the host and the accelerated kernels running on the FPGA.

In TEXTAROSSA project one of the main aim is the software development for an FPGA accelerator using an HLS flow (namely, Vitis) neglecting the low-level FPGA details of the Xilinx Alveo U280 in operation at ENEA FPGA LAB in Portici.

A kernel function implementing the BitPacking algorithm delete the four leading bits in the pixel components of each image from an input stream resulting in a compressed image on the output stream.
If m and n are the numbers of bits used to store a component of an NxN image, the input image has dimension mN^2/8 [Byte] and the output image nN^2/8 [Byte], so the compression ratio is m/n; we indicate with S the size, in bit, of the input and output streams of the kernel; S is constrained to be a power of 2. In the beginning, input and output images are aligned with respect to S, i.e., both the input and output streams have a pixel component starting at bit 0 of the input/output; to determine how many components must be read to be aligned again both at the input and the output streams, we must find the smallest number of pixel components k∈ℕ+ satisfying the following equation (% is the modulus operator):

(k * m) % S = (k * n) % S  	(1)

From (1) it is easy to verify that the solution is:

k=LCM(LCM(m,S)/m,LCM(n,S)/n) 	(2)

Where LCM(a,b) returns the Least Common Multiple of the a and b. In our case we set the stream size S=512 bits (thus saturating the PCIe bandwidth in the reasonable hypothesis to use fck=300 MHz), m=16 bits and n=12 bits; the expression (2) gives k=128. As we read from the stream S bits, data are newly aligned after k  m/S=4 reads from the input stream and k * n/S=3 writes to the output stream. From previous computations, we derive the structure of the Bit-Packing compression function:

for (i=0; i<InputImageSize/S; i+= km/S) {
  read km/S words from the input stream;
  while (not all the input bits have been copied) {
    copy n bits from the input word to the output word;
    skip (n-m) bits of the input word;
  }
  write the kn/S words that have just been filled;
}

The Bit-Packing compressed image of fig.2 is depicted in the fig.3.

Benchmarking sessions are ongoing in order to evaluate Time and Energy to Solution within the WP1 and WP4 activities of TEXTAROSSA.

Leading Partner: ENEA
People: Francesco Iannone, Paolo Palazzari

ByTEXTAROSSA Project

INFN progress in neural network research for HPC

INFN started tackling neural simulations with its own engine, the Distributed and Plastic Spiking Neural Network (DPSNN), which is a scalable C++/MPI code for HPC platforms at extreme scales simulating the spiking dynamics of a brain cortex modeled as a grid of cortical columns populated with neurons and their interconnecting synapses.
It has been used to first model brain cortex behaviour — with a special focus on sleep-like states — and to gauge compute and power efficiency on different architectures.
INFN has now transitioned to another, more versatile tool, the NEST Simulator; this is a C++/MPI/OpenMP code by the NEST Initiative (https://nest-initiative.org) that empowers a user with a domain-specific language to design a virtual neurophysiology experiment, from the equations driving the dynamics of the components of interest in the cortex (with a rich library of many types of either neurons and synapses ready to be used) to the topology of their interconnections — the so-called connectome — all the way to the necessary supporting tools, like probing or stimulating electrodes that read or inject electrical currents into the simulated cortex.
NEST offers to the experimenter an intuitive Python interface to easily setup a detailed and complex protocol of interaction between a simulated cortex and a set of external stimuli.
Coupled with the huge set of tools for analysis, visualization and data transformation available to the Python user, NEST allows for a compact yet expressive way to perform even very involved neural simulations. INFN has used NEST to implement a biologically-inspired thalamo-cortical model which is able to be trained in classification of handwritten digits from the MNIST dataset and then mimick the wake-sleep cycle, in order to test the enhancing effects of sleep on the quality of learning and recognition, even in noisy environments.
NEST does not currently support running on GPUs, therefore INFN is closely following and collaborating in the development of NeuronGPU (by B. Golosio), a CUDA/C++/MPI code that replicates many features of NEST with a similar Python interface while targeted at high performances on NVIDIA GPUs and which is poised to be integrated in the near future in some capacity into NEST as a GPU-enabling component called NEST-GPU.

ByTEXTAROSSA Project

In Quattro and POLIMI testing the new two-phase cooling system.



In Quattro and Politecnico di Milano are collecting experimental data at HEAP Lab in March 23-24. The goal is to build the thermal model of the two-phase cooling systems. The thermal test chip developed by POLIMI and the (portable) prototype of the two-phase cooling systems have been integrated to simulate the thermal behavior of a real processor.

A first step towards a real in-field validation!

In Quattro: Giorgia Rancione, Luca Saraceno
Politecnico di Milano: William Fornaciari and Federico Terraneo

ByTEXTAROSSA Project

MathLib: Colella’s Dwarves


Mathematical libraries are a key component of the software stack for exascale-class applications. As a matter of fact, exascale computing deployment is conditioned by the development of new suitable numerical algorithms, since most of existing ones are not able to face many issues raising into the race to exascale.
Mathematical libraries provide building-blocks implementing up-to-date methods and algorithms that application developers can reuse in form of highly-reliable and high-performance components.

Any new generation of computer architectures brings new challenges to achieve high performance mathematical solvers and a bi-directional interaction between new computer architectures/environments and new mathematical software is ever more crucial for deployment and effective use of the new technology to advance in Science, Industry and Society.

The need to improve computing performance in the near and medium term indicates that exascale and also post-exascale platforms will continue to emphasize heterogeneity. This type of architecture exploits at node level, accelerators such as modern GPUs and reconfigurable hardware such as FPGA to boost performance and also energy efficiency in computations. As computer architectures become heterogeneous, there is the need for algorithms that support mixed-precision and minimize communications among the different devices and memory levels. The new power-to-solution metrics requires a rethinking of many computational kernels of HPC applications looking for the best trade-off between the reduction of the energy consumption and the minimization of the time-to-solution, promoting reproducibility and scalability.

We will provide new high-performance algorithms and software modules for some of the so-called Colella’s dwarves, who identified numerical methods crucial for science and engineering. In particular, we will focus on algorithms and software for sparse linear algebra, where data sets include many zero-values and are usually stored in compressed data structures to reduce storage and memory bandwidth requirements,hey are generally accessed with indexed loads and stores, and main computational kernels are communication bound.

The library kernels will be of immediate use in a wide range of applications, ranging from classical scientific simulation to AI techniques, including automatic pattern recognition in complex systems, and will be tested in some of the applications proposed as use cases in this project.

Leading Partner: CNR

ByTEXTAROSSA Project

The TEXTAROSSA Co-Design Approach

Motivation

Supercomputing offers access to enormous amounts of computational power that are needed by many applications in different scientific and industrial sectors. Public services such as weather predictions require such capabilities, as do critical industries like Oil \& Gas (for fuel deposit discovery) and Pharmaceutical (for drug design). Scientific discoveries in fields such as quantum physics or high energy physics are also made possible by supercomputers.

However, increasingly powerful supercomputers are hitting a ceiling imposed by the ability to provide (and sustain) electrical power through the grid. To avoid this limitation, supercomputing hardware designers need to rely on systems that are less power-hungry, but more difficult for application designers to effectively use, due to characteristics such as heterogeneity (that is, the use of processing elements different from the typical “processor” that is commonly found also in laptop and desktop personal computers) and reconfigurability (that is, the use of systems whose functions are programmable at the hardware rather than software level).

TEXTAROSSA Contributions

TEXTAROSSA aims at making the advantages of reconfigurable hardware and associated technical advances available to application developers by means of a co-design approach.
Whereas in standard supercomputing application design, the hardware is given, and the application developer works only at creating the software application, in co-design, hardware and software are designed together, at least in part.

TEXTAROSSA will leverage co-design, which was first developed in the context of embedded systems, through a new Integrated Development Vehicle, a hardware-software platform for supercomputing including reconfigurable hardware elements. TEXTAROSSA aims at providing tools that will help the application developer in designing and implementing the application on this type of platform, semi-automatically performing the tasks of deciding which activities will be performed on the reconfigurable hardware, and producing optimized hardware accelerators for those activities.

ByTEXTAROSSA Project

European High Performance Computing Joint Undertaking Workshop

The Tuscany Region of Italy via TOUR4EU organized at Bruxelles a workshop entitled “European High Performance Computing Joint Undertaking Workshop” in which prof. Aldinucci and prof. Saponara presented the current EU projects, including TEXTAROSSA. Please find the workshop report here (in Italian).

ByTEXTAROSSA Project

TEXTAROSSA hiring: position open at INRIA

PhD Position F/M Optimization of high-performance applications on heterogeneous computing nodes

A PhD position is open in HiePACS, a joint project-team with Bordeaux INP, Bordeaux University and CNRS, and CAMUS, a joint project-team with Strasbourg University and CNRS.

The purpose of the HiePACS project is to efficiently perform frontier simulations arising from challenging research and industrial multiscale applications. The solution of these challenging problems requires a multidisciplinary approach involving applied mathematics, computational and computer sciences. In applied mathematics, it essentially involves advanced numerical schemes. In computational science, it involves massively parallel computing and the design of highly scalable algorithms and codes to be executed on future petaflop (and beyond) platforms. Through this approach, HiePACS intends to contribute to all steps that go from the design of new high-performance more scalable, robust and more accurate numerical schemes to the optimized implementations of the associated algorithms and codes on very high performance supercomputers.

The CAMUS research team focuses on parallelization, optimization, profiling, modeling, and compilation. The team has increasing interests in the approaches used and enhanced in the high-performance community. The team’s research activities are organized into five main issues that are closely related to reach the following objectives: performance, correction and productivity. These issues are: static parallelization and optimization of programs (where all statically detected parallelisms are expressed as well as all “hypothetical” parallelisms which would be eventually taken advantage of at runtime), profiling and execution behavior modeling (where expressive representation models of the program execution behavior will be used as engines for dynamic parallelizing processes), dynamic parallelization and optimization of programs (such transformation processes running inside a virtual machine), object-oriented programming and compiling for multicores (where object parallelism, expressed or detected, has to result in efficient runs), and finally program transformations proof (where the correction of many static and dynamic program transformations has to be ensured).

The objectives of the thesis will be to study how the new features of the TEXTAROSSA computing nodes can be used to develop high performance applications. With this aim, we will study how high performance task-based applications can be adapted in order to exploit the full potential of the platform. We will thus consider two existing high performance libraries by adapting them, designing advanced scheduling strategies and considering energy consumption awareness as a major constraint of the work.

See More

ByTEXTAROSSA Project

Press Release

TEXTAROSSA, a project co-funded by the European High Performance Computing (EuroHPC) Joint Undertaking, kicked off on April 1st to drive innovation in efficiency and usability of high-end HPC systems.

TEXTAROSSA (Towards EXtreme scale Technologies and AcceleRatOrS for HW/SW Supercomputing Applications for exascale) is funded by the European High Performance Computing (EuroHPC) Joint Undertaking within the EuroHPC-01-2019/Extreme scale computing and data driven technologies. The three-year project, led by ENEA (Italy), aggregates 17 institutions and companies located in 5 European countries (Italy, France, Poland, Germany and Spain).
The TEXTAROSSA project aims to achieve a broad impact on the High Performance Computing (HPC) field both in pre-exascale and exascale scenarios. The TEXTAROSSA consortium will develop new hardware accelerators, innovative two-phase cooling equipment, advanced algorithms, methods and software products. The developed technologies will be tested on the Integrated Development Vehicles (IDV) mirroring and extending the European Processor Initiative’s ARM64-based architecture, and on an OpenSequana testbed. To drive the technology development and assess the impact of the proposed innovations from node to system levels, TEXTAROSSA will use a selected but representative number of HPC, HPDA and AI applications covering challenging HPC domains such as general-purpose numerical kernels, High Energy Physics (HEP), Oil & Gas, climate modelling, as well as emerging domains such as High Performance Data Analytics (HPDA) and High Performance Artificial Intelligence (HPC-AI).

Read the full press release.