PDAS, the Parallel Data Analytics Service

The PDAS (http://ophidia.cmcc.it/) is designed to address big data challenges in the scientific domain. It executes data-intensive analysis and I/O exploiting advanced parallel computing techniques and smart data distribution methods. It provides a framework for parallel I/O and data analysis, an array-based storage model and a hierarchical storage organization to partition and distribute multidimensional scientific datasets. Since the storage model does not rely on a scientific dataset file format, it can be exploited in different scientific domains and with very heterogeneous sets of data.

The internal storage model is based on a key-value pair approach to store multi dimensional data (called data cubes). Values are organized into arrays; each array is identified by a key-value pair obtained from the combination of some dimension values (called explicit dimensions), while the other dimensions (called implicit dimensions) identify a value through their position. From a physical point of view, a data cube is horizontally partitioned into several blocks (called fragments or chunks) that are distributed across multiple I/O nodes. Each I/O node hosts a set of I/O servers optimized to manage n-dimensional arrays. These I/O servers manage a set of databases consisting of one or more data cube fragments.

PDAS comes with an extensive set of primitives to operate on n-dimensional arrays (i.e. on the arrays contained in fragments). To achieve flexibility requirements, primitives are designed as dynamic libraries in order to be plugged in different I/O servers with no effort. Furthermore, plugins can also be nested into each other to enable more complex tasks.

Currently available array-based functions allow data sub-setting, data aggregation (i.e. max, min, avg), array concatenation, algebraic expressions, predicate evaluation and compression routines (i.e. zlib, xz, lzo). Core functions of well-known numerical libraries (e.g. GSL, PETSc) and tools (e.g. CDO) have been included into the primitives. PDAS also provides many parallel operators to work on a whole data cube; i.e. on all fragments associated to a data cube. Some examples are: datacube sub-setting (slicing and dicing), datacube aggregation, array-based primitives at the datacube level, datacube duplication, datacube pivoting, and NetCDF file import and export. Other operators can also handle more than one data cube at time, thus allowing comparison.

The PDAS is being used in the EUBrazilCC project to address scientific challenges related to the use case 3 on Biodiversity and Climate Change. The framework will be exploited both for batch and interactive analytics sessions on NetCDF climate data and Landsat satellite data. The PDAS can be deployed in different cloud-based scenarios according to user requirements in terms of scalability, performance and ease of deployment.
For more information please contact us at info@eubrazilcc.eu.