e-Science Central, the cloud-based data analytics platform

e-Science Central (e-SC, http://www.esciencecentral.co.uk)  offers a visual programming model based on workflows. It is well suited to many scientific applications that often consist of a set of tasks that are often connected in a direct acyclic graph to transform the original input data into a desired end result. With e-Science Central the process of application development usually consists of just dragging blocks from a palette of available tools and connecting them together into a workflow. The user can then execute the workflow, and the results are stored within the e-SC file system and can be displayed in the web front-end.

e-Science Central provides currently about 250 workflow blocks which range from generic utilities for manipulating data matrices, to specialist blocks wrapping the Weka  data mining tools and control blocks that can be used in workflow parallelization. The number of blocks grows constantly as the system is exposed to new projects and application domains. For applications that require some specific tools not yet included in e-SC, users (developers) need to build relevant tools by themselves. Blocks can be created in various programming languages (Java, R, Octave, JavaScript) or can simply wrap existing tools.

Workflow enactment is done by one or more workflow engines. By design, the system is suited to process automatically batches of input data without user interaction, and can scale to problems that require TFLOPS of computing power and TBs of storage.

Although e-SC is self-contained and can operate on its own offering users a web interface, it also exposes an API to enable external applications to interact with storage and workflow management subsystems. The API allows the system to become part of larger applications in which e-SC delivers secure and scalable data analysis service.

Target deployment of e-Science Central

Computing resources provided by the EUBrazilCC project are heterogeneous and thus their federation will give rise to a set of heterogeneous pools of computing and storage nodes. This is an important factor that affects the deployment of the e-SC system.

Currently, e-SC can handle only a single pool of workflow engines. Engines in the pool are considered to be equivalent, which imposes various restrictions on the way the system may be used. One of these limitations is the inability to handle a set of heterogeneous resources. If the resource layer offers computing nodes with different access policies, computing power, memory and storage capacities, the system cannot properly handle them. As a result, users may observe unexpected workflow execution failures or bad execution performance.

In the EUBrazilCC project we will investigate possible options to extend the e-SC system to allow the presence of multiple engine pools to and to enact workflows. The target deployment is shown in Figure 10. e-Science Central will be extended with a scheduling component and multiple workflow invocation queues. Most importantly, the extension of the system according to the presented needs should not lead to a worse result than the current workload balancing which has proved to scale very effectively.

Integration with the cloud infrastructure

In order to allow multiple engine pools to be created, e-Science Central will be extended with components that can access the infrastructure layer and allocate engine VMs based on the current workload. EUBrazilCC offers a variety of options regarding how the infrastructure layer can be accessed such as: direct access to management middlewares through the OCCI interface, the Virtual Infrastructure Manager (later described), and federation of IaaS cloud providers via Fogbow (see section 2.7). The exact way to integrate those components in the infrastructure will be investigated and decided in the course of the project.

For more information please contact us at info@eubrazilcc.eu.