12. Integrating Tecgraf's CSGrid System to Microsoft Azure Cloud for Batch Job Submission

Author(s): 
Gama, Daltro S.; Rodriguez, Noemi; Lima, M Julia
Focus area: 
The work presented in this position paper is the result of a master thesis developed at the Computer Science Department of PUC-Rio with collaboration of Tecgraf Institute. It has been focusing on investigating and improving infrastructure services for business and research. The result achieved enforces the potential cooperation benefits among research institutions and industry, leveraging from EUBrazilCC project and Microsoft Azure platform.
Who stands to benefit and how: 
Tecgraf's CSGrid users interested in cloud-scale batch submission within the Azure public cloud infrastructure as so developers interested in develop integrations with Azure infrastructure.
Focus of your position paper: 
The CSGrid system is built as an instance of the CSBase framework developed over the last ten years by the Tecgraf Institute at PUC-Rio and used in production on a wide range of scenarios. CSGrid offers a collaborative and extensible environment to abstract the use of distributed computational and storage resources for High Performance Computing (HPC), providing functionalities that can be used both directly, by end users, or through programming interfaces. CSGrid is targeted at batch submission of non-interactive programs previously uploaded and shared by the users using a project-centric environment: input and output data files are organized in project areas keeping the cooperation between users centralized over the same work space. Originally, CSGrid was designed for fixed-size clusters and grid computing purposes. Several scientific and commercial applications run in the CSGrid environment, including a few at Brazilian oil company Petrobras and some others at SINAPAD. Our goal in this work was to develop an integration to Azure's public cloud platform by implementing a software module for CSGrid in order to explore the resource elasticity typically provided by cloud computing. Firstly, we evaluated the computing power of the Azure infrastructure by measuring basic network throughput and processing time for CPU-bound processes. The results met our expectations, specially when using D-series virtual machines provided by Azure. Afterwards, we designed a new module to address the integration itself between CSGrid and the Azure platform. We implemented this module using Azure's REST APIS to access services for blob storage, messaging, and virtual machine management. This new module gave CSGrid the capacity to explore the elasticity of the cloud platform, specially in bag-of-tasks workloads, with different policies, such as having as many virtual machines as queued jobs in order to minimize wall clock time or adapting the number of virtual machines based on the execution queue status. Provisioning and deprovisioning policies were designed aiming to minimize the financial costs of virtual machines as well as blob storage usage. The CSGrid-Azure integration was developed over a newly CSGrid API designed during the EUBrazilCC project for integrating CSGrid to other platforms such as the Fogbow cloud federation, developed by Campina Grande Federal University (UFCG). This new API also encourages integration with other cloud API's like OpenStack or Amazon EC2/S3. Recently, Microsoft also developed a batch execution platform called Azure Batch®. During CSGrid's integration development time, Azure Batch had not yet been released. Microsoft's solution has similar requirements to those of CSGrid's batch execution, although Azure Batch's first release only allowed the execution of MS Windows binary programs, an inconvenience when working with CSGrid. Nevertheless, in the future a new CSGrid module might be implemented to achieve integration with Azure Batch. The experience of integrating Microsoft's Azure public cloud platform with Tecgraf's CSGrid system can be considered a successful exploration of CSGrid's new capabilities. It illustrates how far CSGrid, a system which has been commercially used for many years, can evolve to integrate itself with more contemporary technology. This work was made possible by Microsoft Research Award Program in 2014/2015.