Introducing Nextflow with GA4GH TES: A New Era of Scalable Data Processing on Azure

  • Thread starter Thread starter Venkat_Malladi
  • Start date Start date
V

Venkat_Malladi

The Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API is a standardized schema and API for describing and executing batch execution tasks



We’re excited to announce that Nextflow, the powerful workflow management system for data-driven computational pipelines, is now fully supported using Global Alliance for Genomics and Health GA4GH Task Execution Service (TES). This integration will bring seamless scalability, enhanced efficiency, and robust performance to your data processing tasks in the cloud as well as on local compute.



What is TES?




Task Execution Service (TES) API is a standardized schema and API for describing and executing batch execution tasks. It provides a common way to submit and manage tasks to a variety of compute environments, including on premises High Performance Compute and High Throughput Computing (HPC/HTC) systems, Cloud computing platforms, and hybrid environments. The TES API is designed to be flexible and extensible, allowing it to be adapted to a wide range of use cases, such as “bringing compute to the data” solutions for federated and distributed data analysis or load balancing across multi-cloud infrastructures.



Why Nextflow and TES?




Nextflow in conjunction with TES is an ideal choice for managing computational workflows due to its ability to abstract and simplify the configuration of complex data processing tasks. The standardized TES API offers a unified approach to task execution, ensuring compatibility across various computational environments. This integration not only enhances portability and scalability but also significantly reduces the number of configuration files needed for setting up and managing workflows across various cloud and localcompute environments. This streamlined approach allows researchers and developers to focus more on their core scientific objectives rather than the intricacies of infrastructure management.



How does it work




Previously, to run Nextflow pipeline on Azure you must create a config file specifying compute resources to use. By default, Nextflow will use a single compute configuration for all tasks. An example is below:



Code:
process { 

    executor = 'azurebatch' 

    queue = 'Standard_E2d_v4' 

    withLabel:process_low   {queue = 'Standard_E2d_v4'} 

    withLabel:process_medium {queue = 'Standard_E8d_v4'} 

    withLabel:process_high {queue = 'Standard_E16d_v4'} 

    withLabel:process_high_memory {queue = 'Standard_E32d_v4'} 

} 

  

azure { 

    storage { 

        accountName = "<Your storage account name>" 

        sasToken = "<Your storage account SAS Token>" 

    } 

    batch { 

        location = "<Your location>" 

        accountName = "<Your batch account name>" 

        accountKey = "<Your batch account key>" 

        autoPoolMode = false 

        allowPoolCreation = true 

        pools { 

            Standard_E2d_v4 { 

                autoScale = true 

                vmType = 'Standard_E2d_v4' 

                vmCount = 2 

                maxVmCount = 20 

            } 

            Standard_E8d_v4 { 

                autoScale = true 

                vmType = 'Standard_E8d_v4' 

                vmCount = 2 

                maxVmCount = 20 

            } 

            Standard_E16d_v4 { 

                autoScale = true 

                vmType = 'Standard_E16d_v4' 

                vmCount = 2 

                maxVmCount = 20 

            } 

            Standard_E32d_v4 { 

                autoScale = true 

                vmType = 'Standard_E32d_v4' 

                vmCount = 2 

                maxVmCount = 10 

            } 

        } 

    } 

}



With the integration of Nextflow and TES we are able to simplify this configuration and allow the native Nextflow compute directives (e.g. cpu, memory, disk) to inform Azure of what type of minimum machine requirements are necessary. TES will look at the Batch quota available as well as the minimal compute requirements to choose the cheapest available compute that meets the minimal requirements for each process.



Code:
plugins { 

   id 'nf-ga4gh' 

} 

  

process { 

  executor = 'tes' 

} 

  

azure { 

       storage { 

          accountName = ""<Your storage account name>"" 

          accountKey = "<Your storage account key>" 

       } 

    } 

  

 

tes.endpoint= "<Your TES endpoint>" 

tes.basicUsername = "<Your TES username>" 

tes.basicPassword = "<Your TES password"



How to Get Started




To help you get up and running quickly, we’re introducing the `nf-hello-gatk` project, a Nextflow pipeline example designed to showcase the powerful capabilities of Nextflow. This pipeline demonstrates how to use Nextflow for genomic data analysis with GATK (Genome Analysis Toolkit), leveraging Azure Batch to scale compute resources efficiently.



  1. Deploying TES on Azure: Follow the guide to deploy TES on Azure.
  2. Install Nextflow: Follow the Nextflow installation guide to set up Nextflow on your local machine or cloud environment.
  3. Generate the TES Config:Fill out the following config with your own TES and Azure credentials and save as tes.config.

Code:
process { 

  executor = 'tes' 

} 

  

azure { 

       storage { 

          accountName = ""<Your storage account name>"" 

          accountKey = "<Your storage account key>" 

       } 

    } 

tes.endpoint= "<Your TES endpoint>" 

tes.basicUsername = "<Your TES username>" 

tes.basicPassword = "<Your TES password"
  1. Execution :

./nextflow run seqeralabs/nf-hello-gatk -c tes.config -w 'az://work' --outdir 'az://outputs' -r main



After completion all results can be found in the blob container prefix specified by –outdir.



Enhanced Workflow Management




With this integration, you can now manage and execute your Nextflow workflows on Azure Batch with greater ease. This includes:



  • Automatic Scaling: Dynamically scale compute resources based on your workflow’s demands.
  • Cost Efficiency: Optimize your cloud expenditure with Azure Batch’s cost-effective pricing model.
  • Seamless Integration: Utilize the TES API for straightforward interaction with Azure Batch.

We believe this integration will significantly enhance your data processing capabilities, making it easier to handle large-scale workflows with greater efficiency and cost-effectiveness. Stay tuned for more updates and community contributions as we continue to enhance our support for Nextflow on Azure Batch with TES.



Acknowledgments:



We would like to acknowledge the contributions by Liam Beckman from Oregon Health and Science University Computational Biology, and Ben Sherman, Software engineer at Seqera, that contributed to native support of Nextflow and TES.

Continue reading...
 
Back
Top