Friday, 14 April 2017

How to Load Data into Microsoft Azure SQL Data Warehouse using PolyBase & Talend ETL

Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational. Built on a massively parallel processing (MPP) architecture, SQL Data Warehouse can handle any enterprise workload.
With increasing focus on business decisions in real-time, there has been a paradigm shift in not only keeping data warehouse systems up to date but reduce load times.  The fastest and most optimal way to load data into SQL Data Warehouse is to use PolyBase to load data from Azure Blob storage.  PolyBase uses SQL Data Warehouse’s massively parallel processing (MPP) design to load data in parallel from Azure Blob storage.
One of  Talend’s key differentiators is its open source nature and the ability to leverage custom components, either developed in-house or by the open source community @ Talend Exchange.  Today our focus will be on one of such custom components, tAzureSqlDWBulkExec, and how it can enable Talend to utilize PolyBase to load data into SQL Data Warehouse.
For simplicity, we will key in on the following two scenarios:
  • Load data from any source into SQL DW
  • Load data into SQL DW while leveraging Azure HDInsight and Spark

Load data from any source into SQL DW

talendazureetl1
In this scenario data can be ingested from one or more sources as part of a Talend job.  If needed, data will be transformed, cleansed and enriched using various processing and data quality connectors that Talend provides out of the box.  The output will need to conform to a delimited file format using tFileOutputDelimited.
The output file will then be loaded into Azure Blob Storage using tAzureStoragePut.  Once the file is loaded into blob, tAzureSqlDWBulkExec will be utilized to bulk load the data from the delimited file into a SQL Data Warehouse table.

Load data into SQL DW while leveraging Azure HDInsight and Spark

screen-shot-2017-02-08-at-8-09-59-am

As data volumes have increased so has the need to process data faster.  Apache Spark, a fast and general processing engine compatible with Hadoop, has become the go-to big data processing framework for several data-driven enterprises.  Azure HDInsight is a fully-managed cloud Hadoop offering that provides optimized open source analytic clusters for Spark (Please refer to the following link, How to Utilize Talend with Microsoft HDInsight, for instructions on how to connect to an HDInsight cluster using Talend Studio).
Talend Big Data Platform (Enterprise version) provides graphical tools and wizards to generate native Spark code that combines in-memory analytics, machine learning and caching to deliver optimal performance and increased efficiency over hand-coding.  The generated Spark code can be run natively on an HDInsight cluster directly from Talend Studio.
In this scenario, a Talend Big Data job will be set up to leverage an HDInsight Spark cluster to ingest data from one or more sources, apply transformations and output the results to HDFS (Azure Blob storage).  The output file format in the Talend Big Data job can vary between (supported by PolyBase):
  • Delimited Text – using tFileOutputDelimited
  • Hive ORC – using tHiveOutput
  • Parquet – using tHiveOutput / tFileOutputParquet
After the completion of the Spark job, a standard job will be executed that bulk loads the data from the Spark output file into a SQL Data Warehouse table using tAzureSqlDWBulkExec.

Performance Benchmark

tAzureSqlDWBulkExec utilizes native PolyBase capability and therefore fully extends the performance benefits of loading data into Azure SQL Data Warehouse.  In-house tests have shown this approach to provide a 10x throughput improvement versus standard JDBC.

No comments:

Post a Comment