Hook yourself to CLOUD: April 2017

Thursday, 20 April 2017

Best IaaS Architecture - Azure Challenge

Designing an IT infrastructure is totally dependant on the services, products, applications and many other things that an enterprise/organization/company provides to the user. It is also depends on which development model the company is preserving for development. So it is very difficult to develop a generalized IT infrastructure but we can use a basic IT infrastructure module at the beginning stage and can change accordingly depending on future needs.

In this case study, we will look at how to design and develop a simple IT infrastructure, how to balance the incoming traffic load among available resources (VMs), how to make disaster recovery-based architecture using an Availability Set and also how to use amazing Azure Services in a simple way to make standard IT infrastructure.

2. Type: IaaS (Infrastructure as a Service).

3. Architecture diagram

4. Solution Details

The following are the features, technologies used, existing / potential end-users, existing/potential customers and so on:

High Security due to overall architecture is divided into multipe subnets. Each tier of VMs are launched into different private subnets that provide one more security layer in the Virtual Private Network.
High Availability due to two Availability Zones. Availability Zones have the same VMs and they run individually.
Different Load Balancer provides different features. We can change this load balancer type at any time.
1. Performance: Select Performance when you have endpoints (cloud services or web sites) in multiple datacenters (also known as regions) and you want to direct the user to the endpoint with the lowest latency.
2. Failover: Select Failover when you have endpoints in the same or different datacenters and want to use a primary endpoint for all traffic, but provide backups in case the primary or the backup endpoints go offline.
3. Round Robin: Select Round Robin when you want to distribute the load equally across a set of endpoints in the same datacenter or across various datacenters.
Disaster Recovery (Warm Standby) can be possible using this architecture. We can maintain a standby to any one Availability Zone instance and when needed the environment rapidly scales up (in other words activating another Availability Zone) to meet full production demands.
Primary and Secondary Database with SQL Data Synch service provides an always stand by copy of the original database.

5. Economic viability of the Solution
Today companies know disaster is not a matter of if, but when. It is very important to take precautions before something happens. According to a study done by the University of Oregon, every dollar spent on hazard mitigation, including Disaster Recovery, saves companies four dollars in recovery and response costs. In more detail to cost savings, smart companies also view Disaster Recovery as critical to their survival. For example, 51% of companies that experienced a major data loss closed within two years (Source: Gartner), and 44% of companies that experienced a major fire never re-opened (Source: EBM). Again, disaster is not a matter of if, but when. So we need to be ready for Disaster Recovery and to save its cost.

6. Relevant screenshots from the Azure portal:

Example Reference Architecture for Microsoft Azure Resource Manager (ARM)

Azure has two different deployment models for creating and working with resources: Azure Classic and Azure Resource Manager. The information detailed in this article is based on the Azure Resource Manager model (ARM).

Summary

Microsoft Azure cloud platform provides a feature rich environment for Infrastructure-as-a-Service (IaaS) as a cloud offering fully capable of supporting all of InterSystems products. Care must be taken, as with any platform or deployment model, to ensure all aspects of an environment are considered such as performance, availability, operations, and management procedures. Specifics of each of those areas will be covered in this article.

Performance

Within Azure ARM there are several options available for compute virtual machines (VMs) and associated storage options, and the most directly related to InterSystems products are network attached IaaS disks stored as VHD files in Azure page blob storage. There are several other options such as Blob (block), File and others, however those are more specific to an individual application’s requirements rather than supporting the operations of Caché. There are two types of storage where the disks are stored: Premium and Standard. Premium storage is more suited for production workloads that require guaranteed predictable low-latency Input/Output Operations per Second (IOPs) and throughput. Standard storage is a more economical option for non-production or archive type workloads. Care must be taken when selecting a particular VM type because not all VM types can have access to premium storage.

Virtual IP Address and Automatic Failover

Most IaaS cloud providers lacked the ability to provide for a Virtual IP (VIP) address that is typically used in database failover designs. To address this, several of the most commonly used connectivity methods, specifically ECP clients and CSP Gateways, have been enhanced within Caché to no longer rely on VIP capabilities making them mirror-aware.

Connectivity methods such as xDBC, direct TCP/IP sockets, or other direct connect protocols, require the use of a VIP. To address those, InterSystems database mirroring technology makes it possible to provide automatic failover for those connectivity methods within Azure using APIs to interact with the Azure Internal Load Balancer (ILB) to achieve VIP-like functionality, thus providing a complete and robust high availability design within Azure.

Backup Operations

Performing a backup using either traditional file-level or snapshot based backups can be a challenge in cloud deployments. This can now be achieved within Azure ARM platform using Azure Backup and Azure Automation Run Books along with InterSystems External Freeze and Thaw API capabilities to allow for true 24x7 operational resiliency and assurance of clean regular backups. Alternatively, many of the third-party backup tools available on the market can be used by deploying backup agents within the VM itself and leveraging file-level backups in conjunction with Logical Volume Manager (LVM) snapshots.

Example Architecture

As part of this document, a sample Azure architecture is provided as a starting point for your application specific deployment, and can be used as a guideline for numerous deployment possibilities. This reference architecture demonstrates a highly robust Caché database deployment including database mirror members for high availability, application servers using InterSystems Enterprise Cache Protocol (ECP), web servers with InterSystems CSP Gateway, and both internal and external Azure load balancers.

Azure Architecture

Deploying any Caché based application on Microsoft Azure requires some specific considerations in certain areas. The section discusses these areas that need to be considered in addition to any regular technical requirements you may have for your application.

Two examples are being provided in this document one based on InterSystems TrakCare unified healthcare information system, and another option based on a complete InterSystems HealthShare health informatics platform deployment including: Information Exchange, Patient Index, Health Insight, Personal Community, and Health Connect.

Virtual Machines

Azure virtual machines (VMs) are available in two tiers: basic and standard. Both types offer a choice of sizes. The basic tier does not provide some capabilities available in the standard tier, such as load balancing and auto-scaling. For this reason, the standard tier is used for TrakCare deployments.

Standard tier VMs come in various sizes grouped in different series, i.e. A, D, DS, F, FS, G, and GS. The DS, GS, and new FS sizes support the use of Azure Premium Storage.

Production servers typically need to use Premium Storage for reliable, low-latency and high-performance. For this reason, the example TrakCare and HealthShare deployment architectures detailed in this document will be using either FS, DS or GS series VMs. Note that not all virtual machine sizes are available in all regions.

For more details of sizes for virtual machines see:

Storage

Azure Premium Storage is required for TrakCare and HealthShare servers. Premium Storage stores data on Solid State Drives (SSDs) and provides high throughput at low latencies, whereas Standard Storage stores data on Hard Disk Drives (HDDs) resulting in lower performance levels.

Azure Storage is a redundant and highly available system, however, it is important to notice that Availability Sets currently don’t provide redundancy across storage fault domains and in rare circumstances this can lead to issues. Microsoft has mitigation workarounds and is working on making this process widely available and easier to end-customers. It is advisable to work directly with your local Microsoft team to determine if any mitigation is required.

When a disk is provisioned against a premium storage account, IOPS and throughput, (bandwidth) depends on the size of the disk. Currently, there are three types of premium storage disks: P10, P20, and P30. Each one has specific limits for IOPS and throughput as specified in the following table.

Azure Premium Storage Disk Types
	P10	P20	P30
Disk Size	128 GB	512GB	1024GB
IOPS per disk	500	2300	5000
Throughput per disk	100 MB/s	150 MB/s	200 MB/s

Note: Ensure there is sufficient bandwidth available on a given VM to drive the disk traffic. 
For example, a STANDARD_DS13 VM has 256 MB per second dedicated bandwidth available for all premium storage disk traffic. 
That means four P30 premium storage disks attached to this VM have a throughput limit of 256 MB per second 
and not the 800 MB per second that four P30 disks could theoretically provide.

For more details and limits on premium storage disks, including provisioned capacity, performance, sizes, IO sizes, Cache hits, throughput targets, and throttling see:

Premium Storage

High Availability

InterSystems recommends having two or more virtual machines in a defined Availability Set. This configuration is required because during either a planned or unplanned maintenance event, at least one virtual machine will be available to meet the 99.95% Azure SLA. This is important because during data center updates, VMs are brought down in parallel, upgraded, and brought back online in no particular order leaving the application unavailable during this maintenance window.

Therefore, a highly available architecture requires two of every server, i.e. load balanced web servers, database mirrors, multiple application servers and so on.

For more information on Azure high availability best practices see:

Managing Availability

Web Server Load Balancing

External and internal load balanced web servers may be required for your Caché based application. External load balancers are used for access over the Internet or WAN (VPN or Express Route) and internal load balancers are potentially used for internal traffic. The Azure load balancer is a Layer-4 (TCP, UDP) type load balancer that distributes incoming traffic among healthy service instances in cloud services or virtual machines defined in a load balancer set.

The web server load balancers must be configured with client IP address session persistence (2 tuple) and the shortest probe timeout possible, which is currently 5 seconds. TrakCare requires session persistence for the period a user is logged in.

The following diagram provided by Microsoft demonstrates a simple example of the Azure Load Balancer within an ARM deployment model.

For more information on Azure load balancer features such as distribution algorithm, port forwarding, service monitoring, Source NAT, and different types of available load balancers see:

Load Balancer Overview

In addition to the Azure external load balancer, Azure provides the Azure Application Gateway. The Application Gateway is a L7 load balancer (HTTP/HTPS) with support for cookie-based session affinity and SSL termination (SSL offload). SSL offloading removes the encryption/decryption overhead from the Web servers, since the SSL connection is terminated at the load balancer. This approach simplifies management as the SSL certificate is deployed and managed in the getaway instead of all the nodes in the web farm.

For more information, see:

Database Mirroring

When deploying Caché based applications on Azure, providing high availability for the Caché database server requires the use of synchronous database mirroring to provide high availability in a given primary Azure region and potentially asynchronous database mirroring to replicate data to a hot standby in a secondary Azure region for disaster recovery depending on your uptime service level agreements requirements.

A database mirror is a logical grouping of two database systems, known as failover members, which are physically independent systems connected only by a network. After arbitrating between the two systems, the mirror automatically designates one of them as the primary system; the other one automatically becomes the backup system. External client workstations or other computers connect to the mirror through the mirror Virtual IP (VIP), which is specified during mirroring configuration. The mirror VIP is automatically bound to an interface on the primary system of the mirror.

Note: In Azure, it is not possible to configure the mirror VIP, so an alternative solution has been devised.

The current recommendation for deploying a database mirror in Azure is to configure three VMs (primary, backup, arbiter) in the same Azure Availability Set. This ensures that at any given time, Azure will guarantee external connectivity with at least two of these VMs with a 99.95% SLA, and that each will be in different update and fault domains. This provides adequate isolation and redundancy of the database data itself.

Additional details on can be found here:

A challenge within any IaaS cloud provider, including Azure, is the handling of automatic failover of the client connections to the application with the absence of Virtual IP capabilities. To retain automatic failover for client connections a couple directions have been taken.

Firstly, InterSystems has enhanced the CSP gateway to become mirror-aware so connectivity from a web server with the CSP Gateway to a database server no longer requires a VIP. The CSP gateway will auto-negotiate with both the of the mirror members and redirect to the appropriate member whichever is the primary mirror member. This goes along with the already mirror-aware capabilities of ECP clients if using them.

Secondly, connectivity outside of the CSP Gateways and ECP clients still requires a VIP-like capability. InterSystems has worked with Microsoft Azure engineers to develop an API interface from within Caché using the ^ZMIRROR routine to interact with an Azure Internal Load Balancer to update the server pool definition to whichever mirror member becomes primary. The sample template ^ZMIRROR and ^ZAZUREHA routines can be found here.

These routines will need to be modified for you specific Azure subscription details.

The Azure Internal Load Balancer (ILB) will provide a single IP address as a VIP-like capability to direct all network traffic to the primary mirror member. The ILB will only distribute traffic to the primary mirror member. This method does not rely on polling, and allows for an immediate redirection upon any mirror member within a mirror configuration becoming the primary member. Polling may be used in conjunction with this method is some DR scenarios using Azure Traffic Manager.

Backup and Restore

There are multiple options available for backup operations. The following three options are viable for your Azure deployment with InterSystems products. The first two options incorporate a snapshot type procedure which involves suspending database writes to disk prior to create the snapshot and then resuming updates once the snapshot was successful. The following high-level steps are taken to create a clean backup using either of the snapshot methods:

Pause writes to the database via database Freeze API call.
Create snapshots of the OS + data disks.
Resume Caché writes via database Thaw API call.
Backup facility archives to backup location

Additional steps such as integrity checks can be added on a periodic interval to ensure clean and consistent backup.

The decision points on which option to use depends on the operational requirements and policies of your organization. InterSystems is available to discuss the various options in more detail.

Azure Backup

Backup operations can now be achieved within Azure ARM platform using Azure Backup and Azure Automation Runbooks along with InterSystems External Freeze and Thaw API capabilities to allow for true 24x7 operational resiliency and assurance of clean regular backups. Details for managing and automating Azure Backups can be found here.

Logical Volume Manager Snapshots

Alternatively, many of the third-party backup tools available on the market can be used by deploying individual backup agents within the VM itself and leveraging file-level backups in conjunction with Logical Volume Manager (LVM) snapshots.

One of the major benefits to this model is having the ability to have file-level restores of either Windows or Linux based VMs. A couple of points to note with this solution, is since Azure and most other IaaS cloud providers do not provide tape media, all backup repositories are disk-based for short term archiving and have the ability to leverage blob or bucket type low cost storage for long-term retention (LTR). It is highly recommended if using this method to use a backup product that supports de-duplication technologies to make the most efficient use of disk-based backup repositories.

Some examples of these backup products with cloud support include but is not limited to: Commvault, EMC Networker, HPE Data Protector, and Veritas Netbackup. InterSystems does not validate or endorses one product over the other.

Caché Online Backup

For small deployments the built-in Caché Online Backup facility is also a viable option as well. This InterSystems database online backup utility backs up data in database files by capturing all blocks in the databases then writes the output to a sequential file. This proprietary backup mechanism is designed to cause no downtime to users of the production system.

In Azure, after the online backup has finished, the backup output file and all other files in use by the system must be copied to an Azure File share. This process needs to be scripted and executed within the virtual machine.

The Azure File shares should use an Azure RA-GRS storage account for maximum availability. Note Azure File shares have a maximum share size of 5TB, a maximum file size of 1TB, and maximum 60 MB/s throughput per share (shared by all clients).

Online backup is the entry-level approach for smaller sites wishing to implement a low cost solution for backup. However, as databases increase in size, external backups with snapshot technology are recommended as a best practice with advantages including the backup of external files, faster restore times, and an enterprise-wide view of data and management tools.

Disaster Recovery

When deploying a Caché based application on Azure, Disaster Recovery (DR) resources including network, servers and storage are recommended to be in different Azure region. The amount of capacity required in the designated DR Azure region depends on your organizational needs. In most cases 100% of the production capacity is required when operating in a DR mode, however lesser capacity can be provisioned until more is needed as an elastic model.

Asynchronous database mirroring is used to continuously replicate to the DR Azure region’s virtual machines. Mirroring uses database transaction journals to replicate updates over a TCP/IP network in a way that has minimal performance impact on the primary system. Compression and encryption is highly recommended to be configured with these DR Asynchronous mirror members.

All external clients on the general Internet who wish to access the application will be routed through an Azure Traffic Manager as a DNS service. Microsoft Azure Traffic Manager (ATM) is used as a switch to direct traffic to the current active data center. Azure Traffic Manager supports a number of algorithms to determine how end users are routed to the various service endpoints. Details of various algorithms can be found here.

For the purpose of this document, the ‘priority’ traffic-routing method will be used in conjunction with Traffic Manager endpoint monitoring and failover. Details of endpoint monitoring and failover can be found here.

Traffic Manager works by making regular requests to each endpoint and then verifying the response. If an endpoint fails to provide a valid response, Traffic Manager shows its status as Degraded. It is no longer included in DNS responses, which instead will return an alternative, available endpoint. In this way, user traffic is directed away from failing endpoints and toward endpoints that are available.

Using the above methods, only the specific region and specific mirror member will only ever allow traffic to it. This is controlled by the endpoint definition which is a mirror_status page presented from the InterSystems CSP Gateway. Only the primary mirror member will ever report “success” as a HTTP 200 from the monitor probing.

The following diagram provided by Microsoft demonstrates at a high-level the priority traffic-routine algorithm.

The Azure Traffic Manager will yield a single endpoint such as: "https://my-app.trafficmanager.net" that all clients can connect to. In addition, an A record could be configured to provide a vanity URL such as "https://www.my-app-domain.com". The Azure Traffic Manager shall be configured with one profile that contains the addresses of both regions’ end point.

At any given time, only one of the regions will report online based on the endpoint monitoring. This ensures that traffic only flows to one region at a given time. There are no added steps needed for failover between the regions since the endpoint monitoring will detect the application in the primary Azure region is down and the application is now live in the secondary Azure region. This is because the DR Async mirror member being promoted to primary and then allows the CSP Gateway to report HTTP 200 to the Traffic Manager endpoint monitoring.

There are many alternatives to the above described solution, and can be customized based on your organization operational requirements and service level agreements.

Network Connectivity

Depending on your application’s connectivity requirements, there are multiple connectivity models using either Internet, IPSEC VPN, or a dedicated link using Azure Express Route are available. The method to choose will depend on the application and user needs. The bandwidth usage for each of the three methods vary, and best to check with your Azure representative or Azure Portal for confirmation of available connectivity options for a given region.

If you are using Express Route, there are several options including multiple circuits and multi-region access that can be enabled for disaster recovery scenarios. It is important to work with the Express Route provider to understand the high availability and disaster recovery scenarios they support.

Security

Care needs to be taken when deciding to deploy an application in a public cloud provider. Your organization’s standard security policies, or new ones developed specifically for cloud, should be followed to maintain security compliance of your organization. Cloud deployments have the added risk of data now outside client data centers and physical security control. The use of InterSystems database and journal encryption for data at rest (databases and journals) and data in flight (network communications) with AES and SSL/TLS encryption respectively are highly recommended.

As with all encryption key management, proper procedures need to be documented and followed per your organization’s policies to ensure data safety and prevent unwanted data access or security breech.

When access is allowed over the Internet, third party firewall devices may be required for extra functionality such as intrusion detection, denial of service protection etc.

Architecture Diagram Examples

The diagrams below illustrates a typical Caché installation providing high availability in the form of database mirroring (both synchronous failover and DR Asynchronous), application servers using ECP, and multiple load balanced web servers.

TrakCare Example

The following diagram illustrates a typical TrakCare deployment with multiple load balanced webservers, two EPS print servers as ECP clients, and database mirror configuration. The Virtual IP address is only used for connectivity not associated with ECP or the CSP Gateway. The ECP clients and CSP Gateway are mirror-aware and do not require a VIP.

The sample reference architecture diagram below includes high availability in the active or primary region, and disaster recovery to another Azure region if the primary Azure region is unavailable. Also within this example, the database mirrors contain the TrakCare DB, TrakCare Analytics, and Integration namespace all within that single mirror set.

TrakCare Azure Reference Architecture Diagram - PHYSICAL ARCHITECTURE

In addition, the following diagram is provided showing a more logical view of architecture with the associated high-level software products installed and functional purpose.

TrakCare Azure Reference Architecture Diagram - LOGICAL ARCHITECTURE

The following diagram illustrates a typical HealthShare deployment with multiple load balanced webservers, with multiple HealthShare products including Information Exchange, Patient Index, Personal Community, Health Insight, and Health Connect. Each of those respective products include a database mirror pair for high availability within an Azure availability set. The Virtual IP address is only used for connectivity not associated with ECP or the CSP Gateway. The CSP Gateways used for web service communications between the HealthShare products are mirror-aware and do not require a VIP.

The sample reference architecture diagram below includes high availability in the active or primary region, and disaster recovery to another Azure region if the primary Azure region is unavailable.

In addition, the following diagram is provided showing a more logical view of architecture with the associated high-level software products installed, connectivity requirements and methods, and the respective functional purpose.

HealthShare Azure Reference Architecture Diagram – LOGICAL ARCHITECTURE

Post by Mark Bolinsky 9 months 3 weeks ago

COMMENTS

Given that the Azure pricing for storage contains a transaction element, is there any indication as to how many of these transactions will be consumed opening or saving an object as well as other common actions - obviously a simple object will use much less than a complex one.

Comment by Matthew Gage - 9 months 2 weeks ago

Hi Matthew,

Thank you for your question. Pricing is tricky and best discussed with your Microsoft representative. When looking at premium storage accounts, you only pay for the provisioned disk type not transactions, however there are caveats. For example if you need only 100GB of storage will be be charges for a P0 disk @ 128GB. A good Microsoft article to help explain the details can be found here.

Regards,

Mark B

Comment by Mark Bolinsky - 9 months 2 weeks ago

This is great Mark, excellent write up.

Ran into a similar problem a couple of years ago on AWS with the mirror VIP, had a less sophisiticated solution with a custom business service on a target production/namespace listening for a keep alive socket the ELB to detect which Mirror Member was active.... re-used it for an auto-scaling group too for an indicator for availability we could put logic behind. Those links up there to the routines appears broke for me, would love to take a look at that magic.

What's Azure's VPN for solution look like for site 2 site connections? The diagrams above maybe suggest this is possibly bolted to on-prem, but just curious if you had any comments to that with Azure.

Did you provision a DNS Zone on a legible domain for internal communications? I abused a couple of *.info domains for this purpose and found that the hostnames enumerated from Cache were from the Instances and not very usable for interhost communication and broke things like Enterprise Manager, HS Endpoint Enumeration, etc.

Does Azure have an Internet Gateway or a NAT solution to provide communication outbound from a single address (or fault tolerance) ? The diagram for Web Server Load Balancing looks like they work for both inbound and outbound just wondered if that was the case.

Again, excellent resource, thanks for taking the time.

Comment by Ron Sweeney - 9 months 2 weeks ago

Hi Ron,

There are many options available for may different deployment scenarios. Specifically for the multi-site VPN you can use the Azure VPN Gateway. Here is a diagram provided by Microsoft's documentation showing it.

Here is the link as well to the multi-site VPN details.

As for Internet gateways, yes they have that concept and the load balancers can be internal or external. You control access with network security groups and also using the Azure Traffic Manager and also using Azure DNS services. There are tons of options here and really up to you and what/how you want to control and manage the network. Here is a link to Azure's documentation about how to make a load balancer Internet facing.

The link to the code for some reason wasn't marked as public in the github repository. I'll take care of that now.

Regards,

Friday, 14 April 2017

How to Load Data into Microsoft Azure SQL Data Warehouse using PolyBase & Talend ETL

Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational. Built on a massively parallel processing (MPP) architecture, SQL Data Warehouse can handle any enterprise workload.

With increasing focus on business decisions in real-time, there has been a paradigm shift in not only keeping data warehouse systems up to date but reduce load times. The fastest and most optimal way to load data into SQL Data Warehouse is to use PolyBase to load data from Azure Blob storage. PolyBase uses SQL Data Warehouse’s massively parallel processing (MPP) design to load data in parallel from Azure Blob storage.

One of Talend’s key differentiators is its open source nature and the ability to leverage custom components, either developed in-house or by the open source community @ Talend Exchange. Today our focus will be on one of such custom components, tAzureSqlDWBulkExec, and how it can enable Talend to utilize PolyBase to load data into SQL Data Warehouse.

Download >> Talend Open Studio for Data Integration

For simplicity, we will key in on the following two scenarios:

Load data from any source into SQL DW
Load data into SQL DW while leveraging Azure HDInsight and Spark

Load data from any source into SQL DW

In this scenario data can be ingested from one or more sources as part of a Talend job. If needed, data will be transformed, cleansed and enriched using various processing and data quality connectors that Talend provides out of the box. The output will need to conform to a delimited file format using tFileOutputDelimited.

The output file will then be loaded into Azure Blob Storage using tAzureStoragePut. Once the file is loaded into blob, tAzureSqlDWBulkExec will be utilized to bulk load the data from the delimited file into a SQL Data Warehouse table.

Load data into SQL DW while leveraging Azure HDInsight and Spark

As data volumes have increased so has the need to process data faster. Apache Spark, a fast and general processing engine compatible with Hadoop, has become the go-to big data processing framework for several data-driven enterprises. Azure HDInsight is a fully-managed cloud Hadoop offering that provides optimized open source analytic clusters for Spark (Please refer to the following link, How to Utilize Talend with Microsoft HDInsight, for instructions on how to connect to an HDInsight cluster using Talend Studio).

Talend Big Data Platform (Enterprise version) provides graphical tools and wizards to generate native Spark code that combines in-memory analytics, machine learning and caching to deliver optimal performance and increased efficiency over hand-coding. The generated Spark code can be run natively on an HDInsight cluster directly from Talend Studio.

In this scenario, a Talend Big Data job will be set up to leverage an HDInsight Spark cluster to ingest data from one or more sources, apply transformations and output the results to HDFS (Azure Blob storage). The output file format in the Talend Big Data job can vary between (supported by PolyBase):

Delimited Text – using tFileOutputDelimited
Hive ORC – using tHiveOutput
Parquet – using tHiveOutput / tFileOutputParquet

After the completion of the Spark job, a standard job will be executed that bulk loads the data from the Spark output file into a SQL Data Warehouse table using tAzureSqlDWBulkExec.

Performance Benchmark

tAzureSqlDWBulkExec utilizes native PolyBase capability and therefore fully extends the performance benefits of loading data into Azure SQL Data Warehouse. In-house tests have shown this approach to provide a 10x throughput improvement versus standard JDBC.

How-to: Deploy a Secure Enterprise Data Hub on Microsoft Azure – Part 1

Learn how to use Cloudera Director, Microsoft Active Directory (AD DS, AD CS, AD DNS), SAMBA, and SSSD to deploy a secure EDH cluster for workloads in the public cloud.
Authenticating users in Apache Hadoop is the first line of security we recommend. Like most, if not all RDBMS, a user is provided with a username and a password to validate their identity. This is a requirement to access any data managed by those systems. The goal is the same in Apache Hadoop. Since the Hadoop stack does not have an authentication component, Kerberos Key Distribution Center is used as the mechanism to identify users.
There are two implementations of a Kerberos KDC that are supported on a CDH cluster: A MIT KDC installation, and/or integration with Microsoft Active Directory (AD) built-in Kerberos KDC. Generally, the latter is recommended to our enterprise customers and the blog will focus on a direct integration of CDH and the Active Directory KDC. This integration is favored because of other tools that will be used to communicate with Active Directory.
Active Directory
Active Directory is mainly known for its Domain Service (AD DS) service as an Identity Management service which authenticates users and groups. However, there are other powerful services within AD like AD CS, and AD DNS.
On May 6, 2016, my colleague, Ben Spivey wrote a blog on securing a cluster on Amazon AWS. He covered a great deal on the AD DS and AD CS services. For more details, Ben’s blog is a good place to start. This blog will spend more time on AD DNS service.
Active Directory Domain Name System
Deploying a CDH cluster requires both forward and reverse name resolution for internal IP addresses. When deploying a cluster on-premises, this is usually done by your system administrator. When you deploy a cluster on Amazon AWS, this is automatically configured when you launch an EC2 instance.
A forward DNS lookup is resolving a Fully Qualified Domain Name (FQDN) to an IP address, and a reverse DNS lookup is doing the opposite, resolving an IP address to a FQDN. Currently, Microsoft Azure does not provide reverse DNS lookup for internal private IP addresses. This will be covered later.
There are many options for DNS when deploying on Azure. You can install the supported BIND package for your Linux OS, an existing Active Directory Domain Name System, etc. This blog will cover the AD DNS in more details.
If not already configured, ensure your AD administrator has properly configured a reverse DNS zone in the DNS Manager as seen below.
Reverse Zone

The important section in the figure above, is the red box in the “Reverse Lookup Zones”. This illustrates the zone configured to host all the DNS objects for a particular subnet.
Forward Zone

This is a view of the “Forward Lookup Zones” for the CLOUDERA.MORANTUS.COM domain.

Also a view of my OU tree showing zero entries
Azure Virtual Machine
I provisioned a VM in Azure with all the default DNS settings, and we will join it to our AD DS and DNS services.

As you can see, the hostname -f command displays a very long FQDN for my VM and hostname -i gives us the IP address associated with the VM. Next, I did a forward DNS lookup using the host FQDN command, which resolved to the IP address. Then, I did a reverse DNS lookup using host IP–address as shown in the red box above, it did not locate a reverse entry for that IP address. A reverse lookup is a requirement for a CDH deployment. We’ll revisit this later.
SAMBA
In order to configure our RHEL 6.7 VM to communicate with Active Directory, we need to configure a tool called samba. Samba is a Linux based utility that enables the integration of Linux systems with AD.
Join the VM to AD with Samba

Ensure the DNS servers property for your Virtual Network in the Azure portal is pointed to your AD server.

Install packages needed to integrate with AD

sudo yum install -y samba-common krb5-workstation openldap-clients

1	sudo yum install -y samba-common krb5-workstation openldap-clients

Configure the VM to point to the AD DNS server

The nameserver is the IP address for the AD server. This can also be accomplished by running “service network restart” on the VM

Configure samba to join the AD domain and verify the entry in AD. This must be executed as a privileged user. In this case “jmorantus” is an admin account in Active Directory.

Note: You can ignore the failed DNS update error showed above. We need to create a Kerberos keytab with a privileged account to update/create DNS objects in AD. This step will be executed later.

As you can above, we succeeded joining our VM to the AD domain and an AD object was created in the OU servers.

Configure Kerberos krb5.conf file to generate keytab file to update DNS in AD
Update/Create Forward and Reverse DNS entries

Create/Update Forward and Reverse DNS Settings

View of Forward DNS entry added to AD DNS service

View of reverse DNS entry added to AD DNS service.
Note: it’s worth mentioning that Active Directory will age DNS entries that it considers “inactive”. An additional process should be implemented to keep these entries “alive” in AD.
SSSD
The System Security Service Daemon is used to cache users and groups information locally to a Linux system. This integration is also necessary to configure authorization with Apache Sentry for data access.

Now that SSSD is fully configured, we’ll verify we can read user information from AD.

Here you can see with SSSD stopped, the VM does not know of user “scm-cloudera”. With SSSD running, the user information was pulled from AD. If you are looking for a commercial option, Cloudera also recommends Centrify.
Conclusion
You should now be able to configure a VM on Azure, join an AD domain, and create DNS entries in AD DNS server. These steps will work for any other cloud provider and on-premise deployments. In Part 2 of this series, we’ll cover creating a Kerberized cluster with Cloudera Director on Azure.