All about Azure Data Factory

Nathan

4 days ago

Azure Data Factory (ADF) is a fully-managed, serverless data integration service. You can easily create data pipelines to do ETL (extract, transform, load) or ELT (extract, load, transform) workflows. Azure Data Factory offers you a GUI-based, code-free option to create your data pipelines. Let's dig in.

High Level Architecture

Pipelines are the bread and butter of ADF, they are what you use to build your data workflows. A Pipeline is just a logical grouping of Activities. Activities come in 3 different varieties (more on that below). Pipelines can be triggered manually, triggered based on a schedule, or triggered based on an event.

Activity Type 1: Data Movement

The first type of Activity is a Data Movement activity. This is made up of just 2 types of Activities. First, you can delete data from a data store. Second, you can copy data from a source data store to a sink (destination) data store. The list of supported data stores is quite large, and there are some limitations: some data stores are not supported as sources, some data stores are not supported as sinks, some data stores only support self-hosted runtimes. See the docs for a full breakdown of all the data stores and their supported scenarios.

You define your data stores by way of creating Linked Services. In a way, you can think of Linked Services like connection strings. In addition to defining how to connect to each data store (Linked Services) you must also define one or more Datasets for each data store. A Dataset is a way to identify which specific data (tables, files, folders, documents) that you want to target within your data store.

Activity Type 2: Data Transformation

The next type of Activities are Data Transformation activities. These allows you to transform and process your data. Data Transformation requires the use of some type of compute, and each type of compute supports its own specific set of Activities. Let's discuss the various options.

Most Data Transformation Activities use compute options that exist outside of ADF (but there are exceptions, which I will discuss later). See the docs for a full breakdown of the various compute options. Some examples are Azure HDInsight clusters, Azure Databricks, Azure Machine Learning, and more. Just as you did with data stores, you must create Linked Services that represent your compute options and how to connect to them. Each compute option supports a limited number of Activities. For example, the HDInsight cluster option supports 5 Activities: Hive, Pig, Spark, MapReduce, and Hadoop Streaming.

As noted above, there are a few compute options that are different. The first one is an on-demand HDInsight cluster. ADF will automatically spin up a HDInsight cluster before a job is submitted, and then it will remove the cluster when the job is complete. It's important to note that it can take 20 minutes or more to provision the cluster on-demand, so keep that in mind. The cluster can be of type 'Hadoop' or 'Spark', and it's created inside your own Azure Subscription.

The second exception is when you are working with Data Flows. For compute, Data Flows use on-demand, fully managed, auto-scaled Apache Spark clusters that run on Azure-hosted Integration Runtimes. The first type of Data Flow is called a "Mapping" Data Flow. These allow you to use a GUI-based visual interface to create complex data transformation logic without writing any code. See the docs for a full list of the types of supported transformations. The second type of Data Flow is called a "Wrangling" Data Flow. These allow you to do code-free data preparation using Power Query, letting you run Power Query M functions inside your ADF Pipelines. Wrangling Data Flows only support a handful of data sources, see the docs for full details.

The last exception to discuss is running SSIS Packages. When ADF executes SSIS Packages, it will do so on fully managed compute instances running on Azure-hosted SSIS Integration Runtimes. More on the various types of Integration Runtimes below.

Activity Type 3: Control

These types of Activities can be used for various different functions. Some of these are similar to programming functions, such as setting variables, appending to variables, doing loops, conditionals, and more. Then you have some activities dealing with the Pipeline itself, like invoking another Pipeline, getting metadata, waiting for a specified time, and more. See the docs for a full list of all control Activities.

Integration Runtimes

Integration Runtimes are the compute infrastructure that ADF uses for doing the following: executing Data Movement Activities with the linked data stores, dispatching Data Transformation Activities to the linked compute instances, executing Data Flows, and executing SSIS Packages. Integration Runtimes come in 3 different flavors.

Self-Hosted Integration Runtimes

The first type of runtime is a Self-Hosted runtime. As the names suggests, this option uses your own machine(s) that you host yourself (either in the cloud or on-premises). A single Self-hosted runtime can be made up of / can scale out to 4 of your self-hosted machines. Self-hosted runtimes only support Windows operating systems. Some of the reasons that would require you to use a Self-Hosted runtime include:

If a Linked Service is not accessible from ADF's public cloud environment
If a Data Store requires a custom driver that is not included on the Azure-hosted runtimes

Azure-hosted Integration Runtimes

The next type of runtime is an Azure-hosted runtime. As the name suggests, this option uses fully managed machines, which are hosted and automatically scaled by Azure. The Azure-hosted runtime also includes the fully managed and autoscaled Apache Spark cluster, which is used to run Data Flows. You can configure an Azure-hosted runtime to operate in a specific Azure Region. Or, you can configure it for "Auto-Resolve" which means it will pick the best Azure Region to use for each task.

Since this runtime is hosted by Azure that means, by default, the runtime only has access to Linked Services that have publicly accessible endpoints. However, you can configure your Azure-hosted runtime to use a "Managed" Virtual Network. This virtual network is fully managed by Azure, and allows you to create "Managed" Private Endpoints for certain types of Linked Services. This means it is possible for the Azure-hosted runtime to privately communicate with certain types of Linked Services. Check the docs for full details (look at the column for Managed Private Endpoint).

Some of the reasons that would require you to use an Azure-hosted runtime include:

You do not want to create and manage your own Self-hosted machines
If you need to execute a Data Flow
If you want to use the on-demand HDInsight cluster

Azure-hosted SSIS Integration Runtimes

The last type of runtime to discuss is the Azure-hosted SSIS runtime. These are hosted and fully managed by Azure. By default, this runtime is provisioned in the public Azure network. However, you do have the option to inject the runtime into your own private Azure Virtual Network, if desired.

The only reason that would require you to use an Azure-hosted SSIS runtime:

If you need to run the Execute SSIS Package Activity

Nathan Nellans