Introduction to Azure Synapse Analytics Part 2: Data Preparation and Management

Marcelo Ricardo de Oliveira

5.00/5 (2 votes)

Jun 30, 2021

CPOL

7 min read

6617

In this article we explore how Azure Synapse Analytics helps with data preparation and management, eliminating the need for custom extract, transform, and load (ETL) code.

The work of data scientists and business intelligence (BI) teams involves making sense of the data that comes from business processes. However, to transform this sea of unstructured data into valuable insights ready for consumption, we need a data warehouse platform and a set of programming tools.

The traditional way to build a data warehouse is to connect an integration process to multiple sources. It uses a three-step approach for loading data into the data warehouse, called extract, transform, and load (ETL). In ETL, we transform our data in a staging area before loading it into the system.

Data scientists and BI teams often use ETL on on-premises, with relational and structured data. ETL does not support data lake operations. We should use a different approach when loading data into Azure Synapse Analytics.

In the first article of this series, we explored Azure Synapse Analytics and some of its features. This second article of the series explores how to extract, transform, and load data in Azure Synapse Analytics. It also highlights how its flexible language support helps engage developers to provide BI teams and data scientists with robust data solutions.

ETL versus ELT

With extract, load, and transform (ELT), raw data coming from data sources loads directly into the target system, and only then it transforms. ELT is suitable for structured and unstructured data sources in the cloud. Therefore, it is more appropriate than ETL for handling enormous datasets and preparing them for business intelligence and data analytics.

Using ELT, we first extract the source data into a file format that lands in Azure. Typically, this format is a delimited text file (CSV).

We can use one of many loading methods, including the bulk copy program (BCP) and the SqlBulkCopy API, popular among SQL Server developers. However, in the Big Data and cloud environment of Azure Synapse Analytics, PolyBase external tables, and the COPY statement enable us to load data faster and be more scalable.

Polybase efficiently allows multiple nodes to import data in parallel. Within Azure Data Factory (ADF), we can use PolyBase to define a pipeline to transfer data from an Azure Synapse Analytics table to a Databricks data frame, or from there back to an Azure Synapse Analytics table.

We can use the COPY statement in Azure Data Factory to load data into Azure Synapse Analytics.

With PolyBase and the COPY statement, we can access external data stored in Azure Blob Storage or Azure Data Lake Store via the T-SQL language. For the most flexibility when loading, use the COPY statement.

Dedicated SQL Pool

Dedicated SQL pool (formerly known as SQL Data Warehouse (DW)) refers to the enterprise data warehousing features in Azure Synapse Analytics. Developers benefit from the power of T-SQL to build data definition language (DDL) and data modeling language (DML) statements for relational database operations targeting a dedicated SQL pool in their data warehouse.

As in SQL Server, the CREATE TABLE AS SELECT (CTAS) syntax is helpful to quickly create a new table in Azure Synapse Analytics’ dedicated SQL pool from the results of a SELECT statement. CTAS runs as a parallel operation, and you can specify the data distribution and table structure type.

Here is a simple example of how to create a new table from an existing one using CTAS and the round-robin option, which quickly distributes rows randomly across all distributions:

CREATE TABLE [dbo].[FactInvoices_new] 
WITH
( 
DISTRIBUTION = ROUND_ROBIN 
,CLUSTERED COLUMNSTORE INDEX 
) 
AS 
SELECT  * 
FROM    [dbo].[FactInvoices];

The previous FactInvoices fact table contains data created initially in a transactional system and then loaded into the dedicated SQL pool for analysis.

The following example uses CTAS to create a table with the total sales and discounts for each product based on two tables:

CREATE TABLE [dbo].[ShipmentsPerCustomer]  
WITH  
(  
    DISTRIBUTION = ROUND_ROBIN,  
    CLUSTERED COLUMNSTORE INDEX  
)
AS  
SELECT  
c.CustomerID, 
c.Name AS CustomerName,   
SUM(s.Value) as TotalValue, 
SUM(s.Weight) as TotalWeight 
FROM [dbo].[Customer] AS c 
INNER JOIN [dbo].[Shipment] AS s 
ON c.CustomerID = s.CustomerID 
GROUP BY c.CustomerID;

Azure Blob Storage

Organizations must deal with data stored in all types of files. Although storing and managing documents, images, videos, logs, or database files is challenging, Azure Blob Storage helps.

The Blob in Azure Blob Storage stands for binary large objects, which can be any kind of content. Azure Blob Storage helps organizations manage multiple types of unstructured text or binary files in the cloud at a low cost. It secures these files with user access control using Azure Active Directory (AD).

There are many ways to upload content to Azure Blob Storage, including AzCopy and the Azure CLI.

Azure Blob Storage integrates well with other Azure platform products. For example, we can use Azure Blob Storage with Azure Content Delivery Network (CDN) to direct content to users from the nearest data center. This reduces load times and increases responsiveness. Also, to boost application performance and enhance user experience, we can configure Azure Search to index the multiple file types in our Azure Blob Storage.

We can access the data in Azure Blob Storage via the Azure Storage REST API from any device that can make HTTP requests. We can also access Azure Blob Storage from the Azure portal and the command line with Azure PowerShell and Azure CLI.

For platforms like .NET, Java, Python, and Node.js, we can use an Azure Storage client library that wraps the REST API calls to access Azure Blob Storage from our applications.

The following Java code example creates a storage container and uploads a local file to this storage container. Since Azure Blob Storage is optimized for storing massive amounts of unstructured data, we can use this code to move files containing images, videos, audio, logs, and other types of unstructured data:

String sasToken = "<your-sas-token>"; 
/* Create a new BlobServiceClient with a SAS Token */ 
BlobServiceClient blobServiceClient = new BlobServiceClientBuilder() 
    .endpoint("https://your-storage-account-url.storage.windows.net") 
    .sasToken(sasToken) 
    .buildClient(); 
  
/* Create a new container client */ 
try { 
    containerClient = blobServiceClient.createBlobContainer("sample-container"); 
} catch (BlobStorageException ex) { 
    // The container may already exist, so don't throw an error 
    if (!ex.getErrorCode().equals(BlobErrorCode.CONTAINER_ALREADY_EXISTS)) { 
        throw ex; 
    } 
} 
  
/* Upload the file to the container */ 
BlobClient blobClient = containerClient.getBlobClient("sample-blob"); 
blobClient.uploadFromFile("sample-file");

Azure Data Lake Store Gen2

Azure Data Lake Storage Gen2 is a highly productive data lake dedicated to running Big Data analytics on Azure while managing massive amounts of data. Built on top of Azure Blob Storage, Azure Data Lake Storage Gen2 incorporates all its benefits, including file system semantics, file-level security, and scale. Additionally, it is low cost, high availability, and has resilience capabilities. The additional features further lower the total ownership cost for running Big Data analytics on Azure.

Read, Transform, and Move Data Between Azure Synapse Analytics and Azure Databricks

To bridge the gap between a data lake and a data warehouse, Azure has its own Databricks implementation, called Azure Databricks. This cloud tool enables Azure Synapse Analytics to explore, prepare, train, and transform data.

To demonstrate concepts, get insights, and validate our ideas, let’s use Azure Synapse Studio notebooks to experiment with our data. A notebook is a readable, human-friendly document we create by freely adding blocks of text and code snippets. Synapse notebooks support four Apache Spark languages: pySpark (Python), Spark (Scala), SparkSQL, and .NET for Apache Spark (C#).

Let’s explore using the Scala language to access Azure Blob Storage, read a table, perform an SQL query, and write data back to another table.

First, we set up the Azure Blob Storage account access key in the notebook session configuration:

spark.conf.set( 
  "fs.azure.account.key.<storage-account-name>.blob.core.windows.net", 
  "<storage-account-access-key>")

Next, we set up a Java Database Connectivity (JDBC) connection string and BLOB connection string, get some data from an Azure Synapse query, and put the results into a Spark DataFrame using the spark.read function:

val df: DataFrame = spark.read 
  .format("com.databricks.spark.sqldw") 
  .option("url", "jdbc:sqlserver://<connection-string>") 
  .option("tempDir", "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>") 
  .option("forwardSparkAzureStorageCredentials", "true") 
  .option("query", "select OrderId, count(*) as itemCount from FactOrderDetails group by OrderId") 
  .load()

Finally, we write the transformed data back to an Azure Synapse table with the spark.write function:

df.write
  .format("com.databricks.spark.sqldw") 
  .option("url", "jdbc:sqlserver://<connection-string>") 
  .option("forwardSparkAzureStorageCredentials", "true") 
  .option("dbTable", "<table-name>") 
  .option("tempDir", "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>") 
  .save()

Conclusion

We have explored how Azure Synapse Analytics uses ELT as an alternative to ETL for extracting, loading, and transforming data. It anticipates the load step and lets data transformation happen inside the data warehouse.

We can ingest, prepare, and manage data in Azure Synapse Analytics with no-code or a supported language, including T-SQL, Java, Scala, Python, and C#. Azure Synapse Analytics’ flexible language support enables developers to leverage their programming skills to deliver data solutions for BI and data science teams to answer business questions.

To learn more, continue to the third and final article of this series to explore how data science and business intelligence teams can use Azure Synapse Analytics data to gain new insight into business processes.

Plus, you can learn more about how to use Azure Synapse Analytics to drive business intelligence and machine learning by attending Microsoft’s Hands-on Training Series for Azure Synapse Analytics. Led by a member of the Azure Synapse engineering team, each episode offers hands-on Azure Synapse training.