How to build a modern data warehouse in Microsoft Azure

For most, if not all, organizations, data is the most important business asset. We’ve previously looked at what a modern data warehouse is and how to assess whether you need one. Now, let’s take it a step further and dive into how to build a modern data warehouse in Microsoft Azure!

The goal of implementing a data warehouse

Building and implementing a modern data warehouse can take up to three years, and given all the aspects you need to focus on, it’s simply not possible to implement the entire solution in one go. That’s why it makes sense to structure your data warehouse implementation into smaller parts. In other words, you want to implement your data warehouse in phases, with each phase having its own goals and requirements.

With that in mind, the goals for your first phase might typically look like this:

Build a central, unified data model that uses data from a single business or area of operation such as sales, marketing, or human resources, but adequately allows for expansion to other areas of operation in the future.
Integrate data from your largest business system into the data warehouse.
Make data available within a specified local time regardless of when the data is collected or the type of data.
New data from your business system should be regularly loaded into the data warehouse.
Data in the data warehouse should be prepared for use by all employees across the organization, and security settings ensure that data users have access to relevant data and are protected from data that is not relevant.
The analytical model that gives you valuable insight into your data will contain huge amounts of historical business data.
Develop a dashboard with an overview for the specific business area with all security measures implemented that responds in less than a few seconds.

While these goals may vary slightly depending on your specific needs and requirements, these are often standard goals for many data warehouse implementations. However, as you can probably imagine, they require a lot of work and come with a number of challenges that you must overcome to make your implementation successful.

With that in mind, let's look at the process of implementing an enterprise data warehouse.

Typical modern data warehouse architecture

To look at the implementation process, we will consider each component of a typical modern data warehouse architecture in Microsoft Azure. In this example, the Microsoft Azure architecture will survive:

SQL Server as a data source
Blob storage in the form of Azure Data Lake Gen 2 for storing data before loading it into the data warehouse.
SQL Elastic pool to perform analyses on huge amounts of data.
Azure Analysis Services to provide data modeling capabilities.
Azure Active Directory to authenticate users connecting to Azure Analysis Services through Power BI.

How to get data

As we mentioned above, one of the first goals of implementing a data warehouse is to build a central, unified data model that uses data from a single area of business. You also need to integrate data from your largest business system into the data warehouse.

To do this, you need to combine all your structured, unstructured, and semi-structured data. Typically, unstructured data will consist of logs, files, and various types of media. On the other hand, structured data will be the data you get from your business applications like CRM, marketing platform, or sales platform. As mentioned earlier, in this example, we will only use one data source.

Data storage

Once you know how and what data you want to bring into your data warehouse, the next step is to extract all the data from the respective file sources. Here you face one of the main challenges of a modern data warehouse: how do you store data efficiently?

To answer this question, you usually need to consider three important things:

Where to store the files and how to structure and organize them.
How you want to share the files and how much data each file should contain.
What file format you want to extract the data to.

Let's take a closer look at these questions.

1. Where do you want to store the files and how do you want to structure and organize them?

It is very important to plan how your unstructured, semi-structured, and structured raw data from your data sources will be stored. Typically, when you deploy a modern data warehouse on Microsoft Azure, you can store your files in a data lake or Blob storage.

Blob storage is Microsoft's object storage solution for the cloud. It is specifically designed and optimized for storing large amounts of unstructured data. It is fully capable of:

View images, files or documents directly in a browser.
Store files for distributed access across an entire company.
Streaming video and audio.
Write data to log files.
Storing data for backup and restore, archiving, or disaster recovery.
Save data for analysis by an Azure hosted or on-premises data analytics solution.

In contrast, built on Azure Blob Storage, Azure Data Lake Storage Gen2 has a set of features specifically geared towards big data analytics. It effectively combines the features of Azure Data Lake Storage Gen1 with Azure Blob Storage. This gives Data Lake Storage Gen 2:

A hierarchical file system
File system semantics
File-level security
Scalability
Low cost storage in warehouse
High availability
Strong consistency
Disaster recovery options

While choosing the right storage solution depends on your specific needs and requirements, modern data warehouses are designed and implemented with big data analytics in mind. As such, when implementing a modern data warehouse, Azure Data Lake Storage Gen 2 may be the most appropriate choice.

When you choose to implement it, you will typically enjoy the following benefits:

Centralized access to a copy of data in the different data sources
The performance of your data warehouse will be optimized because you don't have to copy or transform data as a requirement for analytics. If you compare this to Blob Storage's flat namespace, the hierarchical namespace allows you to improve overall performance by improving the performance of management tasks.
Data processing becomes easier because you can organize your data into folders and subfolders.
Because you can define POSIX permissions for folders or individual files, security can be maintained.
Because it is built on Azure Blob storage, which is designed to be low-cost, it is very cost-effective, and the additional features further reduce the cost of ownership.

2. How will you share the files and how much data will each file contain?

Once you have decided on which storage solution to use, the next important thing you need to decide is how you will structure the data in your datasheet. In other words, you need to plan which folders to use to store the data in, how to divide those folders, and how to name the folders and individual files.

It is crucial that you plan these aspects carefully, as they will ultimately determine how easily you will be able to navigate through the data stored in the datasheet.

The next step is to plan how to split the files and how much data each file will contain. This usually involves considering the amount of data you already have and how quickly your data is growing. Using this information, you can decide how to split the data into files.

For example, with dictionary data, you would typically use one file for all the data in a dictionary table, regardless of how much data the table stores. However, with transactional data, you have the choice to store data for a day, a month, a year, or longer or shorter depending on your specific needs and requirements.

3. What format do you want to use to extract the data?

The next decision you will be faced with is what format you want to use to extract the data into. While this may sound like a simple choice, it is crucial to get it right as the file format has a significant impact on the final data size.

Typically, you have a choice between the following file formats:

Avro format
Binary format
Limited text format
Excel format
JSON format
ORC format
Parquet format
XML format

You need to carefully consider what file format your data is in and what the impact will be if you save it in one of the formats above. For example, moving data from text files created from a SQL database can significantly increase the data size with some formats, while it can be reduced with others.

Making the right choice has the potential to not only significantly reduce the amount of storage space you need, but can also significantly reduce the time it takes to transfer your data to the cloud.

Once the planning is done, you can proceed to extract the data and transfer it to the datasheet. Here you have many options like Azure CLI and PowerShell. However, one of the best options is TimeXtender. Specifically designed for high-performance copying of data to Azure blob storage, it is a fast and efficient way to transfer your data from your local storage to Azure.

However, there are a few things to keep in mind when copying your data to Azure Data Lake Storage Gen 2. First, don't run the tool on the same machine that runs your production workloads, as the resources required can interfere with production work.

You should also aim to create a storage account in a region close to your on-premises storage to ensure faster transfers. Finally, AzCopy creates a temporary log file when transferring data, allowing it to restart the transfer if it is interrupted. Therefore, you should ensure that you have enough storage space available to store the log files.

Using data from Microsoft Azure data warehouse

Remember, the ultimate goal of having a modern data warehouse built on Microsoft Azure is to deliver data to, for example, a Microsoft Power BI dashboard across the enterprise. To achieve this, you need to load the datasheet files into the data warehouse.

Here you use Polybase to load the files into the data warehouse. It uses Azure Synapse's Massively Parallel Processing (MPP) which makes it the fastest way to load data into Azure Synapse.

Uploading data to Azure Synapse is a two-step process. In the first step, you create a set of external tables for the data. These external tables are simply table definitions that point to data stored outside of Azure Synapse, in our case in the datasheet. It's important to note that this step does not move any data into the repository.

In the next step, you will create compilation tables and load data into these compilation tables. In this step, the data will be copied to the repository. Once the data has been copied to Azure Synapse, you will transform the data and move it to production tables suitable for semantic modeling.

Then load the data into a tabular model in Azure Analysis Services. During this step, you will typically create a semantic model using SQL Server Data Tools (SSDT). Here you also have the option to create a semantic model by importing it from a Power BI Desktop file.

It is important to remember here that you need to add the relationships to the semantic model so that you can join data across tables. This is simply because Azure Synapse does not support foreign keys. Once you have completed this step, you will be able to visualize your data in Power BI.

Power BI has two options for connecting to Azure Analysis Services so you can visualize your data. The first is to import your data into Power BI. The second option is to use Live Connection, where Power BI pulls data directly from Azure Analysis Services.

While the choice ultimately depends on your specific needs and requirements, it is recommended to use Live Connection because you don't need to copy data to Power BI.

When visualizing your data, there are also a few things you need to keep in mind. First, Azure Analytics Services is specifically designed to handle queries for the requirements of a Power BI dashboard. As a result, it is a best practice to query Azure Analytics Services directly from Power BI.

With the above in mind, the second thing to keep in mind is that you should avoid running queries directly against the data warehouse. This can impact performance as refreshing the dashboard will count towards the number of concurrent searches.

Expand capabilities and functions

We mentioned earlier that a modern data warehouse should be implemented in phases, and our example above perfectly illustrates what the first phase of implementation might look like. So, what does the implementation look like in later stages when we want to incorporate more features into the data warehouse?

In this example, we build on the previous example and add some important features that are essential for modern data warehouse implementations. These features include:

Pipeline Automation with Data Factory.
Incremental loading of data.
Integration of data from multiple sources.
Load and use binary data such as geospatial data, images, and other media.

In this example, the Azure architecture consists of:

Local SQL Server and external data as data sources.
Blob storage to store data before loading it into Azure Synapse.
Azure Data Factory to orchestrate and automate the movement and transformation of data and coordinate the different stages of the extract, load, transform (ELT) process.
Azure Analysis Services which provides data modeling capabilities.
Power BI for data analysis.
Azure Active Directory to authenticate users using Power BI to connect to Azure Analysis Services.

Data pipeline and incremental loading

To get your data into the data warehouse, use the Data Factory pipeline. These pipelines are logical groupings of activities that work together to accomplish a specific task. For example, a pipeline might contain a set of activities that ingest and cleanse data from a variety of systems, and then initiate a data flow to analyze that data.

Another example might be when you use a copy activity to copy external data from, for example, a SQL database to Azure Blob Storage. This is similar to our example where we use a pipeline to load and transform data into Azure Synapse.

One of the biggest advantages of using these pipelines is that it allows you to manage your activities together instead of individually. You therefore distribute and schedule the pipeline instead of distributing each activity independently.

Unlike our first example, this architecture will also implement incremental data loading. When using an automated ELT process, it is much more efficient to load only new data, or in other words, only data that has changed, into the data warehouse compared to loading all the data.

These tables, also known as system version tables, always provide information about the data stored in the table. It does this by automatically recording the history of all changes in a separate history table. From an ETL perspective, you can then query the historical data to determine whether to perform an incremental load.

Ultimately, Data Factory will perform an incremental load if there are any changes. Remember that after a new batch of data is loaded into the repository, you will need to update the Analysis Services tabular model. It is also important to remember that data cleansing should be part of the ELT process to ensure good quality data.

Multiple data sources

Unlike our first example, we will now incorporate more data sources. Here, your data factory will organize and coordinate the extraction of data from the external sources into our Blob storage or Azure Data Lake Storage Gen 2 using a copy activity to move data from both on-premises and cloud data sources. As before, you can copy data to the datasheet in any of the file formats mentioned earlier.

From here, it can copy data directly to Azure Synapse using the blob storage connector. However, it is important to remember that the blob storage connector only supports certain types of authentication, such as account key authentication, shared access signature authentication, and system-assigned managed identity authentication, among others.

It then requires a connection string or shared access signature and therefore cannot be used to copy a blob with public read access. For a blob with public read access, use Polybase to create an external table with blob storage and copy the external table to Azure Synapse.

Binary data

When working with binary data, Data Factory's Copy Activity will also be able to copy from the data source to the datasheet and on to Azure Synapse. However, it is important to note that when using the Copy Activity, you can only copy from binary data to binary data.

You should also remember that when using Polybase as the solution described above, it only supports maximum column sizes of 8000 bytes. In this case, you break the data into smaller pieces during the copy and then reassemble the pieces after the copy is complete.

Finally

It's no secret that in today's competitive marketplace, flexibility is critical to success. It not only allows organizations to adapt to changing market conditions and take advantage of business opportunities as they arise, but it also makes them more efficient.

So, when you want to give your organization the agility and flexibility required by today’s competitive marketplace, it’s essential that you implement or modernize your data warehouse. It allows you to combine data from all your business areas into a single source of truth, giving you increased scalability, flexibility, and automation capabilities.

Hopefully this post helped illustrate how you can implement a modern data warehouse. To learn more about modern data warehouses or how our automated, low-code, drag-and-drop Data Estate Builder can help you implement and run a modern data warehouse without writing any code, visit our website for more information.

Want to know how to create a modern data warehouse in Microsoft Azure up to 10 times faster than standard methods?