Every row of your data is an insight waiting to be found. That is why it is critical you can get every row loaded into your data warehouse. When the data is clean, loading data into Azure SQL Data Warehouse is easy using PolyBase. It is elastic, globally available, and leverages Massively Parallel Processing (MPP). In reality clean data is a luxury that is always available. In those cases you need to know which rows failed to load and why.
In Azure SQL Data Warehouse the Create External Table definition has been extended to include a Rejected_Row_Location parameter. This value represents the location in the External Data Source where the Error File(s) and Rejected Row(s) will be written.
CREATE EXTERNAL TABLE [dbo].[Reject_Example] ( [Col_one] TINYINT NULL, [Col_two] VARCHAR(100) NULL, [Col_three] NUMERIC(2,2) NULL ) WITH ( DATA_SOURCE = EDS_Reject_Row ,LOCATION = ‘Read_Directory’ ,FILE_FORMAT = CSV ,REJECT_TYPE = VALUE ,REJECT_VALUE = 100 ,REJECTED_ROW_LOCATION=‘Reject_Directory’ ) What happens when data is loaded?
When a user runs a Create Table as Select (CTAS) on the table above, PolyBase creates a directory on the External Data Source at the Rejected_Row_Location if one doesn’t exist. A child directory is created with the name “_rejectedrows”. The “_”
Today at the Informatica World, Scott Guthrie, EVP, Cloud + AI, along with Anil Chakravarthy, CEO of Informatica, announced the availability of Informatica Intelligent Cloud Services (IICS) for Azure. Microsoft has partnered with Informatica, a leader in Enterprise Data Management, to help our customers accelerate data warehouse modernization. This service is available as a free preview on Azure today.
Informatica provides a discovery-driven approach to data warehouse migration. This approach simplifies the process of identifying and moving data into Azure SQL Data Warehouse (SQL DW), Microsoft’s petabyte scale, fully managed, globally available analytics platform. With the recently released SQL DW Compute Optimized Gen2 tier, you can enjoy 5x performance, 4x concurrency and 5x scale from previous generation.
With this release, Informatica Intelligent Cloud Services for Azure can be launched directly from the Azure Portal. You can enjoy a single sign-on experience and don’t have to create a separate Informatica account. With Informatica Data Accelerator for Azure, you can discover and load data into SQL DW. Informatica’s discovery-driven approach allows you to work with thousands of tables and columns.
“We are very excited about this next step in our long-standing partnership with Microsoft”, said Pratik Parekh, VP, Product Management, Informatica.
Gaining insights rapidly from data is critical to competitiveness in today’s business world. Azure SQL Data Warehouse (SQL DW), Microsoft’s fully managed analytics platform, leverages Massively Parallel Processing (MPP) to run complex interactive SQL queries at every level of scale.
An enterprise data warehouse doesn’t exist in isolation but is part of a complete analytics solution that includes the ‘pipelines’ that extract, transform, and load data, often from many disparate sources, as well as staging and reporting layers. Tasks such as continuous integration and continuous delivery require extensive manual coding and doesn’t offer the flexibility needed for the rapidly changing business needs. This is where data warehouse automation tools provide value for customers.
Data warehouse automation (DWA) tools are meta-data driven, code generation tools that streamline developing and managing a data warehouse solution. DWA tools provide more than just ETL automation, they automate the complete life cycle of a data warehouse solution, from analysis, design, and implementation to documentation, monitoring and maintenance.
Benefits of data warehouse automation
Data warehouse automation (DWA) tools bring the benefits of meta-data driven automation, and code generation to streamline developing and managing a data warehouse solution. DWA tools provide more than just ETL automation, they
SQL Operations Studio can be leveraged with Azure SQL Data Warehouse (SQL DW) to create rich customizable dashboard widgets surfacing insights to your data warehouse. This unlocks key scenarios around managing and tuning your data warehouse to ensure it is optimized for consistent performance. Previously, developers had to manually and continuously execute complex DMV queries to extract insights from their data warehouse. This leads to a repetitious process when following development and tuning best practices with SQL DW. Now with SQL Operations Studio, customized insight widgets can be embedded directly within the query tool enabling you to seamlessly monitor and troubleshoot issues with your data warehouse.
The following widgets can be generated by using the provided T-SQL monitoring scripts within SQL Operations Studio for common data warehouse insights.
Detect data skew across distributions to help identify and troubleshoot query performance issues:
Columnstore health and statistics
Identify and understand workload patterns through active sessions queries, queued queries, loads, and backups:
Ensure adequate resources are allocated such as memory and TempDB:
We are pleased to announce that Azure SQL Data Warehouse (Azure SQL DW) now supports automatic creation of column level statistics. Azure SQL DW is a fast, flexible, and secure analytics platform for the enterprise.
Modern systems such as Azure SQL DW, rely on cost-based optimizers to generate quality execution plans for user queries. Even though Azure SQL DW implements a cost-based optimizer, the system relies on developers and administrators to create statistics objects manually. When all queries are known in advance, determining what statistics objects need to be created is an achievable task. However, when the system is faced with ad-hoc and random queries which is typical for the data warehousing workloads, system administrators may struggle to predict what statistics need to be created leading to potentially suboptimal query execution plans and longer query response times. One way to mitigate this problem is to create statistics objects on all the table columns in advance. However, that process comes with a penalty as statistics objects need to be maintained during table loading process, causing longer loading times.
Azure SQL DW now supports automatic creation of statistics objects providing greater flexibility, productivity, and ease of use for system administrators and developers,
Continuing our series of tutorials on SaaS application patterns with SQL Database, we are delighted to announce an additional cross tenant analytics tutorial. This new tutorial shows how to extract and load tenant data into Azure SQL Data Warehouse (SQL DW) using Azure Data Factory (ADF) and then analyze it in Power BI.
In this tutorial, ADF is used to orchestrate data movement from tenant databases into a SQL Data Warehouse. Parameterized ADF V2 (preview) pipelines are defined to iterate across tenant databases, loading data from multiple databases in parallel. To accelerate loading, ADF stages extracted data in blob files and then uses PolyBase to load into SQL DW. Staging the data and enabling PolyBase are simple check-box operations in ADF.
The tutorial uses an Extract, Load and Transform (ELT) pattern – once data is loaded into staging tables in SQL DW, ADF invokes a stored procedure to upsert the data into star-schema tables, ready for query. Power BI is then used to visualize the data and extract insights that in the tutorial scenario can help the ISV improve their ticket selling application and business.
To get started check out the analytics tutorial which provides step-by-step
Today we made Azure SQL Data Warehouse (SQL DW) Compute Optimized Gen2 Tier generally available to our customers. Even though data and data sources grow exponentially, organizations continue to demand faster and faster insights. Azure SQL DW Compute Optimized Gen2 tier delivers on this need with major performance improvements made possible through adaptive caching.
Analytics workload performance is typically determined by two major factors, I/O bandwidth to storage and repartitioning speed, also known as shuffle speed. This blog post looks under the hood of how Azure SQL DW exploits the latest hardware trends to improve effective I/O bandwidth available.
One of the recent hardware innovations becoming widely available are NVM Express (NVMe) solid-state drive (SSD) devices. NVMe SSDs offer significantly more I/O bandwidth than SATA SSDs or hard drives. A typical single NVMe device used in Azure, generally offers up to 2GB/sec of local I/O bandwidth, with multiple devices available per physical host, resulting in bandwidth previously reserved only to very high-end storage systems. Azure SQL DW Compute Optimized Gen2 tier fully takes advantage of NVMe devices through adaptive caching of recently used data on NVMe. With this breakthrough on customer workloads, we have observed up to five times the
Data is transformative. The ability to turn data into breakthrough insights is foundational to remain relevant in an increasingly competitive market.
To help our customers deliver fast insights from exponentially growing data, today we are announcing the general availability of the Compute Optimized Gen2 tier of Azure SQL Data Warehouse. With this performance optimized tier, we are bringing the best of Microsoft software and hardware innovations to dramatically accelerate query performance and concurrency for our customers.
Fast, flexible, and secure cloud data warehouse
We launched Azure SQL Data Warehouse three years ago to make a powerful SQL based MPP (massively parallel processing) architecture data warehousing accessible to all. It was the first data warehouse that helped customers reduce costs by enabling them to scale compute and storage independently, and by offering pause and resume capabilities. This flexibility, combined with fast query performance, comprehensive data security, and governance capabilities has led to adoption by thousands of customers like Adobe, Toshiba, and LG Electronics. To keep up with customer demand, we have expanded the service to 33 Azure regions and it is now the most globally available of all cloud data warehouse services.
Azure SQL Data Warehouse Compute Optimized Gen2 tier
Today, we announced general availability of Azure SQL Data Warehouse (SQL DW) Compute Optimized Gen2 tier, the new generation of Azure SQL DW. Azure SQL DW is a fast, flexible, and secure cloud data warehouse tuned for running complex queries fast and across petabytes of data.
We see two key trends that drive data warehousing decisions, the amount of data continues to grow exponentially and the need to deliver insights from all this data is even more urgent. Azure SQL DW Compute Optimized Gen2 tier is designed to help customer accomplish just this by delivering dramatic query performance improvement. In addition, SQL DW now supports up to 128 concurrent queries while being able to provision five times more computing power compared to the previous product generation.
“After upgrading to the Gen2 of SQL Data Warehouse, our data warehouse workload has seen an average of 5.4 times performance improvement. This enhancement to the service is phenomenal and helps us deliver key customer insights for our business” said Brent Niezgocki, Senior Software Engineer for the Azure Active Directory analytics team at Microsoft.
Fast query performance through adaptive caching
As organizations look to accelerate time to insight, performance in the domain of
We are excited to announce that Replicated Tables, a new type of table distribution, are now generally available in Azure SQL Data Warehouse (SQL DW). SQL DW is a fully managed, flexible, and secure cloud data warehouse tuned for running complex queries fast and across petabytes of data.
The key to performance for large-scale data warehouses is how data is distributed across the system. When queries join across tables and data is distributed differently, data movement is required to complete the query. The same can be said when transforming data to load, enrich, and apply business rules. With Replicated Tables, the data is available on all compute nodes, hence data movement is eliminated, and queries run faster. In some cases, such as small dimension tables, choosing a Replicated Table versus a Round Robin table, can increase performance because data movement is reduced. As with all optimization techniques, performance gains may vary and should be tested.
Reducing data movement to boost performance
During the public preview of Replicated Tables, SQL Data Warehouse customers are seeing up to 5x performance gains while transforming data with Replicated Tables when compared to using Round Robin distribution.
Taking a look at an example of