Azure Data Lake Analytics and U-SQL Spring 2018 Updates: Parquet support, small files, dynamic output, fast file sets, and much more!

11

Jun

Azure Data Lake Analytics and U-SQL Spring 2018 Updates: Parquet support, small files, dynamic output, fast file sets, and much more!

Azure Data Lake Analytics and U-SQL Spring 2018 Updates: Parquet support, small files, dynamic output, fast file sets, and much more!
https://blogs.msdn.microsoft.com/azuredatalake/2018/06/11/azure-data-lake-analytics-and-u-sql-spring-2018-updates-parquet-support-small-files-dynamic-output-fast-file-sets-and-much-more/

Source: https://blogs.msdn.microsoft.com/azuredatalake/2018/06/11/azure-data-lake-analytics-and-u-sql-spring-2018-updates-parquet-support-small-files-dynamic-output-fast-file-sets-and-much-more/

 

Hello Azure Data Lake and U-SQL fans and followers.

It is high time for the release notes for all the cool features we released over the winter as well as listing all the pending deprecation items and breaking changes. There was so much cool new stuff that it took me several weeks to write the release notes (on top of my day job!) that the next release will probably already be out by the time you read this! I promise that the June release notes will come sooner! :).

Without further ado, here are the Spring 2018 Updates for Azure Data Lake U-SQL and Developer Tooling!

Supporting data formats of your choice at high scale

The top items include expanding our built-in support for standard file formats with native Parquet support for extractors and outputters (in public preview) and ORC (in private preview)!

In addition, since the fast file set feature now has been generally released, we can consume hundreds of thousands of such files in bulk in a single EXTRACT statement. We will publish a blog at a later date to give you much more detailed information on how this capability helps you to process so many files efficiently in a scalable way.

Important aspects of processing files at scale include:

  1. the ability to generate many files from a rowset in a single statement, providing a way to dynamically partition the data for future use with Hadoop or Spark, or to provide individual files for customers. This has been our top customer ask on the ADL Feedback forum –and now it is in private preview!
  2. the ability to handle many small files. We recommend that you make your files large enough for the processing to be efficient (300MB to 4GB is a good range), but often, your file formats (e.g., images) or data ingestion pipelines (e.g., EventHub archives) are not able to reach that size. Thus, we are adding the ability to group several files into a vertex to increase efficiency and lower cost of your job (we have seen 10 to 30 times improvement in some customer jobs!).

You can find a great end-to-end example using several of these important capabilities together in our new Azure blog post announcing the new release and the accompanying detailed walk-through blog post.

Other cool stuff

There is so much stuff to discover that I encourage you all to read through the release notes and use the samples that we also have published as a Visual Studio solution on our U-SQL GitHub site.

Here are a few additional highlights:

  1. Use the new AU modeler to optimize your jobs cost/performance trade-off!
  2. Extract from files that use the Windows code pages!
  3. Augment your script with job information through @@JobInfo!
  4. Light-weight, self-contained script development with in-script C# named lambdas and script-scoped U-SQL objects!

Thanks to all of you who continue to volunteer to test the preview features and provide us valuable feedback. We are looking forward to seeing you use all the new cool stuff.

Make sure that you update your scripts that are affected by the future deprecations and breaking changes!

Please contact us or leave a comment below if you have feedback on this or other features.

Here is the list of topics with links to the detailed release notes: