Category Archives : Big Data

23

Jul

Build secure Oozie workflows in Azure HDInsight with Enterprise Security Package

Customers love to use Hadoop and often rely on Oozie, a workflow and coordination scheduler for Hadoop to accelerate and ease their big data implementation. Oozie is integrated with the Hadoop stack, and it supports several types of Hadoop jobs. However, for users of Azure HDInsight with domain joined clusters, Oozie was not a supported option. To get around this limitation customers had to run Oozie on a regular cluster. This was costly with extra administrative overhead. Today we are happy to announce that customers can now use Oozie in domain-joined Hadoop clusters too.

In domain-joined clusters, authentication happens through Kerberos and fine-grained authorization is through Ranger policies. Oozie supports impersonation of users and a basic authorization model for workflow jobs.

Moreover, Hive server 2 actions submitted as part of an Oozie workflow get logged and are auditable through Ranger too. Fine-grained authorization through ranger will be enforced on the Oozie jobs, only when Ranger policies are present, otherwise coarse-grained authorization based on HDFS (only available on ADLS Gen1) is enforced.

Learn more about how to create an Oozie workflow and submit jobs in a domain joined cluster, and how to use Oozie with Hadoop to define and run a

Share

23

Jul

Build secure Oozie workflows in Azure HDInsight with Enterprise Security Package

Customers love to use Hadoop and often rely on Oozie, a workflow and coordination scheduler for Hadoop to accelerate and ease their big data implementation. Oozie is integrated with the Hadoop stack, and it supports several types of Hadoop jobs. However, for users of Azure HDInsight with domain joined clusters, Oozie was not a supported option. To get around this limitation customers had to run Oozie on a regular cluster. This was costly with extra administrative overhead. Today we are happy to announce that customers can now use Oozie in domain-joined Hadoop clusters too.

In domain-joined clusters, authentication happens through Kerberos and fine-grained authorization is through Ranger policies. Oozie supports impersonation of users and a basic authorization model for workflow jobs.

Moreover, Hive server 2 actions submitted as part of an Oozie workflow get logged and are auditable through Ranger too. Fine-grained authorization through ranger will be enforced on the Oozie jobs, only when Ranger policies are present, otherwise coarse-grained authorization based on HDFS (only available on ADLS Gen1) is enforced.

Learn more about how to create an Oozie workflow and submit jobs in a domain joined cluster, and how to use Oozie with Hadoop to define and run a

Share

17

Jul

Intelligent Healthcare with Azure Bring Your Own Key (BYOK) technology

Sensitive health data processed by hospitals and insurers is under constant attack from malicious actors who try to gain access to health care systems with the goal to steal or extort personal health information. Change Healthcare has implemented a Bring Your Own Key (BYOK) solution based on Microsoft Azure Cloud services and introduces Intelligent Healthcare today.

Change Healthcare is enabling payers and providers to have immediate and granular control over their data by transferring the ownership of encryption keys used to encrypt data at rest. This allows Change Healthcare customers to make security changes without involvement by Change Healthcare personnel and have their cloud-based systems re-encrypted and operational without service interruptions. The BYOK management capabilities include revoking access to encryption keys and rotating or deleting encryption keys on demand and at the time of a potential compromise. 
 
For the Intelligent Healthcare solution, Change Healthcare implemented Azure SQL Database Transparent Data Encryption (TDE) with BYOK support. TDE with BYOK encrypts databases, log files and backups when written to disk, which protects data at rest from unauthorized access. TDE with BYOK support integrates with Azure Key Vault, which provides highly available and scalable secure storage for RSA cryptographic keys backed by

Share

17

Jul

Intelligent Healthcare with Azure Bring Your Own Key (BYOK) technology

Sensitive health data processed by hospitals and insurers is under constant attack from malicious actors who try to gain access to health care systems with the goal to steal or extort personal health information. Change Healthcare has implemented a Bring Your Own Key (BYOK) solution based on Microsoft Azure Cloud services and introduces Intelligent Healthcare today.

Change Healthcare is enabling payers and providers to have immediate and granular control over their data by transferring the ownership of encryption keys used to encrypt data at rest. This allows Change Healthcare customers to make security changes without involvement by Change Healthcare personnel and have their cloud-based systems re-encrypted and operational without service interruptions. The BYOK management capabilities include revoking access to encryption keys and rotating or deleting encryption keys on demand and at the time of a potential compromise. 
 
For the Intelligent Healthcare solution, Change Healthcare implemented Azure SQL Database Transparent Data Encryption (TDE) with BYOK support. TDE with BYOK encrypts databases, log files and backups when written to disk, which protects data at rest from unauthorized access. TDE with BYOK support integrates with Azure Key Vault, which provides highly available and scalable secure storage for RSA cryptographic keys backed by

Share

17

Jul

Blockchain as a tool for anti-fraud

Healthcare costs are skyrocketing. In 2016, healthcare costs in the US are estimated at nearly 18 percent of the GDP! Healthcare is becoming less affordable worldwide, and a serious chasm is widening between those that can afford healthcare and those that cannot. There are many factors driving the high cost of healthcare, one of them is fraud. In healthcare, there are several types of fraud including prescription fraud, medical identity fraud, financial fraud, and occupational fraud. The National Health Care Anti-Fraud Association estimates conservatively that health care fraud costs the US about $68 billion annually, which is about three percent of the US total $2.26 trillion in overall healthcare spending. There are two root vulnerabilities in healthcare organizations: insufficient protection of data integrity, and a lack of transparency.

Insufficient protection of data integrity enables fraudulent modification of records

Cybersecurity involves safeguarding the confidentiality, availability, and integrity of data. Often cybersecurity is mistakenly equated with protecting just the confidentiality of data to prevent unauthorized access. However, equally important is protecting the availability of data. That is, you must secure timely and reliable access to data, as well as the integrity of the data. You must ensure records are accurate, complete,

Share

17

Jul

Blockchain as a tool for anti-fraud

Healthcare costs are skyrocketing. In 2016, healthcare costs in the US are estimated at nearly 18 percent of the GDP! Healthcare is becoming less affordable worldwide, and a serious chasm is widening between those that can afford healthcare and those that cannot. There are many factors driving the high cost of healthcare, one of them is fraud. In healthcare, there are several types of fraud including prescription fraud, medical identity fraud, financial fraud, and occupational fraud. The National Health Care Anti-Fraud Association estimates conservatively that health care fraud costs the US about $68 billion annually, which is about three percent of the US total $2.26 trillion in overall healthcare spending. There are two root vulnerabilities in healthcare organizations: insufficient protection of data integrity, and a lack of transparency.

Insufficient protection of data integrity enables fraudulent modification of records

Cybersecurity involves safeguarding the confidentiality, availability, and integrity of data. Often cybersecurity is mistakenly equated with protecting just the confidentiality of data to prevent unauthorized access. However, equally important is protecting the availability of data. That is, you must secure timely and reliable access to data, as well as the integrity of the data. You must ensure records are accurate, complete,

Share

12

Jul

Welcome our newest family member – Data Box Disk

Last year at Ignite, I talked to you about the preview of Azure Data Box, a ruggedized, portable, and simple way to move large datasets into Azure. So far, the response has been phenomenal. Customers have used Data Box to move petabytes of data into Azure.

While our customers and partners love Data Box, they told us that they also wanted a lower capacity, even easier-to-use option. They cited examples such as moving data from Remote/Office Branch Offices (ROBOs), which have smaller data sets and minimal on-site tech support. They said they needed an option for recurring, incremental transfers for ongoing backups and archives. And they said it needed to have the same traits as Data Box – namely fast, simple, and secure.

Got it. We hear the message loud and clear. So, I’m here today with our partners at Inspire 2018 to announce a new addition to the Data Box family: Azure Data Box Disk.

How it works

Data Box Disk leverages the same infrastructure and management experience as Azure Data Box. You can receive up to five 8TB disks, totaling 40TB per order. Data Box Disk is fast, utilizing SSD technology, and is shipped overnight, so you can

Share

12

Jul

Lightning fast query performance with Azure SQL Data Warehouse

Azure SQL Data Warehouse is a fast, flexible and secure analytics platform for enterprises of all sizes. Today we announced significant query performance improvements for Azure SQL Data Warehouse (SQL DW) customers enabled through enhancements in the distributed query execution layer.

Analytics workload performance is determined by two major factors, I/O bandwidth to storage and repartitioning speed, also known as shuffle speed. In this previous blog post, we described how SQL DW caches relevant data to take advantage of NVMe based local storage. In this blog post, we will go under the hood of SQL DW, to see how the shuffling speed has improved.

Data movement is an operation where parts of the distributed tables are moved to different nodes during query execution. This operation is required where the data is not available on the target node, most commonly when the tables do not share the distribution key. The most common data movement operation is shuffle. During shuffle, for each input row, SQL DW computes a hash value using the join columns and then sends that row to the node that owns that hash value. Either one or both sides of join can participate in the shuffle. The diagram below

Share

11

Jul

Kafka 1.0 on HDInsight lights up real time analytics scenarios

Data engineers love Kafka on HDInsight as a high-throughput, low-latency ingestion platform in their real time data pipeline. They already leverage Kafka features such as message compression, configurable retention policy, and log compaction. With the release of Apache Kafka 1.0 on HDInsight, customers now get key features that make it easy to implement the most demanding scenarios. Here is a quick introduction:

Idempotent producers so that you don’t have to deduplicate

Consider a cellular billing system, in which the producer writes the amount of data consumed by users to a Kafka topic called data-consumption-events. If the broker or the connection fails, the producer will not get an acknowledgment of a message write and will retry that message. This will lead to duplicate writes to the system, causing users to be overbilled.

In critical scenarios like above, data engineers had to write and maintain custom deduplication logic, such as hashing and saving message ids. However, with idempotent producers turned on, Kafka handles that logic for you. Records include unique producer ids and the sequence number of the message. Kafka brokers will only accept a message from a producer if the sequence number is exactly one more than the committed sequence number

Share

10

Jul

Azure HDInsight now supports Apache Spark 2.3

Apache Spark 2.3.0 is now available for production use on the managed big data service Azure HDInsight. Ranging from bug fixes (more than 1400 tickets were fixed in this release) to new experimental features, Apache Spark 2.3.0 brings advancements and polish to all areas of its unified data platform.

Data engineers relying on Python UDFs get 10 times to a 100 times more speed, thanks to revamped object serialization between Spark runtime and Python. Data Scientist will be delighted by better integration of Deep Learning frameworks like TensorFlow with Spark Machine Learning pipelines. Business Analysts will find liberating availability of fast vectorized reader for ORC file format which finally makes interactive analytics in Spark practical over this popular columnar data format. Developers building real-time applications may be interested in experimenting with new Continuous Processing mode in Spark Structured Streaming which brings event processing latency to millisecond level.

Vectorized object serialization in Python UDFs

It is worth mentioning that PySpark is already fast and takes advantage of the vectorized data processing in core Spark engine as long as you are using DataFrame APIs. This is good news as it represents majority of the use cases if you follow best practices for

Share