Azure Data Lake & Databricks

 

BACKGROUND

As Big Data operations handling a huge amount of data in operations are getting more  and more frequent we need to use different data storage ways, more adapted to unstructured information, read and write information in a really fast way as well as to be fully scalable, accessible and secure.

DESCRIPTION

What is a Data Lake?

A Data Lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question. A data lake, on the other hand, maintains data in their native formats and handles the three Vs of big data (Volume, Velocity and Variety) while providing tools for analysis, querying, and processing. Data Lake eliminates all the restrictions of a typical data warehouse system by providing unlimited space, unrestricted file size, schema on read, and various ways to access data (including programming, SQL-like queries, and REST (HTTP-based) requests).

Azure Data Lake Analytics (ADL)


Azure offers an excellent Data Lake service full accessible through a specific structured query language: U-SQL. We'll dedicate a future post to U-SQL. For now is enough to know that It evolved out of the Microsoft's internal Big Data language called SCOPE and combines a familiar SQL-like declarative language with the extensibility and programmability provided by C# types and the C# expression language and big data processing concepts such as schema on reads, custom processors and reducers. It also provides the ability to query and combine data from a variety of data sources, including Azure Data Lake Storage, Azure Blob Storage, and Azure SQL DB, Azure SQL Data Warehouse, and SQL Server instances running in Azure VMs.
For more Azure Data Lake details we recommend some description as this video in Azure.

We use Azure Data Lake Analytics (ADL) mainly as Data Sink (basically a storage medium capable of receiving data) for Big Data operations due to the flexibility, scalability and ability to search in the stored resources by using U-SQL. ADL is specially adapted to be the source for Power BI visualizations.

Connecting to Azure Data Lake from Azure Databricks

In this post we'll focus on how to connect to an ADL directory from a Databricks notebook.

Step 1: Create the ADL Account

Create the ADL account in the Azure portal, manually or using IaC (Infrastructure as Code). We'll assume you have done this already.

In this example we'll use a directory called output and a manually uploaded parquet file called test1. Original, isn't it?  We techies are full of imagination and creative ideas!

Step 2: Register an app in AAD

Register your ADL access application in Azure Active Directory (AAD). We'll call it ADLAccess in our example. Follow the instructions in Azure's documentation. You need to generate a secret key value for the registered app, what is explained in that link



We'll remark that the important information you need to pick up from here is:
  • Application ID
  • Secret Key Value

Step 3: Grant permissions to the AAD registered app in your ADL account

Give permissions to the AADs registered app in your Data Lake Account. The scope is the given directory. So, you need to go to the Data Explorer, select your directory and click on the Access menu. In this example the target directory is output.



Then, add the registered app with the right permissions.


Step 4: Assure the Databricks cluster is compatible

Once we are in Azure Databricks we need to have available a Databricks 4.0 runtime cluster or upper. If we do not have yet a 4.0 cluster our configuration setting will not work. Read more about Databricks cluster runtime versions here.


Remember the settings to connect to ADL from our Databricks runtime cluster will work only if it is 4.0 version or upper. Version 3.5 clusters will not work, showing  a syntax error because the extra_configs property is not recognized.

Step 5: Settings

In the Databricks notebook we need to provide the setting to Spark in order to connect the Spark session to the ADL account and directory.
It is very important to assign the right value to each property and this stuff is never well explained in Databricks and Azure documentation. So, we think this post can be really useful!

The map for properties and information containing the correct values is:

"dfs.adls.oauth2.access.token.provider.type"  should be always  "ClientCredential"

"dfs.adls.oauth2.client.id" should be the  application ID of the AAD registered app. for instance "7c91465a-3f03-4df0-87a2-ca905fb43z00". Do not write here the name of the registered app! It is a common mistake.

"dfs.adls.oauth2.credential" should include the value of the secret key generated in the AAD registered app, for instance "hjKNhKLBR+VGSMKHGXUERaRSGGHd5rP/SYid4FnYPhC="

"dfs.adls.oauth2.refresh.url" should be the url for the OAuth2 token refresh URL with the AAD generic ID like  "https://login.microsoftonline.com/<your AAD refresh token id>/oauth2/token"

We have included these fake values (yes, they are not valid so do not waste your time trying them in your Azure subscription..) because it is really important you understand what information matches to each property.

Now, we'll include all the values in a cool String/String Map:
configs = {
"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "7c91465a-3f03-4df0-87a2-ca905fb43z00",
"dfs.adls.oauth2.credential": "hjKNhKLBR+VGSMKHGXUERaRSGGHd5rP/SYid4FnYPhC=",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your AAD refresh token id>/oauth2/token"}

Step 6: Mount a DBFS volume

Well, now the tricky part comes. Remember this is not our application running in a process, we are running our code in Databricks with some code as a function. So, we need to play Databricks rules and change our mindset.

In Databricks we have  something available similar to Hadoop's HDFS, the Databricks File System (DBFS). So, next points are critically important:

First, we need to create a mount point in the DBFS.
dbutils.fs.mount(
source = "adl://<the name of the target ADL account>.azuredatalakestore.net/output",
mount_point = "/mnt/pytest",
extra_configs = configs
)

Remember that output is the name of the ADL account directory we want work with.
In Scala the name of the properties changes to camel case:
  • mount_point is mountPoint
  • extra_configs is extraConfigs
It is expected as Python and Scala use different conventions but it would be better if you are aware of this.

Juggling with Data

Then, we can use the DBFS commands in order to read, write and inter-act with files in the ADL or if we are using Spark Dataframes, for instance, we can use the Dataframes API to write to the DBFS mounted directory and therefore to the ADL directory.
We strongly recommend to read the DBFS documentation before any practical usage of this post.

Well, we can try to access to a parquet file we have uploaded previously to the output directory in ADL:
moviesDF = spark.read.parquet("/mnt/pytest/test1.parquet")
display(moviesDF)



It worked!

Now,  we'll try to play with the data in the parquet data as pure SQL...
%sql 
CREATE TEMPORARY TABLE movieTable
USING parquet
OPTIONS (
path "/mnt/pytest/test1.parquet"
)

Then..
%sql SELECT * FROM movieTable where genres = "Drama"



That was OK. Now, we;ll try to create a sub-set of data from the same query, save it as a Spark Dataframe and save it in the DBFS (and therefore in the ADL directory):
moviesDF.filter(moviesDF.genres == "Drama").write.save("/mnt/pytest/queryDrama.parquet")

The strength of the Spark API show that in a single line we can do everything we needed.

Now we can check out the new parquet file is in the ADL account, output directory:



Voila!!

We should unmount the DBFS directory now.
dbutils.fs.unmount("/mnt/pytest")

CONCLUSIONS

  1. Azure Data Lake (ADL) is a flexible, fast and powerful storage service for unstructured data that can be easily used in our Spark Applications handling huge amounts of data, mainly as a really valid Data Sink.

  2. We have learned the right process to be followed in Azure to assign the right values to Spark properties in order to connect successfully to an ADL account.

  3. Once we have mounted the volume in DBFS to our ADL account, we have showed data stored in ADL files, handled them and written new data items in the same ADL account directory in an easy and straightforward way.

  4. We have seen that the followed approach in Databricks is based on mounting DBFS directories to the ADL account directory.

  5. Besides, there are many options playing with data using the Spark APIs (for instance the Dataframe API), reading and writing to DBFS as mounted to the ADL account directory, persisting every output we generate.

This is great for us as we can generate handled data outputs (from ETL processes, operational reports, etc) saving to the ADL account and then publishing as REST resources through the API in a a totally Non-Blocking way.  This is really an exciting improvement on Big Data business processes and how the results are made accessible to the business staff.

What's Next?

We'll follow in next posts with Azure Data Lake with structured queries using U-SQL and exploiting data and making data visualizations from Azure Data Lake using Power Bi.

We really hope these simple posts could be helpful to you. When something is published everything seems to be really easy but there are a lot of work and lessons learned behind them. Please leave your comments and experiences.

Good Luck!

 

 

 

 

 

 

Comments