Using Azure Open Datasets with Databricks

Azure Open Datasets is now available in preview! As a result we have easy access to curated public datasets to accelerate our data & AI projects. This post will demonstrate using Azure Open Datasets with Databricks by loading a curated NOAA weather dataset.

Weather is only one of the many (and growing) available datasets available with Azure Open Datasets. Please be sure to frequent Azure Open Dataset catalog to see what else is available.

Setting up Databricks

If you don’t already have an Azure Databricks workspace than follow the steps below to add a Databricks resource to Azure. Otherwise, you can skip to creating a cluster.

Don’t already have an Azure account? No problem, you can create a free account here.

Create Azure Databricks Resource
Add Azure Databricks Resource

Creating an Azure Databricks resource is straightforward; give the workspace a name, select your subscription, resource group and location. The pricing tier is up to you. We will not be using any Premium features in this post; however, there is no harm in selecting the Premium pricing tier especially if you will want to load the data into Databricks Delta sometime in the future.

Azure Databricks Resource Create Details
Create Azure Databricks Resource

Once your resource is finished creating, typically a few minutes, you can launch the workspace.

Azure Databricks Launch Workspace
Click the Launch link

Creating the Databricks Cluster

While creating the workspace is fun, there is not much we can do with data until we create a cluster. So next we will create a cluster.

Azure Databricks Clusters navigation icon

First click on the Clusters link located on the left navigation bar.

Azure Databricks Create Cluster button

Then click on Create Cluster on the top of the Cluster page.

Azure Open Dataset Python SDK requires python 3.6!

For your cluster to run python >=3.6 you will want to choose one of the following Databricks Runtimes:

  • Runtime: 5.4 ML (does not have to be GPU) = python 3.6
  • Runtime: 5.5 ML (does not have to be GPU) = python 3.6.5
Azure Databricks Create Cluster Page
Cluster Details

Don’t forget to start your cluster.

Azure Databricks Cluster Status Page
Running Cluster

Install Azure Open Dataset SDK

For Databricks to use Azure Open Datasets we will need to install the python SDK. The following steps will guide you through installing a python package from PyPI.

Azure Databricks navigation icon

First go to the workspace landing page by clicking on Azure Databricks in the navigation bar.

Azure Databricks Import Library icon

Then click on Import Library

Next on the Create Library page you will select PyPI and add the package: azureml-opendatasets. Click Create.

Azure Databricks create library page
Create Azure Open Datasets Library

Finally, install the library on your running cluster by checking the cluster and clicking Install. It may take a minute or two to complete the install.

Azure Databricks install Azure Open Datasets Python SDK
Install Azure Open Dataset Library

Notebook to Load Data

With the cluster running and the library installed we can create our Databricks Notebook to load the NOAA data.

Azure Databricks navigation icon

Again click on Azure Databricks on the left navigation bar.

Azure Databricks new notebook link

Then click on New Notebook.

Create a Python Notebook using your running cluster.

Azure Databricks create notebook
Create Azure Databricks Notebook

Add the following code to your notebook. I included the output I received from Databricks when executing the code.

Code to load last month of weather data into a Spark Dataframe
from azureml.opendatasets import NoaaIsdWeather
from datetime import datetime
from dateutil.relativedelta import relativedelta

end_date = datetime.today()
start_date = datetime.today() - relativedelta(months=1)

#Get historical weather data in the past month.
isd = NoaaIsdWeather(start_date, end_date)

df = isd.to_spark_dataframe()
Azure Databricks Load NOAA Data Output
Spark Output from Load
Code to print schema
df.printSchema()
Azure Databricks print schema output
Spark Output from Print Schema
Code to Display 5 Rows
display(df.limit(5))
Azure Databricks display rows output
Spark Output from Display

Conclusion

Azure Open Datasets can accelerate your ML projects and I am excited to see what addition datasets get added to the catalog.

For a comparison, please check out Using Azure Notebook Workflows to Ingest NOAA Weather Data to see how NOAA weather data was loaded using NOAAs APIs and Databricks Notebook Workflows. While this is not a 100% 1-to-1 comparison, you should be able to see how Azure Open Datasets can be used to simplify loading publicly available data assets.

Please let me know your thoughts and please rate the post.

Best, Jonathan



Categories: Azure Open Datasets, Databricks

Tags: , , , , , , ,

1 reply

  1. Thanks for the detailed post. It is very helpful

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: