Setting up and Integrating Databricks with your Organization

Databricks is a lakehouse platform in the cloud. Utilizing technologies such as Apache Spark, Delta Lake and MLflow., Databricks combines the best of data warehouses and data lakes to offer an open and unified platform for data and AI.

This document will walk through setting up Databricks to be used as a data Store in DeltaStream.

Setting up Databricks Workspace

Prerequisites

  1. Sign up for a Databricks account using AWS and complete the workspace setup (steps 1 and 2) or use an existing Databricks workspace

  2. Have an AWS account whose S3 will host your Delta Lake data. If you don't have an account, you can sign up for a free trial of AWS.

Create Databricks App Token

  1. Navigate to your Databricks workspace

  2. In the top right of the screen, click down on your account name and select "User Settings"

  3. In the menu bar that appears, select "Developer" and under "Access Tokens", select "Manage"

  4. Select "Generate new token", optionally add a comment for the token and choose a lifetime for the token. Then click "Generate" to create the token.

  1. Make sure the save or download the newly generated token value, as this will be used when creating the Store later on.

Databricks documentation for generating access tokens for a workspace

Add Databricks SQL Warehouse

  1. Navigate to your Databricks workspace

  2. Select the "SQL Warehouses" tab in the left side panel. You'll see a list of the existing SQL warehouses in your workspace. Databricks creates a starter warehouse for you.

  3. To create a new SQL Warehouse, select "Create SQL warehouse". To edit an existing SQL warehouse, click the 3 vertical dots on the right of each listed warehouse and click "Edit".

  4. Configure your SQL warehouse with your preferred specifications. To learn more about configuring your SQL warehouse, take a look at the Databricks documentation. For a more optimal experience, we recommend choosing "serverless" as the SQL warehouse type (more info about Databricks serverless SQL warehouse).

  5. Click "Save" to create the SQL warehouse and take note of the warehouse ID on the overview page. We'll need this ID when creating the Store later on. You can also access the warehouse overview by clicking on the name of the SQL warehouse from the "SQL Warehouses" initial landing page from step 1.

Add S3 Bucket as External Location for Data

Use an existing S3 bucket or create a new one

  1. To create a new AWS S3 bucket:

    1. Navigate to the "S3" page in the AWS console

    2. Select "Create bucket"

    3. Enter a name for your S3 bucket and select "Create bucket" at the bottom to create your new S3 bucket

Documentation for creating, configuring, and working with Amazon S3 buckets

Add Databricks connection to the newly created S3 bucket

  1. Navigate to your Databricks workspace

  2. Select the "Catalog" tab in the left side panel. This will bring you to a view of your Unity Catalog.

  3. At the top of the page, select "+ Add" and in the dropdown select "Add an external location"

  4. Select AWS Quickstart to set up the Databricks and S3 connection. Optionally, advanced users can set up their external location manually instead. For this tutorial, we'll continue with the AWS Quickstart option. Select "Next".

  5. Enter the name of an existing S3 bucket to link to your Databricks workspace and click "Generate new token". Copy that token, then select "Launch in Quickstart" which will bring you back to the AWS console in a page called "Quick create stack".

  6. Enter the access token copied in step 5 into the "Databricks Personal Access Token" field in the AWS "Quick create stack" page. Then at the bottom of the page, click that you acknowledge that AWS CloudFormation might create IAM resources with custom names, and select "Create stack" to launch stack initialization.

  7. After a couple minutes, you'll see the stack creation complete

Databricks documentation on external locations

(Optional) Create a Unity Catalog Metastore

Relevant if you are getting an error message such as "Metastore Storage Root URL Does Not Exist".

  1. Have an S3 bucket to use for metastore-level managed storage in AWS (follow steps above for creating a new S3 bucket), we'll use the bucket created in the previous step

  2. Navigate to the Databricks account settings Catalog page, here you can either create a new metastore or edit existing metastores

  3. If creating a new metastore, click the "Create metastore" button and follow the prompts. You'll set the name, region, S3 path, and workspaces for the metastore.

  4. If editing an existing metastore, click on the name of the metastore you'd like to edit. From this page you can assign new workspaces, set an S3 path, edit the metastore admin, and other actions.

Databricks documentation on creating a Unity Catalog Metastore

Adding Databricks as a DeltaStream Store

  1. Navigate to the DeltaStream web console

  2. Select "Stores" in the left panel then select "+ New Store" in the top right

  3. Enter the following information

    • Store Name – A name to identify your DeltaStream Store (See Store)

    • Store Type – Databricks

    • Access Region – DeltaStream access region to associate with the Store (See Region)

    • URL – URL for Databricks workspace which can be found by navigating to the Databricks accounts page and clicking on the workspace you want to use

    • Warehouse ID – The ID for a Databricks SQL Warehouse in your Databricks workspace (See Add Databricks SQL Warehouse)

    • Databricks Cloud Region – The AWS region that the "Cloud S3 Bucket" exists

    • Cloud S3 Bucket – An AWS S3 bucket that is connected as an external location in your Databricks workspace (See Add S3 Bucket as External Location for Data)

  4. Click "Next" to enter credentials

    • App Token – The Databricks access token for your user in your Databricks workspace (See Create Databricks App Token)

    • Access Key ID – Access key associated with the AWS account where the "Cloud S3 Bucket" exists

    • Secret Access Key – Secret access key associated with the AWS account where the "Cloud S3 Bucket" exists

  5. Click "Save" to create the Store

For instructions on creating the Store using DSQL, see CREATE STORE

Process Streaming Data and Sink to Databricks

For the steps below, let's assume you already have a Stream defined called 'pageviews' which is backed by a topic in Kafka. We'll also assume that there is a Databricks Store called 'databricks_store' (See Adding Databricks as a DeltaStream Store). We are going to perform a simple filter on the pageviews Stream and sink the results into Databricks.

If you are curious about how to set up a Stream or a Kafka Store, see Starting with the Web App or Starting with the CLI.

Inspect the Databricks store

  1. Navigate to the "Stores" tab in the left side panel. This will display a list of the existing Stores.

  2. Select the Store called "databricks_store" then select "Entities". This will bring up a list of the existing catalogs in your Databricks workspace.

  3. (Optional) Create a new catalog

    1. Click on the 3 vertical dots next to your store name and select "Create Entity"

    2. In the popup, enter a name for the new catalog and select "Create". You should now be able to see the new catalog in the entities list. Note, if you are getting an error message "Metastore Storage Root URL Does Not Exist", then make sure you have properly set up your Databricks Unity Catalog metastore.

  4. Click on a catalog name to see the schemas that exist under that catalog

  5. (Optional) Create a new schema

    1. Select "+ New Entity" to create a new schema

    2. In the popup, enter a name for the new schema and select "Create". You should now be able to see the new schema in the entities list.

  6. Click on a schema name to see the tables that exist under that schema

Write a CTAS query to sink data into Databricks

  1. Navigate to the "SQL Editor" tab in the left side panel

  2. In the SQL editor box, write the CREATE TABLE AS SELECT (CTAS) query to ingest from 'pageviews' and output to a new Table called 'pv_table'

CREATE TABLE pv_table WITH (
  'store' = 'databricks_store', 
  'databricks.catalog.name' = 'new_catalog', 
  'databricks.schema.name' = 'new_schema', 
  'databricks.table.name' = 'pageviews', 
  'table.data.file.location' = 's3://deltastream-databricks-bucket2/test'
) AS 
SELECT 
  viewtime, 
  pageid, 
  userid 
FROM 
  pageviews 
WHERE 
  pageid != 'Page_3';
  1. Select "Run"

  2. Navigate to "Queries" tab in the left side panel to see the existing queries, including the query from step 2. It takes a little bit of time for the query to transition into the 'Running' state.

  3. Refresh until you see the query is in the 'Running' state

  1. Navigate to the "Stores" tab in the left side panel

  2. Navigate to the new table created by the above CTAS with the following clicks "databricks_store" --> "Entities" --> "new_catalog" --> "new_schema" --> "pageviews". If you wrote your CTAS such that the store/catalog/schema/table names are different, then navigate accordingly.

  3. Select "Print" to see a sample of the data in your Databricks table

Last updated