LogoLogo
Start Trial
  • Overview
    • What is DeltaStream?
    • Core Concepts
      • Access Control
      • Region
      • SQL
      • Store
      • Database
      • Query
      • Visualizing Data Lineage
      • Function
  • Getting Started
    • Free Trial Quick Start
    • Starting with the Web App
    • Starting with the CLI
  • Tutorials
    • Managing Users and User Roles
      • Inviting Users to an Organization
      • Administering Users in your Organization
      • Using the CLI to Manage User Roles
      • Example: Setting Up Custom Roles for Production and Stage
    • Creating Stores for Streaming Data
    • Using Multiple Stores in Queries
    • Creating Relations to Structure Raw Data
    • Namespacing with Database and Schema
    • Creating and Querying Materialized Views
    • Creating a Function
    • Securing Your Connections to Data Stores
      • Introducing DeltaStream Private Links
      • Creating an AWS Private Link from DeltaStream to your Confluent Kafka Dedicated Cluster
      • Enabling Private Link Connectivity to Confluent Enterprise Cluster and Schema Registry
      • Creating a Private Link from DeltaStream to Amazon MSK
      • Creating a Private Link for RDS Databases
      • Deleting a Private Link
    • Integrations
      • Connecting to Confluent Cloud
      • Databricks
      • PostgreSQL
      • Snowflake
      • WarpStream
    • Serialization
      • Working with ProtoBuf Serialized Data and DeltaStream Descriptors
      • Working with Avro Serialized Data and Schema Registries
      • Configuring Deserialization Error Handling
  • Reference
    • Enterprise Security Integrations
      • Okta SAML Integration
      • Okta SCIM Integration
    • Metrics
      • Prometheus Integration
      • Built-In Metrics
      • Custom Metrics in Functions
    • SQL Syntax
      • Data Formats (Serialization)
        • Serializing with JSON
        • Serializing with Primitive Data Types
        • Serializing with Protobuf
      • Data Types
      • Identifiers and Keywords
      • Command
        • ACCEPT INVITATION
        • CAN I
        • COPY DESCRIPTOR_SOURCE
        • COPY FUNCTION_SOURCE
        • DESCRIBE ENTITY
        • DESCRIBE QUERY
        • DESCRIBE QUERY METRICS
        • DESCRIBE QUERY EVENTS
        • DESCRIBE QUERY STATE
        • DESCRIBE RELATION
        • DESCRIBE RELATION COLUMNS
        • DESCRIBE ROLE
        • DESCRIBE SECURITY INTEGRATION
        • DESCRIBE <statement>
        • DESCRIBE STORE
        • DESCRIBE USER
        • GENERATE COLUMNS
        • GENERATE TEMPLATE
        • GRANT OWNERSHIP
        • GRANT PRIVILEGES
        • GRANT ROLE
        • INVITE USER
        • LIST API_TOKENS
        • LIST DATABASES
        • LIST DESCRIPTORS
        • LIST DESCRIPTOR_SOURCES
        • LIST ENTITIES
        • LIST FUNCTIONS
        • LIST FUNCTION_SOURCES
        • LIST INVITATIONS
        • LIST METRICS INTEGRATIONS
        • LIST ORGANIZATIONS
        • LIST QUERIES
        • LIST REGIONS
        • LIST RELATIONS
        • LIST ROLES
        • LIST SCHEMAS
        • LIST SCHEMA_REGISTRIES
        • LIST SECRETS
        • LIST SECURITY INTEGRATIONS
        • LIST STORES
        • LIST USERS
        • PRINT ENTITY
        • REJECT INVITATION
        • REVOKE INVITATION
        • REVOKE PRIVILEGES
        • REVOKE ROLE
        • SET DEFAULT
        • USE
      • DDL
        • ALTER API_TOKEN
        • ALTER SECURITY INTEGRATION
        • CREATE API_TOKEN
        • CREATE CHANGELOG
        • CREATE DATABASE
        • CREATE DESCRIPTOR_SOURCE
        • CREATE ENTITY
        • CREATE FUNCTION_SOURCE
        • CREATE FUNCTION
        • CREATE INDEX
        • CREATE METRICS INTEGRATION
        • CREATE ORGANIZATION
        • CREATE ROLE
        • CREATE SCHEMA_REGISTRY
        • CREATE SCHEMA
        • CREATE SECRET
        • CREATE SECURITY INTEGRATION
        • CREATE STORE
        • CREATE STREAM
        • CREATE TABLE
        • DROP API_TOKEN
        • DROP CHANGELOG
        • DROP DATABASE
        • DROP DESCRIPTOR_SOURCE
        • DROP ENTITY
        • DROP FUNCTION_SOURCE
        • DROP FUNCTION
        • DROP METRICS INTEGRATION
        • DROP RELATION
        • DROP ROLE
        • DROP SCHEMA
        • DROP SCHEMA_REGISTRY
        • DROP SECRET
        • DROP SECURITY INTEGRATION
        • DROP STORE
        • DROP STREAM
        • DROP USER
        • UPDATE ENTITY
        • UPDATE SCHEMA_REGISTRY
        • UPDATE SECRET
        • UPDATE STORE
      • Query
        • APPLICATION
        • Change Data Capture (CDC)
        • CREATE CHANGELOG AS SELECT
        • CREATE STREAM AS SELECT
        • CREATE TABLE AS SELECT
        • Function
          • Built-in Functions
          • Row Metadata Functions
        • INSERT INTO
        • Materialized View
          • CREATE MATERIALIZED VIEW AS
          • SELECT (FROM MATERIALIZED VIEW)
        • Query Name and Version
        • Resume Query
        • RESTART QUERY
        • SELECT
          • FROM
          • JOIN
          • MATCH_RECOGNIZE
          • WITH (Common Table Expression)
        • TERMINATE QUERY
      • Sandbox
        • START SANDBOX
        • DESCRIBE SANDBOX
        • STOP SANDBOX
      • Row Key Definition
    • Rest API
Powered by GitBook
On this page
  • Setting up the Databricks Workspace
  • Prerequisites
  • Create a Databricks App Token
  • Add Databricks SQL Warehouse
  • Add an S3 Bucket as External Location for Data
  • Adding Databricks as a DeltaStream Store
  • Process Streaming Data and Sink to Databricks
  • Inspect the Databricks store
  • Write a CTAS (CREATE TABLE AS SELECT) Query to Sink Data into Databricks
  • View the results
  1. Tutorials
  2. Integrations

Databricks

PreviousConnecting to Confluent CloudNextPostgreSQL

Last updated 5 months ago

is a lakehouse platform in the cloud. Utilizing technologies such as Apache Spark, Delta Lake ,and MLflow, Databricks combines the functionality of data warehouses and data lakes to offer an open and unified platform for data and AI.

This document walks you through setting up Databricks to be used as a data Store in DeltaStream.

Setting up the Databricks Workspace

Prerequisites

  1. (steps 1 and 2) or use an existing Databricks workspace.

  2. Have an AWS account whose S3 hosts your Delta Lake data. If you don't have an account, you can sign up for a .

Create a Databricks App Token

  1. Navigate to your Databricks workspace.

  2. In the top right of the screen, click down on your account name and select User Settings.

  3. In the menu bar that displays, click Developer, and under Access Tokens, click Manage.

  4. Click Generate new token. Add an optional comment for the token and thenchoose a lifetime for the token. Then click Generate to create the token.

Add Databricks SQL Warehouse

  1. Navigate to your Databricks workspace.

  2. In the lefthand navigation click the SQL Warehouses. A list displays of the existing SQL warehouses in your workspace. Databricks creates a starter warehouse for you.

  3. To create a new SQL warehouse, click Create SQL warehouse. To edit an existing SQL warehouse, to the right of the warehouse you want, click the 3 vertical dots. Then click Edit.

Add an S3 Bucket as External Location for Data

Use an existing S3 bucket or create a new one.

  1. To create a new AWS S3 bucket:

    1. Click Create bucket.

    2. Enter a name for your S3 bucket and then at the bottom click Create bucket to create your new S3 bucket.

Add Databricks connection to the newly-created S3 bucket

  1. Navigate to your Databricks workspace.

  2. At the top of the page, click + Add, and from the list that displays click Add an external location.

  3. Click AWS Quickstart to set up the Databricks and S3 connection, and then click Next. Advanced users can opt to set up their external location manually instead, but this tutorial continues with the AWS Quickstart option.

  4. Enter the name of an existing S3 bucket to link to your Databricks workspace. Then click Generate new token. Copy that token, then click Launch in Quickstart. This brings you back to the AWS console and displays a page called Quick create stack.

  5. On the the AWS Quick create stck" page, in the Databricks Personal Access Token field, enter the access token you copied in step 5. Then at the bottom of the page, click to acknowledge that AWS CloudFormation might create IAM resources with custom names. Then click Create stack to launch stack initialization.

  6. In a few minutes, you'll see the stack creation complete.

(Optional) Create a Unity Catalog Metastore

This step is relevant if you receive an error message such as Metastore Storage Root URL Does Not Exist. In this case:

  1. Ensure you have an S3 bucket to use for metastore-level managed storage in AWS (follow the steps above to create a new S3 bucket). In this case you can use the bucket created in the previous step.

  2. If you're creating a new metastore, click Create metastore and follow the prompts to set the name, region, S3 path, and workspaces for the metastore.

  3. If you're editing an existing metastore, click on the name of the metastore you wish to edit. From this page you can assign new workspaces, set an S3 path, edit the metastore admin, and take other actions.

Adding Databricks as a DeltaStream Store

  1. From the menu that displays, click Databricks. The Add Store window opens.

  2. Enter the authentication and connection parameters. These include:

    • Store Name – A unique name to identify your DeltaStream store. (For more details see Store). Store names are limited to a maximum of 255 characters. Only alphanumeric characters, dashes, and underscores are allowed.

    • Store Type – Databricks.

    • Access Region – DeltaStream access region to associate with the store; indicates where data is stored or streams through. (For more details see Region).

    • Warehouse ID – The ID for a Databricks SQL warehouse in your Databricks workspace. (For more details see Add Databricks SQL Warehouse).

    • Databricks Cloud Region – The AWS region in which the Cloud S3 Bucket exists.

    • Cloud S3 Bucket – An AWS S3 bucket that is connected as an external location in your Databricks workspace (see #add-s3-bucket-as-external-location-for-data).

    • App Token – The Databricks access token for your user in your Databricks workspace. (For more details see #create-databricks-app-token.)

    • Access Key ID – Access key associated with the AWS account in which the Cloud S3 Bucket exists.

    • Secret Access Key – Secret access key associated with the AWS account in which the Cloud S3 Bucket exists.

  3. Click Add.

Your Databricks store displays on the Resources page in your list of stores.

Note For instructions on creating the store using DSQL, see CREATE STORE.

Process Streaming Data and Sink to Databricks

Note For more information on setting up a stream or a Kafka store, see Starting with the Web App or Starting with the CLI.

Inspect the Databricks store

  1. Click the Databricks_Test_Store. The store page displays, with the Catalogs tab active. Here you can view a list of the existing catalogs in your Databricks workspace.

  2. To see the schemas that exist in a particular catalog, click the catalog you want.

  3. (Optional) Create a new schema. To do this:

    1. Select + Add Schema. In the window that displays, enter a name for the new schema and then click Add. The new schema now displays in the list.

  4. To see the tables that exist under a particular schema, click the schema you want.

Write a CTAS (CREATE TABLE AS SELECT) Query to Sink Data into Databricks

CREATE TABLE pv_table WITH (
  'store' = 'databricks_store', 
  'databricks.catalog.name' = 'new_catalog', 
  'databricks.schema.name' = 'new_schema', 
  'databricks.table.name' = 'pageviews', 
  'table.data.file.location' = 's3://deltastream-databricks-bucket2/test'
) AS 
SELECT 
  viewtime, 
  pageid, 
  userid 
FROM 
  pageviews 
WHERE 
  pageid != 'Page_3';
  1. Click Run.

View the results

  1. To view the new table created by the above CTAS, navigate to databricks_store --> Catalogs --> + Add Catalog --> Add Schema --> pageviews. Of course, if you wrote your CTAS such that the store/catalog/schema/table names are different, navigate accordingly.

  2. To view a sample of the data in your Databricks table, click Print.

Verify the save or download the newly-generated token value. You will need this when .

For more details on generating access tokens for a workspace, see the .

Configure your SQL warehouse with your preferred specifications. (To learn more about configuring your SQL warehouse, take a look at the .) For a more optimal experience, we recommend choosing serverless as the SQL warehouse type. .

Click Save to create the SQL warehouse. Record the warehouse ID on the overview page; you will need this ID when . You can also access the warehouse overview by clicking on the name of the SQL warehouse from the SQL Warehouses initial landing page from step 1.

In the , navigate to the S3 page.

For more details, see the Databricks .

In the lefthand navigation, click Catalog. This displays a view of your .

For more information on external locations, see the .

Navigate to the . From here, either create a new metastore or edit existing metastores.

For more information on creating a Unity Catalog Metastore, see the .

Open DeltaStream. In the lefthand navigation, click Resources ( ) and then click Add Store +.

URL – URL for Databricks workspace. Find this by navigating to the and clicking the workspace you wish to use.

For the steps below, assume you already have a called pageviews defined, which is backed by a topic in Kafka. Assume also there is a Databricks store labelled Databricks_Test_Store. (For more details see Adding Databricks as a DeltaStream Store.) Now perform a simple filter on the pageviews stream and sink the results into Databricks.

In the lefthand navigation, click Resources ( ). This displays a list of the existing stores.

(Optional) Create a new catalog. To do this, click + Add Catalog. When prompted, enter a name for the new catalog and click Add. The new catalog displays in the list. Important If you receive this error message -- Metastore Storage Root URL Does Not Exist -- verify that you've properly .

In the lefthand navigation, click Workspace ( ).

In the SQL pane of your workspace, write the query to ingest from pageviews and output to a new table titled pv_table.

In the lefthand navigation click Queries ( ) to see the existing queries, including the query from the step immediately prior. It may take a few moments for the query to transition into the Running state. Keep refreshing your screen until the query transitions.

In the lefthand navigation, click Resources ( ). This displays a list of the existing stores.

Databricks documentation
Databricks documentation
More information about Databricks serverless SQL warehouse
AWS console
documentation for creating, configuring, and working with Amazon S3 buckets
Unity Catalog
Databricks documentation
Databricks account settings Catalog page
Databricks documentation
Databricks accounts page
CREATE TABLE AS SELECT (CTAS)
creating the store later on
you create the store later on
set up your Databricks Unity Catalog metastore
Databricks
Sign up for a Databricks account using AWS and complete the workspace setup
free trial of AWS
stream