# Databricks

[Databricks](https://www.databricks.com/) is a lakehouse platform in the cloud. This article walks you through setting up Databricks to be used as a data [Data Store](/overview/core-concepts/store.md) in DeltaStream.

## Setting up the Databricks Workspace

### Prerequisites

1. [Sign up for a Databricks account using AWS and complete the workspace setup](https://docs.databricks.com/en/getting-started/index.html) (steps 1 and 2) or use an existing Databricks workspace.
2. Have an AWS account whose S3 hosts your Delta Lake data.\
   If you don't have an account, you can sign up for a [free trial of AWS](https://aws.amazon.com/free).

### Create a Databricks App Token

1. Navigate to your Databricks workspace.
2. In the top right of the screen, click down on your account name and select **User Settings**.<br>

   <figure><img src="/files/FqlCSsrHbqL4gByeaJp8" alt=""><figcaption></figcaption></figure>
3. In the menu bar that displays, click **Developer,** and under **Access Tokens**, click **Manage**.<br>

   <figure><img src="/files/GQUwPHg7ILqwhEQQcYlb" alt=""><figcaption></figcaption></figure>
4. Click **Generate new token**. Add an optional comment for the token and then choose a lifetime for the token. Then click **Generate** to create the token.<br>

   <figure><img src="/files/bfZOt91f533BPWs1cVqZ" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/k3OJfd5s8pjptAUuAenT" alt="" width="375"><figcaption></figcaption></figure>

4. Verify the save or download the newly-generated token value. You will need this when [creating the store later on](#adding-databricks-as-a-deltastream-store).

For more details on generating access tokens for a workspace, see the [*Databricks documentation*](https://docs.databricks.com/en/dev-tools/auth.html#databricks-personal-access-token-authentication)*.*

### Add Databricks SQL Warehouse

1. Navigate to your Databricks workspace.
2. In the lefthand navigation click the **SQL Warehouses**. A list displays of the existing SQL warehouses in your workspace. Databricks creates a starter warehouse for you.
3. To create a new SQL warehouse, click **Create SQL warehouse**. To edit an existing SQL warehouse, to the right of the warehouse you want, click the 3 vertical dots. Then click **Edit**.<br>

   <figure><img src="/files/2KdaRghb4iMmw2ZfMIfE" alt=""><figcaption></figcaption></figure>
4. Configure your SQL warehouse with your preferred specifications. (To learn more about configuring your SQL warehouse, please review the [Databricks documentation](https://docs.databricks.com/en/sql/admin/create-sql-warehouse.html).) For a more optimal experience, we recommend choosing **serverless** as the SQL warehouse type. [More information about Databricks serverless SQL warehouse](https://docs.databricks.com/en/admin/sql/serverless.html).<br>

   <figure><img src="/files/Offe02v2ZkolFbapQWSG" alt=""><figcaption></figcaption></figure>
5. Click **Save** to create the SQL warehouse. Record the warehouse ID on the overview page; you will need this ID when [you create the store later on](#adding-databricks-as-a-deltastream-store).\
   You can also access the warehouse overview by clicking the name of the SQL warehouse from the **SQL Warehouses** initial landing page from step 1.<br>

   <figure><img src="/files/TsS67SvOVIOOsuVOciIF" alt=""><figcaption></figcaption></figure>

### Add an S3 Bucket as External Location for Data

#### Use an existing S3 bucket or create a new one.

1. **To create a new AWS S3 bucket**:
   1. In the [AWS console](https://console.aws.amazon.com), navigate to the S3 page.<br>

      <figure><img src="/files/03x7x5BfR9gEcAyBWnIy" alt=""><figcaption></figcaption></figure>
   2. Click **Create bucket**.

      <figure><img src="/files/T4iz226FKn518YdmVTUL" alt=""><figcaption></figcaption></figure>
   3. Enter a name for your S3 bucket and then at the bottom click **Create bucket** to create your new S3 bucket.

For more details, see the Databricks [*documentation for creating, configuring, and working with Amazon S3 buckets*](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-buckets-s3.html)*.*

#### Add Databricks connection to the newly-created S3 bucket

1. Navigate to your Databricks workspace.
2. In the lefthand navigation, click **Catalog**. This displays a view of your [Unity Catalog](https://www.databricks.com/product/unity-catalog).
3. At the top of the page, click **+ Add**, and from the list that displays click **Add an external location**.<br>

   <figure><img src="/files/PCqFwJ6lGHQS3Vt0QGZi" alt=""><figcaption></figcaption></figure>
4. Click **AWS Quickstart** to set up the Databricks and S3 connection, and then click **Next**. Advanced users can opt to set up their external location manually instead, but this article continues with the AWS Quickstart option.<br>

   <figure><img src="/files/Ld6XrOieOaYyX3ruA9fV" alt="" width="375"><figcaption></figcaption></figure>
5. Enter the name of an existing S3 bucket to link to your Databricks workspace. Then click **Generate new token**. Copy that token, then click **Launch in Quickstart**. This brings you back to the AWS console and displays a page called **Quick create stack**.<br>

   <figure><img src="/files/oIJX1E3WqXGQGjF5sQNU" alt="" width="375"><figcaption></figcaption></figure>
6. On the the AWS **Quick create stack** page, in the **Databricks Personal Access Token** field, enter the access token you copied in step 5. Then at the bottom of the page, click to acknowledge that AWS CloudFormation might create IAM resources with custom names. Then click **Create stack** to launch stack initialization.<br>

   <figure><img src="/files/1ARMIjysbQfMaYbP2mEp" alt=""><figcaption></figcaption></figure>

   <figure><img src="/files/guZGSXTp2GAeeVGuxgmZ" alt=""><figcaption></figcaption></figure>
7. In a few minutes, you'll see the stack creation complete.<br>

   <figure><img src="/files/9jwqkG8dqQUe3Op9PHwV" alt=""><figcaption></figcaption></figure>

For more information on external locations, see the [*Databricks documentation*](https://docs.databricks.com/en/sql/language-manual/sql-ref-external-locations.html)*.*

#### (Optional) Create a Unity Catalog Metastore

This step is relevant if you receive an error message such as `Metastore Storage Root URL Does Not Exist`. In this case:

1. Ensure you have an S3 bucket to use for metastore-level managed storage in AWS (follow the steps above to create a new S3 bucket). In this case you can use the bucket created in the previous step.
2. Navigate to the [Databricks account settings Catalog page](https://accounts.cloud.databricks.com/data). From here, either create a new metastore or edit existing metastores.<br>

   <figure><img src="/files/f5ObCHpnKp4EkDcBJA6Y" alt=""><figcaption></figcaption></figure>
3. If you're creating a new metastore, click **Create metastore** and follow the prompts to set the name, region, S3 path, and workspaces for the metastore.
4. If you're editing an existing metastore, click on the name of the metastore you wish to edit. From this page you can assign new workspaces, set an S3 path, edit the metastore admin, and take other actions.

For more information on creating a Unity Catalog metastore, see the [*Databricks documentation*](https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html)*.*

## Adding Databricks as a DeltaStream Store

1. Open DeltaStream. In the lefthand navigation, click **Resources (** ![](/files/Zwq1BBdRyaRsv55N3KNm) ) and then click **Add Store +**.

   <div align="center"><figure><img src="/files/qO7Xks6faJ2X31inVtNu" alt="" width="563"><figcaption></figcaption></figure></div>
2. From the menu that displays, click **Databricks.** The **Add Store** window opens.<br>

   <div align="center"><figure><img src="/files/NcO2j8hUd5p3eCUzPUgf" alt="" width="306"><figcaption></figcaption></figure></div>
3. Enter the authentication and connection parameters. These include:
   * **Store Name** – A unique name to identify your DeltaStream store. (For more details see [Data Store](/overview/core-concepts/store.md)).\
     Store names are limited to a maximum of 255 characters. Only alphanumeric characters, dashes, and underscores are allowed.
   * **Store Type** – Databricks
   * **URL** – URL for Databricks workspace. Find this by navigating to the [Databricks accounts page](https://accounts.cloud.databricks.com/workspaces) and clicking the workspace you wish to use.
   * **Warehouse ID** – The ID for a Databricks SQL warehouse in your Databricks workspace. (For more details see [#add-databricks-sql-warehouse](#add-databricks-sql-warehouse "mention")).
   * **Databricks Cloud Region** – The AWS region in which the **Cloud S3 Bucket** exists.
   * **Cloud S3 Bucket** – An AWS S3 bucket that is connected as an external location in your Databricks workspace (see [#add-s3-bucket-as-external-location-for-data](#add-s3-bucket-as-external-location-for-data "mention")).
   * **App Token** – The Databricks access token for your user in your Databricks workspace. (For more details see [#create-databricks-app-token](#create-databricks-app-token "mention").)
   * **Access Key ID** – Access key associated with the AWS account in which the **Cloud S3 Bucket** exists.
   * **Secret Access Key** – Secret access key associated with the AWS account in which the **Cloud S3 Bucket** exists.
4. Click **Add**.

Your Databricks store displays on the **Resources** page in your list of stores.

<figure><img src="/files/3C5dvuQJkEdxEP68qHM4" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
**Note** For instructions on creating the store using DSQL, *see* [CREATE STORE](/reference/sql-syntax/ddl/create-store.md).
{% endhint %}

## Process Streaming Data and Sink to Databricks

For the steps below, assume you already have a [stream](/overview/core-concepts/databases.md#stream) defined called **pageviews**, which is backed by a topic in Kafka. Assume also there is a Databricks store labelled **Databricks\_Test\_Store**. (For more details see [#adding-databricks-as-a-deltastream-store](#adding-databricks-as-a-deltastream-store "mention").) Now perform a simple filter on the pageviews stream and sink the results into Databricks.

{% hint style="info" %}
**Note** For more information on setting up a stream or a Kafka store, see [Starting with the Web App](/getting-started/starting-with-web-app.md) or [Starting with the CLI](/getting-started/starting-with-cli.md).
{% endhint %}

### Inspect the Databricks store

1. In the lefthand navigation, click **Resources (** ![](/files/Zwq1BBdRyaRsv55N3KNm) ). This displays a list of the existing stores.<br>

   <figure><img src="/files/g7clW7ArMFpOcmkqDWG7" alt="" width="563"><figcaption></figcaption></figure>
2. Click the **Databricks\_Test\_Store**. The store page displays, with the **Databases** tab active. Here you can view a list of the existing catalogs in your Databricks workspace.<br>

   <figure><img src="/files/yxwNUlXrnxZLGjsValBG" alt="" width="369"><figcaption></figcaption></figure>
3. (Optional) Create a new database. To do this, click **+ Add Database**. When prompted, enter a name for the new database and click **Add**. The new database displays in the list.\
   \
   **Important** If you receive this error message -- `Metastore Storage Root URL Does Not Exist` -- verify that you've properly [set up your Databricks Unity Catalog metastore](#optional-create-a-unity-catalog-metastore).
4. To see the namespaces that exist in a particular database, click the database you want.
5. (Optional) Create a new namespace. To do this:
   1. Select **+ Add Namespace.** In the window that displays, enter a name for the new namespace and then click **Add**. The new namespace now displays in the list.
6. To see the tables that exist under a particular namespace, click the namespace you want.

### Write a CTAS (CREATE TABLE AS SELECT) Query to Sink Data into Databricks

1. In the lefthand navigation, click **Workspace** ( ![](/files/ZXcAkgugP7AuG9QFRXKO) ).
2. In the SQL pane of your workspace, write the [CREATE TABLE AS SELECT (CTAS)](/reference/sql-syntax/query/create-table-as.md) query to ingest from **pageviews** and output to a new table titled **pv\_table**.

```sql
CREATE TABLE pv_table WITH (
  'store' = 'databricks_store', 
  'databricks.catalog.name' = 'new_catalog', 
  'databricks.schema.name' = 'new_schema', 
  'databricks.table.name' = 'pageviews', 
  'table.data.file.location' = 's3://deltastream-databricks-bucket2/test'
) AS 
SELECT 
  viewtime, 
  pageid, 
  userid 
FROM 
  pageviews 
WHERE 
  pageid != 'Page_3';
```

3. Click **Run**.
4. In the lefthand navigation click **Queries** ( ![](/files/HOEvY09XthGMf2h6wEx6) ) to see the existing queries, including the query from the step immediately prior.\
   It may take a few moments for the query to transition into the **Running** state. Keep refreshing your screen until the query transitions.

<figure><img src="/files/Y7otaKuz16yXoHaqq6ZS" alt="" width="563"><figcaption></figcaption></figure>

### View the results

1. In the lefthand navigation, click **Resources (** ![](/files/Zwq1BBdRyaRsv55N3KNm) ). This displays a list of the existing stores.
2. To view the new table created by the above CTAS, navigate to **databricks\_store** --> **Databases** --> **+ Add Database** --> **Add Namespace** --> **pageviews**.\
   Of course, if you wrote your CTAS such that the store/database/namespace/table names are different, navigate accordingly.
3. To view a sample of the data in your Databricks table, click **Print**.<br>

   <figure><img src="/files/JilK8idTTW4VxAv7xzca" alt="" width="563"><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.deltastream.io/integrations/setting-up-data-store-integrations/setting-up-and-integrating-databricks-with-your-organization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
