Connecting Your Databricks Workspace

This guide walks you through connecting your Databricks workspace to CloudYali for cost tracking and optimization.

Prerequisites

Before you begin, ensure you have:

A Databricks workspace with Unity Catalog enabled
Account Admin permissions (for creating Service Principals)
Metastore Admin permissions (for granting System Tables access)
A CloudYali account with permissions to add cloud integrations

Unity Catalog Required

System Tables (used for billing data) require Unity Catalog to be enabled on your workspace. If Unity Catalog is not yet enabled, follow the Databricks Unity Catalog setup guide first.

Billing Data Availability

After creating a new Databricks account, billing data in system.billing.usage takes 24-48 hours to appear. The tables will exist but return no rows until Databricks processes the first usage records. You can complete the setup steps below immediately — data will flow once it becomes available.

Step 1: Create a Service Principal

A Service Principal provides secure, non-interactive authentication for CloudYali to access your Databricks workspace.

Using Databricks Account Console (Recommended)

Go to your Databricks Account Console
Navigate to User management → Service principals
Click Add service principal
Enter name: CloudYali-Integration
Click Add
Note the Application ID (a UUID) — this is also your Client ID for OAuth and GRANT statements

Using Databricks CLI

# Install and configure the Databricks CLI
# https://docs.databricks.com/en/dev-tools/cli/install.html

# Create a Service Principal
databricks service-principals create \
  --display-name "CloudYali-Integration"

# Note the application_id from the output

Step 2: Generate OAuth Secret

Create an OAuth M2M (Machine-to-Machine) secret for the Service Principal.

In the Account Console, go to User management → Service principals
Click on CloudYali-Integration
Go to the Credentials & secrets tab
Click Generate secret
Copy the Client Secret (the Client ID shown here is the same Application ID from Step 1)

Store Your Secret Securely

Copy and store the Client Secret immediately — you won't be able to view it again
CloudYali encrypts and stores these credentials in AWS Secrets Manager
These credentials provide read-only access to billing data only

Step 3: Assign Workspace Access

Add the Service Principal to your Databricks workspace:

Go to your Databricks workspace (e.g., https://your-workspace.cloud.databricks.com)
Navigate to Admin Settings → Identity and access → Service principals
Click Add service principal
Select CloudYali-Integration from the list
Click Add

Multiple Workspaces

If you have multiple workspaces, repeat this step for each workspace you want to track. You can use the same Service Principal across all workspaces.

Step 4: Grant System Tables Access

The Service Principal needs read access to Databricks System Tables for billing data, pricing lookups, resource inventory, and optimization recommendations.

Required Permissions

Cost Tracking & Pricing:

System Table	Purpose
`system.billing.usage`	Daily DBU/DSU consumption data across all workspaces, SKUs, clusters, and jobs — used for cost tracking and attribution
`system.billing.list_prices`	Public list prices per SKU — used for cost calculations and pricing analysis

Workspace Inventory:

System Table	Purpose
`system.access.workspaces_latest`	Workspace metadata (name, URL, status) — used for workspace name lookups in filters and dashboards

Resource Inventory & Optimization:

System Table	Purpose
`system.compute.clusters`	Cluster configurations (auto-termination, autoscaling, runtime, Photon, Spot usage) — used for inventory and optimization recommendations
`system.compute.warehouses`	SQL Warehouse configurations (type, size, scaling, auto-stop) — used for warehouse inventory and idle detection
`system.compute.node_types`	Available instance types with hardware specs (memory, cores, GPUs) — used for rightsizing recommendations
`system.compute.warehouse_events`	Warehouse lifecycle events (scale up/down, start/stop) — used for utilization analysis
`system.lakeflow.jobs`	Job definitions with names, owners, schedules, and tags — used for job inventory and cost attribution
`system.lakeflow.job_tasks`	Job task definitions with task keys and dependencies — used for task-level inventory
`system.lakeflow.job_run_timeline`	Per-run start/end times, duration, and result state — used for job cost analysis
`system.lakeflow.pipelines`	Delta Live Tables pipeline definitions with owners and tags — used for pipeline inventory
`system.serving.served_entities`	Model serving endpoint configurations — used for served entity inventory
`system.mlflow.experiments_latest`	MLflow experiment metadata — used for experiment inventory

Billing Data Is Account-Scoped

system.billing.usage contains data from all workspaces in your Databricks account, regardless of which workspace the query runs from. You only need to grant access and connect from one workspace to get billing data for the entire account.

Compute & Job Tables Are Regionally Scoped

system.compute.* and system.lakeflow.* tables contain data from all workspaces in the same region as the metastore. If you have workspaces in multiple regions, connect from one workspace per region for full coverage.

Using SQL (Recommended)

Open a SQL editor in your Databricks workspace and run the following, replacing <client-id> with the Client ID (Application ID) you copied in Step 1 (a UUID like fab9e00e-ca35-11ec-9d64-0242ac120002):

-- Allow access to the system catalog and schemas
GRANT USE CATALOG ON CATALOG system TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.billing TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.access TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.compute TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.lakeflow TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.serving TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.mlflow TO `<client-id>`;

-- Cost tracking & pricing
GRANT SELECT ON system.billing.usage TO `<client-id>`;
GRANT SELECT ON system.billing.list_prices TO `<client-id>`;

-- Workspace inventory
GRANT SELECT ON system.access.workspaces_latest TO `<client-id>`;

-- Resource inventory & optimization
GRANT SELECT ON system.compute.clusters TO `<client-id>`;
GRANT SELECT ON system.compute.warehouses TO `<client-id>`;
GRANT SELECT ON system.compute.node_types TO `<client-id>`;
GRANT SELECT ON system.compute.warehouse_events TO `<client-id>`;
GRANT SELECT ON system.lakeflow.jobs TO `<client-id>`;
GRANT SELECT ON system.lakeflow.job_tasks TO `<client-id>`;
GRANT SELECT ON system.lakeflow.job_run_timeline TO `<client-id>`;
GRANT SELECT ON system.lakeflow.pipelines TO `<client-id>`;
GRANT SELECT ON system.serving.served_entities TO `<client-id>`;
GRANT SELECT ON system.mlflow.experiments_latest TO `<client-id>`;

Metastore Admin Required

These GRANT statements require Metastore Admin privileges. If you get a PERMISSION_DENIED: User does not have MANAGE error, you need to assign yourself as Metastore Admin first:

Go to Account Console → Data
Click on your metastore name
Under Metastore admin, click Edit
Add your user as Metastore Admin and save

Then retry the GRANT statements. The backticks around the Client ID are required because it contains hyphens.

What These Permissions Enable

Resource Inventory:

Clusters — track all-purpose and job cluster configurations, state, and lifecycle
SQL Warehouses — track warehouse configurations, sizing, and auto-stop settings
Jobs & Job Tasks — track job definitions, task-level details, owners, schedules, and tags for cost attribution
Pipelines — track Delta Live Tables pipeline definitions, owners, and tags
Served Entities — track model serving endpoint configurations and entity versions
MLflow Experiments — track experiment metadata for ML workflow inventory

Optimization Recommendations:

Clusters missing auto-termination (idle clusters running indefinitely)
All-Purpose Compute used for production jobs (Jobs Compute is up to 55% cheaper)
Fixed-size clusters without autoscaling enabled
Clusters running outdated Databricks Runtime versions
Clusters without Photon enabled (3-8x performance improvement)
Job clusters not using Spot/Preemptible instances

All Access Is Read-Only

All grants are SELECT only — CloudYali never modifies, creates, or deletes any Databricks resources. The billing and pricing system tables are free to query. Compute and job system tables incur minimal SQL Warehouse cost.

Verify Access

Test the grants by running as the Service Principal (or verify from CloudYali after setup):

-- Verify billing data access
SELECT COUNT(*) FROM system.billing.usage WHERE usage_date >= current_date() - INTERVAL 7 DAYS;

-- Verify pricing data access
SELECT COUNT(*) FROM system.billing.list_prices;

-- Verify resource inventory access
SELECT COUNT(*) FROM system.compute.clusters;
SELECT COUNT(*) FROM system.compute.warehouses;
SELECT COUNT(*) FROM system.lakeflow.jobs;
SELECT COUNT(*) FROM system.lakeflow.job_tasks;
SELECT COUNT(*) FROM system.lakeflow.pipelines;
SELECT COUNT(*) FROM system.serving.served_entities;
SELECT COUNT(*) FROM system.mlflow.experiments_latest;

Step 5: Set Up a SQL Warehouse

CloudYali needs a SQL Warehouse to execute queries against System Tables. A Serverless SQL Warehouse is recommended for minimal cost (~$5/month).

Option A: Use an Existing SQL Warehouse

If you already have a SQL Warehouse, you can use it. Note the SQL Warehouse ID from:

Go to SQL Warehouses in your workspace
Click on the warehouse name
Go to Connection details tab
Copy the Warehouse ID — the last segment of the HTTP path (e.g., if the HTTP path is /sql/1.0/warehouses/4cc57a6c45e6f73e, the ID is 4cc57a6c45e6f73e)

Option B: Create a New Serverless SQL Warehouse

Go to SQL Warehouses in your workspace
Click Create SQL warehouse
Configure:
- Name: CloudYali-Billing
- Cluster size: 2X-Small (smallest available)
- Type: Serverless (recommended) or Pro
- Auto stop: 10 minutes (minimizes cost)
- Scaling: Min 1, Max 1
Click Create
Copy the Warehouse ID from the Connection details tab (last segment of the HTTP path)

Cost Optimization

A Serverless SQL Warehouse with 2X-Small size and 10-minute auto-stop typically costs less than $5/month for CloudYali's billing queries (a few queries every 6-12 hours).

Step 6: Gather Your Credentials

Before proceeding to CloudYali, ensure you have the following information:

Credential	Where to Find	Example
Workspace URL	Browser address bar when in your workspace	`https://dbc-abc123.cloud.databricks.com`
SQL Warehouse ID	SQL Warehouse → Connection details (last segment of HTTP path)	`4cc57a6c45e6f73e`
Client ID	Service Principal → Credentials & secrets (Step 2)	`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`
Client Secret	Service Principal → Credentials & secrets (Step 2)	`dose...xxxx`

Step 7: Connect in CloudYali

Log in to your CloudYali account
Navigate to Settings in the left sidebar
Select Cloud Providers from the settings menu
Locate the Databricks integration card and click Connect
Enter your credentials:
- Workspace Name: A friendly name for this workspace (e.g., "Production Workspace")
- Host URL: Your Databricks workspace URL (e.g., https://dbc-abc123.cloud.databricks.com)
- SQL Warehouse ID: The warehouse ID from Step 5 (e.g., 4cc57a6c45e6f73e)
- Client ID: The OAuth Client ID from Step 2
- Client Secret: The OAuth Client Secret from Step 2
Click Test Connection to verify the setup
Click Connect Workspace to complete the onboarding

Workspace ID

The Workspace ID is automatically extracted from your Host URL — you don't need to provide it separately.

What Happens Next

After connecting your Databricks workspace:

Initial sync imports billing data from the last 3 months
Data typically appears within a few hours of connecting
Ongoing syncs run automatically every 6-12 hours
Cost reports become available in the Cost Analysis section

Data Sync Schedule

CloudYali syncs your Databricks billing data every 6-12 hours. The initial sync imports up to 3 months of historical data. Subsequent syncs are incremental, importing only new data since the last sync.

You can monitor sync status from the Cloud Providers page.

Next Steps

Once your data is synced, explore these features:

Budgets & Alerts — Set up spending thresholds and notifications
Cost Reports — Create custom reports for your Databricks usage

Troubleshooting

Connection Test Failed

Verify the Workspace URL is correct and accessible (e.g., https://your-workspace.cloud.databricks.com)
Ensure the SQL Warehouse HTTP Path is correct and the warehouse is running
Check that the Client ID and Client Secret are from the correct Service Principal
Confirm the Service Principal has been added to the workspace (Step 3)

"Permission Denied" or "User does not have MANAGE" on System Tables

This error occurs when running GRANT statements without Metastore Admin privileges.

Solution:

Go to Account Console → Data
Click on your metastore name
Under Metastore admin, click Edit
Add your user as Metastore Admin and save
Retry the GRANT statements

Note: Being an Account Admin is not enough — you specifically need the Metastore Admin role to manage permissions on system tables.

"INSUFFICIENT_PERMISSIONS: User does not have USE SCHEMA"

This error occurs during data sync if the USE CATALOG and USE SCHEMA grants were not applied. Even with SELECT access on individual tables, Databricks requires explicit catalog/schema traversal permissions.

Solution: Run the following grants in a SQL editor (requires Metastore Admin):

GRANT USE CATALOG ON CATALOG system TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.billing TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.access TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.compute TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.lakeflow TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.serving TO `<client-id>`;
GRANT USE SCHEMA ON SCHEMA system.mlflow TO `<client-id>`;

"Table does not exist" for System Tables

Verify Unity Catalog is enabled: run SELECT current_catalog(); — should return a catalog name (e.g., workspace or main)
Check system tables are available: run SHOW SCHEMAS IN system; — should list billing, compute, lakeflow, etc.
If schemas are missing, system tables may not be enabled for your account — contact Databricks support
For new accounts, billing data takes 24-48 hours to appear in system.billing.usage even though the table exists

SQL Warehouse Not Starting

Verify you have permissions to create/use SQL Warehouses
Check your Databricks workspace quotas
For Serverless warehouses, ensure Serverless SQL is enabled for your workspace
Try creating a Pro warehouse instead if Serverless is not available

Missing Cost Data

Data syncs run every 6-12 hours — allow time for the initial sync to complete
Ensure your Databricks workspace has recent usage (clusters, jobs, or SQL queries)
Verify the System Tables contain data: run SELECT COUNT(*) FROM system.billing.usage in your workspace
Check sync status from the Cloud Providers page in CloudYali

OAuth Authentication Errors

Regenerate the OAuth secret in the Account Console and update in CloudYali
Verify the Service Principal is not deleted or disabled
Ensure the Service Principal is added to the correct workspace

Managing Your Connection

After setup, you can:

View sync status from the Cloud Providers page
Update credentials if you need to rotate the OAuth secret
Add workspaces to track additional Databricks environments
Trigger manual sync to import the latest data immediately
Disconnect the integration if no longer needed

For additional help, please contact our support team at support@cloudyali.io.

Prerequisites​

Step 1: Create a Service Principal​

Using Databricks Account Console (Recommended)​

Using Databricks CLI​

Step 2: Generate OAuth Secret​

Step 3: Assign Workspace Access​

Step 4: Grant System Tables Access​

Required Permissions​

Using SQL (Recommended)​

What These Permissions Enable​

Verify Access​

Step 5: Set Up a SQL Warehouse​

Option A: Use an Existing SQL Warehouse​

Option B: Create a New Serverless SQL Warehouse​

Step 6: Gather Your Credentials​

Step 7: Connect in CloudYali​

What Happens Next​

Next Steps​

Troubleshooting​

Connection Test Failed​

"Permission Denied" or "User does not have MANAGE" on System Tables​

"INSUFFICIENT_PERMISSIONS: User does not have USE SCHEMA"​

"Table does not exist" for System Tables​

SQL Warehouse Not Starting​

Missing Cost Data​

OAuth Authentication Errors​

Managing Your Connection​