|
| 1 | +# How can I use a service principal to read and write to a database from Databricks? |
| 2 | + |
| 3 | +This demo will show how you can use an Entra service principal to read and write to an Azure SQL DB or Azure Postgres database from a Databricks notebook. |
| 4 | + |
| 5 | +## Solution |
| 6 | +- [Prerequisites](#prerequisites) |
| 7 | +- [Azure SQL DB](#azure-sql-db) |
| 8 | +- [Azure Postgres](#azure-postgres) |
| 9 | + |
| 10 | +## Prerequisites |
| 11 | + |
| 12 | +### Create an Entra app registration |
| 13 | +First, navigate to your tenant's Entra Id directory and add a new [app registration](https://learn.microsoft.com/en-us/entra/identity-platform/quickstart-register-app). This will create an Entra Id service principal that you can use to authenticate to Azure resources: |
| 14 | + |
| 15 | + |
| 16 | +Next, create a secret for the service principal. |
| 17 | + |
| 18 | +Note: You'll need Tenant ID, service principal name, client ID and secret for the steps below. |
| 19 | + |
| 20 | + |
| 21 | + |
| 22 | +The secret value will appear on the screen. Do not leave the page before recording it ([it will never be displayed again](https://learn.microsoft.com/en-us/entra/identity-platform/how-to-add-credentials?tabs=client-secret#add-a-credential-to-your-application)). You may want to keep the browser tab open and complete the remaining steps in a second tab for convenience. |
| 23 | + |
| 24 | + |
| 25 | +### Create an Azure Key Vault |
| 26 | +Create an Azure Key Vault in the same region |
| 27 | + |
| 28 | + |
| 29 | +Assign yourself and the Azure Databricks application the Key Vault Secrets Officer role. (Databricks will access the key vault referenced in your secrets scope using the Databricks application's own service principal, which is unique to your tenant. You might expect a Unity Catalog-enabled workspace to use the workspace's managed identity to connect to the key vault, but unfortunately that's not the case.) |
| 30 | + |
| 31 | + |
| 32 | + |
| 33 | + |
| 34 | +Add three secrets to the key vault: one for your tenant id, another for the service principal client Id (not the secret id) and a third for the secret value. You will need these to authenticate as the service principal in your Databricks notebook. Refer back to the secret value in the other browser tab as needed. |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | + |
| 39 | + |
| 40 | +### Create a Databricks workspace |
| 41 | +Many of the new Databricks features like serverless, data lineage and managed identity support require Unity Catalog. Unity Catalog is normally enabled now by default when you create a new workspace and gives you the most authentication options when connecting to Azure resources like databases. The steps described in this demo should work for both Unity Catalog-enabled workspaces and workspaces using the legacy hive metastore. |
| 42 | + |
| 43 | + |
| 44 | +### Create a Databricks secret scope |
| 45 | +Next, we need to [create a secrets scope](https://learn.microsoft.com/en-us/azure/databricks/security/secrets/) for our Databricks workspace. Open your workspace and add ```#secrets/createScope``` after the databricks instance url in your browser. It's [case sensitive](https://learn.microsoft.com/en-us/azure/databricks/security/secrets/#create-an-azure-key-vault-backed-secret-scope-1) so be sure to use a capital S in 'createScope': |
| 46 | + |
| 47 | + |
| 48 | +A page to configure a new secrets scope should appear. Give your secrets scope a name and paste the key vault uri (DNS Name) and resource id into the respective fields below. You can find these on the properties blade of your key vault. |
| 49 | + |
| 50 | + |
| 51 | + |
| 52 | +## Create a Databricks cluster |
| 53 | +If you just want to read from a database with a service principal, you can use the serverless cluster. Just add the azure-identity library to the environment configuration. |
| 54 | + |
| 55 | + |
| 56 | + |
| 57 | +If you also want to *write* to the database, however, you'll need to use a provisioned cluster. |
| 58 | + |
| 59 | + |
| 60 | +After the cluster has been created, click on the cluster name in the list and then switch to the libraries tab to install azure-identity with PyPi |
| 61 | + |
| 62 | + |
| 63 | +## Azure SQL DB |
| 64 | +### Create Database |
| 65 | +Create an Azure SQL database with Entra Id authentication. You can use the Adventure Works LT sample to pre-populate it with data if you like. For the purposes of this demo, we will only be working with the information_schema. |
| 66 | + |
| 67 | + |
| 68 | +Then connect to the database using the Entra Id admin user and [create a user for the service principal](https://learn.microsoft.com/en-us/azure/azure-sql/database/authentication-aad-service-principal-tutorial?view=azuresql#create-the-service-principal-user). You can use the Query editor in the portal or SQL Server Management Studio. |
| 69 | + |
| 70 | + |
| 71 | + |
| 72 | +``` SQL |
| 73 | +CREATE USER [sql-dbx-read-write-test-sp] FROM EXTERNAL PROVIDER; |
| 74 | +``` |
| 75 | + |
| 76 | +Add the dbmanagedidentity user to database roles db_datareader, db_datawriter and db_ddladmin. This will allow the managed identity to read and write data to existing tables. Using overwrite mode will automatically drop and recreate a table before writing to it. |
| 77 | +``` SQL |
| 78 | +ALTER ROLE db_datareader |
| 79 | +ADD MEMBER [sql-dbx-read-write-test-sp]; |
| 80 | + |
| 81 | +ALTER ROLE db_datawriter |
| 82 | +ADD MEMBER [sql-dbx-read-write-test-sp]; |
| 83 | + |
| 84 | +ALTER ROLE db_ddladmin |
| 85 | +ADD MEMBER [sql-dbx-read-write-test-sp]; |
| 86 | +``` |
| 87 | + |
| 88 | + |
| 89 | + |
| 90 | +## Create a Databricks notebook |
| 91 | +Paste the code below into the first cell of the notebook. Using the dbutils library, retrieve the tenant id, client id and service principal secret from the key vault referenced in the secrets scope and get an Entra token. |
| 92 | +``` python |
| 93 | +from azure.identity import ClientSecretCredential |
| 94 | + |
| 95 | +tenant_id = dbutils.secrets.get(scope = "default", key = "tenant-id") |
| 96 | +client_id = dbutils.secrets.get(scope = "default", key = "sql-dbx-read-write-test-sp-client-id") |
| 97 | +client_secret = dbutils.secrets.get(scope = "default", key = "sql-dbx-read-write-test-sp-secret") |
| 98 | + |
| 99 | +credential = ClientSecretCredential(tenant_id, client_id, client_secret) |
| 100 | +token = credential.get_token("https://database.windows.net/.default").token |
| 101 | +``` |
| 102 | + |
| 103 | +In the next cell, create a jdbc connection to the database and query a list of tables from the information schema. |
| 104 | +``` python |
| 105 | +jdbc_url = "jdbc:sqlserver://sql-xxxx.database.windows.net:1433;database=testdb" |
| 106 | + |
| 107 | +connection_properties = { |
| 108 | + "accessToken": token, |
| 109 | + "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver" |
| 110 | +} |
| 111 | + |
| 112 | +# Read from a table |
| 113 | +df = spark.read.jdbc(url=jdbc_url, table="INFORMATION_SCHEMA.TABLES", properties=connection_properties) |
| 114 | +df.show() |
| 115 | +``` |
| 116 | +The output should look something like this |
| 117 | + |
| 118 | + |
| 119 | +If you're using a provisioned cluster, you can run the code below to write to the database. |
| 120 | +``` python |
| 121 | +df.write.jdbc( |
| 122 | + url=jdbc_url, |
| 123 | + table="dbo.Test", |
| 124 | + mode="overwrite", |
| 125 | + properties=connection_properties |
| 126 | +) |
| 127 | +``` |
| 128 | + |
| 129 | +To verify that the data was written as expected, you can read the table data into a dataframe |
| 130 | +``` python |
| 131 | +df_check = spark.read.jdbc( |
| 132 | + url=jdbc_url, |
| 133 | + table="dbo.Test", |
| 134 | + properties=connection_properties |
| 135 | +) |
| 136 | +df_check.show() |
| 137 | +``` |
| 138 | + |
| 139 | +You can find the complete notebook [here](./databricks/Read_Write_SQL_DB_SP.ipynb) |
| 140 | + |
| 141 | +## Azure Postgres |
| 142 | +### Create Database |
| 143 | +First [create an Azure Database for PostgreSQL flexible server with Entra Id authentication](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-configure-sign-in-azure-ad-authentication). Then connect to the **postgres** database with the Entra ID administrator user. |
| 144 | + |
| 145 | +In the example below, we connect using bash commands in [Azure Cloud Shell](https://learn.microsoft.com/en-us/azure/cloud-shell/get-started/classic?tabs=azurecli) in the portal. (You can use the [PostgreSQL add-in in VS Code](https://marketplace.visualstudio.com/items?itemName=ms-ossdata.vscode-pgsql) instead if you prefer.) |
| 146 | + |
| 147 | + |
| 148 | +Set the environment variables in the Cloud Shell window. |
| 149 | +``` bash |
| 150 | +export PGHOST=psql-xxxxx.postgres.database.azure.com |
| 151 | +export PGUSER=user@domain.com |
| 152 | +export PGPORT=5432 |
| 153 | +export PGDATABASE=postgres |
| 154 | +export PGPASSWORD="$(az account get-access-token --resource https://ossrdbms-aad.database.windows.net --query accessToken --output tsv)" |
| 155 | +``` |
| 156 | + |
| 157 | +Now simply type psql. The environment variables set above will be used automatically to connect. |
| 158 | +``` bash |
| 159 | +psql |
| 160 | +``` |
| 161 | + |
| 162 | + |
| 163 | +Once connected, run the statement below to create a user for the service principal |
| 164 | +``` python |
| 165 | +SELECT * FROM pgaadauth_create_principal('sql-dbx-read-write-test-sp',false, false); |
| 166 | +``` |
| 167 | + |
| 168 | + |
| 169 | +Quit your connection to the postgres database and then open a new connection to your **application** database (```testdb``` in our example). Grant the service principal user read and write privileges on all tables in the public schema |
| 170 | + |
| 171 | + |
| 172 | +```SQL |
| 173 | +ALTER DEFAULT PRIVILEGES IN SCHEMA public |
| 174 | +GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO "sql-dbx-read-write-test-sp"; |
| 175 | + |
| 176 | +GRANT ALL PRIVILEGES ON SCHEMA public TO "sql-dbx-read-write-test-sp"; |
| 177 | +``` |
| 178 | + |
| 179 | + |
| 180 | +### Create a Databricks notebook |
| 181 | +Paste the code below into the first cell of the notebook. Using the dbutils library, retrieve the tenant id, client id and service principal secret from the key vault referenced in the secrets scope and get an Entra token. The url used to get the token is different from the one we used above to authenticate to [Azure SQL DB](#azure-sql-db). |
| 182 | +``` python |
| 183 | +from azure.identity import ClientSecretCredential |
| 184 | + |
| 185 | +sp_name = "sql-dbx-read-write-test-sp" |
| 186 | +tenant_id = dbutils.secrets.get(scope = "default", key = "tenant-id") |
| 187 | +client_id = dbutils.secrets.get(scope = "default", key = "sql-dbx-read-write-test-sp-client-id") |
| 188 | +client_secret = dbutils.secrets.get(scope = "default", key = "sql-dbx-read-write-test-sp-secret") |
| 189 | + |
| 190 | +credential = ClientSecretCredential(tenant_id, client_id, client_secret) |
| 191 | +token = credential.get_token("https://ossrdbms-aad.database.windows.net/.default").token |
| 192 | +``` |
| 193 | + |
| 194 | +In the next cell, add the following code to connect to and read data from the database into a dataframe. You need to specify the service principal name as the user and pass the token for the password. Once the connection is made, however, we read and write to the database the same way we did for SQL. |
| 195 | +``` python |
| 196 | +jdbc_url = "jdbc:postgresql://psql-xxxxxxx.postgres.database.azure.com:5432/testdb" |
| 197 | + |
| 198 | +connection_properties = { |
| 199 | + "user": sp_name, |
| 200 | + "password": token, |
| 201 | + "driver": "org.postgresql.Driver", |
| 202 | + "ssl": "true", |
| 203 | + "sslfactory": "org.postgresql.ssl.NonValidatingFactory" |
| 204 | +} |
| 205 | + |
| 206 | +# Read from a table |
| 207 | +df = spark.read.jdbc(url=jdbc_url, table="information_schema.tables", properties=connection_properties) |
| 208 | +display(df) |
| 209 | +``` |
| 210 | + |
| 211 | +Add a new cell and paste in the code below to write the contents of the dataframe to a new table in the database. The only difference to the SQL code above is that we are using the public schema in place of dbo. |
| 212 | +``` python |
| 213 | +df.write.jdbc( |
| 214 | + url=jdbc_url, |
| 215 | + table="public.Test", |
| 216 | + mode="overwrite", |
| 217 | + properties=connection_properties |
| 218 | +) |
| 219 | +``` |
| 220 | + |
| 221 | +Finally, let's add a cell to read the table contents into a dataframe to confirm that the data was written. |
| 222 | +``` python |
| 223 | +df_check = spark.read.jdbc( |
| 224 | + url=jdbc_url, |
| 225 | + table="public.Test", |
| 226 | + properties=connection_properties |
| 227 | +) |
| 228 | +df_check.show() |
| 229 | +``` |
| 230 | + |
| 231 | +You can download the complete notebook [here](./databricks/Read_Write_PSQL_DB_SP.ipynb) |
0 commit comments