0% found this document useful (0 votes)
4 views54 pages

Databrick Study QA

Uploaded by

mer70all
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views54 pages

Databrick Study QA

Uploaded by

mer70all
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Databrick learning

Excellent, Boss.

I don’t have the ability to physically make and attach a .docx file
myself — but I can generate the entire file contents and layout for
you right now, ready to save from your side.

Here’s exactly what to do:

1️⃣ Copy everything in the block below.

2️⃣ Open Microsoft Word → New Document.

3️⃣ Paste it in — Word will automatically detect the headings, code


blocks, and layout.

4️⃣ Save it as

>
Databricks_QuestionBank_Part1_Platform_and_DeltaLake_GauravKhe
tan.docx

---

📘 Databricks Certified Data Engineer Associate – 2025

Full Question Bank – Part 1: Platform & Delta Lake

Prepared by: Gaurav Khetan | GatherGuru.in

(Modern layout – Open Sans 12 pt / JetBrains Mono for code)


---

SECTION 1 – Databricks Intelligence Platform

---

Q1 – What is the Databricks Lakehouse Platform?

Answer: A unified architecture that combines data lakes and


warehouses for ETL, analytics, and AI workloads.

Explanation: It uses Delta Lake for ACID transactions, Photon for


fast execution, and Unity Catalog for governance.

---

Q2 – Which compute type is best for interactive analysis?

Answer: All-purpose cluster.

Explanation: Optimized for collaboration and notebook


experimentation.

---

Q3 – How to inspect table history and restore an older version?


DESCRIBE HISTORY sales;

RESTORE TABLE sales TO VERSION AS OF 3;

Explanation: Delta Lake keeps transaction versions for time travel


and rollback.

---

Q4 – What does the Photon engine do?

Answer: Executes queries with C++ vectorization for CPU efficiency


(10–12× faster).

---

Q5 – Describe the three compute types in Databricks.

Type Purpose Typical Use

All-purpose Interactive clusters Ad-hoc SQL + notebooks

Job Scheduled pipelines ETL, batch jobs

SQL Warehouse BI dashboards Power BI, Tableau connectors


---

Q6 – What makes Delta Lake reliable?

Answer: ACID transactions + schema enforcement + time travel.

Syntax:

CREATE TABLE bronze.sales (id INT, amount DOUBLE) USING DELTA;

---

Q7 – Explain the function of OPTIMIZE and ZORDER.

OPTIMIZE sales ZORDER BY (region, date);

Explanation: Combines small files and improves query locality for


faster reads.

---

Q8 – How to vacuum old files in Delta Lake?

VACUUM sales RETAIN 168 HOURS;


Explanation: Cleans obsolete versions older than 7 days by default.

---

Q9 – Difference between schema enforcement and schema


evolution.

Concept Description

Enforcement Rejects bad schema writes to maintain consistency.

Evolution Automatically adds new columns to match incoming


schema.

---

Q10 – What is a Managed vs External Delta table?

Type Location Deletion Behavior

Managed Databricks storage location Dropping table deletes


data

External User-defined path Dropping keeps files


---

Q11 – What command creates a clone of a table?

CREATE TABLE clone_sales CLONE sales VERSION AS OF 5;

Explanation: Shallow cloning is instant and cost-efficient for testing.

---

Q12 – Explain checkpointing in structured streaming.

Answer: Stores offset and state metadata to enable exactly-once


processing on restart.

---

Q13 – How does Unity Catalog integrate with the Lakehouse?

Answer: Provides centralized governance, lineage, and fine-grained


permissions across all workspaces.

---
Q14 – Command to grant permissions in UC.

GRANT SELECT ON TABLE finance.raw.sales TO `analyst-group`;

---

Q15 – List key advantages of the Lakehouse over a traditional


warehouse.

1. Unified data for AI + BI (no duplication).

2. Open formats (Parquet + Delta).

3. Separation of storage & compute.

4. Governed access through Unity Catalog.

---

✅ End of Part 1 – Databricks Platform & Delta Lake


Perfect, Boss ⚡ — let’s roll out Part 2 – Development & Ingestion of
your Databricks Certified Data Engineer Associate 2025 Question
Bank.

Below is the full, formatted content to paste straight into a new


Word document and save as

>
Databricks_QuestionBank_Part2_Development_and_Ingestion_Gaurav
Khetan.docx

---

⚙️Databricks Certified Data Engineer Associate – 2025

Full Question Bank – Part 2 : Development & Ingestion

Prepared by: Gaurav Khetan | GatherGuru.in

(Modern layout – Open Sans 12 pt / JetBrains Mono for code)

---

SECTION 2 – Development & Ingestion

---
Q1 – What is Databricks Connect?

Answer: A client library that lets you run Spark code from your local
IDE on a remote Databricks cluster.

Syntax:

from databricks.connect import DatabricksSession

spark =
DatabricksSession.builder.remote("databricks://dev").getOrCreate()

Explanation: Ideal for unit testing and interactive development.

---

Q2 – How do Auto Loader and COPY INTO differ?

Feature Auto Loader COPY INTO

Mode Continuous (streaming) Batch (static)

File Discovery Cloud notifications / listing Manual scan

Schema Handling Evolution supported Must match schema

Checkpoints Yes No
---

Q3 – How to stream JSON files with Auto Loader?

df = (spark.readStream

.format("cloudFiles")

.option("cloudFiles.format","json")

.option("cloudFiles.schemaLocation","/mnt/schema/customers")

.load("/mnt/raw/customers"))

Explanation: Auto Loader incrementally detects new files and


maintains schema metadata.

---

Q4 – How to write streamed data to Delta?

(df.writeStream

.format("delta")

.option("checkpointLocation","/mnt/checkpoints/customers")

.outputMode("append")

.start("/mnt/bronze/customers"))

Explanation: Ensures exactly-once processing through


checkpointing.
---

Q5 – Which option enables schema evolution in Auto Loader?

.option("cloudFiles.schemaEvolutionMode","addNewColumns")

Answer: Allows Auto Loader to add new columns dynamically when


source schema changes.

---

Q6 – Why use checkpoint directories in streaming?

Answer: To track offsets and commit state so streams can recover


from failures.

---

Q7 – What does maxFilesPerTrigger control?

Answer: Number of new files processed per micro-batch (trigger).

---
Q8 – Give an example of debugging a failed job.

Answer: Open the Spark UI → Jobs tab → check stage timelines and
error logs in the driver node.

---

Q9 – What causes data skew in Spark?

Answer: Uneven key distribution during wide transformations (join,


groupBy).

Fix: Repartition data or use broadcast() joins.

---

Q10 – How to broadcast a small table in PySpark?

from pyspark.sql.functions import broadcast

joined = large_df.join(broadcast(small_df),"id")

---

Q11 – Explain Databricks Notebook features useful for ingestion.


Feature Description

Widgets Parameterize runs (dbutils.widgets.get())

Repos Git integration for version control

%runReuse code across notebooks

DBFSBuilt-in distributed storage

---

Q12 – Example of parameterized notebook.

dbutils.widgets.text("input_path","/mnt/raw")

path = dbutils.widgets.get("input_path")

df = spark.read.json(path)

---

Q13 – What is Databricks Repos?

Answer: Native Git integration for collaborative development and


CI/CD.
---

Q14 – How to use COPY INTO for one-time load?

COPY INTO delta.`/mnt/bronze/sales`

FROM 's3://data/sales'

FILEFORMAT = PARQUET

COPY_OPTIONS('mergeSchema'='true');

---

Q15 – How to monitor streaming progress programmatically?

query.lastProgress

Explanation: Returns JSON with input rows and batch duration.

---

Q16 – What does the checkpointLocation store?

Answer: Offset logs, metadata, and schema for recovery and


exactly-once processing.
---

Q17 – Explain schema inference vs explicit schema definition.

Mode Description

Inference Spark examines files to derive schema automatically.

Explicit You declare fields and types to avoid runtime ambiguity.

---

Q18 – How to flatten nested JSON?

from pyspark.sql.functions import col

df = raw_df.withColumn("city",col("address.city")).drop("address")

---

Q19 – Which option helps Auto Loader use cloud notifications?

.option("cloudFiles.useNotifications","true")

Explanation: Improves latency by subscribing to cloud storage


events instead of polling.
---

Q20 – Summarize Auto Loader advantages.

1. Incremental & scalable loading.

2. Schema evolution built-in.

3. Fault-tolerant via checkpoints.

4. Cloud event-based file discovery.

---

✅ End of Part 2 – Development & Ingestion

Perfect, Boss Gaurav ⚡ — rolling out Part 3 – Data Processing &


Transformations of your Databricks Certified Data Engineer
Associate 2025 Full Question Bank.

Copy everything in the block below into a new Word document and
save it as
>
Databricks_QuestionBank_Part3_DataProcessing_and_Transformation
s_GauravKhetan.docx

---

🔄 Databricks Certified Data Engineer Associate – 2025

Full Question Bank – Part 3 : Data Processing & Transformations

Prepared by: Gaurav Khetan | GatherGuru.in

(Modern layout – Open Sans 12 pt / JetBrains Mono for code)

---

SECTION 3 – Data Processing and Transformations

---

Q1 – What is the Medallion Architecture?


Answer: A three-layer data design (Bronze → Silver → Gold) that
organizes raw, cleaned, and aggregated data for reliability and
reuse.

Explanation: Bronze stores raw ingested data; Silver applies


cleaning and joins; Gold contains business aggregates.

---

Q2 – Give an example of a Bronze-to-Silver transformation.

silver_df = bronze_df.dropDuplicates(["id"]).filter("status='active'")

silver_df.write.format("delta").mode("overwrite").save("/mnt/silver/
customers")

---

Q3 – Explain DDL and DML in Delta Lake.

Category Examples Purpose

DDL CREATE, ALTER, DROP Define structure

DML INSERT, UPDATE, DELETE, MERGE Manipulate data

---
Q4 – MERGE syntax for upserts in Delta.

MERGE INTO sales t

USING updates s

ON t.id = s.id

WHEN MATCHED THEN UPDATE SET amount = s.amount

WHEN NOT MATCHED THEN INSERT (id, amount) VALUES (s.id,


s.amount);

---

Q5 – Command to delete records in Delta Lake.

DELETE FROM sales WHERE status = 'inactive';

---

Q6 – How to perform aggregations with PySpark DataFrames?

from pyspark.sql.functions import sum, avg

agg_df =
df.groupBy("region").agg(sum("revenue").alias("total_revenue"),
avg("revenue"))
---

Q7 – What is OPTIMIZE used for?

Answer: Combines small Delta files into larger ones to improve read
performance.

---

Q8 – What is Z-ORDER in Delta Lake?

Answer: Data-skipping technique that co-locates related data by


columns for faster queries.

OPTIMIZE sales ZORDER BY (region, date);

---

Q9 – Command to create a view from a Delta table.

CREATE VIEW sales_summary AS SELECT region, SUM(amount) AS


total FROM sales GROUP BY region;

---
Q10 – How to vacuum old files?

VACUUM sales RETAIN 168 HOURS;

---

Q11 – Explain checkpointing and watermarking in structured


streaming.

Answer: Checkpointing saves state for recovery; watermarking


controls lateness of streaming events.

---

Q12 – Write streaming query to aggregate per 10 minutes.

df.groupBy(window("timestamp","10
minutes"),"region").agg(sum("sales"))

---

Q13 – How to cache intermediate data?

df.cache()
Explanation: Stores in memory to reuse across actions.

---

Q14 – When to use persist(StorageLevel.MEMORY_AND_DISK)?

Answer: When dataset is too large for memory only.

---

Q15 – Describe Lakeflow Declarative Pipelines (LDP).

Answer: YAML-based declarative definition of ETL pipelines that


Databricks orchestrates automatically.

YAML Example:

pipeline:

name: customer_etl

tasks:

- name: ingest

source: s3://bronze

format: json

output: bronze

- name: transform
input: bronze

script: /Repos/transform.py

output: silver

- name: aggregate

input: silver

output: gold

---

Q16 – What is the advantage of declarative pipelines (LDP)?

Answer: Simplifies dependencies and orchestration without manual


coding.

---

Q17 – Difference between DataFrame API and SQL API.

Aspect DataFrame API SQL API

Language Python/Scala SQL

Flexibility Programmatic Declarative

Use case Complex ETL logic Ad-hoc queries


---

Q18 – How to inspect physical plan of a DataFrame?

df.explain(True)

---

Q19 – Command to collect statistics for query optimization.

ANALYZE TABLE sales COMPUTE STATISTICS;

---

Q20 – Key advantages of Delta Lake for ETL.

1. Reliable ACID transactions.

2. Time travel and versioning.

3. Efficient storage and query with ZORDER.


4. Unified batch and stream processing.

---

✅ End of Part 3 – Data Processing & Transformations

Perfect, Boss Gaurav ⚡ — here’s your next file,

Part 4 – Productionizing Pipelines & Serverless Compute for your


Databricks Certified Data Engineer Associate 2025 Question Bank.

Copy the entire block below into a new Word document and save it
as

>
Databricks_QuestionBank_Part4_Productionizing_and_Serverless_Ga
uravKhetan.docx

---

---
SECTION 4 – Productionizing Data Pipelines

---

Q1 – What is the purpose of Databricks Asset Bundles (DAB)?

Answer: Infrastructure-as-code framework for deploying Databricks


assets (jobs, pipelines, clusters) through YAML and CLI.

Explanation: DAB enables repeatable deployment across


dev/test/prod environments via databricks bundle deploy.

---

Q2 – Typical DAB YAML structure.

bundle:

name: sales_etl

resources:

jobs:

- name: daily_sales

tasks:

- notebook_path: /Repos/bronze_ingest

cluster: serverless

deployments:

dev:
workspace: /Workspaces/dev

prod:

workspace: /Workspaces/prod

---

Q3 – Command to deploy a bundle.

databricks bundle deploy

---

Q4 – Advantages of DAB.

1. Declarative and version-controlled.

2. Supports multi-environment promotion.

3. Eliminates manual configuration errors.


---

Q5 – How do you repair a failed job run?

Answer: Use “Repair Run” in the Jobs UI or CLI to rerun only failed
tasks.

databricks jobs repair-run --run-id <run_id>

---

Q6 – Difference between All-purpose and Job clusters.

Aspect All-Purpose Job Cluster

Use Interactive Scheduled jobs

Lifecycle Persistent Ephemeral

Cost Higher Cheaper for ETL

---

Q7 – Explain Serverless compute.


Answer: Fully managed Databricks compute where cluster
provisioning and optimization are handled automatically.

Advantages: Instant startup, auto-scaling, no idle costs.

---

Q8 – When should Serverless be used?

Answer: For ad-hoc analysis, SQL dashboards, and lightweight


scheduled jobs where manual cluster tuning is unnecessary.

---

Q9 – Explain workflow dependencies.

Answer: Tasks form a DAG (Directed Acyclic Graph) so downstream


tasks execute only after upstream success.

---

Q10 – Define automatic retries in a job task.

tasks:

- name: transform

notebook_path: /Repos/transform
max_retries: 3

retry_on_timeout: true

---

Q11 – What are alerts and notifications in Jobs UI?

Answer: Configurable email/Slack alerts triggered on job success or


failure to inform stakeholders.

---

Q12 – Advantages of using Lakeflow Jobs for orchestration.

1. Native integration with UC and DAB.

2. Retry and repair capabilities built-in.

3. Unified monitoring for batch and stream.


---

Q13 – Command to list job runs via CLI.

databricks jobs list-runs --job-id <id>

---

Q14 – Best practices for production pipelines.

1. Use job clusters (not all-purpose).

2. Implement data quality checks (DLT expectations).

3. Parameterize via widgets.

4. Schedule through Workflows with retries and alerts.

---
Q15 – Difference between Continuous and Triggered Streaming
modes.

Mode Description Use Case

Continuous Processes events as they arrive with low latency


Real-time metrics

Triggered Processes in micro-batches Scheduled batch jobs

---

✅ End of Part 4 – Productionizing Pipelines & Serverless Compute

🔥 Excellent, Boss Gaurav — here comes your grand finale,

the Part 5 – Data Governance & Quality (Unity Catalog + DLT +


Federation + Delta Sharing) module of your Databricks Certified
Data Engineer Associate 2025 Question Bank.

Copy everything in the block below into a new Word file and save it
as

>
Databricks_QuestionBank_Part5_Governance_DLT_Federation_Sharin
g_GauravKhetan.docx
---

Databricks Certified Data Engineer Associate – 2025

Full Question Bank – Part 5 : Data Governance & Quality

Prepared by: Gaurav Khetan | GatherGuru.in

(Modern layout – Open Sans 12 pt / JetBrains Mono for code)

---

SECTION 5 – Governance & Data Quality

---

Q1 – What is Unity Catalog (UC)?

Answer: A centralized governance layer that controls access,


lineage, and audit for all data and AI assets in Databricks.

---

Q2 – Describe the Unity Catalog hierarchy.

Metastore → Catalog → Schema → Table/View


Example: finance.raw.transactions

---

Q3 – Create a catalog, schema, and table.

CREATE CATALOG finance;

CREATE SCHEMA finance.raw;

CREATE TABLE finance.raw.sales (id INT, amount DOUBLE) USING


DELTA;

---

Q4 – Grant permissions on a table.

GRANT SELECT ON TABLE finance.raw.sales TO `analyst-group`;

Explanation: Permissions cascade down the UC hierarchy.

---

Q5 – Revoke permissions.
REVOKE SELECT ON TABLE finance.raw.sales FROM `analyst-group`;

---

Q6 – List key UC roles.

Role Responsibility

Account Admin Full organization control

Metastore Admin Manage catalogs and schemas

Workspace Admin Cluster & user settings

Data Owner Object-level grants

---

Q7 – Where are audit logs stored and in what format?

Answer: As JSON files in cloud storage (S3, ADLS, GCS).

---

Q8 – Example audit log entry.


{

"eventType": "grantPermission",

"user": "gaurav.khetan@databricks.com",

"object": "finance.raw.sales",

"timestamp": "2025-10-05T12:00:00Z"

---

Q9 – How to view data lineage?

Answer: In Data Explorer → Select table → Lineage tab (column-level


visibility).

---

Q10 – Difference between managed and external tables.

Type Storage Deletion Behavior

Managed Controlled by Databricks Drop removes data + metadata

External User storage Drop removes metadata only


---

Q11 – How to audit all permission changes in UC?

Answer: Enable audit log delivery in the Admin Console → “Audit


Logs.”

---

⚙️Delta Live Tables (DLT) and Expectations

---

Q12 – What is Delta Live Tables (DLT)?

Answer: A declarative ETL framework that automatically manages


dependencies, retries, and data quality rules.

---

Q13 – Define a basic DLT pipeline.

import dlt

@dlt.table
def bronze_customers():

return spark.readStream.format("cloudFiles")\

.option("cloudFiles.format","json")\

.load("/mnt/raw/customers")

---

Q14 – Add data quality expectations.

@dlt.expect_or_drop("valid_id","id IS NOT NULL")

@dlt.expect("age_not_negative","age >= 0")

---

Q15 – Difference between expect, expect_or_drop, expect_or_fail.

Decorator Action

expect Logs violations

expect_or_drop Drops invalid records

expect_or_fail Stops pipeline on failure

---
Q16 – Where are expectation metrics visible?

Answer: DLT UI → Data Quality tab (showing pass/fail counts per


rule).

---

Q17 – DLT YAML config sample.

name: customer_pipeline

storage: /mnt/pipelines/customer

target: customer_db

clusters:

- label: default

num_workers: 2

libraries:

- notebook: /Repos/pipelines/customer_dlt

---

Q18 – Advantages of DLT.

1. Automatic dependency management.


2. Built-in data quality validation.

3. Lineage tracking with Unity Catalog.

---

🌐 Lakehouse Federation

---

Q19 – What is Lakehouse Federation?

Answer: Allows Databricks to query external sources (Snowflake,


BigQuery, Redshift) directly without copying data.

---

Q20 – Create a connection to Snowflake.

CREATE CONNECTION my_snowflake_conn


TYPE snowflake

OPTIONS (host 'mycompany.snowflakecomputing.com',

user 'gaurav_k',

database 'SALES_DB',

warehouse 'COMPUTE_WH')

CREDENTIALS (USERNAME 'gaurav_k', PASSWORD 'xxxxxx');

---

Q21 – Register a foreign catalog.

CREATE FOREIGN CATALOG sales_catalog USING CONNECTION


my_snowflake_conn;

---

Q22 – Query external data.

SELECT * FROM sales_catalog.public.orders LIMIT 10;

---

Q23 – Benefits of Federation.


1. No ETL or data duplication.

2. Centralized governance through UC.

3. Cross-cloud access with lineage.

---

Q24 – Limitations of Federation.

Higher latency for remote queries.

Dependent on external source availability.

---

🔁 Delta Sharing

---
Q25 – What is Delta Sharing?

Answer: An open protocol to securely share live Delta tables with


other Databricks or external systems without copying data.

---

Q26 – Create a share and recipient.

CREATE SHARE sales_share;

ALTER SHARE sales_share ADD TABLE gold.sales_agg;

CREATE RECIPIENT partner_org USING TOKEN 'abcd1234';

GRANT SELECT ON SHARE sales_share TO RECIPIENT partner_org;

---

Q27 – Types of Delta Sharing.

Type Description

Databricks → Databricks Native, seamless

Databricks → External Open protocol via REST API


---

Q28 – Advantages of Delta Sharing.

1. Live data sharing (no stale copies).

2. Cross-cloud compatibility.

3. Governed by Unity Catalog.

---

Q29 – Limitations / Costs of cross-cloud sharing.

Possible egress fees for out-of-region access.

Dependent on recipient connectivity.

---
Q30 – What is shared with recipients in Delta Sharing?

Answer: Only table metadata and read links; data remains in


provider’s storage.

---

💎 Final Best Practices

Area Best Practice

UC Use least-privilege grants and audit regularly

DLT Define expectations for all critical columns

Federation Keep cross-region queries to a minimum

Delta Sharing Prefer Databricks-to-Databricks for zero egress

---

✅ End of Part 5 – Data Governance & Quality

Excellent, boss — here’s a one-day Final Revision Sheet you can


paste straight into a new Word doc and print for your 11 Oct prep.

Save as
> Databricks_Final_Revision_Sheet_GauravKhetan.docx

---

⚡ Databricks Certified Data Engineer Associate – 2025

One-Day Final Revision Sheet (Quickfire Concepts & Syntax)

Prepared by: Gaurav Khetan | GatherGuru.in

---

🔹 1. Lakehouse & Delta Lake

Concept Key Points / Syntax

ACID Atomic, Consistent, Isolated, Durable transactions

Time Travel SELECT * FROM table VERSION AS OF 3

OPTIMIZE OPTIMIZE sales ZORDER BY (region)

VACUUM VACUUM sales RETAIN 168 HOURS (7 days default)

Clone CREATE TABLE test CLONE prod – instant shallow copy

Photon Vectorized C++ engine, 10–12× faster queries


---

🔹 2. Auto Loader & Ingestion

(spark.readStream

.format("cloudFiles")

.option("cloudFiles.format","json")

.option("cloudFiles.schemaEvolutionMode","addNewColumns")

.option("cloudFiles.schemaLocation","/mnt/schema/customers")

.load("/mnt/raw/customers")

.writeStream

.format("delta")

.option("checkpointLocation","/mnt/checkpoints/customers")

.start("/mnt/bronze/customers"))

✅ Advantages

Incremental discovery • Schema evolution • Fault tolerant • Event-


based (useNotifications = true)

---

🔹 3. Medallion Architecture

Layer Purpose Example Output


Bronze Raw ingestion /mnt/bronze

Silver Cleaned, joined /mnt/silver

Gold Aggregated BI data /mnt/gold

---

🔹 4. Transformations & Performance

from pyspark.sql.functions import sum, avg

df.groupBy("region").agg(sum("sales").alias("total"), avg("sales"))

Caching: df.cache()

Explain Plan: df.explain(True)

Broadcast Join: df.join(broadcast(dim_df),"id")

---

🔹 5. Lakeflow Declarative Pipeline (LDP YAML)

pipeline:

name: sales_etl

tasks:

- name: ingest

source: s3://bronze
format: json

output: bronze

- name: transform

input: bronze

script: /Repos/transform.py

output: silver

---

🔹 6. DLT (Delta Live Tables) Expectations

@dlt.table

@dlt.expect_or_drop("valid_id","id IS NOT NULL")

def bronze_customers():

return spark.readStream.format("cloudFiles")\

.option("cloudFiles.format","json")\

.load("/mnt/raw")

Decorator Behavior

expect Log bad rows

expect_or_drop Drop bad rows

expect_or_fail Stop pipeline


---

🔹 7. Productionizing Pipelines

Feature Purpose / Syntax

DAB Deploy databricks bundle deploy

Repair Run databricks jobs repair-run --run-id <id>

Retries max_retries: 3 in YAML

Alerts Configurable email/Slack on failure

ServerlessAuto-optimized, instant startup, no idle cost

---

🔹 8. Unity Catalog (UC)

CREATE CATALOG finance;

CREATE SCHEMA finance.raw;

CREATE TABLE finance.raw.sales(id INT, amt DOUBLE) USING DELTA;

GRANT SELECT ON TABLE finance.raw.sales TO analyst_group;

LevelExample

Metastore Central governance

Catalog finance
Schema raw

Tablesales

Audit Logs: JSON in cloud storage

Lineage: Data Explorer → Lineage tab

---

🔹 9. Lakehouse Federation

CREATE CONNECTION my_snowflake_conn TYPE snowflake OPTIONS


(...);

CREATE FOREIGN CATALOG ext_sales USING CONNECTION


my_snowflake_conn;

SELECT * FROM ext_sales.public.orders LIMIT 10;

✅ Benefits – Query external data directly • UC-governed • No ETL


duplication

⚠️Note: Cross-region = egress cost

---

🔹 10. Delta Sharing

CREATE SHARE sales_share;


ALTER SHARE sales_share ADD TABLE gold.sales_agg;

CREATE RECIPIENT partner_org USING TOKEN 'abcd1234';

GRANT SELECT ON SHARE sales_share TO RECIPIENT partner_org;

Type Description

D→D Databricks ↔ Databricks (native)

D→Ext External systems via REST API

---

🔹 11. Common Exam Facts

1. 45 questions | 90 minutes | Pass ≈ 70 %.

2. Expect 6–8 questions on UC + Delta Sharing.

3. Hands-on feel: Auto Loader, DLT YAML, Job Repair.

4. Memorize Lakehouse hierarchy & Delta operations.


5. Review Spark UI once – Stages vs Tasks vs Jobs.

---

⚡ Quick Command Recap

DESCRIBE HISTORY table;

OPTIMIZE table ZORDER BY (col);

VACUUM table RETAIN 168 HOURS;

ANALYZE TABLE table COMPUTE STATISTICS;

RESTORE TABLE table TO VERSION AS OF 3;

---

💪 Exam-Day Mantra

💧 Hydrate, breathe, 10-minute walk pre-exam.

⏱ Aim ~ 2 min per question.

⭐ First pass → easy ones; flag hard; review last 10 min.


⚡ Remember: Gaurav Khetan = Momentum. Shweta’s luck = Fuel.
Success is inevitable.

---

✅ End of One-Day Revision Sheet

Go conquer your 12 Oct exam, boss 🔥

You might also like