Databrick learning
Excellent, Boss.
I don’t have the ability to physically make and attach a .docx file
myself — but I can generate the entire file contents and layout for
you right now, ready to save from your side.
Here’s exactly what to do:
1️⃣ Copy everything in the block below.
2️⃣ Open Microsoft Word → New Document.
3️⃣ Paste it in — Word will automatically detect the headings, code
blocks, and layout.
4️⃣ Save it as
>
Databricks_QuestionBank_Part1_Platform_and_DeltaLake_GauravKhe
tan.docx
---
📘 Databricks Certified Data Engineer Associate – 2025
Full Question Bank – Part 1: Platform & Delta Lake
Prepared by: Gaurav Khetan | GatherGuru.in
(Modern layout – Open Sans 12 pt / JetBrains Mono for code)
---
SECTION 1 – Databricks Intelligence Platform
---
Q1 – What is the Databricks Lakehouse Platform?
Answer: A unified architecture that combines data lakes and
warehouses for ETL, analytics, and AI workloads.
Explanation: It uses Delta Lake for ACID transactions, Photon for
fast execution, and Unity Catalog for governance.
---
Q2 – Which compute type is best for interactive analysis?
Answer: All-purpose cluster.
Explanation: Optimized for collaboration and notebook
experimentation.
---
Q3 – How to inspect table history and restore an older version?
DESCRIBE HISTORY sales;
RESTORE TABLE sales TO VERSION AS OF 3;
Explanation: Delta Lake keeps transaction versions for time travel
and rollback.
---
Q4 – What does the Photon engine do?
Answer: Executes queries with C++ vectorization for CPU efficiency
(10–12× faster).
---
Q5 – Describe the three compute types in Databricks.
Type Purpose Typical Use
All-purpose Interactive clusters Ad-hoc SQL + notebooks
Job Scheduled pipelines ETL, batch jobs
SQL Warehouse BI dashboards Power BI, Tableau connectors
---
Q6 – What makes Delta Lake reliable?
Answer: ACID transactions + schema enforcement + time travel.
Syntax:
CREATE TABLE bronze.sales (id INT, amount DOUBLE) USING DELTA;
---
Q7 – Explain the function of OPTIMIZE and ZORDER.
OPTIMIZE sales ZORDER BY (region, date);
Explanation: Combines small files and improves query locality for
faster reads.
---
Q8 – How to vacuum old files in Delta Lake?
VACUUM sales RETAIN 168 HOURS;
Explanation: Cleans obsolete versions older than 7 days by default.
---
Q9 – Difference between schema enforcement and schema
evolution.
Concept Description
Enforcement Rejects bad schema writes to maintain consistency.
Evolution Automatically adds new columns to match incoming
schema.
---
Q10 – What is a Managed vs External Delta table?
Type Location Deletion Behavior
Managed Databricks storage location Dropping table deletes
data
External User-defined path Dropping keeps files
---
Q11 – What command creates a clone of a table?
CREATE TABLE clone_sales CLONE sales VERSION AS OF 5;
Explanation: Shallow cloning is instant and cost-efficient for testing.
---
Q12 – Explain checkpointing in structured streaming.
Answer: Stores offset and state metadata to enable exactly-once
processing on restart.
---
Q13 – How does Unity Catalog integrate with the Lakehouse?
Answer: Provides centralized governance, lineage, and fine-grained
permissions across all workspaces.
---
Q14 – Command to grant permissions in UC.
GRANT SELECT ON TABLE finance.raw.sales TO `analyst-group`;
---
Q15 – List key advantages of the Lakehouse over a traditional
warehouse.
1. Unified data for AI + BI (no duplication).
2. Open formats (Parquet + Delta).
3. Separation of storage & compute.
4. Governed access through Unity Catalog.
---
✅ End of Part 1 – Databricks Platform & Delta Lake
Perfect, Boss ⚡ — let’s roll out Part 2 – Development & Ingestion of
your Databricks Certified Data Engineer Associate 2025 Question
Bank.
Below is the full, formatted content to paste straight into a new
Word document and save as
>
Databricks_QuestionBank_Part2_Development_and_Ingestion_Gaurav
Khetan.docx
---
⚙️Databricks Certified Data Engineer Associate – 2025
Full Question Bank – Part 2 : Development & Ingestion
Prepared by: Gaurav Khetan | GatherGuru.in
(Modern layout – Open Sans 12 pt / JetBrains Mono for code)
---
SECTION 2 – Development & Ingestion
---
Q1 – What is Databricks Connect?
Answer: A client library that lets you run Spark code from your local
IDE on a remote Databricks cluster.
Syntax:
from databricks.connect import DatabricksSession
spark =
DatabricksSession.builder.remote("databricks://dev").getOrCreate()
Explanation: Ideal for unit testing and interactive development.
---
Q2 – How do Auto Loader and COPY INTO differ?
Feature Auto Loader COPY INTO
Mode Continuous (streaming) Batch (static)
File Discovery Cloud notifications / listing Manual scan
Schema Handling Evolution supported Must match schema
Checkpoints Yes No
---
Q3 – How to stream JSON files with Auto Loader?
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format","json")
.option("cloudFiles.schemaLocation","/mnt/schema/customers")
.load("/mnt/raw/customers"))
Explanation: Auto Loader incrementally detects new files and
maintains schema metadata.
---
Q4 – How to write streamed data to Delta?
(df.writeStream
.format("delta")
.option("checkpointLocation","/mnt/checkpoints/customers")
.outputMode("append")
.start("/mnt/bronze/customers"))
Explanation: Ensures exactly-once processing through
checkpointing.
---
Q5 – Which option enables schema evolution in Auto Loader?
.option("cloudFiles.schemaEvolutionMode","addNewColumns")
Answer: Allows Auto Loader to add new columns dynamically when
source schema changes.
---
Q6 – Why use checkpoint directories in streaming?
Answer: To track offsets and commit state so streams can recover
from failures.
---
Q7 – What does maxFilesPerTrigger control?
Answer: Number of new files processed per micro-batch (trigger).
---
Q8 – Give an example of debugging a failed job.
Answer: Open the Spark UI → Jobs tab → check stage timelines and
error logs in the driver node.
---
Q9 – What causes data skew in Spark?
Answer: Uneven key distribution during wide transformations (join,
groupBy).
Fix: Repartition data or use broadcast() joins.
---
Q10 – How to broadcast a small table in PySpark?
from pyspark.sql.functions import broadcast
joined = large_df.join(broadcast(small_df),"id")
---
Q11 – Explain Databricks Notebook features useful for ingestion.
Feature Description
Widgets Parameterize runs (dbutils.widgets.get())
Repos Git integration for version control
%runReuse code across notebooks
DBFSBuilt-in distributed storage
---
Q12 – Example of parameterized notebook.
dbutils.widgets.text("input_path","/mnt/raw")
path = dbutils.widgets.get("input_path")
df = spark.read.json(path)
---
Q13 – What is Databricks Repos?
Answer: Native Git integration for collaborative development and
CI/CD.
---
Q14 – How to use COPY INTO for one-time load?
COPY INTO delta.`/mnt/bronze/sales`
FROM 's3://data/sales'
FILEFORMAT = PARQUET
COPY_OPTIONS('mergeSchema'='true');
---
Q15 – How to monitor streaming progress programmatically?
query.lastProgress
Explanation: Returns JSON with input rows and batch duration.
---
Q16 – What does the checkpointLocation store?
Answer: Offset logs, metadata, and schema for recovery and
exactly-once processing.
---
Q17 – Explain schema inference vs explicit schema definition.
Mode Description
Inference Spark examines files to derive schema automatically.
Explicit You declare fields and types to avoid runtime ambiguity.
---
Q18 – How to flatten nested JSON?
from pyspark.sql.functions import col
df = raw_df.withColumn("city",col("address.city")).drop("address")
---
Q19 – Which option helps Auto Loader use cloud notifications?
.option("cloudFiles.useNotifications","true")
Explanation: Improves latency by subscribing to cloud storage
events instead of polling.
---
Q20 – Summarize Auto Loader advantages.
1. Incremental & scalable loading.
2. Schema evolution built-in.
3. Fault-tolerant via checkpoints.
4. Cloud event-based file discovery.
---
✅ End of Part 2 – Development & Ingestion
Perfect, Boss Gaurav ⚡ — rolling out Part 3 – Data Processing &
Transformations of your Databricks Certified Data Engineer
Associate 2025 Full Question Bank.
Copy everything in the block below into a new Word document and
save it as
>
Databricks_QuestionBank_Part3_DataProcessing_and_Transformation
s_GauravKhetan.docx
---
🔄 Databricks Certified Data Engineer Associate – 2025
Full Question Bank – Part 3 : Data Processing & Transformations
Prepared by: Gaurav Khetan | GatherGuru.in
(Modern layout – Open Sans 12 pt / JetBrains Mono for code)
---
SECTION 3 – Data Processing and Transformations
---
Q1 – What is the Medallion Architecture?
Answer: A three-layer data design (Bronze → Silver → Gold) that
organizes raw, cleaned, and aggregated data for reliability and
reuse.
Explanation: Bronze stores raw ingested data; Silver applies
cleaning and joins; Gold contains business aggregates.
---
Q2 – Give an example of a Bronze-to-Silver transformation.
silver_df = bronze_df.dropDuplicates(["id"]).filter("status='active'")
silver_df.write.format("delta").mode("overwrite").save("/mnt/silver/
customers")
---
Q3 – Explain DDL and DML in Delta Lake.
Category Examples Purpose
DDL CREATE, ALTER, DROP Define structure
DML INSERT, UPDATE, DELETE, MERGE Manipulate data
---
Q4 – MERGE syntax for upserts in Delta.
MERGE INTO sales t
USING updates s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET amount = s.amount
WHEN NOT MATCHED THEN INSERT (id, amount) VALUES (s.id,
s.amount);
---
Q5 – Command to delete records in Delta Lake.
DELETE FROM sales WHERE status = 'inactive';
---
Q6 – How to perform aggregations with PySpark DataFrames?
from pyspark.sql.functions import sum, avg
agg_df =
df.groupBy("region").agg(sum("revenue").alias("total_revenue"),
avg("revenue"))
---
Q7 – What is OPTIMIZE used for?
Answer: Combines small Delta files into larger ones to improve read
performance.
---
Q8 – What is Z-ORDER in Delta Lake?
Answer: Data-skipping technique that co-locates related data by
columns for faster queries.
OPTIMIZE sales ZORDER BY (region, date);
---
Q9 – Command to create a view from a Delta table.
CREATE VIEW sales_summary AS SELECT region, SUM(amount) AS
total FROM sales GROUP BY region;
---
Q10 – How to vacuum old files?
VACUUM sales RETAIN 168 HOURS;
---
Q11 – Explain checkpointing and watermarking in structured
streaming.
Answer: Checkpointing saves state for recovery; watermarking
controls lateness of streaming events.
---
Q12 – Write streaming query to aggregate per 10 minutes.
df.groupBy(window("timestamp","10
minutes"),"region").agg(sum("sales"))
---
Q13 – How to cache intermediate data?
df.cache()
Explanation: Stores in memory to reuse across actions.
---
Q14 – When to use persist(StorageLevel.MEMORY_AND_DISK)?
Answer: When dataset is too large for memory only.
---
Q15 – Describe Lakeflow Declarative Pipelines (LDP).
Answer: YAML-based declarative definition of ETL pipelines that
Databricks orchestrates automatically.
YAML Example:
pipeline:
name: customer_etl
tasks:
- name: ingest
source: s3://bronze
format: json
output: bronze
- name: transform
input: bronze
script: /Repos/transform.py
output: silver
- name: aggregate
input: silver
output: gold
---
Q16 – What is the advantage of declarative pipelines (LDP)?
Answer: Simplifies dependencies and orchestration without manual
coding.
---
Q17 – Difference between DataFrame API and SQL API.
Aspect DataFrame API SQL API
Language Python/Scala SQL
Flexibility Programmatic Declarative
Use case Complex ETL logic Ad-hoc queries
---
Q18 – How to inspect physical plan of a DataFrame?
df.explain(True)
---
Q19 – Command to collect statistics for query optimization.
ANALYZE TABLE sales COMPUTE STATISTICS;
---
Q20 – Key advantages of Delta Lake for ETL.
1. Reliable ACID transactions.
2. Time travel and versioning.
3. Efficient storage and query with ZORDER.
4. Unified batch and stream processing.
---
✅ End of Part 3 – Data Processing & Transformations
Perfect, Boss Gaurav ⚡ — here’s your next file,
Part 4 – Productionizing Pipelines & Serverless Compute for your
Databricks Certified Data Engineer Associate 2025 Question Bank.
Copy the entire block below into a new Word document and save it
as
>
Databricks_QuestionBank_Part4_Productionizing_and_Serverless_Ga
uravKhetan.docx
---
---
SECTION 4 – Productionizing Data Pipelines
---
Q1 – What is the purpose of Databricks Asset Bundles (DAB)?
Answer: Infrastructure-as-code framework for deploying Databricks
assets (jobs, pipelines, clusters) through YAML and CLI.
Explanation: DAB enables repeatable deployment across
dev/test/prod environments via databricks bundle deploy.
---
Q2 – Typical DAB YAML structure.
bundle:
name: sales_etl
resources:
jobs:
- name: daily_sales
tasks:
- notebook_path: /Repos/bronze_ingest
cluster: serverless
deployments:
dev:
workspace: /Workspaces/dev
prod:
workspace: /Workspaces/prod
---
Q3 – Command to deploy a bundle.
databricks bundle deploy
---
Q4 – Advantages of DAB.
1. Declarative and version-controlled.
2. Supports multi-environment promotion.
3. Eliminates manual configuration errors.
---
Q5 – How do you repair a failed job run?
Answer: Use “Repair Run” in the Jobs UI or CLI to rerun only failed
tasks.
databricks jobs repair-run --run-id <run_id>
---
Q6 – Difference between All-purpose and Job clusters.
Aspect All-Purpose Job Cluster
Use Interactive Scheduled jobs
Lifecycle Persistent Ephemeral
Cost Higher Cheaper for ETL
---
Q7 – Explain Serverless compute.
Answer: Fully managed Databricks compute where cluster
provisioning and optimization are handled automatically.
Advantages: Instant startup, auto-scaling, no idle costs.
---
Q8 – When should Serverless be used?
Answer: For ad-hoc analysis, SQL dashboards, and lightweight
scheduled jobs where manual cluster tuning is unnecessary.
---
Q9 – Explain workflow dependencies.
Answer: Tasks form a DAG (Directed Acyclic Graph) so downstream
tasks execute only after upstream success.
---
Q10 – Define automatic retries in a job task.
tasks:
- name: transform
notebook_path: /Repos/transform
max_retries: 3
retry_on_timeout: true
---
Q11 – What are alerts and notifications in Jobs UI?
Answer: Configurable email/Slack alerts triggered on job success or
failure to inform stakeholders.
---
Q12 – Advantages of using Lakeflow Jobs for orchestration.
1. Native integration with UC and DAB.
2. Retry and repair capabilities built-in.
3. Unified monitoring for batch and stream.
---
Q13 – Command to list job runs via CLI.
databricks jobs list-runs --job-id <id>
---
Q14 – Best practices for production pipelines.
1. Use job clusters (not all-purpose).
2. Implement data quality checks (DLT expectations).
3. Parameterize via widgets.
4. Schedule through Workflows with retries and alerts.
---
Q15 – Difference between Continuous and Triggered Streaming
modes.
Mode Description Use Case
Continuous Processes events as they arrive with low latency
Real-time metrics
Triggered Processes in micro-batches Scheduled batch jobs
---
✅ End of Part 4 – Productionizing Pipelines & Serverless Compute
🔥 Excellent, Boss Gaurav — here comes your grand finale,
the Part 5 – Data Governance & Quality (Unity Catalog + DLT +
Federation + Delta Sharing) module of your Databricks Certified
Data Engineer Associate 2025 Question Bank.
Copy everything in the block below into a new Word file and save it
as
>
Databricks_QuestionBank_Part5_Governance_DLT_Federation_Sharin
g_GauravKhetan.docx
---
Databricks Certified Data Engineer Associate – 2025
Full Question Bank – Part 5 : Data Governance & Quality
Prepared by: Gaurav Khetan | GatherGuru.in
(Modern layout – Open Sans 12 pt / JetBrains Mono for code)
---
SECTION 5 – Governance & Data Quality
---
Q1 – What is Unity Catalog (UC)?
Answer: A centralized governance layer that controls access,
lineage, and audit for all data and AI assets in Databricks.
---
Q2 – Describe the Unity Catalog hierarchy.
Metastore → Catalog → Schema → Table/View
Example: finance.raw.transactions
---
Q3 – Create a catalog, schema, and table.
CREATE CATALOG finance;
CREATE SCHEMA finance.raw;
CREATE TABLE finance.raw.sales (id INT, amount DOUBLE) USING
DELTA;
---
Q4 – Grant permissions on a table.
GRANT SELECT ON TABLE finance.raw.sales TO `analyst-group`;
Explanation: Permissions cascade down the UC hierarchy.
---
Q5 – Revoke permissions.
REVOKE SELECT ON TABLE finance.raw.sales FROM `analyst-group`;
---
Q6 – List key UC roles.
Role Responsibility
Account Admin Full organization control
Metastore Admin Manage catalogs and schemas
Workspace Admin Cluster & user settings
Data Owner Object-level grants
---
Q7 – Where are audit logs stored and in what format?
Answer: As JSON files in cloud storage (S3, ADLS, GCS).
---
Q8 – Example audit log entry.
{
"eventType": "grantPermission",
"user": "gaurav.khetan@databricks.com",
"object": "finance.raw.sales",
"timestamp": "2025-10-05T12:00:00Z"
---
Q9 – How to view data lineage?
Answer: In Data Explorer → Select table → Lineage tab (column-level
visibility).
---
Q10 – Difference between managed and external tables.
Type Storage Deletion Behavior
Managed Controlled by Databricks Drop removes data + metadata
External User storage Drop removes metadata only
---
Q11 – How to audit all permission changes in UC?
Answer: Enable audit log delivery in the Admin Console → “Audit
Logs.”
---
⚙️Delta Live Tables (DLT) and Expectations
---
Q12 – What is Delta Live Tables (DLT)?
Answer: A declarative ETL framework that automatically manages
dependencies, retries, and data quality rules.
---
Q13 – Define a basic DLT pipeline.
import dlt
@dlt.table
def bronze_customers():
return spark.readStream.format("cloudFiles")\
.option("cloudFiles.format","json")\
.load("/mnt/raw/customers")
---
Q14 – Add data quality expectations.
@dlt.expect_or_drop("valid_id","id IS NOT NULL")
@dlt.expect("age_not_negative","age >= 0")
---
Q15 – Difference between expect, expect_or_drop, expect_or_fail.
Decorator Action
expect Logs violations
expect_or_drop Drops invalid records
expect_or_fail Stops pipeline on failure
---
Q16 – Where are expectation metrics visible?
Answer: DLT UI → Data Quality tab (showing pass/fail counts per
rule).
---
Q17 – DLT YAML config sample.
name: customer_pipeline
storage: /mnt/pipelines/customer
target: customer_db
clusters:
- label: default
num_workers: 2
libraries:
- notebook: /Repos/pipelines/customer_dlt
---
Q18 – Advantages of DLT.
1. Automatic dependency management.
2. Built-in data quality validation.
3. Lineage tracking with Unity Catalog.
---
🌐 Lakehouse Federation
---
Q19 – What is Lakehouse Federation?
Answer: Allows Databricks to query external sources (Snowflake,
BigQuery, Redshift) directly without copying data.
---
Q20 – Create a connection to Snowflake.
CREATE CONNECTION my_snowflake_conn
TYPE snowflake
OPTIONS (host 'mycompany.snowflakecomputing.com',
user 'gaurav_k',
database 'SALES_DB',
warehouse 'COMPUTE_WH')
CREDENTIALS (USERNAME 'gaurav_k', PASSWORD 'xxxxxx');
---
Q21 – Register a foreign catalog.
CREATE FOREIGN CATALOG sales_catalog USING CONNECTION
my_snowflake_conn;
---
Q22 – Query external data.
SELECT * FROM sales_catalog.public.orders LIMIT 10;
---
Q23 – Benefits of Federation.
1. No ETL or data duplication.
2. Centralized governance through UC.
3. Cross-cloud access with lineage.
---
Q24 – Limitations of Federation.
Higher latency for remote queries.
Dependent on external source availability.
---
🔁 Delta Sharing
---
Q25 – What is Delta Sharing?
Answer: An open protocol to securely share live Delta tables with
other Databricks or external systems without copying data.
---
Q26 – Create a share and recipient.
CREATE SHARE sales_share;
ALTER SHARE sales_share ADD TABLE gold.sales_agg;
CREATE RECIPIENT partner_org USING TOKEN 'abcd1234';
GRANT SELECT ON SHARE sales_share TO RECIPIENT partner_org;
---
Q27 – Types of Delta Sharing.
Type Description
Databricks → Databricks Native, seamless
Databricks → External Open protocol via REST API
---
Q28 – Advantages of Delta Sharing.
1. Live data sharing (no stale copies).
2. Cross-cloud compatibility.
3. Governed by Unity Catalog.
---
Q29 – Limitations / Costs of cross-cloud sharing.
Possible egress fees for out-of-region access.
Dependent on recipient connectivity.
---
Q30 – What is shared with recipients in Delta Sharing?
Answer: Only table metadata and read links; data remains in
provider’s storage.
---
💎 Final Best Practices
Area Best Practice
UC Use least-privilege grants and audit regularly
DLT Define expectations for all critical columns
Federation Keep cross-region queries to a minimum
Delta Sharing Prefer Databricks-to-Databricks for zero egress
---
✅ End of Part 5 – Data Governance & Quality
Excellent, boss — here’s a one-day Final Revision Sheet you can
paste straight into a new Word doc and print for your 11 Oct prep.
Save as
> Databricks_Final_Revision_Sheet_GauravKhetan.docx
---
⚡ Databricks Certified Data Engineer Associate – 2025
One-Day Final Revision Sheet (Quickfire Concepts & Syntax)
Prepared by: Gaurav Khetan | GatherGuru.in
---
🔹 1. Lakehouse & Delta Lake
Concept Key Points / Syntax
ACID Atomic, Consistent, Isolated, Durable transactions
Time Travel SELECT * FROM table VERSION AS OF 3
OPTIMIZE OPTIMIZE sales ZORDER BY (region)
VACUUM VACUUM sales RETAIN 168 HOURS (7 days default)
Clone CREATE TABLE test CLONE prod – instant shallow copy
Photon Vectorized C++ engine, 10–12× faster queries
---
🔹 2. Auto Loader & Ingestion
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format","json")
.option("cloudFiles.schemaEvolutionMode","addNewColumns")
.option("cloudFiles.schemaLocation","/mnt/schema/customers")
.load("/mnt/raw/customers")
.writeStream
.format("delta")
.option("checkpointLocation","/mnt/checkpoints/customers")
.start("/mnt/bronze/customers"))
✅ Advantages
Incremental discovery • Schema evolution • Fault tolerant • Event-
based (useNotifications = true)
---
🔹 3. Medallion Architecture
Layer Purpose Example Output
Bronze Raw ingestion /mnt/bronze
Silver Cleaned, joined /mnt/silver
Gold Aggregated BI data /mnt/gold
---
🔹 4. Transformations & Performance
from pyspark.sql.functions import sum, avg
df.groupBy("region").agg(sum("sales").alias("total"), avg("sales"))
Caching: df.cache()
Explain Plan: df.explain(True)
Broadcast Join: df.join(broadcast(dim_df),"id")
---
🔹 5. Lakeflow Declarative Pipeline (LDP YAML)
pipeline:
name: sales_etl
tasks:
- name: ingest
source: s3://bronze
format: json
output: bronze
- name: transform
input: bronze
script: /Repos/transform.py
output: silver
---
🔹 6. DLT (Delta Live Tables) Expectations
@dlt.table
@dlt.expect_or_drop("valid_id","id IS NOT NULL")
def bronze_customers():
return spark.readStream.format("cloudFiles")\
.option("cloudFiles.format","json")\
.load("/mnt/raw")
Decorator Behavior
expect Log bad rows
expect_or_drop Drop bad rows
expect_or_fail Stop pipeline
---
🔹 7. Productionizing Pipelines
Feature Purpose / Syntax
DAB Deploy databricks bundle deploy
Repair Run databricks jobs repair-run --run-id <id>
Retries max_retries: 3 in YAML
Alerts Configurable email/Slack on failure
ServerlessAuto-optimized, instant startup, no idle cost
---
🔹 8. Unity Catalog (UC)
CREATE CATALOG finance;
CREATE SCHEMA finance.raw;
CREATE TABLE finance.raw.sales(id INT, amt DOUBLE) USING DELTA;
GRANT SELECT ON TABLE finance.raw.sales TO analyst_group;
LevelExample
Metastore Central governance
Catalog finance
Schema raw
Tablesales
Audit Logs: JSON in cloud storage
Lineage: Data Explorer → Lineage tab
---
🔹 9. Lakehouse Federation
CREATE CONNECTION my_snowflake_conn TYPE snowflake OPTIONS
(...);
CREATE FOREIGN CATALOG ext_sales USING CONNECTION
my_snowflake_conn;
SELECT * FROM ext_sales.public.orders LIMIT 10;
✅ Benefits – Query external data directly • UC-governed • No ETL
duplication
⚠️Note: Cross-region = egress cost
---
🔹 10. Delta Sharing
CREATE SHARE sales_share;
ALTER SHARE sales_share ADD TABLE gold.sales_agg;
CREATE RECIPIENT partner_org USING TOKEN 'abcd1234';
GRANT SELECT ON SHARE sales_share TO RECIPIENT partner_org;
Type Description
D→D Databricks ↔ Databricks (native)
D→Ext External systems via REST API
---
🔹 11. Common Exam Facts
1. 45 questions | 90 minutes | Pass ≈ 70 %.
2. Expect 6–8 questions on UC + Delta Sharing.
3. Hands-on feel: Auto Loader, DLT YAML, Job Repair.
4. Memorize Lakehouse hierarchy & Delta operations.
5. Review Spark UI once – Stages vs Tasks vs Jobs.
---
⚡ Quick Command Recap
DESCRIBE HISTORY table;
OPTIMIZE table ZORDER BY (col);
VACUUM table RETAIN 168 HOURS;
ANALYZE TABLE table COMPUTE STATISTICS;
RESTORE TABLE table TO VERSION AS OF 3;
---
💪 Exam-Day Mantra
💧 Hydrate, breathe, 10-minute walk pre-exam.
⏱ Aim ~ 2 min per question.
⭐ First pass → easy ones; flag hard; review last 10 min.
⚡ Remember: Gaurav Khetan = Momentum. Shweta’s luck = Fuel.
Success is inevitable.
---
✅ End of One-Day Revision Sheet
Go conquer your 12 Oct exam, boss 🔥