Skip to content

Commit 7622960

Browse files
SireInsectusSireInsectus
authored andcommitted
Publishing v2.0.0
1 parent 8c77e8c commit 7622960

File tree

64 files changed

+4039
-4266
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+4039
-4266
lines changed
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Databricks notebook source
2+
# MAGIC %md-sandbox
3+
# MAGIC
4+
# MAGIC <div style="text-align: center; line-height: 0; padding-top: 9px;">
5+
# MAGIC <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
6+
# MAGIC </div>
7+
8+
# COMMAND ----------
9+
10+
# MAGIC %md
11+
# MAGIC # Apache Spark Programming with Databricks
12+
# MAGIC ## Course Agenda
13+
14+
# COMMAND ----------
15+
16+
# MAGIC %md
17+
# MAGIC ## Introductions
18+
# MAGIC * [ASP 1.1 - Databricks Platform]($./ASP 1 - Introductions/ASP 1.1 - Databricks Platform)
19+
# MAGIC * [ASP 1.1L - Explore Datasets Lab]($./ASP 1 - Introductions/Labs/ASP 1.1L - Explore Datasets Lab)
20+
21+
# COMMAND ----------
22+
23+
# MAGIC %md
24+
# MAGIC ## Spark Core
25+
# MAGIC * [ASP 2.1 - Spark SQL]($./ASP 2 - Spark Core/ASP 2.1 - Spark SQL)
26+
# MAGIC * [ASP 2.2 - Reader & Writer]($./ASP 2 - Spark Core/ASP 2.2 - Reader & Writer)
27+
# MAGIC * [ASP 2.2L - Spark SQL Lab]($./ASP 2 - Spark Core/Labs/ASP 2.2L - Spark SQL Lab)
28+
# MAGIC * [ASP 2.3 - DataFrame & Column]($./ASP 2 - Spark Core/ASP 2.3 - DataFrame & Column)
29+
# MAGIC * [ASP 2.3L - Ingesting Data Lab]($./ASP 2 - Spark Core/Labs/ASP 2.3L - Ingesting Data Lab)
30+
# MAGIC * [ASP 2.4L - Purchase Revenues Lab]($./ASP 2 - Spark Core/Labs/ASP 2.4L - Purchase Revenues Lab)
31+
32+
# COMMAND ----------
33+
34+
# MAGIC %md
35+
# MAGIC ## Functions
36+
# MAGIC * [ASP 3.1 - Aggregation]($./ASP 3 - Functions/ASP 3.1 - Aggregation)
37+
# MAGIC * [ASP 3.1L - Revenue by Traffic Lab]($./ASP 3 - Functions/Labs/ASP 3.1L - Revenue by Traffic Lab)
38+
# MAGIC * [ASP 3.2 - Datetimes]($./ASP 3 - Functions/ASP 3.2 - Datetimes)
39+
# MAGIC * [ASP 3.2L - Active Users Lab]($./ASP 3 - Functions/Labs/ASP 3.2L - Active Users Lab)
40+
# MAGIC * [ASP 3.3 - Complex Types]($./ASP 3 - Functions/ASP 3.3 - Complex Types)
41+
# MAGIC * [ASP 3.4 - Additional Functions]($./ASP 3 - Functions/ASP 3.4 - Additional Functions)
42+
# MAGIC * [ASP 3.4L - Abandoned Carts Lab]($./ASP 3 - Functions/Labs/ASP 3.4L - Abandoned Carts Lab)
43+
# MAGIC * [ASP 3.5 - UDFs]($./ASP 3 - Functions/ASP 3.5 - UDFs)
44+
# MAGIC * [ASP 3.5L - Sort Day Lab]($./ASP 3 - Functions/Labs/ASP 3.5L - Sort Day Lab)
45+
# MAGIC
46+
47+
# COMMAND ----------
48+
49+
# MAGIC %md
50+
# MAGIC ## Performance
51+
# MAGIC * [ASP 4.1 - Query Optimization]($./ASP 4 - Performance/ASP 4.1 - Query Optimization)
52+
# MAGIC * [ASP 4.2 - Partitioning]($./ASP 4 - Performance/ASP 4.2 - Partitioning)
53+
# MAGIC * [ASP 4.3L - De-Duping Data Lab]($./ASP 4 - Performance/Labs/ASP 4.3L - De-Duping Data Lab)
54+
55+
# COMMAND ----------
56+
57+
# MAGIC %md
58+
# MAGIC ## Streaming
59+
# MAGIC * [ASP 5.1 - Streaming Query]($./ASP 5 - Streaming/ASP 5.1 - Streaming Query)
60+
# MAGIC * [ASP 5.1L - Coupon Sales Lab]($./ASP 5 - Streaming/Labs/ASP 5.1L - Coupon Sales Lab)
61+
# MAGIC * [ASP 5.2L - Hourly Activity by Traffic Lab]($./ASP 5 - Streaming/Labs/ASP 5.2L - Hourly Activity by Traffic Lab)
62+
# MAGIC * [ASP 5.3L - Activity by Traffic Lab]($./ASP 5 - Streaming/Labs/ASP 5.3L - Activity by Traffic Lab)
63+
64+
# COMMAND ----------
65+
66+
# MAGIC %md
67+
# MAGIC ## Delta Lake
68+
# MAGIC * [ASP 6.1 - Delta Lake]($./ASP 6 - Delta Lake/ASP 6.1 - Delta Lake)
69+
# MAGIC * [ASP 6.1L - Delta Lake Lab]($./ASP 6 - Delta Lake/Labs/ASP 6.1L - Delta Lake Lab)
70+
71+
# COMMAND ----------
72+
73+
# MAGIC %md-sandbox
74+
# MAGIC &copy; 2022 Databricks, Inc. All rights reserved.<br/>
75+
# MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
76+
# MAGIC <br/>
77+
# MAGIC <a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>

Apache-Spark-Programming-with-Databricks/ASP 1.2 - Databricks Platform.py renamed to Apache-Spark-Programming-with-Databricks/ASP 1 - Introductions/ASP 1.1 - Databricks Platform.py

Lines changed: 30 additions & 146 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636

3737
# COMMAND ----------
3838

39-
# MAGIC %run ./Includes/Classroom-Setup
39+
# MAGIC %run ../Includes/Classroom-Setup
4040

4141
# COMMAND ----------
4242

@@ -187,13 +187,39 @@
187187

188188
# COMMAND ----------
189189

190+
# MAGIC %md ## Our First Table
191+
# MAGIC
192+
# MAGIC Is located in the path identfied by **`eventsPath`** (a variable we created for you).
193+
# MAGIC
194+
# MAGIC We can see those files by running the following cell
195+
196+
# COMMAND ----------
197+
198+
files = dbutils.fs.ls(eventsPath)
199+
display(files)
200+
201+
# COMMAND ----------
202+
203+
# MAGIC %md ## But, Wait!
204+
# MAGIC I cannot use variables in SQL commands.
205+
# MAGIC
206+
# MAGIC With the following trick you can!
207+
# MAGIC
208+
# MAGIC Declare the python variable as a variable in the spark context which SQL commands can access:
209+
210+
# COMMAND ----------
211+
212+
spark.sql(f"SET c.events_path = {eventsPath}")
213+
214+
# COMMAND ----------
215+
190216
# MAGIC %md ## Create table
191217
# MAGIC Run <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/index.html#sql-reference" target="_blank">Databricks SQL Commands</a> to create a table named `events` using BedBricks event files on DBFS.
192218

193219
# COMMAND ----------
194220

195221
# MAGIC %sql
196-
# MAGIC CREATE TABLE IF NOT EXISTS events USING parquet OPTIONS (path "/mnt/training/ecommerce/events/events.parquet");
222+
# MAGIC CREATE TABLE IF NOT EXISTS events USING parquet OPTIONS (path "${c.events_path}");
197223

198224
# COMMAND ----------
199225

@@ -294,154 +320,12 @@
294320

295321
# COMMAND ----------
296322

297-
# MAGIC %md
298-
# MAGIC # Explore Datasets Lab
299-
# MAGIC
300-
# MAGIC We will use tools introduced in this lesson to explore the datasets used in this course.
301-
# MAGIC
302-
# MAGIC ### BedBricks Case Study
303-
# MAGIC This course uses a case study that explores clickstream data for the online mattress retailer, BedBricks.
304-
# MAGIC You are an analyst at BedBricks working with the following datasets: `events`, `sales`, `users`, and `products`.
305-
# MAGIC
306-
# MAGIC ##### Tasks
307-
# MAGIC 1. View data files in DBFS using magic commands
308-
# MAGIC 1. View data files in DBFS using dbutils
309-
# MAGIC 1. Create tables from files in DBFS
310-
# MAGIC 1. Execute SQL to answer questions on BedBricks datasets
311-
312-
# COMMAND ----------
313-
314-
# MAGIC %md ### 1. List data files in DBFS using magic commands
315-
# MAGIC Use a magic command to display files located in the DBFS directory: **`/mnt/training/ecommerce`**
316-
# MAGIC
317-
# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> You should see four items: `events`, `products`, `sales`, `users`
318-
319-
# COMMAND ----------
320-
321-
# TODO
322-
323-
# COMMAND ----------
324-
325-
# MAGIC %md ### 2. List data files in DBFS using dbutils
326-
# MAGIC - Use **`dbutils`** to get the files at the directory above and save it to the variable **`files`**
327-
# MAGIC - Use the Databricks display() function to display the contents in **`files`**
328-
# MAGIC
329-
# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> You should see four items: `events`, `items`, `sales`, `users`
330-
331-
# COMMAND ----------
332-
333-
# TODO
334-
files = dbutils.FILL_IN
335-
display(files)
336-
337-
# COMMAND ----------
338-
339-
# MAGIC %md ### 3. Create tables below from files in DBFS
340-
# MAGIC - Create `users` table using files at location `"/mnt/training/ecommerce/users/users.parquet"`
341-
# MAGIC - Create `sales` table using files at location `"/mnt/training/ecommerce/sales/sales.parquet"`
342-
# MAGIC - Create `products` table using files at location `"/mnt/training/ecommerce/products/products.parquet"`
343-
# MAGIC
344-
# MAGIC (We created `events` table earlier using files at location `"/mnt/training/ecommerce/events/events.parquet"`)
345-
346-
# COMMAND ----------
347-
348-
# TODO
349-
350-
# COMMAND ----------
351-
352-
# MAGIC %md Use the data tab of the workspace UI to confirm your tables were created.
353-
354-
# COMMAND ----------
355-
356-
# MAGIC %md ### 4. Execute SQL to explore BedBricks datasets
357-
# MAGIC Run SQL queries on the `products`, `sales`, and `events` tables to answer the following questions.
358-
# MAGIC - What products are available for purchase at BedBricks?
359-
# MAGIC - What is the average purchase revenue for a transaction at BedBricks?
360-
# MAGIC - What types of events are recorded on the BedBricks website?
361-
# MAGIC
362-
# MAGIC The schema of the relevant dataset is provided for each question in the cells below.
363-
364-
# COMMAND ----------
365-
366-
# MAGIC %md
367-
# MAGIC #### Q1: What products are available for purchase at BedBricks?
368-
# MAGIC
369-
# MAGIC The **`products`** dataset contains the ID, name, and price of products on the BedBricks retail site.
370-
# MAGIC
371-
# MAGIC | field | type | description
372-
# MAGIC | --- | --- | --- |
373-
# MAGIC | item_id | string | unique item identifier |
374-
# MAGIC | name | string | item name in plain text |
375-
# MAGIC | price | double | price of item |
376-
# MAGIC
377-
# MAGIC Execute a SQL query that selects all from the **`products`** table.
378-
# MAGIC
379-
# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> You should see 12 products.
380-
381-
# COMMAND ----------
382-
383-
# TODO
384-
385-
# COMMAND ----------
386-
387-
# MAGIC %md #### Q2: What is the average purchase revenue for a transaction at BedBricks?
388-
# MAGIC
389-
# MAGIC The **`sales`** dataset contains order information representing successfully processed sales.
390-
# MAGIC Most fields correspond directly with fields from the clickstream data associated with a sale finalization event.
391-
# MAGIC
392-
# MAGIC | field | type | description|
393-
# MAGIC | --- | --- | --- |
394-
# MAGIC | order_id | long | unique identifier |
395-
# MAGIC | email | string | the email address to which sales configuration was sent |
396-
# MAGIC | transaction_timestamp | long | timestamp at which the order was processed, recorded in milliseconds since epoch |
397-
# MAGIC | total_item_quantity | long | number of individual items in the order |
398-
# MAGIC | purchase_revenue_in_usd | double | total revenue from order |
399-
# MAGIC | unique_items | long | number of unique products in the order |
400-
# MAGIC | items | array | provided as a list of JSON data, which is interpreted by Spark as an array of structs |
401-
# MAGIC
402-
# MAGIC Execute a SQL query that computes the average **`purchase_revenue_in_usd`** from the **`sales`** table.
403-
# MAGIC
404-
# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> The result should be `1042.79`.
405-
406-
# COMMAND ----------
407-
408-
# TODO
409-
410-
# COMMAND ----------
411-
412-
# MAGIC %md #### Q3: What types of events are recorded on the BedBricks website?
413-
# MAGIC
414-
# MAGIC The **`events`** dataset contains two weeks worth of parsed JSON records, created by consuming updates to an operational database.
415-
# MAGIC Records are received whenever: (1) a new user visits the site, (2) a user provides their email for the first time.
416-
# MAGIC
417-
# MAGIC | field | type | description|
418-
# MAGIC | --- | --- | --- |
419-
# MAGIC | device | string | operating system of the user device |
420-
# MAGIC | user_id | string | unique identifier for user/session |
421-
# MAGIC | user_first_touch_timestamp | long | first time the user was seen in microseconds since epoch |
422-
# MAGIC | traffic_source | string | referral source |
423-
# MAGIC | geo (city, state) | struct | city and state information derived from IP address |
424-
# MAGIC | event_timestamp | long | event time recorded as microseconds since epoch |
425-
# MAGIC | event_previous_timestamp | long | time of previous event in microseconds since epoch |
426-
# MAGIC | event_name | string | name of events as registered in clickstream tracker |
427-
# MAGIC | items (item_id, item_name, price_in_usd, quantity, item_revenue in usd, coupon)| array | an array of structs for each unique item in the user’s cart |
428-
# MAGIC | ecommerce (total_item_quantity, unique_items, purchase_revenue_in_usd) | struct | purchase data (this field is only non-null in those events that correspond to a sales finalization) |
429-
# MAGIC
430-
# MAGIC Execute a SQL query that selects distinct values in **`event_name`** from the **`events`** table
431-
# MAGIC
432-
# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> You should see 23 distinct **`event_name`** values.
433-
434-
# COMMAND ----------
435-
436-
# TODO
437-
438-
# COMMAND ----------
439-
440323
# MAGIC %md ### Clean up classroom
324+
# MAGIC Clean up any temp files, tables and databases created by this lesson
441325

442326
# COMMAND ----------
443327

444-
# MAGIC %run ./Includes/Classroom-Cleanup
328+
classroom_cleanup()
445329

446330
# COMMAND ----------
447331

0 commit comments

Comments
 (0)