|
36 | 36 |
|
37 | 37 | # COMMAND ----------
|
38 | 38 |
|
39 |
| -# MAGIC %run ./Includes/Classroom-Setup |
| 39 | +# MAGIC %run ../Includes/Classroom-Setup |
40 | 40 |
|
41 | 41 | # COMMAND ----------
|
42 | 42 |
|
|
187 | 187 |
|
188 | 188 | # COMMAND ----------
|
189 | 189 |
|
| 190 | +# MAGIC %md ## Our First Table |
| 191 | +# MAGIC |
| 192 | +# MAGIC Is located in the path identfied by **`eventsPath`** (a variable we created for you). |
| 193 | +# MAGIC |
| 194 | +# MAGIC We can see those files by running the following cell |
| 195 | + |
| 196 | +# COMMAND ---------- |
| 197 | + |
| 198 | +files = dbutils.fs.ls(eventsPath) |
| 199 | +display(files) |
| 200 | + |
| 201 | +# COMMAND ---------- |
| 202 | + |
| 203 | +# MAGIC %md ## But, Wait! |
| 204 | +# MAGIC I cannot use variables in SQL commands. |
| 205 | +# MAGIC |
| 206 | +# MAGIC With the following trick you can! |
| 207 | +# MAGIC |
| 208 | +# MAGIC Declare the python variable as a variable in the spark context which SQL commands can access: |
| 209 | + |
| 210 | +# COMMAND ---------- |
| 211 | + |
| 212 | +spark.sql(f"SET c.events_path = {eventsPath}") |
| 213 | + |
| 214 | +# COMMAND ---------- |
| 215 | + |
190 | 216 | # MAGIC %md ## Create table
|
191 | 217 | # MAGIC Run <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/index.html#sql-reference" target="_blank">Databricks SQL Commands</a> to create a table named `events` using BedBricks event files on DBFS.
|
192 | 218 |
|
193 | 219 | # COMMAND ----------
|
194 | 220 |
|
195 | 221 | # MAGIC %sql
|
196 |
| -# MAGIC CREATE TABLE IF NOT EXISTS events USING parquet OPTIONS (path "/mnt/training/ecommerce/events/events.parquet"); |
| 222 | +# MAGIC CREATE TABLE IF NOT EXISTS events USING parquet OPTIONS (path "${c.events_path}"); |
197 | 223 |
|
198 | 224 | # COMMAND ----------
|
199 | 225 |
|
|
294 | 320 |
|
295 | 321 | # COMMAND ----------
|
296 | 322 |
|
297 |
| -# MAGIC %md |
298 |
| -# MAGIC # Explore Datasets Lab |
299 |
| -# MAGIC |
300 |
| -# MAGIC We will use tools introduced in this lesson to explore the datasets used in this course. |
301 |
| -# MAGIC |
302 |
| -# MAGIC ### BedBricks Case Study |
303 |
| -# MAGIC This course uses a case study that explores clickstream data for the online mattress retailer, BedBricks. |
304 |
| -# MAGIC You are an analyst at BedBricks working with the following datasets: `events`, `sales`, `users`, and `products`. |
305 |
| -# MAGIC |
306 |
| -# MAGIC ##### Tasks |
307 |
| -# MAGIC 1. View data files in DBFS using magic commands |
308 |
| -# MAGIC 1. View data files in DBFS using dbutils |
309 |
| -# MAGIC 1. Create tables from files in DBFS |
310 |
| -# MAGIC 1. Execute SQL to answer questions on BedBricks datasets |
311 |
| - |
312 |
| -# COMMAND ---------- |
313 |
| - |
314 |
| -# MAGIC %md ### 1. List data files in DBFS using magic commands |
315 |
| -# MAGIC Use a magic command to display files located in the DBFS directory: **`/mnt/training/ecommerce`** |
316 |
| -# MAGIC |
317 |
| -# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> You should see four items: `events`, `products`, `sales`, `users` |
318 |
| - |
319 |
| -# COMMAND ---------- |
320 |
| - |
321 |
| -# TODO |
322 |
| - |
323 |
| -# COMMAND ---------- |
324 |
| - |
325 |
| -# MAGIC %md ### 2. List data files in DBFS using dbutils |
326 |
| -# MAGIC - Use **`dbutils`** to get the files at the directory above and save it to the variable **`files`** |
327 |
| -# MAGIC - Use the Databricks display() function to display the contents in **`files`** |
328 |
| -# MAGIC |
329 |
| -# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> You should see four items: `events`, `items`, `sales`, `users` |
330 |
| - |
331 |
| -# COMMAND ---------- |
332 |
| - |
333 |
| -# TODO |
334 |
| -files = dbutils.FILL_IN |
335 |
| -display(files) |
336 |
| - |
337 |
| -# COMMAND ---------- |
338 |
| - |
339 |
| -# MAGIC %md ### 3. Create tables below from files in DBFS |
340 |
| -# MAGIC - Create `users` table using files at location `"/mnt/training/ecommerce/users/users.parquet"` |
341 |
| -# MAGIC - Create `sales` table using files at location `"/mnt/training/ecommerce/sales/sales.parquet"` |
342 |
| -# MAGIC - Create `products` table using files at location `"/mnt/training/ecommerce/products/products.parquet"` |
343 |
| -# MAGIC |
344 |
| -# MAGIC (We created `events` table earlier using files at location `"/mnt/training/ecommerce/events/events.parquet"`) |
345 |
| - |
346 |
| -# COMMAND ---------- |
347 |
| - |
348 |
| -# TODO |
349 |
| - |
350 |
| -# COMMAND ---------- |
351 |
| - |
352 |
| -# MAGIC %md Use the data tab of the workspace UI to confirm your tables were created. |
353 |
| - |
354 |
| -# COMMAND ---------- |
355 |
| - |
356 |
| -# MAGIC %md ### 4. Execute SQL to explore BedBricks datasets |
357 |
| -# MAGIC Run SQL queries on the `products`, `sales`, and `events` tables to answer the following questions. |
358 |
| -# MAGIC - What products are available for purchase at BedBricks? |
359 |
| -# MAGIC - What is the average purchase revenue for a transaction at BedBricks? |
360 |
| -# MAGIC - What types of events are recorded on the BedBricks website? |
361 |
| -# MAGIC |
362 |
| -# MAGIC The schema of the relevant dataset is provided for each question in the cells below. |
363 |
| - |
364 |
| -# COMMAND ---------- |
365 |
| - |
366 |
| -# MAGIC %md |
367 |
| -# MAGIC #### Q1: What products are available for purchase at BedBricks? |
368 |
| -# MAGIC |
369 |
| -# MAGIC The **`products`** dataset contains the ID, name, and price of products on the BedBricks retail site. |
370 |
| -# MAGIC |
371 |
| -# MAGIC | field | type | description |
372 |
| -# MAGIC | --- | --- | --- | |
373 |
| -# MAGIC | item_id | string | unique item identifier | |
374 |
| -# MAGIC | name | string | item name in plain text | |
375 |
| -# MAGIC | price | double | price of item | |
376 |
| -# MAGIC |
377 |
| -# MAGIC Execute a SQL query that selects all from the **`products`** table. |
378 |
| -# MAGIC |
379 |
| -# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> You should see 12 products. |
380 |
| - |
381 |
| -# COMMAND ---------- |
382 |
| - |
383 |
| -# TODO |
384 |
| - |
385 |
| -# COMMAND ---------- |
386 |
| - |
387 |
| -# MAGIC %md #### Q2: What is the average purchase revenue for a transaction at BedBricks? |
388 |
| -# MAGIC |
389 |
| -# MAGIC The **`sales`** dataset contains order information representing successfully processed sales. |
390 |
| -# MAGIC Most fields correspond directly with fields from the clickstream data associated with a sale finalization event. |
391 |
| -# MAGIC |
392 |
| -# MAGIC | field | type | description| |
393 |
| -# MAGIC | --- | --- | --- | |
394 |
| -# MAGIC | order_id | long | unique identifier | |
395 |
| -# MAGIC | email | string | the email address to which sales configuration was sent | |
396 |
| -# MAGIC | transaction_timestamp | long | timestamp at which the order was processed, recorded in milliseconds since epoch | |
397 |
| -# MAGIC | total_item_quantity | long | number of individual items in the order | |
398 |
| -# MAGIC | purchase_revenue_in_usd | double | total revenue from order | |
399 |
| -# MAGIC | unique_items | long | number of unique products in the order | |
400 |
| -# MAGIC | items | array | provided as a list of JSON data, which is interpreted by Spark as an array of structs | |
401 |
| -# MAGIC |
402 |
| -# MAGIC Execute a SQL query that computes the average **`purchase_revenue_in_usd`** from the **`sales`** table. |
403 |
| -# MAGIC |
404 |
| -# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> The result should be `1042.79`. |
405 |
| - |
406 |
| -# COMMAND ---------- |
407 |
| - |
408 |
| -# TODO |
409 |
| - |
410 |
| -# COMMAND ---------- |
411 |
| - |
412 |
| -# MAGIC %md #### Q3: What types of events are recorded on the BedBricks website? |
413 |
| -# MAGIC |
414 |
| -# MAGIC The **`events`** dataset contains two weeks worth of parsed JSON records, created by consuming updates to an operational database. |
415 |
| -# MAGIC Records are received whenever: (1) a new user visits the site, (2) a user provides their email for the first time. |
416 |
| -# MAGIC |
417 |
| -# MAGIC | field | type | description| |
418 |
| -# MAGIC | --- | --- | --- | |
419 |
| -# MAGIC | device | string | operating system of the user device | |
420 |
| -# MAGIC | user_id | string | unique identifier for user/session | |
421 |
| -# MAGIC | user_first_touch_timestamp | long | first time the user was seen in microseconds since epoch | |
422 |
| -# MAGIC | traffic_source | string | referral source | |
423 |
| -# MAGIC | geo (city, state) | struct | city and state information derived from IP address | |
424 |
| -# MAGIC | event_timestamp | long | event time recorded as microseconds since epoch | |
425 |
| -# MAGIC | event_previous_timestamp | long | time of previous event in microseconds since epoch | |
426 |
| -# MAGIC | event_name | string | name of events as registered in clickstream tracker | |
427 |
| -# MAGIC | items (item_id, item_name, price_in_usd, quantity, item_revenue in usd, coupon)| array | an array of structs for each unique item in the user’s cart | |
428 |
| -# MAGIC | ecommerce (total_item_quantity, unique_items, purchase_revenue_in_usd) | struct | purchase data (this field is only non-null in those events that correspond to a sales finalization) | |
429 |
| -# MAGIC |
430 |
| -# MAGIC Execute a SQL query that selects distinct values in **`event_name`** from the **`events`** table |
431 |
| -# MAGIC |
432 |
| -# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> You should see 23 distinct **`event_name`** values. |
433 |
| - |
434 |
| -# COMMAND ---------- |
435 |
| - |
436 |
| -# TODO |
437 |
| - |
438 |
| -# COMMAND ---------- |
439 |
| - |
440 | 323 | # MAGIC %md ### Clean up classroom
|
| 324 | +# MAGIC Clean up any temp files, tables and databases created by this lesson |
441 | 325 |
|
442 | 326 | # COMMAND ----------
|
443 | 327 |
|
444 |
| -# MAGIC %run ./Includes/Classroom-Cleanup |
| 328 | +classroom_cleanup() |
445 | 329 |
|
446 | 330 | # COMMAND ----------
|
447 | 331 |
|
|
0 commit comments