Skip to content

Commit 8ef3294

Browse files
SireInsectusSireInsectus
authored andcommitted
Publishing v1.7.0
1 parent cd8cae2 commit 8ef3294

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+11574
-0
lines changed

Apache-Spark-Programming-with-Databricks/ASP 1.2 - Databricks Platform.py

Lines changed: 452 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 388 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,388 @@
1+
# Databricks notebook source
2+
# MAGIC %md-sandbox
3+
# MAGIC
4+
# MAGIC <div style="text-align: center; line-height: 0; padding-top: 9px;">
5+
# MAGIC <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
6+
# MAGIC </div>
7+
8+
# COMMAND ----------
9+
10+
# MAGIC %md
11+
# MAGIC # Spark SQL
12+
# MAGIC
13+
# MAGIC Demonstrate fundamental concepts in Spark SQL using the DataFrame API.
14+
# MAGIC
15+
# MAGIC ##### Objectives
16+
# MAGIC 1. Run a SQL query
17+
# MAGIC 1. Create a DataFrame from a table
18+
# MAGIC 1. Write the same query using DataFrame transformations
19+
# MAGIC 1. Trigger computation with DataFrame actions
20+
# MAGIC 1. Convert between DataFrames and SQL
21+
# MAGIC
22+
# MAGIC ##### Methods
23+
# MAGIC - <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#spark-session-apis" target="_blank">SparkSession</a>: `sql`, `table`
24+
# MAGIC - <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>:
25+
# MAGIC - Transformations: `select`, `where`, `orderBy`
26+
# MAGIC - Actions: `show`, `count`, `take`
27+
# MAGIC - Other methods: `printSchema`, `schema`, `createOrReplaceTempView`
28+
29+
# COMMAND ----------
30+
31+
# MAGIC %run ./Includes/Classroom-Setup-SQL
32+
33+
# COMMAND ----------
34+
35+
# MAGIC %md ## Multiple Interfaces
36+
# MAGIC Spark SQL is a module for structured data processing with multiple interfaces.
37+
# MAGIC
38+
# MAGIC We can interact with Spark SQL in two ways:
39+
# MAGIC 1. Executing SQL queries
40+
# MAGIC 1. Working with the DataFrame API.
41+
42+
# COMMAND ----------
43+
44+
# MAGIC %md
45+
# MAGIC **Method 1: Executing SQL queries**
46+
# MAGIC
47+
# MAGIC This is how we interacted with Spark SQL in the previous lesson.
48+
49+
# COMMAND ----------
50+
51+
# MAGIC %sql
52+
# MAGIC SELECT name, price
53+
# MAGIC FROM products
54+
# MAGIC WHERE price < 200
55+
# MAGIC ORDER BY price
56+
57+
# COMMAND ----------
58+
59+
# MAGIC %md **Method 2: Working with the DataFrame API**
60+
# MAGIC
61+
# MAGIC We can also express Spark SQL queries using the DataFrame API.
62+
# MAGIC The following cell returns a DataFrame containing the same results as those retrieved above.
63+
64+
# COMMAND ----------
65+
66+
display(spark.table("products")
67+
.select("name", "price")
68+
.where("price < 200")
69+
.orderBy("price"))
70+
71+
# COMMAND ----------
72+
73+
# MAGIC %md We'll go over the syntax for the DataFrame API later in the lesson, but you can see this builder design pattern allows us to chain a sequence of operations very similar to those we find in SQL.
74+
75+
# COMMAND ----------
76+
77+
# MAGIC %md ## Query Execution
78+
# MAGIC We can express the same query using any interface. The Spark SQL engine generates the same query plan used to optimize and execute on our Spark cluster.
79+
# MAGIC
80+
# MAGIC ![query execution engine](https://files.training.databricks.com/images/aspwd/spark_sql_query_execution_engine.png)
81+
# MAGIC
82+
# MAGIC <img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> Resilient Distributed Datasets (RDDs) are the low-level representation of datasets processed by a Spark cluster. In early versions of Spark, you had to write <a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html" target="_blank">code manipulating RDDs directly</a>. In modern versions of Spark you should instead use the higher-level DataFrame APIs, which Spark automatically compiles into low-level RDD operations.
83+
84+
# COMMAND ----------
85+
86+
# MAGIC %md ## Spark API Documentation
87+
# MAGIC
88+
# MAGIC To learn how we work with DataFrames in Spark SQL, let's first look at the Spark API documentation.
89+
# MAGIC The main Spark [documentation](https://spark.apache.org/docs/latest/) page includes links to API docs and helpful guides for each version of Spark.
90+
# MAGIC
91+
# MAGIC The [Scala API](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html) and [Python API](https://spark.apache.org/docs/latest/api/python/index.html) are most commonly used, and it's often helpful to reference the documentation for both languages.
92+
# MAGIC Scala docs tend to be more comprehensive, and Python docs tend to have more code examples.
93+
# MAGIC
94+
# MAGIC #### Navigating Docs for the Spark SQL Module
95+
# MAGIC Find the Spark SQL module by navigating to `org.apache.spark.sql` in the Scala API or `pyspark.sql` in the Python API.
96+
# MAGIC The first class we'll explore in this module is the `SparkSession` class. You can find this by entering "SparkSession" in the search bar.
97+
98+
# COMMAND ----------
99+
100+
# MAGIC %md
101+
# MAGIC ## SparkSession
102+
# MAGIC The `SparkSession` class is the single entry point to all functionality in Spark using the DataFrame API.
103+
# MAGIC
104+
# MAGIC In Databricks notebooks, the SparkSession is created for you, stored in a variable called `spark`.
105+
106+
# COMMAND ----------
107+
108+
spark
109+
110+
# COMMAND ----------
111+
112+
# MAGIC %md The example from the beginning of this lesson used the SparkSession method `table` to create a DataFrame from the `products` table. Let's save this in the variable `productsDF`.
113+
114+
# COMMAND ----------
115+
116+
productsDF = spark.table("products")
117+
118+
# COMMAND ----------
119+
120+
# MAGIC %md
121+
# MAGIC Below are several additional methods we can use to create DataFrames. All of these can be found in the <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html" target="_blank">documentation</a> for `SparkSession`.
122+
# MAGIC
123+
# MAGIC #### `SparkSession` Methods
124+
# MAGIC | Method | Description |
125+
# MAGIC | --- | --- |
126+
# MAGIC | sql | Returns a DataFrame representing the result of the given query |
127+
# MAGIC | table | Returns the specified table as a DataFrame |
128+
# MAGIC | read | Returns a DataFrameReader that can be used to read data in as a DataFrame |
129+
# MAGIC | range | Create a DataFrame with a column containing elements in a range from start to end (exclusive) with step value and number of partitions |
130+
# MAGIC | createDataFrame | Creates a DataFrame from a list of tuples, primarily used for testing |
131+
132+
# COMMAND ----------
133+
134+
# MAGIC %md Let's use a SparkSession method to run SQL.
135+
136+
# COMMAND ----------
137+
138+
resultDF = spark.sql("""
139+
SELECT name, price
140+
FROM products
141+
WHERE price < 200
142+
ORDER BY price
143+
""")
144+
145+
display(resultDF)
146+
147+
# COMMAND ----------
148+
149+
# MAGIC %md ## DataFrames
150+
# MAGIC Recall that expressing our query using methods in the DataFrame API returns results in a DataFrame. Let's store this in the variable `budgetDF`.
151+
# MAGIC
152+
# MAGIC A **DataFrame** is a distributed collection of data grouped into named columns.
153+
154+
# COMMAND ----------
155+
156+
budgetDF = (spark.table("products")
157+
.select("name", "price")
158+
.where("price < 200")
159+
.orderBy("price"))
160+
161+
# COMMAND ----------
162+
163+
# MAGIC %md We can use `display()` to output the results of a dataframe.
164+
165+
# COMMAND ----------
166+
167+
display(budgetDF)
168+
169+
# COMMAND ----------
170+
171+
# MAGIC %md The **schema** defines the column names and types of a dataframe.
172+
# MAGIC
173+
# MAGIC Access a dataframe's schema using the `schema` attribute.
174+
175+
# COMMAND ----------
176+
177+
budgetDF.schema
178+
179+
# COMMAND ----------
180+
181+
# MAGIC %md View a nicer output for this schema using the `printSchema()` method.
182+
183+
# COMMAND ----------
184+
185+
budgetDF.printSchema()
186+
187+
# COMMAND ----------
188+
189+
# MAGIC %md ## Transformations
190+
# MAGIC When we created `budgetDF`, we used a series of DataFrame transformation methods e.g. `select`, `where`, `orderBy`.
191+
# MAGIC
192+
# MAGIC ```
193+
# MAGIC productsDF
194+
# MAGIC .select("name", "price")
195+
# MAGIC .where("price < 200")
196+
# MAGIC .orderBy("price")
197+
# MAGIC ```
198+
# MAGIC Transformations operate on and return DataFrames, allowing us to chain transformation methods together to construct new DataFrames.
199+
# MAGIC However, these operations can't execute on their own, as transformation methods are **lazily evaluated**.
200+
# MAGIC
201+
# MAGIC Running the following cell does not trigger any computation.
202+
203+
# COMMAND ----------
204+
205+
(productsDF
206+
.select("name", "price")
207+
.where("price < 200")
208+
.orderBy("price"))
209+
210+
# COMMAND ----------
211+
212+
# MAGIC %md ## Actions
213+
# MAGIC Conversely, DataFrame actions are methods that **trigger computation**.
214+
# MAGIC Actions are needed to trigger the execution of any DataFrame transformations.
215+
# MAGIC
216+
# MAGIC The `show` action causes the following cell to execute transformations.
217+
218+
# COMMAND ----------
219+
220+
(productsDF
221+
.select("name", "price")
222+
.where("price < 200")
223+
.orderBy("price")
224+
.show())
225+
226+
# COMMAND ----------
227+
228+
# MAGIC %md Below are several examples of <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#dataframe-apis" target="_blank">DataFrame</a> actions.
229+
# MAGIC
230+
# MAGIC ### DataFrame Action Methods
231+
# MAGIC | Method | Description |
232+
# MAGIC | --- | --- |
233+
# MAGIC | show | Displays the top n rows of DataFrame in a tabular form |
234+
# MAGIC | count | Returns the number of rows in the DataFrame |
235+
# MAGIC | describe, summary | Computes basic statistics for numeric and string columns |
236+
# MAGIC | first, head | Returns the the first row |
237+
# MAGIC | collect | Returns an array that contains all rows in this DataFrame |
238+
# MAGIC | take | Returns an array of the first n rows in the DataFrame |
239+
240+
# COMMAND ----------
241+
242+
# MAGIC %md `count` returns the number of records in a DataFrame.
243+
244+
# COMMAND ----------
245+
246+
budgetDF.count()
247+
248+
# COMMAND ----------
249+
250+
# MAGIC %md `collect` returns an array of all rows in a DataFrame.
251+
252+
# COMMAND ----------
253+
254+
budgetDF.collect()
255+
256+
# COMMAND ----------
257+
258+
# MAGIC %md ## Convert between DataFrames and SQL
259+
260+
# COMMAND ----------
261+
262+
# MAGIC %md
263+
# MAGIC `createOrReplaceTempView` creates a temporary view based on the DataFrame. The lifetime of the temporary view is tied to the SparkSession that was used to create the DataFrame.
264+
265+
# COMMAND ----------
266+
267+
budgetDF.createOrReplaceTempView("budget")
268+
269+
# COMMAND ----------
270+
271+
display(spark.sql("SELECT * FROM budget"))
272+
273+
# COMMAND ----------
274+
275+
# MAGIC %md
276+
# MAGIC # Spark SQL Lab
277+
# MAGIC
278+
# MAGIC ##### Tasks
279+
# MAGIC 1. Create a DataFrame from the `events` table
280+
# MAGIC 1. Display the DataFrame and inspect its schema
281+
# MAGIC 1. Apply transformations to filter and sort `macOS` events
282+
# MAGIC 1. Count results and take the first 5 rows
283+
# MAGIC 1. Create the same DataFrame using a SQL query
284+
# MAGIC
285+
# MAGIC ##### Methods
286+
# MAGIC - <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html?highlight=sparksession" target="_blank">SparkSession</a>: `sql`, `table`
287+
# MAGIC - <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a> transformations: `select`, `where`, `orderBy`
288+
# MAGIC - <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a> actions: `select`, `count`, `take`
289+
# MAGIC - Other <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a> methods: `printSchema`, `schema`, `createOrReplaceTempView`
290+
291+
# COMMAND ----------
292+
293+
# MAGIC %md ### 1. Create a DataFrame from the `events` table
294+
# MAGIC - Use SparkSession to create a DataFrame from the `events` table
295+
296+
# COMMAND ----------
297+
298+
# TODO
299+
eventsDF = FILL_IN
300+
301+
# COMMAND ----------
302+
303+
# MAGIC %md ### 2. Display DataFrame and inspect schema
304+
# MAGIC - Use methods above to inspect DataFrame contents and schema
305+
306+
# COMMAND ----------
307+
308+
# TODO
309+
310+
# COMMAND ----------
311+
312+
# MAGIC %md ### 3. Apply transformations to filter and sort `macOS` events
313+
# MAGIC - Filter for rows where `device` is `macOS`
314+
# MAGIC - Sort rows by `event_timestamp`
315+
# MAGIC
316+
# MAGIC <img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> Use single and double quotes in your filter SQL expression
317+
318+
# COMMAND ----------
319+
320+
# TODO
321+
macDF = (eventsDF
322+
.FILL_IN
323+
)
324+
325+
# COMMAND ----------
326+
327+
# MAGIC %md ### 4. Count results and take first 5 rows
328+
# MAGIC - Use DataFrame actions to count and take rows
329+
330+
# COMMAND ----------
331+
332+
# TODO
333+
numRows = macDF.FILL_IN
334+
rows = macDF.FILL_IN
335+
336+
# COMMAND ----------
337+
338+
# MAGIC %md **CHECK YOUR WORK**
339+
340+
# COMMAND ----------
341+
342+
from pyspark.sql import Row
343+
344+
assert(numRows == 1938215)
345+
assert(len(rows) == 5)
346+
assert(type(rows[0]) == Row)
347+
348+
# COMMAND ----------
349+
350+
# MAGIC %md ### 5. Create the same DataFrame using SQL query
351+
# MAGIC - Use SparkSession to run a SQL query on the `events` table
352+
# MAGIC - Use SQL commands to write the same filter and sort query used earlier
353+
354+
# COMMAND ----------
355+
356+
# TODO
357+
macSQLDF = spark.FILL_IN
358+
359+
display(macSQLDF)
360+
361+
# COMMAND ----------
362+
363+
# MAGIC %md **CHECK YOUR WORK**
364+
# MAGIC - You should only see `macOS` values in the `device` column
365+
# MAGIC - The fifth row should be an event with timestamp `1592539226602157`
366+
367+
# COMMAND ----------
368+
369+
verify_rows = macSQLDF.take(5)
370+
assert (macSQLDF.select("device").distinct().count() == 1 and len(verify_rows) == 5 and verify_rows[0]['device'] == "macOS"), "Incorrect filter condition"
371+
assert (verify_rows[4]['event_timestamp'] == 1592539226602157), "Incorrect sorting"
372+
del verify_rows
373+
374+
# COMMAND ----------
375+
376+
# MAGIC %md ### Classroom Cleanup
377+
378+
# COMMAND ----------
379+
380+
# MAGIC %run ./Includes/Classroom-Cleanup
381+
382+
# COMMAND ----------
383+
384+
# MAGIC %md-sandbox
385+
# MAGIC &copy; 2022 Databricks, Inc. All rights reserved.<br/>
386+
# MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
387+
# MAGIC <br/>
388+
# MAGIC <a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>

0 commit comments

Comments
 (0)