|
33 | 33 | # COMMAND ----------
|
34 | 34 |
|
35 | 35 | # MAGIC %md ## Multiple Interfaces
|
36 |
| -# MAGIC Spark SQL is a module for structured data processing with multiple interfaces. |
| 36 | +# MAGIC Spark SQL is a module for structured data processing with multiple interfaces. |
37 | 37 | # MAGIC
|
38 | 38 | # MAGIC We can interact with Spark SQL in two ways:
|
39 | 39 | # MAGIC 1. Executing SQL queries
|
|
42 | 42 | # COMMAND ----------
|
43 | 43 |
|
44 | 44 | # MAGIC %md
|
45 |
| -# MAGIC **Method 1: Executing SQL queries** |
| 45 | +# MAGIC **Method 1: Executing SQL queries** |
46 | 46 | # MAGIC
|
47 | 47 | # MAGIC This is how we interacted with Spark SQL in the previous lesson.
|
48 | 48 |
|
|
58 | 58 |
|
59 | 59 | # MAGIC %md **Method 2: Working with the DataFrame API**
|
60 | 60 | # MAGIC
|
61 |
| -# MAGIC We can also express Spark SQL queries using the DataFrame API. |
| 61 | +# MAGIC We can also express Spark SQL queries using the DataFrame API. |
62 | 62 | # MAGIC The following cell returns a DataFrame containing the same results as those retrieved above.
|
63 | 63 |
|
64 | 64 | # COMMAND ----------
|
|
72 | 72 |
|
73 | 73 | # COMMAND ----------
|
74 | 74 |
|
75 |
| -# MAGIC %md We'll go over the syntax for the DataFrame API later in the lesson, but you can see this builder design pattern allows us to chain a sequence of operations very similar to those we find in SQL. |
| 75 | +# MAGIC %md We'll go over the syntax for the DataFrame API later in the lesson, but you can see this builder design pattern allows us to chain a sequence of operations very similar to those we find in SQL. |
76 | 76 |
|
77 | 77 | # COMMAND ----------
|
78 | 78 |
|
|
87 | 87 |
|
88 | 88 | # MAGIC %md ## Spark API Documentation
|
89 | 89 | # MAGIC
|
90 |
| -# MAGIC To learn how we work with DataFrames in Spark SQL, let's first look at the Spark API documentation. |
91 |
| -# MAGIC The main Spark [documentation](https://spark.apache.org/docs/latest/) page includes links to API docs and helpful guides for each version of Spark. |
| 90 | +# MAGIC To learn how we work with DataFrames in Spark SQL, let's first look at the Spark API documentation. |
| 91 | +# MAGIC The main Spark <a href="https://spark.apache.org/docs/latest/" target="_blank">documentation</a> page includes links to API docs and helpful guides for each version of Spark. |
92 | 92 | # MAGIC
|
93 |
| -# MAGIC The [Scala API](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html) and [Python API](https://spark.apache.org/docs/latest/api/python/index.html) are most commonly used, and it's often helpful to reference the documentation for both languages. |
| 93 | +# MAGIC The <a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html" target="_blank">Scala API</a> and <a href="https://spark.apache.org/docs/latest/api/python/index.html" target="_blank">Python API</a> are most commonly used, and it's often helpful to reference the documentation for both languages. |
94 | 94 | # MAGIC Scala docs tend to be more comprehensive, and Python docs tend to have more code examples.
|
95 | 95 | # MAGIC
|
96 | 96 | # MAGIC #### Navigating Docs for the Spark SQL Module
|
97 |
| -# MAGIC Find the Spark SQL module by navigating to **`org.apache.spark.sql`** in the Scala API or **`pyspark.sql`** in the Python API. |
| 97 | +# MAGIC Find the Spark SQL module by navigating to **`org.apache.spark.sql`** in the Scala API or **`pyspark.sql`** in the Python API. |
98 | 98 | # MAGIC The first class we'll explore in this module is the **`SparkSession`** class. You can find this by entering "SparkSession" in the search bar.
|
99 | 99 |
|
100 | 100 | # COMMAND ----------
|
101 | 101 |
|
102 | 102 | # MAGIC %md
|
103 | 103 | # MAGIC ## SparkSession
|
104 |
| -# MAGIC The **`SparkSession`** class is the single entry point to all functionality in Spark using the DataFrame API. |
| 104 | +# MAGIC The **`SparkSession`** class is the single entry point to all functionality in Spark using the DataFrame API. |
105 | 105 | # MAGIC
|
106 | 106 | # MAGIC In Databricks notebooks, the SparkSession is created for you, stored in a variable called **`spark`**.
|
107 | 107 |
|
|
119 | 119 |
|
120 | 120 | # COMMAND ----------
|
121 | 121 |
|
122 |
| -# MAGIC %md |
123 |
| -# MAGIC Below are several additional methods we can use to create DataFrames. All of these can be found in the <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html" target="_blank">documentation</a> for `SparkSession`. |
| 122 | +# MAGIC %md |
| 123 | +# MAGIC Below are several additional methods we can use to create DataFrames. All of these can be found in the <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html" target="_blank">documentation</a> for **`SparkSession`**. |
124 | 124 | # MAGIC
|
125 |
| -# MAGIC #### `SparkSession` Methods |
| 125 | +# MAGIC #### **`SparkSession`** Methods |
126 | 126 | # MAGIC | Method | Description |
|
127 | 127 | # MAGIC | --- | --- |
|
128 |
| -# MAGIC | sql | Returns a DataFrame representing the result of the given query | |
| 128 | +# MAGIC | sql | Returns a DataFrame representing the result of the given query | |
129 | 129 | # MAGIC | table | Returns the specified table as a DataFrame |
|
130 | 130 | # MAGIC | read | Returns a DataFrameReader that can be used to read data in as a DataFrame |
|
131 | 131 | # MAGIC | range | Create a DataFrame with a column containing elements in a range from start to end (exclusive) with step value and number of partitions |
|
|
191 | 191 | # COMMAND ----------
|
192 | 192 |
|
193 | 193 | # MAGIC %md ## Transformations
|
194 |
| -# MAGIC When we created **`budget_df`**, we used a series of DataFrame transformation methods e.g. **`select`**, **`where`**, **`orderBy`**. |
| 194 | +# MAGIC When we created **`budget_df`**, we used a series of DataFrame transformation methods e.g. **`select`**, **`where`**, **`orderBy`**. |
195 | 195 | # MAGIC
|
196 |
| -# MAGIC ``` |
197 |
| -# MAGIC products_df |
198 |
| -# MAGIC .select("name", "price") |
199 |
| -# MAGIC .where("price < 200") |
200 |
| -# MAGIC .orderBy("price") |
201 |
| -# MAGIC ``` |
202 |
| -# MAGIC Transformations operate on and return DataFrames, allowing us to chain transformation methods together to construct new DataFrames. |
203 |
| -# MAGIC However, these operations can't execute on their own, as transformation methods are **lazily evaluated**. |
| 196 | +# MAGIC <strong><code>products_df |
| 197 | +# MAGIC .select("name", "price") |
| 198 | +# MAGIC .where("price < 200") |
| 199 | +# MAGIC .orderBy("price") |
| 200 | +# MAGIC </code></strong> |
| 201 | +# MAGIC |
| 202 | +# MAGIC Transformations operate on and return DataFrames, allowing us to chain transformation methods together to construct new DataFrames. |
| 203 | +# MAGIC However, these operations can't execute on their own, as transformation methods are **lazily evaluated**. |
204 | 204 | # MAGIC
|
205 | 205 | # MAGIC Running the following cell does not trigger any computation.
|
206 | 206 |
|
|
214 | 214 | # COMMAND ----------
|
215 | 215 |
|
216 | 216 | # MAGIC %md ## Actions
|
217 |
| -# MAGIC Conversely, DataFrame actions are methods that **trigger computation**. |
218 |
| -# MAGIC Actions are needed to trigger the execution of any DataFrame transformations. |
| 217 | +# MAGIC Conversely, DataFrame actions are methods that **trigger computation**. |
| 218 | +# MAGIC Actions are needed to trigger the execution of any DataFrame transformations. |
219 | 219 | # MAGIC
|
220 | 220 | # MAGIC The **`show`** action causes the following cell to execute transformations.
|
221 | 221 |
|
|
243 | 243 |
|
244 | 244 | # COMMAND ----------
|
245 | 245 |
|
246 |
| -# MAGIC %md |
| 246 | +# MAGIC %md |
247 | 247 | # MAGIC **`count`** returns the number of records in a DataFrame.
|
248 | 248 |
|
249 | 249 | # COMMAND ----------
|
|
252 | 252 |
|
253 | 253 | # COMMAND ----------
|
254 | 254 |
|
255 |
| -# MAGIC %md |
| 255 | +# MAGIC %md |
256 | 256 | # MAGIC **`collect`** returns an array of all rows in a DataFrame.
|
257 | 257 |
|
258 | 258 | # COMMAND ----------
|
259 | 259 |
|
260 |
| -budget_df.collect() |
| 260 | +budget_df.collect() |
261 | 261 |
|
262 | 262 | # COMMAND ----------
|
263 | 263 |
|
|
0 commit comments