Name	Name	Last commit message	Last commit date
Latest commit History 157 Commits
.github	.github
.kokoro	.kokoro
cloudbuild	cloudbuild
google/cloud/dataproc_spark_connect	google/cloud/dataproc_spark_connect
tests	tests
.dockerignore	.dockerignore
.env.example	.env.example
.gitignore	.gitignore
.trampolinerc	.trampolinerc
CHANGELOG.md	CHANGELOG.md
DEVELOPING.md	DEVELOPING.md
LICENSE	LICENSE
README.md	README.md
contributing.md	contributing.md
pyproject.toml	pyproject.toml
requirements-dev.txt	requirements-dev.txt
requirements-test.txt	requirements-test.txt
setup.cfg	setup.cfg
setup.py	setup.py

Dataproc Spark Connect Client

A wrapper of the Apache Spark Connect client with additional functionalities that allow applications to communicate with a remote Dataproc Spark Session using the Spark Connect protocol without requiring additional steps.

Install

pip install dataproc_spark_connect

Uninstall

pip uninstall dataproc_spark_connect

Setup

This client requires permissions to manage Dataproc Sessions and Session Templates. If you are running the client outside of Google Cloud, you must set following environment variables:

GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
GOOGLE_CLOUD_REGION - The Compute Engine region where you run the Spark workload.
GOOGLE_APPLICATION_CREDENTIALS - Your Application Credentials

Usage

Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
```
pip install google_cloud_dataproc dataproc_spark_connect --force-reinstall
```

Add the required imports into your PySpark application or notebook and start a Spark session with the following code instead of using environment variables:

from google.cloud.dataproc_spark_connect import DataprocSparkSession from google.cloud.dataproc_v1 import Session session_config = Session() session_config.environment_config.execution_config.subnetwork_uri = '<subnet>' session_config.runtime_config.version = '2.2' spark = DataprocSparkSession.builder.dataprocSessionConfig(session_config).getOrCreate()

Using Spark SQL Magic Commands (Jupyter Notebooks)

The package supports the sparksql-magic library for executing Spark SQL queries directly in Jupyter notebooks.

Installation: To use magic commands, install the required dependencies manually:

pip install dataproc-spark-connect pip install IPython sparksql-magic

Load the magic extension:
```
%load_ext sparksql_magic
```
Configure default settings (optional):
```
%config SparkSql.limit=20
```
Execute SQL queries:
```
%%sparksql SELECT * FROM your_table
```

Advanced usage with options:

# Cache results and create a view %%sparksql --cache --view result_view df SELECT * FROM your_table WHERE condition = true

Available options:

--cache / -c: Cache the DataFrame
--eager / -e: Cache with eager loading
--view VIEW / -v VIEW: Create a temporary view
--limit N / -l N: Override default row display limit
variable_name: Store result in a variable

See sparksql-magic for more examples.

Note: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:

pip install dataproc-spark-connect

Developing

For development instructions see guide.

Contributing

We'd love to accept your patches and contributions to this project. There are just a few small guidelines you need to follow.

Contributor License Agreement

Contributions to this project must be accompanied by a Contributor License Agreement. You (or your employer) retain the copyright to your contribution; this simply gives us permission to use and redistribute your contributions as part of the project. Head over to https://cla.developers.google.com to see your current agreements on file or to sign a new one.

You generally only need to submit a CLA once, so if you've already submitted one (even if it was for a different project), you probably don't need to do it again.

Code reviews

All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult GitHub Help for more information on using pull requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataproc Spark Connect Client

Install

Uninstall

Setup

Usage

Using Spark SQL Magic Commands (Jupyter Notebooks)

Developing

Contributing

Contributor License Agreement

Code reviews

About

Uh oh!

Releases 33

Packages

Used by 398

Contributors 18

Languages

License

GoogleCloudDataproc/dataproc-spark-connect-python

Folders and files

Latest commit

History

Repository files navigation

Dataproc Spark Connect Client

Install

Uninstall

Setup

Usage

Using Spark SQL Magic Commands (Jupyter Notebooks)

Developing

Contributing

Contributor License Agreement

Code reviews

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 33

Packages 0

Used by 398

Contributors 18

Languages

Packages