Skip to content

Commit 3ee0484

Browse files
Update README to add new data sources (#22)
* update README * more update
1 parent 7c06061 commit 3ee0484

File tree

5 files changed

+39
-10
lines changed

5 files changed

+39
-10
lines changed

README.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -38,16 +38,23 @@ spark.readStream.format("fake").load().writeStream.format("console").start()
3838

3939
## Example Data Sources
4040

41-
| Data Source | Short Name | Description | Dependencies |
42-
|-------------------------------------------------------------------------|----------------|-----------------------------------------------|-----------------------|
43-
| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None |
44-
| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | `faker` |
45-
| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None |
46-
| [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py) | `googlesheets` | Read table from public Google Sheets | None |
47-
| [KaggleDataSource](pyspark_datasources/kaggle.py) | `kaggle` | Read datasets from Kaggle | `kagglehub`, `pandas` |
48-
| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Write JSON data to Databricks DBFS | `databricks-sdk` |
49-
| [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Read from OpenSky Network. | None |
50-
| [SalesforceDataSource](pyspark_datasources/salesforce.py) | `pyspark.datasource.salesforce` | Streaming datasource for writing data to Salesforce | `simple-salesforce` |
41+
| Data Source | Short Name | Type | Description | Dependencies | Example |
42+
|-------------------------------------------------------------------------|----------------|----------------|-----------------------------------------------|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
43+
| **Batch Read** | | | | | |
44+
| [ArrowDataSource](pyspark_datasources/arrow.py) | `arrow` | Batch Read | Read Apache Arrow files (.arrow) | `pyarrow` | `pip install pyspark-data-sources[arrow]`<br/>`spark.read.format("arrow").load("/path/to/file.arrow")` |
45+
| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Batch/Streaming Read | Generate fake data using the `Faker` library | `faker` | `pip install pyspark-data-sources[fake]`<br/>`spark.read.format("fake").load()` or `spark.readStream.format("fake").load()` |
46+
| [GithubDataSource](pyspark_datasources/github.py) | `github` | Batch Read | Read pull requests from a Github repository | None | `pip install pyspark-data-sources`<br/>`spark.read.format("github").load("apache/spark")` |
47+
| [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py) | `googlesheets` | Batch Read | Read table from public Google Sheets | None | `pip install pyspark-data-sources`<br/>`spark.read.format("googlesheets").load("https://docs.google.com/spreadsheets/d/...")` |
48+
| [HuggingFaceDatasets](pyspark_datasources/huggingface.py) | `huggingface` | Batch Read | Read datasets from HuggingFace Hub | `datasets` | `pip install pyspark-data-sources[huggingface]`<br/>`spark.read.format("huggingface").load("imdb")` |
49+
| [KaggleDataSource](pyspark_datasources/kaggle.py) | `kaggle` | Batch Read | Read datasets from Kaggle | `kagglehub`, `pandas` | `pip install pyspark-data-sources[kaggle]`<br/>`spark.read.format("kaggle").load("titanic")` |
50+
| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Batch Read | Read stock data from Alpha Vantage | None | `pip install pyspark-data-sources`<br/>`spark.read.format("stock").option("symbols", "AAPL,GOOGL").option("api_key", "key").load()` |
51+
| **Batch Write** | | | | | |
52+
| [LanceSink](pyspark_datasources/lance.py) | `lance` | Batch Write | Write data in Lance format | `lance` | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("/tmp/lance_data")` |
53+
| **Streaming Read** | | | | | |
54+
| [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Streaming Read | Read from OpenSky Network. | None | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()` |
55+
| [WeatherDataSource](pyspark_datasources/weather.py) | `weather` | Streaming Read | Fetch weather data from tomorrow.io | None | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()` |
56+
| **Streaming Write** | | | | | |
57+
| [SalesforceDataSource](pyspark_datasources/salesforce.py) | `pyspark.datasource.salesforce` | Streaming Write | Streaming datasource for writing data to Salesforce | `simple-salesforce` | `pip install pyspark-data-sources[salesforce]`<br/>`df.writeStream.format("pyspark.datasource.salesforce").option("username", "user").start()` |
5158

5259
See more here: https://allisonwang-db.github.io/pyspark-data-sources/.
5360

docs/datasources/arrow.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# ArrowDataSource
2+
3+
> Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow`
4+
> or use `pip install pyspark-data-sources[arrow]`.
5+
6+
::: pyspark_datasources.arrow.ArrowDataSource

docs/datasources/lance.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# LanceSink
2+
3+
> Requires the [`Lance`](https://lancedb.github.io/lance/) library. You can install it manually: `pip install lance`
4+
> or use `pip install pyspark-data-sources[lance]`.
5+
6+
::: pyspark_datasources.lance.LanceSink

docs/datasources/opensky.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# OpenSkyDataSource
2+
3+
> No additional dependencies required. Uses the OpenSky Network REST API for real-time aircraft tracking data.
4+
5+
::: pyspark_datasources.opensky.OpenSkyDataSource

docs/datasources/weather.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# WeatherDataSource
2+
3+
> No additional dependencies required. Uses the Tomorrow.io API for weather data. Requires an API key.
4+
5+
::: pyspark_datasources.weather.WeatherDataSource

0 commit comments

Comments
 (0)