Add Kaggle datasets #9

wengh · 2025-03-27T20:16:58Z

The Kaggle data source is simply a wrapper around kagglehub.dataset_load which allows loading a dataset as a pandas dataframe.

Addition of Kaggle Data Source:

pyspark_datasources/kaggle.py: Added the KaggleDataSource class for reading Kaggle datasets in Spark, including methods for schema and data reading.
pyspark_datasources/__init__.py: Imported the KaggleDataSource class to make it available in the module.

Documentation Updates:

docs/datasources/kaggle.md: Added documentation for the KaggleDataSource, including requirements and usage examples.
docs/index.md: Updated the index to include the KaggleDataSource in the list of available data sources.

Project Configuration:

pyproject.toml: Added the kagglehub library as an optional dependency for the project.

Testing:

tests/test_data_sources.py: Added a new test case for the KaggleDataSource to ensure it can read a dataset from Kaggle correctly.

Summary by CodeRabbit

New Features
- Introduced support for reading datasets directly from Kaggle into Spark DataFrames via a new data source.
Documentation
- Added documentation for the Kaggle data source, including usage details and requirements.
- Updated the data sources overview to include Kaggle.
Chores
- Added and updated dependencies to support Kaggle integration.
Tests
- Added tests to verify Kaggle data source functionality.

wengh · 2025-03-27T20:17:58Z

@allisonwang-db
oops I accidentally closed #7 when rebasing

allisonwang-db · 2025-03-27T20:38:48Z

pyspark_datasources/kaggle.py

+from functools import cached_property
+from typing import Iterator
+
+import pyarrow as pa


Do we have to depend on pyarrow? Can we throw a better error message if pyarrow is not installed?

Good point. Depending on pyarrow should be fine since pyspark data source itself depends on pyarrow. Let's import pyarrow later so that pyspark shows the error message if it's missing.

allisonwang-db

🚢

allisonwang-db · 2025-04-15T21:04:07Z

@coderabbitai review

coderabbitai · 2025-04-15T21:04:13Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2025-04-15T21:04:16Z

Walkthrough

This update introduces a new data source, KaggleDataSource, enabling Spark to read datasets directly from Kaggle via the kagglehub library. Documentation is expanded to include usage instructions for this data source, and the documentation navigation is updated accordingly. The package’s dependencies are updated to include kagglehub and pyarrow, and new extras groups are defined for optional dependencies. The KaggleDataSource is made publicly available in the main package, and a corresponding test is added to verify its integration and data loading functionality.

Changes

File(s)	Change Summary
`pyspark_datasources/kaggle.py`	Added new `KaggleDataSource` and `KaggleDataReader` classes for reading Kaggle datasets into Spark.
`pyspark_datasources/__init__.py`	Imported `KaggleDataSource` to make it publicly accessible in the package.
`docs/datasources/kaggle.md`	Added documentation for the new `KaggleDataSource`, including usage and dependency information.
`docs/index.md`	Updated data sources table to include `KaggleDataSource` with dependencies and improved formatting.
`mkdocs.yml`	Added `datasources/kaggle.md` to the documentation navigation under "Data Sources".
`pyproject.toml`	Added `pyarrow` and `kagglehub` as dependencies, updated extras, and removed dev-only `pyarrow`.
`tests/test_data_sources.py`	Added test function `test_kaggle_datasource` to verify Kaggle data source integration.

Sequence Diagram(s)

sequenceDiagram participant User participant Spark participant KaggleDataSource participant kagglehub User->>Spark: Read DataFrame (format="kaggle", options) Spark->>KaggleDataSource: Initialize with options KaggleDataSource->>kagglehub: dataset_load(handle, path, ...) kagglehub-->>KaggleDataSource: Returns pandas DataFrame KaggleDataSource->>KaggleDataSource: Convert to PyArrow Table & cache KaggleDataSource->>Spark: Provide schema and data reader Spark->>User: Returns DataFrame

Poem

In the warren where data hops free,
Kaggle joins the Spark party with glee!
With kagglehub in tow and pyarrow in store,
Datasets leap in, through a brand-new door.
Docs now sparkle, the tests all agree—
More data to munch for you and for me!
🐇✨

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)

We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

tests/test_data_sources.py (3)
28-33: Consider making the test more robust against dataset changes

This test successfully validates that the KaggleDataSource can load a dataset from Kaggle. However, it relies on a specific external dataset with hardcoded expectations about the row and column counts, which could make the test brittle if the dataset changes.

Consider either:

Mocking the Kaggle API to avoid external dependencies in unit tests

Adding a fallback assertion to prevent test failures if the dataset is slightly modified

Using a more stable dataset or one that you control
def test_kaggle_datasource(spark): spark.dataSource.register(KaggleDataSource) df = spark.read.format("kaggle").options(handle="yasserh/titanic-dataset").load("Titanic-Dataset.csv") df.show() - assert df.count() == 891 - assert len(df.columns) == 12 + # Check that data was loaded and has expected structure + row_count = df.count() + assert row_count > 0, "DataFrame should not be empty" + assert len(df.columns) > 0, "DataFrame should have columns" + # Log actual values to help debug future failures + print(f"Loaded {row_count} rows with {len(df.columns)} columns")
🧰 Tools

🪛 Ruff (0.8.2)

29-29: KaggleDataSource may be undefined, or defined from star imports

(F405)

28-29: Explicit imports improve code clarity and avoid F405 warnings

Static analysis tools flagged that KaggleDataSource may be undefined due to star imports. While it works since the class is properly exported, explicit imports are preferred for better code clarity and to avoid namespace pollution.
- from pyspark_datasources import * + from pyspark_datasources import FakeDataSource, GithubDataSource, KaggleDataSource
🧰 Tools

🪛 Ruff (0.8.2)

29-29: KaggleDataSource may be undefined, or defined from star imports

(F405)

28-33: Add tests for error handling scenarios

The current test verifies the happy path, but it's important to test error scenarios as well, such as invalid dataset handles, authentication failures, or network issues.

Consider adding tests for error scenarios:
def test_kaggle_datasource_invalid_handle(spark): spark.dataSource.register(KaggleDataSource) with pytest.raises(Exception) as excinfo: spark.read.format("kaggle").options(handle="invalid/dataset").load("nonexistent.csv") assert "dataset not found" in str(excinfo.value).lower() or "invalid handle" in str(excinfo.value).lower()
🧰 Tools

🪛 Ruff (0.8.2)

29-29: KaggleDataSource may be undefined, or defined from star imports

(F405)
pyspark_datasources/kaggle.py (1)
111-112: Handle large datasets more efficiently

The current implementation reads all data into memory at once, which may not be efficient for very large datasets. Consider supporting batched reading or partitioning.

Add support for partitioning the data:
-def read(self, partition) -> Iterator["pa.RecordBatch"]: - yield from self.source._data.to_batches() +def read(self, partition) -> Iterator["pa.RecordBatch"]: + # Get the number of batches from options with a default value + options_copy = self.source.options.copy() + batch_size = options_copy.get("batch_size", 10000) + try: + batch_size = int(batch_size) + except (ValueError, TypeError): + batch_size = 10000 +  + # Generate batches with the specified size + table = self.source._data + yield from table.to_batches(max_chunksize=batch_size) + +# Update the docstring to document the new option +# Add to the Options section: +# - `batch_size`: The maximum number of rows per batch (default: 10000).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7495ae7 and e897d0d.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

docs/datasources/kaggle.md (1 hunks)
docs/index.md (1 hunks)
mkdocs.yml (1 hunks)
pyproject.toml (1 hunks)
pyspark_datasources/__init__.py (1 hunks)
pyspark_datasources/kaggle.py (1 hunks)
tests/test_data_sources.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

pyspark_datasources/__init__.py (1)

pyspark_datasources/kaggle.py (1)

KaggleDataSource (13-104)

🪛 Ruff (0.8.2)

tests/test_data_sources.py

29-29: KaggleDataSource may be undefined, or defined from star imports

(F405)

pyspark_datasources/__init__.py

5-5: .kaggle.KaggleDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

🔇 Additional comments (8)

pyproject.toml (1)

14-20: Dependency additions and extras grouping look correct.

The changes to dependencies and extras are appropriate for supporting the new Kaggle data source. No issues found.

Also applies to: 26-28

mkdocs.yml (1)

27-27: Documentation navigation updated appropriately.

The addition of the Kaggle data source documentation to the navigation is correct.

docs/datasources/kaggle.md (1)

1-5: KaggleDataSource documentation is clear and complete.

The documentation for the new data source is concise and follows project conventions.

docs/index.md (1)

31-39: Data sources table updated correctly.

The new KaggleDataSource entry and improved formatting are accurate and helpful.

pyspark_datasources/__init__.py (1)

5-5: KaggleDataSource import is correct for public API exposure.

The import exposes KaggleDataSource at the package level, matching the pattern for other data sources. The Ruff warning about unused import can be safely ignored in this context.

🧰 Tools

🪛 Ruff (0.8.2)

5-5: .kaggle.KaggleDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

pyspark_datasources/kaggle.py (3)

13-60: Excellent documentation for the data source

The comprehensive docstring for KaggleDataSource includes clear explanations of the purpose, options, authentication requirements, and usage examples, making it easy for users to understand and use this data source.

1-11: Good type checking and imports organization

The imports are well-organized, and the use of TYPE_CHECKING for pyarrow is a good practice to avoid unnecessary imports at runtime while still providing type hints for static analyzers.

87-89: Good use of temporary directory to avoid permission issues

The use of a temporary directory for caching and explicitly setting KAGGLEHUB_CACHE is a good approach to avoid permission issues with the default cache location.

pyspark_datasources/kaggle.py

coderabbitai · 2025-04-15T21:16:30Z

pyspark_datasources/kaggle.py

+ os.environ["KAGGLE_USERNAME"] = username
+ os.environ["KAGGLE_KEY"] = key
+
+ kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()}
+
+ # Cache in a temporary directory to avoid writing to ~ which may be read-only
+ with tempfile.TemporaryDirectory() as tmpdir:
+ os.environ["KAGGLEHUB_CACHE"] = tmpdir


🛠️ Refactor suggestion

Avoid setting environment variables directly

Setting environment variables directly can have unintended side effects, especially in multi-threaded environments or when tests run concurrently. Consider using a more isolated approach.

Use a context manager to temporarily set and restore environment variables:

+ import contextlib + import os + + @contextlib.contextmanager + def temp_environ(): + """Context manager to temporarily modify os.environ.""" + env_copy = os.environ.copy() + try: + yield + finally: + os.environ.clear() + os.environ.update(env_copy) # In the _data method: - if username or key: - if not (username and key): - raise ValueError( - "Both username and key must be provided to authenticate." - ) - os.environ["KAGGLE_USERNAME"] = username - os.environ["KAGGLE_KEY"] = key - - # Cache in a temporary directory to avoid writing to ~ which may be read-only - with tempfile.TemporaryDirectory() as tmpdir: - os.environ["KAGGLEHUB_CACHE"] = tmpdir - import kagglehub + with temp_environ(): + if username or key: + if not (username and key): + raise ValueError( + "Both username and key must be provided to authenticate." + ) + os.environ["KAGGLE_USERNAME"] = username + os.environ["KAGGLE_KEY"] = key + + # Cache in a temporary directory to avoid writing to ~ which may be read-only + with tempfile.TemporaryDirectory() as tmpdir: + os.environ["KAGGLEHUB_CACHE"] = tmpdir + import kagglehub

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

os.environ["KAGGLE_USERNAME"] = username

os.environ["KAGGLE_KEY"] = key

kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()}

# Cache in a temporary directory to avoid writing to ~ which may be read-only

with tempfile.TemporaryDirectory() as tmpdir:

os.environ["KAGGLEHUB_CACHE"] = tmpdir

import contextlib

import os

import tempfile

import ast

# Other necessary imports

@contextlib.contextmanager

def temp_environ():

"""Context manager to temporarily modify os.environ."""

env_copy = os.environ.copy()

try:

yield

finally:

os.environ.clear()

os.environ.update(env_copy)

# ... inside the _data method of the module

kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()}

with temp_environ():

if username or key:

if not (username and key):

raise ValueError(

"Both username and key must be provided to authenticate."

)

os.environ["KAGGLE_USERNAME"] = username

os.environ["KAGGLE_KEY"] = key

# Cache in a temporary directory to avoid writing to ~ which may be read-only

with tempfile.TemporaryDirectory() as tmpdir:

os.environ["KAGGLEHUB_CACHE"] = tmpdir

import kagglehub

it runs in a separate process so setting env variable should be fine

You are on the CodeRabbit Free Plan. In order to use the Chat feature, please contact CodeRabbit support.

allisonwang-db reviewed Mar 27, 2025

View reviewed changes

allisonwang-db approved these changes Mar 27, 2025

View reviewed changes

wengh added 9 commits March 28, 2025 11:37

Add Kaggle datasets

cc52e9d

fix

56e4589

cache in temp directory

dea05d3

add dependencies

c00d40f

fix dependencies

af5b246

fix dependencies

bc351c3

fix

9fb9dc9

update docs

bfb6646

lock

e897d0d

wengh force-pushed the kaggle branch from 44f0cf8 to e897d0d Compare March 28, 2025 18:39

coderabbitai bot reviewed Apr 15, 2025

View reviewed changes

allisonwang-db merged commit 3e7564d into allisonwang-db:master Apr 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Kaggle datasets #9

Add Kaggle datasets #9

Uh oh!

wengh commented Mar 27, 2025 •

edited by coderabbitai bot

Loading

wengh commented Mar 27, 2025

allisonwang-db Mar 27, 2025

wengh Mar 27, 2025

allisonwang-db left a comment

allisonwang-db commented Apr 15, 2025

coderabbitai bot commented Apr 15, 2025

coderabbitai bot commented Apr 15, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 15, 2025 •

edited

Loading

wengh Apr 15, 2025

coderabbitai bot Apr 15, 2025

- os.environ["KAGGLE_USERNAME"] = username
- os.environ["KAGGLE_KEY"] = key
- kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()}
- # Cache in a temporary directory to avoid writing to ~ which may be read-only
- with tempfile.TemporaryDirectory() as tmpdir:
- os.environ["KAGGLEHUB_CACHE"] = tmpdir
+import contextlib
+import os
+import tempfile
+import ast
+# Other necessary imports
+@contextlib.contextmanager
+def temp_environ():
+ """Context manager to temporarily modify os.environ."""
+ env_copy = os.environ.copy()
+ try:
+ yield
+ finally:
+ os.environ.clear()
+ os.environ.update(env_copy)
+# ... inside the _data method of the module
+kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()}
+with temp_environ():
+ if username or key:
+ if not (username and key):
+ raise ValueError(
+ "Both username and key must be provided to authenticate."
+ )
+ os.environ["KAGGLE_USERNAME"] = username
+ os.environ["KAGGLE_KEY"] = key
+ # Cache in a temporary directory to avoid writing to ~ which may be read-only
+ with tempfile.TemporaryDirectory() as tmpdir:
+ os.environ["KAGGLEHUB_CACHE"] = tmpdir
+ import kagglehub

Add Kaggle datasets #9

Add Kaggle datasets #9

Uh oh!

Conversation

wengh commented Mar 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addition of Kaggle Data Source:

Documentation Updates:

Project Configuration:

Testing:

Summary by CodeRabbit

wengh commented Mar 27, 2025

allisonwang-db Mar 27, 2025

Choose a reason for hiding this comment

wengh Mar 27, 2025

Choose a reason for hiding this comment

allisonwang-db left a comment

Choose a reason for hiding this comment

allisonwang-db commented Apr 15, 2025

coderabbitai bot commented Apr 15, 2025

coderabbitai bot commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

wengh Apr 15, 2025

Choose a reason for hiding this comment

coderabbitai bot Apr 15, 2025

Choose a reason for hiding this comment

wengh commented Mar 27, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 15, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Apr 15, 2025 •

edited

Loading