Skip to content

Conversation

wengh
Copy link
Contributor

@wengh wengh commented Mar 27, 2025

The Kaggle data source is simply a wrapper around kagglehub.dataset_load which allows loading a dataset as a pandas dataframe.

Addition of Kaggle Data Source:

Documentation Updates:

  • docs/datasources/kaggle.md: Added documentation for the KaggleDataSource, including requirements and usage examples.
  • docs/index.md: Updated the index to include the KaggleDataSource in the list of available data sources.

Project Configuration:

  • pyproject.toml: Added the kagglehub library as an optional dependency for the project.

Testing:

  • tests/test_data_sources.py: Added a new test case for the KaggleDataSource to ensure it can read a dataset from Kaggle correctly.

Summary by CodeRabbit

  • New Features
    • Introduced support for reading datasets directly from Kaggle into Spark DataFrames via a new data source.
  • Documentation
    • Added documentation for the Kaggle data source, including usage details and requirements.
    • Updated the data sources overview to include Kaggle.
  • Chores
    • Added and updated dependencies to support Kaggle integration.
  • Tests
    • Added tests to verify Kaggle data source functionality.
@wengh
Copy link
Contributor Author

wengh commented Mar 27, 2025

@allisonwang-db
oops I accidentally closed #7 when rebasing

from functools import cached_property
from typing import Iterator

import pyarrow as pa
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to depend on pyarrow? Can we throw a better error message if pyarrow is not installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Depending on pyarrow should be fine since pyspark data source itself depends on pyarrow. Let's import pyarrow later so that pyspark shows the error message if it's missing.

Copy link
Owner

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@allisonwang-db
Copy link
Owner

@coderabbitai review

Copy link

coderabbitai bot commented Apr 15, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

coderabbitai bot commented Apr 15, 2025

Walkthrough

This update introduces a new data source, KaggleDataSource, enabling Spark to read datasets directly from Kaggle via the kagglehub library. Documentation is expanded to include usage instructions for this data source, and the documentation navigation is updated accordingly. The package’s dependencies are updated to include kagglehub and pyarrow, and new extras groups are defined for optional dependencies. The KaggleDataSource is made publicly available in the main package, and a corresponding test is added to verify its integration and data loading functionality.

Changes

File(s) Change Summary
pyspark_datasources/kaggle.py Added new KaggleDataSource and KaggleDataReader classes for reading Kaggle datasets into Spark.
pyspark_datasources/__init__.py Imported KaggleDataSource to make it publicly accessible in the package.
docs/datasources/kaggle.md Added documentation for the new KaggleDataSource, including usage and dependency information.
docs/index.md Updated data sources table to include KaggleDataSource with dependencies and improved formatting.
mkdocs.yml Added datasources/kaggle.md to the documentation navigation under "Data Sources".
pyproject.toml Added pyarrow and kagglehub as dependencies, updated extras, and removed dev-only pyarrow.
tests/test_data_sources.py Added test function test_kaggle_datasource to verify Kaggle data source integration.

Sequence Diagram(s)

sequenceDiagram participant User participant Spark participant KaggleDataSource participant kagglehub User->>Spark: Read DataFrame (format="kaggle", options) Spark->>KaggleDataSource: Initialize with options KaggleDataSource->>kagglehub: dataset_load(handle, path, ...) kagglehub-->>KaggleDataSource: Returns pandas DataFrame KaggleDataSource->>KaggleDataSource: Convert to PyArrow Table & cache KaggleDataSource->>Spark: Provide schema and data reader Spark->>User: Returns DataFrame 
Loading

Poem

In the warren where data hops free,
Kaggle joins the Spark party with glee!
With kagglehub in tow and pyarrow in store,
Datasets leap in, through a brand-new door.
Docs now sparkle, the tests all agree—
More data to munch for you and for me!
🐇✨

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
tests/test_data_sources.py (3)

28-33: Consider making the test more robust against dataset changes

This test successfully validates that the KaggleDataSource can load a dataset from Kaggle. However, it relies on a specific external dataset with hardcoded expectations about the row and column counts, which could make the test brittle if the dataset changes.

Consider either:

  1. Mocking the Kaggle API to avoid external dependencies in unit tests
  2. Adding a fallback assertion to prevent test failures if the dataset is slightly modified
  3. Using a more stable dataset or one that you control
def test_kaggle_datasource(spark): spark.dataSource.register(KaggleDataSource) df = spark.read.format("kaggle").options(handle="yasserh/titanic-dataset").load("Titanic-Dataset.csv") df.show() - assert df.count() == 891 - assert len(df.columns) == 12 + # Check that data was loaded and has expected structure + row_count = df.count() + assert row_count > 0, "DataFrame should not be empty" + assert len(df.columns) > 0, "DataFrame should have columns" + # Log actual values to help debug future failures + print(f"Loaded {row_count} rows with {len(df.columns)} columns")
🧰 Tools
🪛 Ruff (0.8.2)

29-29: KaggleDataSource may be undefined, or defined from star imports

(F405)


28-29: Explicit imports improve code clarity and avoid F405 warnings

Static analysis tools flagged that KaggleDataSource may be undefined due to star imports. While it works since the class is properly exported, explicit imports are preferred for better code clarity and to avoid namespace pollution.

- from pyspark_datasources import * + from pyspark_datasources import FakeDataSource, GithubDataSource, KaggleDataSource
🧰 Tools
🪛 Ruff (0.8.2)

29-29: KaggleDataSource may be undefined, or defined from star imports

(F405)


28-33: Add tests for error handling scenarios

The current test verifies the happy path, but it's important to test error scenarios as well, such as invalid dataset handles, authentication failures, or network issues.

Consider adding tests for error scenarios:

def test_kaggle_datasource_invalid_handle(spark): spark.dataSource.register(KaggleDataSource) with pytest.raises(Exception) as excinfo: spark.read.format("kaggle").options(handle="invalid/dataset").load("nonexistent.csv") assert "dataset not found" in str(excinfo.value).lower() or "invalid handle" in str(excinfo.value).lower()
🧰 Tools
🪛 Ruff (0.8.2)

29-29: KaggleDataSource may be undefined, or defined from star imports

(F405)

pyspark_datasources/kaggle.py (1)

111-112: Handle large datasets more efficiently

The current implementation reads all data into memory at once, which may not be efficient for very large datasets. Consider supporting batched reading or partitioning.

Add support for partitioning the data:

-def read(self, partition) -> Iterator["pa.RecordBatch"]: - yield from self.source._data.to_batches() +def read(self, partition) -> Iterator["pa.RecordBatch"]: + # Get the number of batches from options with a default value + options_copy = self.source.options.copy() + batch_size = options_copy.get("batch_size", 10000) + try: + batch_size = int(batch_size) + except (ValueError, TypeError): + batch_size = 10000 +  + # Generate batches with the specified size + table = self.source._data + yield from table.to_batches(max_chunksize=batch_size) + +# Update the docstring to document the new option +# Add to the Options section: +# - `batch_size`: The maximum number of rows per batch (default: 10000).
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7495ae7 and e897d0d.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • docs/datasources/kaggle.md (1 hunks)
  • docs/index.md (1 hunks)
  • mkdocs.yml (1 hunks)
  • pyproject.toml (1 hunks)
  • pyspark_datasources/__init__.py (1 hunks)
  • pyspark_datasources/kaggle.py (1 hunks)
  • tests/test_data_sources.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
pyspark_datasources/__init__.py (1)
pyspark_datasources/kaggle.py (1)
  • KaggleDataSource (13-104)
🪛 Ruff (0.8.2)
tests/test_data_sources.py

29-29: KaggleDataSource may be undefined, or defined from star imports

(F405)

pyspark_datasources/__init__.py

5-5: .kaggle.KaggleDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

🔇 Additional comments (8)
pyproject.toml (1)

14-20: Dependency additions and extras grouping look correct.

The changes to dependencies and extras are appropriate for supporting the new Kaggle data source. No issues found.

Also applies to: 26-28

mkdocs.yml (1)

27-27: Documentation navigation updated appropriately.

The addition of the Kaggle data source documentation to the navigation is correct.

docs/datasources/kaggle.md (1)

1-5: KaggleDataSource documentation is clear and complete.

The documentation for the new data source is concise and follows project conventions.

docs/index.md (1)

31-39: Data sources table updated correctly.

The new KaggleDataSource entry and improved formatting are accurate and helpful.

pyspark_datasources/__init__.py (1)

5-5: KaggleDataSource import is correct for public API exposure.

The import exposes KaggleDataSource at the package level, matching the pattern for other data sources. The Ruff warning about unused import can be safely ignored in this context.

🧰 Tools
🪛 Ruff (0.8.2)

5-5: .kaggle.KaggleDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

pyspark_datasources/kaggle.py (3)

13-60: Excellent documentation for the data source

The comprehensive docstring for KaggleDataSource includes clear explanations of the purpose, options, authentication requirements, and usage examples, making it easy for users to understand and use this data source.


1-11: Good type checking and imports organization

The imports are well-organized, and the use of TYPE_CHECKING for pyarrow is a good practice to avoid unnecessary imports at runtime while still providing type hints for static analyzers.


87-89: Good use of temporary directory to avoid permission issues

The use of a temporary directory for caching and explicitly setting KAGGLEHUB_CACHE is a good approach to avoid permission issues with the default cache location.

Comment on lines +82 to +89
os.environ["KAGGLE_USERNAME"] = username
os.environ["KAGGLE_KEY"] = key

kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()}

# Cache in a temporary directory to avoid writing to ~ which may be read-only
with tempfile.TemporaryDirectory() as tmpdir:
os.environ["KAGGLEHUB_CACHE"] = tmpdir
Copy link

@coderabbitai coderabbitai bot Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid setting environment variables directly

Setting environment variables directly can have unintended side effects, especially in multi-threaded environments or when tests run concurrently. Consider using a more isolated approach.

Use a context manager to temporarily set and restore environment variables:

+ import contextlib + import os +  + @contextlib.contextmanager + def temp_environ(): + """Context manager to temporarily modify os.environ.""" + env_copy = os.environ.copy() + try: + yield + finally: + os.environ.clear() + os.environ.update(env_copy) # In the _data method: - if username or key: - if not (username and key): - raise ValueError( - "Both username and key must be provided to authenticate." - ) - os.environ["KAGGLE_USERNAME"] = username - os.environ["KAGGLE_KEY"] = key - - # Cache in a temporary directory to avoid writing to ~ which may be read-only - with tempfile.TemporaryDirectory() as tmpdir: - os.environ["KAGGLEHUB_CACHE"] = tmpdir - import kagglehub + with temp_environ(): + if username or key: + if not (username and key): + raise ValueError( + "Both username and key must be provided to authenticate." + ) + os.environ["KAGGLE_USERNAME"] = username + os.environ["KAGGLE_KEY"] = key +  + # Cache in a temporary directory to avoid writing to ~ which may be read-only + with tempfile.TemporaryDirectory() as tmpdir: + os.environ["KAGGLEHUB_CACHE"] = tmpdir + import kagglehub
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
os.environ["KAGGLE_USERNAME"] = username
os.environ["KAGGLE_KEY"] = key
kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()}
# Cache in a temporary directory to avoid writing to ~ which may be read-only
with tempfile.TemporaryDirectory() as tmpdir:
os.environ["KAGGLEHUB_CACHE"] = tmpdir
import contextlib
import os
import tempfile
import ast
# Other necessary imports
@contextlib.contextmanager
def temp_environ():
"""Context manager to temporarily modify os.environ."""
env_copy = os.environ.copy()
try:
yield
finally:
os.environ.clear()
os.environ.update(env_copy)
# ... inside the _data method of the module
kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()}
with temp_environ():
if username or key:
if not (username and key):
raise ValueError(
"Both username and key must be provided to authenticate."
)
os.environ["KAGGLE_USERNAME"] = username
os.environ["KAGGLE_KEY"] = key
# Cache in a temporary directory to avoid writing to ~ which may be read-only
with tempfile.TemporaryDirectory() as tmpdir:
os.environ["KAGGLEHUB_CACHE"] = tmpdir
import kagglehub
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it runs in a separate process so setting env variable should be fine

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are on the CodeRabbit Free Plan. In order to use the Chat feature, please contact CodeRabbit support.

@allisonwang-db allisonwang-db merged commit 3e7564d into allisonwang-db:master Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants